# COMM 187: Data Science in Communication Research, Winter 2026

## Week #4 Coding Lab: Statistics with Python

Welcome to the Week #4 Coding Lab for COMM 187: Data Science in Communication Research! 

Thus far, we have learned some basic Python skills (variables, data types, lists, and NumPy), Python dictionaries, and Pandas.

Today's lesson plan:
 - Introduction to `scipy`
 - Comparitive Statistics: t-test and ANOVA
 - Associative Statistics: correlation and covariance 

### Introduction to `scipy`

In [3]:
import pandas as pd
import numpy as np

We will now use a new library, called `scipy`, which is a powerful scientific computation library in Python. 

In Python, we have learned how to `import` a library, and how to `import` a library `as` a nickname/alias. Now, we will try importing a specific subset of functions from a library. We do this by using the format:
```
from <library> import <name of sub-library>
```

Here, we only need to sub-library `stats` from `scipy`, so we will run the following code:

In [4]:
from scipy import stats

Let us start with a familiar dataset: the college majors dataset. This dataset is the data behind [this article](https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/).

You can access the repository for this dataset [here](https://github.com/fivethirtyeight/data/tree/master/college-majors).

In [7]:
### Your code below this line
recentgrads = pd.read_csv('./data/recent-grads.csv')

Now, print the name of the columns of this dataframe using `recentgrads.columns`.

In [8]:
### Your code below this line
recentgrads.columns

Index(['Rank', 'Major_code', 'Major', 'Total', 'Men', 'Women',
       'Major_category', 'ShareWomen', 'Sample_size', 'Employed', 'Full_time',
       'Part_time', 'Full_time_year_round', 'Unemployed', 'Unemployment_rate',
       'Income', 'P25th', 'P75th', 'College_jobs', 'Non_college_jobs',
       'Low_wage_jobs'],
      dtype='object')

For your reference, here are the descriptions of the values in each of these columns:

Column Name | Description
---|---------
`Rank` | Rank by median earnings
`Major_code` | Major code, FO1DP in ACS PUMS
`Major` | Major description
`Major_category` | Category of major from Carnevale et al
`Total` | Total number of people with major
`Sample_size` | Sample size (unweighted) of full-time, year-round ONLY (used for earnings)
`Men` | Male graduates
`Women` | Female graduates
`ShareWomen` | Women as share of total
`Employed` | Number employed (ESR == 1 or 2)
`Full_time` | Employed 35 hours or more
`Part_time` | Employed less than 35 hours
`Full_time_year_round` | Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35)
`Unemployed` | Number unemployed (ESR == 3)
`Unemployment_rate` | Unemployed / (Unemployed + Employed)
`Income` | Median earnings of full-time, year-round workers
`P25th` | 25th percentile of earnings
`P75th` | 75th percentile of earnings
`College_jobs` | Number with job requiring a college degree
`Non_college_jobs` | Number with job not requiring a college degree
`Low_wage_jobs` | Number in low-wage service jobs

---

## Descriptive Statistics

### Mean (or Average)

Two ways to go about it:
 - **numpy**: Use the `np.mean()` function
 - **pandas**: Use the method `.mean()` for any column with numerical values.

For example, for a column name "col1" in a dataframe `df`, you can calculate the mean by using either 
 - `np.mean(df["col1"])`, or
 - `df["col1"].mean()`

Let us try it on the column "Employed" in dataframe `recentgrads`. Try out both ways.

**Practice:** Calculate the mean of "Median earnings of full-time, year-round workers" in the dataset.

In [None]:
### Your code below this line


### MEDIAN

Two ways to go about it:
 - **numpy**: Use the `np.median()` function
 - **pandas**: Use the method `.median()` for any column with numerical values.

Let us try it on the column "Employed" in dataframe `recentgrads`. Try out both ways.

**Practice:** Calculate the median of "Median earnings of full-time, year-round workers" in the dataset.

In [None]:
### Your code below this line


### STANDARD DEVIATION

Two ways to go about it:
 - **numpy**: Use the `np.std()` function
 - **pandas**: Use the method `.std()` for any column with numerical values.

Let us try it on the column "Employed" in dataframe `recentgrads`. Try out both ways. 

**Practice:** Calculate the standard deviation of "Median earnings of full-time, year-round workers" in the dataset.

In [None]:
### Your code below this line


### Multiple Descriptive Statistics (`.agg()`)

Use the function `.agg()` for any column with numerical values, and inside the `()` brackets, enter a **list** of the statistical functions you would like to use on that columns. 

 - for mean, just write `'mean'`
 - for median, just write `'median'`
 - for standard deviation, just write `'std'`

Remember! In Python, `'` and `"` are interchangeable.

For example, to calculate both the mean and the median for "col1" in dataframe `df`, you can write `df["col1"].agg(['mean', 'median'])`

Let us try it on the columns "Employed" in dataframe `recentgrads`. Try getting all three statistics in the same line, and then separately. 

**Practice:** Calculate the standard deviation of "Median earnings of full-time, year-round workers" in the dataset.

In [None]:
### Your code below this line


### Summary Statistics (`.describe()`)

Instead of calculating individual statistics for individual columns, we can also get a summary of **mean**, **median**, **standard deviation**, and some other statistics using the function `.describe()`.

For example, for a column "col1" in dataframe `df`, you would write `df["col1"].describe()`

Let us try it on the column "Employed" in dataframe `recentgrads`.

## Comparative Statistics

### T-TEST

With your group, go through the [SciPy Documentation for T-Test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) and solve the following practice question:

**Question:** Compare the mean `Unemployment_rate` between the following two `Major_category`: "Humanities & Liberal Arts" and "Social Science" in dataframe `recent-grads`. \
Run a t-test, and interpret the results.

In [9]:
# Extract data for 'Humanities & Liberal Arts' and 'Social Science'
HLA = recentgrads[recentgrads['Major_category'] == 'Humanities & Liberal Arts']['Unemployment_rate']
SS = recentgrads[recentgrads['Major_category'] == 'Social Science']['Unemployment_rate']

# Perform t-test using stats.ttest_ind


**Question:** Compare the mean `Unemployment_rate` between the following two `Major_category`: "Law & Public Policy" and "Engineering". \
Run a t-test, and interpret the results.

In [None]:
# Extract data for different majors

# Perform t-test using stats.ttest_ind


### ANOVA

With your group, go through the [SciPy Documentation for one-way ANOVA](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html) and solve the following practice question:

**Question:** Test if there are any significant differences in the `Unemployment_rate` between the following four `Major_category`: "Humanities & Liberal Arts", "Law & Public Policy", "Engineering", and "Social Science".

In [10]:
# Extract data for each major
HLA = recentgrads[recentgrads['Major_category'] == 'Humanities & Liberal Arts']['Unemployment_rate']
SS = recentgrads[recentgrads['Major_category'] == 'Social Science']['Unemployment_rate']
LPP = recentgrads[recentgrads['Major_category'] == 'Law & Public Policy']['Unemployment_rate']
E = recentgrads[recentgrads['Major_category'] == 'Engineering']['Unemployment_rate']

# Perform ANOVA using stats.f_oneway


F_onewayResult(statistic=np.float64(3.2734832320511895), pvalue=np.float64(0.027932234385892967))

**Question:** Test if there are any significant differences in `ShareWomen` between the following three `Major_category`: "Biology & Life Science", "Computers & Mathematics", and "Business".

In [None]:
# Extract data for each major

# Perform ANOVA using stats.f_oneway


## Associative Statistics

### COVARIANCE

Two ways to go about it:
 - **numpy**: Use the `np.cov()` function
 - **pandas**: Use the method `.cov( <column to calculate covariance with> )` for any two columns with numerical values.

Let us try it on the columns "Employed" and "ShareWomen".

**Practice:** Calculate the covariance between the column with the "Median earnings of full-time, year-round workers" and the column with "Women as share of total" in the dataset.

In [None]:
### Your code below this line


### CORRELATION

Preferred way to go about it:
 - **pandas**: Use the method `.corr( <column to calculate correlation with> )` for any two columns with numerical values. Let us try it on the columns "Employed" and "ShareWomen".

**Practice:** Calculate the correlation between the column with the "Median earnings of full-time, year-round workers" and the column with "Women as share of total" in the dataset.

In [None]:
### Your code below this line
