# COMM 187: Data Science in Communication Research, Winter 2026

## Week #4 Coding Lab: Statistics with Python

Welcome to the Week #4 Coding Lab for COMM 187: Data Science in Communication Research! 

Thus far, we have learned some basic Python skills (variables, data types, lists, and NumPy), Python dictionaries, and Pandas.

Today's lesson plan:
 - Introduction to `scipy`
 - Comparitive Statistics: t-test and ANOVA
 - Associative Statistics: correlation and covariance 

### Introduction to `scipy`

In [1]:
import pandas as pd
import numpy as np

We will now use a new library, called `scipy`, which is a powerful scientific computation library in Python. 

In Python, we have learned how to `import` a library, and how to `import` a library `as` a nickname/alias. Now, we will try importing a specific subset of functions from a library. We do this by using the format:
```
from <library> import <name of sub-library>
```

Here, we only need to sub-library `stats` from `scipy`, so we will run the following code:

In [2]:
from scipy import stats

Let us start with a familiar dataset: the college majors dataset. This dataset is the data behind [this article](https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/).

You can access the repository for this dataset [here](https://github.com/fivethirtyeight/data/tree/master/college-majors).

In [3]:
### Your code below this line
recentgrads = pd.read_csv('./data/recent-grads.csv')

In [4]:
recentgrads

Unnamed: 0,Rank,Major_code,Major,Total,Men,Women,Major_category,ShareWomen,Sample_size,Employed,...,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Income,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
0,1,2419,PETROLEUM ENGINEERING,2339.0,2057.0,282.0,Engineering,0.120564,36,1976,...,270,1207,37,0.018381,110000,95000,125000,1534,364,193
1,2,2416,MINING AND MINERAL ENGINEERING,756.0,679.0,77.0,Engineering,0.101852,7,640,...,170,388,85,0.117241,75000,55000,90000,350,257,50
2,3,2415,METALLURGICAL ENGINEERING,856.0,725.0,131.0,Engineering,0.153037,3,648,...,133,340,16,0.024096,73000,50000,105000,456,176,0
3,4,2417,NAVAL ARCHITECTURE AND MARINE ENGINEERING,1258.0,1123.0,135.0,Engineering,0.107313,16,758,...,150,692,40,0.050125,70000,43000,80000,529,102,0
4,5,2405,CHEMICAL ENGINEERING,32260.0,21239.0,11021.0,Engineering,0.341631,289,25694,...,5180,16697,1672,0.061098,65000,50000,75000,18314,4440,972
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
168,169,3609,ZOOLOGY,8409.0,3050.0,5359.0,Biology & Life Science,0.637293,47,6259,...,2190,3602,304,0.046320,26000,20000,39000,2771,2947,743
169,170,5201,EDUCATIONAL PSYCHOLOGY,2854.0,522.0,2332.0,Psychology & Social Work,0.817099,7,2125,...,572,1211,148,0.065112,25000,24000,34000,1488,615,82
170,171,5202,CLINICAL PSYCHOLOGY,2838.0,568.0,2270.0,Psychology & Social Work,0.799859,13,2101,...,648,1293,368,0.149048,25000,25000,40000,986,870,622
171,172,5203,COUNSELING PSYCHOLOGY,4626.0,931.0,3695.0,Psychology & Social Work,0.798746,21,3777,...,965,2738,214,0.053621,23400,19200,26000,2403,1245,308


Now, print the name of the columns of this dataframe using `recentgrads.columns`.

In [5]:
### Your code below this line
recentgrads.columns

Index(['Rank', 'Major_code', 'Major', 'Total', 'Men', 'Women',
       'Major_category', 'ShareWomen', 'Sample_size', 'Employed', 'Full_time',
       'Part_time', 'Full_time_year_round', 'Unemployed', 'Unemployment_rate',
       'Income', 'P25th', 'P75th', 'College_jobs', 'Non_college_jobs',
       'Low_wage_jobs'],
      dtype='object')

For your reference, here are the descriptions of the values in each of these columns:

Column Name | Description
---|---------
`Rank` | Rank by median earnings
`Major_code` | Major code, FO1DP in ACS PUMS
`Major` | Major description
`Major_category` | Category of major from Carnevale et al
`Total` | Total number of people with major
`Sample_size` | Sample size (unweighted) of full-time, year-round ONLY (used for earnings)
`Men` | Male graduates
`Women` | Female graduates
`ShareWomen` | Women as share of total
`Employed` | Number employed (ESR == 1 or 2)
`Full_time` | Employed 35 hours or more
`Part_time` | Employed less than 35 hours
`Full_time_year_round` | Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35)
`Unemployed` | Number unemployed (ESR == 3)
`Unemployment_rate` | Unemployed / (Unemployed + Employed)
`Income` | Median earnings of full-time, year-round workers
`P25th` | 25th percentile of earnings
`P75th` | 75th percentile of earnings
`College_jobs` | Number with job requiring a college degree
`Non_college_jobs` | Number with job not requiring a college degree
`Low_wage_jobs` | Number in low-wage service jobs

---

## Descriptive Statistics

### Mean (or Average)

Two ways to go about it:
 - **numpy**: Use the `np.mean()` function
 - **pandas**: Use the method `.mean()` for any column with numerical values.

For example, for a column name "col1" in a dataframe `df`, you can calculate the mean by using either 
 - `np.mean(df["col1"])`, or
 - `df["col1"].mean()`

Let us try it on the column "Employed" in dataframe `recentgrads`. Try out both ways.

In [6]:
np.mean(recentgrads["Employed"])

np.float64(31192.763005780347)

In [7]:
recentgrads["Employed"].mean()

np.float64(31192.763005780347)

**Practice:** Calculate the mean of "Median earnings of full-time, year-round workers" in the dataset.

In [8]:
### Your code below this line
recentgrads["Income"].mean()

np.float64(40151.4450867052)

### MEDIAN

Two ways to go about it:
 - **numpy**: Use the `np.median()` function
 - **pandas**: Use the method `.median()` for any column with numerical values.

Let us try it on the column "Employed" in dataframe `recentgrads`. Try out both ways.

In [9]:
np.median(recentgrads["Employed"])

np.float64(11797.0)

In [10]:
recentgrads["Employed"].median()

np.float64(11797.0)

**Practice:** Calculate the median of "Median earnings of full-time, year-round workers" in the dataset.

In [11]:
### Your code below this line
recentgrads["Income"].median()

np.float64(36000.0)

### STANDARD DEVIATION

Two ways to go about it:
 - **numpy**: Use the `np.std()` function
 - **pandas**: Use the method `.std()` for any column with numerical values.

Let us try it on the column "Employed" in dataframe `recentgrads`. Try out both ways. 

In [12]:
np.std(recentgrads["Employed"])

np.float64(50528.330436051576)

In [13]:
recentgrads["Employed"].std()

np.float64(50675.0022407546)

**Practice:** Calculate the standard deviation of "Median earnings of full-time, year-round workers" in the dataset.

In [14]:
### Your code below this line
recentgrads["Income"].std()

np.float64(11470.18180213382)

### Multiple Descriptive Statistics (`.agg()`)

Use the function `.agg()` for any column with numerical values, and inside the `()` brackets, enter a **list** of the statistical functions you would like to use on that columns. 

 - for mean, just write `'mean'`
 - for median, just write `'median'`
 - for standard deviation, just write `'std'`

Remember! In Python, `'` and `"` are interchangeable.

For example, to calculate both the mean and the median for "col1" in dataframe `df`, you can write `df["col1"].agg(['mean', 'median'])`

Let us try it on the columns "Employed" in dataframe `recentgrads`. Try getting all three statistics in the same line, and then separately. 

In [15]:
recentgrads["Employed"].agg(["median", "mean", "std"])

median    11797.000000
mean      31192.763006
std       50675.002241
Name: Employed, dtype: float64

In [16]:
recentgrads["Employed"].agg(["median"])

median    11797.0
Name: Employed, dtype: float64

In [17]:
recentgrads["Employed"].agg(["mean"])

mean    31192.763006
Name: Employed, dtype: float64

In [18]:
recentgrads["Employed"].agg(["std"])

std    50675.002241
Name: Employed, dtype: float64

**Practice:** Calculate the mean and the standard deviation of "Median earnings of full-time, year-round workers" in the dataset.

In [19]:
### Your code below this line
recentgrads["Income"].agg(["mean", "std"])

mean    40151.445087
std     11470.181802
Name: Income, dtype: float64

### Summary Statistics (`.describe()`)

Instead of calculating individual statistics for individual columns, we can also get a summary of **mean**, **median**, **standard deviation**, and some other statistics using the function `.describe()`.

For example, for a column "col1" in dataframe `df`, you would write `df["col1"].describe()`

Let us try it on the column "Employed" in dataframe `recentgrads`.

In [20]:
recentgrads["Employed"].describe()

count       173.000000
mean      31192.763006
std       50675.002241
min           0.000000
25%        3608.000000
50%       11797.000000
75%       31433.000000
max      307933.000000
Name: Employed, dtype: float64

## Comparative Statistics

### T-TEST

With your group, go through the [SciPy Documentation for T-Test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) and solve the following practice question:

**Question:** Compare the mean `Unemployment_rate` between the following two `Major_category`: "Humanities & Liberal Arts" and "Social Science" in dataframe `recent-grads`. \
Run a t-test, and interpret the results.

In [21]:
recentgrads[recentgrads['Major_category'] == 'Humanities & Liberal Arts']['Unemployment_rate']

69     0.047179
99     0.063429
114    0.095667
115    0.075566
116    0.083634
129    0.104436
135    0.096052
137    0.087724
140    0.078268
148    0.060298
157    0.068584
158    0.062628
162    0.102792
165    0.107116
167    0.081742
Name: Unemployment_rate, dtype: float64

In [22]:
recentgrads[recentgrads['Major_category'] == 'Social Science']['Unemployment_rate']

36     0.099092
56     0.096799
68     0.073080
78     0.101175
79     0.113459
102    0.097244
124    0.084951
131    0.092306
142    0.103455
Name: Unemployment_rate, dtype: float64

In [24]:
# Extract data for 'Humanities & Liberal Arts' and 'Social Science'
HLA = recentgrads[recentgrads['Major_category'] == 'Humanities & Liberal Arts']['Unemployment_rate']
SS = recentgrads[recentgrads['Major_category'] == 'Social Science']['Unemployment_rate']

# Perform t-test using stats.ttest_ind
stats.ttest_ind(HLA, SS)

TtestResult(statistic=np.float64(-2.1749033790189163), pvalue=np.float64(0.04066619780114849), df=np.float64(22.0))

**Question:** Compare the mean `Unemployment_rate` between the following two `Major_category`: "Law & Public Policy" and "Engineering". \
Run a t-test, and interpret the results.

In [26]:
# Extract data for different majors
LPP = recentgrads[recentgrads['Major_category'] == 'Law & Public Policy']['Unemployment_rate']
E = recentgrads[recentgrads['Major_category'] == 'Engineering']['Unemployment_rate']

# Perform t-test using stats.ttest_ind
stats.ttest_ind(LPP, E)

TtestResult(statistic=np.float64(1.4784947640128945), pvalue=np.float64(0.1490521670401815), df=np.float64(32.0))

### ANOVA

With your group, go through the [SciPy Documentation for one-way ANOVA](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html) and solve the following practice question:

**Question:** Test if there are any significant differences in the `Unemployment_rate` between the following four `Major_category`: "Humanities & Liberal Arts", "Law & Public Policy", "Engineering", and "Social Science".

In [27]:
# Extract data for each major
HLA = recentgrads[recentgrads['Major_category'] == 'Humanities & Liberal Arts']['Unemployment_rate']
SS = recentgrads[recentgrads['Major_category'] == 'Social Science']['Unemployment_rate']
LPP = recentgrads[recentgrads['Major_category'] == 'Law & Public Policy']['Unemployment_rate']
E = recentgrads[recentgrads['Major_category'] == 'Engineering']['Unemployment_rate']

# Perform ANOVA using stats.f_oneway
stats.f_oneway(HLA, SS, LPP, E)

F_onewayResult(statistic=np.float64(3.2734832320511895), pvalue=np.float64(0.027932234385892967))

**Question:** Test if there are any significant differences in `ShareWomen` between the following three `Major_category`: "Biology & Life Science", "Computers & Mathematics", and "Business".

In [28]:
# Extract data for each major
BLS = recentgrads[recentgrads['Major_category'] == 'Biology & Life Science']['ShareWomen']
CM = recentgrads[recentgrads['Major_category'] == 'Computers & Mathematics']['ShareWomen']
B = recentgrads[recentgrads['Major_category'] == 'Business']['ShareWomen']

# Perform ANOVA using stats.f_oneway
stats.f_oneway(BLS, CM, B)

F_onewayResult(statistic=np.float64(20.69422793238571), pvalue=np.float64(1.1700528545519522e-06))

## Associative Statistics

### COVARIANCE

Two ways to go about it:
 - **numpy**: Use the `np.cov()` function
 - **pandas**: Use the method `.cov( <column to calculate covariance with> )` for any two columns with numerical values.

Let us try it on the columns "Employed" and "ShareWomen". 

NOTE: The numpy way is resulting in NAs. Avoid it. Use the pandas way.

In [32]:
recentgrads["Employed"].cov(recentgrads["ShareWomen"])

np.float64(1732.1987732913128)

**Practice:** Calculate the covariance between the column with the "Median earnings of full-time, year-round workers" and the column with "Women as share of total" in the dataset.

In [37]:
### Your code below this line
recentgrads["Income"].cov(recentgrads["ShareWomen"])

np.float64(-1639.4846731945195)

### CORRELATION

Preferred way to go about it:
 - **pandas**: Use the method `.corr( <column to calculate correlation with> )` for any two columns with numerical values. Let us try it on the columns "Employed" and "ShareWomen".

In [38]:
recentgrads["Employed"].corr(recentgrads["ShareWomen"])

np.float64(0.14754681093687835)

**Practice:** Calculate the correlation between the column with the "Median earnings of full-time, year-round workers" and the column with "Women as share of total" in the dataset.

In [39]:
### Your code below this line
recentgrads["Income"].corr(recentgrads["ShareWomen"])

np.float64(-0.618689751213161)