# COMM 187: Data Science in Communication Research
# Spring 2025

## Week #4 Coding Lab: Statistics with Python
**Monday, April 21, 2025**

Welcome to the Week #4 Coding Lab for COMM 187: Data Science in Communication Research! 

Thus far, we have learned some basic Python skills (variables, data types, lists, and NumPy), Python dictionaries, and Pandas.

Today's lesson plan:
 - Introduction to `scipy`
 - Comparitive Statistics: t-test, Z-test, ANOVA, and Chi-Squared test
 - Visualizing comparative statistics using `matplotlib`

### Introduction to `scipy`

In [None]:
import pandas as pd
import numpy as np

We will now use a new library, called `scipy`, which is a powerful scientific computation library in Python. 

In Python, we have learned how to `import` a library, and how to `import` a library `as` a nickname/alias. Now, we will try importing a specific subset of functions from a library. We do this by using the format:
```
from <library> import <name of sub-library>
```

Here, we only need to sub-library `stats` from `scipy`, so we will run the following code:

In [None]:
from scipy import stats

Let us start with a familiar dataset: the college majors dataset. This dataset is the data behind [this article](https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/).

You can access the repository for this dataset [here](https://github.com/fivethirtyeight/data/tree/master/college-majors).

In [None]:
### Your code below this line
df = pd.read_csv('./data/recent-grads.csv')

Now, print the name of the columns of this DataFrame using `df.columns`.

In [None]:
### Your code below this line
df.columns

For your reference, here are the descriptions of the values in each of these columns:

Column Name | Description
---|---------
`Rank` | Rank by median earnings
`Major_code` | Major code, FO1DP in ACS PUMS
`Major` | Major description
`Major_category` | Category of major from Carnevale et al
`Total` | Total number of people with major
`Sample_size` | Sample size (unweighted) of full-time, year-round ONLY (used for earnings)
`Men` | Male graduates
`Women` | Female graduates
`ShareWomen` | Women as share of total
`Employed` | Number employed (ESR == 1 or 2)
`Full_time` | Employed 35 hours or more
`Part_time` | Employed less than 35 hours
`Full_time_year_round` | Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35)
`Unemployed` | Number unemployed (ESR == 3)
`Unemployment_rate` | Unemployed / (Unemployed + Employed)
`Income` | Median earnings of full-time, year-round workers
`P25th` | 25th percentile of earnings
`P75th` | 75th percentile of earnings
`College_jobs` | Number with job requiring a college degree
`Non_college_jobs` | Number with job not requiring a college degree
`Low_wage_jobs` | Number in low-wage service jobs

---

## Descriptive Statistics

### Mean (or Average)

Two ways to go about it:
 - **numpy**: Use the `np.mean()` function
 - **pandas**: Use the method `.mean()` for any column with numerical values.

Let us try it on the column "Employed".

In [None]:
df["Employed"].mean()

In [None]:
np.mean(df["Employed"])

**Practice:** Calculate the mean of "Median earnings of full-time, year-round workers" in the dataset.

In [None]:
### Your code below this line


### MEDIAN

Two ways to go about it:
 - **numpy**: Use the `np.median()` function
 - **pandas**: Use the method `.median()` for any column with numerical values.

Let us try it on the column "Employed".

In [None]:
df["Employed"].median()

In [None]:
np.median(df["Employed"])

**Practice:** Calculate the median of "Median earnings of full-time, year-round workers" in the dataset.

In [None]:
### Your code below this line


### STANDARD DEVIATION

Two ways to go about it:
 - **numpy**: Use the `np.std()` function
 - **pandas**: Use the method `.std()` for any column with numerical values.

Let us try it on the columns "Employed"

In [None]:
df["Employed"].std()

In [None]:
np.std(df["Employed"])

**Practice:** Calculate the standard deviation of "Median earnings of full-time, year-round workers" in the dataset.

In [None]:
### Your code below this line


### Multiple Descriptive Statistics (`.agg()`)

Use the function `.agg()` for any column with numerical values, and inside the `()` brackets, enter a list of the statistical functions you would like to use on that columns. 

 - for mean, just write `'mean'`
 - for median, just write `'median'`
 - for standard deviation, just write `'std'`

Remember! In Python, `'` and `"` are interchangeable.

Let us try it on the columns "Employed"

In [None]:
df["Employed"].agg('mean')

In [None]:
df["Employed"].agg(['mean'])

In [None]:
df["Employed"].agg(['mean', 'median'])

In [None]:
df["Employed"].agg(['mean', 'median', 'std'])

**Practice:** Calculate the standard deviation of "Median earnings of full-time, year-round workers" in the dataset.

In [None]:
### Your code below this line


### Summary Statistics (`.describe()`)

Instead of calculating individual statistics for individual columns, we can also get a summary of **mean**, **median**, **standard deviation**, and some other statistics using the function `.describe()`. Let us try that out:

Let us try it on the column "Employed"

In [None]:
df["Employed"].describe()

## Comparative Statistics

### T-TEST

With your group, go through the [SciPy Documentation for T-Test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) and solve the following practice question:

**Question:** Compare the mean `Unemployment_rate` between the following two `Major_category`: "Humanities & Liberal Arts" and "Social Science". \
Run a t-test, and interpret the results.

In [None]:
pd.unique(df['Major_category'])

In [None]:
# Extract data for 'Humanities & Liberal Arts' and 'Social Science'
HLA = df[df['Major_category'] == 'Humanities & Liberal Arts']['Unemployment_rate']
SS = df[df['Major_category'] == 'Social Science']['Unemployment_rate']

# Perform t-test using stats.ttest_ind
stats.ttest_ind(HLA, SS)

**Question:** Compare the mean `Unemployment_rate` between the following two `Major_category`: "Law & Public Policy" and "Engineering". \
Run a t-test, and interpret the results.

In [None]:
# Extract data for different majors

# Perform t-test using stats.ttest_ind


### ANOVA

With your group, go through the [SciPy Documentation for one-way ANOVA](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html) and solve the following practice question:

**Question:** Test if there are any significant differences in the `Unemployment_rate` between the following four `Major_category`: "Humanities & Liberal Arts", "Law & Public Policy", "Engineering", and "Social Science".

In [None]:
# Extract data for each major
HLA = df[df['Major_category'] == 'Humanities & Liberal Arts']['Unemployment_rate']
SS = df[df['Major_category'] == 'Social Science']['Unemployment_rate']
LPP = df[df['Major_category'] == 'Law & Public Policy']['Unemployment_rate']
E = df[df['Major_category'] == 'Engineering']['Unemployment_rate']

# Perform ANOVA using stats.f_oneway
stats.f_oneway(HLA, SS, LPP, E)

**Question:** Test if there are any significant differences in `ShareWomen` between the following three `Major_category`: "Biology & Life Science", "Computers & Mathematics", and "Business".

In [None]:
# Extract data for each major

# Perform ANOVA using stats.f_oneway


## Associative Statistics

### COVARIANCE

Two ways to go about it:
 - **numpy**: Use the `np.cov()` function
 - **pandas**: Use the method `.cov( <column to calculate covariance with> )` for any two columns with numerical values.

Let us try it on the columns "Employed" and "ShareWomen".

In [None]:
np.cov(df["Employed"], df["ShareWomen"])[0,0]

In [None]:
df["Employed"].cov(df["ShareWomen"])

**Practice:** Calculate the covariance between the column with the "Median earnings of full-time, year-round workers" and the column with "Women as share of total" in the dataset.

In [None]:
### Your code below this line


### CORRELATION

Preferred way to go about it:
 - **pandas**: Use the method `.corr( <column to calculate correlation with> )` for any two columns with numerical values. Let us try it on the columns "Employed" and "ShareWomen".

In [None]:
df["Employed"].corr(df["ShareWomen"])

**Practice:** Calculate the correlation between the column with the "Median earnings of full-time, year-round workers" and the column with "Women as share of total" in the dataset.

In [None]:
### Your code below this line
