## COMM 187 (160DS): Data Science in Communication Research -- Spring 2024

## Coding Lab #7: Statistical Analysis in Python
**Wednesday, May 15, 2024**

Welcome to the Coding Lab #6 for COMM 187 (160DS): Data Science in Communication Research! 

In the last Coding Lab, we learnt about Python dictionaries, and pandas library, including pandas `Series` and `DataFrames`.

Today's lesson plan:
 - Introduction to `scipy`
 - Comparitive Statistics: t-test, Z-test, ANOVA, and Chi-Squared test
 - Association between variables: Covariance and Correlation
 - Linear Regression

### Introduction to `scipy`

In [None]:
import pandas as pd
import numpy as np

We will now use a new library, called `scipy`, which is a powerful scientific computation library in Python. 

In Python, we have learned how to `import` a library, and how to `import` a library `as` a nickname/alias. Now, we will try importing a specific subset of functions from a library. We do this by using the format:
```
from <library> import <name of sub-library>
```

Here, we only need to sub-library `stats` from `scipy`, so we will run the following code:

In [None]:
from scipy import stats

Let us start with a familiar dataset: the redlining dataset from Assignment #6. 

To review: Redlining is the discriminatory practice of denying services (like mortgages) to residents of certain areas based on racial or ethnic composition, leading to economic and social disparities.

The dataset is stored as a csv file in the sub-folder "data" by the name "metro-grades.csv". \
You can read more about the dataset on its [github repository page](https://github.com/fivethirtyeight/data/tree/master/redlining). 

In [None]:
df = pd.read_csv('./data/metro-grades.csv')

I have copied relevant information about the dataset below:

This csv file contains 2020 population total estimates by race/ethnicity for combined zones of each redlining grade (from Home Owners' Loan Corporation \[HOLC\] maps originally drawn in 1935-40, downloaded from the [Mapping Inequality project](https://dsl.richmond.edu/panorama/redlining/#loc=5/37.8/-97.9&maps=0)) within micro- and metropolitan areas. Also included are population estimates in the surrounding area of each metropolitan area's HOLC map (computed by adding a 10 percent buffer radius to the minimum bounding circle of all zones in that metro area) and [location quotients](https://belonging.berkeley.edu/technical-appendix#footnote34_cnxakh3) (LQs) for each racial/ethnic group and HOLC grade. LQs are small-area measures of segregation that specifically compare one racial/ethnic group’s proportion in a granular geography to their proportion in a larger surrounding geography. An LQ above 1 for a given racial group indicates overrepresentation in that HOLC zone relative to the broader surrounding area, and values below 1 indicate underrepresentation. Only micro- and metropolitan areas with both A- (“best”) and D-rated (“hazardous”) zones in their redlining map are included — 138 of a total 143 metropolitan areas in the data from Mapping Inequality.

Header | Definition
--- | ---
`metro_area` | Official U.S. Census name of micro- or metropolitan area — defined as ["Core-Based Statistical Areas"](https://www.census.gov/topics/housing/housing-patterns/about/core-based-statistical-areas.html). The first city and state listed are used as the display name for each micro/metropolitan area in the story (for example, "Chicago-Naperville-Elgin, IL-IN-WI" is referred to as "Chicago, IL").
`holc_grade` | Grade assigned by the Home Owners' Loan Corporation (HOLC). `A`: "best" (green). `B`: "Still Desirable" (blue). `C`: "Definitely Declining" (yellow). `D`: "Hazardous" (red).
`white_pop` | Estimate of non-Hispanic white population within HOLC zones with a given `holc_grade` in a given `metro_area`. Rounded to the nearest integer.
`black_pop` | Estimate of non-Hispanic Black population within HOLC zones with a given `holc_grade` in a given `metro_area`. Rounded to the nearest integer.
`hisp_pop` | Estimate of Hispanic/Latino population within HOLC zones with a given `holc_grade` in a given `metro_area`. Rounded to the nearest integer.
`asian_pop` | Estimate of non-Hispanic Asian population within HOLC zones with a given `holc_grade` in a given `metro_area`. Rounded to the nearest integer.
`other_pop` | Estimate of population in any other racial/ethnic groups within HOLC zones with a given `holc_grade` in a given `metro_area`. Rounded to the nearest integer.
`total_pop` | Estimate of total population (across all racial/ethnic groups) within HOLC zones with a given `holc_grade` in a given `metro_area`. Rounded to the nearest integer.
`pct_white` | Estimate of the percentage of total population within HOLC zones with a given `holc_grade` in a given `metro_area` that are non-Hispanic white. Represented between 0-100. Rounded to the nearest two decimal places.
`pct_black` | Estimate of the percentage of total population within HOLC zones with a given `holc_grade` in a given `metro_area` that are non-Hispanic Black. Represented between 0-100. Rounded to the nearest two decimal places.
`pct_hisp` | Estimate of the percentage of total population within HOLC zones with a given `holc_grade` in a given `metro_area` that are Hispanic/Latino. Represented between 0-100. Rounded to the nearest two decimal places.
`pct_asian` | Estimate of the percentage of total population within HOLC zones with a given `holc_grade` in a given `metro_area` that are non-Hispanic Asian. Represented between 0-100. Rounded to the nearest two decimal places.
`pct_other` | Estimate of the percentage of total population within HOLC zones with a given `holc_grade` in a given `metro_area` in any other racial/ethnic group. Represented between 0-100. Rounded to the nearest two decimal places.
`lq_white` | Non-Hispanic white location quotient for a given `holc_grade` and `metro_area`.
`lq_black` | Non-Hispanic Black location quotient for a given `holc_grade` and `metro_area`.
`lq_hisp` | Hispanic/Latino location quotient for a given `holc_grade` and `metro_area`.
`lq_asian` | Non-Hispanic Asian location quotient for a given `holc_grade` and `metro_area`.
`lq_other` | All other racial/ethnic groups' location quotient for a given `holc_grade` and `metro_area`.
`surr_area_white_pop` | Estimate of non-Hispanic white population within surrounding area of a given `metro_area`'s HOLC zones. Rounded to nearest integer. Repeated for each `holc_grade` for a given `metro_area`.
`surr_area_black_pop` | Estimate of non-Hispanic Black population within surrounding area of a given `metro_area`'s HOLC zones. Rounded to nearest integer. Repeated for each `holc_grade` for a given `metro_area`.
`surr_area_hisp_pop` | Estimate of Hispanic/Latino population within surrounding area of a given `metro_area`'s HOLC zones. Rounded to nearest integer. Repeated for each `holc_grade` for a given `metro_area`.
`surr_area_asian_pop` | Estimate of non-Hispanic Asian population within surrounding area of a given `metro_area`'s HOLC zones. Rounded to nearest integer. Repeated for each `holc_grade` for a given `metro_area`.
`surr_area_other_pop` | Estimate of population in any other racial/ethnic groups within surrounding area of a given `metro_area`'s HOLC zones. Rounded to nearest integer. Repeated for each `holc_grade` for a given `metro_area`.
`surr_area_total_pop` | Estimate of total population (across all racial/ethnic groups) within surrounding area of a given `metro_area`'s HOLC zones. Rounded to nearest integer. Repeated for each `holc_grade` for a given `metro_area`.
`surr_area_pct_white` | Estimate of the percentage of total population within surrounding area of a given `metro_area`'s HOLC zones that are non-Hispanic white. Represented between 0-100. Rounded to the nearest two decimal places. Repeated for each `holc_grade` for a given `metro_area`.
`surr_area_pct_black` | Estimate of the percentage of total population within surrounding area of a given `metro_area`'s HOLC zones that are non-Hispanic Black. Represented between 0-100. Rounded to the nearest two decimal places. Repeated for each `holc_grade` for a given `metro_area`.
`surr_area_pct_hisp` | Estimate of the percentage of total population within surrounding area of a given `metro_area`'s HOLC zones that are Hispanic/Latino. Represented between 0-100. Rounded to the nearest two decimal places. Repeated for each `holc_grade` for a given `metro_area`.
`surr_area_pct_asian` | Estimate of the percentage of total population within surrounding area of a given `metro_area`'s HOLC zones that are non-Hispanic Asian. Represented between 0-100. Rounded to the nearest two decimal places. Repeated for each `holc_grade` for a given `metro_area`.
`surr_area_pct_other` | Estimate of the percentage of total population within surrounding area of a given `metro_area`'s HOLC zones in any other racial/ethnic group. Represented between 0-100. Rounded to the nearest two decimal places. Repeated for each `holc_grade` for a given `metro_area`.

---

## Comparative Statistics

### T-Test

With your group, go through the [SciPy Documentation for T-Test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) and solve the following practice question:

**Question:** Compare the mean White population between areas graded 'A' and 'D'. \
Run a t-test, and interpret the results.

In [None]:
# Extract data for grades 'A' and 'D'
grade_A_White = df[df['holc_grade'] == 'A']['white_pop']
grade_D_White = df[df['holc_grade'] == 'D']['white_pop']

# Perform t-test using stats.ttest_ind
...

**Question:** Compare the mean Asian population between areas graded 'A' and 'B'. \
Run a t-test, and interpret the results.

In [None]:
# Extract data for grades 'A' and 'D'
grade_A_Asian = df[df['holc_grade'] == 'A']['asian_pop']
grade_B_Asian = df[df['holc_grade'] == 'B']['asian_pop']

# Perform t-test using stats.ttest_ind
...

### Z-Test

Z-test is used to compare *sample mean* with *popupation mean*.

The formula for Z-test statistic is:

$$ Z = {(\bar{x} - \mu) \over (\sigma / \sqrt{n})} $$

where $\bar{x}$ = sample mean \
$\mu$ = population mean \
$\sigma$ = standard deviation of population \
$\sqrt{n}$ = square root of sample size, $n$

**Question:** Compare the mean Black population in grade A (sample mean) with the overall Black population across all grades (population mean).

In [None]:
# Extract data for grades 'A' and 'D'
grade_A_Black = ...
all_pop_Black = ...

# Calculate Z-statistic using the formula outlined above

How do we get the p-value for this? 

We input the absolute value of Z statistic to the function `stats.norm.cdf( <Z statistic> )`

In [None]:
# Calcualte p-value using stats.norm.cdf

### ANOVA

With your group, go through the [SciPy Documentation for one-way ANOVA](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html) and solve the following practice question:

**Question:** Test if there are any significant differences in the total population among different grades ('A', 'B', 'C', 'D').

In [None]:
# Extract data for each grade
...

# Perform ANOVA using stats.f_oneway
...

**Question:** Test if there are any significant differences in the total population among different grades ('B', 'C', 'D').

In [None]:
# Perform ANOVA using stats.f_oneway
...

### Chi-Squared Test

Chi-squared test is used to compare the frequency of occurance of values in a categorical variable with the expected values. 

With your group, go through the [SciPy Documentation for Chi-Squared Test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html) and solve the following practice question:

**Question:** Subset the data to include only the `pct_black` and `holc_grade columns`.\
Filter the data to include only rows where `pct_black` is greater than or equal to 50%.\
Use the `stats.chisquare` function to test the observed frequencies of holc_grade in this subset of the data against the expected frequencies (which, by default, is equal distribution across categories). Interpret the results.

In [None]:
# Step 1: Subset the data to include only the pct_black and holc_grade columns
...

# Step 2: Filter the data to include only rows where pct_black is greater than or equal to 50%
...

# Step 3: Get observed frequencies of holc_grade
...

# Expected frequencies assuming no association (equal distribution across categories)
...

# Perform Chi-Squared test using scipy.stats.chisquare
...

## Association between variables

### Covariance

For this, we do not need to rely on SciPy. Numpy provides a lovely covariance calculation function, called `cov`.

With your group, go through the [Numpy Documentation for Covariance](https://numpy.org/doc/stable/reference/generated/numpy.cov.html) and solve the following practice question:

**Question:** Calculate the covariance between the percentage of Hispanic population in an area and the percentage of White population in the surrounding area.

In [None]:
# Perform covariance using np.cov
...

Are you getting a single number or a matrix (2 dimensional array)? If so, it is because you have computed a covariance MATRIX, which looks like this: 

![](./images/Lab7_cov.jpg)

Based on this information, how do we use indexing on the covariance matrix to find our desired covariance value? \
Discuss with your group and try below:

In [None]:
# Index the covariance between the two variables from the covariance matrix using [row index, column index]
...

### Correlation

With your group, go through the [SciPy Documentation for Pearson's R correlation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html) and solve the following practice question:

**Question:** Calculate the correlation between the percentage of Hispanic population and the percentage of White population in the surrounding area.

In [None]:
# Perform correlation using stats.pearson
...

**Question:** Calculate the correlation between the percentage of White population and the percentage of Hispanic population in the surrounding area.

In [None]:
# Perform correlation using stats.pearson
...

### Linear Regression

With your group, go through the [SciPy Documentation for Linear Regression](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html) and solve the following practice question:

**Question:**  Perform a linear regression to predict the percentage of White population in the surrounding area based on the percentage of Hispanic population.

In [None]:
# Perform linear regression using stats.linregress
...

Discuss with your group: 
 - What is the slope?
 - What is the intercept (or constant value "c" in regression)?
 - What is the p-value?
 - For a significance level of 0.05, is this linear regression statistically significant? 

**Question:** Perform a linear regression to predict the percentage of Hispanic population in the surrounding area based on the percentage of Asian population.

In [None]:
# Perform linear regression using stats.linregress
...