## COMM 187 (160DS): Data Science in Communication Research -- Spring 2024

## Coding Lab #6: Data Manipulation and Statistics with Pandas
**Wednesday, May 8, 2024**

Welcome to the Coding Lab #6 for COMM 187 (160DS): Data Science in Communication Research! 

In the last Coding Lab, we learnt about Python dictionaries, and pandas library, including pandas `Series` and `DataFrames`.

Today's lesson plan:
 - Review of Coding Assignment #5
 - Loading CSV files using pandas
 - Sorting pandas DataFrames
 - Summary Statistics in Python -- Mean, Median, Standard Deviation, Covariance, Correlation
 - Grouping data using pandas `groupby`

Today's lessons are based on the following online resources (feel free to try them out yourselves too!):
 - https://wesmckinney.com/book/pandas-basics

### Loading CSV data into Python

In [None]:
import pandas as pd
import numpy as np

A **CSV** (Comma Separated Values) file is simply a table, just like a spreadsheet, where each line is a row of values separated by commas.

In order to import, or load, a csv file into your Python code, you will need to use the following function:

```
pd.read_csv(path_to_file)
```

Here, `path_to_file` should be replaced with the file path to the csv file.

Go through this tutorial on file paths to learn more about them: https://www.codecademy.com/resources/docs/general/file-paths 

To load a csv file named ".csv" which is in the "data" subfolder (or sub-directory) within the current folder (or directory), the file path should be "./data/ .csv".

Let us try to load that file into our code:

In [None]:
### Your code below this line
df = pd.read_csv('./data/recent-grads.csv')

This dataset is the data behind [this article](https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/). This data shows the earnings of Americans with different college majors. 

You can access the repository for this dataset [here](https://github.com/fivethirtyeight/data/tree/master/college-majors).

Now, print the dataset below to see what it looks like.

In [None]:
### Your code below this line

**Question:** Print the name of the columns of this DataFrame.

In [None]:
### Your code below this line
df.columns

For your reference, here are the descriptions of the values in each of these columns:

Column Name | Description
---|---------
`Rank` | Rank by median earnings
`Major_code` | Major code, FO1DP in ACS PUMS
`Major` | Major description
`Major_category` | Category of major from Carnevale et al
`Total` | Total number of people with major
`Sample_size` | Sample size (unweighted) of full-time, year-round ONLY (used for earnings)
`Men` | Male graduates
`Women` | Female graduates
`ShareWomen` | Women as share of total
`Employed` | Number employed (ESR == 1 or 2)
`Full_time` | Employed 35 hours or more
`Part_time` | Employed less than 35 hours
`Full_time_year_round` | Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35)
`Unemployed` | Number unemployed (ESR == 3)
`Unemployment_rate` | Unemployed / (Unemployed + Employed)
`Income` | Median earnings of full-time, year-round workers
`P25th` | 25th percentile of earnings
`P75th` | 75th percentile of earnings
`College_jobs` | Number with job requiring a college degree
`Non_college_jobs` | Number with job not requiring a college degree
`Low_wage_jobs` | Number in low-wage service jobs

Now, use the `.head()` function to just print the first 5 rows of the DataFrame.

In [None]:
df.head()

Now, similarly, use `.tail()` to print the last 5 rows of the DataFrame.

In [None]:
df.tail()

#### Sorting pandas DataFrame

In pandas, you can sort the entire table based on the values in any column. We do it using `.sort_values()` function.

Read [this documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) for `.sort_values()` function, and write a code below which would sort the pandas DataFrame `df` based on the median earnings, which is stored in the column "Income". Discuss with your table and learn from each other!

In [None]:
### Your code below this line
df.sort_values(by = "Income")

**Practice:** Sort the DataFrame `df` in the DESCENING order of the values in the column "ShareWomen", which is the share of women among all the majors in that group.

In [None]:
### Your code below this line

**Practice:** Sort the DataFrame `df` in the ASCENDING order of the values in the column "Sample_size", and select only the following columns: "Major_code", "Major", "Sample_size", and "Income".

In [None]:
### Your code below this line

### Summary Statistics in `pandas`

Let us now calculate perform some basic statistical analysis!

**1. MEAN**

Use the function `.mean()` for any column with numerical values. Let us try it on the columns "Employed"

In [None]:
df["Employed"].mean()

**Practice:** Calculate the mean of "Median earnings of full-time, year-round workers" in the dataset.

In [None]:
### Your code below this line

**2. MEDIAN**

Use the function `.median()` for any column with numerical values. Let us try it on the columns "Employed"

In [None]:
df["Employed"].median()

**Practice:** Calculate the median of "Median earnings of full-time, year-round workers" in the dataset.

In [None]:
### Your code below this line

**3. STANDARD DEVIATION**

Use the function `.std()` for any column with numerical values. Let us try it on the columns "Employed"

In [None]:
df["Employed"].std()

**Practice:** Calculate the standard deviation of "Median earnings of full-time, year-round workers" in the dataset.

In [None]:
### Your code below this line

***DOING MEAN, MEDIAN, STANDARD DEVIATION TOGETHER?***

Use the function `.agg()` for any column with numerical values, and inside the `()` brackets, enter a lis of the statistical functions you would like to use on that columns. 

 - for mean, just write `'mean'`
 - for median, just write `'median'`
 - for standard deviation, just write `'std'`

Remember! In Python, `'` and `"` are interchangeable.

Let us try it on the columns "Employed"

In [None]:
df["Employed"].agg('mean')

In [None]:
df["Employed"].agg(['mean'])

In [None]:
df["Employed"].agg(['mean', 'median'])

In [None]:
df["Employed"].agg(['mean', 'median', 'std'])

**Practice:** Calculate the standard deviation of "Median earnings of full-time, year-round workers" in the dataset.

In [None]:
### Your code below this line

**4. COVARIANCE**

Use the function `.cov( <column to calculate covariance with> )` for any two columns with numerical values. Let us try it on the columns "Employed" and "ShareWomen".

In [None]:
df["Employed"].cov(df["ShareWomen"])

**Practice:** Calculate the covariance between the column with the "Median earnings of full-time, year-round workers" and the column with "Women as share of total" in the dataset.

In [None]:
### Your code below this line

**5. CORRELATION**

Use the function `.corr( <column to calculate correlation with> )` for any two columns with numerical values. Let us try it on the columns "Employed" and "ShareWomen".

In [None]:
df["Employed"].cov(df["ShareWomen"])

**Practice:** Calculate the correlation between the column with the "Median earnings of full-time, year-round workers" and the column with "Women as share of total" in the dataset.

In [None]:
### Your code below this line

**6. SUMMARY STATISTICS USING `.describe()`**

Instead of calculating individual statistics for individual columns, we can also get a summary of mean, median, standard deviation, and some other statistics using the function `.describe()`. Let us try that out:

In [None]:
print(df.describe())

### Grouping data in `pandas` using `groupby`

Instead of calculating the aforementioned statistics for all of the data, if I want to calculate these statistics, let us say, for all the different majors. I would then have to subset the data by each major and calculate the summary statistics for each major. That is a very long process!

Instead, we can "group" the data based on the values of a column (for e.g., the "Major_category" column) and perform the same statistic across all groups. We do this operation using `.groupby()` function. 

Learn more about how to use `.groupby()` here: https://www.geeksforgeeks.org/python-pandas-dataframe-groupby/ 

Using the tutorial linked here, discuss with your table how to solve the following questions:

**Question 1.** Use `.groupby()` function to group the DataFrame based on the column “Major_category”. 

In [None]:
### Your code below this line

**Question 2.** Use `.groupby()` function to group the DataFrame based on the column “Major_category”, and select only the "Median" column. 

In [None]:
### Your code below this line

**Question 3.** Use `.groupby()` function and `.mean()` together to print the mean income for each “Major_category” group. 

In [None]:
### Your code below this line

**Question 4.** Use `.groupby()` function, `.mean()` function, and `.sort_values()` together to print the mean income for each “Major_category” group sorted in descending order of the mean incomes. 

In [None]:
### Your code below this line

**Question 5.** Use `.groupby()` function, `.median()` function, and `.sort_values()` together to print the median income for each “Major_category” group sorted in descending order of the mean incomes. 

In [None]:
### Your code below this line

### Practice

Identify the top 3 major categories with the highest average unemployment rates and provide the standard deviation for the unemployment rate within those categories. Use `.groupby()`, `.mean()`, `.std()`, and `.sort_values()`.

In [None]:
### Your code below this line