# COMM 187: Data Science in Communication Research
# Spring 2025

## Week #3 Coding Lab: Pandas
**Monday, April 14, 2025**

Welcome to the Week #3 Coding Lab for COMM 187: Data Science in Communication Research! 

Thus far, we have learned some basic Python skills, including variables, data types, lists, and NumPy.

Today's lesson plan:
 - Dictionaries in Python
 - Pandas
 - File Paths
 - CSV files

### Dictionaries in Python `dict`

Going beyond lists and arrays, we will now discuss dictionaries in Python. \
Dictionaries are used to store data values in `key`:`value` pairs.

These are just like a real-life dictionary. The `key`s are like the words, and the `value`s are like the meanings of the words.

In [None]:
thisdict =	{"school": "UCSB", "department": "COMM", "class": 187}

print(thisdict)

The keys here are `"school"`, `"department"`, and `"class"`.

In [None]:
thisdict.keys()

The values here are `"UCSB"`, `"COMM"`, and `187`.

In [None]:
thisdict.values()

Let's practice!

**Question:** Create a dictionary called `student_grades` that stores the names of the following five students as **keys** and their respective grades (out of 100) as **values**. After creating the dictionary, print the dictionary to verify the contents.

 - Ace: 67
 - Buck: 98
 - Chip: 90
 - Domingo: 71
 - Echo: 85

In [None]:
### Your code below this line


### Pandas

Pandas is a fast, powerful, and flexible open-source **data analysis and manipulation tool** built on top of the Python programming language. Mainly it is built using NumPy and Matplotlib, two very important Python libraries. 

We will start by exploring its basic features.

#### Importing pandas

To use pandas, you need to import it first. 

In [None]:
import pandas

Just as we typically import numpy **as np**, we also typically give pandas the **nickname pd** for ease of coding.

In [None]:
import pandas as pd

### Data Structures in `pandas`

Pandas has two primary data structures: **Series** and **DataFrame**. Let's explore each.

A Series is a one-dimensional array-like object. It is very similar to, and actually built on top of, numpy arrays!

In [None]:
## Example of a pandas Series
s = pd.Series([1, 2, 3, 4, 5])
print(s)

A pandas series contains to key elements: \
the *array*, and \
the *index*.

In [None]:
s.array

In [None]:
s.index

Now, use indexing on this series. Print the value at index 2:

In [None]:
### Your code below this line

Great!

We learnt with lists and numpy arrays that index starts with 0, and ends with N-1, for a list/array of length N. As you just saw, pandas series follows the same logic!

Except.

You can make the index to be whatever you want. With pandas series, your index can be any label you want to use for every item in your series, even text!

Let us play with an example.

In [None]:
s2 = pd.Series([1, 2, 3, 4, 5], index=["a", "b", "c", "d", "e"])

In [None]:
s.array

In [None]:
s.index

Now, print the value of series `s2` at index `"c"` \
Hint: Use indexing as you normally do, just replace the index number with this new index!

In [None]:
### Your code below this line

#### DataFrame

A DataFrame is a two-dimensional data structure, like a table of data.

In [None]:
# Example of a pandas DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df)

Here, instead of just ONE series, we have TWO series, `A` and `B`. Basically, a DataFrame comprises of multiple series!\

The index (or label) for `A` and `B` are the same!

**Thinking in terms of columns and rows**

Think of DataFrames as a table of data with columns and rows. \
Each column is a pandas series.\
Each row is the index or label.


The syntax to make a pandas data frame is 

```
pd.DataFrame( { 'column1': [col1_value1, col1_value2, ...], 'column2': [col2_value1, col2_value2, ...]} )
```

Let us practice!

**Question:** You have the two lists provided below: `Grade1` and `Grade2`. Make a pandas DataFrame which looks like this:

![](./images/Lab6_Grade1Grade2.png)

In [None]:
Grade1 = [95, 91, 89]
Grade2 = [72, 99, 92]
### Your code below this line


### File Paths

(The following tutorial has been prepared with the help of this [online tutorial](https://www.codecademy.com/resources/docs/general/file-paths). Use this as a reference for future help regarding file paths.)

A file path specifies the **location of a file** in a computer’s file system structure.

In general, a path is a string of characters which specifies a unique location in a directory or page hierarchy. For file systems, each level in the hierarchy is called a **directory**.

Different sections of the path are separated by a path separator, such as a forward slash (/). These different sections represent the separate directories or pages in the hierarchy.

Consider the example below shown below:

`/home/user/python/test.py`

In this example file path, the test.py file is inside the python directory. The python directory is a subdirectory of the user directory, which is a subdirectory of the home directory. 

```
home
    |- user
        |- python
            |- test.py
```

Consider this very notebeook (.ipynb file) that you are working on right now. What is the file path for this notebook? \
To find out, run the function `getcwd()` which is short for "**get** **c**urrent **w**orking **d**irectory" from package "os":

In [None]:
import os
os.getcwd()

Here, the Week03Lab_COMM187S25.ipynb file is located in the codinglabs directory. The codinglabs directory is a subdirectory of COMM187_S25 directory, which is a subdirectory of joyvan directory, which is a subdirectory of home.

```
home
    |- joyvan
        |- COMM187_S25
            |- codinglabs
                |- Week03Lab_COMM187S25.ipynb
```

#### Absolute vs Relative Path

**Absolute file paths** specify the location of a file from the root directory in the file system structure. They are also called “full file paths” or “full paths.” 

The output of `getcwd()` function is the absolute file path.

On this platform, the `/home/joyvan/` part is the **home directory**, which can be replaced with `~` for ease of use.

This means that the following two absolute file paths are equivalent:
 - `/home/jovyan/COMM187_S25/codinglabs`
 - `~/COMM187_S25/codinglabs`

**Relative file paths** specify the location of a file in the same folder. In other words, a relative file path specifies a location of a file that is relative to the current directory. 

For example, in the current directory of this file, we have a directory called **data**. The relative file path for the file `recent-grads.csv` in this directory would be:

```
./data/recent-grads.csv
```

The equivalent absolute file path would be:

```
/home/jovyan/COMM187_S25/codinglabs/data/recent-grads.csv
```
OR
```
~/COMM187_S25/codinglabs/data/recent-grads.csv
```

### Comma Separated Values (CSV) file

A **CSV** (Comma Separated Values) file is simply a table, just like a spreadsheet, where each line is a row of values separated by commas.

In order to import, or load, a csv file into your Python code, you will need to use the following function:

```
pd.read_csv(path_to_file)
```

Here, `path_to_file` should be replaced with the absolute or relative file path to the csv file.

Let us try to load the `recent-grads.csv` file into our code:

In [None]:
### Your code below this line
df = pd.read_csv('./data/recent-grads.csv')

This dataset is the data behind [this article](https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/). This data shows the earnings of Americans with different college majors. 

You can access the repository for this dataset [here](https://github.com/fivethirtyeight/data/tree/master/college-majors).

Now, print the dataset below to see what it looks like.

In [None]:
### Your code below this line


Now, print the name of the columns of this DataFrame using `df.columns`.

In [None]:
### Your code below this line
df.columns

For your reference, here are the descriptions of the values in each of these columns:

Column Name | Description
---|---------
`Rank` | Rank by median earnings
`Major_code` | Major code, FO1DP in ACS PUMS
`Major` | Major description
`Major_category` | Category of major from Carnevale et al
`Total` | Total number of people with major
`Sample_size` | Sample size (unweighted) of full-time, year-round ONLY (used for earnings)
`Men` | Male graduates
`Women` | Female graduates
`ShareWomen` | Women as share of total
`Employed` | Number employed (ESR == 1 or 2)
`Full_time` | Employed 35 hours or more
`Part_time` | Employed less than 35 hours
`Full_time_year_round` | Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35)
`Unemployed` | Number unemployed (ESR == 3)
`Unemployment_rate` | Unemployed / (Unemployed + Employed)
`Income` | Median earnings of full-time, year-round workers
`P25th` | 25th percentile of earnings
`P75th` | 75th percentile of earnings
`College_jobs` | Number with job requiring a college degree
`Non_college_jobs` | Number with job not requiring a college degree
`Low_wage_jobs` | Number in low-wage service jobs

Now, use the `.head()` function to just print the first 5 rows of the DataFrame.

In [None]:
df.head()

Now, similarly, use `.tail()` to print the last 5 rows of the DataFrame.

In [None]:
df.tail()

You will learn more about this in this week's assigned Datacamp Module, so make sure to go through them before attempting this week's coding assignment. 