## COMM 187 (160DS): Data Science in Communication Research -- Spring 2024

## Coding Lab #5: Basics of Pandas
**Wednesday, May 1, 2024**

Welcome to the Coding Lab #4 for COMM 187 (160DS): Data Science in Communication Research! 

In the last Coding Lab, we learnt about ranges, indexing and slicing, conditional statements (`if` `else`) and `for` loops.

Today's lesson plan:
 - Review of `for` loops
 - Python Dictionaries `dict`
 - Basics of `pandas` library

Today's lessons are based on the following online resources (feel free to try them out yourselves too!):
 - https://wesmckinney.com/book/pandas-basics

## IMPORTANT ANNOUNCEMENT: DATACAMP ACCESS

Great news! We now have access to [DataCamp](https://www.datacamp.com/), which has a large suite of coding tutorials. As we move on to more intermediate coding skills in Python, I will be assinging certain chapters from courses on this website for you to do along with your assignments.

TO JOIN, CLICK THIS LINK: https://www.datacamp.com/groups/shared_links/0d032623fe95677c03dd5d41331db87feeb0738725bb7ae390f6d9ee17f2bed8  

Please let me know if you are having any difficulty.

For Assignment #5, you will be required to finish the assigned chapter on DataCamp first, and then do the assignment questions. Since the assigned chapter will take 1-2 hours for you to finish, the assignment will be shorter, but based on the skills you will have learnt in the chapter.

### `for` loop review

A `for` loop is used for **iterating** over a sequence (that is either a list, a tuple, a dictionary, a set, or a string).

*What does it mean to "iterate" over a sequence?* \
It means going through *each* element in the sequence *one by one*, in order, and performing some action with each element. \
For example, think of it as reading a book *page by page*: you start at the beginning and move sequentially to the end, doing something with each page before moving on to the next one.

In a `for` loop, we are running a block of code *FOR* each value in a list of values.

Let us say we have the following two variables:

Typical syntax for a `for` loop in Python:
```
for value in <collection>:
    <do something with value>
```

Here, the `<collection>` can be a list, numpy.array, or a range!\
In this `for` loop, the variable `value` takes on the value of each element in collection, in sequence, and then the block of code does something with that value.


For example:\
If I wanted to print out each day of the week, I could either do it this way:

In [None]:
print("Monday")
print("Tuesday")
print("Wednesday")
print("Thursday")
print("Friday")
print("Saturday")
print("Sunday")

**OR** I could simplify things are write a `for` loop.

Step 1: I will make a list that will iterate over.

In [None]:
days_in_week = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]

Step 2: I will write a `for` loop to iterate over it. For this, we will need a variable to *iterate* over the list `days_in_week`. We will call it `i`.

In [None]:
for i in days_in_week:
    print(i)

### Dictionaries in Python `dict`

Going beyond lists and arrays, we will now discuss dictionaries in Python. \
Dictionaries are used to store data values in `key`:`value` pairs.

These are just like a real-life dictionary. The `key`s are like the words, and the `value`s are like the meanings of the words.

In [None]:
thisdict =	{"school": "UCSB", "department": "COMM", "class": 187}

print(thisdict)

The keys here are `"school"`, `"department"`, and `"COMM"`.

In [None]:
thisdict.keys()

The values here are `"UCSB"`, `"COMM"`, and `187`.

In [None]:
thisdict.values()

Let's practice!

**Question:** Create a dictionary called student_grades that stores the names of three students as keys and their respective grades (out of 100) as values. After creating the dictionary, print the dictionary to verify the contents.

In [None]:
### Your code below this line

### Introduction to pandas

Pandas is a fast, powerful, and flexible open-source **data analysis and manipulation tool** built on top of the Python programming language. Mainly it is built using NumPy and Matplotlib, two very important Python libraries. 

We will start by exploring its basic features.

#### Importing pandas

To use pandas, you need to import it first. 

In [None]:
import pandas

**IMPORTANT:** You can give your library a nickname, or an alias, when you import them. This can be done using the `as` operator. For example, if I want to import pandas, but want to give it an alias `pd`, I will do it as follows:

In [None]:
import pandas as pd

This can be done with any library. 

***Practice:*** Import numpy with alias `np`.

In [None]:
### Your code below this line

### Data Structures in `pandas`

Pandas has two primary data structures: **Series** and **DataFrame**. Let's explore each.

A Series is a one-dimensional array-like object. It is very similar to, and actually built on top of, numpy arrays!

In [None]:
## Example of a pandas Series
import pandas as pd
s = pd.Series([1, 2, 3, 4, 5])
print(s)

A pandas series contains to key elements: \
the *array*, and \
the *index*.

In [None]:
s.array

In [None]:
s.index

Now, use indexing on this series. Print the value at index 2:

In [None]:
### Your code below this line

Great!

We learnt with lists and numpy arrays that index starts with 0, and ends with N-1, for a list/array of length N. As you just saw, pandas series follows the same logic!

Except.

You can make the index to be whatever you want. With pandas series, your index can be any label you want to use for every item in your series, even text!

Let us play with an example.

In [None]:
s2 = pd.Series([1, 2, 3, 4, 5], index=["a", "b", "c", "d", "e"])

In [None]:
s.array

In [None]:
s.index

Now, print the value of series `s2` at index `"c"` \
Hint: Use indexing as you normally do, just replace the index number with this new index!

In [None]:
### Your code below this line

#### DataFrame

A DataFrame is a two-dimensional data structure, like a table of data.

In [None]:
# Example of a pandas DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df)

Here, instead of just ONE series, we have TWO series, `A` and `B`. Basically, a DataFrame comprises of multiple series!\

The index (or label) for `A` and `B` are the same!

**Thinking in terms of columns and rows**

Think of DataFrames as a table of data with columns and rows. \
Each column is a pandas series.\
Each row is the index or label.


The syntax to make a pandas data frame is 

```
pd.DataFrame( { 'column1': [col1_value1, col1_value2, ...], 'column2': [col2_value1, col2_value2, ...]} )
```

Let us practice!

**Question:** You have the two lists provided below: `Grade1` and `Grade2`. Make a pandas DataFrame which looks like this:

![](./images/Lab5_Grade1Grade2.png)

In [None]:
Grade1 = [95, 91, 89]
Grade2 = [72, 99, 92]
### Your code below this line

### Subsetting Data in `pandas`

You can select specific rows and columns using pandas, using the following syntax:

In [None]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df)

In [None]:
# Selecting a single column
df['A']

In [None]:
# Selecting multiple columns
df[['A', 'B']]

You can select **rows** using the index value for the row, and the function `.iloc()`

In [None]:
# Selecting rows using index
# Select first row
df.iloc[0]

In [None]:
# Selecting rows AND columns using index and column name
# Select first row and column 'A'
df.loc[0, 'A']