# Lab 8A - Data Analysis with `numpy` and `pandas`
*Day 8 - August 8, 2024*

*I School Python Bootcamp*

*Author: Lauren Chambers<br>Modified from notebook by George McIntire*

## `numpy`

NumPy is a powerful Python library used for numerical and scientific computing. Its name stands for "Numerical Python." NumPy provides support for large, multi-dimensional arrays and matrices, as well as a vast collection of high-level mathematical functions to operate on these arrays. It is one of the fundamental libraries in the Python data science ecosystem.

Common practice is to give an alias `np` to numpy when we import the library:

In [None]:
import numpy as np

We can convert a list to a numpy array

In [None]:
num_arr = np.array([93,  5, 65, 53, 52, 55, 20, 59, 79, 30, 16, 29, 23,
       61, 96, 89, 86, 38, 84, 25])
type(num_arr)

Now that its an array, we can use a variety of numpy array methods.

In [None]:
#mean
num_arr.mean()

In [None]:
#sum
num_arr.sum()

In [None]:
#standard deviation
num_arr.std()

argmax and argmin tell us the location of the maximum and minimum values.

In [None]:
num_arr.argmax()

In [None]:
num_arr[num_arr.argmax()]

Reshape into a 2D matrix. Since len(num_arr) = 20 we can transform it into a 5x4 matrix.

In [None]:
num_arr

In [None]:
num_matrix = num_arr.reshape(5,4)
num_matrix

Let's check the shape just to be sure that worked as we expected:

In [None]:
num_matrix.shape

With a 2D matrix we can do 2D slicing

Slice the second column

In [None]:
num_matrix[:, 1]

Slice after the second row and before the third column

In [None]:
num_matrix[2:, :3]

Recall, too, that we can do element-wise operations using our numpy arrays

In [None]:
num_matrix * num_matrix

In [None]:
num_matrix - num_matrix

In [None]:
(num_matrix / 2) + num_matrix

## `pandas`

`pandas` is a powerful Python library designed for data manipulation and analysis, making it an essential tool for working with structured data. Its `DataFrame` and `Series` allow us to to easily clean, filter, and transform datasets. tbh, especially compared to python-native packages like `csv.reader`, I just adore `pandas`.

Common practice is to give an alias `pd` to pandas when we import the library:

In [None]:
import pandas as pd

You can create a DataFrame using a `np.array` or a dictionary:

In [None]:
pd.DataFrame(data = {"students": ["Natalia", "Kenny", "Andrew", "Jeremy", "Carl", "Ray"],
                     "age": np.round(np.random.random(6) * 10 + 18).astype(int),
                     "tv_or_movie": ["Mad Men", "Chungking Express", "See", "Silicon Valley", "Paprika", "Portrait of a Lady on Fire"]})

But for our lab today let's revisit our titanic dataset:

In [None]:
titanic_df = pd.read_csv("titanic.csv")

Viewing dataframes with pandas is a very pleasant experience, as the package creates an HTML table showing our columns and a reasonable subset of rows that doesn't flood our entire browser screen:

In [None]:
titanic_df

Pandas makes it super easy to select one column, or even a subset of columns:

In [None]:
# One column of a DataFrame is a Series
titanic_df.Name

In [None]:
titanic_df[["Name", "Sex", "Age", "Survived"]]

However, do note that column names are case sensitive!

In [None]:
titanic_df.age

Pandas also makes filtering really intuitive. You can even define variables to make your code easier to read.

In [None]:
# Children who survived
titanic_df[(titanic_df.Survived == 1) & (titanic_df.Age < 18)]

In [None]:
# Just get their names
titanic_df[(titanic_df.Survived == 1) & (titanic_df.Age < 18)].Name

In [None]:
# Use variables for readability
survived = titanic_df.Survived == 1
children = titanic_df.Age < 18
titanic_df[survived & children]

In [None]:
survived # Just a mask of booleans for each row

It can be a bit tricky to remember when to use which (don't be afraid to Google!), but we can use the `.loc` and `.iloc` methods to select and filter along both dimensions - columns and rows.

In [None]:
# Syntax is df.loc[row_filters, col_filters]
titanic_df.loc[survived & children, ["Name", "Age"]]

In [None]:
# Syntax is df.iloc[row_indices, col_indices] 
titanic_df.iloc[:7]

In [None]:
titanic_df.iloc[1:11, [3, 5]]

And finally, we can also create new columns easily using this column name notation:

In [None]:
titanic_df["Child"] = titanic_df.Age < 18
titanic_df

In [None]:
titanic_df[titanic_df.Child]

Using `pandas` can make it waaaaaay easier to make analytic plots with `matplotlib`, too:

In [None]:
import matplotlib.pyplot as plt

In [None]:
# Calculate how many men survived and died
is_male = titanic_df.Sex == "male"

male_survival = titanic_df[is_male].Survived
n_male_survival = male_survival.value_counts()
n_male_survival.sort_index() # sort to make sure the order is the same for both sexes

In [None]:
# We can use the minus sign to reverse the filter
titanic_df[-is_male].Survived.value_counts().sort_index()

In [None]:
# Initialize
plt.figure()

# Draw plot and axis labels
width = 0.4  # the width of the bars
plt.bar([0 - width/2, 1 - width/2], 
        titanic_df[is_male].Survived.value_counts().sort_index(), 
        label="Male", width=width)
plt.bar([0 + width/2, 1 + width/2], 
        titanic_df[-is_male].Survived.value_counts().sort_index(), 
        label="Female", width=width)
plt.ylabel("Number of passengers")
plt.title("Survival rates by sex")
plt.xticks([0, 1], labels=["died", "survived"])
plt.ylim(0, 600)
plt.legend()

# Display
plt.show()

In [None]:
# Initialize
fig, (ax1, ax2, ax3) = plt.subplots(nrows=1, ncols=3, sharey=True)

# Draw plot and axis labels
width = 0.4  # the width of the bars
for c, ax in zip([1, 2, 3], [ax1, ax2, ax3]):
    ax.bar([0 - width/2, 1 - width/2], 
           
           # Use pandas to filter, count, and sort inline
           titanic_df[is_male & (titanic_df.Pclass == c)].Survived.value_counts().sort_index(),
           label="Male", width=width)
    ax.bar([0 + width/2, 1 + width/2], 
           
           # Use pandas to filter, count, and sort inline
           titanic_df[-is_male & (titanic_df.Pclass == c)].Survived.value_counts().sort_index(),
           label="Female", width=width)
    ax.set_xticks([0, 1], labels=["died", "survived"])
    ax.set_title(str(c) + " Class")

ax1.set_ylabel("Number of passengers")
plt.suptitle("Survival rates by sex and class")
ax2.legend(loc="lower center", bbox_to_anchor = (.5, -.2), ncol=2)

# Display
plt.show()

# Exercises

## Exercise 1
Create a numpy array with 20 elements that are evenly spaced between 0 and 10. *Hint: Use either np.arange() or np.linspace() - do you remember the difference?*

1. Calculate the sine of each element using `np.sin()`
2. Do the same with `np.exp()` and `np.log()`
1. Use `matplotlib` to create a figure with three subplots; plot the sine in the first, the exponential in the second, and the logarithmic in the last.

## Exercise 2

Load the `countries.csv` file using `pd.read_csv()`.

1. Using `pandas`, print the unique values in the `continent` column. Try finding a `pd` function that will let you do this without using  `set`!
1. Calculate the mean GDP per capita across *all* countries for each year. (You'll probably want to use a for loop for this.)
1. Use `matplotlib` to plot the mean worldwide GDP over time.
1. For each year, determine the country with the highest life expectancy.
1. Just as in Lab 4A, add a new column to the dataframe: the raw GDP, as opposed to the per capita. (`gdp = pop * gdpPercap`)
1. Use bracket notation to display only the Year, GDP, GDP Per Capita, and Country columns.

## Bonus Exercise
1. Use `np.random.random(shape)` to generate a 100 x 100 array of random values.
1. Let's use `matplotlib` to visualize this 2D data. (Surprise!) Use the `plt.imshow(data)`, where you pass in your numpy array as the data, and see what it looks like. Fun, huh?
1. Reshape the array to a 2000 x 50 array, then plot it again.