# Data analysis and visualization (Day 2)

Today we will be continuing our journey with Pandas and Seaborn.

# Review of Day 1

## 1. What is Pandas?

## 2. What is Seaborn?

## 3. What is a Pandas DataFrame?

## 4. Some coding review:

In [None]:
import pandas as pd
import seaborn as sns
# Seaborn has a set of example datasets so we use it to load Penguins.
df = sns.load_dataset("penguins")
print(f"`df` has type {type(df)}")
# QUESTION: what does df.head() do?
df.head()

In [None]:
# How can we show the last 5 rows of this dataframe?


In [None]:
# How do we select the column "island"?


In [None]:
# How do we calculate the mean of column "bill_depth_mm" ?


In [None]:
# How do you count the number is NaN values in `bill_length_mm` ?

In [None]:
# How do we set a new column with the product of "bill_length_mm" and "bill_depth_mm" ?


In [None]:
# How can we make a scatter plot of 'bill_length_mm' by 'bill_depth_mm' ?


In [None]:
# What is a Pandas Series?


# Loading and saving data in Pandas

In [None]:
import pandas as pd
import seaborn as sns

# Load a sample dataset from Seaborn's repository
df = sns.load_dataset("taxis")
# What is the type of df?
df.head()

Pandas DataFrames have a variety of methods with the pattern ".to_FORMAT" where FORMAT is the saving format. CSV and Excel are two common options.

See the documentation at https://pandas.pydata.org/docs/user_guide/io.html

In [None]:
# Show the help message for this function.
# This only works in Jupyter Notebooks and Colab (not in scripts).
df.to_csv?

In [None]:
# Save to a file.
df.to_csv("mytaxis.csv")

To load a file, use one of the many functions with the pattern "pd.read_FORMAT".

In [None]:
pd.read_csv?

In [None]:
# Read the file!
df = pd.read_csv("mytaxis.csv")
df.head()

# Group By

There are often groups in our data. In the taxis data, for example, one can group by taxi color or the pickup borough and perform calculations on each of the groups. With this, you can answer the question, "Does the tip vary by the color of the taxi?

In [None]:
# First, let's load the dataset again...
df = sns.load_dataset("taxis")

In [None]:
df["color"].value_counts()

In [None]:
df.groupby("color")["tip"].mean()

In [None]:
df.groupby("color").agg({"tip": "mean"})

In [None]:
df.groupby("color")["tip"].describe()

# Plotting by category

In [None]:
sns.set_context("talk")  # Make text bigger.
sns.countplot(data=df, x="passengers")

In [None]:
sns.countplot(data=df, x="pickup_borough")

In [None]:
sns.countplot(data=df, x="passengers", hue="pickup_borough")

### Now with scatter plots!! And datetime!

Introducing datetime types!

In [None]:
# Let's inspect the dtypes. We expect that pickup and dropoff will be datetime objects.
df.dtypes

In [None]:
# What does this do?
df["dropoff"] - df["pickup"]

In [None]:
df["duration"] = df["dropoff"] - df["pickup"]

In [None]:
# Let's get the duration in seconds. This will allow us to plot this more easily...
df["duration_sec" ] = df["duration"].dt.seconds

In [None]:
df.dtypes

In [None]:
sns.scatterplot(data=df, x="duration_sec", y="tip")

## Quick detour into linear regression

We have seen the plot of Tip by Trip Duration. What's the relationship between these two features.

Scikit-learn is _the_ machine learning library for Python.

Use it for classical machine learning methods like linear regression and random forests. Do not use it for deep learning.

See [the docs](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) for information on Linear Regression in Scikit Learn.

In [None]:
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [None]:
df_train, df_test = train_test_split(df, random_state=42)

In [None]:
linreg = LinearRegression()

In [None]:
linreg.fit(df_train[["duration_sec"]], df_train[["tip"]])

In [None]:
y_pred = linreg.predict(df_test[["duration_sec"]])
print(f"Shape of y_pred is {y_pred.shape}")
y_pred = y_pred.squeeze()
print(f"After squeezing, shape of y_pred is {y_pred.shape}")

In [None]:
sns.scatterplot(x=df_test["tip"], y=y_pred)
plt.xlabel("Ground truth tip")
plt.ylabel("Predicted tip")

In [None]:
from scipy import stats

result = stats.pearsonr(df_test["tip"], y_pred)
# This is a namedtuple object.
result

In [None]:
result.statistic

In [None]:
result.pvalue

# Set values conditionally

Let's say we want to add new values to the dataframe based on existing data. Here's how we can do that.

In the example below, we will create a new column indicating whether the tip was high or low based on the median tip.

In [None]:
sns.displot(data=df, x="tip")  # Make plot...
sns.despine()  # Remove the top and right borders.

In [None]:
median_tip = df["tip"].median()
median_tip

In [None]:
# This creates a boolean Series, where True indicates a value above the median.
mask_high_tipper = df["tip"] > median_tip

# The tilde ~ is the NOT operator. NOT True is False. NOT False is True.
mask_low_tipper = ~mask_high_tipper

In [None]:
mask_high_tipper.head()

In [None]:
mask_low_tipper.head()

In [None]:
df.loc[mask_high_tipper, "tipper_class"] = "high"

In [None]:
df["tipper_class"].head()

In [None]:
df.loc[mask_low_tipper, "tipper_class"] = "low"

In [None]:
df["tipper_class"].head()

In [None]:
df["tipper_class"].value_counts()

We can combine multiple conditions when making boolean masks.

`&` is logical AND. `|` is local OR.

In [None]:
(df["color"] == "yellow") & (df["tipper_class"] == "low")

In [None]:
sns.displot(data=df, x="distance")

# Merging data frames

Merging is a very powerful tool. If two data frames share a common column, we can combine the data frames based on that column.

See https://pandas.pydata.org/docs/user_guide/merging.html for more information.

In [None]:
import seaborn as sns

df = sns.load_dataset("mpg")
df.head()

In [None]:
df.query("cylinders == 5")

In [None]:
df["cylinders"].value_counts()

In [None]:
notes_dict = {
    3: "cool engine!",
    4: "normal, economical",
    5: "unique!",
    6: "normal, powerful",
    8: "oh goodness!"
}
df_notes = pd.DataFrame.from_dict(notes_dict, orient="index", columns=["engine_notes"])
df_notes.index.name = "cylinders"
df_notes = df_notes.reset_index()
df_notes.head()

In [None]:
df.merge(df_notes)

# Customizing plots

We will be going through some examples from Seaborn's website https://seaborn.pydata.org/tutorial/aesthetics.html

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

def sinplot(n=10, flip=1):
    """Make a plot of a few sin functions."""
    x = np.linspace(0, 14, 100)
    for i in range(1, n + 1):
        plt.plot(x, np.sin(x + i * .5) * (n + 2 - i) * flip)

In [None]:
sinplot()

Seaborn has a function `set_theme()` which offers stylistic control. See https://seaborn.pydata.org/generated/seaborn.set_theme.html#seaborn.set_theme for more information.

Seaborn groups matplotlib parameters into "style" and "context". Style controls style... context controls the scale of the plot.

In [None]:
sns.set_theme(context="notebook", style="darkgrid")
sinplot()

Seaborn has several preset styles: `darkgrid`, `whitegrid`, `dark`, `white`, and `ticks`.

In [None]:
sns.set_style("whitegrid")
data = np.random.normal(size=(20, 6)) + np.arange(6) / 2
sns.boxplot(data=data)

In [None]:
sns.set_style("dark")
sinplot()

In [None]:
sns.set_style("white")
sinplot()

In [None]:
sns.set_style("ticks")
sinplot()

The top and right axes spines can be removed in many cases.

In [None]:
sinplot()
sns.despine()

In [None]:
f, ax = plt.subplots()
sns.violinplot(data=data)
sns.despine(offset=10, trim=True)

In [None]:
sns.set_style("whitegrid")
sns.boxplot(data=data, palette="deep")
sns.despine(left=True)

You can use context managers (the `with` statement) to temporarily set a style.

In [None]:
f = plt.figure(figsize=(6, 6))
gs = f.add_gridspec(2, 2)

with sns.axes_style("darkgrid"):
    ax = f.add_subplot(gs[0, 0])
    sinplot(6)

with sns.axes_style("white"):
    ax = f.add_subplot(gs[0, 1])
    sinplot(6)

with sns.axes_style("ticks"):
    ax = f.add_subplot(gs[1, 0])
    sinplot(6)

with sns.axes_style("whitegrid"):
    ax = f.add_subplot(gs[1, 1])
    sinplot(6)

f.tight_layout()

View all of the different style parameters using `sns.axes_style()`:

In [None]:
sns.axes_style()

In [None]:
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
sinplot()

### Scaling plot elements

In [None]:
# Reset the theme
sns.set_theme()

In [None]:
sns.set_context("paper")
sinplot()

In [None]:
sns.set_context("talk")
sinplot()

In [None]:
sns.set_context("poster")
sinplot()

In [None]:
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
sinplot()

### Customizing plots with matplotlib

Seaborn is built on top of matplotlib. We can use matplotlib methods to further customize our plots.

In [None]:
sinplot()
plt.title("Arbitrary sin waves", fontweight="bold")
plt.xlabel("Time")
plt.ylabel("Amplitude")

#### Multiple figures

Use `plt.subplots()` to create multiple figures.

In [None]:
data.shape

In [None]:
# Make the scale a bit smaller.
sns.set_context("paper")

In [None]:
fig, axes = plt.subplots(nrows=5, ncols=4, figsize=(10, 10))
for i, ax in enumerate(axes.flat):
    ax.plot(data[i])
    ax.set_title(f"Plot number {i}")
fig.tight_layout()