### Plotting Data with Pandas and Matplotlib.Pyplot

For common plot types and settings, pandas provides functions that can be
accessed directly from the dataframe. It is always possible to design
manual plots via matplotlib.pyplot, or use other libraries such as seaborn.

In [None]:
# Read the file "LaborSupply1988.csv" into a pandas dataframe
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("LaborSupply1988.csv")

In [None]:
# Plot a histogram of the attribute "age". What is the most frequent age?

# Pandas dataframes have built in basic plotting functionalities
df["age"].plot.hist(bins=15)
df["age"].mode()  # the mode is the most common value in a dataset

In [None]:
# Plot the average number of "kids" against the "age" and interpret the resulting graph.
df.groupby("age")["kids"].mean().plot(style=".")

# Compute correlation of kids vs age
corr = df[["kids", "age"]].corr()
print(corr)

# --> correlation between age and kids is negative, meaning the avg number of kids
#  decreases with increasing age.

In [None]:
# Plot the "log of hourly wage (lnwg)" against the "age".
df.plot(x="age", y="lnwg", style=".")

In [None]:
# Plot the mean of the "log of hourly wages (lnwg)" against the "age". 
# Compute and discuss the type of correlation between "lnwg" and "age".
df.groupby("age")["lnwg"].mean().plot(x="age", y="lnwg", style=".")
corr = df[["age", "lnwg"]].corr()
print(corr)


In [None]:
# Plot "lnhr" against the "age" with different colors for "disab=0" and "disab=1".
Xs = df["age"].values
Ys = df["lnhr"].values
filterfunction = lambda x : "red" if x == 0 else "blue"
colors = df["disab"].apply(filterfunction).values
for x, y, c in zip(Xs, Ys, colors):
    plt.scatter(x, y, s=10, color=c)
plt.show()

In [None]:
# Plot a boxplot of the "log of annual hours (lnhr)" against the number of kids.
# What can be observed regarding median and variance? Is the observation meaningful
# for large values of kids?
df.plot.box(column="lnhr", by="kids")