# Demo 03: Summary Stats and Graphs

In this demo we'll go over how to do basic summary stats in Pandas. This will help us do more Exploratory Data Analysis (EDA) on our data

In [None]:
# COLAB Code Only!
# clone the course repository, change to right directory, and import libraries.
%cd /content
!git clone https://github.com/nmattei/cmps6790.git
%cd /content/cmps6790/_demos

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
# Make the fonts a little bigger in our graphs.
font = {'size'   : 20}
plt.rc('font', **font)
plt.rcParams['mathtext.fontset'] = 'cm'
plt.rcParams['pdf.fonttype'] = 42

The cell below loads up a few libraries and does some initialization.  In this notebook we'll do a few basic data manipulations and see the Pandas formatting for the first time and make some simple graphs.


## Working with some real data and Pandas!

Opening and reading CSV files is very easy with Pandas [Read CSV Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

In general the [Pandas Documentation](https://pandas.pydata.org/docs/) is very good and you should spend some time getting to know it. You can also use the cheatsheets that we have posted on the [course webpage](https://nmattei.github.io/cmps3160/).

Last class we only worked with the salaries dataset, this time let's work with some more data including stats!

In [None]:
# Open the NBA Salaries file.

df_nba = pd.read_csv("./data/nba_stats.csv")
df_nba.head()

This dataset has so much to look at! For example.. we can see what the league average salary in a given year is

In [None]:
# Display gives us a basic table.  Note that we can index and slice this in many different ways.
df_nba[(df_nba["Season"] == 1998)]["Salary"]

In [None]:
# What is the type of the above? What do we want?
type(df_nba[(df_nba["Season"] == 1998)]["Salary"])

In [None]:
# What if we want the mean?
df_nba[(df_nba["Season"] == 1998)]["Salary"].mean()

In [None]:
# Pandas has support for lots of functions, use the autocomplete to see
df_nba[(df_nba["Season"] == 1998)]["Salary"].describe()

In [None]:
# Careful, Pandas will let you do things maybe you shouldn't some times...
df_nba[(df_nba["Season"] == 1998)].mean()

In [None]:
# We can also see just some stats if we want to, check the average rebounds and
# salary for 35 year olds.

df_nba[(df_nba["Age"] == 35)][["Name", "TRB","ORB","DRB"]]

In [None]:
df_nba[(df_nba["Age"] == 35)][["TRB","ORB","DRB"]].mean()

In [None]:
# We can also make fun histograms
df_nba['Age'].hist()

In [None]:
# But we can slice for Season.. sometimes it's good to have boxplots per year.
df_nba[(df_nba["Season"] == 1998)].boxplot(column=["Age","Salary"], showmeans=True)

In [None]:
# These are on very different scales, we'll come back to this!

df_nba[(df_nba["Season"] == 1998)].boxplot(column=["ORB","DRB"], showmeans=True)

In [None]:
# Very fancy.. though apparently there is a bug in matplotlib? https://github.com/matplotlib/matplotlib/issues/16353
df_nba[["Season","Salary"]].boxplot(column="Salary", by="Season", showmeans=True)

In [None]:
# how can we make this prettier?
df_nba.boxplot?

In [None]:
df_nba[["Season","Salary"]].dropna().boxplot(column="Salary", by="Season", figsize=(20,9), showmeans=True)
plt.xticks(rotation=90)

In [None]:
salary_1995 = df_nba[df_nba['Season']==1995]['Salary']
salary_1996 = df_nba[df_nba['Season']==1996]['Salary']
print('average salary in 1995=%.2f and in 1996=%.2f, a difference of %.2f' % (
                                                      salary_1995.mean(),
                                                      salary_1996.mean(),
                                                      salary_1996.mean()-salary_1995.mean()))

In [None]:
def f2c(f):
  # Format float value as currency for pretty printing.
  return '${0:,.2f}'.format(f)

print('average salary in 1995=%s and in 1996=%s, a difference of %s' % (f2c(salary_1995.mean()),
                                                                        f2c(salary_1996.mean()),
                                                                        f2c(salary_1996.mean()-salary_1995.mean())))

In [None]:
# let's repeat using the median and see the difference...
print('average salary in 1995=%s and in 1996=%s, a difference of %s' % (f2c(salary_1995.median()),
                                                                        f2c(salary_1996.median()),
                                                                        f2c(salary_1996.median()-salary_1995.median())))

<br><br><br>

Means and medians are handy for politicians...

<br>

**"The average American family will get a \$4,000 tax cut"**
[source](https://americansfortaxfairness.org/promise-will-middle-class-tax-cut/)

<img width=800 src="https://americansfortaxfairness.org/wp-content/uploads/P2F5-3.png">

<br><br><br>

### A mathemetician, engineer, and accountant are interviewing for a job...

First question: What is 2+2?

**Engineer:** pulls out a slide rule, shuffles it back and forth, and finally announces, “It lies between 3.98 and 4.02”.

**Mathemetician:** “In two hours I can demonstrate it equals 4 with the following short proof.”

**Accountant:** looks at the business owner, then gets out of his chair, checks to see if anyone is listening at the door and pulls the drapes. Then he returns to the business owner, leans across the desk and says in a low voice, “What would you like it to be?”