# Demo 03: Basic Summary Stats

In this demo we'll go over how to do basic summary stats in Pandas. This will help us do more Exploratory Data Analysis (EDA) on our data

In [None]:
# first, mount your google drive, change to the course folder, pull latest changes, and change to the lab folder.
# Startup Magic to: (1) Mount Google Drive
# (2) Change to Course Folder
# (3) Pull latest Changes
# (4) Move to the Demo Directory so that the data files are available

from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My Drive/cmps3160
!git pull
%cd _demos

The cell below loads up a few libraries and does some initialization.  In this notebook we'll do a few basic data manipulations and see the Pandas formatting for the first time and make some simple graphs.


In [None]:
### Standard Magic and startup initializers.

# Load Numpy
import numpy as np
# Load MatPlotLib
import matplotlib
import matplotlib.pyplot as plt
# Load Pandas
import pandas as pd

# This lets us show plots inline and also save PDF plots if we want them. 
# This is not strictly necessary if you are working in CoLab but if you
# run this on a local install these packages help.
%matplotlib inline
from matplotlib.backends.backend_pdf import PdfPages
matplotlib.style.use('fivethirtyeight')
# Seaborn is a plotting package for Pandas that we'll try out...
import seaborn as sns

# Make the fonts a little bigger in our graphs.
font = {'size'   : 20}
matplotlib.rc('font', **font)
matplotlib.rcParams['mathtext.fontset'] = 'cm'
matplotlib.rcParams['pdf.fonttype'] = 42

# These two things are for Pandas, it widens the notebook and lets us display data easily.
# Again these have no effect on CoLab but are useful for if you are working on Anaconda
# or docker on your own machine.
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

## Working with some real data and Pandas!

Opening and reading CSV files is very easy with Pandas [Read CSV Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

In general the [Pandas Documentation](https://pandas.pydata.org/docs/) is very good and you should spend some time getting to know it. You can also use the cheatsheets that we have posted on the [course webpage](https://nmattei.github.io/cmps3160/).

Last class we only worked with the salaries dataset, this time let's work with some more data including stats!

In [None]:
# Open the NBA Salaries file.

df_nba = pd.read_csv("./data/nba_stats.csv")
df_nba.head()

This dataset has so much to look at! For example.. we can see what the leaghe average salary in a given year is

In [None]:
# Display gives us a basic table.  Note that we can index and slice this in many different ways.
df_nba[(df_nba["Season"] == 1998)]["Salary"]

In [None]:
# What is the type of the above? What do we want?
type(df_nba[(df_nba["Season"] == 1998)]["Salary"])

In [None]:
# What if we want the mean?
df_nba[(df_nba["Season"] == 1998)]["Salary"].mean()

In [None]:
# Pandas has support for lots of functions, use the autocomplete to see
df_nba[(df_nba["Season"] == 1998)]["Salary"].describe()

In [None]:
# Careful, Pandas will let you do things maybe you shouldn't some times...
df_nba[(df_nba["Season"] == 1998)].mean()

In [None]:
# We can also see just some stats if we want to, check the average rebounds and 
# salary for 35 year olds.

df_nba[(df_nba["Age"] == 35)][["Name", "TRB","ORB","DRB"]]

In [None]:
df_nba[(df_nba["Age"] == 35)][["TRB","ORB","DRB"]].mean()

In [None]:
# We can also make fun histograms
df_nba['Age'].hist()

In [None]:
# But we can slice for Season.. sometimes it's good to have boxplots per year.
df_nba[(df_nba["Season"] == 1998)].boxplot(column=["Age","Salary"])

In [None]:
# These are on very different scales, we'll come back to this! 

df_nba[(df_nba["Season"] == 1998)].boxplot(column=["ORB","DRB"])

In [None]:
# Very fancy.. though apparently there is an error in Pandas?
df_nba[["Season","Salary"]].boxplot(by="Season")