<a href="https://colab.research.google.com/github/jimhaines37/DataScience/blob/main/_demos/Demo-02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Demo 02: Introduction to Pandas, Reading Data, and Plotting!

In this demo we will get more practice working with Pandas, slicing data, and some simple plotting. At the very end we will do a bit more complex analysis and make a model!

In [None]:
# first, mount your google drive, change to the course folder, pull latest changes, and change to the lab folder.
# Startup Magic to: (1) Mount Google Drive
# (2) Change to Course Folder
# (3) Pull latest Changes
# (4) Move to the Demo Directory so that the data files are available

from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My Drive/cmps3160
!git pull
%cd _demos

# Intro to Pandas, Reading Data, and Plotting

The cell below loads up a few libraries and does some initialization.  In this notebook we'll do a few basic data manipulations and see the Pandas formatting for the first time and make some simple graphs.


In [None]:
### Standard Magic and startup initializers.

# Load Numpy
import numpy as np
# Load MatPlotLib
import matplotlib
import matplotlib.pyplot as plt
# Load Pandas
import pandas as pd

# This lets us show plots inline and also save PDF plots if we want them. 
# This is not strictly necessary if you are working in CoLab but if you
# run this on a local install these packages help.
%matplotlib inline
from matplotlib.backends.backend_pdf import PdfPages
matplotlib.style.use('fivethirtyeight')
# Seaborn is a plotting package for Pandas that we'll try out...
import seaborn as sns

# Make the fonts a little bigger in our graphs.
font = {'size'   : 20}
matplotlib.rc('font', **font)
matplotlib.rcParams['mathtext.fontset'] = 'cm'
matplotlib.rcParams['pdf.fonttype'] = 42

# These two things are for Pandas, it widens the notebook and lets us display data easily.
# Again these have no effect on CoLab but are useful for if you are working on Anaconda
# or docker on your own machine.
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

## Working with some real data and Pandas!

Opening and reading CSV files is very easy with Pandas [Read CSV Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

In general the [Pandas Documentation](https://pandas.pydata.org/docs/) is very good and you should spend some time getting to know it. You can also use the cheatsheets that we have posted on the [course webpage](https://nmattei.github.io/cmps3160/).

In [None]:
# Open the NBA Salaries file.

df_nba = pd.read_csv("./data/nba_salaries.csv")

Once we have loaded the data into the variable `df_nba` we can print it out! We can also slice the data by the index, or first column of the dataframe.

In [None]:
# Display gives us a basic table.  Note that we can index and slice this in many different ways.
display(df_nba)

In [None]:
# Within this we can get different rows of the table by the index, the left most column
display(df_nba[10:30])

We can pull out specific cells in the table.

In [None]:
# Look at a specific person... a little clunky.
display(df_nba.loc[df_nba['PLAYER'] == "Stephen Curry"])

In [None]:
# Can filter for a whole set
df_nba.loc[df_nba['POSITION'] == "PG"][:10]
# Again note that we can slice this different ways
# but we will always get the results in the order of the *index*

Because our data is well organized (we'll get more into how to do this well when we get into Tidy Data) we can use the selecting features to see a team.

In [None]:
# Can also see a team...
df_nba.loc[df_nba['TEAM'] == "New Orleans Pelicans"]

You may have noticed that we are using boolean functions to select, we can combine these using the standard boolean operators `|`, `&`, and `!`. One quirk of Pandas is that we really should encase all our boolean logic in parentisis, which can get a bit cumbersome.

In [None]:
df_nba.loc[(df_nba['TEAM'] == "New Orleans Pelicans") | (df_nba['TEAM'] == "Boston Celtics")]

In [None]:
# We can also do more complex selection, which we'll need a lot!
df_nba.loc[(df_nba["'15-'16 SALARY"] > 5.0)]

**Gotcha Warning** Pandas will let you use a `.attribute` notation... but I don't like this...

In [None]:
# There are other ways to index, but I always use the full selection as it's more explicit.
df_nba.loc[df_nba.TEAM == "New Orleans Pelicans"]

**Question:** Why would I not want to use the `.TEAM` syntax all the time? What could go wrong? 

In [None]:
# Can also just pick out a subset of coumns if we want.
df_nba[['PLAYER', 'TEAM']]

In the above the function returns a **view** of a dataframe with just the relevant columns. We can use this and compute summary stats, which we'll use more later.

I really cannot stress enough how useful [the handy Pandas Cheat Sheet is.](https://drive.google.com/file/d/1SWw2QXKPGJv99_a4VceEdBkmnB2Zljb5/view)

In [None]:
# Can also compute summary stats over columns, like the mean.. (look at cheat sheet!)
# Note the problem with the column name!
df_nba.loc[df_nba['TEAM'] == "New Orleans Pelicans"]["'15-'16 SALARY"].sum()

In [None]:
df_nba["'15-'16 SALARY"].median()

In [None]:
df_nba["'15-'16 SALARY"].describe()

# Sorting!

Up to now everything has been sorted by the `index` or first columnt of the dataframe. We might want to sort by other things, so we can use `sort_values`.

In [None]:
# Look at all PGs sorted by salary...
df_nba.loc[df_nba['POSITION'] == "PG"].sort_values("'15-'16 SALARY", ascending=False)[:10]

Note that the above sorting does not happen *in place* unless we explicitly tell Pandas to do so -- [Documentation for sort_values](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html)

In [None]:
# Once we sort by values it does not stay the same unless we overwrite the table or do it in place...
df_nba.loc[df_nba['POSITION'] == "PG"][:5]

### So let's try to sort it in place!

**Gotcha Warning!**

In [None]:
df_nba.loc[df_nba['POSITION'] == "PG"].sort_values("'15-'16 SALARY", inplace=True, ascending=False)

## This won't work -- note that we are getting a VIEW of the data with the first command, so 
## we get the dreaded set with copy warning! because we are only operating on a view of the data!

You are going to get so sick of the **SettingWithCopyWarning** before this class is over. Do not ignore these! They will cost you professionalism points on turned in assignments and, more importantly, it means you are likely doing something that you do not want to or do not mean to do.

[Remember Arie's Handy Coding Guide!](https://nmattei.github.io/cmps3140/codingguide)

In [None]:
# We have to sort, then print the whole data frame
df_nba.sort_values("'15-'16 SALARY", inplace=True, ascending=False)

In [None]:
df_nba.loc[df_nba['POSITION'] == "PG"][:5]

In [None]:
# But now everything is sorted!
df_nba[:5]

In [None]:
# Now let's put it back to normal... Note here we used .sort_index, if we used reset_index we'd renumber!!
df_nba.sort_index(inplace=True)
df_nba.loc[df_nba['POSITION'] == "PG"][:5]

To make sure it is clear -- if we had called `reset_index` on the above then we would have completely renumbered and set the index to the current order. While this may not seem like a big deal now, it might be if our data was, e.g., dates!

In [None]:
# Get a histogram with Pandas
df_nba['POSITION'].hist()

In [None]:
# Maybe see what the position distribution is... (using seaborn)
sns.countplot(x=df_nba['POSITION'])

In [None]:
# Or we can see how salaries are distributed...
df_nba.plot.scatter(x='POSITION', y="'15-'16 SALARY")

In [None]:
# Or box plots to get really fancy...
df_nba.boxplot(column=["'15-'16 SALARY"], by=['POSITION'], figsize=(10,10))

# Reading some Books and Making a Model!

So far we've seen the basics of how to sort and slice data, you'll practice these more in Lab 2! 

But now let's try something more fun. We're going to go build a model that shows the relationship between the number of characters in a book, and the puncuation!

To do this we'll use several tools that we'll do much more with later in the semester including the `requests` library and using Python's regular expression processing features.

In [None]:
from urllib.request import urlopen 
import re
def read_url(url): 
    return re.sub('\\s+', ' ', urlopen(url).read().decode())

In [None]:

# Read two books, fast!

huck_finn_url = 'https://www.inferentialthinking.com/data/huck_finn.txt'
huck_finn_text = read_url(huck_finn_url)
huck_finn_chapters = huck_finn_text.split('CHAPTER ')[44:]

little_women_url = 'https://www.inferentialthinking.com/data/little_women.txt'
little_women_text = read_url(little_women_url)
little_women_chapters = little_women_text.split('CHAPTER ')[1:]

In [None]:
len(huck_finn_chapters)

In [None]:
huck_finn_chapters[0][:300]

In [None]:
# Turn it into a data frame..
df_huck = pd.DataFrame(huck_finn_chapters, columns=["Text"])

In [None]:
display(df_huck[:5])

In [None]:
# Count how many times we see each character...
# Here we make a data frame out of a dictionary where the index is the column name
# and the values are the column
counts = pd.DataFrame({
        'Jim':np.char.count(huck_finn_chapters, 'Jim'),
        'Tom':np.char.count(huck_finn_chapters, 'Tom'),
        'Huck':np.char.count(huck_finn_chapters, 'Huck')
    })

In [None]:
counts[:10]

In [None]:
ax = counts.cumsum().plot(figsize=(10,8))
ax.set_xlabel("Chapter")
ax.set_ylabel("Number of Times")
#ax.set_ylim((-5,310))
plt.show()

There are lots of options for the figures ... Note that here we are using [Pandas Plot](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.plot.html) which is a wrapper around [MatPlot's Plot](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.plot.html).

In [None]:
ax = counts.cumsum().plot(figsize=(10,8), fontsize=(5),
                                          lw=2, 
                                          markersize=12,
                                          style=['X-','o-.','v--','s:','d:','*-.'])
ax.set_xlabel("Chapter")
ax.set_ylabel("Number of Times")
ax.set_ylim((-5,310))
plt.show()

In [None]:
# Now for Little women... here we're going to do the counts a little differently
# and use Numpy (NUmerical PYthon) functions

people = ['Amy', 'Beth', 'Jo', 'Laurie', 'Meg']
people_counts = {pp: np.char.count(little_women_chapters, pp) for pp in people}


In [None]:
people_counts.keys()

In [None]:
people_counts['Beth']

In [None]:
# Make a pandas table...
counts = pd.DataFrame(people_counts)
counts

In [None]:
ax = counts.cumsum().plot(figsize=(10,8), fontsize=(15),
                                          lw=2, 
                                          markersize=12,
                                          style=['X-','o-.','v--','s:','d:','*-.'])
ax.set_xlabel("Chapter")
ax.set_ylabel("Number of Times")
plt.show()

# Something more fun...

Inspired by the [Inferential Thinking Book](https://www.inferentialthinking.com/chapters/01/3/2/Another_Kind_Of_Character) let's do some more analysis on the text that we have loaded up.

First let's count the number of periods and the total number of characters in each of the books.

In [None]:
# Recall that each element in the array corresponds to a chapter.
print(huck_finn_chapters[0][:50])
print(little_women_chapters[0][:50])

In [None]:
chars_periods_huck_finn = pd.DataFrame({
        'Huck Finn Chapter Length':[len(s) for s in huck_finn_chapters],
        'Number of Periods':np.char.count(huck_finn_chapters, '.')
        })
chars_periods_little_women = pd.DataFrame({
        'Little Women Chapter Length': [len(s) for s in little_women_chapters],
        'Number of Periods': np.char.count(little_women_chapters, '.')
        })

In [None]:
display(chars_periods_huck_finn[:5])
display(chars_periods_little_women[:5])

What do we notice about the above?  It seems like *Little Women* is significantly longer per chapter than *Huck Finn*.  Let's try plotting this relationship on the same graph.

To do this we are going to use the `scatter` function from [MatPlotLib](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.scatter.html)

In [None]:
plt.figure(figsize=(6, 6))
plt.scatter(chars_periods_huck_finn["Number of Periods"], 
              chars_periods_huck_finn["Huck Finn Chapter Length"], 
              color='darkblue')

plt.scatter(chars_periods_little_women["Number of Periods"], 
              chars_periods_little_women["Little Women Chapter Length"], 
              color='gold')

plt.xlabel('Number of periods in chapter')
plt.ylabel('Number of characters in chapter')

The above plot shows us a few things:
1. Little Women is much longer on average than Huck
2. There seems to be a linear relationship between the number of characters and the number of periods

If we look at all the chapters that have 100 periods we see they have 10,000 - 15,000 characters.. or roughly 100-150 characters per sentence.  Seems like a Tweet.

In [None]:
# Let's formally find the relationship...
from scipy import stats

# First let's make the tables the same..
chars_periods_huck_finn.columns = ['characters', 'periods']
chars_periods_little_women.columns = ['characters', 'periods']
display(chars_periods_huck_finn[:5])
len(chars_periods_huck_finn)

In [None]:
# Now we are going to concatinate the data together -- this is our first join operation!

merged = pd.concat([chars_periods_huck_finn, chars_periods_little_women])
merged

In [None]:
len(merged)

In [None]:
slope, intercept, r_value, p_value, std_err = stats.linregress(merged['periods'],merged['characters'])

In [None]:
slope

In [None]:
intercept

In [None]:
r_value

In [None]:
p_value

In [None]:
std_err

In [None]:
line = slope * merged['periods'] + intercept

Now we can add the line above to our plot using the [plot function](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.plot.html).

In [None]:
plt.figure(figsize=(10, 10))
plt.scatter(chars_periods_huck_finn["periods"], 
              chars_periods_huck_finn["characters"], 
              color='darkblue')

plt.scatter(chars_periods_little_women["periods"], 
              chars_periods_little_women["characters"], 
              color='gold')

plt.plot(merged['periods'], line, lw=1, ls=':')

plt.xlabel('Number of periods in chapter')
plt.ylabel('Number of characters in chapter')