# Data Science with Pandas

* Matt Burton 
* Data Dojo 
* November 13th, 2018

## Doing Data Science with Python

* Python is a very popular programming language for *doing* data science
* Is a powerful and expressive interpreted programming lanaguage
* Is fast enough for many data processing tasks
* Can hook into lower level lanaguages like C and FORTRAN when necessary
* Has a HUGE user community and many powerful 3rd party libraries

### Essential Python Libraries for Data Science

* **NumPy** - A low-level numerical computing library with a fast multidimensional array object *ndarray*
* **pandas** - A higher level library with several user-friendly data structures for numerical computing and data processing. 
* **matplotlib** - The most used (but not necessarily loved) Python library for data visualizations. 
* **Jupyter** - A platform for interactive computing and data analysis. Allows for the creation of *notebooks* (like this one here) for conducting and publishing data workflows. IT IS GREAT!!!
* **scikit-learn** - The go-to library for machine learning in Python. Implements many popular ML algorithms, has a nice API, and has many useful helper fuctions.
* **statsmodel** - A library for "classical" (frequentist) statistics (think ANOVA). Mirrors many of the models in R. 


* All of these libraries work well together making the Python data sceince ecosystem.

## Dive into Pandas


* Pandas is a third party library for doing data analysis
* It is a foundational component of Python data science
* Developed by someone in the finance industry, but is now used by everyone
* Vanilla Python can do many of the same things, but Pandas is *faster*
* The core of Pandas are the data structures

### Pandas Data Structures

* To understand Pandas, which is hard, you need to start with three data structures
    * Series - For one dimensional data
    * Dataframe - For two dimensional data
    * Index - For naming, selecting, and transforming data within a Pandas Series or Dataframe 

### Series

* A one-dimensional array of indexed data
* Kind of like a blend of a Python list and dictionary
* You can create them from a Python list


In [None]:
import pandas as pd

In [None]:
my_list = [0.25, 0.5, 0.75, 1.0]
data = pd.Series(my_list)
data

* You can index a Series just like a list
* Use index notation to grab the 2nd element of `data`

In [None]:
# remember, index by zero so 1 is the second element
data[1]

* You can also slice Series as well
* Use slices to grab the 2nd and 3rd elements of this series

In [None]:
# slicing the 2nd & 3rd elements 
data[1:3]


* Series also act like Python dictionaries

In [None]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

* You can use indexing and slicing like above, but now with keys instead of numbers!


In [None]:
population['California']

* Like a Python dictionary, a Series is a list of key/value pairs
* But these are *ordered*, which means you can do slicing
* Try slicing this series, but with keys instead of numbers!

In [None]:
# Hint: Use the same : notation, but use the state names listed above
# Your code here:
population.loc['California':'Illinois']

* There are a couple ways of creating `Series` objects

In [None]:
# From a list with an implicit index
pd.Series([2, 4, 6])

In [None]:
# From a list with an *explicit* index
pd.Series([2, 4, 6], index=['a','b','c'])

In [None]:
# From a dictionary so keys are the index and get sorded by keys
pd.Series({2:'a', 1:'b', 3:'c'})

### DataFrame

* `DataFrames` are the real workhorse of Pandas and Python Data Science
* We will be spending a lot of time with data inside of Dataframes, so buckle up!
* `DataFrames` contain two-dimensional data, just like an Excel spreadsheet
* In practice, a `DataFrame` is a bunch of `Series` lined up next to each other

In [None]:
# Start with our population Series define above
population

In [None]:
# Then create an area Series
area_dict = {'Illinois': 149995, 'California': 423967, 
             'Texas': 695662, 'Florida': 170312, 
             'New York': 141297}
area = pd.Series(area_dict)
area

In [None]:
# Now mash them together into a DataFrame
states = pd.DataFrame({'population': population,
                       'area': area}   )
states

* Pandas automatically lines everything up because they have shared index values

In [None]:
print(area.index)
print(population.index)
print(states.index)

* A `DataFrame` actually has two indexes
* One for the rows (as seen above)
* An another for the columes

In [None]:
states.columns

## Indexes

* Pandas `Series` and `DataFrames` are containers for data
* Index (and Indexing) are the mechanism to make that data retrievable
* In a `Series` the index is the key to each value in the list
* In a `DataFrame` the index is the column headers, but also row headers
* Indexing allows you to merge or join disparate datasets together

## Real world data processing 

Let's write script that parses information out of an mbox email archive, `mbox-short.txt`, and put it into a Pandas Dataframe.

* Parse every piece of information into a dictionary
* Aggregate all of those dictionaries into a list
* Create a Pandas DataFrame from that list of dictionaries


So we will transform this:
```
From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
```
into this:
```
{'year': '2008', 'month': 'Jan', 'dayofweek': 'Sat', 'address': 'stephen.marquard@uct.ac.za', 'day': '5', 'time': '09:14:16'}
```
into this:
```
address      stephen.marquard@uct.ac.za
day                                   5
dayofweek                           Sat
month                               Jan
time                           09:14:16
year                               2008
Name: 0, dtype: object
```


* Download the data manually with [this link](http://www.py4e.com/code3/mbox.txt) or run the cell below if you are on a Unix based system
* If you are running this with Binder the data should already be downloaded.

In [None]:
# Run this cell to download the data
!wget -nv https://www.py4e.com/code3/mbox.txt -O mbox.txt

In [None]:
!head -n 10 mbox.txt

* What we want to do is parse the text file above into the nicely structured data below

In [None]:
# Quick and dirty code that parses the mbox file
with open("mbox.txt", encoding="utf-8") as email_file:
    # create a list to contain all the data
    # list comprehensions foo
    email_data = [line.split()[1:7] 
                  for line in email_file 
                  if "From " in line]
     
cols = ["address", 
         "dayofweek",
         "month",
         "day",
         "time",
         "year"]
emails_dataframe = pd.DataFrame(email_data, columns=cols)
emails_dataframe

* Once your data is in a Pandas `DataFrame` you can easily use a ton of analytical tools
* You just have to get your data to fit into a dataframe
* Getting data to fit is a big part of the "data janitor" work...it is the craft of data carpentry
* However, as we will see, there is still a lot of carpentry work to do once your data fits into a `DataFrame`
* This dataframe allows us ask questions of the data, if you know how to ask.
* `value_counts()` is a `Series` method that tabulates the number of values.
* First we need to extract the column we want

In [None]:
emails_dataframe['dayofweek']

In [None]:
emails_dataframe['dayofweek'].value_counts()

### Vectorized String Operations

* There is a Pandas way of doing this that is much more terse and compact
* Pandas has a set of String operations that do much painful work for you
* Especially handling bad data!

* So now lets try tabulating the number of institutions the Pandas way

In [None]:
# use a vectorized string operation over the email addresses
emails_dataframe['address'].str.split("@")

* Now we have a Series of list objects (you can tell from the square brackets)
* Lets get just the 2nd element of those lists. We can do that with [vectorized item access](http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.10-Working-With-Strings.ipynb#Vectorized-item-access-and-slicing)

In [None]:
# 
emails_dataframe['address'].str.split("@").str.get(1)

In [None]:
emails_dataframe['institution'] = emails_dataframe['address'].str.split("@").str.get(1)
emails_dataframe

In [None]:
emails_dataframe['institution'].value_counts()

In [None]:
emails_dataframe.to_csv("email-data-with-institution.csv", index=False)

### Playing with time on real data

* Let's look at the [311 data for the city of Pittsburgh](https://data.wprdc.org/dataset/311-data) from the WPRDC
* You can  download the CSV file [here](https://data.wprdc.org/datastore/dump/40776043-ad00-40f5-9dc8-1fde865ff571) or run the cell below

In [None]:
# load the 311 data directly from the WPRDC and parse dates directly
pgh_311_data = pd.read_csv("311.csv",
                           index_col="CREATED_ON", 
                           parse_dates=True)
pgh_311_data.head()

In [None]:
pgh_311_data.info()

* Now that the dataframe has been indexed by time we can select 311 complains by time

In [None]:
# Select 311 complaints on November 13th 2017
pgh_311_data['2017-11-13']

In [None]:
# Select the times just around new years celebration of 2016
pgh_311_data["2015-12-31 20:00:00":"2016-01-01 02:00:00"]

* Someone clearly had a very roudy new years 

### Grouping time with the `resample` method

* You use the `resample()` method to *split* time into groups
* Then you can *apply* the regular aggregation functions 

In [None]:
# compute the mean of complaints per quarter...
# note this doesn't make sense, but works anyway
pgh_311_data.resample("Q").mean()

In [None]:
# count the number of complaints per month
pgh_311_data.resample("M").count()

* Ok, these data are *begging* to be visualized, so let me show you one last feature of Pandas...Visalization!

In [None]:
# load up the data visualization libraries
%matplotlib inline


In [None]:
# Create a graph of the monthly complaint counts
pgh_311_data['REQUEST_ID'].resample("M").count().plot()

## Further Resources

* I highly recommend this book, it covers NumPy, pandas, matplotlib, and scikit-learn. 
* It is well written and up to date!

![Python Data Science Handbook](https://covers.oreillystatic.com/images/0636920034919/lrg.jpg)

* If you want to go deeper into pandas, you can't do better than this book. 
* The 2nd edition just came out and it is written by Wes McKinney, the creator of pandas!

![Python Data Analysis: *2nd Edition*](https://covers.oreillystatic.com/images/0636920050896/lrg.jpg)