# Python for Data Analysis

# Tour de Python Level 2 ●○○○

* Python `stdlib`
* `import` syntax
* Python packages

## built-in Python

* Everything we've talked about so far is referred to as part of the Python "built-in"s
* Every Python session has access to everything we've learned no matter what
* The built-ins are general purpose building blocks: "primitive" data types (like strings, integers, dictionaries), control flow statements, basic operators, etc

## moving on (briefly) to the Python `stdlib`

* every Python installation also comes with special data types, operators, functions, and methods to address specific types of problems
    * ex. `datetime` for storing time data that are cognizant of year/month/day/timezone
* by default these are not loaded into each Python session, but instead have to be **imported**
* `stdlib` = "standard library"

## Python modules

* any `.py` file can also be referred to as a "Python module"
* modules can be imported using one of four styles of import syntax. here's one of them:

In [None]:
import math

* the list of modules already accessible to any vanilla Python installation because they are in the `stdlib` are listed online at https://docs.python.org/3/library/ ⇢
* importing a module makes its code definitions accessible in whatever environment they are being imported to

## Variants of import syntax and namespaces

* Python provides 3 styles of import syntax that affect the namespacing of the imported module and its members


In [None]:
import math
math.ceil(5)

## Anatomy of import syntax 1

<p><b>if import syntax is:</b></p>
<p><font color=green>import</font> <font color=orange>module_name</font></p>

<p><b>then call syntax is:</b></p>
<p><font color=orange>module_name</font>.<font color=blue>member_name</font></p>

In [None]:
import math as m
m.ceil(5)

## Anatomy of import syntax 2

<p><b>if import syntax is:</b></p>
<p><font color=green>import</font> <font color=orange>module_name</font> <font color=green>as</font> <font color=goldenrod>alias</font></p>

<p><b>then call syntax is:</b></p>
<p><font color=goldenrod>alias</font>.<font color=blue>member_name</font></p>

In [None]:
from math import ceil
ceil(5)

## Anatomy of import syntax 3

<p><b>if import syntax is:</b></p>
<p><font color=green>from</font> <font color=orange>module_name</font> <font color=green>import</font> <font color=blue>member_name</font>, <font color=grey>...</font></p>

<p><b>then call syntax is:</b></p>
<p><font color=blue>member_name</font></p>

## `stdlib` greatest hits

* `datetime`
* `random.seed`, `random.random`
* `os.path.exists`, `os.path.join`, `os.path.abspath`
* `csv.reader`, `csv.DictReader`
* `csv.writer`
* `json.loads`, `json.dumps`

## Get your feet wet

In the Python interpreter, try using the 3 different styles of import syntax to import the following **functions**, and call them properly based on the type of import syntax you used. You will need to exit and re-enter your python session to clear your prior import syntax each time.

* `random.random`
* `os.getcwd`

## Going past the `stdlib`

* remember: the `stdlib` is maintained by the Python Software Foundation and comes with every installation of Python
* other members of the Python community write their own extensions to the Python built-ins called **packages**
    * usually they are even more specialized than modules in the `stdlib`

## Introducing our data analysis packages

* **Pandas**
   * used for processing tabular data
   * core data type is the `DataFrame`
   * port of R's DataFrame paradigm
* **Matplotlib**
   * used to generate charts such as histograms or box plots from Python data structures
   * port of MATLAB's charting functionalilty

## ~~Installing~~ python packages

* lucky you - you don't have to! For this class, since we used the Anaconda distribution of Python, the python packages we want to use are already installed!
    * the full list for your installation can be found at https://docs.anaconda.com/anaconda/packages/pkg-docs ⇢
* more generally: there are many ways to find and download community-supported Python extensions, but the most popular way is via a *package manager* that downloads from PyPI at https://pypi.python.org/pypi ⇢
    * popular *package managers* include `pip`, `pipenv`, and `conda`

# `pandas`

# Tour de Python Level 2 ○●○○

* `DataFrame`
* `Series`
* Python attributes
* `DataFrame` indexing
* Querying `DataFrame`s with boolean series

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("iris.csv")

In [None]:
type(df)

In [None]:
df.head()

In [None]:
df.head(2)

## The `pandas` dataframe

* a two dimensional data structure representing tabular data
* has *columns* and *rows*
* each column's data is of the same *data type*

## Creating a pandas dataframe

* use a convenience function against a file on disk
    * [`pd.read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html), for CSV data
    * [`pd.read_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html), for general reading of tabular data, including `.tsv` files
    * [`pd.read_json`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html) for JSON data
    * [`pd.read_excel`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html) for Excel files, particularly useful for excel files with many sheets
    * [`pd.read_html`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html) for reading HTML `<table>`s

In [None]:
df = pd.read_csv("iris.csv")

## Anatomy of a pandas dataframe convenience function

<p><font color=blue>variable_name</font> = <font color=green>pd</font>.<font color=goldenrod>convenience_method</font>(<font color=red>path_as_a_string</font>)</p>
<ul><li> <font color=goldenrod>read_csv</font></li>
 <li><font color=goldenrod>read_table</font></li>
 <li><font color=goldenrod>read_json</font></li>
 <li><font color=goldenrod>read_excel</font></li>
 <li><font color=goldenrod>read_html</font></ul>
 
 PS: Of course, remember that the path can be an absolute or relative path!


## Creating a pandas dataframe inline

* instantiate a `dataframe` instance directly, passing it a `data` parameter with something that can be cast into a dataframe shape
* the general format for something that can be cast to a dataframe shape takes a form like: `[[row],[row],[row]]`

In [None]:
df_direct = pd.DataFrame(data=[["a", 1, 5.0], ["b", 2, 10.0]])

In [None]:
df_direct

## Anatomy of instantiating a DataFrame directly

<p><font color=blue>variable_name</font> = <font color=green>pd</font>.<font color=green>DataFrame</font>(data=<font color=red>data_castable_to_dataframe</font>)</p>

In [None]:
df_direct_with_columns = pd.DataFrame(data=[["a", 1, 5.0], ["b", 2, 10.0]],
                                     columns=["letter", "integer", "float"])

In [None]:
df_direct_with_columns

## Anatomy of instantiating a DataFrame directly

<p><font color=blue>variable_name</font> = <font color=green>pd</font>.<font color=green>DataFrame</font>(data=<font color=red>data_castable_to_dataframe</font>, columns=<font color=purple>list_of_column_names</font>)</p>

## Python attributes

* instances of more complex data types have **attributes** associated with them
* they are accessible using the dot notation like `variable_name.attribute_name`
* these are not callable - in practical terms to us at this point, this means they don't need the parentheses `()` after them - and simply `return` the static data that attribute refers to

## DataFrame attributes

* `DataFrame`s are one case of a data type that has attributes associated with them
* three interesting ones for us are 
    * `DataFrame.columns`
    * `DataFrame.shape`
    * `DataFrame.values`

In [None]:
df.columns

In [None]:
df.shape

In [None]:
df.values

## Series

The other important data type in the `pandas` package is that of a 
 [`<class 'pandas.core.series.Series>`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html), which is effectively the one-dimensional representation of a DataFrame axis - for example one row, or one column.

In [None]:
sepal_length = df['Sepal Length']

In [None]:
type(sepal_length)

In [None]:
sepal_length

## Series attributes

In [None]:
sepal_length.name

In [None]:
sepal_length.dtype

In [None]:
sepal_length.shape

## Indexing a DataFrame

Index notation like we are used to with one-dimensions data structures like lists and dictionaries is modified a bit for two-dimensional DataFrames.

To illuminate series, we already saw the following:

In [None]:
sepal_length = df['Sepal Length']

## Anatomy of basic indexing for columns

<p><font color=blue>variable_name</font>[<font color=red>column_label</font>]</p>

In [None]:
df.iloc[1]

## Anatomy of one-dimensional iloc indexing for rows

<p><font color=blue>variable_name</font>.<font color=green>iloc</font>[<font color=red>row_index</font>]</p>

In [None]:

df.iloc[1,1]


## Anatomy of two-dimensional iloc indexing for cells

<p><font color=blue>variable_name</font>.<font color=green>iloc</font>[<font color=red>row_index</font>,<font color=red>column_index</font>]</p>

In [None]:
df.loc[1]

In [None]:
df.loc[1,'Sepal Width']

## Anatomy of one- and two-dimensional `loc` indexing 

<p><font color=blue>variable_name</font>.<font color=green>loc</font>[<font color=red>row_label</font>]</p>

<p><font color=blue>variable_name</font>.<font color=green>loc</font>[<font color=red>row_label</font>,<font color=red>column_label</font>]</p>

## Basic querying with a dataframe

In [None]:
# you can use expressions to slice and dice using logic
print(df[df['Sepal Length'] == 6.9])

In [None]:
# how does this work? by supplying to the index a boolean array
boolean_series = df['Sepal Length'] == 6.9

In [None]:
boolean_series.head()

In [None]:
# this can get quite complex
df[(df['Sepal Length']==6.9) & (df['Species']=='versicolor')]

## Anatomy of boolean array indexing

<p><font color=bulue>variable_name</font>[<font color=red>series_wise_boolean_expression</font>]</p>

Use `&` and `|` to represent `and` and `or`, respectively

## Grouping data

In [None]:
groups = df.groupby("Species")

In [None]:
for key, group in groups:
    print(key)
    print(group.head())

In [None]:
# This gives you a convenient way to apply logic based on a group filter
# For example, use the DataFrame.describe method to easily get summary statistics on each species group
for key, group in groups:
    print(key)
    print(group.describe())

In [None]:
# You can chain an aggregation onto a groupby to get groupwise stats outside of what is in `describe`
print(df.groupby("Species").sum())

In [None]:
print(df.groupby("Species").max())

In [None]:
print(df.groupby("Species").min())

## Get your feet wet

Choose any of the data sets I've provided in Canvas to begin practicing with these first 5 pandas tasks.

Try to:

1. Load the data as a pandas DataFrame.
    * HINT: Use a convenience method to pull the data into a DataFrame from a file path!
2. Describe the data in the DataFrame using the describe() method.
3. Select just row 5 from the DataFrame. Now how about the value from row 5, column 2. How about selecting a whole column by its label?
4. Use the groupby() method against a categorial column in your data.

# Tour de Python Level 2 ○○●○

* `pandas` based processing techniques for
    * dealing with duplicates
    * dealing with sparse data
    * applying custom logic
    * quick vis with just pandas

## Dealing with duplicates

In [None]:
df[df.duplicated()]

In [None]:
df[df.duplicated(keep=False)]

In [None]:
dropped_df = df.drop_duplicates()

In [None]:
dropped_df.shape

## Dealing with sparse data

In [None]:
sparse_df = pd.read_csv("hepatitis.csv", na_values="?", header=None)

In [None]:
sparse_df.head()

In [None]:
sparse_df.shape

In [None]:
sparse_df.dropna().shape

In [None]:
sparse_df.fillna(1000).head()

In [None]:
sparse_df.interpolate().head()

## Applying custom logic cellwise

> > Write a program that prints the numbers from 1 to 100. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”

In [None]:
import numpy as np
num_df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=['A','B','C','D'])

In [None]:
num_df.head()

In [None]:
def fizz_buzz_ify(cell):
    cell = float(cell)
    if (cell % 3.0 == 0) & (cell % 5.0 == 0):
        return "FizzBuzz"
    elif cell % 3.0 == 0:
        return "Fizz"
    elif cell % 5.0 == 0:
        return "Buzz"
    else:
        return cell
        

In [None]:
num_df.applymap(fizz_buzz_ify).head()

## Quick vis with just pandas

Pandas also includes some built-in visualization methods against dataframes for common plots. It is as simple as calling the `hist()` or `plot()` method on a dataframe to get a visualization.

In [None]:
%pylab inline

In [None]:
df.plot('Sepal Length', 'Sepal Width', kind="scatter")

In [None]:
df.hist()

In [None]:
df[df['Species'] == 'virginica'].hist(column=['Sepal Width'])

## Exercises

Using the `chipotle.tsv` file from class folder, answer the following questions. (HINT: What convenience method works on `.tsv`s?)

1. What is the number of observations in this dataset?
    - HINT: (1) and (2) can be answered with the same DataFrame attribute!
2. What is the number of columns in the dataset?
3. What are the names of all the columns of this dataset?
4. What was the most ordered item?
   - HINT: Consider a groupby with an aggregation!
   - HINT: You will need to add up the `quantity` field across items of the same `item_name` and look at the results. There is an aggregation method called `sum()`.
5. How many times was a Veggie Salad Bowl ordered?

# Matplotlib

# Tour de Python Level 2 ○○○●

* basic `Matplotlib`
* a realistic example

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.scatter(df['Sepal Width'], df['Sepal Length'])
plt.xlabel('Sepal Width')
plt.ylabel('Sepal Length')
plt.title('Sepal Width vs Sepal Length')
plt.show()

## A more realistic example

Take a look at the file `"gdp_time_series"` in your terminal with `cat`. You'll notice it's not so well formatted...

In [None]:
import pandas as pd
df = pd.read_csv('gdp_time_series', skiprows=3, delim_whitespace=True)

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df.dtypes

In [None]:
plt.plot(df['YEAR'],df['AUSTRIA'])
plt.ylabel('Per Capita Annual GDP')
plt.show()