# $\Omega$ Pandas

> `pandas` is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

> `pandas` is well suited for many different kinds of data:

> * Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
* Ordered and unordered (not necessarily fixed-frequency) time series data.
* Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
* Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

For more tutorials, visit: https://pandas.pydata.org/pandas-docs/stable/tutorials.html

And here is a nice cheatsheet: https://elitedatascience.com/python-cheat-sheet

### Imports

Here we import Pandas and Matplotlib for data visualization

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

### Read data

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

We load the data in using Pandas and create a Dataframe called `titanic`.

In [None]:
# local
titanic = pd.read_csv('data/titanic.csv')

### Help

Jupyter provides a **magic** command to look at the documentation of a function by adding a question mark.

In [None]:
pd.read_csv?

### Get first n rows

In [None]:
titanic.head()

### Dataframe dimensions

Get the number of rows and columns. `(rows, columns)`

In [None]:
titanic.shape

### Index

The index is set to [0, 1, 2...nrows] by default if you don't specify it.

In [None]:
titanic.index

Let's consult pd to learn more about indices.

In [None]:
titanic.set_index?

In [None]:
titanic.reset_index?

### Columns

In [None]:
titanic.columns

### Dataframe information

Show some general information about your dataframe like the column names, the type, the number of non-missing values as well as the memory usage. Pandas Dataframes are loaded entirely into memory.

In [None]:
titanic.info()

### Values

In [None]:
titanic.values

### Summary statistics for numeric columns

By default `describe` will include only numeric columns. By setting `include='all'` it also shows all types.

In [None]:
titanic.describe(include='all')

# $\Omega$ Selecting Data

### Select column

In [None]:
titanic['name']

### Select multiple columns

In [None]:
titanic[['name', 'fare']]

### Select rows and columns by name

Use `loc` to slice your data. The first parameter is the names of the indices (this can be a range) and the second is a column or a list of columns.

In [None]:
titanic.loc[10:20, ['name', 'fare']]

### Select rows and columns by range of indices

The difference with `iloc` is instead of using the **names** of the indices or columns, it uses the **integer positions**. For example instead selecting the 'age' column, we can select the column in position 4.

In [None]:
titanic.iloc[10:20, 0:3]

### Select rows based on condition (filter)

In [None]:
titanic[titanic['fare'] > 100]

## $\Delta$ Exercise 1 - Toronto Subway Delay Data

https://www.toronto.ca/city-government/data-research-maps/open-data/open-data-catalogue/#917dd033-1fe5-4ba8-04ca-f683eec89761

**Deliverables:**
1. Read the data `subway.csv` and set it to the variable `subway`
2. Show the last 10 rows (hint: use `tail`)
3. Show the summary statistics
4. Select columns "Date" and "Time" for rows 100 to 200

In [None]:
########################
# Your Code Below
########################

# $\Omega$ Transforming Data

### Calculate ticket price in today's dollar
According to the Bureau of Labor Statistics consumer price index, prices in 2018 are 2,669.00% higher than prices in 1909. The dollar experienced an average inflation rate of 3.09% per year.

In [None]:
titanic['fare'] * 26.69

In [None]:
(titanic['fare'] * 26.69).round(2)

### Add new column

In [None]:
titanic['fare_2018'] = (titanic['fare'] * 26.69).round(2)

### Count missing values

In [None]:
titanic.isnull().sum()

### Calculate mean

In [None]:
titanic['age'].mean()

In [None]:
mean_age = titanic['age'].mean()

### Fill missing values

In [None]:
titanic['age'] = titanic['age'].fillna(mean_age).round(0).astype(int)

### Count categorical data

In [None]:
titanic['gender'].value_counts()

### Groupby & aggregate

In [None]:
titanic.groupby('gender')['survived'].sum()

In [None]:
titanic['is_child'] = titanic['age'] < 18

In [None]:
titanic.groupby('is_child')['survived'].sum()

### Save new dataframe as csv

In [None]:
# titanic.to_csv('data/new_titanic.csv')

## $\Delta$ Exercise 2 - Subway Dataset

Some helpful functions:
- `pd.to_datetime`
- `.dt.year`
- `.dt.month`
- `.dt.dayofweek`
- `.dt.hour`

**Deliverables:**
1. Convert `Date` to a datetime object and replace the old one
2. Create a new column called `year`
3. Create a new column called `month`
4. Create a new column called `dayofweek`
5. Create a new column called `label` with value `0` if the `Min Delay` is less than 5 and `1` if it's greater than or equal to 5
6. Print the type of every column
7. Print new summary statistics

In [None]:
########################
# Your Code Below
########################

# $\Omega$ Data Visualization
https://pandas.pydata.org/pandas-docs/stable/visualization.html

### Bar plot

In [None]:
titanic['survived'].value_counts().plot(kind='bar')

### Pie plot

In [None]:
titanic['survived'].value_counts().plot(kind='pie')

### Set figure size

In [None]:
titanic['survived'].value_counts().plot(kind='pie', figsize=(5, 5))

### Set plot style

In [None]:
plt.style.use('ggplot')

In [None]:
titanic['survived'].value_counts().plot(kind='pie', figsize=(5, 5))

### Histogram

In [None]:
titanic['age'].plot(kind='hist')

### Set histogram bin size

In [None]:
titanic['age'].plot(kind='hist', bins=100)

### Boxplot

In [None]:
titanic['fare'].plot(kind='box')

### Scatter plot

In [None]:
titanic[['fare', 'age']].plot(x='age', y='fare', kind='scatter')

### Transform then plot

In [None]:
titanic.groupby('gender')['survived'].sum().plot(kind='bar')

## $\Delta$ Exercise 3 - Subway Dataset

**Deliverables:**
1. Create a bar plot of `label`
2. Create a histogram of `Min Delay`
3. Create a boxplot of `Min Delay`
4. Plot the 10 most frequent `Codes`
5. Plot the 10 `Codes` that have the most delays over 5 minutes

**Bonus:**
- Explore the dataset and see if you can come up with your own interesting analysis or plots

In [None]:
########################
# Your Code Below
########################

# $\Omega$ Machine Learning

http://scikit-learn.org/stable/user_guide.html

https://docs.google.com/presentation/d/1HDCbQ7-Abh3wi0L4Dg4i8V1orovsatKX_LgeQ85t_HY/edit#slide=id.p

- `pip install scikit-plot`
- `pip install mlxtend`

In [None]:
!pip install scikit-plot
!pip install mlxtend

In [None]:
from sklearn.datasets import make_blobs
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

from mlxtend.plotting import category_scatter
from mlxtend.plotting import plot_decision_regions

import scikitplot as skplt

In [None]:
X, y = make_blobs(centers=[[1, 1], [3, 3]], random_state=1)

df = pd.DataFrame(X, columns=['feature1', 'feature2']).assign(label=y)

df.head()

In [None]:
category_scatter(x='feature1', y='feature2', label_col='label', data=df);

In [None]:
d_tree = DecisionTreeClassifier(max_depth=1)
d_tree.fit(X, y)

plot_decision_regions(X, y, clf=d_tree);

### Titanic 

In [None]:
titanic.head()

### Selecting features

In [None]:
features = ['gender', 'fare', 'age', 'is_child']

titanic[features]

### One hot encoding

In [None]:
titanic['gender'].head(20)

In [None]:
pd.get_dummies(titanic['gender'])

### Make feature dataset

In [None]:
X = pd.get_dummies(titanic[features], drop_first=True)
X.head()

### Make label

In [None]:
y = titanic['survived']
y.head()

### Make validation dataset

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Build/train model

In [None]:
model = DecisionTreeClassifier()

model.fit(X_train, y_train)

### Model accuracy

In [None]:
model.score(X_test, y_test)

### Model prediction

In [None]:
y_pred = model.predict(X_test)

### Model evaluation

In [None]:
y_pred[:10]

In [None]:
y_test[:10]

In [None]:
confusion_matrix(y_test, y_pred)

In [None]:
skplt.metrics.plot_confusion_matrix(y_test, y_pred)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
skplt.estimators.plot_feature_importances(model, feature_names=X.columns)

## $\Delta$ Exercise 4 - Subway Dataset

**Deliverables:**
1. Create a list of features: Day, Code, Station, Bound, Line, month, year
2. One hot encode the categorical features
3. Create the feature set called `X`
4. Create the target label `y` (delays over 5 minuties)
5. Train and evaluate a model
6. Randomly train 10 models and average their results

# More Productive ML

## Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
cv_model = RandomForestClassifier(n_estimators=25)
scores = cross_val_score(cv_model, X, y, cv=10)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

### Scoring Functions

In [None]:
from sklearn import metrics
scores = cross_val_score(cv_model, X, y, cv=10, scoring='f1_macro')
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

More information here:

https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

https://scikit-learn.org/stable/modules/cross_validation.html

## Random Forest: an ensemble model

In [None]:
rf_model = RandomForestClassifier(n_estimators=25)
rf_model.fit(X_train, y_train)
rf_model.score(X_test, y_test)

In [None]:
y_pred = rf_model.predict(X_test)

print(classification_report(y_test, y_pred))

More on ensemble models:

https://scikit-learn.org/stable/modules/ensemble.html

### $\Delta$ Quick Test

How does adding n_estimators influence model.score()?

In [None]:
########################
# Your Code Below
########################

## Hyperparameter Tuning

In [None]:
from time import time
import numpy as np

# Utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print()
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print()
            print("Parameters: {0}".format(results['params'][candidate]))
            print("------------------------------------------------------------")

In [None]:
X.head()

### Randomized Search

In [None]:
from scipy.stats import randint as sp_randint
from sklearn.model_selection import RandomizedSearchCV
rs_model = RandomForestClassifier(n_estimators=25)

# specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1, 5),
              "min_samples_split": sp_randint(2, 5),
              "min_samples_leaf": sp_randint(1, 5),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# run randomized search
n_iter_search = 20
rand_search = RandomizedSearchCV(rs_model, param_distributions=param_dist,
                                   n_iter=n_iter_search)

start = time()
rand_search.fit(X, y)

print("RandSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(rand_search.cv_results_['params'])))
print("-----------------------------------------------------------------------------")
print()
report(rand_search.cv_results_)

### Grid Search

In [None]:
from sklearn.model_selection import GridSearchCV
gs_model = RandomForestClassifier(n_estimators=25)

# use a full grid over all parameters
param_grid = {"max_depth": [3, None],
              "max_features": [1, 4],
              "min_samples_split": [2, 5, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# run grid search
grid_search = GridSearchCV(gs_model, param_grid=param_grid, cv=5)
start = time()
grid_search.fit(X, y)

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(grid_search.cv_results_['params'])))
print("-----------------------------------------------------------------------------")
print()
report(grid_search.cv_results_)

## $\Delta$ Exercise 5 - Subway Dataset

**Deliverable:**
Create an ensemble model using pipeline and grid search that would beat your last decision tree

In [None]:
########################
# Your Code Below
########################