# Project Layout Suggestion

- cookiecutter data science package suggests a layout to create an analysis that allows for easy reproduction and sharing code

## Imports

- This example is based mostly on:
    - pandas Library: a tool for easy data munging and analysis
    - scikit-learn library: has great predictive modeling tools
    - Yellowbrick library: is a visualization library for evaluating models

In [137]:
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import (ensemble, preprocessing, tree)
from sklearn.metrics import (auc, confusion_matrix, roc_auc_score, roc_curve)
from sklearn.model_selection import (train_test_split, StratifiedKFold)
from yellowbrick.classifier import(ConfusionMatrix, ROCAUC)
from yellowbrick.model_selection import (LearningCurve)

## Ask a Questions

In this example, we want to create a predictive model to answer a question. It will classify whether an individual survives the Titanic ship catastrophe based on individual and trip characteristics

Our model should be table to take passenger information and predict whether that passenger would survive on the Titanic

This is a **classification** question, as we are predicting a label for survival: either they survived or they died.

## Terms for Data

y = f(X)

y is a vector that contains labels (for classification) or values (for regression)

X is a matrix. Each row represents a sample of data or information about an individual. Every column in X is a feature

## Gather Data

In [138]:
url = ("titanic3.xls")

In [139]:
df = pd.read_excel(url)
orig_df = df

## Clean Data

Once we have the data, we need to ensure that it is in a format we can use for building our model.

- Most scikit-learn models require that our features be numeric (float or integer)
- Many models fail if they contain missing values (NaN in pandas numpy)
- Some models perform better if the features are standardized (give a mean of 0 and standard deviation of 1)

The Titanic dataset has leaky features (?)

### Leaky features
Leaky features are variables that contain information about the future or target.

There is nothing bad in having data about the target, and we often have the data during model creation time. However if those variables are not available when we perform prediction on a new sample, we should remove them from the model 
as they are leaking data from the future.

### Subject Matter Expert Required
Cleaning data can take a bit of time. It helps to have access to a subject matter expert (SME) who can provide guidance on dealing with outliers or missing data.


In [140]:
df.dtypes

pclass         int64
survived       int64
name          object
sex           object
age          float64
sibsp          int64
parch          int64
ticket        object
fare         float64
cabin         object
embarked      object
boat          object
body         float64
home.dest     object
dtype: object

- When reading data, pandas will try to coerce data into the appropriate types 
- Need to lookthrough the data 

- integer types are fine
- float types might have some missing values
- data and string types will need to be converted or used to feature engineer numeric types 
- string types that have low cardinality are called categorical columns 

## Pandas Profiling Library

In [141]:
import pandas_profiling
pandas_profiling.ProfileReport(df)

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

NameError: name 'path' is not defined



In [142]:
df.shape1b

AttributeError: 'DataFrame' object has no attribute 'shape1b'

The default behavior of this method is to only report on numeric columns.

In [None]:
df.describe().iloc[:,:2]

## Finding missing values

- Use .isnull method find columns or rows with missing values 

In [None]:
df.isnull().sum(axis=0) # axis = 0 which is along the indexy by default (rows)

To find the missing values for each sample, you can apply this along axis 1 (along the columns)

In [None]:
df.isnull().sum(axis=1).loc[:10]

In [None]:
df.sex.value_counts()

In [None]:
df.embarked.value_counts()

In [None]:
df.embarked.value_counts(dropna=False)

### Create Features

In [156]:
name = df.name
name.head(3)

0     Allen, Miss. Elisabeth Walton
1    Allison, Master. Hudson Trevor
2      Allison, Miss. Helen Loraine
Name: name, dtype: object

In [157]:
df = df.drop(columns=['name', 'ticket', 'home.dest', 'boat', 'body', 'cabin'])

In [158]:
df.columns

Index(['pclass', 'survived', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked'],
      dtype='object')

In [159]:
df = pd.get_dummies(df)

In [160]:
df.columns

Index(['pclass', 'survived', 'age', 'sibsp', 'parch', 'fare', 'sex_female',
       'sex_male', 'embarked_C', 'embarked_Q', 'embarked_S'],
      dtype='object')

In [161]:
y = df.survived
X = df.drop(columns='survived')

In [162]:
y.describe()

count    1309.000000
mean        0.381971
std         0.486055
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max         1.000000
Name: survived, dtype: float64

In [146]:
X.describe()


Unnamed: 0,pclass,age,sibsp,parch,fare,body
count,1309.0,1046.0,1309.0,1309.0,1308.0,121.0
mean,2.294882,29.881135,0.498854,0.385027,33.295479,160.809917
std,0.837836,14.4135,1.041658,0.86556,51.758668,97.696922
min,1.0,0.1667,0.0,0.0,0.0,1.0
25%,2.0,21.0,0.0,0.0,7.8958,72.0
50%,3.0,28.0,0.0,0.0,14.4542,155.0
75%,3.0,39.0,1.0,0.0,31.275,256.0
max,3.0,80.0,8.0,9.0,512.3292,328.0


In [65]:
import janitor as jn
X, y = jn.get_features_targets(df, target_columns='survived')

[autoreload of jinja2.nodes failed: Traceback (most recent call last):
  File "/home/onwunalu/.pyenv/versions/3.9.0/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 245, in check
    superreload(m, reload, self.old_objects)
  File "/home/onwunalu/.pyenv/versions/3.9.0/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 410, in superreload
    update_generic(old_obj, new_obj)
  File "/home/onwunalu/.pyenv/versions/3.9.0/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 347, in update_generic
    update(a, b)
  File "/home/onwunalu/.pyenv/versions/3.9.0/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 302, in update_class
    if update_generic(old_obj, new_obj): continue
  File "/home/onwunalu/.pyenv/versions/3.9.0/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 347, in update_generic
    update(a, b)
  File "/home/onwunalu/.pyenv/versions/3.9.0/lib/python3.9/site-packages/IPython/extensions/autor

ModuleNotFoundError: No module named 'ConfigParser'

### Sample Data

- We always want to train and test on different data
- Otherwise you don't really know how well your model generalizes to data that it hasn't seen before.

In [175]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [176]:
X_train.shape

(916, 10)

In [177]:
X_test.shape

(393, 10)

In [178]:
y_train.shape

(916,)

In [179]:
y_test.shape

(393,)

### Impute Data

- age column has missing values

- we need to impute age from the numeric values 

- we only want to impute on the training set and then use that imputer to fill in the data for the test. Otherwise, we are leaking data (cheating by giving future information to the model).

- now we have the test and train data, we can impute missing values on the training set, and use the trained imputers to fill in the test dataset.

In [180]:
from sklearn.experimental import enable_iterative_imputer
from sklearn import impute

In [181]:
num_cols = ['pclass','age', 'sibsp', 'parch', 'fare', 'sex_male']

In [182]:
X_train.columns

Index(['pclass', 'age', 'sibsp', 'parch', 'fare', 'sex_female', 'sex_male',
       'embarked_C', 'embarked_Q', 'embarked_S'],
      dtype='object')

In [183]:
imputer = impute.IterativeImputer()
imputed = imputer.fit_transform(X_train[num_cols])
X_train.loc[:, num_cols] = imputed
# we also need to impute the data in the test data
imputed = imputer.transform(X_test[num_cols])
X_test.loc[:, num_cols] = imputed

### Imputing with median

In [172]:
#meds = X_train.median()
#X_train = X_train.fillna(meds)
#X_test = X_test.fillna(meds)

### Normalize Data

- normalizing or proprocessing the data will help many models perform better after this is done.
- particularly those that depend on a distance metric to determine similarity. (Note that tree models which tree each feature on its own don't have this requirement)

- We are going to standardize the data for the preprocessing.

- Standardizing is translating the data so that it has a mean value of zero and a standard deviation of one. 

- This way models don't treat varaibles with larger scales are more important than smaller scaled variables

In [186]:
cols = 'pclass,age,sibsp,fare'.split(',')
sca  = preprocessing.StandardScaler()
X_train = sca.fit_transform(X_train)
X_train = pd.DataFrame(X_train, columns=cols)
X_test = sca.transform(X_test)
X_test = pd.DataFrame(X_test, columns=cols)

ValueError: Shape of passed values is (916, 10), indices imply (916, 4)