# Prepare Data

Plan - Acquire - **Prepare** - Explore - Model - Deliver

## What we are doing and why:

**What:** Clean and tidy our data so that it is ready for exploration, analysis and modeling

**Why:** Set ourselves up for certainty! 

    1) Ensure that our observations will be sound:
        Validity of statistical and human observations
    2) Ensure that we will not have computational errors:
        non numerical data cells, nulls/NaNs
    3) Protect against overfitting:
        Ensure that have a split data structure prior to drawing conclusions

## High level Roadmap:

**Input:** An aquired dataset (One Pandas Dataframe) 

**Output:** Cleaned data split into Train, Validate, and Test sets (Three Pandas Dataframes)

**Processes:** Inspect and summarize the data ---> Clean the data ---> Split the data

## Inspect and Summarize

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# train test split from sklearn
from sklearn.model_selection import train_test_split
# imputer from sklearn
from sklearn.impute import SimpleImputer

# filter out warnings
import warnings
warnings.filterwarnings('ignore')

# our own acquire script:
import acquire

## Inspect and Summarize

In [18]:
from acquire import get_titanic_data

In [19]:
df = get_titanic_data()
df.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   passenger_id  891 non-null    int64  
 1   survived      891 non-null    int64  
 2   pclass        891 non-null    int64  
 3   sex           891 non-null    object 
 4   age           714 non-null    float64
 5   sibsp         891 non-null    int64  
 6   parch         891 non-null    int64  
 7   fare          891 non-null    float64
 8   embarked      889 non-null    object 
 9   class         891 non-null    object 
 10  deck          203 non-null    object 
 11  embark_town   889 non-null    object 
 12  alone         891 non-null    int64  
dtypes: float64(2), int64(6), object(5)
memory usage: 97.5+ KB


In [5]:
# survived is target variable, will not be a feature of our mmodel
# passenger_id is irrelevant 
# pclass and class are the same - keep one
# emabarked and embark_town are the same data - keep one


In [6]:
pd.crosstab (df['class'], df.pclass)

pclass,1,2,3
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
First,216,0,0
Second,0,184,0
Third,0,0,491


In [7]:
pd.crosstab(df.embarked, df.embark_town)

embark_town,Cherbourg,Queenstown,Southampton
embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C,168,0,0
Q,0,77,0
S,0,0,644


In [8]:
df.describe()

Unnamed: 0,passenger_id,survived,pclass,age,sibsp,parch,fare,alone
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0,891.0
mean,445.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208,0.602694
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429,0.489615
min,0.0,0.0,1.0,0.42,0.0,0.0,0.0,0.0
25%,222.5,0.0,2.0,20.125,0.0,0.0,7.9104,0.0
50%,445.0,0.0,3.0,28.0,0.0,0.0,14.4542,1.0
75%,667.5,1.0,3.0,38.0,1.0,0.0,31.0,1.0
max,890.0,1.0,3.0,80.0,8.0,6.0,512.3292,1.0


#### Gather our takeaways, i.e., what we are going to do when we clean:

## Clean

In [9]:
df.drop_duplicates(inplace = True)

In [10]:
columns_to_drop = ['embarked','class', 'passenger_id', 'deck']

In [11]:
data = df.drop(columns = columns_to_drop)
data

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embark_town,alone
0,0,3,male,22.0,1,0,7.2500,Southampton,0
1,1,1,female,38.0,1,0,71.2833,Cherbourg,0
2,1,3,female,26.0,0,0,7.9250,Southampton,1
3,1,1,female,35.0,1,0,53.1000,Southampton,0
4,0,3,male,35.0,0,0,8.0500,Southampton,1
...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,Southampton,1
887,1,1,female,19.0,0,0,30.0000,Southampton,1
888,0,3,female,,1,2,23.4500,Southampton,0
889,1,1,male,26.0,0,0,30.0000,Cherbourg,1


#### Encoding: Turning Categorical Values into Boolean Values (0,1)
 - We have two options: simple encoding or one-hot encoding

In [12]:
dummy_df = pd.get_dummies(data[['sex', 'embark_town']], dummy_na=False, drop_first=True)

In [13]:
dummy_df

Unnamed: 0,sex_male,embark_town_Queenstown,embark_town_Southampton
0,1,0,1
1,0,0,0
2,0,0,1
3,0,0,1
4,1,0,1
...,...,...,...
886,1,0,1
887,0,0,1
888,0,0,1
889,1,0,0


In [14]:
data = pd.concat([data, dummy_df], axis=1)
data.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embark_town,alone,sex_male,embark_town_Queenstown,embark_town_Southampton
0,0,3,male,22.0,1,0,7.25,Southampton,0,1,0,1
1,1,1,female,38.0,1,0,71.2833,Cherbourg,0,0,0,0
2,1,3,female,26.0,0,0,7.925,Southampton,1,0,0,1
3,1,1,female,35.0,1,0,53.1,Southampton,0,0,0,1
4,0,3,male,35.0,0,0,8.05,Southampton,1,1,0,1


## Putting our Work Into a Function

In [15]:
def clean_titanic_data(df):
    '''
    Takes in titanic df and returns a clean df.
    Arguments: df - a pandas dataframe with the expected 
    features names and columns
    Return: clean_df - a dataframe with the cleaning operations performed on it
    '''
    df.drop_duplicates(inplace = True)
    columns_to_drop = ['embarked','class', 'passenger_id', 'deck']
    df = df.drop(columns = columns_to_drop)
    dummy_df = pd.get_dummies(df[['sex', 'embark_town']], dummy_na=False, drop_first=True)
    df = pd.concat([data, dummy_df], axis=1)
    return df.drop(columns=['sex', 'embark_town'])

In [16]:
df = acquire.get_titanic_data()
clean_df = clean_titanic_data(df)
clean_df

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,alone,sex_male,embark_town_Queenstown,embark_town_Southampton,sex_male.1,embark_town_Queenstown.1,embark_town_Southampton.1
0,0,3,22.0,1,0,7.2500,0,1,0,1,1,0,1
1,1,1,38.0,1,0,71.2833,0,0,0,0,0,0,0
2,1,3,26.0,0,0,7.9250,1,0,0,1,0,0,1
3,1,1,35.0,1,0,53.1000,0,0,0,1,0,0,1
4,0,3,35.0,0,0,8.0500,1,1,0,1,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,27.0,0,0,13.0000,1,1,0,1,1,0,1
887,1,1,19.0,0,0,30.0000,1,0,0,1,0,0,1
888,0,3,,1,2,23.4500,0,0,0,1,0,0,1
889,1,1,26.0,0,0,30.0000,1,1,0,0,1,0,0


In [17]:
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   survived                 891 non-null    int64  
 1   pclass                   891 non-null    int64  
 2   age                      714 non-null    float64
 3   sibsp                    891 non-null    int64  
 4   parch                    891 non-null    int64  
 5   fare                     891 non-null    float64
 6   alone                    891 non-null    int64  
 7   sex_male                 891 non-null    uint8  
 8   embark_town_Queenstown   891 non-null    uint8  
 9   embark_town_Southampton  891 non-null    uint8  
 10  sex_male                 891 non-null    uint8  
 11  embark_town_Queenstown   891 non-null    uint8  
 12  embark_town_Southampton  891 non-null    uint8  
dtypes: float64(2), int64(5), uint8(6)
memory usage: 60.9 KB


## Train, Validate, Test Split

## Option for Missing Values: Impute

We can impute values using the mean, median, mode (most frequent), or a constant value. We will use sklearn.imputer.SimpleImputer to do this.  

1. Create the imputer object, selecting the strategy used to impute (mean, median or mode (strategy = 'most_frequent'). 
2. Fit to train. This means compute the mean, median, or most_frequent (i.e. mode) for each of the columns that will be imputed. Store that value in the imputer object. 
3. Transform train: fill missing values in train dataset with that value identified
4. Transform test: fill missing values with that value identified

1. Create the `SimpleImputer` object, which we will store in the variable `imputer`. In the creation of the object, we will specify the strategy to use (`mean`, `median`, `most_frequent`). Essentially, this is creating the instructions and assigning them to a variable we will reference.  

2. `Fit` the imputer to the columns in the training df.  This means that the imputer will determine the `most_frequent` value, or other value depending on the `strategy` called, for each column.   

3. It will store that value in the imputer object to use upon calling `transform.` We will call `transform` on each of our samples to fill any missing values.  

Create a function that will run through all of these steps, when I provide a train and test dataframe, a strategy, and a list of columns. 

Blend the clean, split and impute functions into a single prep_data() function. 

## Exercises

The end product of this exercise should be the specified functions in a python script named `prepare.py`.
Do these in your `classification_exercises.ipynb` first, then transfer to the prepare.py file. 

This work should all be saved in your local `classification-exercises` repo. Then add, commit, and push your changes.

Using the Iris Data:  

1. Use the function defined in `acquire.py` to load the iris data.  

1. Drop the `species_id` and `measurement_id` columns.  

1. Rename the `species_name` column to just `species`.  

1. Create dummy variables of the species name. 

1. Create a function named `prep_iris` that accepts the untransformed iris data, and returns the data with the transformations above applied.  