# Prepare Data

Plan - Acquire - **Prepare** - Explore - Model - Deliver

## What we are doing and why:

**What:** Clean and tidy our data so that it is ready for exploration, analysis and modeling

**Why:** Set ourselves up for certainty! 

    1) Ensure that our observations will be sound:
        Validity of statistical and human observations
    2) Ensure that we will not have computational errors:
        non numerical data cells, nulls/NaNs
    3) Protect against overfitting:
        Ensure that have a split data structure prior to drawing conclusions

## High level Roadmap:

**Input:** An aquired dataset (One Pandas Dataframe) 

**Output:** Cleaned data split into Train, Validate, and Test sets (Three Pandas Dataframes)

**Processes:** Inspect and summarize the data ---> Clean the data ---> Split the data

## Inspect and Summarize

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# train test split from sklearn
from sklearn.model_selection import train_test_split
# imputer from sklearn
from sklearn.impute import SimpleImputer

# filter out warnings
import warnings
warnings.filterwarnings('ignore')

# our own acquire script:
import acquire 

|Variable |	Description	| Details |
|---|---|---|
passenger_id| Index| Unique| 
survival | Survived the crisis |0 = No; 1 = Yes|
pclass	|Passenger Class	|1 = 1st; 2 = 2nd; 3 = 3rd|
sex	|Sex| "male", "female" | 
age|Age	| |
sibsp	|Number of Siblings/Spouses Aboard|	 |
parch	|Number of Parents/Children Aboard|	 |
fare	|Passenger Fare|	| 
embarked	|Port of Embarkation|	C = Cherbourg; Q = Queenstown; S = Southampton|
deck | Location of cabin| |
embarked_town| Port of Embarkation| |
alone| Registered as a solo traveler | | 

## Inspect and Summarize

In [3]:
# Importing our data
df = acquire.get_titanic_data()

In [4]:
# Take a look at the data
df.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


### Takeaways
- **Continuous Variables**
    - `age`, `fare`
    
- **Categorical Variables**
    - `survived`, `pclass`, `sex`, `sibsp`, `parch`, `embarked`, `class`, `deck`, `embark_town`, `alone`

**Notes**:
- `passenger_id` is effectively an index and provides no predictive quality
- `survived` is our target variable
- `embarked` and `embark_town` seem to be identical information (not identical data, but identical info...what's the difference?)
- `pclass` and `class` also seem to be identical
- Redundant columns will need to be removed

In [5]:
# Looking at relationship between embarked and embark_town
pd.crosstab(df.embarked, df.embark_town)

embark_town,Cherbourg,Queenstown,Southampton
embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C,168,0,0
Q,0,77,0
S,0,0,644


`embarked` and `embark_town` contain identical information

In [None]:
# Detailed look at the relationship between 
pd.crosstab(df['class'], df.pclass) 

`class` and `pclass` contain identical information

In [None]:
pd.crosstab(df['sibsp'], df['alone'])

>71 passengers had no siblings or spouses aboard, but were not marked as being alone. Perhaps they are children? We could look at `parch` for this. Something to explore later...

#### `df.info()` will give us a quick view of the datatypes (Dtype) and the nulls in each column

In [None]:
df.info()

**Takeaways**
- There is a substantial number of nulls in `deck`
- There are 2 nulls in `embarked`
- There are 100+ nulls in `age`

In [None]:
# Review summary statistics of numeric columns
df.describe()

Let's dig deeper into each of the fields
- For categorical columns, we can look at `value_counts()`
- For numeric columns, we can look as histograms

In [6]:
# Creating a list of our numeric columns
numcols = [col for col in df.columns if df[col].dtype != 'O']

In [7]:
numcols

['passenger_id',
 'survived',
 'pclass',
 'age',
 'sibsp',
 'parch',
 'fare',
 'alone']

In [8]:
# Creating a list of our categorical columns
catcols = [col for col in df.columns if df[col].dtype == 'O']

In [9]:
catcols

['sex', 'embarked', 'class', 'deck', 'embark_town']

In [None]:
# Describe the object columns
for col in catcols:
    print(f"Column: {col}")
    print(df[col].value_counts())
    print("--------")
    print(df[col].value_counts(normalize=True, dropna=False))
    print("=================")

In [None]:
# Histograms of numeric columns
for col in numcols:
    print(col)
    df[col].hist()
    plt.show()

## IMPORTANT NOTE: Visualizations create through a loop should only be part of your personal exploration. Do not include this much noise in a report or presentation!!!

### Next Steps:
1. Removal
- Remove `embarked`
- Remove `pclass`
- Remove `passenger_id`
- Remove `deck`
    - Has too many nulls
    - Would require an extensive imputation process
        - Build this out after an MVP is acheived
        
2. Imputing Nulls
- Lots of missing information in `age`
    - Going to have to impute nulls
- Two nulls in `embark_town`
    - Going to have to impute these nulls (maybe just use mode)
    
3. Encoding categorical variables
- 

## Clean

Drop duplicates

In [10]:
df.shape

(891, 13)

In [11]:
df = df.drop_duplicates()

In [12]:
df.shape # No duplicates after all

(891, 13)

Drop redundant columns (and `deck` because it has too many nulls)

In [13]:
columns_to_drop = ['embarked', 'pclass', 'passenger_id', 'deck']

In [14]:
df = df.drop(columns = columns_to_drop) 

In [15]:
df.head()

Unnamed: 0,survived,sex,age,sibsp,parch,fare,class,embark_town,alone
0,0,male,22.0,1,0,7.25,Third,Southampton,0
1,1,female,38.0,1,0,71.2833,First,Cherbourg,0
2,1,female,26.0,0,0,7.925,Third,Southampton,1
3,1,female,35.0,1,0,53.1,First,Southampton,0
4,0,male,35.0,0,0,8.05,Third,Southampton,1


#### Encoding: Turning Categorical Values into Boolean Values (0,1)
 - We have two options: simple encoding or one-hot encoding

In [16]:
# Encoding steps
# 1. Make a dataframe out of "dummy" columns
# 2. Concatenate our dummy dataframe to our original dataframe

dummy_df = pd.get_dummies(df[['sex', 'class', 'embark_town']], dummy_na=False, drop_first=[True, True])

In [17]:
dummy_df

Unnamed: 0,sex_male,class_Second,class_Third,embark_town_Queenstown,embark_town_Southampton
0,1,0,1,0,1
1,0,0,0,0,0
2,0,0,1,0,1
3,0,0,0,0,1
4,1,0,1,0,1
...,...,...,...,...,...
886,1,1,0,0,1
887,0,0,0,0,1
888,0,0,1,0,1
889,1,0,0,0,0


In [18]:
# Concatenate my dummy_df to my data

df = pd.concat([df, dummy_df], axis=1)
df

Unnamed: 0,survived,sex,age,sibsp,parch,fare,class,embark_town,alone,sex_male,class_Second,class_Third,embark_town_Queenstown,embark_town_Southampton
0,0,male,22.0,1,0,7.2500,Third,Southampton,0,1,0,1,0,1
1,1,female,38.0,1,0,71.2833,First,Cherbourg,0,0,0,0,0,0
2,1,female,26.0,0,0,7.9250,Third,Southampton,1,0,0,1,0,1
3,1,female,35.0,1,0,53.1000,First,Southampton,0,0,0,0,0,1
4,0,male,35.0,0,0,8.0500,Third,Southampton,1,1,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,male,27.0,0,0,13.0000,Second,Southampton,1,1,1,0,0,1
887,1,female,19.0,0,0,30.0000,First,Southampton,1,0,0,0,0,1
888,0,female,,1,2,23.4500,Third,Southampton,0,0,0,1,0,1
889,1,male,26.0,0,0,30.0000,First,Cherbourg,1,1,0,0,0,0


## Putting our Work Into a Function

In [19]:
def clean_titanic_data(df):
    '''
    Takes in a titanic dataframe and returns a cleaned dataframe
    Arguments: df - a pandas dataframe with the expected feature names and columns
    Return: clean_df - a dataframe with the cleaning operations performed on it
    '''
    # Drop duplicates
    df.drop_duplicates(inplace=True)
    # Drop columns 
    columns_to_drop = ['embarked', 'pclass', 'passenger_id', 'deck']
    df = df.drop(columns = columns_to_drop)
    # encoded categorical variables
    dummy_df = pd.get_dummies(df[['sex', 'class', 'embark_town']], dummy_na=False, drop_first=[True, True])
    df = pd.concat([df, dummy_df], axis=1)
    return df

In [20]:
df = acquire.get_titanic_data()
clean_df = clean_titanic_data(df)
clean_df

Unnamed: 0,survived,sex,age,sibsp,parch,fare,class,embark_town,alone,sex_male,class_Second,class_Third,embark_town_Queenstown,embark_town_Southampton
0,0,male,22.0,1,0,7.2500,Third,Southampton,0,1,0,1,0,1
1,1,female,38.0,1,0,71.2833,First,Cherbourg,0,0,0,0,0,0
2,1,female,26.0,0,0,7.9250,Third,Southampton,1,0,0,1,0,1
3,1,female,35.0,1,0,53.1000,First,Southampton,0,0,0,0,0,1
4,0,male,35.0,0,0,8.0500,Third,Southampton,1,1,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,male,27.0,0,0,13.0000,Second,Southampton,1,1,1,0,0,1
887,1,female,19.0,0,0,30.0000,First,Southampton,1,0,0,0,0,1
888,0,female,,1,2,23.4500,Third,Southampton,0,0,0,1,0,1
889,1,male,26.0,0,0,30.0000,First,Cherbourg,1,1,0,0,0,0


In [21]:
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   survived                 891 non-null    int64  
 1   sex                      891 non-null    object 
 2   age                      714 non-null    float64
 3   sibsp                    891 non-null    int64  
 4   parch                    891 non-null    int64  
 5   fare                     891 non-null    float64
 6   class                    891 non-null    object 
 7   embark_town              889 non-null    object 
 8   alone                    891 non-null    int64  
 9   sex_male                 891 non-null    uint8  
 10  class_Second             891 non-null    uint8  
 11  class_Third              891 non-null    uint8  
 12  embark_town_Queenstown   891 non-null    uint8  
 13  embark_town_Southampton  891 non-null    uint8  
dtypes: float64(2), int64(4), o

We still have two columns with nulls:
1. `age`
2. `embark_town`

As a general practice, wait until after the Train, Validate, Test Split before filling nulls.

### WHY?

> Note: There can be cases where it is okay to fill nulls before splitting. We will talk about those cases after we get through creating the Train, Validate, Test split.

## Train, Validate, Test Split

In [22]:
train, test = train_test_split(clean_df,
                               train_size = 0.8,
                               stratify = clean_df.survived,
                               random_state=1234)

In [23]:
train.shape

(712, 14)

In [24]:
test.shape

(179, 14)

In [25]:
train, validate = train_test_split(train,
                                  train_size = 0.7,
                                  stratify = train.survived,
                                  random_state=1234)

In [26]:
train.shape

(498, 14)

In [27]:
validate.shape

(214, 14)

In [28]:
test.shape

(179, 14)

In [29]:
train.head()

Unnamed: 0,survived,sex,age,sibsp,parch,fare,class,embark_town,alone,sex_male,class_Second,class_Third,embark_town_Queenstown,embark_town_Southampton
301,1,male,,2,0,23.25,Third,Queenstown,0,1,0,1,1,0
290,1,female,26.0,0,0,78.85,First,Southampton,1,0,0,0,0,1
779,1,female,43.0,0,1,211.3375,First,Southampton,0,0,0,0,0,1
356,1,female,22.0,0,1,55.0,First,Southampton,0,0,0,0,0,1
147,0,female,9.0,2,2,34.375,Third,Southampton,0,0,0,1,0,1


In [30]:
validate.head()

Unnamed: 0,survived,sex,age,sibsp,parch,fare,class,embark_town,alone,sex_male,class_Second,class_Third,embark_town_Queenstown,embark_town_Southampton
91,0,male,20.0,0,0,7.8542,Third,Southampton,1,1,0,1,0,1
297,0,female,2.0,1,2,151.55,First,Southampton,0,0,0,0,0,1
101,0,male,,0,0,7.8958,Third,Southampton,1,1,0,1,0,1
705,0,male,39.0,0,0,26.0,Second,Southampton,1,1,1,0,0,1
335,0,male,,0,0,7.8958,Third,Southampton,1,1,0,1,0,1


In [31]:
test.head()

Unnamed: 0,survived,sex,age,sibsp,parch,fare,class,embark_town,alone,sex_male,class_Second,class_Third,embark_town_Queenstown,embark_town_Southampton
92,0,male,46.0,1,0,61.175,First,Southampton,0,1,0,0,0,1
552,0,male,,0,0,7.8292,Third,Queenstown,1,1,0,1,1,0
810,0,male,26.0,0,0,7.8875,Third,Southampton,1,1,0,1,0,1
29,0,male,,0,0,7.8958,Third,Southampton,1,1,0,1,0,1
681,1,male,27.0,0,0,76.7292,First,Cherbourg,1,1,0,0,0,0


## Option for Missing Values: Impute

We can impute values using the mean, median, mode (most frequent), or a constant value. We will use sklearn.imputer.SimpleImputer to do this.  

1. Create the imputer object, selecting the strategy used to impute (mean, median or mode (strategy = 'most_frequent'). 
2. Fit to train. This means compute the mean, median, or most_frequent (i.e. mode) for each of the columns that will be imputed. Store that value in the imputer object. 
3. Transform train: fill missing values in train dataset with that value identified
4. Transform test: fill missing values with that value identified

1. Create the `SimpleImputer` object, which we will store in the variable `imputer`. In the creation of the object, we will specify the strategy to use (`mean`, `median`, `most_frequent`). Essentially, this is creating the instructions and assigning them to a variable we will reference.  

In [32]:
imputer = SimpleImputer(strategy='mean', missing_values=np.nan)

In [33]:
type(imputer)

sklearn.impute._base.SimpleImputer

2. `Fit` the imputer to the columns in the training df.  This means that the imputer will determine the `most_frequent` value, or other value depending on the `strategy` called, for each column.   

In [None]:
imputer = imputer.fit(train[['age']])

3. It will store that value in the imputer object to use upon calling `transform.` We will call `transform` on each of our samples to fill any missing values.  

In [None]:
train[['age']] = imputer.transform(train[['age']])

In [None]:
train.info()

In [None]:
validate[['age']] = imputer.transform(validate[['age']])

In [None]:
test[['age']] = imputer.transform(test[['age']])

Create a function that will run through all of these steps, when I provide a train and test dataframe, a strategy, and a list of columns. 

In [None]:
def impute_age(train, validate, test):
    '''
    Imputes the mean age of train to all three datasets
    '''
    imputer = SimpleImputer(strategy='mean', missing_values=np.nan)
    imputer = imputer.fit(train[['age']])
    train[['age']] = imputer.transform(train[['age']])
    validate[['age']] = imputer.transform(validate[['age']])
    test[['age']] = imputer.transform(test[['age']])
    return train, validate, test

Blend the clean, split and impute functions into a single prep_data() function. 

In [None]:
def prep_titanic_data(df): 
    df = clean_titanic_data(df)
    train, test = train_test_split(df,
                               train_size = 0.8,
                               stratify = df.survived,
                               random_state=1234)
    train, validate = train_test_split(train,
                                  train_size = 0.7,
                                  stratify = train.survived,
                                  random_state=1234)
    train, validate, test = impute_age(train, validate, test)
    return train, validate, test

In [None]:
df = acquire.get_titanic_data()
train, validate, test = prep_titanic_data(df)
train.head()

In [None]:
train.info()

**How should we impute `embark_town`?**
- `SimpleImputer()`
- `.fillna()`

## Exercises

The end product of this exercise should be the specified functions in a python script named `prepare.py`.
Do these in your `classification_exercises.ipynb` first, then transfer to the prepare.py file. 

This work should all be saved in your local `classification-exercises` repo. Then add, commit, and push your changes.

**Using the Iris Dataset:**  

1. Use the function defined in `acquire.py` to load the iris data.  

1. Drop the `species_id` and `measurement_id` columns.  

1. Rename the `species_name` column to just `species`.  

1. Create dummy variables of the species name. 

1. Create a function named `prep_iris` that accepts the untransformed iris data, and returns the data with the transformations above applied.  

**Using the Titanic Dataset:**

1. Use the function defined in acquire.py to load the Titanic data.

1. Drop any unnecessary, unhelpful, or duplicated columns.

1. Encode the categorical columns. Create dummy variables of the categorical columns and concatenate them onto the dataframe.

1. Create a function named `prep_titanic` that accepts the raw titanic data, and returns the data with the transformations above applied.

**Using the Telco Dataset:**

1. Use the function defined in `acquire.py` to load the Telco data.

1. Drop any unnecessary, unhelpful, or duplicated columns. This could mean dropping foreign key columns but keeping the corresponding string values, for example.

1. Encode the categorical columns. Create dummy variables of the categorical columns and concatenate them onto the dataframe.

1. Create a function named `prep_telco` that accepts the raw telco data, and returns the data with the transformations above applied.

**Split your data**

1. Write a function to split your data into `train`, `validate`, and `test` datasets. Add this function to `prepare.py`.

1. Run the function in your notebook on the Iris dataset, returning 3 datasets: `train_iris`, `validate_iris`, and `test_iris`.

1. Run the function on the Titanic dataset, returning 3 datasets: `train_titanic`, `validate_titanic`, and `test_titanic`.

1. Run the function on the Telco dataset, returning 3 datasets: `train_telco`, `validate_telco`, and `test_telco`.