# Prepare
Plan --- Acquire --- **Prepare** --- Explore --- Model --- Deliver

- **very** critical
- cannot explore or model without preparing first
- clean and make legible
- 2nd step in the pipline
- must do preliminary exploration so you know what your data looks like before you clean it

to *not* **overfit**:
   - train
   - validate
   - test set

## Process of Preparing
- Summarize data
- Clean the data
- Split the data

### Summarize:

##### Step 1: imports

In [2]:
#imports
import numpy as np #for vectorized operations
import pandas as pd #for dataframe manipulation of tabular data
import matplotlib.pyplot as plt #for visualization

#new import
from sklearn.model_selection import train_test_split #train, test, split
from sklearn.impute import SimpleImputer # impute

#import warnings
#warnings.filterwarnings('ignore')
#^this turns OFF the warnings

import acquire #the functions that we created

##### Step 2: grab data

In [3]:
df = acquire.get_titanic_data()
type(df)

pandas.core.frame.DataFrame

##### Step 3: get to know your data

In [4]:
df.shape

(891, 13)

In [5]:
df.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


**takeaways from df head:**
- surviced is our target -- it is not a potential feature
- duplicates: 
    - 'passenger_id' is the same as 'index' in this case
    - 'plass' and 'class' are the same just different dtypes
    - 'embarked' and 'embark_town' are same

In [11]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   passenger_id  891 non-null    int64  
 1   survived      891 non-null    int64  
 2   pclass        891 non-null    int64  
 3   sex           891 non-null    object 
 4   age           714 non-null    float64
 5   sibsp         891 non-null    int64  
 6   parch         891 non-null    int64  
 7   fare          891 non-null    float64
 8   embarked      889 non-null    object 
 9   class         891 non-null    object 
 10  deck          203 non-null    object 
 11  embark_town   889 non-null    object 
 12  alone         891 non-null    int64  
dtypes: float64(2), int64(6), object(5)
memory usage: 90.6+ KB


**takeaways from .info()**
- this can help you find the nulls
    - (ex): embarked, age, etc have several nulls (lack of info)
    - deck: only 203 data points out of 891 -cannot use

In [9]:
df.describe()

Unnamed: 0,passenger_id,survived,pclass,age,sibsp,parch,fare,alone
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0,891.0
mean,445.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208,0.602694
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429,0.489615
min,0.0,0.0,1.0,0.42,0.0,0.0,0.0,0.0
25%,222.5,0.0,2.0,20.125,0.0,0.0,7.9104,0.0
50%,445.0,0.0,3.0,28.0,0.0,0.0,14.4542,1.0
75%,667.5,1.0,3.0,38.0,1.0,0.0,31.0,1.0
max,890.0,1.0,3.0,80.0,8.0,6.0,512.3292,1.0


In [6]:
df.columns.to_list()

['passenger_id',
 'survived',
 'pclass',
 'sex',
 'age',
 'sibsp',
 'parch',
 'fare',
 'embarked',
 'class',
 'deck',
 'embark_town',
 'alone']

In [10]:
df.dtypes

passenger_id      int64
survived          int64
pclass            int64
sex              object
age             float64
sibsp             int64
parch             int64
fare            float64
embarked         object
class            object
deck             object
embark_town      object
alone             int64
dtype: object

In [18]:
#find columns that are NOT objects
num_cols = df.select_dtypes(exclude=['object'])
num_cols

Unnamed: 0,passenger_id,survived,pclass,age,sibsp,parch,fare,alone
0,0,0,3,22.0,1,0,7.2500,0
1,1,1,1,38.0,1,0,71.2833,0
2,2,1,3,26.0,0,0,7.9250,1
3,3,1,1,35.0,1,0,53.1000,0
4,4,0,3,35.0,0,0,8.0500,1
...,...,...,...,...,...,...,...,...
886,886,0,2,27.0,0,0,13.0000,1
887,887,1,1,19.0,0,0,30.0000,1
888,888,0,3,,1,2,23.4500,0
889,889,1,1,26.0,0,0,30.0000,1


In [19]:
#get descriptive stats on all columns that are NOT objects
num_cols.describe()

Unnamed: 0,passenger_id,survived,pclass,age,sibsp,parch,fare,alone
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0,891.0
mean,445.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208,0.602694
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429,0.489615
min,0.0,0.0,1.0,0.42,0.0,0.0,0.0,0.0
25%,222.5,0.0,2.0,20.125,0.0,0.0,7.9104,0.0
50%,445.0,0.0,3.0,28.0,0.0,0.0,14.4542,1.0
75%,667.5,1.0,3.0,38.0,1.0,0.0,31.0,1.0
max,890.0,1.0,3.0,80.0,8.0,6.0,512.3292,1.0


In [22]:
obj_cols = df.select_dtypes(include=['object'])
obj_cols

Unnamed: 0,sex,embarked,class,deck,embark_town
0,male,S,Third,,Southampton
1,female,C,First,C,Cherbourg
2,female,S,Third,,Southampton
3,female,S,First,C,Southampton
4,male,S,Third,,Southampton
...,...,...,...,...,...
886,male,S,Second,,Southampton
887,female,S,First,B,Southampton
888,female,S,Third,,Southampton
889,male,C,First,C,Cherbourg


In [24]:
for col in obj_cols:
    print(df[col].value_counts())
    print('--------')
    print(df[col].value_counts(normalize=True))

male      577
female    314
Name: sex, dtype: int64
--------
male      0.647587
female    0.352413
Name: sex, dtype: float64
S    644
C    168
Q     77
Name: embarked, dtype: int64
--------
S    0.724409
C    0.188976
Q    0.086614
Name: embarked, dtype: float64
Third     491
First     216
Second    184
Name: class, dtype: int64
--------
Third     0.551066
First     0.242424
Second    0.206510
Name: class, dtype: float64
C    59
B    47
D    33
E    32
A    15
F    13
G     4
Name: deck, dtype: int64
--------
C    0.290640
B    0.231527
D    0.162562
E    0.157635
A    0.073892
F    0.064039
G    0.019704
Name: deck, dtype: float64
Southampton    644
Cherbourg      168
Queenstown      77
Name: embark_town, dtype: int64
--------
Southampton    0.724409
Cherbourg      0.188976
Queenstown     0.086614
Name: embark_town, dtype: float64


### Step 4: Missing (null) Values

In [25]:
#this will show you all null (missing) data for each column
missing =df.isnull().sum()
missing

passenger_id      0
survived          0
pclass            0
sex               0
age             177
sibsp             0
parch             0
fare              0
embarked          2
class             0
deck            688
embark_town       2
alone             0
dtype: int64

In [26]:
#shows ONLY missing data
missing =df.isnull().sum()
missing[missing >0]

age            177
embarked         2
deck           688
embark_town      2
dtype: int64

- this will show you that 'deck' has far too many missing values to even use for your model
- 'embarked' and 'embarked_town' are the same
    - because we can see above that the most common embarked town is Southhampton (72%), we can just use Southhampton on those 2 missing data points

#### Takeaways after getting to know your data:
1. embarked == embarked_town- choose one, drop the other
2. class == pclass- choose one, drop the other
3. deck and age have too many missing data points- drop them
4. embarked_town has 2 missing data points- fill in with Southtown
5. embarked_town and sex will be encoded values

### Step 5: Start Cleaning

In [28]:
#this drops duplicates
df.drop_duplicates(inplace=True)

In [29]:
#look at size to see if anything was duplicated
#if shape is the same as before... there were no duplicates
df.shape

(891, 13)

In [33]:
#remove the columns you dont need (from your takeaway)
columns_to_drop = ['deck', 'age', 'embarked', 'class']

In [39]:
#reassign variable 'df' with dropped columns
#df = df.drop(columns= columns_to_drop)

In [35]:
df

Unnamed: 0,passenger_id,survived,pclass,sex,sibsp,parch,fare,embark_town,alone
0,0,0,3,male,1,0,7.2500,Southampton,0
1,1,1,1,female,1,0,71.2833,Cherbourg,0
2,2,1,3,female,0,0,7.9250,Southampton,1
3,3,1,1,female,1,0,53.1000,Southampton,0
4,4,0,3,male,0,0,8.0500,Southampton,1
...,...,...,...,...,...,...,...,...,...
886,886,0,2,male,0,0,13.0000,Southampton,1
887,887,1,1,female,0,0,30.0000,Southampton,1
888,888,0,3,female,1,2,23.4500,Southampton,0
889,889,1,1,male,0,0,30.0000,Cherbourg,1


In [40]:
#fill in null values
df['embark_town'] = df.embark_town.fillna(value='Southampton')

In [41]:
#this will show you the new infomation after drops
#there are no nulls now
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   passenger_id  891 non-null    int64  
 1   survived      891 non-null    int64  
 2   pclass        891 non-null    int64  
 3   sex           891 non-null    object 
 4   sibsp         891 non-null    int64  
 5   parch         891 non-null    int64  
 6   fare          891 non-null    float64
 7   embark_town   891 non-null    object 
 8   alone         891 non-null    int64  
dtypes: float64(1), int64(6), object(2)
memory usage: 69.6+ KB


In [42]:
dummy_df = pd.get_dummies(df[['sex', 'embark_town']], dummy_na = False, drop_first=[True,True])
#drop_frist=[True, True] means dropping the first of sex and embarked_town

In [44]:
#this is object dtype with numeric values
dummy_df

Unnamed: 0,sex_male,embark_town_Queenstown,embark_town_Southampton,embark_town_Southhampton
0,1,0,1,0
1,0,0,0,0
2,0,0,1,0
3,0,0,1,0
4,1,0,1,0
...,...,...,...,...
886,1,0,1,0
887,0,0,1,0
888,0,0,1,0
889,1,0,0,0


In [46]:
#rows : r0ws
#cols: co1s

#add together, clean dataframe and dummy_df
df= pd.concat([df, dummy_df], axis=1)
df

Unnamed: 0,passenger_id,survived,pclass,sex,sibsp,parch,fare,embark_town,alone,sex_male,embark_town_Queenstown,embark_town_Southampton,embark_town_Southhampton,sex_male.1,embark_town_Queenstown.1,embark_town_Southampton.1,embark_town_Southhampton.1
0,0,0,3,male,1,0,7.2500,Southampton,0,1,0,1,0,1,0,1,0
1,1,1,1,female,1,0,71.2833,Cherbourg,0,0,0,0,0,0,0,0,0
2,2,1,3,female,0,0,7.9250,Southampton,1,0,0,1,0,0,0,1,0
3,3,1,1,female,1,0,53.1000,Southampton,0,0,0,1,0,0,0,1,0
4,4,0,3,male,0,0,8.0500,Southampton,1,1,0,1,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,886,0,2,male,0,0,13.0000,Southampton,1,1,0,1,0,1,0,1,0
887,887,1,1,female,0,0,30.0000,Southampton,1,0,0,1,0,0,0,1,0
888,888,0,3,female,1,2,23.4500,Southampton,0,0,0,1,0,0,0,1,0
889,889,1,1,male,0,0,30.0000,Cherbourg,1,1,0,0,0,1,0,0,0
