# Titanic ML EDA
## Titanic Passenger Survival EDA & Data Cleaning

In [1]:
# Imports

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

### Data Import & DataFrame Setup

Below are the variables used to import all the needed data into the EDA notebook, as well as 'pop' out the labels being sought out in the training process.

In [19]:
# Training Data
training_file = './data/train.csv'
training_df = pd.read_csv(training_file)

training_lables = training_df.pop('Survived')

# Testing Data
testing_file = './data/test.csv'
testing_df = pd.read_csv(testing_file)

# Gender Submission Data
gender_file = './data/gender_submission.csv'
gender_df = pd.read_csv(gender_file)

### Handle Errors in the Data

Missing data was resolved in a variety of ways. The following assumptions and decisions were made: 

- **Ages**: Missing ages were filled in using the mean age of passengers with known ages. There are additional steps that could be taken to achieve closer results, such as assigning the ages by mean within Pclass, or Fare, however, the decision to use standard mean was made due to the relative lack of importance of specific age, when compared to general age groups (e.g. it is likely less important to know the exact age, but more important to know if an individual is a child, adult, or senior).

- **Cabin**: Cabin data in the source file was missing in ~77.1% of the records. Due to the large number of missing values, it was decided that the cabin value would be dropped from the dataframe.

- **Embarked**: Embarking location was missing on two of the records in the source file. The port of call of Southampton was assigned to these individuals, as that port had the largest group of passengers boarding. It would stand to reason that it would be statistically likely that the individuals would have boarded there. It is not believed that the results of this assignment will have any measurable effects of the outcome of the model.

In [20]:
# Handling Missing Ages
ages = training_df['Age']
mean_age = ages.mean()

training_df['Age'] = training_df['Age'].fillna(mean_age)

In [9]:
# Handling Missing Cabin Data
training_df = training_df.drop('Cabin', axis=1)

In [16]:
# Handling Missing Embarked Data
training_df['Embarked'] = training_df['Embarked'].fillna('S')

In [21]:
# Output Info of the Cleaned Training DataFrame
training_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Pclass       891 non-null    int64  
 2   Name         891 non-null    object 
 3   Sex          891 non-null    object 
 4   Age          891 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Ticket       891 non-null    object 
 8   Fare         891 non-null    float64
 9   Cabin        204 non-null    object 
 10  Embarked     889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 76.7+ KB
