# Predicting Disaster Results!

We are going to use machine learning to predict whether or not passengers survived the Titanic crash! We will use known information about the passengers as the feature set (**X variables**) and the **binary feature** 'Survived' to classify whether or not the passenger survived the accident.

We will use two different models to complete these predictions:
- K Nearest Neighbors (KNN)
- Decision Tree
- Random Forest


The data is sourced from the Kaggle Completition: Titanic: Machine Learning from Disaster

To see other examples of data cleaning, analysis, & modeling of this data, follow this link: https://www.kaggle.com/c/titanic

## Data Dictionary

Your dataset may come with dictionary. The dictionary for the Titanic dataset is below. It provides a definition for each feature and a key for the categorical features.

| Field | Definition | Key |
|------|------|------|
| survival | Survival | 0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex |
| age | Age in years |
|sibsp| # of siblings / spouses aboard the Titanic |
|parch|# of parents / children aboard the Titanic|
|ticket|Ticket number|
|fare|	Passenger fare|
|cabin|	Cabin number|
|embarked|Port of Embarkation|C = Cherbourg, Q = Queenstown, S = Southampton|

## K Nearest Neighbors (KNN) Model

Our problem is to predict whether a passenger will survive the Titanic disaster given some personal data and information about their trip. To start, we will classify the passenger survivor outcomes by creating a KNN model. 

**The Model Components:**
- **Target Variable:** 'survival'
- **Features:** 'PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp','Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'

**Bootcamp Modeling Process:**
1. Load packages
2. Load training data
3. Review initial training data & features
4. Clean Data / Create Data Pipeline
5. Load test data / apply data pipeline to test data
6. Split training dataset
7. Fit KNN Model
8. Test model
9. Evaluate model

**1. Load Packages**

Packages for working with / modifying / preprocessing **data**

In [None]:
import numpy as np #Numpy for working with your data as an array
import pandas as pd #Pandas for working with your data as a dataframe
from sklearn import preprocessing #preprocessing is used to normalize the data
from sklearn.model_selection import train_test_split #train test split is used for preprocessing

Packages for **visualization**

In [None]:
from matplotlib import pyplot as plt #matplotlib for visualization functions 
import seaborn as sns #seaborn for visualization functions

Packages for **modeling**: used to create models from testing data & defining model hyperparameters

In [None]:
from sklearn.tree import DecisionTreeClassifier #DecisionTreeClassifier is the decision tree classification function
from sklearn.ensemble import RandomForestClassifier #RandomForestClassifier is the random forest classification functionality
from sklearn.neighbors import KNeighborsClassifier #KNeighborsClassifier is the KNN classification functionality

Packages for **evaluating** Model Results

In [None]:
from sklearn import metrics #metrics allows us to prints the model's accuracy score
from sklearn.metrics import confusion_matrix #confusion_matrix allows us to print accuracy details about our model

**2. Load training data**

In [None]:
df = pd.read_csv('train.csv')

**3. Review initial training data & features**

This data is used to predict whether breast cancer is malignant or benign. We will look further into some of the features in the dataset before building our KNN mdeol. 

In [None]:
df.head()

In [None]:
df.columns

In [None]:
df.info()

**Review the Numerical Features**

A flag on 0 means that the passenger survived the trip of the titanic

In [None]:
df.Survived.value_counts()

In [None]:
df.Pclass.value_counts()

In [None]:
df.Age.value_counts()

In [None]:
#It will be easier to look at this distribution using a histogram!
df.Age.hist()

In [None]:
df.SibSp.value_counts()

In [None]:
df.Parch.value_counts()

In [None]:
df.Fare.value_counts()

In [None]:
#Take a look at the distribution
df.Fare.hist()

In [None]:
#Review the summary statistics
df.Fare.describe()

**Identify Two Features to use to predict whether a passenger will survive.** <br> The below scatterplots compare different features so that we can start evaluating their relationships with each other and our target feature, **survived**

In [None]:
sns.scatterplot(
    x='Age',
    y='Fare',
    data=df)

Include the target feature as the **hue** and look for patterns

In [None]:
sns.scatterplot(
    x='Age',
    y='Fare',
    hue='Survived',
    data=df
)

**In Class Activity**

Using the features in the X dataframe **df**, create a scatterplot comparing the two features that you want to use to predict whether the cancer or benign. Select from the features output by the cell below. You can recreate the scatterplot using the code from the cell above.

In [None]:
df.columns

In [None]:
## Put scatterplot code in this cell

**Review Categorical Data**

In [None]:
df.info()

In [None]:
df.Name

In [None]:
df.Ticket

In [None]:
df.Cabin.value_counts()

In [None]:
df.Sex.value_counts()

In [None]:
df.Embarked.value_counts()

**4. Clean Data / Create Data Pipeline**

Clean Null Values

In [None]:
df_na = (df.isnull().sum() / len(df)) * 100
df_na = df_na.drop(df_na[df_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :df_na})
missing_data.head()

**Drop features that are missing too much information**

In [None]:
df = df.drop(['Cabin'], axis=1)

**Drop features that will not add value**

In [None]:
df = df.drop(['Name'], axis=1)

In [None]:
df = df.drop(['Ticket'], axis=1)

**Fill empty values in the Age feature**

In [None]:
df['Age'].fillna(df['Age'].median(), inplace=True)

**Fill empty values in the Emarked feature**

In [None]:
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

**Verify that all nulls are filled**

In [None]:
df_na = (df.isnull().sum() / len(df)) * 100
df_na = df_na.drop(df_na[df_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :df_na})
missing_data.head()

**Convert Sex & Embarked to numeric values**

In [None]:
df['Sex'] = df['Sex'].map({'female': 0, 'male': 1}).astype(int)

In [None]:
df['Embarked'] = df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}).astype(int)

**All features are now numeric**

The K Nearest Neighbor model requires numeric data

In [None]:
df.info()

**5. Load test data / apply data pipeline to test data**

In [None]:
df_test = pd.read_csv('test.csv')

In [None]:
df_test = df_test.drop(['Cabin'], axis=1)
df_test = df_test.drop(['Name'], axis=1)
df_test = df_test.drop(['Ticket'], axis=1)
df_test['Fare'].fillna(df_test['Age'].median(), inplace=True)
df_test['Age'].fillna(df_test['Age'].median(), inplace=True)
df_test['Embarked'].fillna(df_test['Embarked'].mode()[0], inplace=True)

In [None]:
df_test['Sex'] = df_test['Sex'].map({'female': 0, 'male': 1}).astype(int)
df_test['Embarked'] = df_test['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}).astype(int)

In [None]:
df_test.info()

**Isolate the X & Y Features in the training and test datasets**

Define the features that you want to model

In [None]:
df_train = df

In [None]:
df_train = df

In [None]:
features = ['Age', 'Fare', 'Sex', 'Embarked']

In [None]:
X = df_train[features]

In [None]:
y = df_train['Survived']

**Normalize the Feature Set**

In [None]:
X = preprocessing.StandardScaler().fit_transform(X)

In [None]:
X

**6. Split training dataset**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

**7. Fit KNN Model**

We need to determine how many neighbors using the **KNeighborsClassifier()** function. 

In [None]:
test_knn = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)

In [None]:
test_knn

**8. Test model**

In [None]:
test_yhat = test_knn.predict(X_test)

In [None]:
test_yhat

**9. Evaluate model**

**Create a Confusion Matrix**

This will allow us to compare the results of our model to the actual results

In [None]:
test_confusion = confusion_matrix(y_test, test_yhat)

In [None]:
test_confusion

In [None]:
pd.DataFrame(confusion_matrix(y_test, test_yhat, labels = [0, 1]), 
             columns = ['Predicted Positive','Predicted Negative']
            ).rename(index = {0:'True Positive',1:'True Negative'})

**Review the Accuracy Score**

In [None]:
metrics.accuracy_score(y_test, test_yhat)

### In Class Activity
**Can we improve the model through different feature selection?**

The codes creating the KNN model in the previous selection are aggregated in the cells below. Modify the **features** list by:
- Adding a feature or features
- Removing a feature or features
- Swapping on feature for another

When you run the cells, your model's accuracy score will be returned! Can you get an accuracy better than our previoius acccuracy of **.7932?**

Try to only make one change at a time. This way you can see and measure the impact of the change. 

In [None]:
df.columns

In [None]:
features = ['Age', 'Fare', 'Sex', 'Embarked']
X = df_train[features]
y = df_train['Survived']
X = preprocessing.StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
test_knn = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)
test_yhat = test_knn.predict(X_test)
metrics.accuracy_score(y_test, test_yhat)

**Test / Modify / Tune Model**

In [None]:
knn = KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train)
yhat = knn.predict(X_test)
confusion = confusion_matrix(y_test, yhat)
print(metrics.accuracy_score(y_test, yhat))
print(confusion)

In [None]:
pd.DataFrame(confusion_matrix(y_test, test_yhat, labels = [0, 1]), 
             columns = ['Predicted Positive','Predicted Negative']
            ).rename(index = {0:'True Positive',1:'True Negative'})

In [None]:
knn_10 = KNeighborsClassifier(n_neighbors=10).fit(X_train, y_train)
yhat_10 = knn_10.predict(X_test)
confusion_10 = confusion_matrix(y_test, yhat_10)
print(metrics.accuracy_score(y_test, yhat_10))
print(confusion_10)

In [None]:
pd.DataFrame(confusion_matrix(y_test, yhat_10, labels = [0, 1]), 
             columns = ['Predicted Positive','Predicted Negative']
            ).rename(index = {0:'True Positive',1:'True Negative'})

In [None]:
knn_20 = KNeighborsClassifier(n_neighbors=20).fit(X_train, y_train)
yhat_20 = knn_20.predict(X_test)
confusion_20 = confusion_matrix(y_test, yhat_20)
print(metrics.accuracy_score(y_test, yhat_20))
print(confusion_20)

In [None]:
pd.DataFrame(confusion_matrix(y_test, yhat_20, labels = [0, 1]), 
             columns = ['Predicted Positive','Predicted Negative']
            ).rename(index = {0:'True Positive',1:'True Negative'})

## Define Final Model

In [None]:
knn_final = KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train)

## Make Final Predictions
Use the Test dataset to make final predictions

In [None]:
df_test.head()

**Preprocessing**

In [None]:
features = ['Age', 'Fare', 'Sex', 'Embarked']
X = df_test[features]

In [None]:
yhat = knn_final.predict(X)

**Review Results**

In [None]:
df_test['Results'] = pd.DataFrame(yhat, columns = ['Results'])

In [None]:
df_test

# Create a Decision Tree to Predict whether a Passenger will Survive

We will use the same dataset with a different approach to predict! Luckily, we have already created a data prep pipeline, so we can use the same methods to prepare our dataset (steps 1:6 in the previous modeling process)

Our problem is to predict whether a passenger will survive the Titanic disaster given some personal data and information about their trip. To start, we will classify the passenger survivor outcomes by creating a Decision Tree model.

**The Model Components:**
- **Target Variable:** 'survival'
- **Features:** 'PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp','Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'

**Bootcamp Modeling Process:**
1. Load packages
2. Load training data
3. Review initial training data & features
4. Clean Data / Create Data Pipeline
5. Load test data / apply data pipeline to test data
6. Split training dataset
7. Fit KNN Model
8. Test model
9. Evaluate model

**Load Packages**

In [None]:
from sklearn.tree import DecisionTreeClassifier

### Data Preprocessing

In [None]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

In [None]:
df_train = df_train.drop(['Cabin'], axis=1)
df_train = df_train.drop(['Name'], axis=1)
df_train = df_train.drop(['Ticket'], axis=1)
df_train['Age'].fillna(df_train['Age'].median(), inplace=True)
df_train['Embarked'].fillna(df_train['Embarked'].mode()[0], inplace=True)

In [None]:
df_train['Sex'] = df_train['Sex'].map({'female': 0, 'male': 1}).astype(int)
df_train['Embarked'] = df_train['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}).astype(int)

In [None]:
df_test = df_test.drop(['Cabin'], axis=1)
df_test = df_test.drop(['Name'], axis=1)
df_test = df_test.drop(['Ticket'], axis=1)
df_test['Age'].fillna(df_test['Age'].median(), inplace=True)
df_test['Embarked'].fillna(df_test['Embarked'].mode()[0], inplace=True)

In [None]:
df_test['Sex'] = df_test['Sex'].map({'female': 0, 'male': 1}).astype(int)
df_test['Embarked'] = df_test['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}).astype(int)

**Identify Features for Modeling**

In [None]:
features = ['Age', 'Fare', 'Sex', 'Embarked']
X = df_train[features]
y = df_train['Survived']

**We do not need to convert categorical features to numeric for a Random Forest**

The randomn forest algorithm can accept categorical features.

**Split the Training Data**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

### Create Initial Model

In [None]:
decision_tree = DecisionTreeClassifier(max_depth = 3)

In [None]:
decision_tree.fit(X_train, y_train)

**Create Visual of the Decision Tree Model**

To create the below visual, you will need to install the pydotplus package. You can uncomment the cell below to install the package. If this results in an error, run the command in your computer's terminal.

In [None]:
#pip install pydotplus

In [None]:
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

In [None]:
dot_data = StringIO()
export_graphviz(decision_tree, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

### Predict Initial Results

In [None]:
y_pred = decision_tree.predict(X_test)

### Review Test Results

In [None]:
y_true = np.array(y_test)

**Create Dataframe to Review Individual Results**

In [None]:
pd.DataFrame({'Survived': y_true, 'Predictions': y_pred}, columns=['Survived', 'Predictions'])

**Create confusion matrix to review overall accuracy**

In [None]:
confusion_matrix(y_true, y_pred, labels = [0, 1])

In [None]:
pd.DataFrame(confusion_matrix(y_true, y_pred, labels = [0, 1]), 
             columns = ['Predicted Positive','Predicted Negative']
            ).rename(index = {0:'True Positive',1:'True Negative'})

# Random Forest
Improve the decision tree process by using a decision tree model. 

In [None]:
df_train.shape

Use the **RandomForestClassifier()** function to make an initial predictor. We will be training our model by modifying the **n_estimators** hyperparameter. To begin testing, we will set n_estimators equal to 50.

In [None]:
random_forest = RandomForestClassifier(n_estimators=50)

In [None]:
random_forest.fit(X_train, y_train)

In [None]:
y_pred = random_forest.predict(X_test)

In [None]:
y_pred

In [None]:
random_forest.score(X_train, y_train)

## Try Improving the Model Score by Modifying the Hyperparameter

In [None]:
random_forest = RandomForestClassifier(n_estimators=50)
random_forest.fit(X_train, y_train)
y_pred = random_forest.predict(X_test)
random_forest.score(X_train, y_train)

## Make Final Predictions

In [None]:
y_pred = random_forest.predict(X_test)

## Review Results

In [None]:
X_test['results'] = y_pred
X_test