# Building and Using Predictive Models

"*It is a capital mistake to theorize before one has data.*" — Sherlock Holmes

## Data Preparation: Selecting Variables

The first step is to load the data.

In [None]:
# This line imports a library called "pandas", a very useful tool to manipulate data with Python. 
import pandas as pd

# Load data from CSV file
df1 = pd.read_csv('known_survival.csv')
# Print data size
print(f"Total number of rows and columns: {df1.shape}")
# This line displays the top rows in the dataframe.
df1.head()

This data set contains information about passangers of the Titanic. In this example, we will create a model that predicts the probability of a passsanger surviving the sinking of the Titanic.

To do so, we will first keep only the columns that may be related to survival (i.e., all except PassengerId and Name*).

\* Can you think of a reason why this column could potentially be useful to predict survival? If so, make sure to post your answer in the discussion board (this will count as class participation). 

In [None]:
# List of the names of columns
cols = ['Survived', 'Pclass', 'Sex', 'Age', 'Fare', 'Embarked']
# This selects only the columns that are in 'cols'
df2 = df1[cols]
df2.head()

## Data Preparation: Missing Values

We now will address rows with missing values. The best course of action to address missing values will vary from one context to another. The code below illustrates two common alternatives: (1) imputing missing values with some new value (e.g., the mean) and (2) dropping observations with missing values. For simplicity, we will stick to the version of the training data in which the observations with missing values were dropped.

In [None]:
print("Number of passengers with missing values for each column:")
print(df2.isna().sum(axis=0))

##### OPTION 1: Imputing missing values with the mean
df3 = df2.copy()
# This replaces the missing values in Age with its average.
df3.Age = df2.Age.fillna(df2.Age.mean())
# This replaces the missing values in Embarked with its mode.
df3.Embarked = df2.Age.fillna(df2.Embarked.mode()[0])
print(f"Observations and columns in the original data set: {df2.shape}")
print(f"Observations and columns in the new data set: {df3.shape}")
print(df3.isna().sum(axis=0))

##### OPTION 2: Dropping observations with missing values
# Drop observations with missing values
df3 = df2.dropna()
print(f"Observations and columns in the original data set: {df2.shape}")
print(f"Observations and columns in the new data set: {df3.shape}")
print(df3.isna().sum(axis=0))

## Data Preparation: Categorical Variables

Next, we will do some data pre-processing and transform the existing columns into variables that can be processed by data mining algorithms. More specifically, we will transform categorical variables (variables that represent categories rather than numeric quantities) into dummy variables. Each dummy variable can only take the value 1 or 0 to indicate whether the individual belongs or not to a certain category. 

In [None]:
# The function "get_dummies" creates dummy variables for the listed variables
df4 = pd.get_dummies(df3, columns=['Pclass', 'Sex', 'Embarked'])
df4.head()

When creating dummy variables, it's also common to drop one of the categories because you only need K-1 binary columns to represent a categorical variable with K categories. You can do that with the `get_dummies` function introduced above by setting the `drop_first` parameter to `TRUE`.

In [None]:
# The function "get_dummies" creates dummy variables for the listed variables
df4 = pd.get_dummies(df3, columns=['Pclass', 'Sex', 'Embarked'], drop_first=True)
df4.head()

Finally, we need to split the data into the variable we want to predict (also known as the target or dependent variable), and the data we want to use to predict (also known as the features or the independent variables).

In [None]:
# Name of all the target variable
target_variable = 'Survived'
# Name of all the columns in df4
all_columns = df4.columns.values
# Keep the name of all the columns that are not the target variable
features = all_columns[all_columns != target_variable]
# Select the data for the features
X = df4[features]
# Select the data for the target variable
y = df4[target_variable]
print(X.head())
print(y.head())

## Modeling: Decision Tree

Finally, here's a simple example of how to ***build*** and visualize a decision-tree model (which is a specific type of machine learning model). 

In [None]:
# This imports the library that contains the code to build a decision tree
from sklearn.tree import DecisionTreeClassifier
# This imports the function to visualize the tree
from sklearn.tree import plot_tree
# This imports a general library for visualization
import matplotlib.pyplot as plt

# Define the paramaters for building the model (e.g., max depth of 3)
tree_model = DecisionTreeClassifier(max_depth=3)
# Build a decision tree model based on historical data of whom survived the Titanic
tree_model = tree_model.fit(X, y)
# Set size of tree figure to be displayed
plt.figure(figsize=(22,10))
# Visualize the decision tree
t = plot_tree(tree_model, fontsize=16, feature_names=features, filled=True,
              impurity=False, class_names=["Dead!", "Alive!"])
# The visualization can be interpreted as follows:
# - If the condition at the top is true, move left. Otherwise, move right.
# - 'samples': Total number of individuals.
# - 'value': Number of people that died (left) and survived (right).
# - 'class': Predicted outcome for people at that node.

As you can see, the people more likely to survive are women (Sex_female > 0.5) who did not travel in third class (Pclass_3 < 0.5) and were more than 2.5 years old. On the other hand, the people more likely to die are men (Sex_female < 0.5) who were more than 6.5 years old and paid a low fare (Fare < 26.269).

Note, however, that some nodes contain very few observations, which makes us wonder: are these nodes representive? How can we tell if this model is any good? How could we improve the model? We will analyze these (and many more other questions) as we advance in the course. But for now, let's look at how to ***use*** this model to make predictions.

## Model Usage

Below is a data set of people with an unknown survival status.

In [None]:
# Same as above, this code creates a dataframe using a CSV file stored in Google Drive

df_unk1 = pd.read_csv("unknown_survival.csv")
print(f"Total number of rows and columns: {df_unk1.shape}")
df_unk1.head()

Note that this data set has exactly the same columns as the original data set we loaded, except for one: this data set does not contain the 'Survival' column. Fortunately, we could use our decision-tree model to predict how likely were these passengers to survive the Titanic (based on their features). But first, we need to do the data pre-processing for this data set as well.

In [None]:
cols = ['Pclass', 'Sex', 'Age', 'Fare', 'Embarked']
X2 = pd.get_dummies(df_unk1[cols], columns=['Pclass', 'Sex', 'Embarked'], drop_first=True)
print("Number of passengers with missing values for each column:")
print(X2.isna().sum())

# In this case, we will replace missing values with the mean, so that we can 
# make predictions for all passengers. Do you think this make sense? Why yes or 
# why not? If you want to share your thoughts, please feel free to post your 
# answer in the discussion board (this counts as class participation).
X2['Age'] = X2['Age'].fillna(X2['Age'].mean())
X2['Fare'] = X2['Fare'].fillna(X2['Fare'].mean())

Now that the data is pre-processed, we can use the decision tree to make predictions.

In [None]:
# Predict whether the person survived
df_unk1["SurvivalPrediction"] = tree_model.predict(X2)
# Predict the probability of survival for this person
df_unk1["SurvivalProbability"] = tree_model.predict_proba(X2)[:, 1]
df_unk1.head(10)

And done! This is how you build a predictive model and use it to make predictions. Simple, right?

## Modeling: Logistic Regression

So, what if you would like to build a logistic regression model instead? Well, it's very easy. You just import the library that includes the code for the LogisticRegression, and replace the line `model=DecisionTreeClassifier()` with `model=LogisticRegression()`. It's that simple! Here's an illustration:

In [None]:
# This imports the code for the logistic regression
from sklearn.linear_model import LogisticRegression

# This code builds the logistic regression model (ignore the solver parameter for)
logistic_model = LogisticRegression(solver='liblinear')
logistic_model = logistic_model.fit(X, y)

# This code makes the predictions
df_unk1["SurvivalPrediction"] = logistic_model.predict(X2)
df_unk1["SurvivalProbability"] = logistic_model.predict_proba(X2)[:, 1]
df_unk1.head(10)

And here are the coefficients for the logistic regression model.

In [None]:
pd.DataFrame(logistic_model.coef_, columns=X.columns.values)

It conveys similar results to the tree: you are less likely to survive if you are older, female, or didn't pay for an expensive ticket or a fancy passenger class.