In [None]:
# plotting
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl

# data visualization
import seaborn as sns
from helper_functions import plot_setup
sns.set_style('white')
plot_setup()

# scientific computing/mathematics
import numpy as np

# data analysis
import pandas as pd

# data mining & ML
from sklearn import preprocessing

import warnings
warnings.filterwarnings('ignore')

In this part, we'll look at Logistic Regression.<br /> 
While using Logistic Regression, we'll also explore the effects the number and the selection of features have on the model performance.  

First, let's load the Titanic dataset we prepared in the previous notebook.

In [None]:
titanic = pd.read_csv('titanic_processed.csv')

In [None]:
titanic.head()

As we heard earlier, X is a 2D vector with the dimensions `[n_samples, n_features]`, and y is a 1d vector of length `n_samples`. X here corresponds to our whole dataset minus the `survived` label; y contains only `survived` label.

In [None]:
X = titanic.drop('survived', axis = 1)
y = titanic['survived']

## Logistic Regression

Regressions are models which attempt to map the data into a function, so that when you use the model in the future to predict labels, all you need to do is input the feature vector into the learned regression function and your model returns a value. 

Regressions come in many flavors depending on what the function is. For example, a linear regression uses a linear function and a logistic regression uses a log function. You learn the coefficients of these functions during training and use the resulting function during testing.

The value a regression returns represents a probability. It is the _probability that this feature set belongs to a particular category_.

You may be wondering, then: If regressions are probabilities for one category, how do we use them when we have multiple categories for which we want to predict?

The algorithm used within scikit-learn does "one-versus-the-rest" with a regression called [multinominal Logistic Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). That means we can predict multiple classes. We do this by building a model for each class we want to predict. When we want to classify the item, we look at the probability produced by all of the models and return the class representing the highest probability.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

Let's start first with just the passenger class as a feature for simplicity. We will see how predictive this one feature is of whether someone survived or not.

In [None]:
X_sub = X[['pclass']]

Separate the data into training and test sets using a 80/20 split.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_sub, y, test_size=0.2, random_state=42)

Train the model. (Learn about `.fit()` [here](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit))

In [None]:
model.fit(X_train, y_train)

Predict the labels for the test set.

In [None]:
y_pred = model.predict(X_test)

When computing accuracy, there are a variety of ways to do so.  

The `accuracy_score` we're using (you can read more on it [here](http://scikit-learn.org/stable/modules/model_evaluation.html#accuracy-score) and [here](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)) by default uses something called [Jaccard Similarity](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_similarity_score.html#sklearn.metrics.jaccard_similarity_score). Don't worry too much about the details, but know that accuracy will change depending on which underlying similarity metric you use. You should ensure that the similarity metric used makes sense for the type of data you have. Though for simple purposes, this default similarity metric is usually the one you want.  

Read more [here](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics) about other metrics you can use to evaluate the performance of your model.  

What is the accuracy of this model? 

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

Confusion matrix is one of the ways we can visually show the results of a machine learning model.

In [None]:
from helper_functions import plot_confusion_matrix

plot_confusion_matrix(y_test, y_pred)

#### Let's add in other features

We started with just using `pclass` of a passenger as a feature. Let's see what happens if we add their gender as well.

In [None]:
X_sub = X[['gender', 'pclass']]

Separate the data into training and test sets using a 80/20 split.

In [None]:
X_train, X_test, y_train, y_test = # TODO

Train the model and predict the labels for this new model.

In [None]:
# TODO - train
# TODO - predict

What is the accuracy of this new model?

In [None]:
# TODO

As an exercise, pick another feature to add and see how that changes the accuracy of your model.

In [None]:
# TODO