In [None]:
# plotting
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl

# data visualization
from helper_functions import plot_setup
plot_setup()

# data analysis
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

In this part, we'll look at **Logistic Regression**.<br /> 
While using Logistic Regression, we'll also explore the effects the number and the selection of features have on the model performance.  

First, let's load the Titanic dataset we prepared in the previous notebook.

In [None]:
titanic = pd.read_csv('titanic_processed.csv')

In [None]:
titanic.head()

As we heard earlier, X is a 2D vector with the dimensions `[n_samples, n_features]`, and y is a 1d vector of length `n_samples`. X here corresponds to our whole dataset minus the `survived` label; y contains only `survived` label.

In [None]:
X = titanic.drop('survived', axis = 1)
y = titanic['survived']

## Logistic Regression

Regressions are models which attempt to map the data into a function, so that when you use the model in the future to predict labels, all you need to do is input the feature vector into the learned regression function and your model returns a value. Given that a regression function is fully determined by its coefficients, these models are very easy to implement in practice/production. 

Regressions come in many flavors depending on what the function is. For example, a linear regression uses a linear function and a logistic regression uses a logistic function. You learn the coefficients of these functions during training and use the resulting function during testing.

Linear regression is generally used for regression problems, i.e. those with a continuous variable. For classification problems, those with categorical variables, we want the model to output the probability that a particular data point belongs to a particular category. Since probability is limited by 0 and 1, we want the function that maps the input space to [0, 1]. Logistic function has exactly those properties.

We start by importing the relevant classifier from `sklearn`.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model

You can find more information about the logistic regression parameters [here](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

Let's start first with just the passenger class as a feature for simplicity. We will see how predictive this one feature is of whether someone survived or not.

In [None]:
X_sub = X[['pclass']]

Separate the data into training and test sets using a 80/20 split.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_sub, y, test_size=0.2, random_state=42)

Train the model. (Learn about `.fit()` [here](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit))

In [None]:
model.fit(X_train, y_train)

Let's see what parameters our model learned for our regression function.

In [None]:
print("Coefficients ", model.coef_)
print("Intercept", model.intercept_)

Predict the labels for the test set.

In [None]:
y_pred = model.predict(X_test)

y_pred

When computing accuracy, there are a variety of ways to do so.  

The `accuracy_score` we're using (you can read more on it [here](http://scikit-learn.org/stable/modules/model_evaluation.html#accuracy-score) and [here](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)) by default uses something called [Jaccard Similarity](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_similarity_score.html#sklearn.metrics.jaccard_similarity_score). Don't worry too much about the details, but know that accuracy will change depending on which underlying similarity metric you use. You should ensure that the similarity metric used makes sense for the type of data you have. Though for simple purposes, this default similarity metric is usually the one you want.  

Read more [here](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics) about other metrics you can use to evaluate the performance of your model.  

What is the accuracy of this model? 

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

This means that our model predicted the label correctly in 65% of the cases.

Confusion matrix is one of the ways we can visually show the results of a machine learning model.

In [None]:
from helper_functions import plot_confusion_matrix

plot_confusion_matrix(y_test, y_pred)

Our model correctly predicted for 33 passengers that they'd survive (*true positives*), and 103 that they wouldn't (*true negatives*). It also however predicted for 22 passengers who didn't survive that they would (*false positives*) and for 51 who survived it guessed that they wouldn't (*false negatives*). Depending on the particular problem you are modeling, you may want to maximize the number of false positives and minimize the number of false negatives, and vice versa.

#### Let's add in other features

We started with just using `pclass` of a passenger as a feature. Let's see what happens if we add their gender as well.

In [None]:
X_sub = X[['gender', 'pclass']]

Separate the data into training and test sets using a 80/20 split.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_sub, y, test_size=0.2, random_state=42)

Train the model and predict the labels for this new model.

In [None]:
# TODO - train
# TODO - predict

What is the accuracy of this new model?

In [None]:
# TODO

As an exercise, pick another feature to add and see how that changes the accuracy of your model.

In [None]:
# TODO