In [None]:
%matplotlib inline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from helper_functions import plot_supervised_model

In this part, we'll look at Logistic Regression.<br />
While using Logistic Regression, we'll also explore the effects the number of features, and the relative sizes of training and tests sets have on the model performance.

First, let's load the same Iris dataset as in the previous example.

In [None]:
iris_dataset = load_iris()
X = iris_dataset.data
y = iris_dataset.target

## Logistic Regression

Regressions are models which attempt to map the data into a function, so that when you use the model in the future to predict labels, all you need to do is input the feature vector into the learned regression function and your model returns a value. 

Regressions come in many flavors depending on what the function is. For example, a linear regression uses a linear function and a logistic regression uses a log function. You learn the coefficients of these functions during training and use the resulting function during testing.

The value a regression returns represents a probability. It is the _probability that this feature set belongs to a particular category_.

You may be wondering, then: If regressions are probabilities for one category, how do we use them when we have multiple categories for which we want to predict?

The algorithm used within scikit-learn does "one-versus-the-rest" with a regression called [multinominal Logistic Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). That means we can predict multiple classes. We do this by building a model for each class we want to predict. When we want to classify the item, we look at the probability produced by all of the models and return the class representing the highest probability.

In [None]:
model = LogisticRegression()

Let's start with just the first two dimensions as features (sepal length and width) for simplicity. We will see if these two features alone are predictive enough for species.

In [None]:
X_sub = X[:, :2]

Separate the data into training and test sets using a 60/40 split.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_sub, y, test_size=0.4, random_state=42)

Train the model. (Learn from about `#fit` [here](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit))

In [None]:
model.fit(X_train, y_train)

Predict the labels for the test set.

In [None]:
y_pred = model.predict(X_test)

When computing accuracy, there are a variety of ways to do so.  

The `accuracy_score` we're using (you can read more on it [here](http://scikit-learn.org/stable/modules/model_evaluation.html#accuracy-score) and [here](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)) by default uses something called [Jaccard Similarity](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_similarity_score.html#sklearn.metrics.jaccard_similarity_score). Don't worry too much about the details, but know that accuracy will change depending on which underlying similarity metric you use. You should ensure that the similarity metric used makes sense for the type of data you have. Though for simple purposes, this default similarity metric is usually the one you want.  

Read more [here](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics) about other metrics you can use to evaluate the performance of your model.  

What is the accuracy of this model? 

In [None]:
accuracy_score(y_test, y_pred)

Where did our model do well and where it failed? Let's plot the dataset and find out.

In [None]:
plot_supervised_model('Logistic Regression, 2 features, 40% test set', model, X_test, y_test, y_pred)

##### Now let's explore the effect of training size versus test size.
First, come up with a hypothesis.

How do you think accuracy could change if we made the training set much smaller and the test set much larger?<br />
What if we did the opposite and made the training set much larger and the test set much smaller?

We'll look at the latter scenario.
Let's increase the size of the training set to an 80/20 split.

We use the first two features (sepal length  and width) again.

In [None]:
X_sub = X[:, :2]

Separate the data into training and test sets using a *80/20* split this time.

In [None]:
X_train, X_test, y_train, y_test = ## TODO

Train the model and predict the labels for this new model.

In [1]:
## TODO - train
## TODO - predict

What is the accuracy of this new model? Use the `accuracy_score` method from above.

In [2]:
## TODO

Let's plot to check out or model. What information do you get from this plot?

In [None]:
plot_supervised_model('Logistic Regression, 2 features, 20% test set', model, X_test, y_test, y_pred)

#### Let's add in other features

We started with two of the four total features available in this data set. Let's see what happens when we add in another feature (petal length).

We use the first three features now.

In [None]:
X_sub = X[:, :3]

Separate the data into training and test sets using a 80/20 split.

In [None]:
# TODO

Train the model and predict the labels for this new model.

In [None]:
# TODO

What is the accuracy of this new model?

In [None]:
# TODO

Let's plot again and compare with previous models.

In [None]:
# TODO

As an exercise, now use all 4 features. (We will now use the full feature array `X`, not just a subset like in the previous examples.)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

In [None]:
# TODO