# Assignment 04: Scikit Learn Basic Regression and Classification

**Due Date:** Friday 10/11/2019 (by midnight)


## Introduction 

In this exercise we will be redoing the regression problem from assignment 02 and the classification problem
from assignment 03, but we will use the Scikit Learn python machine learning library to perform the
model fitting.

For the first part of this assignment, I recommend looking through the following tutorial on using
Scikit Learn for linear regression:

[A Beginners guide to Linear Regression in Python with Scikit-Learn](https://towardsdatascience.com/a-beginners-guide-to-linear-regression-in-python-with-scikit-learn-83a8f7ae2b4f)

[Use statsmodels to Perform Linear Regression in Python](https://datatofish.com/statsmodels-linear-regression/)

I am using this material as a reference for the first part of this assignment.

**Please fill these in before submitting, just in case I accidentally mix up file names while grading**:

Name: Joe Student

CWID-5: (Last 5 digits of cwid)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# By convention, we often just import the specific classes/functions
# from scikit-learn we will need to train a model and perform prediction.
# Here we include all of the classes and functions you should need for this
# assignment from the sklearn library, but there could be other methods you might
# want to try or would be useful to the way you approach the problem, so feel free
# to import others you might need or want to try
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# statsmodels has an api, it is often imported as sm by convention
import statsmodels.api as sm

%matplotlib inline

In [None]:
plt.rcParams['figure.figsize'] = (10, 8) # set default figure size, 8in by 6in

# Linear Regression with One Variable

## Scikit-Learn LinearRegression model

Load and plot the house pricing data of 1 feature variable again from assignment 02, just to remind us of
the data set.  We plot using seaborn linear model plot, which fits it own linear regression model to the
data in the plot visualization for us already.  Recall that the data set has profit (y or dependent
variable) for a food truck business as a function of the population size (x or independent variable).

In [None]:
# load the assignment 02 housing price linear regression data here

In [None]:
# plot the data here

The tutorial shows an example of actually building a regression model where data is held back from the training
so that we can evaluate the accuracy of our predictive model.  We will try that next.  First of all, lets 
fit a linear regression model to all of the data, and compare the fitted parameters with what we obtained when
we implemented the linear regression by hand on our own in assignment 02.

In [None]:
# fit the linear regression model to all of the data

In [None]:
# retrieve the intercept and slope

You should compare your intercept and slope you determine here using Scikit Learn with the one we obtained in
assignment 02.  The slope and intercept should be exactly the same as the ones obtained using the normal
equation at the end of assignment 02 (or the same as your by hand implementaiton if you iterate your gradient
descent a sufficient number of times).

The slope and intercept we had from assignment 02 were:

intercept: -3.89578088

slope: 1.19303364

As shown in the tutorial, use the predict() method of your Scikit Learn regression model to predict
each value of our x data features

In [None]:
# using predict() from scikit-learn find the predicted or hypothesized profit for each of the model populations


Now we can plot the determined linear fit line given by Scikit Learn to our data

In [None]:
# plot the fitted line using the predict() method from the LinearRegression object

## statsmodels Linear Regression

In contrast to the scikit-learn library, the python statsmodel library is primarily geared towards doing statistical
analysis of data, similar to a stats package like using SPSS or R.  You can perform a linear regression on a data
set using the statsmodel package, and get much more information about the goodness of the fit from the
constructed model.

In the next cell, create a model using statsmodels OLS (ordinary least squared fit) function, fit the model, and use the summary() function
to get infroamtion about the fit.

In [None]:
# load the data from our assignment 02 linear regression problem again if needed

In [None]:
# unlike for sklearn library, we actually have to add the dummy feature by hand to 
# represent the intercept feature, it is not assumed automatically by OLS
# use the add_constant() method to add a column to represent our intercept coefficient in the model.


In [None]:
# use the statsmodels summary method to get a summary of the statistical fit of your linear regression.
# Check the fitted parameters to the results from scikit-learn and from your assignment 02.

In the summary you should note that you get the same coefficients as we have determined using all of our
other methods for the linear regression fit to this data.  The $R^2$ measure of the fit is also the same
as what we got for fitting all of the data for the sklearn model.  

The rest of the summary information are some statistical information about how well the model fits
the data.  The data in the table under the [0.025 0.975] columns give a 95% confidence interval for
the coefficients.  For example from the measure of the noise and fit we are 95% confident that the true
coefficient for the x1 parameter is somewhere between 1.035 and 1.351.  The P>|t| measure is also
important here.  This is a P-value that measure how surprised we would be to see this fit if there was
actually no linear relationship between the independent variable and the dependent variable.  Both of these
measures are basically 0, which means we would be very surprised to see this fit if there was no linear
relationship between the features and the dependent variable.  When the P value here is large (usually
a cutoff of 0.05 is used), then that means we are not so surprised to see the result if there was no
relationship.


# Logistic Regression for a Binary Classifier

## scikit-learn LogisticRegression model

Load and plot the exam score data with binary class labels of accepted/not accepted.  Here we
use the features of the Seaborn plotting library again to display markers based on each
data point in a scatter plot.  Recall that this data set has 2 exam scores (exam1 and exam2),
for a number of students, and a binary category for each student of whether they were admitted
or not to the university.


[Logistic Regression using Python (scikit-learn)](https://towardsdatascience.com/logistic-regression-using-python-sklearn-numpy-mnist-handwriting-recognition-matplotlib-a6b31e2b166a)

[Logistic Regression in Python Using statsmodels](http://blog.yhat.com/posts/logistic-regression-python-rodeo.html)

In [None]:
# load the data from our assignment 03 logistic classification problem here

In [None]:
# replot the exam1/exam2 data indicating the binary categories using marker type again here for reference

Now we will use Scikit Learn to fit a model again, but of course we will fit a binary logistic regression classifier
to our data to find the best decision boundary between the two classes.

In the next cell, create the Scikit Learn logistic regression model and fit it to our data.

In [None]:
# Create the scikit-learn LogisticRegression instance here

# and fit your model to the college acceptance using exam scores data here


In assignment 03, we used an optimization method to optimize our cost and gradient functions and find the
best parameters.  For a binary classifier with 2 features like this, there are 3 parameters in the model,
the intercept, and the parameters for exam1 and exam2 that were fit to define the decision boundary.

In assignment 03, our by hand solution using a scipy optimizer found the following parameters, where the first
parameter is the intercept, and the second and third are the theta parameters fit for the exam1 and exam2
feature respectively:

[-25.16133586   0.20623176   0.2014716 ]

In the model returned by Scikit Learn, the intercept_ should correspond and match the intercept value,
and the coef_ should match the exam1 and exam2 coefficient parameters.

In [None]:
# display the intercept and the model coefficients for the exam1 and exam2 feature here


The parameters in this case might not exactly match because of the differences in the optimization methods, but
they will be close and essentially form almost the same decision boundary.

As we did in assignment 03, for a 2 parameter set of data we can use the intercept and coefficients to
visualize the decision boundary specified by the fitted logistic regression model.

In [None]:
# plot the decision boundary line found by the scikit-learn logistic classification

## statsmodels Logistic Regression

Likewise use the statsmodels library to redo the Logistic Regression classification of the adming/not admit
data set once again.  In the following cells, load the data, add in the constant column needed by statsmodels
to fit the model using the intercept parameter, then create an instance and fit the model, and show a summary
of your logistic regression results.

In [None]:
# get fresh reload of the data if needed here to ensure you have correct starting values of the assignment 03
# classification data


In [None]:
# unlike for sklearn library, we actually ahve to add the dummy feature by hand to 
# represent the intercept feature, it is not assumed automatically by OLS
# make sure you add the intercept feature column here before fitting the model.
x = sm.add_constant(x)
model = sm.Logit(y, x).fit()

print(model.summary())

In [None]:
# create an instance of the statsmodel Logit model (you don't need MNLogit here since this is
# a binary classification task).

In [None]:
# fit your model to get a statsmodel model fit wrapper

In [None]:
# display a summary of the fit of your classifier.  You might want to compare your intercept and
# fitted coefficients again, though because of differences in the optimizer/solvers used your coefficients
# may not be exact matches as before with the linear regression fit.