# In this lecture we'll review some concepts of machine learning with examples
# I will also give a rundown of sklearn

In [None]:
#standard import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### You all should already have sklearn installed

In [None]:
import sklearn
#Let me know if this doesn't work

# What is sklearn?

In its most basic essence, sklearn is an over all package for machine learning.

# Features of sklearn
* Load example datasets
* Build, train, and test ML models
    * Classification
    * Regression
    * Clustering
* Preprocessing
    * Could do all of this in Pandas but sklearn makes it easier
* Dimensionality Reduction
    * We may discuss this Friday
* Model selection
    * Won't have time to discuss
    
# When  you wouldn't use sklearn
* Specialized ML
    * Custom loss
    * Active Learning
        * Libact
    * Semisupervised learning
        * Other libraries
    * Neural Networks beyond simple ones, and GPU support
        * Tensorflow, Keras, Pytorch (my favorite)

## Load example datasets
Easy way to test out certain sklearn functions without having to look for a dataset yourself.

In my own research in the CS department where we are build novel algorithms I use these to test.

https://scikit-learn.org/stable/datasets/index.html#toy-datasets

In [None]:
#Lets load the boston one
from sklearn.datasets import load_boston

In [None]:
dataset = load_boston()
dataset

In [None]:
X = dataset.data

In [None]:
y = dataset.target

In [None]:
dataset.feature_names

In [None]:
X_frame = pd.DataFrame(X,columns=dataset.feature_names)

In [None]:
y_ser = pd.Series(y,name='price')

In [None]:
#Lets plot crime versus price
plt.plot(X_frame.PTRATIO.values,y_ser.values,'o')

### We want to fit a line to this dataset that says if you're student teacher ratio is around here, then the house is worth around here. In practice, we will use all feature at once.

### We cannot do this yet. We must talk about train/test.

# Train/Test Split.
Who has heard of this concept in ML?

How about the bias variance tradeoff?

<img src="images/b_v.png">

For a very complicated model like a neural net we may fit too well to the quirks of the particular dataset but don't general well. Hence, we split our dataset into the training set that our model does see and the test set that it does not see during fitting. If performance is roughly the same then we are doing well. What do I mean by performance?

<img src="images/tt_split.png">

Its not much of a concern for linear regression but we'll do it anyway.

# There is an sklearn function for it

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y)

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
X.shape

# Fitting our first sklearn model

Find your model and import it. I know where linear regression is.

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
#standard
reg = LinearRegression()
#reg for regressor

In [None]:
# Fit the best fit line to the training data


reg.fit(X_train,y_train)

In [None]:
#Predict houses on test set
preds = reg.predict(X_test)
preds 

In [None]:
#what is the average square error?
y_test

In [None]:
preds - y_test

In [None]:
((preds - y_test)**2).mean()

In [None]:
#Another way that is more interpretable 
reg.score(X_test,y_test)
#Returns R^2 score, closer to 1 is better fit, closer to 0 is worse fit

In [None]:
#make sure we aren't overfitting, see if r^2 similar for train
reg.score(X_train,y_train)


# What features were important?

In [None]:
clf.coef_

In [None]:
pd.Series(clf.coef_,index = dataset.feature_names)

In [None]:
dataset.DESCR

### Analysis of this type can only be done on linear models
* Linear and Logistic Regression

# Classification with logistic regression

You should all be familiar with logistic regression. We've seen it for binary classification. What does this mean?

Now we will see it can generalize to more classes.

In [None]:
from sklearn.linear_model import LogisticRegression

#Why is logistic regression linear?

clf = LogisticRegression()
#clf for classifier

### A new dataset
https://en.wikipedia.org/wiki/MNIST_database

In [None]:
from sklearn.datasets import load_digits

In [None]:
dataset = load_digits()

In [None]:
X = dataset.data
#pixels

In [None]:
y = dataset.target
#what the number represents, may seem quantative, but it's qualitative

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,stratify=y)
#What does stratify do?
#makes sure roughly same proportion of each class in train and test

In [None]:
pd.Series(y_train).value_counts(True,False)

In [None]:
pd.Series(y_test).value_counts(True,False)

In [None]:
#fit on training data
clf.fit(X_train,y_train)

In [None]:
#predict test
preds = clf.predict(X_test)
preds

In [None]:
#we can also predict probabilities for each class
probs = clf.predict_proba(X_test)
probs

In [None]:
pd.DataFrame(probs)
#just chooses largest in each row as prediction

In [None]:
#measure performance
(preds == y_test).mean()

In [None]:
#Exact the same thing
clf.score(X_test,y_test)

In [None]:
#See if overfit
clf.score(X_train,y_train)

#We are in fact overfitting a bit. How to not overfit?
#https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
#look into the 'C' parameters. More regularization means less complicated fit

# Next: Clustering and Kmeans