**scikit-learn refresher**
___

In [None]:
#KNN classification

#In this exercise you'll explore a subset of the Large Movie Review Dataset.
#'The variables X_train, X_test, y_train, and y_test are already loaded
#into the environment. The X variables contain features based on the words
#in the movie reviews, and the y variables contain labels for whether the
#review sentiment is positive (+1) or negative (-1).

from sklearn.neighbors import KNeighborsClassifier

# Create and fit the model
knn = KNeighborsClassifier()
#knn.fit(X_train, y_train)

# Predict on the test features, print the results
#pred = knn.predict(X_test)[0]
#print("Prediction for test example 0:", pred)

#################################################
#<script.py> output:
#    Prediction for test example 0: 1.0

In [None]:
#Comparing models
#create two instances of KNeighborsClassifier with n_neighbors=1, 5
#fit training data to both instances

#test for accuracy
#knn1.score(X_test, y_test)
#knn5.score(X_test, y_test)

Review:
- **Underfitting**: model is too simple, low training accuracy
- **Overfitting**: model is too complex, low test accuracy

an example of **overfitting**:
- Training accuracy 95%, testing accuracy 50%.

*overfitting refers to doing better on the training set than the test set.*
___

**Applying logistic regression and Support Vector Machines/Classifier (SVM/SVC)**
___
- SVC - non-linear SVM by default

In [1]:
#Running LogisticRegression and SVC

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn import datasets

digits = datasets.load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target)

# Apply logistic regression and print scores
lr = LogisticRegression()
lr.fit(X_train, y_train)
print(lr.score(X_train, y_train))
print(lr.score(X_test, y_test))

# Apply SVM and print scores
svm = SVC()
svm.fit(X_train, y_train)
print(svm.score(X_train, y_train))
print(svm.score(X_test, y_test))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


1.0
0.9688888888888889
0.9955456570155902
0.9888888888888889


In [3]:
#Sentiment analysis for movie reviews

from sklearn.linear_model import LogisticRegression

# Instantiate logistic regression and train
lr = LogisticRegression()
#lr.fit(X, y)

# Predict sentiment for a glowing review
review1 = "LOVED IT! This movie was amazing. Top 10 this year."
#review1_features = get_features(review1)
print("Review:", review1)
#print("Probability of positive review:", lr.predict_proba(review1_features)[0,1])

# Predict sentiment for a poor review
review2 = "Total junk! I'll never watch a film by that director again, no matter how good the reviews."
#review2_features = get_features(review2)
print("Review:", review2)
#print("Probability of positive review:", lr.predict_proba(review2_features)[0,1])

#################################################
#<script.py> output:
#    Review: LOVED IT! This movie was amazing. Top 10 this year.
#    Probability of positive review: 0.8079007873616059
#    Review: Total junk! I'll never watch a film by that director again, no matter how good the reviews.
#    Probability of positive review: 0.5855117402793947

Review: LOVED IT! This movie was amazing. Top 10 this year.
Review: Total junk! I'll never watch a film by that director again, no matter how good the reviews.


**Linear classifiers**
___
Definitions
- **classification**: learning to predict categories
- **decision boundary**: the surface separating different predicted classes
- **linear classifier**: a classifier that learns linear decision boundaries
    - e.g., logistic regression, linear SVM
- **linearly separable**: a data set can be perfectly explained by a linear classifier
___


In [6]:
#Visualizing decision boundaries

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier

wine = datasets.load_wine()
X=wine.data
y=wine.target

# Define the classifiers
classifiers = [LogisticRegression(), LinearSVC(),
               SVC(), KNeighborsClassifier()]

# Fit the classifiers
for c in classifiers:
    c.fit(X, y)

# Plot the classifiers - plot is a series of previously defined functions
#plot_4_classifiers(X, y, classifiers)
#plt.show()

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


![images/9.1.svg](images/9.1.svg)

**Linear classifiers: the coefficients**
___
- Dot products
    - multiply each element in arrays
    - sum elements in remaining array
    - x@y
- Linear classifier prediction
    - raw model output = coefficients[.coef_] @ features + intercept[.intercept_]
    - if positive predict one class; if negative predict the other class
- This is the same for logistic regression and linear SVM
    - i.e. 'fit' is different but 'predict' is the same
- **intercept** changes boundary location but not orientation
- **coefficients** change orientation of boundary line
___

In [None]:
#Changing the model coefficients

#When you call fit with scikit-learn, the logistic regression coefficients
#are automatically learned from your dataset. In this exercise you will
#explore how the decision boundary is represented by the coefficients. To
#do so, you will change the coefficients manually (instead of with fit),
#and visualize the resulting classifiers.

#A 2D dataset is already loaded into the environment as X and y,
#along with a linear classifier object model

# Set the coefficients
#model.coef_ = np.array([[0,1]])
#model.intercept_ = np.array([0])

# Plot the data and decision boundary using preset function
#plot_classifier(X,y,model)

# Print the number of errors
#num_err = np.sum(y != model.predict(X))
#print("Number of errors:", num_err)

#################################################
#<script.py> output:
#    Number of errors: 3

![images/9.2.svg](images/9.2.svg)

**What is a loss function?**
___
- Least squares: the squared loss (linear regression)
    - minimizes square of the error made on training set
    - $$\sum_{i=1}^{n}(\text{true ith target value - predicted ith target value})^{2}$$
- **loss function**
    - penalty score that tells us how well/poorly model is doing on training data
- **fit function**
    - minimizes loss
- *squared loss/error is not appropriate for classification problems*
    - 0-1 loss
    - 0 for correct prediction, 1 for incorrect prediction
        - a natural loss for classification problem is the number of errors
- minimizing loss using Python
    - from **scipy.optimize** import **minimize**
    - inputs are values of model coefficents
    - what values of the model coefficients make my squared error as small as possible?
___

In [None]:
#Minimizing a loss function

#In this exercise you'll implement linear regression "from scratch" using
#scipy.optimize.minimize.

#We'll train a model on the Boston housing price data set, which is
#already loaded into the variables X and y. For simplicity, we won't
#include an intercept in our regression model.

# The squared error, summed over training examples
def my_loss(w):
    s = 0
    for i in range(y.size):
        # Get the true and predicted target values for example 'i'
        y_i_true = y[i]
        y_i_pred = w@X[i]
        s = s + (y_i_true - y_i_pred)**2
    return s

# Returns the w that makes my_loss(w) smallest
#w_fit = minimize(my_loss, X[0]).x
#print(w_fit)

# Compare with scikit-learn's LinearRegression coefficients
#lr = LinearRegression(fit_intercept=False).fit(X,y)
#print(lr.coef_)

#################################################
#<script.py> output:
#    [-9.16299112e-02  4.86754828e-02 -3.77698794e-03  2.85635998e+00
#     -2.88057050e+00  5.92521269e+00 -7.22470732e-03 -9.67992974e-01
#      1.70448714e-01 -9.38971600e-03 -3.92421893e-01  1.49830571e-02
#     -4.16973012e-01]
#    [-9.16297843e-02  4.86751203e-02 -3.77930006e-03  2.85636751e+00
#     -2.88077933e+00  5.92521432e+00 -7.22447929e-03 -9.67995240e-01
#      1.70443393e-01 -9.38925373e-03 -3.92425680e-01  1.49832102e-02
#     -4.16972624e-01]

**Loss function diagrams**
___


