<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Group Lab: Model Debugging

_Author: Dan Wilhelm_

---

What is wrong, if anything, with the following models?

The models are ordered (approximately) by difficulty. Can you debug them all?

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import r2_score, accuracy_score

from sklearn.datasets import load_iris, load_boston, load_breast_cancer

---

### Model 1

Two classes: Malignant and Benign

+ We expect a test accuracy > 90%.
+ However, the test accuracy is only ~60%! 

Why?

In [2]:
bc_data = load_breast_cancer()

NPOINTS = 500

X = bc_data.data[:NPOINTS]
y = bc_data.target[:NPOINTS]

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.5)

In [6]:
lr = LogisticRegression(solver='lbfgs', max_iter=10000)
lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)

print(f'TEST ACCURACY: {accuracy_score(y_test, y_pred)}')

TEST ACCURACY: 0.98


array([0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1,
       1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0,
       0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1,
       0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1,
       1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1,
       0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1,
       1, 0, 0, 0, 0, 1, 1, 0])

---

### Model 2

+ The Iris data is easily classifiable -- it should have an accuracy of > 95%. 
+ So, why is the test accuracy only 42%?

In [8]:
iris = load_iris()

X = iris.data
y = iris.target

In [11]:
# Manually split the data
# 70% Train - 30% Test Split
total_pts = len(iris.data)
num_train_pts = int(total_pts * 0.7)

X_train = X[:num_train_pts]
y_train = y[:num_train_pts]

X_test = X[num_train_pts:]
y_test = y[num_train_pts:]

print('TRAIN SET SIZE:', X_train.shape, y_train.shape)
print('TEST SET SIZE:', X_test.shape, y_test.shape)

TRAIN SET SIZE: (105, 4) (105,)
TEST SET SIZE: (45, 4) (45,)


In [10]:
# Create a logreg model for predicting three target classes (note: this line is not the problem)
lr = LogisticRegression(solver='lbfgs', multi_class='multinomial')

lr.fit(X_train, y_train)

y_test_pred = lr.predict(X_test)

print(f'FINAL ACCURACY ON TEST SET: {accuracy_score(y_test, y_test_pred)}')

FINAL ACCURACY ON TEST SET: 0.4222222222222222


---

### Model 3

+ The Iris data is easily classifiable -- it should have an accuracy of > 95%. 
+ So, why is the test accuracy only 68%?

In [130]:
iris = load_iris()

X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, random_state=14227)

In [131]:
# Create a logreg model for predicting three target classes (note: this line is not the problem)
lr = LogisticRegression(solver='lbfgs', multi_class='multinomial', max_iter=50000)

# Train a logistic regression on all of the features
lr.fit(X_train, y_train)

y_test_pred = lr.predict(X_test)

print(f'FINAL ACCURACY ON TEST SET: {accuracy_score(y_test, y_test_pred)}')

FINAL ACCURACY ON TEST SET: 0.6842105263157895


---

### Model 4

+ On the test data, the test $R^2$ is always a 1.0!
+ Why do we get a perfect score on both the training and test sets?

In [151]:
# First, format our data in a DataFrame

def get_boston_df():
    boston = load_boston()

    df = pd.DataFrame(boston.data, columns=boston.feature_names)
    df['MEDV'] = boston.target
    
    return df

df = get_boston_df()

In [150]:
stats_X = df.iloc[:,1:]        # All columns except MEDV
stats_y = df.MEDV              # Price of home

linreg = LinearRegression()

# R^2 scores using K folds
linreg_scores = cross_val_score(linreg, stats_X, stats_y, cv=10)

print(f'Cross-validation R^2 values: {linreg_scores}')
print(f'Mean cross-validated R^2: {np.mean(linreg_scores)}')

Cross-validation R^2 values: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
Mean cross-validated R^2: 1.0


---

### Model 5

+ The crossval score gives a negative R^2.
+ Yet, train-test-split seems to give much higher R^2 values.
+ Why?

In [30]:
df = get_boston_df()
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [134]:
# Create a very simple model from the Boston data
FEATURES = ['RM']       # Average number of rooms

X = df[FEATURES].values
y = df['MEDV'].values   # House price

X.shape, y.shape

((506, 1), (506,))

In [145]:
# Cross-validation gives a negative average R^2?!

cv_scores = cross_val_score(LinearRegression(), X, y, cv=5)
print(cv_scores.mean())

-0.02952191995781055


In [147]:
# No matter how many times this cell is run, we never get a negative test R^2!

# Double-check R^2 score using train-test-split
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.2)   # Same test size as CV

lr = LinearRegression()
lr.fit(X_train, y_train)

print('Training score:', lr.score(X_train, y_train))
print('Test score:', lr.score(X_test, y_test))

Training score: 0.49472897403047944
Test score: 0.43596180205252877
