# Evaluation of Regression Models

* Supervised learning: learn y using features X.
* r2 vs error
* Training and testing
* Cross validation
* Comparing to a baseline 

In [None]:
import pandas
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
abalone = pandas.read_csv('../Datasets/abalone.csv')
X = pandas.get_dummies(abalone.drop(columns=['Rings']))
scaled_X = pandas.DataFrame(scaler.fit_transform(X), columns=X.columns)
y = abalone['Rings']
X.sample()

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor

model1 = LinearRegression()
model2 = KNeighborsRegressor(n_neighbors=9)
print(model1, model2)

In [None]:
new_data = X.sample(3)
new_data = pandas.DataFrame(
    scaler.transform(new_data), 
    columns=new_data.columns,
    index = new_data.index,
)
new_data

In [None]:
model1.fit(X,y)
model2.fit(X,y)
y1 = model1.predict(new_data)
y2 = model2.predict(new_data)
print(y1,y2)


In [None]:
y.loc[new_data.index]

### How good are these predictions? How good is the model's ability to make predictions?

Several things are needed. First, we need a metric.

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
true_values = [1,1,1]
predict_values = [1,0,1]
print(mean_absolute_error(true_values,predict_values))

In [None]:
y1, y2, y.loc[new_data.index].values

In [None]:
mean_absolute_error(y.loc[new_data.index], y1)

In [None]:
mean_absolute_error(y.loc[new_data.index], y2)

In [None]:
y.loc[new_data.index]

##### Summary

In [None]:
scaler = MinMaxScaler()
abalone = pandas.read_csv('../Datasets/abalone.csv')
X = pandas.get_dummies(abalone.drop(columns=['Rings']))
scaled_X = pandas.DataFrame(scaler.fit_transform(X), columns=X.columns)

model1 = LinearRegression()
model1.fit(scaled_X,y)
model2 = KNeighborsRegressor(n_neighbors=9)
model2.fit(scaled_X,y)

y1 = model1.predict(scaled_X)
y2 = model2.predict(scaled_X)

print(mean_absolute_error(y, y1))
print(mean_absolute_error(y, y2))

### Next, we need two separate training and testing sets

To test a learner's ability to learn, first, we give the learner data to learning from.

Then, we test the learner using different data.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(scaled_X, y, test_size=0.01)

In [None]:
len(y), 0.01*4117

In [None]:
y_train.index

In [None]:
X_train.index

In [None]:
y_test.index

In [None]:
X_test.index

In [None]:

model1 = LinearRegression()
model1.fit(X_train, y_train)

model2 = KNeighborsRegressor(n_neighbors=9)
model2.fit(X_train, y_train)

y1 = model1.predict(X_test)
y2 = model2.predict(X_test)

print(mean_absolute_error(y_test, y1))
print(mean_absolute_error(y_test, y2))

This is a single step of validation.

#### Summary

To evaluate a model we need the following:
* A metric, e.g. mean_absolute_error
* Validation: two different datasets, a training dataset and a testing dataset.
* Cross validation: validating the model in multiple rounds.
* Other things...

In [None]:
# 1. prepare the data, select features

import pandas
from sklearn.preprocessing import MinMaxScaler

abalone = pandas.read_csv('../Datasets/abalone.csv')

y = abalone['Rings']

X = abalone.drop(columns=['Rings'])
X = pandas.get_dummies(X)

scaler = MinMaxScaler()
scaled_X = pandas.DataFrame(
    scaler.fit_transform(X), 
    columns=X.columns,
)




In [None]:
# 2. Create model

from sklearn.linear_model import LinearRegression
model = LinearRegression()



In [None]:
# 3. Plan for validation
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

def one_round_validate(model, X, y, test_size):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size) 
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    error = mean_absolute_error(y_test, predictions)
    return error


In [None]:
one_round_validate(model, scaled_X, y, 0.1)

In [None]:
# 4. Cross validation

# repeated sampling (this is equivalent to ShuffleSplit)

errors = []
for i in range(100):
    errors.append(one_round_validate(model, scaled_X, y, 0.1))

print('Average test error:', sum(errors) / len(errors))

### Cross Validation with ShuffleSplit

Cross validation simply means that we validate a model across differet splits.



In [None]:
from sklearn.model_selection import cross_validate, ShuffleSplit

In [None]:
ss = ShuffleSplit(n_splits=100, test_size=0.05)
splits = list(ss.split(scaled_X,y))
len(splits)

In [None]:
train_idx, test_idx = splits[0]

In [None]:
train_idx

In [None]:
len(test_idx), len(y)*0.05

In [None]:
ss = ShuffleSplit(n_splits=100, test_size=0.05)
for train_idx, test_idx in ss.split(scaled_X,y):
    print(test_idx)

In [None]:
# to get all the names of "scoring"
#from sklearn.metrics import get_scorer_names
#get_scorer_names()

In [None]:
ss = ShuffleSplit(n_splits=100, test_size=0.05)
model = LinearRegression()

result = cross_validate(model, scaled_X, y, cv=ss,  scoring='neg_mean_absolute_error')
result.keys()

In [None]:
result['test_score'].mean().round(3), result['test_score'].std().round(3)

Using `cross_validate` and `ShufflSplit`, we combine all of the cross validation procedure in one step.


#### KFold Cross Validation

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

This picture shows the 5-fold cross validation procedure.
<img src="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png" width="60%">

Benefits of K-fold compared to ShuffleSplit:
* Fewer splits.
* Each data point is tested exactly once.

Cons:
* Can be biased. Groups of data are learned/tested together.
* We should shuffle the data at the beginning. Sklearn does not do this by default.

### Baseline Comparison

In [None]:
ss = ShuffleSplit(n_splits=100, test_size=0.05)
model = LinearRegression()

result = cross_validate(model, scaled_X, y, cv=ss,  
                        scoring=['r2','neg_mean_absolute_error'])
print(result['test_neg_mean_absolute_error'].mean().round(2))
print(result['test_r2'].mean().round(2))


In [None]:
from sklearn.dummy import DummyRegressor
baseline = DummyRegressor()
baseline.strategy

In [None]:
result = cross_validate(baseline, scaled_X, y, cv=ss,  
                        scoring=['r2','neg_mean_absolute_error'])
print(result['test_neg_mean_absolute_error'].mean().round(2))
print(result['test_r2'].mean().round(2))

### Learning Curves

How does a model learn with more data?  Does learning increase with more experience?

A learning curve reveals insights about the data and the learner.  The learning curve reveals how good the learner (model) is, and/or how difficult the problem is.

Learning curve:
* x-axis training size
* y-axis score (higher is better)


In [None]:
def one_round_validate(model, X, y, test_size):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size) 
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    test_score = -mean_absolute_error(y_test, predictions)

    predictions = model.predict(X_train)
    train_score = -mean_absolute_error(y_train, predictions)
    
    return train_score, test_score



In [None]:
# from sklearn.metrics import get_scorer_names
# get_scorer_names()

In [None]:
import lcplot

model = LinearRegression()
lcplot.plot(model, scaled_X, y, scoring='neg_mean_squared_error')



<img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_learning_curve_001.png" width="75%">


#PID:21
#### Exercise: evaluate the following learning curve

https://i.stack.imgur.com/uHDIM.png

My evaluation:
* learner's ability: test score may not get above 0.8
* problem difficulty: with additional data, scores may not get higher.



https://i.stack.imgur.com/uHDIM.png

https://i.stack.imgur.com/MHRKD.png

https://i.stack.imgur.com/VGhxI.png

https://i.stack.imgur.com/dDgMw.png