# Challenge: Model Comparison
You now know two kinds of regression and two kinds of classifier. So let's use that to compare models!

Comparing models is something data scientists do all the time. There's very rarely just one model that would be possible to run for a given situation, so learning to choose the best one is very important.

Here let's work on regression. Find a data set and build a KNN Regression and an OLS regression. Compare the two. How similar are they? Do they miss in different ways?

Create a Jupyter notebook with your models. At the end in a markdown cell write a few paragraphs to describe the models' behaviors and why you favor one model or the other. Try to determine whether there is a situation where you would change your mind, or whether one is unambiguously better than the other. Lastly, try to note what it is about the data that causes the better model to outperform the weaker model. Submit a link to your notebook below.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statistics as sts
from sklearn import linear_model
import scipy
from sklearn import neighbors
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error as mse
%matplotlib inline

# Data

In [2]:
distress = pd.read_csv('Financial Distress.csv')
print(distress.head())

   Company  Time  Financial Distress      x1        x2       x3       x4  \
0        1     1            0.010636  1.2810  0.022934  0.87454  1.21640   
1        1     2           -0.455970  1.2700  0.006454  0.82067  1.00490   
2        1     3           -0.325390  1.0529 -0.059379  0.92242  0.72926   
3        1     4           -0.566570  1.1131 -0.015229  0.85888  0.80974   
4        2     1            1.357300  1.0623  0.107020  0.81460  0.83593   

         x5        x6       x7 ...       x74    x75     x76     x77   x78  \
0  0.060940  0.188270  0.52510 ...    85.437  27.07  26.102  16.000  16.0   
1 -0.014080  0.181040  0.62288 ...   107.090  31.31  30.194  17.000  16.0   
2  0.020476  0.044865  0.43292 ...   120.870  36.07  35.273  17.000  15.0   
3  0.076037  0.091033  0.67546 ...    54.806  39.80  38.377  17.167  16.0   
4  0.199960  0.047800  0.74200 ...    85.437  27.07  26.102  16.000  16.0   

   x79  x80       x81  x82  x83  
0  0.2   22  0.060390   30   49  
1  0.4   22 

[Data](https://www.kaggle.com/shebrahimi/financial-distress)

The target variable is denoted by "Financial Distress" if it is greater than -0.50 the company should be considered as healthy (0). Otherwise, it would be regarded as financially distressed (1).

# OLS Regression

In [3]:
x = distress.drop('Financial Distress', axis=1)
y = distress['Financial Distress']

# Instantiate our model.
regr = linear_model.LinearRegression()

# Fit our model to our data.
regr.fit(x, y)

# Calculate score
regr_score = cross_val_score(regr, x, y, cv=20)
# Display the attributes we calculated.
#print('Coefficients: \n', regr.coef_)
#print('Intercept: \n', regr.intercept_)
print('R-squared: ',regr.score(x,y))
print("Fold Accuracy: %0.2f (+/- %0.2f)" % (regr_score.mean(), regr_score.std() * 2))

y_pred = regr.predict(x)

print('MSE:', mse(y,y_pred))
print('rMSE:',mse(y,y_pred)**.5)

R-squared:  0.4138037866827684
Fold Accuracy: -26985.74 (+/- 186292.43)
MSE: 4.122362242119699
rMSE: 2.0303601262139925


# KNN Regression

In [4]:
# Build our model.
knn = neighbors.KNeighborsRegressor(n_neighbors=10)
X = distress.drop('Financial Distress', axis=1)
Y = distress['Financial Distress']

# Fit our model to the data.
knn.fit(X, Y)

# Calculate score
knn_score = cross_val_score(knn, X, Y, cv=20)
print('Unweighted KNN R-squared: ',knn.score(X,Y))
print("Fold Unweighted KNN Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() * 2))

Y_pred = knn.predict(X)

print('Unweighted KNN MSE:', mse(Y,Y_pred))
print('Unweighted KNN rMSE:',mse(Y,Y_pred)**.5)

Unweighted KNN R-squared:  0.2732229842292031


NameError: name 'score' is not defined

In [55]:
# Build our model.
knn = neighbors.KNeighborsRegressor(n_neighbors=10, weights='distance')
X = distress.drop('Financial Distress', axis=1)
Y = distress['Financial Distress']

# Fit our model to the data.
knn.fit(X, Y)

# Calculate score
knn_score = cross_val_score(knn, X, Y, cv=20)
print('Weigthed KNN R-squared: ',knn.score(X,Y))
print("Fold Weighted KNN Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() * 2))

Y_pred = knn.predict(X)

print('Weighted KNN MSE:', mse(Y,Y_pred))
print('Weighted KNN rMSE:',mse(Y,Y_pred)**.5)

Weiyhed KNN R-squared:  1.0
Fold Weighted KNN Accuracy: 0.33 (+/- 0.28)
Weighted KNN MSE: 0.0
Weighted KNN rMSE: 0.0


# With Output Feature Engineered

In [56]:
distress_sel = pd.read_csv('Financial Distress.csv')
distress_sel['fin_dis'] = distress_sel['Financial Distress'].map(lambda x: 0 if x > -0.50 else 1)
print(distress_sel.head())

   Company  Time  Financial Distress      x1        x2       x3       x4  \
0        1     1            0.010636  1.2810  0.022934  0.87454  1.21640   
1        1     2           -0.455970  1.2700  0.006454  0.82067  1.00490   
2        1     3           -0.325390  1.0529 -0.059379  0.92242  0.72926   
3        1     4           -0.566570  1.1131 -0.015229  0.85888  0.80974   
4        2     1            1.357300  1.0623  0.107020  0.81460  0.83593   

         x5        x6       x7   ...       x75     x76     x77   x78  x79  \
0  0.060940  0.188270  0.52510   ...     27.07  26.102  16.000  16.0  0.2   
1 -0.014080  0.181040  0.62288   ...     31.31  30.194  17.000  16.0  0.4   
2  0.020476  0.044865  0.43292   ...     36.07  35.273  17.000  15.0 -0.2   
3  0.076037  0.091033  0.67546   ...     39.80  38.377  17.167  16.0  5.6   
4  0.199960  0.047800  0.74200   ...     27.07  26.102  16.000  16.0  0.2   

   x80       x81  x82  x83  fin_dis  
0   22  0.060390   30   49        0  
1   

# OLS Regression with Feature Engineering

In [57]:
x = distress_sel.drop(['Financial Distress','fin_dis'], axis=1)
y = distress_sel['fin_dis']

# Instantiate our model.
regr = linear_model.LinearRegression()

# Fit our model to our data.
regr.fit(x, y)

# Calculate score
regr_score = cross_val_score(regr, x, y, cv=20)
# Display the attributes we calculated.
#print('Coefficients: \n', regr.coef_)
#print('Intercept: \n', regr.intercept_)
print('R-squared: ',regr.score(x,y))
print("Fold Accuracy: %0.2f (+/- %0.2f)" % (regr_score.mean(), regr_score.std() * 2))

y_pred = regr.predict(x)

print('MSE:', mse(y,y_pred))
print('rMSE:',mse(y,y_pred)**.5)

R-squared:  0.20964936272078705
Fold Accuracy: -796808.09 (+/- 6918070.14)
MSE: 0.028188088572372472
rMSE: 0.16789308673192138


# KNN Regression with Feature Engineering

In [58]:
# Build our model.
knn = neighbors.KNeighborsRegressor(n_neighbors=10)
X = distress_sel.drop(['Financial Distress','fin_dis'], axis=1)
Y = distress_sel['fin_dis']

# Fit our model to the data.
knn.fit(X, Y)

# Calculate score
knn_score = cross_val_score(knn, X, Y, cv=20)
print('Unweighted KNN R-squared: ',knn.score(X,Y))
print("Fold Unweighted KNN Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() * 2))

Y_pred = knn.predict(X)

print('Unweighted KNN MSE:', mse(Y,Y_pred))
print('Unweighted KNN rMSE:',mse(Y,Y_pred)**.5)

Unweighted KNN R-squared:  0.265670248868778
Fold Unweighted KNN Accuracy: 0.33 (+/- 0.28)
Unweighted KNN MSE: 0.026190087145969505
Unweighted KNN rMSE: 0.16183351675709673


In [59]:
# Build our model.
knn = neighbors.KNeighborsRegressor(n_neighbors=10, weights='distance')
X = distress_sel.drop(['Financial Distress','fin_dis'], axis=1)
Y = distress_sel['fin_dis']

# Fit our model to the data.
knn.fit(X, Y)

# Calculate score
knn_score = cross_val_score(knn, X, Y, cv=20)
print('Weighted Knn R-squared: ',knn.score(X,Y))
print("Fold Weighted Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() * 2))

Y_pred = knn.predict(X)

print('Weighted KNN MSE:', mse(Y,Y_pred))
print('Weighted KNN rMSE:',mse(Y,Y_pred)**.5)

Weighted Knn R-squared:  1.0
Fold Weighted Accuracy: 0.33 (+/- 0.28)
Weighted KNN MSE: 0.0
Weighted KNN rMSE: 0.0
