# Using Cross Validation to Evaluate a Model's Performance

### Problem: Given FICO Score and the loan amount requested, what will be the interest rate?

**Step 1:** Downloading the data and cleaning pertinent columns:

In [41]:
import pandas as pd


loansData = pd.read_csv('https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv')

# Clean Interest.Rate and FICO.Range fields
g = lambda x: round(float(x[0:-1])/100,4)
clean_Interest_Rate = loansData['Interest.Rate'].map(g)
loansData['Interest.Rate'] = clean_Interest_Rate

j = lambda z: int(z.split('-')[0])
loansData['fico_score'] = loansData['FICO.Range'].map(j)

# Modify column names
loansData.columns = [col.replace(".", "_").lower() for col in loansData.columns]

In [42]:
loansData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2500 entries, 81174 to 3116
Data columns (total 15 columns):
amount_requested                  2500 non-null int64
amount_funded_by_investors        2500 non-null float64
interest_rate                     2500 non-null float64
loan_length                       2500 non-null object
loan_purpose                      2500 non-null object
debt_to_income_ratio              2500 non-null object
state                             2500 non-null object
home_ownership                    2500 non-null object
monthly_income                    2499 non-null float64
fico_range                        2500 non-null object
open_credit_lines                 2498 non-null float64
revolving_credit_balance          2498 non-null float64
inquiries_in_the_last_6_months    2498 non-null float64
employment_length                 2500 non-null object
fico_score                        2500 non-null int64
dtypes: float64(6), int64(2), object(7)
memory usage: 312.5+

**Step 2:** Saving the data to csv for future use:

In [43]:
loansData.to_csv('loansData_crossv.csv', header=True, index=False)

In [44]:
import pandas as pd
import statsmodels.formula.api as smf
from sklearn import linear_model
from sklearn import cross_validation as cv
from sklearn import metrics
from sklearn import svm

In [45]:
loansData = pd.read_csv('loansData_crossv.csv')

**Step 3:** Starting with statsmodels to create the first model:

In [46]:
model = smf.ols('interest_rate ~ fico_score + amount_requested', loansData).fit()
model.summary()

0,1,2,3
Dep. Variable:,interest_rate,R-squared:,0.657
Model:,OLS,Adj. R-squared:,0.656
Method:,Least Squares,F-statistic:,2388.0
Date:,"Thu, 21 Jan 2016",Prob (F-statistic):,0.0
Time:,17:42:47,Log-Likelihood:,5727.6
No. Observations:,2500,AIC:,-11450.0
Df Residuals:,2497,BIC:,-11430.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,0.7288,0.010,73.734,0.000,0.709 0.748
fico_score,-0.0009,1.4e-05,-63.022,0.000,-0.001 -0.001
amount_requested,2.107e-06,6.3e-08,33.443,0.000,1.98e-06 2.23e-06

0,1,2,3
Omnibus:,69.496,Durbin-Watson:,1.979
Prob(Omnibus):,0.0,Jarque-Bera (JB):,77.811
Skew:,0.379,Prob(JB):,1.27e-17
Kurtosis:,3.414,Cond. No.,296000.0


**Step 4:** Using scikit-learn to make the same model, checking to see if the coefficients match:

In [47]:
X = loansData[['amount_requested','fico_score']]
y = loansData['interest_rate']

model_2 = linear_model.LinearRegression()
model_2.fit(X, y)
model_2.coef_

array([  2.10747769e-06,  -8.84424222e-04])

**Step 5:** Splitting the data just once before moving on to KFold:

In [48]:
X_train, X_test, y_train, y_test = cv.train_test_split(X, y, train_size=0.7, random_state=0)
model_3 = linear_model.LinearRegression()
model_3.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [49]:
predicted = model_3.predict(X_test)
expected = y_test

metrics.mean_squared_error(expected, predicted)

0.00058921001758735405

**Step 6:** Splitting the data via KFold:

In [52]:
clf = svm.SVR()
mse_scores = cv.cross_val_score(clf, X, y, cv=10, scoring='mean_squared_error')
mse_scores

array([-0.00239206, -0.00206945, -0.00201442, -0.00199629, -0.00210307,
       -0.00245783, -0.00213549, -0.00227609, -0.00223815, -0.00212754])

In [53]:
abs(mse_scores.mean())

0.0021810386520000006

**Preliminary conclusion:** 
The smaller the mean squared error, the closer the fit is to the data. Therefore, somehow I got a worse result from cross validation with KFold than just splitting the data once. Not sure what happened there!