## Model Evaluation and Refinement

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Already downloaded csv file module_5_auto.csv from ibm skill lab
df = pd.read_csv('module_5_auto.csv')
print('data read successfully in pandas dataframe')

data read successfully in pandas dataframe


In [3]:
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,symboling,normalized-losses,make,aspiration,num-of-doors,body-style,drive-wheels,engine-location,...,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price,city-L/100km,horsepower-binned,diesel,gas
0,0,0,3,122,alfa-romero,std,two,convertible,rwd,front,...,9.0,111.0,5000.0,21,27,13495.0,11.190476,Medium,0,1
1,1,1,3,122,alfa-romero,std,two,convertible,rwd,front,...,9.0,111.0,5000.0,21,27,16500.0,11.190476,Medium,0,1
2,2,2,1,122,alfa-romero,std,two,hatchback,rwd,front,...,9.0,154.0,5000.0,19,26,16500.0,12.368421,Medium,0,1
3,3,3,2,164,audi,std,four,sedan,fwd,front,...,10.0,102.0,5500.0,24,30,13950.0,9.791667,Medium,0,1
4,4,4,2,164,audi,std,four,sedan,4wd,front,...,8.0,115.0,5500.0,18,22,17450.0,13.055556,Medium,0,1



First, let's only use numeric data:

In [4]:
df = df._get_numeric_data()
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,...,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price,city-L/100km,diesel,gas
0,0,0,3,122,88.6,0.811148,0.890278,48.8,2548,130,...,2.68,9.0,111.0,5000.0,21,27,13495.0,11.190476,0,1
1,1,1,3,122,88.6,0.811148,0.890278,48.8,2548,130,...,2.68,9.0,111.0,5000.0,21,27,16500.0,11.190476,0,1
2,2,2,1,122,94.5,0.822681,0.909722,52.4,2823,152,...,3.47,9.0,154.0,5000.0,19,26,16500.0,12.368421,0,1
3,3,3,2,164,99.8,0.84863,0.919444,54.3,2337,109,...,3.4,10.0,102.0,5500.0,24,30,13950.0,9.791667,0,1
4,4,4,2,164,99.4,0.84863,0.922222,54.3,2824,136,...,3.4,8.0,115.0,5500.0,18,22,17450.0,13.055556,0,1


Let's remove the columns 'Unnamed:0.1' and 'Unnamed:0' since they do not provide any value to the models.

In [5]:
df.drop(['Unnamed: 0.1', 'Unnamed: 0'], axis=1, inplace=True)

df.head()

Unnamed: 0,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price,city-L/100km,diesel,gas
0,3,122,88.6,0.811148,0.890278,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0,11.190476,0,1
1,3,122,88.6,0.811148,0.890278,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0,11.190476,0,1
2,1,122,94.5,0.822681,0.909722,52.4,2823,152,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0,12.368421,0,1
3,2,164,99.8,0.84863,0.919444,54.3,2337,109,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0,9.791667,0,1
4,2,164,99.4,0.84863,0.922222,54.3,2824,136,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0,13.055556,0,1


#### Functions for Plotting

In [6]:
def DistributionPlot(RedFunction, BlueFunction, RedName, Title):
    plt.figure(figsize=(12,10))

    ax1 = sns.kdeplot(RedFunction, color='r', label=RedName)
    ax2 = sns.kdeplot(BlueFunction, color='b', label=BlueName, ax=ax1)

    plt.title(Title)
    plt.xlabel('Price (in dollars)')
    plt.ylabel('Proportion of Cars')
    plt.show()
    plt.close()

In [7]:
def PollyPlot(x_train , xtest, y_train, y_test, lr, poly_transform):
    plt.figure(figsize=(12,10))

    xmax=max([xtrain.values.max(), xtest.values.max()])
    xmin=min([xtrain.values.min(), xtest.values.min()])

    x.np.arange(xmin, xmax, 0.1)

    plt.plot(xtrain, y_train, 'ro', label='Training Data')
    plt.plot(xtest, y_test, 'go', label='Test Data')
    plt.plot(x, lr.predict(poly_transform.fit_transform(x.reshape(-1,1))), label='Predicted Function')
    plt.ylim([-10000, 60000])
    plt.ylabel('Price')
    plt.legend()

#### Part 1: Training and Testing
An important step in testing your model is to split your data into training and testing data. We will place the target data price in a separate dataframe y_data:

In [8]:
y_data = df['price']

Drop price data in dataframe x_data

In [9]:
x_data = df.drop('price', axis=1)

Now, we randomly split our data into training and testing data using the function train_test_split.

In [10]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.10, random_state=1 )

print('number of test samples:', x_test.shape[0])
print('number of training samples:', x_train.shape[0])

number of test samples: 21
number of training samples: 180


The <b>test_size</b> parameter sets the proportion of data that is split into the testing set. In the above, the testing set is 10% of the total dataset.


### 1): Use the function "train_test_split" to split up the dataset such that 40% of the data samples will be utilized for testing. Set the parameter "random_state" equal to zero. The output of the function should be the following: "x_train1" , "x_test1", "y_train1" and "y_test1".

In [11]:
x_train1, x_test1, y_train1, y_test1 = train_test_split(x_data, y_data, test_size=0.40, random_state=0)

print('number of test samples:', x_test1.shape[0])
print('number of training sampels:', x_train1.shape[0])

number of test samples: 81
number of training sampels: 120


Let's import <b>LinearRegression</b> from the module <b>linear_model</b>.


In [21]:
from sklearn.linear_model import LinearRegression


We create a Linear Regression object:

In [22]:
lre = LinearRegression()
lre

We fit the model using the feature "horsepower":

In [23]:
lre.fit(x_train[['horsepower']], y_train)

Let's calculate the R^2 on the test data:

In [24]:
lre.score(x_test[['horsepower']], y_test)

0.36358755750788263


We can see the R^2 is much smaller using the test data compared to the training data.

In [25]:
lre.score(x_train[['horsepower']], y_train)

0.6619724197515104

### 2): Find the R^2 on the test data using 40% of the dataset for testing.

In [33]:
x_data = df.drop('price', axis=1)
y_data = df['price']

x_train1, x_test1, y_test1, y_test1 = train_test_split(x_data, y_data, test_size=0.40, random_state=0)

lre = LinearRegression()
train_score1 = lre.fit(x_train1[['horsepower']], y_train1)
test_score1 = lre.score(x_test1[['horsepower']], y_test1)


Sometimes you do not have sufficient testing data; as a result, you may want to perform cross-validation. Let's go over several methods that you can use for cross-validation.

### Cross-Validation Score
Let's import cross_val_score from the module model_selection.

In [35]:
from sklearn.model_selection import cross_val_score

We input the object, the feature ("horsepower"), and the target data (y_data). The parameter 'cv' determines the number of folds. In this case, it is 4.