# Assignment-3

`assignment3.csv`  is the data for you to do analysis on.  It is the data to predict cars' prices.

1. The complete documentation on the dataset can be found here for your reference: https://archive.ics.uci.edu/ml/datasets/Automobile
2. Delete rows with missing data points (denoted with "?")
3. You should convert the data types of some features to the appropriate data types (i.e. float, integer, etc). Refer to the documentation
4. You should be able to distinguish numerical from categorical features. Refer to documentation
5. target = \['price'\]
6. When doing "train_test_split", use random_state = 20190227
7. You need to use Cross Validation and greedy algorithm method to find a best feature subset which gives us the smallest mean squared error corresponding validation set. Finally, compute the root mean squared error (RMSE) of test set.

Note: the running time for this assignment could be long. Do not shut down the kernel when you run the codes.

This time, we will not give you hints. You should use sklearn to do the analysis!

In [1]:
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import mean_squared_error
from itertools import combinations

In [2]:
data = pd.read_csv('assignment3.csv').drop(['Unnamed: 0'], axis = 1)
data = data[~(data == '?').any(axis=1)]

In [3]:
data.dtypes

symboling              int64
normalized-losses     object
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                  object
stroke                object
compression-ratio    float64
horsepower            object
peak-rpm              object
city-mpg               int64
highway-mpg            int64
price                 object
dtype: object

In [4]:
data[['normalized-losses','bore','stroke','horsepower','peak-rpm','price']].astype('float')
features_numerical = ['normalized-losses','wheel-base','length','width','height','curb-weight','engine-size','bore','stroke','compression-ratio','horsepower','peak-rpm','city-mpg','highway-mpg']
target = ['price']
features_categorical = list(set(data.columns) - set(features_numerical) - set(target))
features = features_categorical + features_numerical
sales = data[features_numerical + features_categorical + target]

In [5]:
def onehot_encoder(df, feature):
    result = pd.DataFrame()
    if feature in df.columns:
        # The following line is important, refer to Assignment 2
        result = pd.get_dummies(df, columns=[feature])
        return result
    else:
        return print("Please select a feature in this df!")

In [6]:
def CrossValidation(data, target, proporiton):  
    
    # train valid split
    train, valid = train_test_split(data, 
                                    test_size = proporiton, 
                                    random_state = 20190227) 
    
    # extract X and Y to be fit in a model
    X_train = train.drop(target, axis = 1)
    Y_train = train[target]
    X_valid = valid.drop(target, axis = 1)
    Y_valid = valid[target]
    
    # build linear regression model
    model = linear_model.LinearRegression()

    # fit model using training data
    model.fit(X_train, Y_train)
    
    # predict using validation data
    Y_valid_fit = model.predict(X_valid)
    
    return mean_squared_error(Y_valid_fit, Y_valid)

In [7]:
# initialize a list to save features
greedy_select = []

# and a numpy array to save their model MSE
MSE_greedy_algo = np.array([])

for i in range(len(features)):
    MSE = np.array([])
    features_left = list(set(features) - set(greedy_select))
    
    for new in features_left:
        features_new = greedy_select + [new]
        train_valid_sub = sales[features_new + target]
        
        # get all categorical features in sub
        categorical_sub = list(set(features_new) & set(features_categorical))
        
         # if there really are categorical features, 
        # we need to do onthot encoding.
        if len(categorical_sub) != 0:
            for i in categorical_sub:
                # Again, this line is important. Refer to Assignment 2
                train_valid_sub = onehot_encoder(train_valid_sub, i)   
            
        # CrossValidation, compute the mse and save it into MSE_sub
        MSE_sub = CrossValidation(train_valid_sub, 'price', 0.2)
        MSE = np.append(MSE, MSE_sub)
        
    # pick the features that gives the smallest MSE
    # and add it into our features list
    # meanwhile, save the corresponding MSE
    greedy_select += [features_left[MSE.argmin()]]
    MSE_greedy_algo = np.append(MSE_greedy_algo, MSE.min())

In [8]:
features_greedy = greedy_select[:(MSE_greedy_algo.argmin()+1)]
sales_greedy = sales[features_greedy + target]
features_greedy

['curb-weight',
 'make',
 'height',
 'aspiration',
 'body-style',
 'stroke',
 'symboling',
 'width',
 'engine-type',
 'length',
 'num-of-doors',
 'engine-location']

In [9]:
categorical_cv = list(set(features_greedy) & set(features_categorical))

if len(categorical_cv) != 0:
    for i in categorical_cv:
        sales_greedy = onehot_encoder(sales_greedy, i)

In [10]:
train_valid_greedy, test_greedy = train_test_split(sales_greedy, test_size = 0.2, random_state = 20190227)

In [11]:
# we build linear regression model
model_greedy = linear_model.LinearRegression()

# features traget split
X_greedy = train_valid_greedy.drop(target, axis = 1)
Y_greedy = train_valid_greedy[target]

# fit model
model_greedy.fit(X_greedy, Y_greedy)

# Use model
X_test_greedy = test_greedy.drop(target, axis = 1)
Y_test_greedy = test_greedy[target]

Y_test_greedy_fit = model_greedy.predict(X_test_greedy)
np.sqrt(mean_squared_error(Y_test_greedy_fit, Y_test_greedy))

1240.310026313878