In [1]:
from ml import *

In [2]:
df = pd.read_csv('/data/datasets/auto-mpg.csv')

# Data Inspection

In a dataframe there are several functions you can use to inspect what the data looks like. We could start with the `head` method to view the columns and guess what the meaning and data type should be. 

In [3]:
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


It can be useful to check the datatypes of the columns. If a column only contains numeric and missin values it will be a float. If you expect a float and it says object, there may be a corrupt value. This may be the case for the feature horsepower.

In [4]:
df.dtypes

mpg             float64
cylinders         int64
displacement    float64
horsepower       object
weight            int64
acceleration    float64
model year        int64
origin            int64
car name         object
dtype: object

In [5]:
df.horsepower.unique()

array(['130', '165', '150', '140', '198', '220', '215', '225', '190',
       '170', '160', '95', '97', '85', '88', '46', '87', '90', '113',
       '200', '210', '193', '?', '100', '105', '175', '153', '180', '110',
       '72', '86', '70', '76', '65', '69', '60', '80', '54', '208', '155',
       '112', '92', '145', '137', '158', '167', '94', '107', '230', '49',
       '75', '91', '122', '67', '83', '78', '52', '61', '93', '148',
       '129', '96', '71', '98', '115', '53', '81', '79', '120', '152',
       '102', '108', '68', '58', '149', '89', '63', '48', '66', '139',
       '103', '125', '133', '138', '135', '142', '77', '62', '132', '84',
       '64', '74', '116', '82'], dtype=object)

We can either replace the '?' for a NaN, or reload the data with a '?' as a missing value. We can then drop 'car name' and opt to drop the line(s) containing the missing value.

In [6]:
df = pd.read_csv('/data/datasets/auto-mpg.csv', na_values=['?'])
df = df.drop(columns='car name').dropna()

We can convert the data to numpy arrays for training. We can use the train_test_split function from SKLearn to separate the data that we use for validation.

In [7]:
y = df.mpg.to_numpy()
X = df.drop(columns='mpg').to_numpy()
train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.2, random_state=0)

The data is stored in Numpy arrays which can be used by .

In [9]:
train_X

array([[  4. ,  85. ,  70. , ...,  16.8,  77. ,   3. ],
       [  6. , 225. , 100. , ...,  17.2,  78. ,   1. ],
       [  4. , 105. ,  70. , ...,  13.2,  79. ,   1. ],
       ...,
       [  4. , 116. ,  75. , ...,  15.5,  73. ,   2. ],
       [  6. , 250. ,  88. , ...,  14.5,  71. ,   1. ],
       [  6. , 171. ,  97. , ...,  14.5,  75. ,   1. ]])

Then we fit a LinearRegression model like before, and report the Root Mean Squared Error. A value of 28 means that on average our predictions om Miles per Gallon (mpg) are 28 miles per gallon wrong.

In [10]:
model = LinearRegression()
model.fit(train_X, train_y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [11]:
pred_y = model.predict(valid_X)
sqrt(squared_error(valid_y, pred_y))

28.345807484665073

Next we can fit Polynomial Features to see if a non-linear function improves results. SKLearn has a PolynomialFeatures transformation class for that. And we see that when we fit a second order polynomial function, results improve to an average error of 23 mpg.

In [12]:
poly = PolynomialFeatures(degree=2)
Xp = poly.fit_transform(X)

In [13]:
train_X, valid_X, train_y, valid_y = train_test_split(Xp, y, test_size=0.2, random_state=0)

In [14]:
model = LinearRegression()
model.fit(train_X, train_y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [15]:
pred_y = model.predict(valid_X)
sqrt(squared_error(valid_y, pred_y))

23.2217850097223