### Machine Learning Model for Auto-MPG (Miles Per Gallon)

Data Source: [Auto MPG](https://archive-beta.ics.uci.edu/ml/datasets/auto+mpg)  
We used the non-original data with unknown values removed.  
Build a simple linear regression model

In [36]:
import pandas as pd
from math import sqrt
import pickle

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [39]:
# read data
col = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year', 'origin', 'car_name']
df = pd.read_fwf("./data/auto-mpg.data", names=col)
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,"""chevrolet chevelle malibu"""
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,"""buick skylark 320"""
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,"""plymouth satellite"""
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,"""amc rebel sst"""
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,"""ford torino"""


In [40]:
# drop columns
df1= df.drop(columns=['origin', 'car_name','model_year'])
df2 = df1[df1['horsepower'] != '?']
df2['horsepower'] = df2['horsepower'].astype(float)
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 0 to 397
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           392 non-null    float64
 1   cylinders     392 non-null    int64  
 2   displacement  392 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        392 non-null    float64
 5   acceleration  392 non-null    float64
dtypes: float64(5), int64(1)
memory usage: 21.4 KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['horsepower'] = df2['horsepower'].astype(float)


In [26]:
# train-test split
X=df2[['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration']]
y=df2['mpg']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Train Model
lr = LinearRegression()
lr.fit(X_train, y_train)

# Get predictions
y_train_pred = lr.predict(X_train)
print(sqrt(mean_squared_error(y_train, y_train_pred)))
y_test_pred = lr.predict(X_test)
print(sqrt(mean_squared_error(y_test, y_test_pred)))

In [38]:
# save model
filename = 'auto_mpg_lr_model.pkl'
pickle.dump(lr, open(filename, 'wb'))