# Intro to Linear Regression with Scikit


- We will run a simple linear regression with simple data cleanup steps. (Statistical tests/ methods or charts will not be inlcuded (for simplicity). 
- The objective of this notebook is simply to introduce how to run a linear regression with ScikitLearn package. 

### Target: 
Predict prices of automobiles based on selected numercial variables -->  Target variable

In [2]:
#imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

df= pd.read_csv('automobiles.csv')
display(df)
display(df.info())
# missing values
print('Missing valus sum: ', df.isnull().sum().sum())
#duplicates check
print('Duplicates sum: ', df.duplicated().sum())
df.columns

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114,5400,23,28,16845
201,-1,95,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,8.7,160,5300,19,25,19045
202,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,173,mpfi,3.58,2.87,8.8,134,5500,18,23,21485
203,-1,95,volvo,diesel,turbo,four,sedan,rwd,front,109.1,...,145,idi,3.01,3.4,23.0,106,4800,26,27,22470


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized-losses  205 non-null    object 
 2   make               205 non-null    object 
 3   fuel-type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num-of-doors       205 non-null    object 
 6   body-style         205 non-null    object 
 7   drive-wheels       205 non-null    object 
 8   engine-location    205 non-null    object 
 9   wheel-base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb-weight        205 non-null    int64  
 14  engine-type        205 non-null    object 
 15  num-of-cylinders   205 non-null    object 
 16  engine-size        205 non

None

Missing valus sum:  0
Duplicates sum:  0


Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price'],
      dtype='object')

##### Observations for cleanup and preprocessing

- '?' values found under normalized_losses --> analyse, how many are these? percentage?  --> for simplicity we will drop these
- '?' values found under target variable price --> DELETE
- '?' also for bore and stroke --> delete rows for simplicity

In [3]:
# target variable is price, find price =='?'
df[df['price'] == '?']

# drop rows with price == '?'
df = df[df['price'] != '?']
# convert price to numeric
df['price'] = pd.to_numeric(df['price'])

In [4]:
# filtr on normalized_losses=='?'
df[df['normalized-losses'] == '?']
# drop rows with normalized-losses == '?'
df = df[df['normalized-losses'] != '?']
# convert normalized-losses to numeric
df['normalized-losses'] = pd.to_numeric(df['normalized-losses'])
# check for missing values


# filtr on bore=='?'
df[df['bore'] == '?']
# drop rows with bore == '?'
df = df[df['bore'] != '?']
# convert bore to numeric
df['bore'] = pd.to_numeric(df['bore'])
# check for missing values


# filtr on stroke=='?'
df[df['stroke'] == '?']
# drop rows with stroke == '?'
df = df[df['stroke'] != '?']
# convert stroke to numeric
df['stroke'] = pd.to_numeric(df['stroke'])
# check for missing values
print('Missing values sum: ', df.isnull().sum().sum())

Missing values sum:  0


In [5]:
# for simplicity we keep only selected numerical type columns
df= df[['symboling', 'normalized-losses',  'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-size', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price']]
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 160 entries, 3 to 204
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          160 non-null    int64  
 1   normalized-losses  160 non-null    int64  
 2   wheel-base         160 non-null    float64
 3   length             160 non-null    float64
 4   width              160 non-null    float64
 5   height             160 non-null    float64
 6   curb-weight        160 non-null    int64  
 7   engine-size        160 non-null    int64  
 8   bore               160 non-null    float64
 9   stroke             160 non-null    float64
 10  compression-ratio  160 non-null    float64
 11  horsepower         160 non-null    object 
 12  peak-rpm           160 non-null    object 
 13  city-mpg           160 non-null    int64  
 14  highway-mpg        160 non-null    int64  
 15  price              160 non-null    int64  
dtypes: float64(7), int64(7), object

- The symboling variable indicates the degree of risk in relation to the insurer, taking into account factors like the risk of accidents and breakdowns.

- The normalized_losses variable represents the relative average annual cost of vehicle insurance. It's normalized based on cars of the same type (SUV, utility, sports, etc.).

- The next 13 variables refer to technical specifications of the cars, including dimensions, engine displacement, horsepower, etc.

- The final variable, price, denotes the selling price of the vehicle. This is the variable we aim to predict.

In [6]:
#convert all to numeric
df = df.apply(pd.to_numeric, errors='coerce')
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 160 entries, 3 to 204
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          160 non-null    int64  
 1   normalized-losses  160 non-null    int64  
 2   wheel-base         160 non-null    float64
 3   length             160 non-null    float64
 4   width              160 non-null    float64
 5   height             160 non-null    float64
 6   curb-weight        160 non-null    int64  
 7   engine-size        160 non-null    int64  
 8   bore               160 non-null    float64
 9   stroke             160 non-null    float64
 10  compression-ratio  160 non-null    float64
 11  horsepower         160 non-null    int64  
 12  peak-rpm           160 non-null    int64  
 13  city-mpg           160 non-null    int64  
 14  highway-mpg        160 non-null    int64  
 15  price              160 non-null    int64  
dtypes: float64(7), int64(9)
memory 

In [7]:
df

Unnamed: 0,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
3,2,164,99.8,176.6,66.2,54.3,2337,109,3.19,3.40,10.0,102,5500,24,30,13950
4,2,164,99.4,176.6,66.4,54.3,2824,136,3.19,3.40,8.0,115,5500,18,22,17450
6,1,158,105.8,192.7,71.4,55.7,2844,136,3.19,3.40,8.5,110,5500,19,25,17710
8,1,158,105.8,192.7,71.4,55.9,3086,131,3.13,3.40,8.3,140,5500,17,20,23875
10,2,192,101.2,176.8,64.8,54.3,2395,108,3.50,2.80,8.8,101,5800,23,29,16430
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95,109.1,188.8,68.9,55.5,2952,141,3.78,3.15,9.5,114,5400,23,28,16845
201,-1,95,109.1,188.8,68.8,55.5,3049,141,3.78,3.15,8.7,160,5300,19,25,19045
202,-1,95,109.1,188.8,68.9,55.5,3012,173,3.58,2.87,8.8,134,5500,18,23,21485
203,-1,95,109.1,188.8,68.9,55.5,3217,145,3.01,3.40,23.0,106,4800,26,27,22470


In [8]:
# split target variable
X = df.drop(columns=['price'])
y = df['price']

In [9]:
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)

In [10]:
# missing values
X_train.isnull().sum()
      

symboling            0
normalized-losses    0
wheel-base           0
length               0
width                0
height               0
curb-weight          0
engine-size          0
bore                 0
stroke               0
compression-ratio    0
horsepower           0
peak-rpm             0
city-mpg             0
highway-mpg          0
dtype: int64

In [11]:
#linear regression
# fit: trains the model on the dataset given as input
# predict: predicts the target variable from the set of explanatory variables given as input

#instantiate the model
linreg= LinearRegression()
# fit the model
linreg.fit(X_train, y_train)
# predict the target variable on train set
y_pred_train= linreg.predict(X_train)
# predict the target variable on test set
y_pred_test= linreg.predict(X_test)

### Evaluating performance

MSE: common metric used for evluating performance --> average of squared differentes between predicted values and actual target values

In [12]:
mse_train = mean_squared_error(y_train, y_pred_train)
mse_test = mean_squared_error(y_test, y_pred_test)

print('Train MSE: ', round(mse_train, 0))
print('Test MSE: ', round(mse_test, 0))

Train MSE:  5499270.0
Test MSE:  3582946.0


- large difference range within the pred scores
- due to the high number, interpretation is difficult
- instead , we use MAE --> this metric is on the same scale as the target variable making it better to interpret

In [13]:
mae_train= mean_absolute_error(y_train, y_pred_train)
mae_test= mean_absolute_error(y_test, y_pred_test)
print('Train MAE: ', round(mae_train, 0))
print('Test MAE: ', round(mae_test, 0))

Train MAE:  1732.0
Test MAE:  1555.0


- much better for interpretation!
- difference less than 200 , lower in the test set which could mean underfitting?!?!

- We will now calculate the relative error to  by first getting the average prices for all vehicles from the dataset


In [14]:
mean_price= df['price'].mean()
print('Mean price: ', round(mean_price, 0))

Mean price:  11428.0


In [17]:
print('Relative errorr test set: ', round(mae_test/mean_price*100, 0), '%')

Relative errorr test set:  14.0 %


## Conclusion
The MAE is 14% of the average price, which is not optimal. It is. still a good baseline for improving the model or testing more advanced models. 


### Improve performance
- Instead of deleted the '?' values we could replace them by the mean/median in case of numerical values or by the mode in case of categorical (using the SimpleImputer)
- We do not remove the categorical values, instead we could perform some test statistics in order to decide which features we can to keep. --> Pearsons, Spearmns, Chi2Tes --> calculate P-values
- Choice of encoding categorical variables to be review (One-hot encoding?, Ordinal Encoding?)
- Review correlations for numerical variables --> do we have variables that correlate among themselves --> keep one, the rest maybe produces noice
- Outliers --> decide how we will deal with those --> remove them? create categories of ranges?
- Create new features out of the available ones? 
....

There are various approaches that can be taken, both on the data processing (feature selection, engineering, reduction and on the modelling (choose other, more advanced models)

This notebook served simply as an introduction to performing a linear regression with Scikit package. 