# DSC410: Regression Models

**Name**: Joseph Choi <br>
**Class**: DSC410-T301 Predictive Analytics (2243-1)

**Instructions**: 
- For the large data set you performed EDA on in Week 2, use one of the covered algorithms to predict y 
- Note that the target is a continuous numerical variable (regression problem)
- You can optionally print out the R2 score

### Setup:

In [52]:
# Packages:

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Lasso
from sklearn.neighbors import KNeighborsRegressor

In [33]:
# Loading 'eda_data' csv file and displaying 'eda_df'
eda_df = pd.read_csv('eda_data.csv')

### Data Cleaning:

In [20]:
# Creating a copy of the df to perform data cleaning procedures
eda_df_copy = eda_df.copy()

# Removing characters '[()$,]' from column 'x6' and converting cleaned values to float64
eda_df_copy['x6'] = pd.to_numeric(eda_df_copy['x6'].replace('[\$,()]', '', regex=True))
eda_df_copy['x6'] = eda_df_copy['x6'].astype('float64')

# Removing characters '%' from column 'x10' and converting cleaned values to float64
eda_df_copy['x10'] = pd.to_numeric(eda_df_copy['x10'].str.replace('%', ''))

# Printing results:
eda_df_copy.head(5)

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,y
0,-17.933519,6.55922,-14.45281,-4.732855,0.381673,2.563194,1306.52,-89.394348,-28.454044,-16.201298,-0.01,0.21701,9.729891,-0.786431,0.666146
1,-37.214754,10.77493,-15.384004,-0.077339,10.983774,-15.210206,24.86,153.032652,-32.557736,69.675903,0.0,-3.584908,35.727926,-0.985552,0.378411
2,0.330441,-19.609972,-9.167911,2.064124,12.071688,12.506141,110.85,-141.437276,-20.794952,55.042604,0.0,-3.991366,-9.283523,-3.394718,0.624498
3,-13.709765,-8.01139,6.759264,1.727615,-1.768382,24.039733,324.43,51.039653,-7.046908,-31.424419,0.01,7.908897,-2.891882,-2.690222,0.126622
4,-4.202598,7.07621,-26.004919,-4.269696,-3.414224,2.115989,1213.37,-31.0467,19.061182,-31.525515,-0.01,0.846719,25.49748,3.516801,0.640025


In [32]:
# Displaying summary of eda_df_copy
eda_df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x0      9996 non-null   float64
 1   x1      9995 non-null   float64
 2   x2      9996 non-null   float64
 3   x3      9997 non-null   float64
 4   x4      9997 non-null   float64
 5   x5      9999 non-null   float64
 6   x6      9996 non-null   float64
 7   x7      9998 non-null   float64
 8   x8      9999 non-null   float64
 9   x9      9996 non-null   float64
 10  x10     9997 non-null   float64
 11  x11     9995 non-null   float64
 12  x12     9999 non-null   float64
 13  x13     9998 non-null   float64
 14  y       9999 non-null   float64
dtypes: float64(15)
memory usage: 1.1 MB


**Interpretation**: <br>There are many null values in the dataset. Therefore, a simple imputation procedure will done to handle nulls by replacing each missing value with the mean value of that particular column.

In [38]:
# Initiating the mean imputing process via 'SimpleImputer()' 
imputer = SimpleImputer(strategy='mean')

# Applying inputation to dataset
eda_df_copy_imputed = pd.DataFrame(imputer.fit_transform(eda_df_copy), columns=eda_df_copy.columns)

# Displaying summary of eda_df_copy_inputed
eda_df_copy_imputed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x0      9999 non-null   float64
 1   x1      9999 non-null   float64
 2   x2      9999 non-null   float64
 3   x3      9999 non-null   float64
 4   x4      9999 non-null   float64
 5   x5      9999 non-null   float64
 6   x6      9999 non-null   float64
 7   x7      9999 non-null   float64
 8   x8      9999 non-null   float64
 9   x9      9999 non-null   float64
 10  x10     9999 non-null   float64
 11  x11     9999 non-null   float64
 12  x12     9999 non-null   float64
 13  x13     9999 non-null   float64
 14  y       9999 non-null   float64
dtypes: float64(15)
memory usage: 1.1 MB


### MLR:

In [42]:
# Splitting data into input (X) and target (y)
mlr_X = eda_df_copy_imputed.drop('y', axis=1)
mlr_y = eda_df_copy_imputed['y']

In [43]:
# Initializing and fitting the model
mlr_model = LinearRegression()
mlr_model.fit(mlr_X, mlr_y)

In [45]:
# Predicting 'y' using the trained model
mlr_predictions = mlr_model.predict(mlr_X)

In [46]:
# Evaluating model performance via R2 score
mlr_r2 = r2_score(mlr_y, mlr_predictions)

# Printing results:
mlr_r2

0.0010547721715407077

### Lasso Regression:

In [48]:
# Splitting data into input (X) and target (y)
lasso_X = eda_df_copy_imputed.drop('y', axis=1)
lasso_y = eda_df_copy_imputed['y']

In [49]:
# Initializing and fitting the Lasso Regression model
lasso_model = Lasso()
lasso_model.fit(lasso_X, lasso_y)

In [50]:
# Predicting 'y' using the trained Lasso model
lasso_predictions = lasso_model.predict(lasso_X)

In [51]:
# Evaluating model performance via R2 score
lasso_r2 = r2_score(lasso_y, lasso_predictions)

# Printing results:
lasso_r2

3.642776540380144e-05

### KNN:

In [53]:
# Splitting data into input (X) and target (y)
knn_X = eda_df_copy_imputed.drop('y', axis=1)
knn_y = eda_df_copy_imputed['y']

In [54]:
# Initializing and fitting the KNN Regression model
knn_model = KNeighborsRegressor()
knn_model.fit(knn_X, knn_y)

In [55]:
# Predicting 'y' using the trained KNN model
knn_predictions = knn_model.predict(knn_X)

In [56]:
# Evaluating model performance via R2 score
knn_r2 = r2_score(knn_y, knn_predictions)

# Printing results
knn_r2

0.20746493103417973