# Machine Learning 101 - Regression

## How to Build Regression Models

Author: Kris Barbier

### Overview:

This notebook will demonstrate how to build and interpret 2 types of regression models: Linear Regression and Random Forests.

### Regression Models Overview:

- In supervised machine learning, there are two main tasks that can be completed. In this notebook, we will build and interpret regression models, which are used to predict continuous numerical values. We will use the following steps to complete our models:
    - Import needed libraries and read in data.
    - Quickly preprocess data for modeling (for an in depth look at preprocessing, check out the preprocessing notebook).
    - Use model pipelines to efficiently build 2 different types of regression models.
    - Evaluate models using different metrics to test each model's accuracy of predictions.

## Regression Models in Code

### Import Libraries and Read in Data

In [1]:
#Common imports for data science
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt  #For visualizations
import seaborn as sns #For visualizations

#Imports for machine learning 
from sklearn.model_selection import train_test_split  #For validation split

#Imports for feature transformations
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

#Imports for building preprocessing object
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline

#Imports for regression models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

#Imports for model metrics
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

#Set sklearn output to pandas
from sklearn import set_config
set_config(transform_output = 'pandas')

#Mute warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
#Read in sample dataset from repo folder
file_path = "Data/insurance_mod.csv"
df = pd.read_csv(file_path)
#Preview data
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,1,southwest,16885.0
1,18,male,33.77,1,0,southeast,1726.0
2,28,male,33.0,3,0,southeast,4449.0
3,33,male,22.705,0,0,northwest,21984.0
4,32,male,28.88,0,0,northwest,3867.0


### Preprocess Data

- In this step, we will go through the process of preprocessing data for this task. In this notebook, the steps will be condensed to save space. For an in-depth look at preprocessing, see the separate preprocessing notebook from this repo.

In [3]:
#Define X and y variables
y = df['charges']
X = df.drop(columns = 'charges')

In [4]:
#Perform validation split
#Setting a random state will make this reproducible in the future
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

#Verify the split is correct
X_train.head()  #Note the absence of the charges column from the X_train data

Unnamed: 0,age,sex,bmi,children,smoker,region
693,24,male,23.655,0,0,northwest
1297,28,female,26.51,2,0,southeast
634,51,male,39.7,1,0,southwest
1022,47,male,36.08,1,1,southeast
178,46,female,28.9,2,0,southwest


In [5]:
##Create numeric pipeline
#Define numeric columns
num_cols = X_train.select_dtypes('number').columns

#Instantiate transformers
impute_mean = SimpleImputer(strategy='mean')
scaler = StandardScaler()

#Set numeric pipeline
num_pipe = make_pipeline(impute_mean, scaler)

#Create tuple for column transformer
num_tuple = ("Numeric", num_pipe, num_cols)

In [6]:
##Create categorical pipeline
#Define categorical columns
cat_cols = X_train.select_dtypes('object').columns

#Instantiate transformers
impute_missing = SimpleImputer(strategy='constant', fill_value='Missing')
cat_encode = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

#Set categorical pipeline
cat_pipe = make_pipeline(impute_missing, cat_encode)

#Create tuple for column transformer
cat_tuple = ("Categorical", cat_pipe, cat_cols)

In [7]:
#Finalize preprocessing object
preprocessor = ColumnTransformer([num_tuple, cat_tuple], verbose_feature_names_out=False)

## Model 1: Linear Regression

- Now that the data has been preprocessed, we will fit our first model, Linear Regression.
- Linear Regression is a simple and easily evaluated model that aims to reduce the total squared errors produced from predictions. The result is a straight line that could be described as similar to a line of best fit for the data. While this model is easy to evaluate and understand, it is usually not the most accurate of models that can be used.