# **First Model (Practice)**

## Assignment:

For this exercise, you will create, fit, and evaluate the performance of a linear regression model. The machine learning question is:

How well can the additional charges be predicted based on the age, sex, BMI, number of children, smoking habit, and region of the patient?

This is the dataset you will be using: [insurance.csv](https://drive.google.com/file/d/1zkcVqin1DV7ym7DFPVovCCsqkoiDZs-6/view)

For this task, you will need to:

- Create a preprocessing object, such as a column transformer or pipeline, that will:
  - Ordinal encode any ordinal features
  - One-hot encode any nominal features
  - Scale any numeric features
- Instantiate a linear regression model
- Create a model pipeline with your preprocessor first and linear regression model last
- Fit the modeling pipeline on the training data
- Evaluate the model performance on both the training set and the test set using the R-squared score.

[Assignment Solution](https://colab.research.google.com/drive/162Q49DsUcLgUgJl5yqAnbVAHW2sDF5-k?usp=sharing)

# Preliminary steps

In [None]:
# import libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn import set_config
set_config(display = 'diagram')

In [None]:
# mount drive
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
# import data
path = '/content/drive/MyDrive/Coding Dojo/06 Week 6: Regression Models/insurance.csv'
df = pd.read_csv(path)

# Clean data before preprocessing

In [None]:
# drop any duplicates
df.duplicated().sum()

1

In [None]:
df.drop_duplicates(inplace = True)

In [None]:
df.duplicated().sum()

0

In [None]:
# explore data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


There are a mix of numeric and categorical features. There are no ordinal features to encode. The target variable ('charges') is continuous numeric, so this is a **regression** problem.

In [None]:
# explore data
df.sample(10)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1236,63,female,21.66,0,no,northeast,14449.8544
310,50,male,26.6,0,no,southwest,8444.474
869,25,female,24.3,3,no,southwest,4391.652
1162,30,male,38.83,1,no,southeast,18963.17192
482,18,female,31.35,0,no,southeast,1622.1885
1214,27,female,31.255,1,no,northwest,3956.07145
889,57,male,33.63,1,no,northwest,11945.1327
514,39,male,28.3,1,yes,southwest,21082.16
555,28,male,23.8,2,no,southwest,3847.674
819,33,female,35.53,0,yes,northwest,55135.40209


In [None]:
# explore data
num_cols = []
cat_cols = []

for column in df.columns:
  if column == 'charges': # target variable
    pass
  elif df[column].dtype == 'object':
    cat_cols.append(column)
  else:
    num_cols.append(column)

num_transformed_cat_cols = 0

for column in cat_cols:
  num_transformed_cat_cols += df[column].nunique()

num_transformed_cols = len(num_cols) + num_transformed_cat_cols

print(f"Dataset has {df.shape[0]} rows and {df.shape[1]} columns. \n\
Target variable column is {df['charges'].dtype}. \n\
Numeric feature columns are {num_cols}. \n\
Object feature columns are {cat_cols}. \n\
Predicted size of transformed dataset is {df.shape[0]} rows and \
{num_transformed_cols} columns.")

Dataset has 1337 rows and 7 columns. 
Target variable column is float64. 
Numeric feature columns are ['age', 'bmi', 'children']. 
Object feature columns are ['sex', 'smoker', 'region']. 
Predicted size of transformed dataset is 1337 rows and 11 columns.


# Validation split

In [None]:
# identify target column name
target = 'charges'

# assign target column to y
y = df[target]

# assign rest of columns to feature matrix
X = df.drop(columns = target)

In [None]:
# validation split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

# Create preprocessing object and model in pipeline

In [None]:
# instantiate column selectors
num_selector = make_column_selector(dtype_include = 'number')
cat_selector = make_column_selector(dtype_include = 'object')

In [None]:
# instantiate imputers
# for numeric columns, impute the mean; for object columns, impute the most_frequent

mean_imputer = SimpleImputer(strategy = 'mean')
freq_imputer = SimpleImputer(strategy = 'most_frequent')

In [None]:
# instantiate transformers
scaler = StandardScaler()
ohe = OneHotEncoder(handle_unknown = 'ignore', sparse_output = False)

In [None]:
# instantiate pipelines
num_pipe = make_pipeline(mean_imputer, scaler)
display(num_pipe)

cat_pipe = make_pipeline(freq_imputer, ohe)
display(cat_pipe)

In [None]:
# instantiate tuples for column transformer
num_tuple = (num_pipe, num_selector)
cat_tuple = (cat_pipe, cat_selector)

In [None]:
# instantiate column transformer
preprocessor = make_column_transformer(num_tuple, cat_tuple)
display(preprocessor)

In [None]:
# instantiate linear regression model
lin_reg = LinearRegression()

In [None]:
# make pipeline for preprocessor and model
lin_reg_pipe = make_pipeline(preprocessor, lin_reg)
display(lin_reg_pipe)

# Fit model pipeline on training data

In [None]:
# fit model pipeline on training data
lin_reg_pipe.fit(X_train, y_train)

NameError: ignored

# Create model predictions on training and testing data

In [None]:
# create predictions for training data
train_pred = lin_reg_pipe.predict(X_train)

# create predictions for testing data
test_pred = lin_reg_pipe.predict(X_test)

# Evaluate model with R^2 score

In [None]:
# r2 score for training data
train_r2 = r2_score(y_train, train_pred)

# r2 score for testing data
test_r2 = r2_score(y_test, test_pred)

print(f"Model Training R2: {train_r2} \n\
Model Testing R2: {test_r2}")

Model Training R2: 0.7297496299680121 
Model Testing R2: 0.7959833580860984


This model can account for about 73% of the variation in y_train using the features in X_train, and about 80% of the variation in y_test using the features in X_test.