<img src="https://nyp-aicourse.s3.ap-southeast-1.amazonaws.com/agods/nyp_ago_logo.png" width='400'/>

# Final Exercise: Regression Model

We will build a Linear regression model for Medical cost dataset. The dataset consists of age, sex, BMI, children, smoker and region feature, which are independent and charge as a dependent feature. We will predict individual medical costs billed by health insurance.

### Import Library and Dataset
Now we will import couple of python library required for our analysis and import dataset 

In [None]:
# Import library
import pandas  as pd  
import numpy as np  
import matplotlib.pyplot as plt 
import seaborn as sns  

In [None]:
# Import dataset
df = pd.read_csv('../data/insurance.csv')

### Task 1: Check for missing values 

In [None]:
df.info()

In [None]:
df.describe()

### Task 2: Check the distribution of the variables

In [None]:
sns.set_theme(style='whitegrid')
df.hist(figsize=(8,6))
plt.show()

### Task 3: Splitting Data into Train and Test Set

Assuming we think age is key for good prediction, we want to preserve the same age distribution in our train and test samples.

1. Bin the age into appropriate number the following bins \[0, 25\], \[25, 35\], \[35, 45\], \[45, 55\] and \[55, $\infty$\]. 
2. Create a new column named 'age_cat' for the age categories. 
3. split the data into train test split, with 80:20 split, and with strafitied splitting on 'age_cat'

In [None]:
df["age_cat"] = pd.cut(df["age"],
                          bins=[0, 25, 35, 45, 55, np.inf],
                          labels=[1, 2, 3, 4, 5])
df["age_cat"].value_counts().sort_index()

In [None]:
from sklearn.model_selection import train_test_split 

strat_train_set, strat_test_set = train_test_split(df, shuffle=True, 
                                                   train_size=0.8,
                                                   stratify=df['age_cat'], 
                                                   random_state=42)

In [None]:
strat_train_set.info()

### Task 4: Look for correlation

In [None]:
# correlation plot
corr_matrix = strat_train_set.corr()
corr_matrix['charges'].sort_values(ascending=False)

Thier no correlation among valiables.

### Task 5: Separate features and labels

In [None]:
insurance = strat_train_set.drop('charges', axis=1)
insurance_labels = strat_train_set['charges']

### Task 6: Separate the numerical features from categorical features

We have a few categorical features that we need to one-hot-encode. We also need to do scaling on the numerical features.  We want to build different transformation pipeline for these two different types of data.  So you need to first separate numerical features from categorical features. Call your new dataframes as `insurance_num` and `insurance_cat`.

In [None]:
categorical_columns = ['sex','smoker', 'region', 'age_cat']
insurance_num = insurance.drop(categorical_columns, axis=1)
insurance_cat = insurance[categorical_columns]

### Task 7: Build a pipeline for numerical data

Build a pipeline to do scaling for numerica data, using StandardScaler. It is also good practice to include imputer. Although we don't have any missing data in the training data (and test data), we cannot guarantee that there won't be any missing values during the live system (e.g. users may leave certain data blank).

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline

num_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())

### Task 8: Build a pipeline for categorical data 

Build pipeline to one-hot-encode categorical features. You need to specify sparse_output to False, so the return values will not be a sparse matrix, which make converting the output back to Dataframe difficult.

In [None]:
from sklearn.preprocessing import OneHotEncoder

cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))

### Task 9: Build a single combined pipeline

In [None]:
from sklearn.compose import make_column_transformer
from sklearn.compose import ColumnTransformer

num_attribs = ["age", "bmi", "children"] 
cat_attribs = ['sex', 'smoker', 'region','age_cat']

# preprocessing = make_column_transformer(
#                     (num_pipeline, num_attribs), 
#                     (cat_pipeline, cat_attribs))
# preprocessing

from sklearn.compose import make_column_selector 

preprocessing = make_column_transformer(
    (num_pipeline, make_column_selector(dtype_include=np.number)),
    (cat_pipeline, make_column_selector(dtype_include=object)),
)

preprocessing

### Task 10: Select a model and evaluate the model using cross validation

Choose a scoring function, `r2` to cross validate your model.

In [None]:
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression

lin_reg = make_pipeline(preprocessing, LinearRegression())
linreg_score = cross_validate(lin_reg, 
                              insurance, 
                              insurance_labels,
                              scoring="r2", 
                              return_train_score=True,
                              cv=5)

print("r2 (train): ", linreg_score['train_score'])
print("average train r2: ", linreg_score['train_score'].mean())
print("r2 (val):", linreg_score['test_score'])
print("average val r2:", linreg_score['test_score'].mean())

### Task 11: Evaluate final model on test data

In [None]:
final_model = make_pipeline(preprocessing, LinearRegression())
final_model.fit(insurance, insurance_labels)

In [None]:
from sklearn.metrics import r2_score

X_test = strat_test_set.drop("charges", axis=1)
y_test = strat_test_set["charges"]

final_predictions = final_model.predict(X_test)
final_r2 = r2_score(y_test, final_predictions)
print(final_r2) 