## _Medical Insurance Costs_

Pada kasus ini, terdapat data tentang informasi kesehatan dan biaya yang harus dikeluarkan oleh asuransi kesehatan. Infomasi terkait dengan data _medical insurance cost_ adalah sebagai berikut,

1. Age: Usia penerima manfaat
2. Sex: Gender penerima manfaat (_male_, _femele_)
3. Bmi : Body Mass Index
4. Children: Jumlah anak/tanggungan yang dicover oleh pihak asuransi
5. Smoker: Status perokok (_yes_, _no_)
6. Region: Wilayah tempat tinggal penerima manfaat
7. Charges: Biaya yang dikeluarkan oleh asuransi

### Tantangan

Buatlah model regresi untuk memprediksi biaya yang harus dikeluarkan oleh pihak asuransi berdasarkan data. Validasi performa model regresi Anda dengan nilai ***R-squared ($R^2$)***

#### _Tasks_

1. Pastikan semua variabel kategorial diolah dengan baik. (Gunakan fitur mapping pada pandas)
2. Cek kondisi multicollinearity untuk semua variabel independen. Jika ada, antar variabel apakah itu?
3. Pastikan model menggunakan variabel yang tidak memiliki nilai multicollinearity yang tinggi
4. (Hints) Anda dapat menggunakan nilai ***Variance Inflation Factor (VIF)*** untuk mengetahui tingkat multicollinearity pada sebuah variabel independent.
5. Evaluasi model yang Anda buat dengan nilai $R^2$
6. Simpulkan, variabel independen apa saja yang dapat digunakan untuk menghasilkan model regresi yang baik pada kasus _medical insurance costs_?

#### Penyelesaian

In [1]:
# Import data yang akan digunakan dalam percobaan
import pandas as pd

df = pd.read_csv('data/insurance.csv')

display(df.head())

display(df.corr())

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


Unnamed: 0,age,bmi,children,charges
age,1.0,0.109272,0.042469,0.299008
bmi,0.109272,1.0,0.012759,0.198341
children,0.042469,0.012759,1.0,0.067998
charges,0.299008,0.198341,0.067998,1.0


In [2]:
# Lakukan Mapping pada data sex, smoker, dan region

# Mapping untuk data sex
label_for_sex = {
    'male' : 1,
    'female' : 0
}

df['sex'] = df['sex'].map(label_for_sex)

df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,yes,southwest,16884.924
1,18,1,33.77,1,no,southeast,1725.5523
2,28,1,33.0,3,no,southeast,4449.462
3,33,1,22.705,0,no,northwest,21984.47061
4,32,1,28.88,0,no,northwest,3866.8552


In [3]:
# Mapping untuk data smoker
label_for_smoker = {
    'yes' : 1,
    'no' : 0
}

df['smoker'] = df['smoker'].map(label_for_smoker)

df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,1,southwest,16884.924
1,18,1,33.77,1,0,southeast,1725.5523
2,28,1,33.0,3,0,southeast,4449.462
3,33,1,22.705,0,0,northwest,21984.47061
4,32,1,28.88,0,0,northwest,3866.8552


In [4]:
# Mapping untuk data region
label_for_region = {
    'northwest' : 3,
    'northeast' : 0,
    'southeast' : 1,
    'southwest' : 2
}

df['region'] = df['region'].map(label_for_region)

df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,1,2,16884.924
1,18,1,33.77,1,0,1,1725.5523
2,28,1,33.0,3,0,1,4449.462
3,33,1,22.705,0,0,3,21984.47061
4,32,1,28.88,0,0,3,3866.8552


In [5]:
# Dapat dilihat pada tabel, value dari sex, smoker, dan region telah berubah sesuai
# dengan mapping yang telah dilakukan.
 
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,1,2,16884.924
1,18,1,33.77,1,0,1,1725.5523
2,28,1,33.0,3,0,1,4449.462
3,33,1,22.705,0,0,3,21984.47061
4,32,1,28.88,0,0,3,3866.8552


In [6]:
# Selanjutnya menentukan nilai VIF terhadap beberapa variabel independent
# yaitu age, sex, bmi, children, smoker, dan region

from statsmodels.stats.outliers_influence import variance_inflation_factor
  
X = df[['age', 'sex', 'bmi', 'children', 'smoker', 'region']]
  
data_vif = pd.DataFrame()
data_vif["Features"] = X.columns
  
data_vif["VIF"] = [variance_inflation_factor(X.values, i)
                          for i in range(len(X.columns))]
  
data_vif

Unnamed: 0,Features,VIF
0,age,7.656609
1,sex,2.00303
2,bmi,9.501583
3,children,1.808837
4,smoker,1.25728
5,region,2.623906


In [7]:
# Berdasarkan data tersebut, data BMI dan Age memiliki nilai multicollinearity diatas 5
# Model yang akan dibuat, diharapkan untuk tidak memiliki nilai multicollinearity yang tinggi
# Oleh karena itu, lakukan drop terhadap BMI dan Age terlebih dahulu

df1 = df.copy()
df1 = df1.drop(df1.columns[0], axis=1)
df1 = df1.drop(df1.columns[2], axis=1)

df1.head()

Unnamed: 0,sex,bmi,smoker,region,charges
0,0,27.9,1,2,16884.924
1,1,33.77,0,1,1725.5523
2,1,33.0,0,1,4449.462
3,1,22.705,0,3,21984.47061
4,1,28.88,0,3,3866.8552


In [8]:
# Melakukan perhitungan R2 Score terhadap model

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

dftest1 = df1.copy()

dftest1

Unnamed: 0,sex,bmi,smoker,region,charges
0,0,27.900,1,2,16884.92400
1,1,33.770,0,1,1725.55230
2,1,33.000,0,1,4449.46200
3,1,22.705,0,3,21984.47061
4,1,28.880,0,3,3866.85520
...,...,...,...,...,...
1333,1,30.970,0,3,10600.54830
1334,0,31.920,0,0,2205.98080
1335,0,36.850,0,1,1629.83350
1336,0,25.800,0,2,2007.94500


In [9]:
X = dftest1.iloc[:, :-1]
y = dftest1.iloc[:, 4]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.2, random_state=50)

In [10]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)

In [11]:
combine = np.concatenate((y_test, y_pred))
combine

array([ 5976.8311    ,  5846.9176    , 13831.1152    , ...,
       30127.72587159,  7258.89670111,  7361.88170033])

In [12]:
from sklearn.metrics import r2_score

rscore = r2_score(y_test, y_pred)

print('Hasil R2 : ', rscore)

Hasil R2 :  0.6498290218230921


In [13]:
# Melakukan perhitungan R2 Score terhadap model dengan seluruh variabel independent

dftest2 = df.copy()

dftest2

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.900,0,1,2,16884.92400
1,18,1,33.770,1,0,1,1725.55230
2,28,1,33.000,3,0,1,4449.46200
3,33,1,22.705,0,0,3,21984.47061
4,32,1,28.880,0,0,3,3866.85520
...,...,...,...,...,...,...,...
1333,50,1,30.970,3,0,3,10600.54830
1334,18,0,31.920,0,0,0,2205.98080
1335,18,0,36.850,0,0,1,1629.83350
1336,21,0,25.800,0,0,2,2007.94500


In [14]:
X2 = dftest2.iloc[:, :-1].values
y2 = dftest2.iloc[:, 6].values

X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X2, y2, test_size=0.2, random_state=50)

In [15]:
y2 = y2.reshape(len(y), 1)
y2.shape

(1338, 1)

In [16]:
lr2 = LinearRegression()
lr2.fit(X_train_2, y_train_2)

y_pred_2 = lr2.predict(X_test_2)

In [17]:
combine2 = np.concatenate((y_test_2, y_pred_2))
combine2

array([ 5.97683110e+03,  5.84691760e+03,  1.38311152e+04,  9.62592000e+03,
        2.68094930e+03,  4.78967913e+04,  1.82234512e+04,  7.41947790e+03,
        3.73262510e+03,  1.22228983e+04,  7.05002130e+03,  2.19786769e+04,
        6.28223500e+03,  3.77018768e+04,  7.04672220e+03,  1.20323260e+04,
        1.31126048e+04,  4.23989265e+03,  1.23338280e+04,  3.41032400e+03,
        1.72778500e+03,  4.46411974e+04,  1.71284261e+04,  6.11235295e+03,
        4.52947700e+03,  1.05945016e+04,  6.40229135e+03,  4.61511245e+04,
        1.71102680e+03,  1.70470015e+03,  4.58632050e+04,  4.68779700e+03,
        1.50197601e+04,  3.18051010e+03,  3.86120965e+03,  3.44306400e+03,
        2.71179938e+04,  2.70924395e+03,  1.34511220e+04,  4.79280300e+04,
        2.35630162e+04,  6.71019190e+03,  1.42350720e+04,  1.40011338e+04,
        2.72184372e+04,  1.33905590e+04,  4.10342214e+04,  2.02017700e+03,
        1.42561928e+04,  2.12321823e+04,  4.86755177e+04,  6.98669700e+03,
        4.14973600e+03,  

In [18]:
rscore2 = r2_score(y_test_2, y_pred_2)

print('Hasil R2 : ', rscore2)

Hasil R2 :  0.7835627749480735


## Kesimpulan

Berdasarkan hasil percobaan tersebut, menunjukkan bahwa Hasil R2 Score dengan menguji semua independent variable memiliki score yang lebih baik (lebih mendekati 1) dari pada hasil R2 Score dengan menggunakan 4 variabel independent saja. Berikut Hasilnya :

- R2 Score (4 Variable) : 0.6498290218230921
- R2 Score (Semua Variable) : 0.7835627749480735