### Medical Expenses

**Purpose**: Build models to predict **medical expenses** ("charges").

    1. We import the data and check for any NaN values. Categorical data is converted to numerical values. Since there are few columns, we use all of them as the X variables. Next, we split the data into training and validation sets, with the y variable being 'charges'.








In [1]:
import pandas as pd
desc = pd.read_csv("data_description.txt", delimiter='\t')
desc

Unnamed: 0,Columns
0,* age: age of primary beneficiary
1,"* sex: insurance contractor gender, female, male"
2,"* bmi: Body mass index, providing an understan..."
3,objective index of body weight (kg / m ^ 2) us...
4,* children: Number of children covered by heal...
5,* smoker: Smoking
6,* region: the beneficiary's residential area i...
7,* charges: Individual medical costs billed by ...


In [2]:
train = pd.read_csv("train.csv")
val = pd.read_csv("test.csv")

In [3]:
list(train.isna().sum())

[0, 0, 0, 0, 0, 0, 0]

In [4]:
list(val.isna().sum())

[0, 0, 0, 0, 0, 0, 0]

In [5]:
train.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,34,male,42.9,1,no,southwest,4536.259
1,61,female,36.385,1,yes,northeast,48517.56315
2,60,male,25.74,0,no,southeast,12142.5786
3,44,female,29.81,2,no,southeast,8219.2039
4,40,female,29.6,0,no,southwest,5910.944


In [6]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

for i in ["sex", "smoker", "region"]:
    val[i] = encoder.fit_transform(val[i])
    train[i] = encoder.fit_transform(train[i])
val.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,57,0,31.16,0,1,1,43578.9394
1,61,0,39.1,2,0,3,14235.072
2,61,1,23.655,0,0,0,13129.60345
3,59,1,29.7,2,0,2,12925.886
4,19,0,28.88,0,1,1,17748.5062


In [7]:
cols = list(train.columns)
cols.remove("charges")
cols

['age', 'sex', 'bmi', 'children', 'smoker', 'region']

In [8]:
X_train = train[cols]
y_train = train["charges"]
X_val = val[cols]
y_val = val["charges"]

    2. Using RandomForestRegressor, we train the model and predict the required values. The model's accuracy is evaluated using the r2_score.

In [9]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
reg = RandomForestRegressor(n_estimators = 10, random_state=1)
reg.fit(X_train,y_train)
y_pred = reg.predict(X_val)

p = len(X_train.columns)
n = len(y_val)

mae = mean_absolute_error(y_val, y_pred)
print(f'Mean Absolute Error (MAE): {mae}')

mse = mean_squared_error(y_val, y_pred)
print(f'Mean Squared Error (MSE): {mse}')

rmse = np.sqrt(mse)
print(f'Root Mean Squared Error (RMSE): {rmse}')

r2 = r2_score(y_val, y_pred)
print(f'R-squared (R^2): {r2}')

adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
print(f'Adjusted R-squared: {adjusted_r2}')

mape = np.mean(np.abs((y_val - y_pred) / y_val)) * 100
print(f'Mean Absolute Percentage Error (MAPE): {mape}%')   #Лучший результат

Mean Absolute Error (MAE): 2412.6323251929825
Mean Squared Error (MSE): 21699929.089153938
Root Mean Squared Error (RMSE): 4658.318268340404
R-squared (R^2): 0.8665430295430516
Adjusted R-squared: 0.8601879357117683
Mean Absolute Percentage Error (MAPE): 30.38353287385947%


    3. Using GradientBoostingRegressor, we train the model and predict the required values. The model's accuracy is evaluated using the r2_score.








In [10]:
from sklearn.ensemble import GradientBoostingRegressor

reg = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=6, random_state=1)
reg.fit(X_train, y_train)

y_pred = reg.predict(X_val)


mae = mean_absolute_error(y_val, y_pred)
print(f'Mean Absolute Error (MAE): {mae}')

mse = mean_squared_error(y_val, y_pred)
print(f'Mean Squared Error (MSE): {mse}')

rmse = np.sqrt(mse)
print(f'Root Mean Squared Error (RMSE): {rmse}')

r2 = r2_score(y_val, y_pred)
print(f'R-squared (R^2): {r2}')

adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
print(f'Adjusted R-squared: {adjusted_r2}')

mape = np.mean(np.abs((y_val - y_pred) / y_val)) * 100
print(f'Mean Absolute Percentage Error (MAPE): {mape}%')

Mean Absolute Error (MAE): 2488.2000494143977
Mean Squared Error (MSE): 22302290.85040321
Root Mean Squared Error (RMSE): 4722.530132291716
R-squared (R^2): 0.8628384379084356
Adjusted R-squared: 0.8563069349516945
Mean Absolute Percentage Error (MAPE): 28.125559118901077%


    4. Using LinearRegression, we train the model and predict the required values. Since the data is multivariate, we use all X columns. The model's accuracy is evaluated using the r2_score.

In [11]:
from sklearn.linear_model import LinearRegression

red = LinearRegression()
reg.fit(X_train, y_train)

y_pred = reg.predict(X_val)


mae = mean_absolute_error(y_val, y_pred)
print(f'Mean Absolute Error (MAE): {mae}')

mse = mean_squared_error(y_val, y_pred)
print(f'Mean Squared Error (MSE): {mse}')

rmse = np.sqrt(mse)
print(f'Root Mean Squared Error (RMSE): {rmse}')

r2 = r2_score(y_val, y_pred)
print(f'R-squared (R^2): {r2}')

adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
print(f'Adjusted R-squared: {adjusted_r2}')

mape = np.mean(np.abs((y_val - y_pred) / y_val)) * 100
print(f'Mean Absolute Percentage Error (MAPE): {mape}%')

Mean Absolute Error (MAE): 2488.2000494143977
Mean Squared Error (MSE): 22302290.85040321
Root Mean Squared Error (RMSE): 4722.530132291716
R-squared (R^2): 0.8628384379084356
Adjusted R-squared: 0.8563069349516945
Mean Absolute Percentage Error (MAPE): 28.125559118901077%


    5. Using Polynomial Regression, we train the model and predict the required values. Similar to the previous step, we use all X columns as the data is multivariate. The model's accuracy is evaluated using the r2_score. Spoiler: the previous result is better.








In [12]:
from sklearn.preprocessing import PolynomialFeatures

degree = 2

poly = PolynomialFeatures(degree=degree)
X_poly_train = poly.fit_transform(X_train)
X_poly_val = poly.transform(X_val)

reg = LinearRegression()
reg.fit(X_poly_train, y_train)

y_pred = reg.predict(X_poly_val)


mae = mean_absolute_error(y_val, y_pred)
print(f'Mean Absolute Error (MAE): {mae}')

mse = mean_squared_error(y_val, y_pred)
print(f'Mean Squared Error (MSE): {mse}')

rmse = np.sqrt(mse)
print(f'Root Mean Squared Error (RMSE): {rmse}')

r2 = r2_score(y_val, y_pred)
print(f'R-squared (R^2): {r2}')

adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
print(f'Adjusted R-squared: {adjusted_r2}')

mape = np.mean(np.abs((y_val - y_pred) / y_val)) * 100
print(f'Mean Absolute Percentage Error (MAPE): {mape}%')

Mean Absolute Error (MAE): 2946.193637495301
Mean Squared Error (MSE): 24902516.443053015
Root Mean Squared Error (RMSE): 4990.242122688339
R-squared (R^2): 0.8468467621442456
Adjusted R-squared: 0.839553750817781
Mean Absolute Percentage Error (MAPE): 27.063927403978866%


    6. Using KNeighborsRegressor, we train the model and predict the required values with k=5 (default setting). The model's accuracy is evaluated using the r2_score.

In [13]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)



k = 5
knn = KNeighborsRegressor(n_neighbors=k)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_val)

mae = mean_absolute_error(y_val, y_pred)
print(f'Mean Absolute Error (MAE): {mae}')

mse = mean_squared_error(y_val, y_pred)
print(f'Mean Squared Error (MSE): {mse}')

rmse = np.sqrt(mse)
print(f'Root Mean Squared Error (RMSE): {rmse}')

r2 = r2_score(y_val, y_pred)
print(f'R-squared (R^2): {r2}')

adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
print(f'Adjusted R-squared: {adjusted_r2}')

mape = np.mean(np.abs((y_val - y_pred) / y_val)) * 100
print(f'Mean Absolute Percentage Error (MAPE): {mape}%')

Mean Absolute Error (MAE): 3159.1184669022564
Mean Squared Error (MSE): 28945654.908108138
Root Mean Squared Error (RMSE): 5380.116625883507
R-squared (R^2): 0.8219810121932972
Adjusted R-squared: 0.8135039175358352
Mean Absolute Percentage Error (MAPE): 32.83632686839687%


    7. Using LinearRegression, we train the model and predict the required values, using the most correlated column as X. The model's accuracy is evaluated using the r2_score. Still, multivariate linear regression showed a better result.








In [14]:
X_train = train[cols]
y_train = train["charges"]
X_val = val[cols]
y_val = val["charges"]

train.corr()["charges"]

age         0.296395
sex         0.064249
bmi         0.204654
children    0.059493
smoker      0.785957
region      0.000332
charges     1.000000
Name: charges, dtype: float64

In [15]:
reg = LinearRegression()
reg.fit(X_train["smoker"].values.reshape(-1,1),y_train)
y_pred = reg.predict(X_val["smoker"].values.reshape(-1,1))


mae = mean_absolute_error(y_val, y_pred)
print(f'Mean Absolute Error (MAE): {mae}')

mse = mean_squared_error(y_val, y_pred)
print(f'Mean Squared Error (MSE): {mse}')

rmse = np.sqrt(mse)
print(f'Root Mean Squared Error (RMSE): {rmse}')

r2 = r2_score(y_val, y_pred)
print(f'R-squared (R^2): {r2}')

adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
print(f'Adjusted R-squared: {adjusted_r2}')

mape = np.mean(np.abs((y_val - y_pred) / y_val)) * 100
print(f'Mean Absolute Percentage Error (MAPE): {mape}%')

Mean Absolute Error (MAE): 5687.433054307644
Mean Squared Error (MSE): 59801783.79001118
Root Mean Squared Error (RMSE): 7733.1613063488585
R-squared (R^2): 0.6322123975729772
Adjusted R-squared: 0.6146987022193096
Mean Absolute Percentage Error (MAPE): 86.24638182558469%
