# Medical Costs Prediction

## 1. Importing libraries & dataset

In [61]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [62]:
dfInsurance = pd.read_csv('/content/drive/MyDrive/ml-medical-costs/data/insurance.csv')
dfInsurance

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [63]:
dfInsurance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [64]:
dfInsurance.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


## 2. Exploring our dataset

<p>We'll take samples from the dataset and then we'll analyze how each feature relates to our target(charges).</p>

In [65]:
fig = px.scatter(dfInsurance, x='age', y='charges', title='Charges distribution per age', color='sex')
fig.show()

In [66]:
fig = px.box(dfInsurance, x='age', y='charges', title='Charges dristibution per age group and it\'s deviation')
fig.show()

<p>The boxplot visualization from the charges per age group is interesting because it's clear that theres a function that describes the minimum charge per age, regardless of being a smoker, having high BMI, gender or having children.</p>
<p>We probably should have a model for the minimum charge and work from there.</p>

<p>We'll divide our dataframe into two different dataframes, one for the smokers and other one for non smokers.</p>

In [67]:
dfSmokers = dfInsurance.loc[dfInsurance['smoker']=='yes']
dfSmokers

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
11,62,female,26.290,0,yes,southeast,27808.72510
14,27,male,42.130,0,yes,southeast,39611.75770
19,30,male,35.300,0,yes,southwest,36837.46700
23,34,female,31.920,1,yes,northeast,37701.87680
...,...,...,...,...,...,...,...
1313,19,female,34.700,2,yes,southwest,36397.57600
1314,30,female,23.655,3,yes,northwest,18765.87545
1321,62,male,26.695,0,yes,northeast,28101.33305
1323,42,female,40.370,2,yes,southeast,43896.37630


In [68]:
dfSmokers.describe()

Unnamed: 0,age,bmi,children,charges
count,274.0,274.0,274.0,274.0
mean,38.514599,30.708449,1.113139,32050.231832
std,13.923186,6.318644,1.157066,11541.547176
min,18.0,17.195,0.0,12829.4551
25%,27.0,26.08375,0.0,20826.244213
50%,38.0,30.4475,1.0,34456.34845
75%,49.0,35.2,2.0,41019.207275
max,64.0,52.58,5.0,63770.42801


In [69]:
fig = px.scatter(dfSmokers, x='bmi', y='charges', color='children')
fig.show()

In [70]:
fig = px.box(dfSmokers, x='children', y='charges')
fig.show()

In [71]:
fig = px.scatter(dfSmokers, x='age', y='charges', color='bmi')
fig.show()

<p>Alright, it's obvious now that BMI affects the charges, but children doesn't seem to change that much, I couldn't see a pattern at least.</p>
<p>Now last see if region affects the charges.</p>

In [72]:
fig = px.box(dfSmokers, x='region', y='charges')
fig.show()

In [73]:
fig = px.scatter(dfSmokers, x='bmi', y='charges', color='region')
fig.show()

In [74]:
fig = px.scatter(dfSmokers, x='age', y='charges', color='region', size='bmi')
fig.show()

In [75]:
fig = px.box(dfSmokers, x='sex', y='charges')
fig.show()

<p>It's noticeable that south regions are the most expansive ones, their median charges are well above the north regions charges.</p>

<p>Now we do the same for the non smokers, the majority from our dataset.</p>

In [76]:
dfNonSmokers = dfInsurance.loc[dfInsurance['smoker']=='no']
dfNonSmokers

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
5,31,female,25.740,0,no,southeast,3756.62160
...,...,...,...,...,...,...,...
1332,52,female,44.700,3,no,southwest,11411.68500
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350


In [77]:
dfNonSmokers.describe()

Unnamed: 0,age,bmi,children,charges
count,1064.0,1064.0,1064.0,1064.0
mean,39.385338,30.651795,1.090226,8434.268298
std,14.08341,6.043111,1.218136,5993.781819
min,18.0,15.96,0.0,1121.8739
25%,26.75,26.315,0.0,3986.4387
50%,40.0,30.3525,1.0,7345.4053
75%,52.0,34.43,2.0,11362.88705
max,64.0,53.13,5.0,36910.60803


In [78]:
fig = px.scatter(dfNonSmokers, x='age', y='charges', color='bmi')
fig.show()

In [79]:
fig = px.scatter(dfNonSmokers, x='bmi', y='charges', color='children')
fig.show()

In [80]:
fig = px.box(dfNonSmokers, x='children', y='charges')
fig.show()

In [81]:
fig = px.box(dfNonSmokers, x='age', y='charges')
fig.show()

<p>We notice that boxplot for charges per age and charges per children show something that resemble a linear function for the minimum charge, the same idea from the smokers dataframe.</p>

In [82]:
fig = px.box(dfNonSmokers, x='region', y='charges')
fig.show()

<p>Once again there's a difference between south and north regions, north being the most expansive.</p>

In [83]:
fig = px.scatter(dfNonSmokers, x='age', y='charges', color='region', size='bmi')
fig.show()

In [84]:
fig = px.box(dfNonSmokers, x='sex', y='charges')
fig.show()

### Conclusion 1(number of features and models needed)

<p>I believe it'll be interesting to have two models, one for smokers and another one the non smokers. If we look to the full dataset charges per age, and compare to the others we created, smokers and non smokers, we notice that the plot has two partes. So, I suggest two models to try to predict medical costs.</p>

<p>As for the number of features per model, we'll use all the columns from the dataframe. All the columns seem to have influence over the charges.</p>

<p>For now, I'll create a training set for each model with half of the available data from each dataframe we created.</p>

<p>We'll use simple random samples for each training sets.</p>

## 3. Models

### a) Smokers

<p>Let's prepare our training set.</p>

<p>First, we need to change the data in the gender column, female gets 0, male gets 1.</p>
<p>Then, we'll remove the column smoker, because it won't be relevant anymore, since we split the original dataframe in two.</p>

In [85]:
dfSmokers['sex'] = dfSmokers['sex'].str.replace('female', '0')
dfSmokers['sex'] = dfSmokers['sex'].str.replace('male', '1')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [86]:
dfSmokers

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.900,0,yes,southwest,16884.92400
11,62,0,26.290,0,yes,southeast,27808.72510
14,27,1,42.130,0,yes,southeast,39611.75770
19,30,1,35.300,0,yes,southwest,36837.46700
23,34,0,31.920,1,yes,northeast,37701.87680
...,...,...,...,...,...,...,...
1313,19,0,34.700,2,yes,southwest,36397.57600
1314,30,0,23.655,3,yes,northwest,18765.87545
1321,62,1,26.695,0,yes,northeast,28101.33305
1323,42,0,40.370,2,yes,southeast,43896.37630


In [87]:
dfSmokers.drop('smoker', axis=1, inplace=True)
dfSmokers



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,age,sex,bmi,children,region,charges
0,19,0,27.900,0,southwest,16884.92400
11,62,0,26.290,0,southeast,27808.72510
14,27,1,42.130,0,southeast,39611.75770
19,30,1,35.300,0,southwest,36837.46700
23,34,0,31.920,1,northeast,37701.87680
...,...,...,...,...,...,...
1313,19,0,34.700,2,southwest,36397.57600
1314,30,0,23.655,3,northwest,18765.87545
1321,62,1,26.695,0,northeast,28101.33305
1323,42,0,40.370,2,southeast,43896.37630


<p>Now, we do the same for region, let's exchange the names for numeric values.</p>
<ul>
  <li>Southeast: 0</li>
  <li>Southwest: 1</li>
  <li>Northeast: 2</li>
  <li>Northwest: 3</li>
</ul>

In [88]:
dfSmokers['region'] = dfSmokers['region'].str.replace('southeast', '0')
dfSmokers['region'] = dfSmokers['region'].str.replace('southwest', '1')
dfSmokers['region'] = dfSmokers['region'].str.replace('northeast', '2')
dfSmokers['region'] = dfSmokers['region'].str.replace('northwest', '3')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/

In [89]:
dfSmokers

Unnamed: 0,age,sex,bmi,children,region,charges
0,19,0,27.900,0,1,16884.92400
11,62,0,26.290,0,0,27808.72510
14,27,1,42.130,0,0,39611.75770
19,30,1,35.300,0,1,36837.46700
23,34,0,31.920,1,2,37701.87680
...,...,...,...,...,...,...
1313,19,0,34.700,2,1,36397.57600
1314,30,0,23.655,3,3,18765.87545
1321,62,1,26.695,0,2,28101.33305
1323,42,0,40.370,2,0,43896.37630


In [90]:
smokersTraining = dfSmokers.sample(frac=0.5)
smokersTraining

Unnamed: 0,age,sex,bmi,children,region,charges
1093,22,0,30.400,0,3,33907.54800
1042,20,1,30.685,0,2,33475.81715
1207,36,1,33.400,2,1,38415.47400
292,25,1,45.540,2,0,42112.23560
609,30,1,37.800,2,1,39241.44200
...,...,...,...,...,...,...
322,34,1,30.800,0,1,35491.64000
1231,20,0,21.800,0,1,20167.33603
29,31,1,36.300,2,1,38711.00000
621,37,1,34.100,4,1,40182.24600


In [91]:
smokersX = smokersTraining.iloc[:, 0:5]
smokersX

Unnamed: 0,age,sex,bmi,children,region
1093,22,0,30.400,0,3
1042,20,1,30.685,0,2
1207,36,1,33.400,2,1
292,25,1,45.540,2,0
609,30,1,37.800,2,1
...,...,...,...,...,...
322,34,1,30.800,0,1
1231,20,0,21.800,0,1
29,31,1,36.300,2,1
621,37,1,34.100,4,1


In [92]:
smokersy = smokersTraining['charges']
smokersy

1093    33907.54800
1042    33475.81715
1207    38415.47400
292     42112.23560
609     39241.44200
           ...     
322     35491.64000
1231    20167.33603
29      38711.00000
621     40182.24600
280     22331.56680
Name: charges, Length: 137, dtype: float64

In [93]:
smokersModel = LinearRegression()
smokersModel.fit(smokersX, smokersy)

LinearRegression()

In [94]:
smokersTest = pd.merge(dfSmokers, smokersTraining, indicator=True, how='outer').query('_merge=="left_only"').drop('_merge', axis=1)
smokersTest

Unnamed: 0,age,sex,bmi,children,region,charges
0,19,0,27.900,0,1,16884.92400
1,62,0,26.290,0,0,27808.72510
4,34,0,31.920,1,2,37701.87680
6,22,1,35.600,0,1,35585.57600
9,60,1,39.900,0,1,48173.36100
...,...,...,...,...,...,...
264,43,1,27.800,0,1,37829.72420
265,42,1,24.605,2,2,21259.37795
268,25,0,30.200,0,1,33900.65300
269,19,0,34.700,2,1,36397.57600


In [95]:
smokersPredict = smokersModel.predict(smokersTest.iloc[:, 0:5])

In [96]:
smokersPredict

array([23229.02756629, 30625.01342143, 32827.44570583, 34515.4277529 ,
       49281.84032002, 36728.75166442, 35649.15718616, 17295.09915176,
       19830.38295128, 36675.58234887, 36853.6628247 , 24148.39229165,
       37798.53939883, 36229.90116106, 38526.42608608, 25097.59286966,
       34642.21844351, 23767.16989529, 28443.11720007, 34653.11920931,
       25018.37146538, 20145.81927653, 35710.93555976, 28118.43427051,
       20037.00843306, 26223.46703734, 39500.89415446, 36655.70499535,
       41103.31994645, 30057.73113141, 27162.52537622, 36831.5976976 ,
       48798.90850361, 16955.68068564, 39337.13140326, 23178.06023974,
       36347.99756006, 39506.48184063, 45215.96809327, 31722.53025208,
       14887.82146692, 31915.71670953, 36303.76021699, 19470.48463869,
       20203.96961859, 41527.07229571, 43481.82030865, 31816.94781251,
       31708.47858409, 25917.20808604, 42205.51235865, 24143.2179307 ,
       26007.58292667, 57148.11101738, 29055.87387627, 52920.35658178,
      

In [97]:
smokersTest['charges']

0      16884.92400
1      27808.72510
4      37701.87680
6      35585.57600
9      48173.36100
          ...     
264    37829.72420
265    21259.37795
268    33900.65300
269    36397.57600
271    28101.33305
Name: charges, Length: 137, dtype: float64

In [98]:
# Mean Squared Error
mean_squared_error(smokersTest['charges'], smokersPredict)

38936448.47303441

In [99]:
# R2 Score (1-> perfection)
r2_score(smokersTest['charges'], smokersPredict)

0.7435208259502408

<p>Alright, we got 75.73% for smokers, which is okay.</p>

In [100]:
dfComparasion = pd.DataFrame({'Test':smokersTest['charges'], 'Prediction':smokersPredict, 'Prediction/Test':smokersPredict/smokersTest['charges']})
dfComparasion

Unnamed: 0,Test,Prediction,Prediction/Test
0,16884.92400,23229.027566,1.375726
1,27808.72510,30625.013421,1.101274
4,37701.87680,32827.445706,0.870711
6,35585.57600,34515.427753,0.969927
9,48173.36100,49281.840320,1.023010
...,...,...,...
264,37829.72420,28985.047946,0.766198
265,21259.37795,25138.759572,1.182479
268,33900.65300,27741.458989,0.818316
269,36397.57600,32464.788149,0.891949


In [101]:
dfComparasion['Prediction/Test'].mean()

1.0540797859941327

### b) Non Smokers

<p>After a really mediocre result from the smokers, we'll try with the non smokers dataset.</p>

In [103]:
dfNonSmokers['region'] = dfNonSmokers['region'].str.replace('southeast', '0')
dfNonSmokers['region'] = dfNonSmokers['region'].str.replace('southwest', '1')
dfNonSmokers['region'] = dfNonSmokers['region'].str.replace('northeast', '2')
dfNonSmokers['region'] = dfNonSmokers['region'].str.replace('northwest', '3')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/

In [104]:
dfNonSmokers

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1,18,male,33.770,1,no,0,1725.55230
2,28,male,33.000,3,no,0,4449.46200
3,33,male,22.705,0,no,3,21984.47061
4,32,male,28.880,0,no,3,3866.85520
5,31,female,25.740,0,no,0,3756.62160
...,...,...,...,...,...,...,...
1332,52,female,44.700,3,no,1,11411.68500
1333,50,male,30.970,3,no,3,10600.54830
1334,18,female,31.920,0,no,2,2205.98080
1335,18,female,36.850,0,no,0,1629.83350


In [107]:
dfNonSmokers['sex'] = dfNonSmokers['sex'].str.replace('female', '0')
dfNonSmokers['sex'] = dfNonSmokers['sex'].str.replace('male', '1')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [108]:
dfNonSmokers.drop('smoker', axis=1, inplace=True)
dfNonSmokers



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,age,sex,bmi,children,region,charges
1,18,1,33.770,1,0,1725.55230
2,28,1,33.000,3,0,4449.46200
3,33,1,22.705,0,3,21984.47061
4,32,1,28.880,0,3,3866.85520
5,31,0,25.740,0,0,3756.62160
...,...,...,...,...,...,...
1332,52,0,44.700,3,1,11411.68500
1333,50,1,30.970,3,3,10600.54830
1334,18,0,31.920,0,2,2205.98080
1335,18,0,36.850,0,0,1629.83350


In [109]:
nonSmokersTraining = dfNonSmokers.sample(frac=0.5)
nonSmokersTraining

Unnamed: 0,age,sex,bmi,children,region,charges
216,53,0,26.600,0,3,10355.64100
726,41,1,28.405,1,3,6664.68595
183,44,0,26.410,0,3,7419.47790
1309,41,1,32.200,2,1,6875.96100
825,64,0,31.825,2,2,16069.08475
...,...,...,...,...,...,...
892,54,1,24.035,0,2,10422.91665
744,50,1,26.410,0,3,8827.20990
459,40,0,33.000,3,0,7682.67000
754,24,1,33.630,4,2,17128.42608


In [110]:
nonSmokersX = nonSmokersTraining.iloc[:, 0:5]

In [111]:
nonSmokersy = nonSmokersTraining['charges']

In [112]:
nonSmokersModel = LinearRegression()

In [113]:
nonSmokersModel.fit(nonSmokersX, nonSmokersy)

LinearRegression()

In [114]:
nonSmokersTest = pd.merge(dfNonSmokers, nonSmokersTraining, indicator=True, how='outer').query('_merge=="left_only"').drop('_merge', axis=1)
nonSmokersTest

Unnamed: 0,age,sex,bmi,children,region,charges
2,33,1,22.705,0,3,21984.47061
3,32,1,28.880,0,3,3866.85520
4,31,0,25.740,0,0,3756.62160
8,60,0,25.840,0,3,28923.13692
10,23,1,34.400,0,1,1826.84300
...,...,...,...,...,...,...
1055,23,0,24.225,2,2,22395.74424
1056,52,1,38.600,2,1,10325.20600
1058,23,0,33.400,0,1,10795.93733
1060,50,1,30.970,3,3,10600.54830


In [115]:
nonSmokersPredict = nonSmokersModel.predict(nonSmokersTest.iloc[:,0:5])

In [116]:
nonSmokersPredict

array([ 6224.17643825,  6113.04047711,  5411.781851  , 13467.03240154,
        3549.69332133, 12215.95059997,  3480.89770222,  6045.70254723,
        2065.53763533,  8486.17647022, 14352.95311114, 13993.86496877,
       13439.26241113,  2429.22812272,  5309.17424716, 14119.34446541,
        2547.78651827,  3988.2828407 ,  8227.09889569, 11897.63132042,
        3824.40327983,  5401.80999887, 12885.37641936, 10011.04521869,
       14009.23121263,  8115.38509323, 10389.09437365,  6066.03845698,
        7989.88354462,  5536.39155822,  2875.83391489,  3786.4438715 ,
        8721.86852076,  9890.53691884,  3153.30908427,  8129.95518886,
       11737.60176324,  8322.34792219,  5330.44430281,  2520.04124752,
       12833.24670819, 12771.34612618, 10102.68746461, 10147.10400549,
       11946.86428784, 11306.73760675,  8537.50461411, 12629.6301111 ,
        2670.38423674,  3138.62107118,  2507.69470837,  3437.42479107,
       13159.31110373,  3512.06147164, 11422.48318741,  7325.66218868,
      

In [117]:
mean_squared_error(nonSmokersTest['charges'], nonSmokersPredict)

19573209.270637162

In [118]:
# R2 Score(1-> perfection)
r2_score(nonSmokersTest['charges'], nonSmokersPredict)

0.44858756276834877

<p>Got 44.85%, a not so okay result.</p>

In [119]:
dfComparasion = pd.DataFrame({'Test':nonSmokersTest['charges'], 'Prediction':nonSmokersPredict, 'Prediction/Test':nonSmokersPredict/nonSmokersTest['charges']})
dfComparasion

Unnamed: 0,Test,Prediction,Prediction/Test
2,21984.47061,6224.176438,0.283117
3,3866.85520,6113.040477,1.580882
4,3756.62160,5411.781851,1.440598
8,28923.13692,13467.032402,0.465615
10,1826.84300,3549.693321,1.943075
...,...,...,...
1055,22395.74424,4772.849850,0.213114
1056,10325.20600,12252.609603,1.186670
1058,10795.93733,3708.633465,0.343521
1060,10600.54830,12469.767649,1.176332


## 4. Final Thoughts

<p>Alright, I believe it was a good practice, the smokers part was more successful compared to the the non smokers. But it still was a nice first try.</p>
<p>I believe that, for a next time, I should find a way of weighting the features differently and only them work on the models. But I still believe the it was the best option to divide our dataframe in two, one for smokers and other one for non smokers.</p>
<p>Overall I'm satisfied with this first attempt.</p>