## Medical insurance price prediction
Data definition of the columns
##### age:   
        age of primary beneficiary
##### sex: 
        insurance contractor gender, female, male
##### bmi:   
        Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
##### children:
        Number of children covered by health insurance / Number of dependents
##### smoker:
        Smoking
##### region: 
        the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
##### charges:
        Individual medical costs billed by health insurance
##### Acknowledgements
        The dataset is available on GitHub here.
##### Inspiration
        Can you accurately predict insurance costs?
##### Goal:To create a Linear regression model using Medical Insurance 


In [124]:
pip install numpy pandas matplotlib sklearn

Collecting sklearn
  Using cached sklearn-0.0.post12.tar.gz (2.6 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'error'
Note: you may need to restart the kernel to use updated packages.


  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [15 lines of output]
      The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
      rather than 'sklearn' for pip commands.
      
      Here is how to fix this error in the main use cases:
      - use 'pip install scikit-learn' rather than 'pip install sklearn'
      - replace 'sklearn' by 'scikit-learn' in your pip requirements files
        (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
      - if the 'sklearn' package is used by one of your dependencies,
        it would be great if you take some time to track which package uses
        'sklearn' instead of 'scikit-learn' and report it to their issue tracker
      - as a last resort, set the environment variable
        SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True to avoid this error
      
      More information is available at
      https://github.com/scikit-learn/sklearn-pypi-packag

In [134]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [135]:
# Loading data into pandas
insurance_data = pd.read_csv('data/insurance.csv')


In [136]:
print(insurance_data.head())
print(insurance_data.describe())
#did the describe to understand the mean std min max and %s
print(insurance_data.isnull().sum())


   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520
               age          bmi     children       charges
count  1338.000000  1338.000000  1338.000000   1338.000000
mean     39.207025    30.663397     1.094918  13270.422265
std      14.049960     6.098187     1.205493  12110.011237
min      18.000000    15.960000     0.000000   1121.873900
25%      27.000000    26.296250     0.000000   4740.287150
50%      39.000000    30.400000     1.000000   9382.033000
75%      51.000000    34.693750     2.000000  16639.912515
max      64.000000    53.130000     5.000000  63770.428010
age         0
sex         0
bmi         0
children    0
smoker      0
region      0
c

In [137]:
#one-hot encoding to convert into categorical column conversion
insurance_data_encoded = pd.get_dummies(insurance_data, columns = ['sex', 'smoker', 'region'], drop_first= True)

print(insurance_data_encoded.head())


   age     bmi  children      charges  sex_male  smoker_yes  region_northwest  \
0   19  27.900         0  16884.92400     False        True             False   
1   18  33.770         1   1725.55230      True       False             False   
2   28  33.000         3   4449.46200      True       False             False   
3   33  22.705         0  21984.47061      True       False              True   
4   32  28.880         0   3866.85520      True       False              True   

   region_southeast  region_southwest  
0             False              True  
1              True             False  
2              True             False  
3             False             False  
4             False             False  


In [138]:
print(insurance_data_encoded.describe())
print(insurance_data_encoded.isnull().sum())
# To extra check the values in encoded data

               age          bmi     children       charges
count  1338.000000  1338.000000  1338.000000   1338.000000
mean     39.207025    30.663397     1.094918  13270.422265
std      14.049960     6.098187     1.205493  12110.011237
min      18.000000    15.960000     0.000000   1121.873900
25%      27.000000    26.296250     0.000000   4740.287150
50%      39.000000    30.400000     1.000000   9382.033000
75%      51.000000    34.693750     2.000000  16639.912515
max      64.000000    53.130000     5.000000  63770.428010
age                 0
bmi                 0
children            0
charges             0
sex_male            0
smoker_yes          0
region_northwest    0
region_southeast    0
region_southwest    0
dtype: int64


In [139]:
X = insurance_data_encoded.drop('charges', axis = 1)
#if you want to delete the entire column in the data frame you need to use axis=1
print(X.head())
y = insurance_data_encoded['charges']
print(y.head())
#y is a data frame, X is still a data frame so we can see all the values present


   age     bmi  children  sex_male  smoker_yes  region_northwest  \
0   19  27.900         0     False        True             False   
1   18  33.770         1      True       False             False   
2   28  33.000         3      True       False             False   
3   33  22.705         0      True       False              True   
4   32  28.880         0      True       False              True   

   region_southeast  region_southwest  
0             False              True  
1              True             False  
2              True             False  
3             False             False  
4             False             False  
0    16884.92400
1     1725.55230
2     4449.46200
3    21984.47061
4     3866.85520
Name: charges, dtype: float64


In [140]:
#random_state is to initialize the random rows in the data
#generating training data, testing data using sklearn 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42)
print("X_train shape is: {} and y_train shape is: {}".format(X_train.shape, y_train.shape))
print("X_test shape is: {} and y_test shape is: {}".format(X_test.shape, y_test.shape))

 


X_train shape is: (1070, 8) and y_train shape is: (1070,)
X_test shape is: (268, 8) and y_test shape is: (268,)


In [141]:
#Model training using Linear Regression

#Loading the model
lr = LinearRegression()
#training the model
lr.fit(X_train, y_train)


In [142]:
#Make Predictions

y_pred = lr.predict(X_test)

#evaluate your predictions

mean_squared_error_insurance_cost = mean_squared_error(y_test, y_pred)
print(f"mean_squared_error_insurance_cost is: {mean_squared_error_insurance_cost}")
r2_score_insurance_cost = r2_score(y_test, y_pred)
print(f"r2_score_insurance_cost is: {r2_score_insurance_cost}")

mean_squared_error_insurance_cost is: 33596915.85136145
r2_score_insurance_cost is: 0.7835929767120724


In [143]:
#coefficients 

coeff_df = pd.DataFrame(lr.coef_, X.columns, columns=['coefficient'])
print("Coefficient is: ")
print(coeff_df)

Coefficient is: 
                   coefficient
age                 256.975706
bmi                 337.092552
children            425.278784
sex_male            -18.591692
smoker_yes        23651.128856
region_northwest   -370.677326
region_southeast   -657.864297
region_southwest   -809.799354
