**Independent variables**

1. age : age of policyholder
2. sex: gender of policy holder (female=0, male=1)
3. bmi: Body mass index, ideally 18.5 to 25
4. children: number of children / dependents of policyholder
5. smoker: smoking state of policyholder (non-smoke=0;smoker=1) 
6. region: the residential area of policyholder in the US (northeast=0, northwest=1, southeast=2, southwest=3)
7. charges: individual medical costs billed by health insurance

**Target variable**

1. insuranceclaim - categorical variable (0,1)

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
 
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings('ignore')


In [None]:
insuranceDF = pd.read_csv('insurance2.csv')
print(insuranceDF.head())

In [None]:
insuranceDF.info()

Let's start by finding correlation of every pair of features (and the outcome variable), and visualizing the correlations using a heatmap.

In [None]:
corr = insuranceDF.corr()
print(corr)
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns)

The dataset consists the records of 1338 patients in total. Using 1000 records for training and 300 records for testing, and the last 38 records to cross check your model.

In [None]:
dfTrain = insuranceDF[:1000]
dfTest = insuranceDF[1000:1300]
dfCheck = insuranceDF[1300:] 

In [None]:
print(dfTrain.head())
print(dfTest.head())
print(dfCheck.head())

In [None]:
trainLabel = np.asarray(dfTrain['insuranceclaim'])
trainData = np.asarray(dfTrain.drop('insuranceclaim', axis=1))
testLabel = np.asarray(dfTest['insuranceclaim'])
testData = np.asarray(dfTest.drop('insuranceclaim', axis=1))

In [None]:
print(trainData[:5])
print(trainLabel[:5])
print(testData[:5])
print(testLabel[:5])

Before using machine learning,normalize your inputs. Machine Learning models often benefit substantially from input normalization. It also makes it easier to understand the importance of each feature later, when looking at the model weights. Normalize the data such that each variable has 0 mean and standard deviation of 1.

In [None]:
means = np.mean(trainData, axis=0)
stds = np.std(trainData, axis=0)
 
trainData = (trainData - means)/stds
testData = (testData - means)/stds

In [None]:
insuranceCheck = LogisticRegression()
insuranceCheck.fit(trainData, trainLabel)

In [None]:
from sklearn.metrics import confusion_matrix
y_pred = insuranceCheck.predict(testData)
cm_lr = confusion_matrix(testLabel,y_pred)

print(cm_lr)

Now, use test data to find out accuracy of the model.

In [None]:
accuracy = insuranceCheck.score(testData, testLabel)
print("accuracy = ", accuracy * 100, "%")

To get a better sense of what is going on inside the logistic regression model, visualize how your model uses the different features and which features have greater effect.

In [None]:
coeff = list(insuranceCheck.coef_[0])
labels = list(dfTrain.drop(columns='insuranceclaim').columns)
features = pd.DataFrame()
features['Features'] = labels
features['importance'] = coeff
features.sort_values(by=['importance'], ascending=True, inplace=True)
features['positive'] = features['importance'] > 0
features.set_index('Features', inplace=True)
features.importance.plot(kind='barh', figsize=(11, 6), color=features.positive.map({True: 'blue', False: 'red'}))
plt.xlabel('Importance')

From the above figure, 

1. BMI and Smoker category have significant influence on the model, especially BMI. 

2. Children have a negative influence on the prediction, i.e. the number of children  are highly correlated with a policy holder who has not claimed insurance amount.

3. Although age was better correlated than BMI to the target variable of Insurance Claim, the model relies more on BMI. This can happen for several reasons, including the fact that the correlation captured by age is also captured by some other variable, whereas the information captured by BMI is not captured by other variables.

