# Scikit-learn for Classification Modeling - Predicting Health Quality

You are an analyst for a large insurance company, Insurance Co.  You obtain the following sample data regarding health quality from a well known research organization:
<br>
- Season in which the analysis was performed. 1) winter, 2) spring, 3) Summer, 4) fall. (-1, -0.33, 0.33, 1) 
- Age at the time of analysis. 18-36 (0, 1) 
- Childish diseases (ie , chicken pox, measles, mumps, polio)	1) yes, 2) no. (0, 1) 
- Accident or serious trauma 1) yes, 2) no. (0, 1) 
- Surgical intervention 1) yes, 2) no. (0, 1) 
- High fevers in the last year 1) less than three months ago, 2) more than three months ago, 3) no. (-1, 0, 1) 
- Frequency of alcohol consumption 1) several times a day, 2) every day, 3) several times a week, 4) once a week, 5) hardly ever or never (0, 1) 
- Smoking habit 1) never, 2) occasional 3) daily. (-1, 0, 1) 
- Number of hours spent sitting per day ene-16	(0, 1) 
- Output: Diagnosis	normal (N), altered (O)	

The data is stored on the research organization's server.  Run the following code to import the data into a Pandas DataFrame:

In [2]:
import pandas as pd
colnames = ['SEASON','AGE','DISEASES','ACCIDENTS','SURGERIES','FEVERS','ALCOHOL','SMOKING','SITTING','QUALITY']
health_qual = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/00244/fertility_Diagnosis.txt', names=colnames)
print(health_qual.head())

   SEASON   AGE  DISEASES  ACCIDENTS  SURGERIES  FEVERS  ALCOHOL  SMOKING  \
0   -0.33  0.69         0          1          1       0      0.8        0   
1   -0.33  0.94         1          0          1       0      0.8        1   
2   -0.33  0.50         1          0          0       0      1.0       -1   
3   -0.33  0.75         0          1          1       0      1.0       -1   
4   -0.33  0.67         1          1          0       0      0.8       -1   

   SITTING QUALITY  
0     0.88       N  
1     0.31       O  
2     0.50       N  
3     0.38       N  
4     0.50       O  


You think that using the first 9 columns of this dataset will be sufficient to predict the last column, QUALITY.  That is, given this input data you can create a model to predict someone's overall health quality.  Will a future Insurance Co customer's health be normal or altered?  A person who is predicted to have normal health might then be given a lower premium whereas a person with altered health might be given a higher premium.

**Question 1)** Run the following code to create your testing and training subsets:

In [4]:
from sklearn.model_selection import train_test_split
X = health_qual.loc[:,'SEASON':'SITTING']
y = health_qual.loc[:,'QUALITY']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

**Question 2)** Write the code below to model the data using Logistic Regression.  In the cell below it, explain the accuracy results.

**Question 3)** Write the code below to model the data using Classification Trees. In the cell below it, explain the accuracy results.

**Question 4)** Write the code below to model the data using Naive Bayes.  In the cell below it, explain the accuracy results.

**Question 5)** Use the BEST model to make a prediction regarding the following customer.  Explain in the cell below your analysis what model results indicate regarding this customer's prediction of overall health quality.  What would your recommendation be regarding his/her insurance premiums?  Explain.

In [9]:
data = [1,0.75,1,1,1,0,1,1,0.25]

import numpy as np
customer = np.array(data).reshape((1,-1))
print(customer)

[[ 1.    0.75  1.    1.    1.    0.    1.    1.    0.25]]


Nice.  After you have completed your prediction, you could then make a recommendation to your company regarding the given new customer’s overall health quality.  Understanding a given customer’s overall health quality could provide your organization insight on whether to offer a higher or lower premium to that particular customer.