# AI Lab Assignment 4

# 2. Building a classifier on a real dataset (3.5 points)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score

%matplotlib inline

In [None]:
df = pd.read_csv("pima.csv", header=0, sep=',')
print(df.shape)
df.head(5)

**The goal is to predict whether or not a patient has diabetes from the values ​​of some variables. The target variable is "class".**

* **Pregnancies:** Number of times pregnant
* **Glucose:** Plasma glucose concentration a 2 hours in an oral glucose tolerance test
* **BloodPressure:** Diastolic blood pressure (mm Hg)
* **SkinThickness:** Triceps skin fold thickness (mm)
* **Insulin:** 2-Hour serum insulin (mu U/ml)
* **BMI:** Body mass index (weight in kg/(height in m)^2)
* **DiabetesPedigreeFunction:** Diabetes pedigree function
* **Age:** Age (years)
* **Class:** Class variable ("yes" / "no")

In [None]:
feature_names = list(df.columns)
feature_names.remove('class')
print(feature_names)
X = df[feature_names].values
y = df['class'].values

**Basic stats for each attribute:**

In [None]:
df.describe()

**Smoothed histograms of each attribute in each class. Color indicates class ("yes"/"no"):**

In [None]:
plt.figure(figsize=(15,7))
for i,n in enumerate(feature_names):
    plt.subplot(2,4,i+1)
    aux = 'Density' if i%4==0 else ''
    df.groupby("class")[n].plot(kind='kde', title='Hist. de '+n)
    plt.ylabel(aux)

## Training a model and testing its quality using 5-fold cross validation

The following cell trains a model and tests it on several different training-test partitions of the data. The result is a mean score with its standard deviation. The type of model (Naïve Bayes / decision tree / knn / logistic regression / neural network) and parameters used must be selected to obtain the best result.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
# other classifiers (from notebook p4_01)

clf = KNeighborsClassifier(n_neighbors=1) # DecisionTreeClassifier(max_depth=3)
scores = cross_val_score(clf, X, y, cv=5)
print("All scores: ", scores)
print("Global Model Score: {:.2f} +/- {:.2f}".format(scores.mean(), scores.std()))

## Answer the following questions here:

* What is the best score you get with a k-nn and with what k (value of n_neighbours)?
* What is the best score you get with a decision tree and at what maximum depth (value of max_depth)?
* What is the best score you get with a neural network and with what configuration (value of hidden_layer_sizes)?

Note: to answer these questions you just have to change the type of model and its parameters in the previous cell.

TO DO

## Improve the model: feature processing and parameter search

Sometimes, instead of using more complex models, it is more useful to spend more time processing the data to get better results.

In this section you will investigate a few approaches for preparing the data which are likely to improve the results: feature construction and selection, feature preprocessing (detection of outliers, missing values, centering and scaling).

Give reasons why you decide to try or ignore any of these methods, and how the results change when you apply them (you can create as many cells as you want).

In [None]:
# include code here

Do you think that another configuration of the classifier hyperparameters can solve the problem more efficiently? 
Most likely. 

Now try to change the value of the hyperparameters and return as the final classifier the one that minimizes the estimation of the generalization error. To do this, you have to do two things. The first one is to change the way in which we estimate the generalization error. If we base our results on the error provided by the test, we will overfit the test set. Therefore we must change this estimate. We will estimate the generalization error of each classifier using Nested Cross Validation. 
On the other hand, we will do a grid search of the optimal hyperparameters. We will return the value of the hyperparameters that optimize that error estimate. 

Adapt the code found at https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html to this problem and to the hyperparameter space of one of the classifiers. 
Remember that at https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier and at https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier you have information about each of the hyperparameters. You are free to choose the values ​​and hyperparameters to consider. Before configuring the grid, read about each one of the hyperparameters to make sure your search makes sense.

In [None]:
# include code about this section here