# Task 3 - predict diabetes occurence

- Predict diabetes using decission tree model from `sklearn.tree`
- Extract target variable `y` - under `class` column
- Separate test set (20%) `X_train, X_test, y_train, y_test` using `train_test_split` from `sklearn.model_selection`
- fit model and display train and test scores.

In [None]:
import pandas as pd

url = "https://gist.githubusercontent.com/SaxMan96/4738e28799226fc76b40a35e5d55b282/raw/02a16853a83ae701db3dc95b638760a61d76d01f/diabetes.csv"
df = pd.read_csv(url)

In [None]:
y = df['class']
X = df.drop(columns=['class']).copy()

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)

In [None]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

model.fit(X_train, y_train)
model.score(X_train, y_train), model.score(X_test, y_test)

---
To make sure your split wasn't lucku or unlucky perform cross validation to average the score across multiple folds

- Use `cross_validate` from `sklearn.model_selection` to get a CV accuracy score (used by default in DecissionTreeClassifier model)
- Use 5-fold split (`cv=5`)
- Use parameters `scoring` and `return_train_score` to receive required scores.
- Print mean test and train scores.

In [None]:
from sklearn.model_selection import cross_validate
scores_cv = cross_validate(model, X, y, cv=5, scoring='accuracy', return_train_score=True)
scores_cv = pd.DataFrame(scores_cv)
scores_cv.mean().to_frame().T

---
As you can see the model is overfitter (test score is much worse than train score)

- try to improve the test score by tweaking `max_depth` (default None) and `min_samples_leaf` (default 1) parameters.
- use cross_validate again to measure the performance

In [None]:
model = DecisionTreeClassifier(max_depth=2, min_samples_leaf=20)
scores_cv = cross_validate(model, X, y, cv=5, scoring='accuracy', return_train_score=True)
scores_cv = pd.DataFrame(scores_cv)
scores_cv.mean().to_frame().T

---
We can experiment with that for long, but its better to automate it. 

- Use `GridSearch` from `sklearn.model_selection`
- You will need param grid with defined ranges of hyperparameters. Use `max_depth` and `min_samples_leaf`.
- Use `range(start, stop, step)` to define ranges of values to test.

In [None]:
from sklearn.model_selection import GridSearchCV
model = DecisionTreeClassifier(max_depth=5, min_samples_leaf=40)
param_grid = {
    'max_depth': range(2, 10),
    'min_samples_leaf': range(10, 60, 5),
}
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, return_train_score=True)
grid_search.fit(X, y)
results = grid_search.cv_results_
results = pd.DataFrame(results).sort_values('rank_test_score')
results.head(1)[['params', 'mean_test_score', 'mean_train_score']]

---
Now you have scores and hyperparamaters for the best model, how close you were when doing it manualy?

- train the best model again on train dataset and use predictions from test set to display confusion matrix.
- display confusion matrix - assign prediction and true values properly

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

model = grid_search.best_estimator_
model.fit(X_train, y_train)
predictions = model.predict(X_test)
cm = confusion_matrix(y_true=y_test, y_pred=predictions, labels=[0, 1])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Healthy', 'Diabetes'])
disp.plot();

---
Assuming the above confusion matrix represent a real results from diabetes classification for real people, think about the following questions:
- How many ill people this model marks as healthy?
- How many healthy people will have to go to unnecessary labours because of your model?
- How precise is your model in detecting ill people?
- How much of ill-classified people are truly ill?