# Homework 1

To answer the following questions, consider using the sklearn API documentation and the notebooks in
the course webpage as guidance. Show in your PDF report both the code and the corresponding results.
Consider the column_diagnosis.arff data available at the homework tab, comprising 6 biomechanical
features to classify 310 orthopaedic patients into 3 classes (normal, disk hernia, spondilolysthesis). 

In [None]:
import pandas as pd
from scipy.io.arff import loadarff

data = loadarff('data/column_diagnosis.arff')
df = pd.DataFrame(data[0])
df['class'] = df['class'].str.decode('utf-8')


df.head()

1) Apply f_classif from sklearn to assess the discriminative power of the input variables.
Identify the input variable with the highest and lowest discriminative power.
Plot the class-conditional probability density functions of these two input variables.

In [None]:
from sklearn.feature_selection import f_classif

X, y = df.drop('class', axis=1), df['class']

scores = list(zip(X.columns.values, f_classif(X, y)[0]))
scores.sort(key=(lambda x: x[1]))

print(scores)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Meter histogramas ou densidade estimada (kernel)

plot = sns.FacetGrid(df, hue="class")
plot.map(sns.kdeplot, "pelvic_radius").add_legend()
  
plot = sns.FacetGrid(df, hue="class")
plot.map(sns.kdeplot, "degree_spondylolisthesis").add_legend()
  
plt.show()

2) Using a stratified 70-30 training-testing split with a fixed seed (random_state=0), assess in a
single plot both the training and testing accuracies of a decision tree with depth limits in
{1,2,3,4,5,6,8,10} and the remaining parameters as default.

[optional] Note that split thresholding of numeric variables in decision trees is non-deterministic
in sklearn, hence you may opt to average the results using 10 runs per parameterization.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import tree, metrics

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0, stratify=y)

depth_limits = [1, 2, 3, 4, 5, 6, 8, 10]
training_accuracies = [0 for i in range(8)]
testing_accuracies = [0 for i in range(8)]

for i in range(10):
    predictors = [tree.DecisionTreeClassifier(max_depth=k).fit(X_train, y_train) for k in depth_limits]

    j = 0
    for predictor in predictors:
        training_accuracies[j] += metrics.accuracy_score(y_train, predictor.predict(X_train))
        testing_accuracies[j] += metrics.accuracy_score(y_test, predictor.predict(X_test))
        j += 1

training_accuracies = [0.1*x for x in training_accuracies]
testing_accuracies = [0.1*x for x in testing_accuracies]

training_accuracies

In [None]:
testing_accuracies_rows = list(zip(depth_limits, testing_accuracies, ['testing' for i in range(8)]))
training_accuracies_rows = list(zip(depth_limits, training_accuracies, ['training' for i in range(8)]))


data = pd.DataFrame(testing_accuracies_rows + training_accuracies_rows, columns=['depth_limit', 'accuracy', 'type'])
sns.pointplot(x="depth_limit", y="accuracy", data=data, hue='type')
plt.legend(bbox_to_anchor=(1, 1), loc=2)  
plt.show()


3) Comment on the results, including the generalization capacity across settings.

The model seems to be overfitting when the depth limit surpasses 4, since the training accuracy continues to grow (until it reaches close to 100% at depth 8) while the testing accuracy actually tends to diminuish. The model seems to lack generalization capacity for depth limits bigger than 4.

Although, when the limit is 1, there is a simillar value to testing and training accuracies, that seems to be the case because both of them are pretty low. That fact gets highlighted for limits 2 and 3, where the gap between the accuracies grows.

In conclusion, to get the best case of generalization (avoiding underfitting and overfitting) we would chose a depth limit of 4 for a decision tree trained using this data set.

4) To deploy the predictor, a healthcare team opted to learn a single decision tree
(random_state=0) using all available data as training data, and further ensuring that each leaf has
a minimum of 20 individuals in order to avoid overfitting risks.
i. Plot the decision tree.
ii. Characterize a hernia condition by identifying the hernia-conditional associations.

In [None]:
healthcare_predictor = tree.DecisionTreeClassifier(random_state=0, min_samples_leaf=20)
healthcare_predictor.fit(X, y)

figure = plt.figure(figsize=(20, 15))
tree.plot_tree(healthcare_predictor, feature_names=list(X.head(0)), class_names=sorted(list(set(y.values))), impurity=False)
plt.show()

(Check if they want the whole path, including duplicates)

The hernia condition can be characterized by the following conditions: spondylolisthesis degree lower or equal to 16.079, sacral slope lower or equal to 28.136 or between 28.136 (excluding) and 40.149 when the pelvic radius is lower or equal to 117.36.

In resume, the following conditions lead to a Hernia diagnosis:

- spondylolisthesis degree $\leq 16.079$ $\land$ sacral slope $\leq 28.136$
- spondylolisthesis degree $\leq 16.079$ $\land$ $28.136 \leq$ sacral slope $\leq 40.149$ $\land$ pelvic radius $\leq 117.36$