<a href="https://colab.research.google.com/github/abelowska/dataPy/blob/main/Classes_06_KNN_DT_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification: KNN and Decision Tree

Imports

In [67]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_score, recall_score, balanced_accuracy_score
from sklearn.inspection import DecisionBoundaryDisplay
import pandas as pd
import seaborn as sns
sns.set_theme(style="whitegrid", palette="deep")

import io
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn import set_config
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import power_transform

In [3]:
plt.rcParams["figure.figsize"] = (10,7)

In [4]:
# constans
test_size=0.2
random_state=42

In [107]:
def compute_score_classification(y_true, y_pred):
  '''
  Helper function for printing scores.

  Parameters:
  y_true: ndarray of y values from original dataset.
  y_pred: ndarray of y values predicted with given model.

  Return:
  dictionary object that consists of accuracy and classification report.

  '''
  return {
        "Accuracy": f"{accuracy_score(y_true, y_pred):.3f}",
        "Classification Report": classification_report(y_true, y_pred),
}

## Load dataset

In [99]:
df = pd.read_csv('data_neo-ffi_religion.csv')
df['Orthodoxy'] = np.log(df[['Orthodoxy']].to_numpy())
df.head()

Unnamed: 0,Extraversion,Agreeableness,Conscientiousness,Openness,Neuroticism,External Critique,Orthodoxy,Historical Relativism,Relativism,Second Naïveté
0,34.082439,46.04369,40.788554,461.110426,43.865868,2.838143,2.586507,3.74499,6.09386,4.458938
1,45.914894,45.968433,41.23529,401.384274,28.027017,3.824136,2.294873,3.392507,5.230517,3.269949
2,33.008654,42.065841,42.06917,390.19351,41.023889,2.288471,2.367948,3.765416,4.801786,4.683288
3,56.112153,45.903571,53.080369,468.518727,20.018578,5.824989,2.621076,2.826005,2.592473,0.883451
4,31.972346,49.009174,42.161417,508.686847,43.026028,4.038579,2.736421,4.689029,4.916692,4.627536


Inspect the dataset

In [100]:
df.describe()

Unnamed: 0,Extraversion,Agreeableness,Conscientiousness,Openness,Neuroticism,External Critique,Orthodoxy,Historical Relativism,Relativism,Second Naïveté
count,342.0,342.0,342.0,342.0,342.0,342.0,342.0,342.0,342.0,342.0
mean,37.811238,42.318673,41.208357,436.675828,36.04229,3.880765,2.581919,4.883313,5.29955,4.223002
std,8.871624,7.222606,8.104708,63.64256,11.014191,1.25973,1.129017,0.778776,0.944093,1.392345
min,14.040682,21.008399,16.904128,289.257541,12.942666,0.876634,0.816596,1.304512,1.905408,0.80131
25%,32.079634,37.959417,35.084398,388.849925,27.908735,3.046971,1.706643,4.430331,4.729321,3.1693
50%,38.954182,43.089033,41.052187,440.203497,35.908038,3.762668,2.411769,4.922857,5.437686,4.434408
75%,43.946449,47.078293,47.089308,480.501705,44.692161,4.684792,3.334454,5.373428,5.981834,5.21364
max,56.88282,59.121317,59.00043,572.957659,60.030339,6.935906,7.131491,6.877499,7.139185,7.151545


## Exercise 1

Recall the model from the last classes:

*Orthodoxy ~ Extraversion + Agreeableness + Openness + Neuroticism + Conscientiousness*

Now we will perform a classification of our data: based on the results of the Big Five, we will predict membership in one of four cognitive approaches to belief (*Orthodoxy, External Critique, Second Naïvetém, and Relativism*).

To perform classification, we have to create classes. Each participant (sample) must be of a known class.

Create a new column called `'Class'` that store the correct class for a given observation. **Assume that the class of a given observation corresponds to the cognitive style that has the highest value**.

In [2]:
# Your code

### K-Nearest Neighbours

We have data in the correct format, so we can begin to create a model. Let's statrt from KNN Classifier model. Look into the documentation of [`KNeighborsClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) and write down the code, employing the same patter as in the regression analysis. Do not forget to scale your data (e.g., using [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)).
To check the classification results, use the predefined `compute_score_classification()` method and print separately each metric. How you interpret the results of the model?

In [1]:
# Your code

### Decision Trees

Now, use Decision Tree classifier to predict the PCBS classes. See the documantation of [`DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) and create DT model. Do not forget to scale your data (e.g., using [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)). Compare results of DT and KNN. Which classifier seems to work better?


Save the structure of your DT  into `.dot` file and visualize it using the [WebGraphviz](http://www.webgraphviz.com) tool. You should copy the content of the `.dot` file (saved to the *Files* directory in Colab) to the input area on the [WebGraphviz](http://www.webgraphviz.com).

In [111]:
# Your code

## (Exercise 2)

Now, recall the theory behind the Post-Critical Belief Scale [source](https://theo.kuleuven.be/apps/press/ecsi/files/2019/03/4.-Pollefeyt-Bouwens-PCB-Melb-Vict-for-dummies-EN.pdf). Four classes of cognitive approaches to belief are defined by two dimensions: Exclusion vs. Inclusion of Transcendence and Iteral vs. Symbolic Belief. Defining the class based on the highest value can be suboptimal (it somehow assumes perfect introspection). Think how such two dimensions could be created from the data you have, assuming the theory is (reasonably) correct. Try to define these dimensions and check, whether classification results are improved.

HINT: Think of the values of the four PCBS classes as vectors. Which values should be summed up and which subtracted to obtain Literal/Symbolic and Inclusion/Exclusion dimensions?

In [3]:
# Your code

## Exercise 3

When you created the KNN and DT model - most of the code (actually all of it, except for the line defining the model) was the same. This is quite a waste of time and space. It also makes it difficult to read, analyze, and refactor the code. The [`Pipelines`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) framework was created exactly for such situations. As sugested by the name, `Pipeline` is a pipe of transforms (functions that somehow transform the data) with a final estimator at the end. According to the documentation, intermediate steps of the pipeline must be *transforms*, that is, they must implement `fit` and `transform` methods (e.g., `StandardScaler`). The final estimator only needs to implement `fit` (e.g., `KNeighborsClassifier`). When you create a pipeline, you can think of this pipeline as a model - in fact, individual data processing steps are already a model, such as scaler, because they often learn from data.

For the sake of simplicity, we'll start with the [`make_pipeline()`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline) function, which conveniently allows you to create a pipeline. Take a look at the example below:

```
from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=1))
model.fit(X,y)
y_pred = model.predict(X)
```

Again, create KNN and DT classifiers, but this time:
1. Define classification estimators beforehand and put them in a list;
2. Use a for loop to ...
3. ... make pipeline that chain scaler with estimator using `make_pipeline()` function.

In [112]:
# Your code

## Homework: