<a href="https://colab.research.google.com/github/abelowska/dataPy/blob/main/Classes_06_KNN_DT_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification: KNN and Decision Tree

Imports

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_score, recall_score, balanced_accuracy_score
from sklearn.inspection import DecisionBoundaryDisplay
import pandas as pd
import seaborn as sns
sns.set_theme(style="whitegrid", palette="deep")

import io
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn import set_config
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import power_transform

In [None]:
plt.rcParams["figure.figsize"] = (10,7)

In [None]:
# constans
test_size=0.2
random_state=42

In [None]:
def compute_score_classification(y_true, y_pred):
  '''
  Helper function for printing scores.

  Parameters:
  y_true: ndarray of y values from original dataset.
  y_pred: ndarray of y values predicted with given model.

  Return:
  dictionary object that consists of accuracy and classification report.

  '''
  return {
        "Accuracy": f"{accuracy_score(y_true, y_pred):.3f}",
        "Classification Report": classification_report(y_true, y_pred),
}

## Load dataset

In [None]:
df = pd.read_csv('data_neo-ffi_religion.csv')
df['Orthodoxy'] = np.log(df[['Orthodoxy']].to_numpy())
df.head()

Inspect the dataset

In [None]:
df.describe()

## Exercise 1

Recall the model from the last classes:

*Orthodoxy ~ Extraversion + Agreeableness + Openness + Neuroticism + Conscientiousness*

Now we will perform a classification of our data: based on the results of the Big Five, we will predict membership in one of four cognitive approaches to belief (*Orthodoxy, External Critique, Second Naïveté, and Relativism*).

To perform classification, we have to create classes. Each participant (sample) must be of a known class.

Create a new column called `'Class'` that store the correct class for a given observation. **Assume that the class of a given observation corresponds to the cognitive style that has the highest value**.

In [None]:
# Your code

### K-Nearest Neighbours

We have data in the correct format, so we can begin to create a model. Let's statrt from KNN Classifier model. Look into the documentation of [`KNeighborsClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) and write down the code, employing the same patter as in the regression analysis. Do not forget to scale your data (e.g., using [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)).
To check the classification results, use the predefined `compute_score_classification()` method and print separately each metric. How you interpret the results of the model?

In [None]:
# Your code

### Decision Trees

Now, use Decision Tree classifier to predict the PCBS classes. See the documantation of [`DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) and create DT model. Do not forget to scale your data (e.g., using [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)). Compare results of DT and KNN. Which classifier seems to work better?


Save the structure of your DT  into `.dot` file and visualize it using the [WebGraphviz](http://www.webgraphviz.com) tool. You should copy the content of the `.dot` file (saved to the *Files* directory in Colab) to the input area on the [WebGraphviz](http://www.webgraphviz.com).

In [None]:
# Your code

## (Exercise 2)

Now, recall the theory behind the Post-Critical Belief Scale [source](https://theo.kuleuven.be/apps/press/ecsi/files/2019/03/4.-Pollefeyt-Bouwens-PCB-Melb-Vict-for-dummies-EN.pdf). Four classes of cognitive approaches to belief are defined by two dimensions: Exclusion vs. Inclusion of Transcendence and Iteral vs. Symbolic Belief. Defining the class based on the highest value can be suboptimal (it somehow assumes perfect introspection). Think how such two dimensions could be created from the data you have, assuming the theory is (reasonably) correct. Try to define these dimensions and check, whether classification results are improved.

HINT: Think of the values of the four PCBS classes as vectors. Which values should be summed up and which subtracted to obtain Literal/Symbolic and Inclusion/Exclusion dimensions?

In [None]:
# Your code

## Exercise 3

When you created the KNN and DT model - most of the code (actually all of it, except for the line defining the model) was the same. This is quite a waste of time and space. It also makes it difficult to read, analyze, and refactor the code. The [`Pipelines`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) framework was created exactly for such situations. As sugested by the name, `Pipeline` is a pipe of transforms (functions that somehow transform the data) with a final estimator at the end. According to the documentation, intermediate steps of the pipeline must be *transforms*, that is, they must implement `fit` and `transform` methods (e.g., `StandardScaler`). The final estimator only needs to implement `fit` (e.g., `KNeighborsClassifier`). When you create a pipeline, you can think of this pipeline as a model - in fact, individual data processing steps are already a model, such as scaler, because they often learn from data.

For the sake of simplicity, we'll start with the [`make_pipeline()`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline) function, which conveniently allows you to create a pipeline. Take a look at the example below:

```
from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=1))
model.fit(X,y)
y_pred = model.predict(X)
```

Again, create KNN and DT classifiers, but this time:
1. Define classification estimators beforehand and put them in a list;
2. Use a for loop to ...
3. ... make pipeline that chain scaler with estimator using `make_pipeline()` function.

In [None]:
# Your code

## Homework:

In this exercise, we will examine how the number of features analyzed and the value of the model's hyperparameter(s) are related to accuracy. Such an analysis will allow us, for example, to answer the question of how many personality traits (from NEO-FFI) we need to have in order to create a valid model that predicts cognitive belief style. Perhaps we can achieve a similar results with much less data? Alongside, we will examine the effect of hyperparameter values on accuracy depending on the number of features. To do so, you want to randomly select a subset of features (think - why randomly?)from a set of features and test the model's performance on that feature set for, e.g., set of n_neighbors.

**HINT**: You may want to follow the step list below (for KNN):

1. Define a list of all possible features (all five scales form NEO-FFI);
2. Define a list specifying the number of features you might select (intuitively, it's a list from 1 to 5);
3. Define a list specifying the n_neighbors to be tested (e.g., from 1 to 20);
4. For each number/size from the list with possible number of features:
  
  4.1. Draw random subset of features (features names) of this size. To do so, you can use `random.sample()` in a following way: `random.sample(list_of_all_possible_features, number_of_features_to_select)` ;
  
  4.2. Create the y set, and X set based on the output of `random.sample()`;
  
  4.3. Perform train-test split, scale X_train and X_test, perform classification, estimate the accuracy (e.g., `accuracy_score(y_true, y_pred)`), and the average precision (e.g., `precision_score(y_test, y_pred, average='weighted')`);

  4.4. Save to results of classification to a dataframe: the number of features that was used in classification, value of k, accuracy, and the average precision;

5. Do all steps from the (4) n=50 (or more) times and for each k from your list of n_neighbors;
6. Plot the resultson accuracy and average precision using e.g. `sns.lineplot()` with hue set to number of features.

Do not forget put your comments about the effect of the number of features on accuracy and the relationship between the accuracy and the number of k depending on number of features. Did you learn anything interesting about the relationship between personality traits and cognitive belief style?

In [None]:
# Your code