<a href="https://colab.research.google.com/github/moisesortega93/MCME-AI/blob/main/iaa2021_pac2_template_en.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### M0.508 · IAA · CET2 · 2020-21 · 2
##### Màster Enginyeria Informàtica · Estudis d’Informàtica Multimèdia i Telecomunicació
##### Universitat Oberta de Catalunya

# CET2: FEATURE EXTRACTION AND CLASSIFICATION

## INTRODUCTION

In this evaluation test we will study how to apply feature extraction and classification techniques in cartographic data about forest cover types.

## SKILLS

In this assignment, the following general skills developed in the Master are addressed:
* Ability to project, envision and design products, processes and
facilities in all areas of computer engineering.
* Abilities in mathematical modelling, calculation and simulation in
technology centres and business engineering, particularly in research,
development and innovation tasks in all areas related to computer
engineering.
* Ability to apply the knowledge acquired and solve problems in new or
unfamiliar environments within broader and multidisciplinary contexts,
and being able to integrate this knowledge.
* Skills for continuous, self-directed and autonomous learning.
* Ability to model, design, define architecture, implement, manage,
operate, manage and maintain applications, networks, systems,
services and computer content.

The specific skills of this course that are addressed in this test are:
* Understanding what machine learning is in the context of artificial
intelligence.
* Distinguishing between different types and methods of learning.
* Applying the studied techniques to a real case.

## RESOURCES

This CET requires the following resources:

Provided files:

  * iaa2021_pac2_template_en.ipynb

Complementary: 
  * Course materials, library documentation (_scikit-learn_, _pandas_, _seaborn_,...).

## SUBMISSION AND ASSESSMENT CRITERIA

The CET must be submitted by **27th April 2021**. 

The final submission must be an edited version of this notebook (.ipynb). The use of Google Colab platform is encouraged (https://colab.research.google.com/). The source code solutions to the exercises must be implemented and run in the corresponding code cells and the related discussion and justified answers must be added to the corresponding text cell.

All answers must be discussed and justified. **Answers without
discussion will not be evaluated**.

## CET DESCRIPTION

In this assignment, data classification and dimensionality reduction
techniques will be used on real world cartographic data about forest cover types.

We will work with the data from a study published in:

> Blackard, Jock A., and Denis J. Dean. "Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables." _Computers and electronics in agriculture_ 24.3 (1999): 131-151.

The data set includes samples of seven forest cover types extracted from cover type maps created by the US Forest Service that were derived from aerial photographies. The independent cartographic variables were obtained from the US Geological Survey and the US Forest Service. The variables, among others, include elevation, horizontal distance to nearest surface water and relative measures of incident sunlight. Overall, the data set includes 581,012 instances and 54 variables.

A full description of the dataset can be found in:

https://archive.ics.uci.edu/ml/datasets/Covertype

The goal of this assignment is to use and get familiar with different
dimensionality reduction techniques and to carry out a comparative study of
different data classification algorithms and validation techniques.

The solutions to the exercises will be based on the open source library _scikit-learn_ for Python, which includes a wide selection of machine learning
algorithms as well as preprocessing, validation and visualization techniques.
Students are encouraged to refer to scikit-learn documentation available
online:

https://scikit-learn.org/stable/

## EXERCISE 1

First things first. Let's import some necessary packages and load the data. The original data set is available in _scikit-learn_ as part of the _datasets_ module.

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_covtype.html

In this exercise we are going to work with a simplified version of the original dataset. The simplified version includes fewer samples and attributes. Study and run the code provided in the following cell and get familiar with the data set.

In [5]:
import pandas as pd
from sklearn import datasets
import numpy as np

features, labels = datasets.fetch_covtype(return_X_y=True)

n_samples = 5000
n_attributes = 10
df_features = pd.DataFrame(features[:n_samples, :n_attributes])
df_labels = pd.DataFrame({'label': labels[:n_samples]})

n_classes = len(np.unique(df_labels))

print("No. of attributes = " + str(len(df_features.columns)))
print("No. of classes = " + str(n_classes))
print("No. of samples = " + str(len(df_features)))
for cl in np.unique(df_labels):
  print("\--No. of samples class " + str(cl) + " = " 
        + str(df_labels[df_labels==cl].count()['label']))

No. of attributes = 10
No. of classes = 7
No. of samples = 5000
\--No. of samples class 1 = 557
\--No. of samples class 2 = 948
\--No. of samples class 3 = 643
\--No. of samples class 4 = 1249
\--No. of samples class 5 = 945
\--No. of samples class 6 = 479
\--No. of samples class 7 = 179


**1.a) (1 POINT) Explore the dataset features.**

**Use the method _describe_ in _pandas.DataFrame_. What kind of information does it generate?**

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html

**Analise the standard deviation of all the components. Do you think they span over similar ranges? Identify the attributes with maximum and minimum dispersion (standard deviation) and provide their values. Can you briefly describe (theoretically) what would happen if we used those features directly to perform a PCA analysis?**

In [3]:
# Exercise 1.a: add and run your source code
print(df_features.head())
print(df_labels.head())

df_features.info()
df_features.describe()


        0      1     2      3      4       5      6      7      8       9
0  2596.0   51.0   3.0  258.0    0.0   510.0  221.0  232.0  148.0  6279.0
1  2590.0   56.0   2.0  212.0   -6.0   390.0  220.0  235.0  151.0  6225.0
2  2804.0  139.0   9.0  268.0   65.0  3180.0  234.0  238.0  135.0  6121.0
3  2785.0  155.0  18.0  242.0  118.0  3090.0  238.0  238.0  122.0  6211.0
4  2595.0   45.0   2.0  153.0   -1.0   391.0  220.0  234.0  150.0  6172.0
   label
0      5
1      5
2      2
3      2
4      5
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       5000 non-null   float64
 1   1       5000 non-null   float64
 2   2       5000 non-null   float64
 3   3       5000 non-null   float64
 4   4       5000 non-null   float64
 5   5       5000 non-null   float64
 6   6       5000 non-null   float64
 7   7       5000 non-null   float64
 8   8       5000 non-nul

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,2590.2976,155.5296,17.599,183.294,46.2902,1723.5056,212.261,215.535,130.7616,1517.3908
std,389.357969,109.718326,9.400014,160.188788,56.633346,1564.713429,33.583068,25.150835,50.682969,1324.742471
min,1863.0,0.0,0.0,0.0,-134.0,30.0,0.0,99.0,0.0,30.0
25%,2220.0,66.0,10.0,42.0,3.0,630.0,195.0,202.0,97.0,601.0
50%,2677.5,121.5,17.0,150.0,27.0,1167.0,221.0,221.0,135.0,1124.0
75%,2905.25,263.25,25.0,277.0,74.0,2207.0,237.0,233.0,167.0,2001.0
max,3442.0,360.0,52.0,997.0,554.0,6890.0,254.0,254.0,248.0,6853.0


Exercise 1.a: Double-click (or enter) to add your answer.

**1.b) (1 POINT) PCA analysis must always be performed on scaled data. The next exercises will allow us to study and learn about PCA and to explore the impact of data scaling in the process.**

**Perform a PCA analysis to the data twice:**

* **PCA analysis to the raw features.**
* **PCA analysis to the standardised features using _StandardScaler_ data:**

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

**Similarly to Exercise 1.a, in both of the PCA spaces, identify the attributes with maximum and minimum dispersion (standard deviation) and their corresponding values. Discuss the results.**

In [9]:
# Exercise 1.b: add and run your source code
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

pca = PCA()
pca.fit(df_features)
X_transformed = pca.transform(df_features)

# display the projected dataset

for i, target_name in zip([0, 1], df_labels):
    plt.scatter(X_transformed[df_labels == i, 0], X_transformed[df_labels == i, 1], label=target_name)
plt.legend()

df_features_scaled = StandardScaler(df_features)


IndexError: ignored

Exercise 1.b: Double-click (or enter) to add your answer.

**1.c) (1 POINT) Create three scatter plots of the first two components in:**

* **The original raw features (original data set).**
* **The PCA space obtained without data scaling (features from the original data set transformed to a the PCA space).**
* **The PCA space obtained with data scaling (features from the original space, scaled and, then, transformed to a the PCA space).**

**Discuss the obtained plots.**

In [None]:
# Exercise 1.c: add and run your source code

Exercise 1.c: Double-click (or enter) to add your answer.

**1.d) (1 POINT) For the PCA space obtained with data scaling and without data scaling, use _Seaborn_ library and its method _boxplot_ to create 10 box plots (each) to graphically depict the attributes (y axis) for each of the data set classes (x axis).**

**Seaborn is a Python data visualization library based on _matplotlib_. It provides a high-level interface for drawing attractive and informative statistical graphics.**

https://seaborn.pydata.org/

https://seaborn.pydata.org/generated/seaborn.boxplot.html

**Hint: Use matplotlib.pyplot.subplots with parameter _sharey=True_ to create a subplot grid of 2x10 axes (2 rows for the 2 PCA spaces and 10 columns for the 10 attributes).**

In [None]:
# Exercise 1.d: add and run your source code

Exercise 1.d: Double-click (or enter) to add your answer.

**1.e) (1 POINT) Plot the cumulative explained variance ratio as a function of the number of components for the two PCA spaces (without and with previous scaling). Compare both plots and discuss the relation to the box plots in the previous exercise. In the case of scaled data, identify how many PCA components are necessary to represent 95% of the variance of the original data. What is the problem when PCA analysis is performed on unscaled data?**

In [None]:
# Exercise 1.e: add and run your source code

Exercise 1.e: Double-click (or enter) to add your answer.

**1.f) (1 POINT) Using the scaled features, rebuild the dataset for the 5 PCA components (_inverse_transform_ method) and calculate the loss of information with
respect to the original set. To do so, use the average of the squared differences between each element of the reconstructed set and the original one. What is the relationship between this value and the
cumulative variances plotted in the previous exercise?**

In [None]:
# Exercise 1.f: add and run your source code

Exercise 1.f: Double-click (or enter) to add your answer.

## EXERCISE 2

In this exercise we are going to explore different classification algorithms, validation techniques and performance metrics. For that purpose, a new version of the data set is used. The number of samples is decreased even further and the labels are reduced to two classes to facilitate the comparison between the different classification algorithms. Study and run the code provided in the following cell.

In [None]:
n_samples2 = 700
n_attributes2 = 54
df_features2 = pd.DataFrame(features[:n_samples2, :n_attributes2])

labels2 = [1 if value==2 else 0 for value in labels]
df_labels2 = pd.DataFrame({'label': labels2[:n_samples2]})

print("No. of attributes = " + str(len(df_features2.columns)))
print("No. of classes = " + str(len(np.unique(labels2))))
print("No. of samples = " + str(len(df_features2)))
print("\--No. of samples class 0 = " + 
      str(df_labels2[df_labels2==0].count()['label']))
print("\--No. of samples class 1 = " + 
      str(df_labels2[df_labels2==1].count()['label']))

**2.a) (1.5 POINTS) Train the following classifiers, implemented in _scikit-learn_, by using 80% of the available data (training set) and report the training time for each of them (_timeit_ python module):**

- **k Nearest Neighbors: 5 neighbours (first parameter).**

- **Linear SVM: kernel=”rbf”, C=25, and the default values for the remaining parameters.**

- **Decision Tree: criterion='entropy' max_depth=5, and the default values for the remaining parameters.**

- **AdaBoost: default parameters.**

- **Gaussian Naive Bayes: default parameters.**

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

In [None]:
# Exercise 2.a: add and run your source code

Exercise 2.a: Double-click (or enter) to add your answer.

**2.b) (1 POINT) Testing the performance of the classifiers trained in Exercise 2.a.**

**Plot in a single figure/axis all the ROC curves for the trained classifiers by using the remaining samples (20% test set). Discuss the results and compare the performance of the different algorithms. Which one would be your choice for the given task?**

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_roc_curve.html

In [None]:
# Exercise 2.b: add and run your source code

Exercise 2.b: Double-click (or enter) to add your answer.

**2.c) (1.5 POINTS) Train and validate the same classifiers described in Exercise 2.a by using-cross-validation with k=5 on the whole reduced dataset.Report accuracy, precision, recall, F1-score and ROC-AUC for each classifier and discuss the results.**

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html

**Hint: Use sklearn.model_selection.RepeatedStratifiedKFold (with parameters *n_repeats*=1 and _random_state_=1) as the _cv_ parameter of _cross_validate_ to create identical splits to validate all classifiers.**

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RepeatedStratifiedKFold.html

In [None]:
# Exercise 2.c: add and run your source code

Exercise 2.c: Double-click (or enter) to add your answer.