<a href="https://colab.research.google.com/github/matteo-orsi/MachineLearning/blob/main/challenges/challenge-01/challenge-01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Challenge 1: The banknote-authentication data set problem

We will perform a nearly realistic analysis of the data set bank note authentication that can be downloaded from https://archive.ics.uci.edu/dataset/267/banknote+authentication

## Data set description

Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about 660 dpi were gained. Wavelet Transform tool were used to extract features from images.
These features are:
1. variance of Wavelet Transformed image (continuous)
2. skewness of Wavelet Transformed image (continuous)
3. curtosis of Wavelet Transformed image (continuous)
4. entropy of image (continuous)
5. class (integer)

## Task description
We have a binary classification problem. The assignment can be divided in several parts:
    
    1. Load the data and pretreatment.
    2. Data exploring by Unsupervised Learning techniques.
    3. Construction of several models of Supervised Learning.

### 1. Data pretreatment

Load the data and look at it: It is needed some kind of scaling? Why? Are the data points sorted in the original data set? Can it generate problems? How can this be solved?

In [5]:
import pandas as pd
import os

In [6]:
FFILE = './data_banknote_authentication.txt'
if os.path.isfile(FFILE):
    print("File already exists")
    if os.access(FFILE, os.R_OK):
        print ("File is readable")
    else:
        print ("File is not readable, removing it and downloading again")
        !rm FFILE
        !wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt"
else:
    print("Either the file is missing or not readable, download it")
    !wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt"

File already exists
File is readable


In [8]:
columns = ['variance', 'skewness', 'curtosis', 'entropy', 'class']
data = pd.read_csv(FFILE, sep = ',', names = columns)

data.head()

Unnamed: 0,variance,skewness,curtosis,entropy,class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


In [9]:
data = data.sample(frac = 1, random_state = 0).reset_index(drop = True)
data.head()

Unnamed: 0,variance,skewness,curtosis,entropy,class
0,-1.7713,-10.7665,10.2184,-1.0043,1
1,5.1321,-0.031048,0.32616,1.1151,0
2,-2.0149,3.6874,-1.9385,-3.8918,1
3,1.4884,3.6274,3.308,0.48921,0
4,5.2868,3.257,-1.3721,1.1668,0


In [14]:
data.shape

(1372, 5)

### 2. Unsupervised Learning

Use PCA and plot the two first components colouring according with the class. Are the classes linearly separable in this projection? What happens when I applied k-means with two classes in this space? And if I use all the coordinates? Try also t-SNE for projection and DBSCAN for the clustering and comment on the results.

### 3. Supervised Learning

Generate a subset of the data of 372 elements that would be saved as test set. With the rest of the data generate the following models: Logistic Regression, Decision tree (use the ID3 algorithm), Naive Bayesian and k-NN.

Investigate the effect of regularization (when possible) and use cross validation for setting the hyper-parameters when needed.

Compare the performances in terms of accuracy, precision, recall and F1-score on the test set. Comment these results at the light of those obtained from the Unsupervised Learning analysis. Could you propose a way to improve these results?     
