# Topic 1. Machine Learning 


## Supervised, unsupervised methods

In this lab we will exercise different aspects related to the solution to initial steps of ML problem solving. In particular: 

1- Methods for reading and manipulating real-world datasets (pandas library)

2- Methods for problem visualization and analysis  (seaborn library)

3- Supervised classification problems 

4- Unsupervised classification problems  




We import all the libraries required for the exercises

In [1]:
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt


from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.preprocessing import binarize
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import axes3d  
from sklearn import preprocessing
from sklearn.pipeline import Pipeline

##  Reading and manipulating datasets with pandas 

We will use the Parkinsons Telemonitoring Data Set available from https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/telemonitoring/
    
    This dataset contains 16 biomedical voice measurements from 42 people with early-stage Parkinson's disease. 
    
    The main aim of the data is to predict the motor and total UPDRS scores ('motor_UPDRS' and 'total_UPDRS') from the 16 voice measures. 

    This can be seen as a regression problem. 

### Download the dataset  and open using the following pandas commands. Pandas is an extensively used python library: https://pandas.pydata.org/

In [None]:
# The dataset is read
df = pd.read_csv('parkinsons_updrs.data')

# The columns of the dataset are printed. These columns include the features and 
# the target variables 'motor_UPDRS' and 'total_UPDRS') 
print(df.columns)



In [None]:
# There are 42 subjects. We will use data for the first one
indices_subject_1 = df['subject#']==1

# Records of subject_1
df_subject_1 = df[indices_subject_1]

# Records of the dataframe are transformed to a matrix and we print its shape
data = df_subject_1.values
print('The shape of the matrix is ',data.shape)

## Visualizing data with seaborn


### seaborn is a python library that allows the exploration of data. A brief overview of the functionalities of seaborn can be accessed from: https://seaborn.pydata.org/examples/index.html

In [None]:
# We visualize the relationships between variables in the Parkinson dataset
sns.pairplot(df)
plt.show()

### We visualize the distribution of the variables, ordered depending the variance they have

In [None]:
order = df.std().sort_values().index
fig = plt.figure(figsize=(12,6))
#chart = sns.lvplot(data=df, order=order, scale="linear")
chart = sns.boxenplot(data=df, order=order, scale="linear")
chart.set_xticklabels(chart.get_xticklabels(),rotation=45)
plt.show()


### We scale variables for improved visualization

In [None]:
df_sca = df - df.min()
df_sca /= df_sca.max()

fig = plt.figure(figsize=(12,6))
#chart = sns.lvplot(data=df_sca, order=order, scale="linear")
chart = sns.boxenplot(data=df_sca, order=order, scale="linear")
chart.set_xticklabels(chart.get_xticklabels(),rotation=45)
plt.show()

# Exercise 1

1.1) Using only data from the first subject, create train and test datasets to predict the total_UPDRS score using the following features: ['Jitter(%)', 'Jitter(Abs)', 'Jitter:RAP', 'Jitter:PPQ5', 'Jitter:DDP','Shimmer', 'Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5','Shimmer:APQ11', 'Shimmer:DDA', 'NHR', 'HNR', 'RPDE', 'DFA', 'PPE']
      
1.2) Use a linear regressor to predict the response variable in the test data from a model learned in the train data.

1.3) Use a random forest regressor to predict the response variable in the test data from a model learned in the train data.

1.4) Use a Gaussian process regressor to predict the response variable in the test data from a model learned in the train data.

1.5) Use the seaborn and matplotlib libraries to visualize the median_absolute_error of the three regressor algorithms used in exercises 1.2, 1.3, and 1.4. 
Hints: Use https://seaborn.pydata.org/generated/seaborn.barplot.html or https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.bar.html



### Answer to 1.1)

In [None]:
# From the data, we select the response variable that we are going to model
# It is the total_UPDRS score
target = XXX

# The 16 variables that measure the voice will be used as features. 
features = XXX


# We divide the dataset for the first subject in training and test data. Even rows are in the train set 
# and odd rows in the test set. 

# Train set 
train_features = XXX
train_target = XXX
train_n_samples = XXX

# Test set
test_features = XXX
test_target = XXX
test_n_samples = XXX

NameError: name 'XXX' is not defined

### Answer to 1.2)

In [None]:
regressor = XXX
regressor.fit(XXX,XXX)
predicted_test_target = XXX

###  Answer to 1.3)

In [None]:
rf_regressor = XXX
rf_regressor.XXX
rf_predicted_test_target = XXX


### Answer to 1.4)

In [None]:
gp_regressor = XXX
gp_regressor.fit(XXX,XXX)
gp_predicted_test_target = XXX

### Answer to 1.5)

In [None]:
width = 1.0
n_regressors = 3
error_lr = XXX
error_rf = XXX
error_gp = XXX

error_values = [error_lr,error_rf,error_gp]
ind = np.arange(n_regressors)+1

p1 = plt.bar(XXX, XXX, width)


plt.ylabel('median absolute error')
plt.title('Errors produced by the three regressors')
plt.xticks(ind, ('LR', 'RF', 'GR',))


plt.show()

 ## yacht_hydrodynamics dataset 

We download the yacht_hydrodynamics dataset from https://archive.ics.uci.edu/ml/datasets/Yacht+Hydrodynamics

The goal of this dataset is the prediction of residuary resistance of sailing yachts from a number of features.  Essential inputs include the basic hull dimensions and the boat velocity. 

This can be approached as a regression problem

In [None]:

# https://archive.ics.uci.edu/ml/datasets/Yacht+Hydrodynamics
data = np.loadtxt('yacht_hydrodynamics.data')

# The Pandas dataframe is created
df = pd.DataFrame(data,columns=['Long. position', 'Prismatic coef.', 'LD ratio', 'BD ratio', 
                                'LB ratio', 'Froude numb.', 'Resistance' ])


In [None]:
data

In [None]:
# Visualization of the dataset
sns.pairplot(df)
plt.show()

## Exercise 2

2.1) Create train and test datasets from the ship data

2.2) Use kmeans to separate the train data into five clusters

2.3) Use a dimensionality reduction method to visualize the clusters in three dimensions
Hint: Use a different color to visualize each of the five clusters
    
2.4) Create a pipeline that selects two variables and applies linear regressor to predict the ship resistance

2.5) Visualize the original resistance data versus the predictions made by the pipeline

### Answer to 2.1)

In [None]:
yach_features = XXX
yach_target = XXX

# We split the data into two sets, training and test

yach_train_features = XXX
yach_train_target = XXX
yach_train_n_samples = XXX

yach_test_features = XXX
yach_test_target = XXX
yach_test_n_samples = XXX




### Answer to 2.2)

In [None]:
kmeans = XXX
kmeans.fit(XXX)
yach_train_clusters = XXX
print(yach_train_clusters)
#test_clusters = kmeans.predict(yach_test_features)


### Answer to 2.3)

In [None]:
colors = 'brgmk'
n_components = 3
pca = XXX
pca.XXX

dim3_yach_train_data = XXX


fig = plt.figure(figsize=(8, 12))
ax = fig.add_subplot(111, projection='3d')

for i,c in enumerate(colors):
    #print(i,c)

    index_cluster = np.where(yach_train_clusters==XXX)   
    x_vals = dim3_yach_train_data[index_cluster,0]
    y_vals = XXX
    z_vals = XXX
    # Plot the values
    ax.scatter(XXX, y_vals, z_vals, c = c, marker='o')
    #print(index_cluster)

ax.set_xlabel('X-axis')
ax.set_ylabel('Y-axis')
ax.set_zlabel('Z-axis')
    

### Answer to 2.4)

In [None]:
feature_selector = XXX
linear_regressor = XXX

yach_pipeline = XXX
yach_pipeline.XXX

yach_pipeline_test_prediction = yach_pipeline.XXX



### Answer to 2.5)

In [None]:

plt.plot(XXX,XXX,'r.')
plt.xlabel('Response variable')
plt.ylabel('Prediction')
plt.show()


## Exercise 3

Moving forward to a real classification problem,

3.1) Fetch a real database from the uci dataset (https://archive.ics.uci.edu/ml/datasets.php)

3.2) Define and fit a classifier using the data.

3.3) Use cross-validation to estimate the accuracy, recall, and precision of the classifier.

3.4) Use a pre-processing method to transform the data before feeding it to the classifier

3.5) Create a Pipeline which includes (at least) one preprocessing method, and a classifier.

3.6) Apply the pipeline to the data.

3.7) Use Tpot to automatically generate a pipeline