In [1]:
# import required packages
import numpy as np
import pandas as pd

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.feature_selection import r_regression, SelectKBest
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt  # For creating plots

**Question 1**: Load the dataset provided with today's lab material using numpy. Note that this is a pre-formatted version of the dataset from the last week's lab i.e. all features have already been converted to numerical values.
**Solution**: Use the function [`genfromtxt'](https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html) to load the dataset.

In [2]:
hr = np.genfromtxt('hotel_res.csv', delimiter=',')

**Question 2**: How many samples and features does the dataset have? The dataset is based on [this](https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset) dataset in Kaggle. The last column in the dataset represents the class.

In [3]:
hr.shape

(36275, 18)

**Question 3**: How many classes does this dataset have?
**Solution**: The method we used in the last lab session won't work because this is not a data downloaded using scikit-learn. A generic way to do this is check the number of unique values on the class column.

In [4]:
np.unique(hr[:, -1])

array([0., 1.])

**Question 4**: Computing pearson correlation coefficient between each feature and the class. Using the estimate coefficients, identify the 5 most important features.
**Solution**: Refer documentation of [`r_regression'](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.r_regression.html).

In [5]:
# compute pearson correlation coefficient between each feature and the class
hotel_pr = r_regression(hr[:, :-1], hr[:, -1])
print(hotel_pr)


[-0.0869203  -0.03307782 -0.06156254 -0.09299601 -0.04937437  0.08618528
 -0.02298637 -0.43853792 -0.17952889  0.01123305 -0.01062905 -0.13600847
  0.10728664  0.0337278   0.06017942 -0.14223419  0.25306981]


In [6]:
# sort the absolute value of the pearson correlation coefficient to identify the important features
imp_feature_idx = np.argsort(np.abs(hotel_pr))

# the 5 most important features are
# argsort sorts in the ascending order. Higher pearson correlation coefficient represents important features for us.
imp_feature_idx[-5:]

array([11, 15,  8, 16,  7])

**Question 5**: sklearn provides a shortcut to achieve what we did in the previous question. Checkout [this](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection) page and attempt the previous question again.

In [7]:
skb = SelectKBest(r_regression, k=6)
hr2 = skb.fit_transform(hr[:, :-1], hr[:, -1])


In [8]:
# you can use the 'scores_' attribute of skb to view the scores computed using pearson correlation coefficient
skb.scores_


array([-0.0869203 , -0.03307782, -0.06156254, -0.09299601, -0.04937437,
        0.08618528, -0.02298637, -0.43853792, -0.17952889,  0.01123305,
       -0.01062905, -0.13600847,  0.10728664,  0.0337278 ,  0.06017942,
       -0.14223419,  0.25306981])

In [9]:
hr2.shape

(36275, 6)

**Question 6**: Split the dataset after feature selection into training and testing?

In [10]:
x_train, x_test, y_train, y_test = train_test_split(hr2, hr[:, -1], test_size=0.2)

**Question 7**: Normalize the training data to the range $[-1, 1]$ using scikit-learn.
**Solution**: Refer to the documentation for [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)

In [11]:
mms = MinMaxScaler(feature_range=(-1, 1))
x_train_norm = mms.fit_transform(x_train)

In [12]:
x_train_norm

array([[-1.        , -0.09090909, -1.        , -1.        , -1.        ,
        -1.        ],
       [ 1.        ,  0.45454545, -1.        , -1.        , -1.        ,
        -0.2       ],
       [-1.        ,  0.45454545, -1.        , -1.        , -1.        ,
        -1.        ],
       ...,
       [-1.        ,  0.09090909, -1.        , -1.        , -1.        ,
        -1.        ],
       [-1.        , -0.09090909, -1.        , -1.        , -1.        ,
        -1.        ],
       [-1.        ,  0.45454545, -1.        , -1.        , -1.        ,
        -0.6       ]])

**Question 8**: Train a Support Vector Machine (SVM) classifier using the holdout method with the normalized data. Report your model's performance.


In [13]:
clf = SVC()
clf.fit(x_train_norm, y_train) # train an SVM using the training data.
print('Training Accuracy: ', clf.score(x_train_norm, y_train)) # Accuracy on the training data

x_test_norm = mms.transform(x_test) # normalize the testing data before testing. Note that we use the model fit using the training data
print('Testing Accuracy: ', clf.score(x_test_norm, y_test)) # Accuracy on the testing data

Training Accuracy:  0.6961750516884907
Testing Accuracy:  0.6975878704341834


**Question 9**: Train a Support Vector Machine (SVM) classifier using the holdout method with the **original data** i.e. not the normalized data. Do you see a difference in your model's performance?

In [14]:
clf = SVC()
clf.fit(x_train, y_train) # train an SVM using the training data.
print('Training Accuracy: ', clf.score(x_train, y_train)) # Accuracy on the training data
print('Testing Accuracy: ', clf.score(x_test, y_test)) # Accuracy on the testing data

Training Accuracy:  0.6705031013094418
Testing Accuracy:  0.6803583735354928


#### Selecting appropriate number of features
**Question 10**: Determine the appropriate number of features to be selected using pearson correlation coefficient.

In [16]:
n_features = hr.shape[1]

prev_train_acc = 0
prev_test_acc = 0
use_n_samples = hr.shape[0]
for f in range(1, n_features):
    # feature selection
    skb = SelectKBest(r_regression, k=f)
    hr3 = skb.fit_transform(hr[:use_n_samples, :-1], hr[:use_n_samples, -1]) # dataset with selected features

    # split the dataset into training and testing
    x_train, x_test, y_train, y_test = train_test_split(hr3, hr[:use_n_samples, -1], test_size=0.2)

    # normalize the training data
    mms = MinMaxScaler(feature_range=(-1, 1))
    x_train_norm = mms.fit_transform(x_train)

    # train a classifier
    clf = SVC()
    clf.fit(x_train_norm, y_train) # train an SVM using the training data.
    train_acc = clf.score(x_train_norm, y_train) # estimate training accuracy

    # testing
    x_test_norm = mms.transform(x_test) # normalize the testing data before testing. Note that we use the model fit using the training data
    test_acc = clf.score(x_test_norm, y_test) # estimate testing accuracy

    print(f'Using {f} features:')
    print('Training Accuracy: ', train_acc) # Accuracy on the training data
    print('Testing Accuracy: ', test_acc) # Accuracy on the testing data

    if (train_acc > prev_train_acc) or (test_acc > prev_test_acc):
        prev_train_acc = train_acc
        prev_test_acc = test_acc
    else:
        print('I came here')
        break

Using 1 features:
Training Accuracy:  0.6719159200551343
Testing Accuracy:  0.6741557546519642
Using 2 features:
Training Accuracy:  0.6712267401791868
Testing Accuracy:  0.6769124741557546
Using 3 features:
Training Accuracy:  0.6728118538938663
Testing Accuracy:  0.6705720192970366
Using 4 features:
Training Accuracy:  0.6722949689869056
Testing Accuracy:  0.6726395589248794
Using 5 features:
Training Accuracy:  0.6713990351481737
Testing Accuracy:  0.6767746381805652
Using 6 features:
Training Accuracy:  0.697381116471399
Testing Accuracy:  0.6927636113025499
Using 7 features:
Training Accuracy:  0.6891454169538249
Testing Accuracy:  0.6876636802205376
I came here


Question: Generate a random set of points to see how pearson correlation changes

Question: An example to use matplotlib for plotting multiple times on the same figure.