# COMP0189: Applied Artificial Intelligence
## Week 2 (Data Preprocessing)

### After this week you will be able to ...
- load various datasets from sklearn
- know the importance of data scaling and preprocessing
- know that the sensitivity various between learning algorithms
- split the dataset into train and test set
- know what will happen if you apply different preprocessing steps to train and test set
- know how to encode categorical features to ordinal representation and how it affects the model performance
- know how to deal with missing data

### Acknowledgements
- https://github.com/UCLAIS/Machine-Learning-Tutorials
- https://www.cs.columbia.edu/~amueller/comsw4995s19/schedule/
- https://scikit-learn.org/stable/
- https://archive.ics.uci.edu/ml/datasets/adult

## Introduction to Scikit-learn

Why do we use sklearn??

1. Example Datasets
    - sklearn.datasets : Provides example datasets

2. Feature Engineering  
    - sklearn.preprocessing : Variable functions as to data preprocessing
    - sklearn.feature_selection : Help selecting primary components in datasets
    - sklearn.feature_extraction : Vectorised feature extraction
    - sklearn.decomposition : Algorithms regarding Dimensionality Reduction

3. Data split and Parameter Tuning  
    - sklearn.model_selection : 'Train Test Split' for cross validation, Parameter tuning with GridSearch

4. Evaluation  
    - sklearn.metrics : accuracy score, ROC curve, F1 score, etc.

5. ML Algorithms
    - sklearn.ensemble : Ensemble, etc.
    - sklearn.linear_model : Linear Regression, Logistic Regression, etc.
    - sklearn.naive_bayes : Gaussian Naive Bayes classification, etc.
    - sklearn.neighbors : Nearest Centroid classification, etc.
    - sklearn.svm : Support Vector Machine
    - sklearn.tree : DecisionTreeClassifier, etc.
    - sklearn.cluster : Clustering (Unsupervised Learning)

6. Utilities  
    - sklearn.pipeline: pipeline of (feature engineering -> ML Algorithms -> Prediction)

7. Train and Predict  
    - fit()
    - predict()

8. and more...

In [None]:
!pip install scikit-learn==1.1.3

In [None]:
from sklearn.datasets import load_boston
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning)

**1. Boston House Price Dataset**

Let's first take a look at the Boston House Price dataset. This Dataset is deprecated as of version 1.2, but we will use this for educational purpose

In [None]:
boston = load_boston()
print(boston.DESCR)

In [None]:
boston.keys()

In [None]:
boston.feature_names, len(boston.feature_names)

In [None]:
from sklearn.model_selection import train_test_split
X, y = boston.data, boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
fig, axes = plt.subplots(3, 5, figsize=(20, 10))
for i, ax in enumerate(axes.ravel()):
    if i > 12:
        ax.set_visible(False)
        continue
    ax.plot(X[:, i], y, 'o', alpha=.5)
    ax.set_title("{}: {}".format(i, boston.feature_names[i]))
    ax.set_ylabel("PRICE")
plt.show()

See how our data are spread in different ranges. 3rd feature (CHAS) is even in binary. Most of the algorithms perform poorly on these various input spaces.

**2. Wine Dataset**

In [None]:
from sklearn.datasets import load_wine

In [None]:
wine = load_wine()
print(wine.DESCR)

In [None]:
wine.keys()

In [None]:
wine_X = wine.data
wine_labels = wine.target
wine_feature_names = wine.feature_names

In [None]:
wine_labels

In [None]:
pd.DataFrame(wine_X, columns=wine_feature_names)

In [None]:
def visualise_wine(X, labels=None, column_indices=(0,1), set_labels=False):
    """
    @param: X        --> Data
    @param: lables   --> Default is set to None, but if you've got your result of labels from clustering, 
                         you can input according labels in a list format.
    @param: column_indices --> column indices of dataset X to be selected for plotting.
                                 two-element tuple if you want 2D graph,
                                 three-element tuple if you want 3D graph.
    """
    assert type(column_indices) is tuple
    
    if len(column_indices)==2:  # 2D
        first_col, second_col = column_indices[0], column_indices[1]
        
        if set_labels:
            plt.xlabel(wine_feature_names[first_col])
            plt.ylabel(wine_feature_names[second_col])
            
        plt.scatter(X[:, first_col], X[:, second_col], c=labels)
        
    elif len(column_indices)==3:  # 3D
        first_col, second_col, third_col = column_indices[0], column_indices[1], column_indices[2]
        fig = plt.figure()
        plt.clf()
        ax = fig.add_subplot(projection='3d')

        plt.cla()
        
        if set_labels:
            ax.set_xlabel(wine_feature_names[first_col])
            ax.set_ylabel(wine_feature_names[second_col])
            ax.set_zlabel(wine_feature_names[third_col])

        ax.scatter(X[:, first_col], X[:, second_col], X[:, third_col], c=labels)
        
    else:
        raise RuntimeError("Your dimension should be either set to \"2d\" or \"3d\"")
    
    plt.tight_layout()
    plt.show()

In [None]:
visualise_wine(wine_X, labels=wine_labels, column_indices=(8, 10), set_labels=True)

In [None]:
# try out different col_in_X and get some feeling of how the data is shaped.
visualise_wine(wine_X, labels=wine_labels, column_indices=(8, 10, 12), set_labels=True)

We will closely look into details of many functions in scikit-learn (fit, predict, PCA, metrics, etc.) in the following practicals as we learn more in lectures.  
For now, it is good to be familiar with datasets and the main takeaways we demonstrate.

## Exercise 1: Impact of feature scaling in ML pipeline

Normalization scales each input variable separately to the range 0-1.  
Standardization scales each input variable separately by subtracting the mean (centering) and dividing each of them by the standard deviation to shift the distribution to have a mean of zero and a standard deviation of one.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
import numpy as np

#### Examaple usage of sklearn.preprocessing.StandardScaler

In [None]:
# Example
unscaled_data = np.asarray([[100, 0.001],
 [8, 0.05],
 [50, 0.005],
 [88, 0.07],
 [4, 0.1]])
# define standard scaler
scaler = StandardScaler()
# transform data
scaled_data = scaler.fit_transform(unscaled_data)

In [None]:
pd.DataFrame(unscaled_data).hist()

In [None]:
pd.DataFrame(scaled_data).hist()

In [None]:
del scaled_data, unscaled_data, scaler

**Questions**  
- Try using different methods, such as MinMaxScaler and Normalisation. Do you see the difference in the histogram?
- Experiment the effects of different feature scaling methods on one ML algorithm.

### Scaling Vs. Unscaling the Wine Dataset

In [None]:
RANDOM_STATE = 42
# We are using the wind dataset again
features, target = load_wine(return_X_y=True)

# Make a train/test split using 30% test size
X_train, X_test, y_train, y_test = train_test_split(None)

In [None]:
scaler = StandardScaler()
unscaled_X_train = X_train
unscaled_X_test = X_test

# scale data
scaled_X_train = scaler.None
scaled_X_test = scaler.None

In [None]:
unscaled_model = KNeighborsClassifier()
scaled_model = KNeighborsClassifier()

In [None]:
# TASK: fit each data with unscaled and scaled train set
unscaled_model.None
scaled_model.None

In [None]:
# TASK: predict y_hat with scaled/unscaled test set
unscaled_y_hat = unscaled_model.None
scaled_y_hat = scaled_model.None

In [None]:
# TASK: using accuracy_score() get the accuracy metric of both model
unscaled_acc = accuracy_score(y_test, unscaled_y_hat)
scaled_acc = accuracy_score(y_test, scaled_y_hat)
unscaled_acc, scaled_acc

## Exercise 2: Impact of different preprocessing strategy in train and test data

Do you see the difference in accuracy?  
**Question**  
Now, notice that I also scaled the test set.   
Using the same code, see what happens if you don't scale the test data and predict based on the unscaled data.

In [None]:
# Using the same test data for both unscaled and scaled model
unscaled_y_hat = unscaled_model.predict(None)
scaled_y_hat = scaled_model.predict(None)

In [None]:
unscaled_acc = accuracy_score(y_test, unscaled_y_hat)
scaled_acc = accuracy_score(y_test, scaled_y_hat)
unscaled_acc, scaled_acc

## sklearn.pipeline.make_pipeline

In [None]:
def demo_make_pipeline(pca_enable=False):
    # Fit to data and predict using pipeline
    if pca_enable:
        unscaled_clf = make_pipeline(PCA(n_components=2), KNeighborsClassifier())
    else:
        unscaled_clf = make_pipeline(KNeighborsClassifier())
    unscaled_clf.fit(X_train, y_train)
    pred_test = unscaled_clf.predict(X_test)

    # Fit to data and predict using pipeline
    if pca_enable:
        std_clf = make_pipeline(StandardScaler(), PCA(n_components=2), KNeighborsClassifier())
    else:
        std_clf = make_pipeline(StandardScaler(), KNeighborsClassifier())
    std_clf.fit(X_train, y_train)
    pred_test_std = std_clf.predict(X_test)

    # Show prediction accuracies in scaled and unscaled data.
    print("\nPrediction accuracy for the normal test dataset")
    print(f"{accuracy_score(y_test, pred_test):.2%}\n")

    print("\nPrediction accuracy for the standardized test dataset")
    print(f"{accuracy_score(y_test, pred_test_std):.2%}\n")

    # Extract PCA from pipeline3
    # print(unscaled_clf.named_steps)
    # {'pca': PCA(n_components=2), 'gaussiannb': GaussianNB()}
    # print(std_clf.named_steps)
    # {'standardscaler': StandardScaler(), 'pca': PCA(n_components=2), 'gaussiannb': GaussianNB()}

    try:
        pca = unscaled_clf.named_steps["pca"]
        pca_std = std_clf.named_steps["pca"]

        # Show first principal components
        print(f"\nPC 1 without scaling:\n{pca.components_[0]}")
        print(f"\nPC 1 with scaling:\n{pca_std.components_[0]}")

        # Use PCA without and with scale on X_train data for visualization.
        X_train_transformed = pca.transform(X_train)

        scaler = std_clf.named_steps["standardscaler"]
        scaled_X_train = scaler.transform(X_train)
        X_train_std_transformed = pca_std.transform(scaled_X_train)

        # visualize standardized vs. untouched dataset with PCA performed
        fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 7))

        target_classes = range(0, 3)
        colors = ("blue", "red", "green")
        markers = ("^", "s", "o")

        for target_class, color, marker in zip(target_classes, colors, markers):
            ax1.scatter(
                x=X_train_transformed[y_train == target_class, 0],
                y=X_train_transformed[y_train == target_class, 1],
                color=color,
                label=f"class {target_class}",
                alpha=0.5,
                marker=marker,
            )

            ax2.scatter(
                x=X_train_std_transformed[y_train == target_class, 0],
                y=X_train_std_transformed[y_train == target_class, 1],
                color=color,
                label=f"class {target_class}",
                alpha=0.5,
                marker=marker,
            )

        ax1.set_title("Training dataset after PCA")
        ax2.set_title("Standardized training dataset after PCA")

        for ax in (ax1, ax2):
            ax.set_xlabel("1st principal component")
            ax.set_ylabel("2nd principal component")
            ax.legend(loc="upper right")
            ax.grid()

        plt.tight_layout()

        plt.show()
    except KeyError:
        pass

In [None]:
demo_make_pipeline(pca_enable=False)

**Question**  
Try changing the KNN algorithm with different ones, such as Gaussian Naïve Bayes and Decision Trees in the pipeline. What do you notice in the accuracy of the test set?

### sneak peek to the future session (PCA)

In [None]:
demo_make_pipeline(pca_enable=True)

## Now we move on the next session which is about categorial features and data imputation

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import svm
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.metrics import classification_report
from sklearn.feature_selection import mutual_info_classif

In [None]:
# Open the csv file and skim through it. It does not have column names 
# so we will allocate names to each column 

# Naming the Columns
names = ['age','workclass','fnlwgt','education','education-num',
        'marital-status','occupation','relationship','race','sex',
        'capital-gain','capital-loss','hours-per-week','native-country',
        'y']

# Load dataset
df = pd.read_csv('../data/adult.csv', names=names, na_values='?')
df = df.dropna()

In [None]:
df.head()

In [None]:
len(df)

In [None]:
# TASK 1: Get the unique values in the race column 
df['race'].None

In [None]:
# TASK 2: Get the unique values in the 'y' column 
df['y'].None

In [None]:
# Get the popluation count by race
counts = df['race'].value_counts()
labels = counts.index

# Plot pie chart
plt.pie(counts, startangle=90)
plt.legend(labels, loc=2,fontsize=8)
plt.title("Race",size=20)

In [None]:
# TASK 3
# We see redundant space prefix in the values. Remove them. Hint: apply() function
df['race'] = df['race'].None
df['y'] = df['y'].None

In [None]:
df['race'].unique(), df['y'].unique(), df['occupation'].unique()

Hmmm it's not just the race and y column.

In [None]:
# Let's try to apply this to all the string-valued columns
for col_name in df.columns:
    if not 'int' in str(df[col_name].dtype):
        df[col_name] = df[col_name].apply(lambda x: x.strip())

In [None]:
for col_name in df.columns:
    if not 'int' in str(df[col_name].dtype):
        print(df[col_name].unique())

All done!  
Now let's specifically look into the 'race' and 'y' columns

In [None]:
df[['race', 'y']].head(10)

In [None]:
# TASK 4: Convert categorical and target variables to binary numerical values
# We now show converting them into binary values, but later in this notebook 
# we will show how we encode them into different labels using LabelEncoder and OneHotEncoder
# Also, if you see a SettingWithCopyWarning, ignore for now.

# Converting White into 1 else 0
# df_numerical['num_race'] = [1 if r=='White' else 0 for r in df_numerical['race']]
df["race"] = None

# Define target variable 
# Converting >50k into 1 and others into 0
df["y"] = None

df[['race', 'y']].head(10)

Now, let's map the occupation into different numerical values

In [None]:
df['occupation'].unique()

In [None]:
# For occupation converting different categories to numerical values
occ_mapping = {
    'Priv-house-serv':0,'?':-1, 'Other-service':0,'Handlers-cleaners':0,
    'Farming-fishing':1,'Machine-op-inspct':1,'Adm-clerical':1,
    'Transport-moving':2,'Craft-repair':2,'Sales':2,
    'Armed-Forces':3,'Tech-support':3,'Protective-serv':3,
    'Prof-specialty':4,'Exec-managerial':4
}

In [None]:
# TASK 5: using 'map' function in pandas, map the categorical values into numerical values
df["occupation"] = None
df['occupation']

### Dealing with Missing data

#### In processing the data earlier, we did not take account of the missing values. 

In [None]:
df

In [None]:
# TASK 7
# Drop the missing values, i.e. values with '?' in the occupation and native country 
None
df

Look, the the number of rows shrinked down to 30162 from 32561

Above is the case where we want specific ordinal values for each occupation. What if we don't care?
We can use LabelEncoder
###  Basic Usage of LabelEncoder

In [None]:
df['workclass'].unique()

In [None]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()

In [None]:
# let's encode workclass column
work_class_list = list(df['workclass'].unique())
work_class_list

In [None]:
label_encoder.fit(work_class_list)
list(label_encoder.classes_)

In [None]:
label_encoder.transform(work_class_list)

### Now let's use this in our df

In [None]:
df

In [None]:
label_encoder = preprocessing.LabelEncoder()

In [None]:
categ = ['workclass','education','marital-status','relationship', 'sex', 'native-country']

# TASK 6: Encode Categorical Columns
# label_encoder.fit_transform fits label encoder and returns encoded labels.
None

In [None]:
df

### Now, train an SVM or KNN Classifier and check the metrics by using the function below

In [None]:
# TASK 8: Training an SVM Classifier
None

In [None]:
# Calcualte the Accuracy of the model
def accuracy_metric(y, y_pred):
    """Calculate fairness for subgroup of population"""
    
    cm=confusion_matrix(y, y_pred)
    TN, FP, FN, TP = cm.ravel()
    
    N = TP+FP+FN+TN #Total population
    ACC = (TP+TN)/N #Accuracy
    TPR = TP/(TP+FN) # True positive rate
    FPR = FP/(FP+TN) # False positive rate
    FNR = FN/(TP+FN) # False negative rate
    PPP = (TP + FP)/N # % predicted as positive
    
    return np.array([ACC, TPR, FPR, FNR, PPP])  

#### Question 1 : Try training a classifier with and without dealing with missing values
#### Question 2: Try OneHotEncoder instead of LabelEncoder and compare the performance of the models