# COMP0189: Applied Artificial Intelligence
## Week 1 (Data Preprocessing)

### After this week you will be able to ...
- load various datasets from sklearn
- know the importance of data scaling and preprocessing
- know the sensitivity between various learning algorithms
- split the dataset into train and test set
- know what will happen if you apply different preprocessing steps to train and test set
- know how to encode categorical features to ordinal and one-hot representations and how these affect model performance
- know how to deal with missing data

### Acknowledgements
- https://github.com/UCLAIS/Machine-Learning-Tutorials
- https://www.cs.columbia.edu/~amueller/comsw4995s19/schedule/
- https://scikit-learn.org/stable/
- https://archive.ics.uci.edu/ml/datasets/adult

## Introduction to Scikit-learn

Why do we use sklearn??

1. Example Datasets
    - sklearn.datasets : Provides example datasets

2. Feature Engineering  
    - sklearn.preprocessing : Variable functions as to data preprocessing
    - sklearn.feature_selection : Help selecting primary components in datasets
    - sklearn.feature_extraction : Vectorised feature extraction
    - sklearn.decomposition : Algorithms regarding Dimensionality Reduction

3. Data split and Parameter Tuning  
    - sklearn.model_selection : 'Train Test Split' for cross validation, Parameter tuning with GridSearch

4. Evaluation  
    - sklearn.metrics : accuracy score, ROC curve, F1 score, etc.

5. ML Algorithms
    - sklearn.ensemble : Ensemble, etc.
    - sklearn.linear_model : Linear Regression, Logistic Regression, etc.
    - sklearn.naive_bayes : Gaussian Naive Bayes classification, etc.
    - sklearn.neighbors : Nearest Centroid classification, etc.
    - sklearn.svm : Support Vector Machine
    - sklearn.tree : DecisionTreeClassifier, etc.
    - sklearn.cluster : Clustering (Unsupervised Learning)

6. Utilities  
    - sklearn.pipeline: pipeline of (feature engineering -> ML Algorithms -> Prediction)

7. Train and Predict  
    - fit()
    - predict()

8. and more...

In [None]:
!pip install scikit-learn==1.1.3

Collecting scikit-learn==1.1.3
  Downloading scikit_learn-1.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.5/30.5 MB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.2.2
    Uninstalling scikit-learn-1.2.2:
      Successfully uninstalled scikit-learn-1.2.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 0.18.0 requires scikit-learn>=1.2.2, but you have scikit-learn 1.1.3 which is incompatible.[0m[31m
[0mSuccessfully installed scikit-learn-1.1.3


In [None]:
from sklearn.datasets import load_boston
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning)

**1. Boston House Price Dataset**

Let's first take a look at the Boston House Price dataset. This Dataset is deprecated as of version 1.2, but we will use this for educational purpose

In [None]:
boston = load_boston()
print(boston.DESCR)

In [None]:
boston.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename', 'data_module'])

In [None]:
boston.feature_names, len(boston.feature_names)

In [None]:
from sklearn.model_selection import train_test_split
X, y = boston.data, boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
fig, axes = plt.subplots(3, 5, figsize=(20, 10))
for i, ax in enumerate(axes.ravel()):
    if i > 12:
        ax.set_visible(False)
        continue
    ax.plot(X[:, i], y, 'o', alpha=.5)
    ax.set_title("{}: {}".format(i, boston.feature_names[i]))
    ax.set_ylabel("PRICE")
plt.show()

See how our data are spread in different ranges. 3rd feature (CHAS) is even in binary. Most of the algorithms perform poorly on these various input spaces.

**2. Wine Dataset**

In [None]:
from sklearn.datasets import load_wine

In [None]:
wine = load_wine()
print(wine.DESCR)

In [None]:
wine.keys()

In [None]:
wine_X = wine.data
wine_labels = wine.target
wine_feature_names = wine.feature_names

In [None]:
wine_labels

In [None]:
pd.DataFrame(wine_X, columns=wine_feature_names)

In [None]:
def visualise_wine(X, labels=None, column_indices=(0,1), set_labels=False):
    """
    @param: X        --> Data
    @param: lables   --> Default is set to None, but if you've got your result of labels from clustering,
                         you can input according labels in a list format.
    @param: column_indices --> column indices of dataset X to be selected for plotting.
                                 two-element tuple if you want 2D graph,
                                 three-element tuple if you want 3D graph.
    """
    assert type(column_indices) is tuple

    if len(column_indices)==2:  # 2D
        first_col, second_col = column_indices[0], column_indices[1]

        if set_labels:
            plt.xlabel(wine_feature_names[first_col])
            plt.ylabel(wine_feature_names[second_col])

        plt.scatter(X[:, first_col], X[:, second_col], c=labels)

    elif len(column_indices)==3:  # 3D
        first_col, second_col, third_col = column_indices[0], column_indices[1], column_indices[2]
        fig = plt.figure()
        plt.clf()
        ax = fig.add_subplot(projection='3d')

        plt.cla()

        if set_labels:
            ax.set_xlabel(wine_feature_names[first_col])
            ax.set_ylabel(wine_feature_names[second_col])
            ax.set_zlabel(wine_feature_names[third_col])

        ax.scatter(X[:, first_col], X[:, second_col], X[:, third_col], c=labels)

    else:
        raise RuntimeError("Your dimension should be either set to \"2d\" or \"3d\"")

    plt.tight_layout()
    plt.show()

In [None]:
visualise_wine(wine_X, labels=wine_labels, column_indices=(8, 10), set_labels=True)

In [None]:
# try out different col_in_X and get some feeling of how the data is shaped.
visualise_wine(wine_X, labels=wine_labels, column_indices=(8, 10, 12), set_labels=True)

We will closely look into details of many functions in scikit-learn (fit, predict, PCA, metrics, etc.) in the following practicals as we learn more in lectures.  
For now, it is good to be familiar with datasets and the main takeaways we demonstrate.

## Exercise 1: Impact of feature scaling

Normalization scales each input variable separately to the range 0-1.  
Standardization scales each input variable separately by subtracting the mean (centering) and dividing each of them by the standard deviation to shift the distribution to have a mean of zero and a standard deviation of one.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd

#### Example usage of sklearn.preprocessing.StandardScaler

In [None]:
# Example
unscaled_data = np.asarray([[100, 0.001],
 [8, 0.05],
 [50, 0.005],
 [88, 0.07],
 [4, 0.1]])
# define standard scaler
scaler = StandardScaler()
# transform data
scaled_data = scaler.fit_transform(unscaled_data)

In [None]:
pd.DataFrame(unscaled_data).hist()

In [None]:
pd.DataFrame(scaled_data).hist()

In [None]:
del scaled_data, unscaled_data, scaler

**Questions**  
- Try using different scaling methods, such as MinMaxScaler and Normalisation. Do you see the difference in the histogram?
- Experiment the effects of different feature scaling methods on various ML algorithms e.g. KNN, SVM, Decision-Tree.

### Scaling Vs. Unscaling the Wine Dataset

In [None]:
RANDOM_STATE = 42
# We are using the wind dataset again
features, target = load_wine(return_X_y=True)

# Make a train/test split using 30% test size
# Make a train/test split using 30% test size
X_train, X_test, y_train, y_test = train_test_split(None)

In [None]:
# Define scalers and models
scalers = {
    'StandardScaler': StandardScaler(),
    'MinMaxScaler': MinMaxScaler(),
    'Normalizer': Normalizer()
}

models = {
    'KNeighborsClassifier': KNeighborsClassifier(),
    'SVC': SVC(random_state=RANDOM_STATE),
    'DecisionTreeClassifier': DecisionTreeClassifier(random_state=RANDOM_STATE)
}

# Store results
results = {}

# Iterate over each scaler
for scaler_name, scaler in scalers.items():
    scaled_X_train = scaler.None
    scaled_X_test = scaler.None

    # Iterate over each model
    for model_name, model in models.items():
        key = f'{scaler_name}_{model_name}'

        # Fit and predict with unscaled and scaled data
        model.fit(None)
        unscaled_y_hat = model.predict(None)
        unscaled_acc = accuracy_score(None)

        model.fit(None)
        scaled_y_hat = model.predict(None)
        scaled_acc = accuracy_score(None)

        # Store results
        results[key] = {
            'Unscaled Accuracy': unscaled_acc,
            'Scaled Accuracy': scaled_acc
        }


results_df = pd.DataFrame(results)
results_df



## Exercise 2: Impact of different preprocessing strategy in train and test data

Do you see the difference in accuracy?  
**Question**  
Now, notice that I also scaled the test set.   
Using the same code, see what happens if you don't scale the test data and predict based on the unscaled data.

In [None]:
# Store results
results = {}

# Iterate over each scaler
for scaler_name, scaler in scalers.items():
    scaled_X_train = scaler.fit_transform(None)
    scaled_X_test = scaler.transform(None)

    # Iterate over each model
    for model_name, model in models.items():
        key = f'{scaler_name}_{model_name}'

        # Fit with scaled data
        model.fit(scaled_X_train, y_train)

        # Predict with unscaled test data
        unscaled_y_hat = model.None
        unscaled_acc = accuracy_score(None)

        # Predict with scaled test data
        scaled_y_hat = model.predict(None)
        scaled_acc = accuracy_score(None)

        # Store results
        results[key] = {
            'Accuracy with Unscaled Test Data': unscaled_acc,
            'Accuracy with Scaled Test Data': scaled_acc
        }

results_df = pd.DataFrame(results)
results_df

# Display results
#for key, value in results.items():
#    print(key, value)

## Now we move on the next session which is about categorial features and data imputation

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import svm
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.metrics import classification_report
from sklearn.feature_selection import mutual_info_classif

In [None]:
# Open the csv file and skim through it. It does not have column names
# so we will allocate names to each column

# Naming the Columns
names = ['age','workclass','fnlwgt','education','education-num',
        'marital-status','occupation','relationship','race','sex',
        'capital-gain','capital-loss','hours-per-week','native-country',
        'y']

# Load dataset with specifying ' ?' as missing values
df = pd.read_csv('/adult.data', delimiter=',', names=names, na_values=' ?')


In [None]:
# Check for missing values
print(df.isnull().sum())

In [None]:
# Display the 15th row of the DataFrame - notice NaN
row_15 = df.iloc[14]
print(row_15)

In [None]:
len(df)

32561

In [None]:
# for now we will drop the rows with NA values

df = df.dropna()
len(df)

In [None]:
# TASK 1: Get the unique values in the race column
df['race'].None

In [None]:
# TASK 2: Get the unique values in the 'y' column
df['y'].None

In [None]:
# TODO: Get the popluation count by race
counts = df['race'].value_counts()
labels = counts.index

# Plot pie chart
plt.pie(counts, startangle=90)
plt.legend(labels, loc=2,fontsize=8)
plt.title("Race",size=20)

In [None]:
# TASK 3
# We see redundant space prefix in the values. Remove them.
df['race'] = df['race'].apply(None)
df['y'] = df['y'].apply(None)

In [None]:
df['race'].unique(), df['y'].unique(), df['occupation'].unique()

Hmmm it's not just the race and y column.

In [None]:
# Let's try to apply this to all the string-valued columns
for col_name in df.columns:
    if df[col_name].dtype == object:  # Checking for object type (string in pandas)
        df[col_name] = df[col_name].apply(lambda x: x.strip() if isinstance(x, str) else x)


In [None]:
for col_name in df.columns:
    if not 'int' in str(df[col_name].dtype):
        print(df[col_name].unique())

All done!  
Now let's specifically look into the 'race' and 'y' columns

In [None]:
df[['race', 'y']].head(10)

In [None]:
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, LabelEncoder

# TASK 4: Convert features and target to binary numerical values using
# Ordinal, One-hot, LabelEncoding as appropriate.

In [None]:
# Assuming df is your DataFrame

# Ordinal Encoding for 'education'
ordinal_encoder = OrdinalEncoder()
df['education_encoded'] = ordinal_encoder.fit_transform(df[[None]])

# OneHotEncoding for nominal features without an implied order
# Including the previously missed nominal columns
nominal_columns = [None]
onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoded_columns = onehot_encoder.fit_transform(df[nominal_columns])
column_names = onehot_encoder.get_feature_names_out(nominal_columns)
df_onehot_encoded = pd.DataFrame(onehot_encoded_columns, columns=column_names)

# Integrate these new columns back into the original dataframe
df = df.reset_index(drop=True)  # Reset index to align with the new onehot encoded DataFrame
df = pd.concat([df, df_onehot_encoded], axis=1)

# Optionally, remove the categorical columns if no longer needed
df.drop(columns=nominal_columns + ['education'], inplace=True)

# Label Encoding for the target variable
label_encoder = LabelEncoder()
df['y_encoded'] = label_encoder.fit_transform(df['y'])

# Remove the original 'y' column if no longer needed
df.drop(columns=['y'], inplace=True)

# Display the first few rows of the modified DataFrame
df.head(10)


### Dealing with Missing data

#### In processing the data earlier, we did not take account of the missing values.

In [None]:
# Re-Load dataset with specifying ' ?' as missing values
df = pd.read_csv('/adult.data', delimiter=',', names=names, na_values=' ?')


In [None]:
# TASK 7 Create 3 datasets using different methods for dealing with missing data:
# A: Drop missing values, B: KNN imputation, C: Most frequest imputation

for col_name in df.columns:
    if df[col_name].dtype == object:  # Checking for object type (string in pandas)
        df[col_name] = df[col_name].apply(lambda x: x.strip() if isinstance(x, str) else x)

# Check for missing values
print(df.isnull().sum())


In [None]:
# first conduct encoding for features without missing values

# Ordinal Encoding for 'education'
ordinal_encoder = OrdinalEncoder()
df['education_encoded'] = ordinal_encoder.fit_transform(None)

# OneHotEncoding for nominal features without missing values and without an implied order
nominal_columns_without_missing = [None]
onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoded_columns = onehot_encoder.fit_transform(df[None])
column_names = onehot_encoder.get_feature_names_out(None)
df_onehot_encoded = pd.DataFrame(None)

# Integrate these new columns back into the original dataframe
df = df.reset_index(drop=True)  # Reset index to align with the new onehot encoded DataFrame
df = pd.concat([df, df_onehot_encoded], axis=1)

# Optionally, remove the original categorical columns if no longer needed
df.drop(columns=nominal_columns_without_missing + ['education'], inplace=True)

# Label Encoding for the target variable
label_encoder = LabelEncoder()
df['y_encoded'] = label_encoder.fit_transform(df['y'])

# Remove the original 'y' column if no longer needed
df.drop(columns=['y'], inplace=True)

# Display the first few rows of the modified DataFrame
df.head(10)


In [None]:
print(df.isnull().sum())


In [None]:
from sklearn.impute import KNNImputer
from sklearn.preprocessing import OrdinalEncoder

# A: Dataset with dropped missing values
df_dropna = df.None

print(df_dropna.isnull().sum())


In [None]:
from sklearn.impute import KNNImputer
from sklearn.preprocessing import OrdinalEncoder

df_knn_imputed = df.copy()

# Temporarily encode categorical columns with missing values
temp_encoder = OrdinalEncoder()
columns_with_missing_values = ['workclass', 'occupation', 'native-country']
df_temp = df[columns_with_missing_values].copy()
df_temp_encoded = temp_encoder.fit_transform(df_temp)

# Apply KNN imputer
knn_imputer = KNNImputer(n_neighbors=5)
imputed_data = knn_imputer.None

# Decode the categorical columns back to original categories
imputed_data_decoded = temp_encoder.inverse_transform(imputed_data)
df_imputed_final = pd.DataFrame(imputed_data_decoded, columns=columns_with_missing_values)

# Integrate the imputed columns back into the main DataFrame
df_knn_imputed[columns_with_missing_values] = df_imputed_final

print(df_knn_imputed.isnull().sum())


In [None]:
from sklearn.impute import SimpleImputer

# Create an imputer object using the most frequent strategy
mode_imputer = SimpleImputer(strategy='most_frequent')

# Apply the imputer to the categorical columns with missing values
df_mode_imputed = df.copy()
df_mode_imputed[columns_with_missing_values] = mode_imputer.fit_transform(None)

print(df_mode_imputed.isnull().sum())


In [None]:
def apply_onehot_encoding(df, columns):
    # Perform One-Hot Encoding
    encoded_data = onehot_encoder.fit_transform(df[columns])
    column_names = onehot_encoder.get_feature_names_out(columns)
    df_encoded = pd.DataFrame(encoded_data, columns=column_names)

    # Reset indices to ensure alignment
    df_reset = df.reset_index(drop=True)
    df_encoded_reset = df_encoded.reset_index(drop=True)

    # Drop original columns and concatenate the new One-Hot Encoded columns
    return pd.concat([df_reset.drop(columns, axis=1), df_encoded_reset], axis=1)

# Columns to be One-Hot Encoded
columns_to_encode = [None]
# Apply One-Hot Encoding
df_dropna_encoded = apply_onehot_encoding(None)

# Check for missing values after encoding
print(df_dropna_encoded.isnull().sum().sum())

In [None]:
from sklearn.preprocessing import OneHotEncoder

# now apply one-hot encoding for the feautres which were imputed

# Initialize One-Hot Encoder
onehot_encoder = OneHotEncoder(sparse=False)

# Function to apply One-Hot Encoding to a DataFrame
def apply_onehot_encoding(df, columns):
    encoded_data = onehot_encoder.fit_transform(df[columns])
    column_names = onehot_encoder.get_feature_names_out(columns)
    df_encoded = pd.DataFrame(encoded_data, columns=column_names)

    # Drop original columns and concatenate the new One-Hot Encoded columns
    return pd.concat([df.drop(columns, axis=1), df_encoded], axis=1)

# Apply One-Hot Encoding to each DataFrame

df_knn_imputed_encoded = apply_onehot_encoding(None)
df_mode_imputed_encoded = apply_onehot_encoding(None)


print(df_knn_imputed_encoded.isnull().sum().sum())
print(df_mode_imputed_encoded.isnull().sum().sum())


### Now, train an SVM or KNN Classifier and check the metrics by using the function below

In [None]:
# TASK 8: Train an SVM Classifier on the differnt dataset to compare imputation method accuracy
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

In [None]:
# For dataset A
X_dropna = df_dropna_encoded.drop('y_encoded', axis=1)
y_dropna = df_dropna_encoded['y_encoded']

# For dataset B
X_knn = df_knn_imputed_encoded.drop('y_encoded', axis=1)
y_knn = df_knn_imputed_encoded['y_encoded']

# For dataset C
X_mode = df_mode_imputed_encoded.drop('y_encoded', axis=1)
y_mode = df_mode_imputed_encoded['y_encoded']


In [None]:
# Function to train and evaluate SVM
def train_evaluate_svm(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    clf = SVC()
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    return accuracy_score(y_test, y_pred)

In [None]:
# Train and evaluate on each dataset
accuracy_dropna = train_evaluate_svm(None)
accuracy_knn = train_evaluate_svm(None)
accuracy_mode = train_evaluate_svm(None)

# Print the accuracies
print(f"Accuracy with dropped missing values: {accuracy_dropna}")
print(f"Accuracy with KNN imputation: {accuracy_knn}")
print(f"Accuracy with mode imputation: {accuracy_mode}")