### Coursework 1 - revised and can be used with automarker

In this coursework you will be aiming to complete two classification tasks. One of the classification tasks is related to image classification and the other relates to text classification.

The specific tasks and the marking for the various tasks are provided in the notebook. Each task is expected to be accompanied by a lab-report. Each task can have a concise lab report that is maximum of one page in an A4 size. You will be expected to submit your Jupyter Notebook and all lab reports as a single PDF file. You could have additional functions implemented that you require for carrying out each task.

#### Task 1

In this task, you are provided with three classes of images, cars, bikes and people in real world settings. You are provided with code for obtaining features for these images (specifically histogram of gradients (HoG) features). You need to implement a boosting based classifier that can be used to classify the images. 

This task is worth 30 points out of 100 points. 
Implementing a working boosting based classifier and validating it by cross-validation on the training set will be evaluated for 15 out of 30 points. 10 points are based on the evaluation carried out on a separate test dataset that will be done at the time of evaluation. Finally 5 points are reserved for analysis of this part of the task and presenting it well in a lab report. 

Note that the boosting classifier you implement can include decision trees from your previous ML1 coursework or can be a decision stump. Use the image_dataset directory provided with the assignment and save it in the same directory as the Python notebook

#### Write your  Image feature extraction code

In [49]:
import numpy as np
import cv2 as cv
import glob
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import sklearn.metrics as metrics
import os
from tqdm import tqdm
import time
import matplotlib.pyplot as plt    
import seaborn as sns
from sklearn.metrics import confusion_matrix
plt.rcParams['figure.figsize'] = [10, 6]

## Normalize Function

In [50]:
# function for normalizing data
def normalize(df, mean_method = True ):
    if mean_method:
        # mean normalization
        normalized_df = (df - df.mean()) / df.std()
        a, b = df.std(), df.mean()
    else:
        # min max normalization
        normalized_df = (df - df.min()) / (df.max() - df.min())
        a, b = df.min(), df.max()
    return normalized_df, a, b

def unnormalize(df, a, b, mean_method = True):
    if mean_method:
        # mean normalization
        unnormalized_df = df * a + b
    else:
        # min max normalization
        unnormalized_df = (df * (b - a)) + a
    return unnormalized_df

## Augmentation Function

In [51]:

def HOG_augment(df, augment = True, n = 2):
    
    from keras.preprocessing.image import ImageDataGenerator, img_to_array, array_to_img
    
    datagen = ImageDataGenerator( 
            rotation_range = 40, 
            shear_range = 0.2, 
            zoom_range = 0.2, 
            horizontal_flip = True, 
            brightness_range = (0.5, 1.5))
    
    hog = cv.HOGDescriptor()
    
    x = []
    y = []
    
    
    
    for image, target in df.values:
        
        # get hog features for OG images
        x.append(hog.compute(image))
        y.append(target)
        
        x_array = img_to_array(image)
        x_array = x_array.reshape((1,) + x_array.shape)
        
        if augment:
            i = 0
            for batch in datagen.flow(x_array):

                b = np.asarray(array_to_img(np.squeeze(batch)))
                h = hog.compute(b)

                # save hog features and class for augmented images
                x.append(h)
                y.append(target)

                i+=1
                if i >= n:
                    break
                
                
    X = np.hstack(x).T 
    
    df_augmented = pd.DataFrame(X)
    df_augmented['y'] = y
    
    return df_augmented


## PCA Function

In [52]:


def reduce_features(df, num_features = 20):
    from sklearn.preprocessing import LabelEncoder, StandardScaler
    from sklearn.decomposition import PCA
    from tqdm import tqdm
    
    encoder = LabelEncoder()
    
    # split input and target 
    x, y = df.iloc[:,:-1], df.iloc[:,-1]
    
    for col in (x.columns[:]):
        x[col] = encoder.fit_transform(x[col])
    
    # apply scaler to input features
    scaler = StandardScaler()
    x = scaler.fit_transform(x)

    # transform features
    pca = PCA()
    pca.fit_transform(x)
    pca_variance = pca.explained_variance_

    # fit
    pca2 = PCA(n_components = num_features, whiten = True)
    pca2.fit(x)
    pca_x = pca2.transform(x)

    # recreate dataframe after PCA
    df_pca = pd.DataFrame(pca_x)
    df_pca['y'] = y.values
    
    return df_pca

## Load Data Function

In [53]:
    
def load_data(folder_name, 
                   include_augmented = True, 
                   Hog_features = False):
    import os
    
    feature_len = 34020
    hog = cv.HOGDescriptor()

    y = []
    X = []

    c = 0
    
    # store images to visualize if necessary
    image_dict = {}
    
    class_dict = {'bike'    : 1, 
                  'car'     : 2, 
                  'people'  : 3} 
    
    for subdir, dirs, files in os.walk(folder_name):
        for file in files:
            if not include_augmented: 
                if 'aug' in file:
                    file = ''
            
            # get file location from directory
            file_location = subdir + os.path.sep + file
            
            try:
                img = cv.imread(file_location)
                image_dict.update({file:img})
                h = hog.compute(img)
                
                if Hog_features:
                    X.append(h)
                else:
                    X.append([img])
                
                if 'bike' in file:
                    y.append(class_dict['bike'])
                elif 'car' in file:
                    y.append(class_dict['car'])
                elif 'person' or 'people' in file:
                    y.append(class_dict['people'])
                else:
                    y.append('error')
            except:
                print('Error when loading file : ', file)
    if Hog_features:
        X = np.concatenate(X, axis = 1).T
    print('Data Loaded.')
    return X, y



In [54]:
    
def obtain_dataset_train(folder_name):
    # import dependancies
    import numpy as np
    import cv2 as cv
    import glob
    import pandas as pd
    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import train_test_split
    from sklearn.tree import DecisionTreeClassifier
    from sklearn import svm
    from sklearn.metrics import confusion_matrix
    from sklearn.preprocessing import LabelEncoder, StandardScaler
    from sklearn.decomposition import PCA
    import sklearn.metrics as metrics
    import os
    from tqdm import tqdm
    import time
    import matplotlib.pyplot as plt    
    import seaborn as sns
    
    # load raw images w/o HOG features
    x, y = load_data(folder_name, 
                   include_augmented = False, 
                   Hog_features = False)
    df = pd.DataFrame(x)
    df['y'] = y
    
    print('Original Shape: ', df.shape)
    # augment and extract HOG from raw images
    df = HOG_augment(df, augment = False, n = 5)
    # dimensionality reduction w/ PCA
    df = reduce_features(df, num_features = 150)
    
    print('Training Shape: ', df.shape)
    # seperate input and target
    X, y = df.iloc[:,:-1], df.iloc[:,-1].values
    
    # normalize input
    X = normalize(X)[0]
    
    return (X,y) 

In [55]:

# Optional function for those who want to include pre-processing for train data in obtain dataset
def obtain_dataset_test(folder_name_test):
    # import dependancies
    import numpy as np
    import cv2 as cv
    import glob
    import pandas as pd
    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import train_test_split
    from sklearn.tree import DecisionTreeClassifier
    from sklearn import svm
    from sklearn.metrics import confusion_matrix
    from sklearn.preprocessing import LabelEncoder, StandardScaler
    from sklearn.decomposition import PCA
    import sklearn.metrics as metrics
    import os
    from tqdm import tqdm
    import time
    import matplotlib.pyplot as plt    
    import seaborn as sns
    
    # load raw test images
    x, y = load_data(folder_name_test, 
                   include_augmented = False, 
                   Hog_features = True)
    df = pd.DataFrame(x)
    df['y'] = y
    # PCA
    df = reduce_features(df, num_features = 150)
    
    # seperate input and target
    X, y = df.iloc[:,:-1], df.iloc[:,-1].values
    
    # normalize input
    X = normalize(X)[0]
    return (X, y) 

#### Boosting classifier class

In [56]:
class BoostingClassifier:
    
    def __init__(self,n_estimators = 100,
                 max_depth = 5,
                 num_features = 1000,
                 start_feature = 200, 
                 sklearn = True):
        
        self.sklearn = sklearn
        
        self.n_est = n_estimators
        self.depth = max_depth
        
        self.trees = []
        self.tree_weights = []
        
        self.num_features = num_features
        self.start = start_feature
        
    def update_weights(self, weights, y, y_dt, tree_weights):
        
        for i in range(weights.shape[0]):
            if y[i] != y_dt[i]:
                weights[i] = weights[i]*np.exp(tree_weights)
        
        weights /= np.sum(weights)
        return weights
        
    def fit(self, X, y):
        
        n_class = np.unique(y).shape[0]
        
        # init weights, value is aribitrary
        weights = np.array([1/X.shape[0] for i in range(X.shape[0])])
        weights = weights.reshape([-1,1])
        
        start = self.start 
        features = np.arange(start , start + self.num_features)   
        
        for i in range(self.n_est):
            if self.sklearn == False:
                # instantiate DT & fit to data
                DT = Decision_Tree(X, y,
                      num_features = self.num_features,
                      max_depth = self.depth,
                      start_feature = self.start)
            elif self.sklearn == True:
                DT = DecisionTreeClassifier(max_depth = self.depth)
                DT.fit(X,y)
            
            # predict with trees
            dt_pred = DT.predict(X)
            
            # update tree weights
            error = np.sum(weights[np.where(y != dt_pred)])
            self.tree_weights.append(np.log((1 - error) / error) \
                                     + np.log(n_class - 1))
            
            # update data weights
            weights = self.update_weights(weights, 
                                          y, 
                                          dt_pred, 
                                          self.tree_weights[-1])
            # store each tree
            self.trees.append(DT)
            
    
    def predict(self, X):
        
        n = X.shape[0]
        
        pred = np.array([tree.predict(X) for tree in self.trees]).T
        
        y_pred = []
        for i in range(n):
            
            current = pred[i,:]
            class_weights = {prediction : 0 for prediction in np.unique(current)}
            for j, prediction in enumerate(current):
                class_weights[prediction] += self.tree_weights[j]
                
            optimal_weight = max(class_weights, key = class_weights.get)
            y_pred.append(optimal_weight)
            
        return np.array(y_pred)

### Test function that will be called to evaluate your code. Separate train and test dataset will be provided

Do not modify the code below. Please write your code above such that it can be evaluated by the function below. You can modify your code above such that you obtain the best performance through this function. We will also be evaluating the cross-validation performance with a set train and val split.

In [None]:
def test_func_boosting_image(image_dataset_train, image_dataset_test):
    from sklearn.metrics import accuracy_score
    (X_train, Y_train) = obtain_dataset_train(image_dataset_train)
    (X_test, Y_test) = obtain_dataset_test(image_dataset_test)# optionally replace the two calls with a single call to obtain_dataset_train_test() function
    bc = BoostingClassifier()
    bc.fit(X_train, Y_train)
    y_pred = bc.predict(X_test)
    print('Accuracy: ', np.sum(y_pred == Y_test) / len(y_pred))
    acc = accuracy_score(Y_test, y_pred)
    print('Accuracy: ', acc)
    return acc

#### Task 2

In this task, you need to classify the above dataset using a Support Vector Machine (SVM).

This task is worth 25 points out of 100 points. You are allowed to use existing library functions such as scikit-learn for obtaining the SVM. The main idea is to analyse the dataset using different kind of kernels. You are also supposed to write your own custom kernels. The marking will be 15 marks for analysing the dataset using various kernels including your own kernels, 5 points for the performance on the test dataset and 5 points for a lab-report that provides the analysis and comparisons.

In [57]:
from sklearn import svm
class SVMClassifier:
    
    def __init__(self, kernel = 'rbf',
                       reg_param = 1.0,
                       degree = 3,
                       max_iter = 1e5,
                       tol = 1e-3):
        
        
        if kernel == 'sigmoid':
            k = self.sigmoid
        elif kernel == 'linear':
            k = self.linear
        elif kernel == 'poly':
            k = self.polynomial
        elif kernel == 'gaussian':
            k = self.gaussian
        elif kernel == 'laplacian':
            k = self.laplacian
        elif kernel == 'log':
            k = self.log
        elif kernel == 'exponential':
            k = self.exponential
        elif kernel == 'rational':
            k = self.rational
        elif kernel == 'quadric':
            k = self.quadric
        elif kernel == 'rbf':
            k = kernel
        elif kernel == 'linear_sklearn':
            k = 'linear'
        elif kernel == 'poly_sklearn':
            k = 'poly'
        elif kernel == 'intersection':
            k = self.Histo_intersection
        else:
            print('Using externel kernel.')
            k = kernel
            #return
        
        # instantial svm classifier
        self.clf = svm.SVC(C = reg_param,
              kernel = k,
              degree = degree,
              gamma = 'scale')
        
    def kernels_list(self):
        return ['sigmoid', 'linear', 'poly', 'gaussian', 'laplacian', 
                'log', 'exponential', 'rational', 'quadric', 'rbf', 
                'linear_sklearn', 'poly_sklearn']
    
    def linear(self, x,y):
        return np.inner(x,y)
    
    def sigmoid(self, x,y, alpha = 1):
        return np.tanh(alpha*np.inner(x,y))
    
    def polynomial(self, x,y, coef = 1, p = 6):
        return (np.inner(x,y) + coef) ** p
        
    def gaussian(self, U,V,sigma = 0.1):
        def gaussianKernel(U,V,sigma = 0.1):
            return np.exp(np.linalg.norm(U-V) ** 2 / (2*sigma**2))

        G = np.zeros((U.shape[0], V.shape[0]))
        for i in range(0,U.shape[0]):
            for j in range(0,V.shape[0]):
                G[i][j] = gaussianKernel(U[i],V[j],sigma)
        return G    
    
    def laplacian(self, U,V,sigma = 0.1):
        def LaplacianKernel(U,V,sigma = 0.1):
            return np.exp(-np.linalg.norm(U-V) / sigma)
        G = np.zeros((U.shape[0], V.shape[0]))
        for i in range(0,U.shape[0]):
            for j in range(0,V.shape[0]):
                G[i][j] = LaplacianKernel(U[i],V[j],sigma)
        return G
    
    def log(self, U,V):
        def logKernel(U,V):
            return -np.log(np.linalg.norm(U-V) + 1)
        G = np.zeros((U.shape[0], V.shape[0]))
        for i in range(0,U.shape[0]):
            for j in range(0,V.shape[0]):
                G[i][j] = logKernel(U[i],V[j])
        return G
    
    def exponential(self, U,V,sigma = 0.1):
        def expKernel(U,V,sigma = 0.1):
            return - np.linalg.norm(U-V) /  (2*sigma**2) 
        G = np.zeros((U.shape[0], V.shape[0]))
        for i in range(0,U.shape[0]):
            for j in range(0,V.shape[0]):
                G[i][j] = expKernel(U[i],V[j],sigma)
        return G
    
    def rational(self, U,V,c = 100):
        def RationalKernel(U,V,c = 100):
            return 1 - (np.linalg.norm(U-V)**2 / (np.linalg.norm(U-V)**2 + c))
        G = np.zeros((U.shape[0], V.shape[0]))
        for i in range(0,U.shape[0]):
            for j in range(0,V.shape[0]):
                G[i][j] = RationalKernel(U[i],V[j],c)
        return G
    
    def quadric(self, U,V,c = 100):
        def quadricKernel(U,V,c = 100):
            return np.sqrt(np.sum(np.power((U-V),2)) + c**2)
        G = np.zeros((U.shape[0], V.shape[0]))
        for i in range(0,U.shape[0]):
            for j in range(0,V.shape[0]):
                G[i][j] = quadricKernel(U[i],V[j],c)
        return G
    
    def intersect(self,U,V,sigma = 0.1):
        def Kernel(U,V,sigma = 0.1):
            return  np.sum((min(U),  min(V)), axis = 0)
        G = np.zeros((U.shape[0], V.shape[0]))
        for i in range(0,U.shape[0]):
            for j in range(0,V.shape[0]):
                G[i][j] = Kernel(U[i],V[j],sigma)
        return G
    
    def fit(self, X,y):
        self.clf.fit(X,y)
    
    def fit_image(self, X,y):
        self.clf.fit(X,y)
    
    def fit_text(self, X,y):
        self.clf.fit(X,y)
    
    def predict(self, X):
        y_pred = self.clf.predict(X)
        return y_pred
    
    def predict_image(self, X):
        y_pred = self.clf.predict(X)
        return y_pred
    
    def predict_text(self, X):
        y_pred = self.clf.predict(X)
        return y_pred

### Test function that will be called to evaluate your code. Separate train and test dataset will be provided

Do not modify the code below. Please write your code above such that it can be evaluated by the function below. You can modify your code above such that you obtain the best performance through this function. We will also be evaluating the cross-validation performance with a set train and val split.

In [None]:
def test_func_svm_image(image_dataset_train, image_dataset_test):
    from sklearn.metrics import accuracy_score  
    from sklearn.utils import check_X_y
    (X_train, Y_train) = obtain_dataset_train(image_dataset_train)
    (X_test, Y_test) = obtain_dataset_test(image_dataset_test) # optionally replace the two calls with a single call to obtain_dataset_train_test() function
    sc = SVMClassifier()
    sc.fit_image(X_train, Y_train)
    y_pred = sc.predict_image(X_test)
    print('Accuracy: ', np.sum(y_pred == Y_test) / len(y_pred))
    acc = accuracy_score(Y_test, y_pred)
    #acc = np.sum(y_pred == Y_test) / len(y_pred)
    return acc

#### Task 3

In this task, you need to obtain sentiment analysis for the provided dataset. The dataset consists of movie reviews with the sentiments being provided. The sentiments are either positive or negative. You need to train a boosting based classifier to obtain train and cross-validate on the dataset provided. The method will be evaluated against an external test set.

This task is worth 25 points out of 100 points. 15 points will be for implementing the pre-processing and Bag of Words based feature extractor correctly and evaluating the boosting based classifier for the text features and validating it by cross-validation on the training set. 5 points are based on the evaluation carried out on a separate test dataset that will be done at the time of evaluation. Finally 5 points are reserved for analysis of this part of the task and presenting it well in a lab report.

Use the movie_review_train.csv file provided with the assignment, and save it in the same directory as the Python notebook

#### Process the text and obtain a bag of words-based features 

In [31]:
def bag_of_words(df, 
                 max_features = 4000, 
                 min_df = 0.05):
    
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer 
    
    vectorizer = CountVectorizer(stop_words = 'english', 
                                 min_df = min_df,
                                 max_features = int(max_features))
    x = vectorizer.fit_transform(df.iloc[:,-2])

    transformer = TfidfTransformer(sublinear_tf = False)
    tf = transformer.fit_transform(x)

    df_tf = pd.DataFrame(tf.toarray())
    df_tf['y'] = df.iloc[:,-1]
    df_tf['y'].replace('positive', 1, inplace = True)
    df_tf['y'].replace('negative', 0, inplace = True)
    
    return df_tf

In [27]:
def extract_bag_of_words_train(train_file):
    
    print('Extracting train text data.')
    
    # import dependancies
    import nltk
    from nltk.tokenize import word_tokenize
    from nltk.stem import WordNetLemmatizer
    import re
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer 
    import pandas as pd
    
    df_movies = pd.read_csv(train_file)
    print('raw file shape:', df_movies.shape)
    
    df = bag_of_words(df_movies, 
                      max_features = 3000, # 3000
                      min_df = 1)
    print('Bag of words extraction: ', df.shape)
    
    df = reduce_features(df, num_features = 200)
    print('PCA: ', df.shape)
    
    X, y = df.iloc[:,:-1], df.iloc[:,-1]
    
    return X, y

In [28]:
def extract_bag_of_words_test(test_file):
    
    print('Extracting test text data.')
    
    # import dependancies
    import nltk
    from nltk.tokenize import word_tokenize
    from nltk.stem import WordNetLemmatizer
    import re
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer 
    import pandas as pd
    
    df_movies = pd.read_csv(test_file)
    print('raw file shape:', df_movies.shape)
    
    df = bag_of_words(df_movies, 
                      max_features = 3000,
                      min_df = 1)
    print('Bag of words extraction: ', df.shape)
    
    df = reduce_features(df, num_features = 200)
    print('PCA: ', df.shape)
    
    X, y = df.iloc[:,:-1], df.iloc[:,-1]
    
    return X, y

### Test function that will be called to evaluate your code. Separate train and test dataset will be provided

Do not modify the code below. Please write your code above such that it can be evaluated by the function below. You can modify your code above such that you obtain the best performance through this function. We will also be evaluating the cross-validation performance with a set train and val split.

In [18]:
def test_func_boosting_text(text_dataset_train, text_dataset_test):
    from sklearn.metrics import accuracy_score    
    (X_train, Y_train) = extract_bag_of_words_train(text_dataset_train)
    (X_test, Y_test) = extract_bag_of_words_test(text_dataset_test) # optionally the two calls can be replaced by a single extract_bag_of_words_train_test() function
    bc = BoostingClassifier()
    bc.fit(X_train, Y_train)
    y_pred = bc.predict(X_test)    
    acc = accuracy_score(Y_test, y_pred)
    return acc

#### Task 4

In this task, you need to classify the above movie review dataset using a Support Vector Machine (SVM).

This task is worth 20 points out of 100 points. You are allowed to use existing library functions such as scikit-learn for obtaining the SVM. The main idea is to analyse the dataset using different kind of kernels. You are also supposed to write your own custom text kernels. The marking will be 10 marks for analysing the dataset using various kernels including your own kernels, 5 points for the performance on the test dataset and 5 points for a lab-report that provides the analysis and comparisons.

### Test function that will be called to evaluate your code. Separate train and test dataset will be provided

Do not modify the code below. Please write your code above such that it can be evaluated by the function below. You can modify your code above such that you obtain the best performance through this function. We will also be evaluating the cross-validation performance with a set train and val split.

In [19]:
def test_func_svm_text(text_dataset_train, text_dataset_test):
    from sklearn.metrics import accuracy_score    
    (X_train, Y_train) = extract_bag_of_words_train(text_dataset_train)
    (X_test, Y_test) = extract_bag_of_words_test(text_dataset_test) # optionally the two calls can be replaced by a single extract_bag_of_words_train_test() function
    sc = SVMClassifier()
    sc.fit_text(X_train, Y_train)
    y_pred = sc.predict_text(X_test)
    acc = accuracy_score(Y_test, y_pred)
    return acc