# Assignment #2 - Classification

<font color="red"> <b> Due: Oct 11 (Thursday) 11:00 pm </b> </font>

<font color="blue"> Josiah Laivins </font>

# I. Introduction

Determining the qualities that play into what someone would call "having satisfaction in their career" is important. Through the use of multiple qualities based on the stack overflow 2018 survey, I hope to gain insight into what actually determines a satisfactory career.

# II. Data

The link this data set can be found [this Kaggle Stack Overflow link](https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey)

The data set found from stack overflow seemed interesting to use for classification. There should be interesting insights that I would expect to see. 
There are 98855 records found in this data set. There are 129 features, but I am going to list some main ones that I was immediately interested in:
- SkipMeals
- WakeTime
- HoursComputer
- RaceEthnicity
- CareerSatisfaction

As a note, I am going to binarize the target (CareerSatisfaction) via this method:
- Extremely satisfied: 1
- Moderately satisfied: 1
- Slightly satisfied: 1
- Neither satisfied nor dissatisfied': -1
- Slightly dissatisfied: -1
- Moderately dissatisfied: -1 
- Extremely dissatisfied: -1

# III. Method

Summarize the pocket algorithm, discriminant analysis, and logistic regression.
The superclass *Classifier* defines common utility methods. 
Finish the normalize function for you. 
Do not forget explain your implementation. 

The explanation of your codes should not be the comments in a code cell. 
This section should include
 - review of the 4 classification models 
 - your implementation and description


### A. Super Classs Definition

In [None]:
from pathlib import Path
import os
from sklearn.linear_model import Perceptron
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from typing import List
import warnings
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import LabelEncoder

def show_correlation(columns_to_look_at: List[str], data: pd.DataFrame, n_rows=200, is_labeled=False):
    if not is_labeled:
        data = data.reset_index(drop=True)
        # Convert String columns into one-hot encoded columns
        for column in data:
            if data[column].dtype == object:
                # print(f'Encoding one-hot for column: {column} \r')
                data[column] = LabelEncoder().fit_transform(y=data[column].fillna('0'))
    else:
        warnings.warn("Assuming the dataset is already labeled. Note, if it has strings, then this will be blank.",
                      category=RuntimeWarning)

    fig, ax = plt.subplots(figsize=(10, 10))
    ax.matshow(data[columns_to_look_at].corr())
    plt.xticks(range(len(columns_to_look_at)), [column for column in data[columns_to_look_at].columns], rotation=45)
    plt.yticks(range(len(columns_to_look_at)), [column for column in data[columns_to_look_at].columns], rotation=45)
    plt.title(f'Prediction using: {n_rows} samples', y=1.15)
    plt.ylabel('Columns Y')
    plt.xlabel('Columns X')
    plt.margins(.1)
    plt.title(f'Prediction using: {n_rows}. Shows the correlation between different columns')
    plt.show()


def show_accuracy(x, t):
    # x, t = x[np.argsort(t, axis=0).flatten()], np.sort(t, axis=0)

    plt.plot(t, label="Ground Truth")
    plt.plot(x, label="Result")
    plt.legend()
    plt.show()


def show_boundaries(bounary_diff: np.array):
    xs, ys = np.meshgrid(np.linspace(-3, 6, 500), np.linspace(-3, 7, 500))
    plt.figure(figsize=(6, 6))
    plt.contourf(xs, ys, (bounary_diff > 0).reshape(xs.shape))
    plt.title("Decision Boundary")
    plt.show()


# Load data:
base_dir = str(Path().absolute())
n_rows = 20  # None for all
data = pd.read_csv(base_dir + os.sep + 'data' + os.sep + 'stack-overflow-2018-developer-survey' + os.sep +
                   'survey_results_public.csv', nrows=n_rows)

# Features of interest:
features = ['SkipMeals', 'WakeTime', 'HoursComputer', 'RaceEthnicity', 'CareerSatisfaction']  # Removed: 'JobSatisfaction' because... that's too easy
# features = ['SkipMeals', 'WakeTime', 'CareerSatisfaction']

# Filter Features:
data = data[features]

# We want to predict CareerSatisfaction
classification = 'CareerSatisfaction'
# We want this classification to be binary. We will range it from not satisfied to satisfied -1 to +1
replacement_keys = {'Extremely satisfied': 1, 'Neither satisfied nor dissatisfied': 1,
                    'Moderately satisfied': 1, 'Slightly dissatisfied': -1, 'Slightly satisfied': 1,
                    'Moderately dissatisfied': -1, 'Extremely dissatisfied': -1}

data = data.replace({classification: replacement_keys})  # Target is now binary

data = data.dropna(axis=0).reset_index(drop=True)  # Drop Null or nan records.
print(f'Rows was {n_rows} before dropping, but now is: {data.shape[0]}')
# Convert String columns into one-hot encoded columns
for column in data:
    if data[column].dtype == object:
        # print(f'Encoding one-hot for column: {column} \r')
        data[column] = LabelEncoder().fit_transform(y=data[column].fillna('0'))  # TODO save an array of LabelEncoders

# Show correlation Matrix for this data set
show_correlation(features, data, n_rows, is_labeled=True)

# Split the data into features and targets
x = pd.DataFrame.copy(data.drop(classification, axis=1))  # Exclude the classification field from the training samples
y = pd.DataFrame.copy(data.drop([f for f in features if f != classification], axis=1))
print("Split data")

# Split the features into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(x.values, y.values, test_size=0.33)


In [None]:
import numpy as np
from abc import *

# Super class for machine learning models 

class BaseModel(ABC):
    """ Super class for ITCS Machine Learning Class"""
    
    @abstractmethod
    def train(self, X, T):
        pass

    @abstractmethod
    def use(self, X):
        pass

class Classifier(BaseModel):
    """
        Abstract class for classification 
        
        Attributes
        ==========
        meanX       ndarray
                    mean of inputs (from standardization)
        stdX        ndarray
                    standard deviation of inputs (standardization)
    """
    def __init__(self, ):
        self.meanX = None
        self.stdX = None
        
    def normalize(self, X, reset_fields=False):
        """ standardize the input X """

        if not isinstance(X, np.ndarray):
            X = np.asanyarray(X)

        if reset_fields:
            self.means = np.mean(X, 0)
            self.stds = np.std(X, 0)

        Xs = (X - self.means) / self.stds
        return Xs

    def _check_matrix(self, mat, name):
        if len(mat.shape) != 2:
            raise ValueError(''.join(["Wrong matrix ", name]))
        
    # add a basis
    def add_ones(self, X):
        """
            add a column basis to X input matrix
        """
        self._check_matrix(X, 'X')
        return np.hstack((np.ones((X.shape[0], 1)), X))

    ####################################################
    #### abstract funcitons ############################
    @abc.abstractmethod
    def train(self, X, T):
        pass
    
    @abc.abstractmethod
    def use(self, X):
        pass 

### B. Pocket Algorithm


In [None]:
import numpy as np

class PerceptronPocketClassifier(Classifier):

    def __init__(self, max_iterations: int, alpha: float = 0.1) -> object:
        """

        :param max_iterations:
        :param alpha:
        """
        super().__init__()
        self.alpha = alpha  # Learning rate
        self.max_iterations = max_iterations  # training iterations
        # Weight matrix
        self.w = np.random.uniform(-1.0, 1.0, 1).reshape(-1, 1)
        self.w_pocket = np.copy(self.w)

    def _compare(self, x, targets):
        y = np.sign(x @ self.w)
        yp = np.sign(x @ self.w_pocket)

        return 1 if np.sum(y == targets) >= np.sum(yp == targets) else -1

    def train(self, x: np.ndarray, targets: np.ndarray):
        # Set Shape positions
        shape_features = x.shape[1]
        shape_num_samples = x.shape[0]
        shape_target_features = targets.shape[1]

        # Reset the w to reflect the dimensions being trained on
        self.w = np.zeros(shape_features).reshape(-1, 1)#np.random.uniform(-1.0, 1.0, shape_features).reshape(-1, shape_target_features)
        self.w_pocket = np.copy(self.w)

        # Normalize Training Data:
        x = self.normalize(x, reset_fields=True)

        for j in range(self.max_iterations):
            print(f'Iteration: {j}')
            converged = True
            for k in np.random.permutation(shape_num_samples - 1):
                '''
                ok so, we need weight (W) to be (target_dim X feature_num)
                so....
                T[k] * X[k] is not enough. Note that this would output (feature X 1)
                This is too constraining. This means that each target has to be a scalar,
                and X cant have upper dimensionality. This also does not take advantage of matrix 
                operations.

                So we fix this by:

                transpose(T dot transpose(X))

                because T is (t_features X constant) and X is (features X constant) 
                and we want (features X constant) outputted

                '''
                y = np.transpose(self.w) @ x[k]

                if np.sign(y) != np.sign(targets[k]):
                    self.w += self.alpha * x[k].reshape(-1, 1) * targets[k].reshape(-1, 1)
                    converged = False
                    if self._compare(x, targets) > 0:
                        self.w_pocket[:] = self.w[:]

            if converged:
                print("converged at ", j)
                break

    def use(self, x: np.ndarray):
        x = self.normalize(x)
        return x @ self.w_pocket

In [None]:
# Start training!!
clf = PerceptronPocketClassifier(1000, 0.1)
clf.train(X_train, y_train)

print(f'Predicted: \n\n {clf.use(X_test)} \n\nActual: {y_test}')
show_accuracy(clf.use(X_test), y_test)

### C. QDA

In [None]:
from typing import List, Any
import numpy as np

# noinspection PyMethodMayBeStatic
class QDAClassifier(Classifier):
    def __init__(self):
        super().__init__()

        self.discriminant_functions = {}
        self.discriminant_function_params = {}

        self.global_mean = 0
        self.global_stds = 0

    def train(self, big_x: np.array, big_t: np.array):
        # Scale the big_x sample base
        scaled_big_x = self.normalize(big_x, reset_fields=True)

        # Split them by their classes
        for unique in [np.unique(ll) for ll in big_t]:
            # Get the samples that have that unique value
            indexes = np.where(big_t == unique)
            temp_x = np.copy(scaled_big_x[indexes])
            mu = np.mean(temp_x, 0)
            sigma = np.cov(temp_x.T).reshape(-1, 1)
            prior = float((len(big_t[big_t == unique[0]]) / len(big_t)))
            # Get and save the discriminant function

            self.discriminant_function_params[unique[0]] = (mu, sigma, prior)

    def get_qda(self, big_x: np.array, mu, sigma, prior):
        sigma_inv = np.linalg.inv(sigma)
        diff_v = big_x - mu
        return - 0.5 * np.log(np.linalg.det(sigma)) \
               - 0.5 * np.sum(diff_v @ sigma_inv * diff_v, axis=1) + np.log(prior)

    def use(self, big_x):

        scaled_big_x = self.normalize(big_x)
        classes = [c for c in self.discriminant_function_params]
        evaluations = []
        for sample in scaled_big_x:
            probabilities: List[float] = []
            for class_value in self.discriminant_function_params:
                print(f'\n\nSample: {sample} Class to test: {class_value} the resulting prob: '
                      f'{self.get_qda(np.array(sample).reshape(-1, 1), *self.discriminant_function_params[class_value])}')

                probabilities\
                    .append(max(self.get_qda(np.array(sample).reshape(-1, 1),
                                             *self.discriminant_function_params[class_value])))
            evaluations.append(classes[np.argmax(probabilities)])
        return evaluations


In [None]:
# Start training!!
clf = QDAClassifier()
clf.train(X_train, y_train)

print(f'Predicted: \n\n {clf.use(X_test)} \n\nActual: {y_test}')
show_accuracy(clf.use(X_test), y_test)

show_accuracy(clf.use(X_test), y_test)

### D. LDA

In [None]:
from typing import List, Any
import numpy as np


# noinspection PyMethodMayBeStatic
class LDAClassifier(Classifier):
    def __init__(self):
        super().__init__()

        self.discriminant_functions = {}
        self.discriminant_function_params = {}

        self.global_mean = 0
        self.global_stds = 0

    def train(self, big_x: np.array, big_t: np.array):
        # Scale the big_x sample base
        scaled_big_x = self.normalize(big_x, reset_fields=True)

        sigma = np.cov(scaled_big_x.T)

        # Split them by their classes
        for unique in [np.unique(ll) for ll in big_t]:
            # Get the samples that have that unique value
            indexes = np.where(big_t == unique)
            temp_x = np.copy(scaled_big_x[indexes])
            mu = np.mean(temp_x, 0)
            prior = float((len(big_t[big_t == unique[0]]) / len(big_t)))
            # Get and save the discriminant function

            self.discriminant_function_params[unique[0]] = (mu, sigma, prior)

    def get_lda(self, big_x: np.array, mu, sigma, prior):
        sigma_inv = np.linalg.inv(sigma)
        return np.sum(np.dot(big_x, sigma_inv) * mu -
                      0.5 * np.dot(mu, sigma_inv) * mu
                      + np.log(prior), axis=1)

    def use(self, big_x):

        scaled_big_x = self.normalize(big_x)
        classes = [c for c in self.discriminant_function_params]
        evaluations = []
        for sample in scaled_big_x:
            probabilities: List[float] = []
            for class_value in self.discriminant_function_params:
                print(f'\n\nSample: {sample} Class to test: {class_value} the resulting prob: '
                      f'{self.get_lda(np.array(sample).reshape(-1, 1).T, *self.discriminant_function_params[class_value])}')

                probabilities\
                    .append(max(self.get_lda(np.array(sample).reshape(-1, 1).T,
                                             *self.discriminant_function_params[class_value])))
            evaluations.append(classes[np.argmax(probabilities)])
        return evaluations


In [None]:
# Start training!!
clf = LDAClassifier()
clf.train(X_train, y_train)

print(f'Predicted: \n\n {clf.use(X_test)} \n\nActual: {y_test}')
show_accuracy(clf.use(X_test), y_test)

show_accuracy(clf.use(X_test), y_test)

### E. Logistic Regression

In [None]:
import numpy as np
from sklearn.preprocessing import LabelBinarizer

# noinspection PyMethodMayBeStatic
class LRClassifier(Classifier):
    def __init__(self):
        super().__init__()

        self.w = None
        self.binarizer = LabelBinarizer()

    def _soft_max(self, z: np.ndarray):
        if not isinstance(z, np.ndarray):
            z = np.asarray(z)
        f = np.exp(z)
        return f / (np.sum(f, axis=1, keepdims=True)) if len(z.shape) == 2 else np.sum(f)

    def _g(self, big_x):
        return self._soft_max(big_x @ self.w)

    def train(self, big_x: np.ndarray, big_t: np.ndarray, iterations=1000, alpha=0.1):
        big_x = self.normalize(big_x, reset_fields=True)

        # Fix big_t to be one hot (one column for each class)
        one_hot_big_t = self.one_hot(big_t)

        n_samples = big_x.shape[0]
        n_features = big_x.shape[1]
        n_target_dims = one_hot_big_t.shape[1]
        n_bias_dims = 1
        self.w = np.random.rand(n_features + n_bias_dims, n_target_dims)

        # Add a bias column to big_x
        bias_big_x = np.hstack((np.ones((n_samples, n_bias_dims)), big_x))

        for step in range(iterations):
            y_scaled = self._g(bias_big_x)
            self.w += alpha * bias_big_x.T @ (one_hot_big_t - y_scaled)

    def one_hot(self, big_t: np.ndarray):
        # One hot via numpy:
        self.binarizer.fit(big_t)
        labels = self.binarizer.transform(big_t)
        return np.hstack((labels, 1 - labels))

    def use(self, big_x: np.ndarray):
        big_x = self.normalize(big_x)
        n_samples = big_x.shape[0]
        n_bias_dims = 1
        bias_big_x = np.hstack((np.ones((n_samples, n_bias_dims)), big_x))

        y = self._g(bias_big_x)

        return self.binarizer.inverse_transform(np.argmax(y, axis=1))

In [None]:
# Start training!!
clf = LRClassifier()
clf.train(X_train, y_train)

print(f'Predicted: \n\n {clf.use(X_test)} \n\nActual: {y_test}')
show_accuracy(clf.use(X_test), y_test)

# IV. Experiments

Apply the classfiers on the data and discuss the results.
Please describe your codes for experiments. You may have subsections of results and discussions here.
Here follows the list that you consider to include:
- the classification results
- plots of classification results 
- model comparision 
- choice of evaluation metrics
- **Must partition data into training and testing**

# Conclusions

Summarize your work here. 
Which classifier do you think the best? 
Discuss the challenges or somethat that you learned. 
If you have any suggestion about the assignment, you can write about it. 

# References

List all your references here.

# Extra Credit

* [OPT 1] Search for a ordinal data set and apply your classifiers to it. 
  - Repeat the experiments on it. 
  - Do you have different observation from previous results? 
  - Were you able to observe that we discussed in class about logistic regression? 
  - For a full extra credit point, you need to discuss all bullet points in Results section.     


* [OPT 2] Partition your data into five sets. Selecting one test set and the other for training, repeat your experiments and observe/analyze the 5 different training/testing errors.  

## Grading

DO NOT forget to submit your data! Your notebook is supposed to run well after running your codes.

To help our TA's grading, please make an explicit section for each grading criteria. 
Again, this is a **writing assignment**. Please don't forget to properly explain your codes and results using Markdown cell. 


points | | description
--|--|:--
5 | Overview| states the objective and the appraoch 
15 | Data | 
 | 5| description 
 | 5| plots for understanding or analysis 
 | 5| preliminary observation 
25 | Methods | 
 |10| Summary of Classification models
 | 5| Explanation of codes
 |10| Pocket, LDA, QDA, Logistic Regression
40 | Experiments 
| 5| Discussion about evaluation metrics
| 5| Discussion about train and test accuracies
|20| plots for results (5 for each algorithm)
|10| Discussions about classificaion model comparison
5 | |Conclusions 
5 | |Referemces
5 | |Grammar and spelling error (Proofread please)