# Initialization

In [2]:
import numpy as np
from matplotlib import pyplot as plt

In [3]:
dataFolder = "./data"
p1 = { "testDir": dataFolder + "/p1_test.csv", "trainDir": dataFolder + "/p1_train.csv" }
p2 = { "testDir": dataFolder + "/p2_test.csv", "trainDir": dataFolder + "/p2_train.csv" }
p3 = { "testDir": dataFolder + "/p3_test.csv", "trainDir": dataFolder + "/p3_train.csv" }

p1["test"] = np.genfromtxt(p1["testDir"], delimiter=',')
p1["train"] = np.genfromtxt(p1["trainDir"], delimiter=',')
p2["test"] = np.genfromtxt(p2["testDir"], delimiter=',')
p2["train"] = np.genfromtxt(p2["trainDir"], delimiter=',')
p3["test"] = np.genfromtxt(p3["testDir"], delimiter=',')
p3["train"] = np.genfromtxt(p3["trainDir"], delimiter=',')

# P1 (Regression Analysis)

In this problem, the task is to predict the current health (as given by the target variable) of an organism given the measurements from two biological sensors measuring their bio-markers (negative indicates that it is lesser than the average case). 

With this data, you are expected to try our linear regression models on the  training data and report the following metrics on the test split: 
- Mean Squared Error, 
- Mean Absolute Error, 
- p-value out of significance test.

**DATA:** `p1train/test.csv`

In [4]:
p1["train"].shape

(10000, 3)

In [56]:
# Compute mean squared error
def mse(X, Y, W):
    return (1/2) * (X @ W - Y) @ (X @ W - Y)

# Compute mean absolute error
def mae(X, Y, W):
    return np.sum(np.abs(X @ W - Y))

# Split the training data into features matrix with bias and the result vector
def splitData(data):
    X = np.c_[np.ones(data.shape[0]), data.T[:2].T]
    Y = data.T[-1].T
    return X, Y

In [57]:
X, Y = splitData(p1["train"])

# Initialise the parameters to be a null vector
W = np.array([0, 0, 0])

In [58]:
print(mse(X, Y, W))
print(mae(X, Y, W))

3389272.628302881
217168.26712067553


In [65]:
W = np.linalg.pinv(X) @ Y

print("MSE (train-split): ", mse(X, Y, W))
print("MAE (train-split): ", mae(X, Y, W))

MSE (train-split):  25298.423078218584
MAE (train-split):  17917.53209393991


In [71]:
from scipy.stats import ttest_ind

X_test, Y_test = splitData(p1["test"])

print("a) MSE: ", mse(X_test, Y_test, W))
print("b) MAE: ", mae(X_test, Y_test, W))
print("c) p-value -> sensor 1: ", ttest_ind(X[:, 1], Y).pvalue, ", sensor 2: ", ttest_ind(X[:, 2], Y).pvalue)

a) MSE:  12616.090009878131
b) MAE:  8995.400265491306
c) p-value -> sensor 1:  9.492378356739791e-30 , sensor 2:  3.0848891994991416e-26


# P2 (Regression Analysis)

Here, you are expected to predict the lifespan of the above organism given the data from three sensors. In this case, the model is not linear.

You are expected to try several (at least 3) non-linear regression models on the train split and report the following metrics on the test split.
- Mean Squared Error
- Mean Absolute Error
- p-value out of significance test

**DATA**: `p2train/test.csv`

In [4]:
p2["train"]

array([[ 6.50199562e+00, -8.53698298e+00,  3.42293467e+00,
         1.19980220e+05],
       [ 1.32838341e+00,  8.94357801e+00, -8.14530720e+00,
         2.98902250e+04],
       [ 1.61478186e-01, -7.92835138e+00,  1.62892423e+00,
         3.24557940e+03],
       ...,
       [ 5.15542189e+00,  5.50082251e+00,  7.80498384e+00,
         1.65778154e+05],
       [ 7.41019691e+00, -3.09607941e+00,  4.39444446e+00,
         2.12850414e+05],
       [ 8.65839198e+00,  2.12902551e+00, -2.23757771e+00,
         3.96440751e+05]])

# P3 (Multi-class classification)

We have data from 10 sensors fitted in an industrial plant. There are five classes indicating which product is being produced. The task is to predict the product being produced by looking at the observation from these 10 sensors. 

Given this, you are expected to implement 
- Bayes’ classifiers with 0-1 loss assuming Normal, exponential, and GMMs (with diagonal co-variances) as class-conditional densities. For GMMs, code up the EM algorithm,
- Linear classifier using the one-vs-rest approach
- Multi-class Logistic regressor with gradient descent.

The metrics to be computed are 
- Classification accuracy, 
- Confusion matrix,
- Class-wise F1 score, 
- RoC curves for any pair of classes, and 
- likelihood curve for EM with different choices for the number of mixtures as hyper-parameters, 
- Emipiral risk on the train and test data while using logistic regressor.

**DATA:** `p3train/test.csv`

In [5]:
p3["train"]

array([[-0.66524016, -1.44494482, -0.50279241, ..., -0.51264941,
         1.14855255,  1.        ],
       [ 2.77439505,  1.60670577,  0.55340908, ..., -0.3560387 ,
         0.67794381,  1.        ],
       [-0.62586937, -0.26507389,  0.70141731, ...,  0.10409147,
        -0.98485438,  5.        ],
       ...,
       [-1.7652821 , -0.13820513, -2.05887396, ..., -0.16761556,
         1.92810659,  5.        ],
       [ 1.31672603,  0.29958778,  0.40964062, ..., -0.50250074,
        -0.77139579,  2.        ],
       [ 0.41895968,  1.06039492, -1.1435325 , ...,  1.16254537,
         0.12443366,  4.        ]])

# P4 (Multi-class classification)

In this problem, we consider an image dataset called Kannada-MNIST. This dataset contains images (60,000 images with 6000 per class) of digits from the south Indian language of Kannada. The task is to build a 10-class classifier for the digits. 

You are supposed to test the following classification schemes: 
- Naive Bayes’ with Normal as Class conditional
- Logistic regressor with gradient descent
- Multi-class Bayes’ classifier with GMMs with diagonal co-variances for class conditionals.

Report the following metrics on the test data: 
- Classification accuracy
- Confusion matrix
- Class-wise F1 score
- RoC curves for any pair of classes
- likelihood curve for EM with different choices for the number of mixtures as hyper-parameters
- Emipiral risk on the train and test data while using logistic regressor

In this problem, first split the data into train and test parts with the following ratios of **20:80**, **30:70**, **50:50**, **70:30**, and **90:10**, and record your observations. Train the algorithms on the train part and evaluate over the test part.

**DATA:** `images.zip`

# P5 (Multi-class classification)

In this part, the data from the previous problem is ’condensed’ (using PCA) to **10 dimensions**. Repeat the above experiment with all the models and metrics and record your observations.

**DATA:** `KannadaMNISTPCA.csv`