# Classification
## Analysis of Data
8 input features and 1 column (“Rings”) to be predicted. Column ‘Sex’ have values as {M, F,
I} and every other column has a numerical value. Therefore column ‘Sex’ needs to be one hot encoded.
Column ‘Rings’ also need to be one hot encoded as this is a classification problem
## Preprocessing Steps
Checked for duplicate and null values.  
Used one-hot encoding on ‘Ring’ and ‘Sex’ column
for classification. Therefore Ring is encoded into Ring_1 - Ring_29 and Sex into Sex_M, Sex_F, Sex_I.
## Information about code
I have used an object-oriented approach by creating classes. To create a particular model, I am creating an object for that
respective class and then applying fit function on it to improve the model.
## Code Approach:
Logistic Regression: Used softmax function because of multiclass classification.   
Naive Bayes: Gaussian fitting and then applied Bayes theorem assuming all features are independent.
## Test and train errors

| Algorithm    | Features    | Prediction  | Accuracy  (Train)         |Accuracy  (Test)         |
| :---:       | :---:       | :---:       | :---:        |         :---: |
| Univariate Logistic Regression | Length | Rings (20 vs not 20) | 0.9916201117318436|0.9960095770151636|
| Multivariate Logistic Regression | All except (Rings) | Ring_1 - Ring_29 | 0.22825219473264166|0.24421388667198723|
| Univariate Naive Bayes | Length | Rings (20 vs not 20) |0.9916201117318436|0.9960095770151636|
| Multivariate Naive Bayes | All except (Rings) | Ring_1 - Ring_29 | 0.25591140377132593|0.29571984435797666|

# Preprocessing

In [230]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import math

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

Loading data

In [231]:
main_df = pd.read_csv("./abalone.data",names=["Sex","Length","Diameter","Height","Whole weight","Shucked weight","Viscera weight","Shell weight","Rings"])
main_df.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


Data description

In [232]:
main_df.describe()

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
count,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0
mean,0.523992,0.407881,0.139516,0.828742,0.359367,0.180594,0.238831,9.933684
std,0.120093,0.09924,0.041827,0.490389,0.221963,0.109614,0.139203,3.224169
min,0.075,0.055,0.0,0.002,0.001,0.0005,0.0015,1.0
25%,0.45,0.35,0.115,0.4415,0.186,0.0935,0.13,8.0
50%,0.545,0.425,0.14,0.7995,0.336,0.171,0.234,9.0
75%,0.615,0.48,0.165,1.153,0.502,0.253,0.329,11.0
max,0.815,0.65,1.13,2.8255,1.488,0.76,1.005,29.0


Since Column 'Sex' has categorical value we have encoded it.

In [233]:
main_df = pd.get_dummies(main_df, columns=['Sex'])
main_df.head()

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings,Sex_F,Sex_I,Sex_M
0,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15,0,0,1
1,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7,0,0,1
2,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9,1,0,0
3,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10,0,0,1
4,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7,0,1,0


Checking for null/empty values

In [234]:
main_df.isna().sum()

Length            0
Diameter          0
Height            0
Whole weight      0
Shucked weight    0
Viscera weight    0
Shell weight      0
Rings             0
Sex_F             0
Sex_I             0
Sex_M             0
dtype: int64

In [235]:
(main_df == "?").sum()

Length            0
Diameter          0
Height            0
Whole weight      0
Shucked weight    0
Viscera weight    0
Shell weight      0
Rings             0
Sex_F             0
Sex_I             0
Sex_M             0
dtype: int64

In [236]:
main_df['Rings'].unique().shape

(28,)

In [237]:
main_df['Rings'].describe()

count    4177.000000
mean        9.933684
std         3.224169
min         1.000000
25%         8.000000
50%         9.000000
75%        11.000000
max        29.000000
Name: Rings, dtype: float64

This is classification problem so we need to encode our prediction column

In [238]:
main_df = pd.get_dummies(main_df, columns=['Rings'])

In [239]:
print(main_df.sum())

Length            2188.7150
Diameter          1703.7200
Height             582.7600
Whole weight      3461.6560
Shucked weight    1501.0780
Viscera weight     754.3395
Shell weight       997.5965
Sex_F             1307.0000
Sex_I             1342.0000
Sex_M             1528.0000
Rings_1              1.0000
Rings_2              1.0000
Rings_3             15.0000
Rings_4             57.0000
Rings_5            115.0000
Rings_6            259.0000
Rings_7            391.0000
Rings_8            568.0000
Rings_9            689.0000
Rings_10           634.0000
Rings_11           487.0000
Rings_12           267.0000
Rings_13           203.0000
Rings_14           126.0000
Rings_15           103.0000
Rings_16            67.0000
Rings_17            58.0000
Rings_18            42.0000
Rings_19            32.0000
Rings_20            26.0000
Rings_21            14.0000
Rings_22             6.0000
Rings_23             9.0000
Rings_24             2.0000
Rings_25             1.0000
Rings_26            

# Common functions

Below softmax functions its softmax values for input vector lst

In [240]:
from math import exp


def softmax(lst):
    for x in range(len(lst)):
        lst[x] = exp(lst[x])
    e_sum = sum(lst)
    for x in range(len(lst)):
        lst[x] = lst[x]/e_sum
    return lst


In [241]:
print(sum(softmax([1,2,3])))

1.0


Accuracy function

In [242]:
def accuracy(a,b):
    x,y = a.shape
    count = 0
    for i in range(x):
        if(a[i][b[i]] == 1):
            count = count + 1
    return count*1.0/x

Function to split data into training and testing

In [243]:
def train_test_split(X,y,train_size):
    a = int(X.shape[0]*train_size)
    b = X.shape[0]-a
    return X[:a],X[b:],y[:a],y[b:]

Gaussian probability function

In [244]:
def GPF(mu,sg,x):
    if(sg==0):
        return 1
    a = abs((x-mu)*1.0/sg)
    a = exp(-(a**2)/2)
    b = math.sqrt(2*math.pi)*sg
    return a/b

In [245]:
GPF(0,1,0)

0.3989422804014327

# Logistic Regression

In [246]:
class LogReg:
    w = np.ones((1,1))
    def apply_softmax(self,X):
        a,b = X.shape
        for x in range(a):
            X[x] = softmax(X[x])
        return X

    def fit(self,w_ini,x,y,lr,itrs):
        global w
        a,b = x.shape
        x0 = np.ones((a,1))
        X = np.hstack((x0,x))
        for i in range(itrs):
            Z = np.matmul(X,w_ini)
            Z = self.apply_softmax(Z)
            diff = Z - y
            grad = diff.T @ X
            w = w_ini - grad.T/X.shape[0]
            w_ini = w
        return w
    
    def predict(self,x):
        global w
        a,b = x.shape
        x0 = np.ones((a,1))
        X = np.hstack((x0,x))
        sf = X @ w
        sf = self.apply_softmax(sf)
        return np.argmax(sf,axis=1)


# Univariate Logistic Regression

In [247]:
X = main_df['Length'].to_numpy().astype(np.float64)
y = main_df['Rings_20'].to_numpy().astype(np.float64)
X = np.reshape(X,(X.shape[0],1))

y = np.vstack((y,(y+1)%2)).T
train_X, test_X, train_y, test_y = train_test_split(X, y, train_size=0.6)

Univariate Logistic model

In [248]:
ulr_model = LogReg()
w = ulr_model.fit(np.zeros((train_X.shape[1]+1,train_y.shape[1])),train_X,train_y,0.0000001,300)

Train accuracy

In [249]:
print(accuracy(train_y,ulr_model.predict(train_X)))

0.9916201117318436


Test accuracy

In [250]:
print(accuracy(test_y,ulr_model.predict(test_X)))

0.9960095770151636


# Multivariate Logistic Regression

In [251]:
X = main_df.iloc[:,:10].to_numpy().astype(np.float64)
y = main_df.iloc[:,10:].to_numpy().astype(np.float64)

train_X, test_X, train_y, test_y = train_test_split(X, y, train_size=0.6)

Multivariate Logistic Regression model

In [252]:
mlr_model = LogReg()
w = mlr_model.fit(np.zeros((train_X.shape[1]+1,train_y.shape[1])),train_X,train_y,0.0000001,300)

Training accuracy

In [253]:
print(accuracy(train_y,mlr_model.predict(train_X)))

0.22825219473264166


Testing accuracy

In [254]:
print(accuracy(test_y,mlr_model.predict(test_X)))

0.24421388667198723


# Naive Bayes

In [255]:
class NB:
    gauss_fit_mean = np.ones((1,1))
    gauss_fit_std = np.ones((1,1))
    y_prob = np.ones((1,1))
    b = 0
    def gauss_fit_class(self,X,y,c):
        a = np.zeros((1,X.shape[1]))
        a = np.delete(a,obj=0,axis=0)
        for i in range(y.shape[0]):
            if(y[i][c]==1):
                a = np.vstack((a,X[i]))
        return np.mean(a,axis=0), np.std(a,axis=0)

    def bayesThm(self,mu,sg,x,p):
        ans = 1
        for xi in range(len(x)):
            ans = GPF(mu[xi],sg[xi],x[xi])*ans
        return ans*p

    def fit(self,X,y):
        global y_prob,gauss_fit_mean,gauss_fit_std,b
        a = X.shape[0]
        b = y.shape[1]
        gauss_fit_mean = np.zeros((b,X.shape[1]))
        gauss_fit_std = np.zeros((b,X.shape[1]))
        for i in range(b):
            gauss_fit_mean[i],gauss_fit_std[i] = self.gauss_fit_class(X,y,i)
        y_prob = (np.sum(y,axis=0))/a
        prob = np.ones((a,b))
        for i in range(a):
            for j in range(b):
                prob[i][j] = self.bayesThm(gauss_fit_mean[j],gauss_fit_std[j],X[i],y_prob[j])
        return np.argmax(prob,axis=1)
    
    def predict(self,X):
        global y_prob,gauss_fit_mean,gauss_fit_std,b
        a = X.shape[0]
        prob = np.ones((a,b))
        for i in range(a):
            for j in range(b):
                prob[i][j] = self.bayesThm(gauss_fit_mean[j],gauss_fit_std[j],X[i],y_prob[j])
        return np.argmax(prob,axis=1)
        

# Univariate Naive Bayes

In [256]:
X = main_df.iloc[:,:1].to_numpy().astype(np.float64)
y = main_df['Rings_20'].to_numpy().astype(np.float64)
y = np.vstack((y,(y+1)%2)).T 

train_X, test_X, train_y, test_y = train_test_split(X, y, train_size=0.6)

Univariate naive bayes model

In [257]:
unb_model = NB()
w = unb_model.fit(train_X,train_y)

Train Accuracy

In [258]:
print(accuracy(train_y,unb_model.predict(train_X)))

0.9916201117318436


Test Accuracy

In [259]:
print(accuracy(test_y,unb_model.predict(test_X)))

0.9960095770151636


# Multivariate Naive Bayes

In [260]:
X = main_df.iloc[:,:1].to_numpy().astype(np.float64)
y = main_df.iloc[:,10:].to_numpy().astype(np.float64)

train_X, test_X, train_y, test_y = train_test_split(X, y, train_size=0.8)

Multivariate Naive Bayes model

In [261]:
mnb_model = NB()
mnb_model.fit(train_X,train_y)

array([ 7,  6,  8, ..., 10,  8,  8])

Train accuracy

In [262]:
print(accuracy(train_y,mnb_model.predict(train_X)))

0.25591140377132593


Test accuracy

In [263]:
print(accuracy(test_y,mnb_model.predict(test_X)))

0.29571984435797666
