<a href="https://colab.research.google.com/github/neyoyoyminoy/MachineLearning/blob/main/MLhw4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**II. Objective**

The main objective of this assignment is to implement Classification (by means of Softmax Regression) with Mini-Batch Stochastic Gradient Descent and K-Fold Cross-Validation from scratch. You’ll also evaluate the performance using different metrics and visualizations.

#**III. Dataset**

You will work with the Wine dataset, a classic dataset for classification tasks. The dataset contains 178 instances and 13 features. Your task is to classify the wine into one of the three classes. The dataset can be found here: https://archive.ics.uci.edu/dataset/109/wine.


#**IV. Tasks**

**Task 1: Dataset Preparation [10%]**
1. Load the Wine dataset and extract the features (X) and labels (y).
2. Filter the dataset to keep only the 40 first samples from each class.
3. Perform feature scaling using **Z-score normalization** transforms each feature to have zero mean and unit variance. Given a feature vector *x = [x1, x2, . . . , xn]*, the
normalized value is computed as
zi =
xi − µ
σ
, where µ =
1
n
Xn
i=1
xi
, σ =
vuut
1
n
Xn
i=1
(xi − µ)
2.
Here, *µ* and *σ* denote the sample mean and sample standard deviation, computed using all available samples.

In [2]:
!pip install ucimlrepo #needed these first two imports to grab wine data
from ucimlrepo import fetch_ucirepo

import pandas as pd
import numpy as np

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


In [3]:
wine = fetch_ucirepo(id = 109)

x = wine.data.features
y = wine.data.targets

print(x)
print(y)

     Alcohol  Malicacid   Ash  Alcalinity_of_ash  Magnesium  Total_phenols  \
0      14.23       1.71  2.43               15.6        127           2.80   
1      13.20       1.78  2.14               11.2        100           2.65   
2      13.16       2.36  2.67               18.6        101           2.80   
3      14.37       1.95  2.50               16.8        113           3.85   
4      13.24       2.59  2.87               21.0        118           2.80   
..       ...        ...   ...                ...        ...            ...   
173    13.71       5.65  2.45               20.5         95           1.68   
174    13.40       3.91  2.48               23.0        102           1.80   
175    13.27       4.28  2.26               20.0        120           1.59   
176    13.17       2.59  2.37               20.0        120           1.65   
177    14.13       4.10  2.74               24.5         96           2.05   

     Flavanoids  Nonflavanoid_phenols  Proanthocyanins  Color_i

In [16]:
filtered = pd.concat([x.assign(label = y['class']).groupby('label').head(40)]) #this should keep the first 40 samples from each class

filteredX = filtered.drop(columns = ['label'])
filteredY = filtered['label']

print(filteredX)
print(filteredY)

     Alcohol  Malicacid   Ash  Alcalinity_of_ash  Magnesium  Total_phenols  \
0      14.23       1.71  2.43               15.6        127           2.80   
1      13.20       1.78  2.14               11.2        100           2.65   
2      13.16       2.36  2.67               18.6        101           2.80   
3      14.37       1.95  2.50               16.8        113           3.85   
4      13.24       2.59  2.87               21.0        118           2.80   
..       ...        ...   ...                ...        ...            ...   
165    13.73       4.36  2.26               22.5         88           1.28   
166    13.45       3.70  2.60               23.0        111           1.70   
167    12.82       3.37  2.30               19.5         88           1.48   
168    13.58       2.58  2.69               24.5        105           1.55   
169    13.40       4.60  2.86               25.0        112           1.98   

     Flavanoids  Nonflavanoid_phenols  Proanthocyanins  Color_i

In [7]:
mu = filteredX.mean(axis = 0)
sigma = filteredX.std(axis = 0)

normalizedX = (filteredX - mu) / sigma

print(normalizedX)

      Alcohol  Malicacid       Ash  Alcalinity_of_ash  Magnesium  \
0    1.504837  -0.501045  0.158493          -1.103940   1.707731   
1    0.138713  -0.437372 -0.918762          -2.381346  -0.063978   
2    0.085659   0.090203  1.050014          -0.232981   0.001640   
3    1.690524  -0.282738  0.418520          -0.755557   0.789067   
4    0.191766   0.299414  1.792948           0.463785   1.117161   
..        ...        ...       ...                ...        ...   
165  0.841670   1.909427 -0.473002           0.899265  -0.851405   
166  0.470296   1.309083  0.789987           1.044424   0.657829   
167 -0.365295   1.008911 -0.324415           0.028306  -0.851405   
168  0.642720   0.290318  1.124307           1.479904   0.264116   
169  0.403979   2.127734  1.755802           1.625064   0.723448   

     Total_phenols  Flavanoids  Nonflavanoid_phenols  Proanthocyanins  \
0         0.912942    1.184809             -0.635608         1.350213   
1         0.674195    0.882326       

#**Task 2**

In [17]:
def softmax(z):
  z = z - np.max(z, axis = 1, keepdims = True) #np.max is just max value
  e = np.exp(z)

  return e / np.sum(e, axis = 1, keepdims = True)

print(softmax(np.array([[2, 1, 0.1]])))

[[0.65900114 0.24243297 0.09856589]]


In [27]:
def kfold_split(N, K):
  indices = np.arange(N)
  np.random.shuffle(indices)

  kFolds =  np.array_split(indices, K)

  splits = []

  for k in range(K):
    indicesTest = kFolds[k]

    c = []
    for i in range(K):
      if i != k:
        c.append(kFolds[i])

    indicesTrain = np.hstack(c)

    splits.append((indicesTrain, indicesTest))

  return splits

splits = kfold_split(4, 2)

for indicesTrain, indicesTest in splits:
  print("train", indicesTrain)
  print("test", indicesTest)

train [0 3]
test [2 1]
train [2 1]
test [0 3]


In [34]:
def accuracy(yTrue, yPrediction):
  return np.mean(yTrue == yPrediction)

accuracyTest = accuracy(np.array([2, 1, 0]), np.array([3, 1, 5]))

print(accuracyTest)

0.3333333333333333


In [42]:
def confusion_matrix(yTrue, yPrediction):
  uniqueLabels = np.unique(np.concatenate((yTrue, yPrediction)))

  maxLabel = np.max(uniqueLabels)

  matrix = np.zeros((maxLabel + 1, maxLabel + 1), dtype = int)

  for t, p in zip(yTrue, yPrediction):
    matrix[t, p] += 1

  return matrix

yTrue = np.array([3, 7, 1, 0, 0])
yPrediction = np.array([7, 5, 3, 3, 7])

printConfusitonMatrix = confusion_matrix(yTrue, yPrediction)

print(printConfusitonMatrix)

[[0 0 0 1 0 0 0 1]
 [0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0]]
