<h1><center>CSCI 6515 - Machine Learning for Big Data (Fall 2023)</h1></center>
<h1><center>Assignment No. 3</h1></center>

<b>Mudra Verma</b>  
<b>Banner ID: B00932103</b>  


### 1. Task 1<a id='top'></a>

**Data Transformation**<a href='#1'>[1]</a>

In [1]:
##### Data Transformation on the dataset #####
### Normalizing the training dataset ###

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Assuming you have a DataFrame with columns 'label', 'pixel1', 'pixel2', ..., 'pixel784'
# Load your dataset (replace 'your_dataset.csv' with the actual file path)
df = pd.read_csv('sign_mnist_train.csv')

# Separate labels and pixel values
labels = df['label']
pixels = df.drop('label', axis=1)

# Apply Min-Max scaling to normalize pixel values to the range [0, 1]
scaler = MinMaxScaler()
pixels_normalized = scaler.fit_transform(pixels)

# Combine normalized pixel values with labels
df_normalized = pd.DataFrame(data=pixels_normalized, columns=pixels.columns)
df_normalized['label'] = labels

# Save the normalized dataset (replace 'normalized_dataset.csv' with your desired file name)
df_normalized.to_csv('normalized_dataset.csv', index=False)

In [2]:
##### Data Transformation on the dataset #####
### Normalizing the test dataset ###

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Assuming you have a DataFrame with columns 'label', 'pixel1', 'pixel2', ..., 'pixel784'
# Load your dataset (replace 'your_dataset.csv' with the actual file path)
df = pd.read_csv('sign_mnist_test.csv')

# Separate labels and pixel values
labels = df['label']
pixels = df.drop('label', axis=1)

# Apply Min-Max scaling to normalize pixel values to the range [0, 1]
scaler = MinMaxScaler()
pixels_normalized = scaler.fit_transform(pixels)

# Combine normalized pixel values with labels
df_normalized = pd.DataFrame(data=pixels_normalized, columns=pixels.columns)
df_normalized['label'] = labels

# Save the normalized dataset (replace 'normalized_dataset.csv' with your desired file name)
df_normalized.to_csv('normalized_test_dataset.csv', index=False)

**Descriptive Analysis** :

Normalization is performed on the MNIST image dataset, as well as on many other image datasets, for several reasons related to improving the performance and convergence of machine learning models. Here are some key reasons for normalizing pixel values in the MNIST dataset:
1. Stability of Training: Normalization ensures that the pixel values are within a similar numerical range. This helps in stabilizing and accelerating the training process of machine learning models.
2. Model Generalization: Normalization can improve the generalization ability of a model. By bringing all pixel values into a standard range, the model becomes less sensitive to variations in the input data. This is especially important for datasets like MNIST, where the lighting conditions or contrast of the images may vary.

### 2. K-means algorithm to Sign Language MNIST dataset

#### i) Subtask 2.a

In [None]:
##### K-means algorithm #####

import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn import metrics
import matplotlib.pyplot as plt

# Load the Sign Language MNIST dataset (replace 'your_dataset.csv' with the actual file path)
df = pd.read_csv('normalized_dataset.csv')

# Separate labels and pixel values
labels = df['label']
pixels = df.drop('label', axis=1)

# Convert DataFrame to NumPy array
X = pixels.values

# Vary the number of clusters from 10 to 200 with a step size of 10
cluster_range = range(10, 201, 10)
accuracy_scores = []
inertia_values = []

for n_clusters in cluster_range:
    # Fit the k-means model
    kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=42)
    kmeans.fit(X)
    
    # Predict cluster labels
    labels_pred = kmeans.predict(X)
    
    # Calculate accuracy
    accuracy = metrics.accuracy_score(labels, labels_pred)
    accuracy_scores.append(accuracy)
    
    # Get the inertia (objective function) value
    inertia = kmeans.inertia_
    inertia_values.append(inertia)

# Plot the accuracy and inertia values for different numbers of clusters
plt.figure(figsize=(12, 6))

# Plot accuracy
plt.subplot(1, 2, 1)
plt.plot(cluster_range, accuracy_scores, marker='o')
plt.title('Accuracy vs. Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Accuracy')

# Plot inertia values
plt.subplot(1, 2, 2)
plt.plot(cluster_range, inertia_values, marker='o')
plt.title('Inertia vs. Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')

plt.tight_layout()
plt.show()


##### a) Sub-Subtask 2.i.a

**Your Answer**

In [None]:
### 3. Fuzzy K-means algorithm to Sign Language MNIST dataset

In [None]:
#### i) Subtask 3.a

In [None]:
import numpy as np
import pandas as pd
import skfuzzy as fuzz
from sklearn import metrics
import matplotlib.pyplot as plt

# Load the Sign Language MNIST dataset (replace 'your_dataset.csv' with the actual file path)
df = pd.read_csv('normalized_dataset.csv')

# Separate labels and pixel values
labels = df['label']
pixels = df.drop('label', axis=1)

# Convert DataFrame to NumPy array
X = pixels.values

# Vary the number of clusters from 10 to 200 with a step size of 10
cluster_range = range(10, 201, 10)
accuracy_scores = []
objective_function_values = []

for n_clusters in cluster_range:
    # Apply fuzzy k-means
    cntr, u, u0, d, jm, p, fpc = fuzz.cluster.cmeans(
        X.T,
        c=n_clusters,
        m=2,  # Fuzziness coefficient (usually set to 2)
        error=0.005,  # Stopping criterion for the algorithm
        maxiter=1000,  # Maximum number of iterations
        init=None,  # Initial cluster centers (default: None, which means random initialization)
        seed=42
    )

    # Get the cluster labels
    labels_pred = np.argmax(u, axis=0)

    # Calculate accuracy
    accuracy = metrics.accuracy_score(labels, labels_pred)
    accuracy_scores.append(accuracy)

    # Get the objective function value (Jm)
    objective_function_values.append(jm)

# Plot the accuracy and objective function values for different numbers of clusters
plt.figure(figsize=(12, 6))

# Plot accuracy
plt.subplot(1, 2, 1)
plt.plot(cluster_range, accuracy_scores, marker='o')
plt.title('Accuracy vs. Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Accuracy')

# Plot objective function values
plt.subplot(1, 2, 2)
plt.plot(cluster_range, objective_function_values, marker='o')
plt.title('Objective Function Value vs. Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Objective Function Value')

plt.tight_layout()
plt.show()

##### b) Sub-Subtask 2.i.b

### References:

1. 
2. Scribbr. (2021, July 30). Free APA citation Generator | with Chrome Extension - Scribbr. https://www.scribbr.com/citation/generator/apa/