<a href="https://colab.research.google.com/github/khatgarhaastha/KMeans/blob/main/kmeans.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [18]:
import numpy as np

In [19]:
# Method to read the data from Train, Valid and test csv files and
# returning a list containing labels and grayscale values for digits

def read_data(file_name):

    data_set = []
    with open(file_name,'rt') as f:
        for line in f:
            line = line.replace('\n','')
            tokens = line.split(',')
            label = tokens[0]
            attribs = []
            for i in range(784):
                attribs.append(tokens[i+1])
            data_set.append([label,attribs])
    return(data_set)

**Illustration of Return Value:**

***If the input file (file_name) contains:***

L1,1,2,3,...,784th_value

L2,1,2,3,...,784th_value

...

***The returned data_set would be:***

[
   
    ["L1", ["1", "2", "3", ..., "784th_value"]],
    ["L2", ["1", "2", "3", ..., "784th_value"]],
    ...

]

**Note:** All values, including the label and attributes, are read as strings since we are directly reading from a text file. If we need the attributes as integers or other types, additional type conversion would be necessary.

In [20]:
# Reading the data using 'read_data' method

train_data = read_data("train.csv")
valid_data = read_data("valid.csv")
test_data = read_data("valid.csv")

In [21]:
# Splitting the dataset into "labels" and "features"

def split_data(data):
    labels = []
    features = []
    for item in data:
        labels.append(item[0])
        features.append(item[1])
    return labels, features

y_train, X_train = split_data(train_data)
y_valid, X_valid = split_data(valid_data)
y_test, X_test = split_data(test_data)


In [22]:
# Illustrating the labels and features of valid dataset
#print(y_valid)
#for i in range(len(X_valid)):
#  print(X_valid[i])


**Convert the features and labels into numpy arrays and change their data type**

The features are currently in the form of strings. We need to convert them to floating point values. Similarly, labels should be converted to integers.

In [23]:
X_train = np.array(X_train, dtype=np.float32)
y_train = np.array(y_train, dtype=np.int32)
X_valid = np.array(X_valid, dtype=np.float32)
y_valid = np.array(y_valid, dtype=np.int32)
X_test = np.array(X_test, dtype=np.float32)
y_test = np.array(y_test, dtype=np.int32)


This conversion is beneficial, especially when working with machine learning or other scientific computing tasks in Python. By ensuring data is in the correct format and type, operations can be performed more efficiently, and many libraries (like scikit-learn, TensorFlow, etc.) expect or perform better with NumPy arrays compared to standard Python lists.

**Normalize the features**

Here we'll use a simple normalization technique by dividing every feature by 255 (the maximum grayscale value) to bring all values into the range [0,1].

In [24]:
max_value = np.max(X_train)
max_value

255.0

In [25]:
X_train /= 255.0
X_valid /= 255.0
X_test /= 255.0


In [26]:
# Illustrating the normalised dataset

#for i in range(len(X_valid)):
#  print(X_valid[i])

**Now, the data is ready to be used in the k-means algorithm.**