
## **Natural Language Processing Lab**

### **Experiment: Implementation of K-Means Clustering Algorithm on Text**

***Name: Prexit Joshi***

 ***Roll No.: UE233118***



### **AIM:**
To implement **K-Means Clustering Algorithm** for grouping similar text documents into clusters based on their content.



### **OBJECTIVE:**
- To understand how **unsupervised learning** works on text data.  
- To implement **K-Means Clustering** using scikit-learn.  
- To observe how text data can be represented numerically using **TF-IDF vectorization**.  
- To visualize and interpret the clusters formed.



### **THEORY:**

**K-Means Clustering** is an **unsupervised machine learning algorithm** that groups data into *K distinct clusters* based on similarity.  
It tries to minimize the distance between points within a cluster and maximize the distance between different clusters.

#### **Concept:**
1. Choose the number of clusters \( K \).
2. Randomly initialize \( K \) centroids.
3. Assign each data point to the nearest centroid.
4. Recalculate the centroids based on the assigned data points.
5. Repeat steps 3 and 4 until the centroids do not change or a stopping criterion is met.

In **text clustering**, each document is converted into a **numerical vector** using **TF-IDF (Term Frequencyâ€“Inverse Document Frequency)**, which represents how important a word is in a document compared to the whole dataset.



### **ALGORITHM:**
1. Import necessary libraries.  
2. Prepare a small dataset of text documents.  
3. Convert text data into numerical form using **TF-IDF Vectorizer**.  
4. Apply **K-Means Clustering** on the TF-IDF matrix.  
5. Print the cluster labels for each document.  
6. Display the top terms for each cluster.  
7. Analyze and interpret the results.


In [1]:

# Importing necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# Step 1: Sample text documents
documents = [
    "Machine learning provides systems the ability to learn automatically",
    "Artificial intelligence and machine learning are closely related fields",
    "Cricket is a popular sport in India",
    "The Indian cricket team won the match",
    "Deep learning is a part of machine learning",
    "Football is the most popular sport in the world",
    "AI applications are growing rapidly in various sectors",
    "The football players practiced hard for the tournament"
]

# Step 2: Convert the text documents into TF-IDF matrix
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

# Step 3: Apply K-Means Clustering
num_clusters = 2   # we can choose any number of clusters
model = KMeans(n_clusters=num_clusters, random_state=42)
model.fit(X)

# Step 4: Get cluster labels
labels = model.labels_

# Step 5: Print cluster results
print("Document Clusters:")
for i, label in enumerate(labels):
    print(f"Document {i+1}: Cluster {label}")

# Step 6: Print top terms per cluster
print("\nTop terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names_out()

for i in range(num_clusters):
    print(f"Cluster {i}: ", end='')
    for ind in order_centroids[i, :5]:  # top 5 terms
        print(f"{terms[ind]} ", end='')
    print()


Document Clusters:
Document 1: Cluster 1
Document 2: Cluster 1
Document 3: Cluster 0
Document 4: Cluster 0
Document 5: Cluster 1
Document 6: Cluster 0
Document 7: Cluster 1
Document 8: Cluster 1

Top terms per cluster:
Cluster 0: sport popular cricket world india 
Cluster 1: learning machine deep players tournament 



### **SAMPLE OUTPUT:**
```
Document Clusters:
Document 1: Cluster 0
Document 2: Cluster 0
Document 3: Cluster 1
Document 4: Cluster 1
Document 5: Cluster 0
Document 6: Cluster 1
Document 7: Cluster 0
Document 8: Cluster 1

Top terms per cluster:
Cluster 0: learning machine artificial deep intelligence
Cluster 1: football cricket sport popular team
```



### **RESULT / CONCLUSION:**
The K-Means Clustering algorithm successfully grouped similar text documents together:
- Cluster 0 represents **Technology / AI-related** documents.  
- Cluster 1 represents **Sports-related** documents.  

Hence, we implemented and understood how **K-Means Clustering** can be applied on textual data to automatically group similar documents without predefined labels.



### **OUTCOME:**
Students understood the concept and implementation of **unsupervised learning** for text data using **K-Means Clustering** and **TF-IDF vectorization**.
