# Text Clustering
Text clustering is the task of grouping a set of unlabeled texts in such a way that texts in the same group (called a cluster) are more similar to each other than to those in other clusters.

### Importing necessary package

In [1]:
from sklearn.metrics import silhouette_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import pandas as pd
import string
from nltk.corpus import stopwords
from textblob import Word

### Reading the input file (text_to_cluster.txt)

By running the below shell it will ask for path to your input text file

In [2]:
try:
    path = input("Enter the path to the input text file: ")  # Asks the user to input path
    file = open(path, "r", encoding="utf8")  # Open the file in read mode
    lines = file.read().split('\n')  # Split the contents on the basis of "\n"
    file.close()  # close the file
    print("Reads Successfully")
except:
    print("Unable to read input file")

Enter the path to the input text file: C:/Users/MOHAN KUMAR SAH/Desktop/AssignmentByBikram/text_to_cluster.txt
Reads Successfully


### Data Cleaning

In [3]:
# Storing into data frame so that we can easily apply operations on it
df = pd.DataFrame(lines, columns = ['Lines'])
df.head()

Unnamed: 0,Lines
0,Ransomware attack at Mexico's Pemex halts work...
1,#city | #ransomware | Ransomware Attack At Mex...
2,"Mexico's Pemex Oil Suffers Ransomware Attack, ..."
3,A Mexican oil company was hit by ransomware at...
4,Pemex Struck by Ransomware Attack


In [4]:
# Removing duplicates lines
df.drop_duplicates(inplace = True)
# Printing shape i.e., (number of rows, number of coumns)
df.shape

(332, 1)

In [5]:
# Filtering out unncessary information from given text

stop = stopwords.words('english')

# Removing punctuations and all digits from text
filterString = string.punctuation + '“”|”' + string.digits
df['UpdatedLines'] = df['Lines'].apply(lambda x: x.translate(str.maketrans(filterString,' '*len(filterString),'')))

# Removing all single characters
df['UpdatedLines'] = df['UpdatedLines'].replace('\s+[a-zA-Z]\s+', ' ', regex=True)

# Removing single characters in beginning
df['UpdatedLines'] = df['UpdatedLines'].replace('\^[a-zA-Z]\s+', ' ', regex=True)

# Removing multiple spaces
df['UpdatedLines'] = df['UpdatedLines'].replace('\s+', ' ', regex=True)

# Converting text to lowercase
df['UpdatedLines'] = df['UpdatedLines'].apply(lambda x: x.lower())

# Removing stop words from text
df['UpdatedLines'] = df['UpdatedLines'].str.split(' ').apply(lambda x: ' '.join(k for k in x if k not in stop))

# Lemmatizing all words in the text
df['UpdatedLines'] = df['UpdatedLines'].apply(lambda x: "".join([Word(word).lemmatize() for word in x]))

df.head()

Unnamed: 0,Lines,UpdatedLines
0,Ransomware attack at Mexico's Pemex halts work...,ransomware attack mexico pemex halts work thre...
1,#city | #ransomware | Ransomware Attack At Mex...,city ransomware ransomware attack mexico’s pe...
2,"Mexico's Pemex Oil Suffers Ransomware Attack, ...",mexico pemex oil suffers ransomware attack mil...
3,A Mexican oil company was hit by ransomware at...,mexican oil company hit ransomware attack
4,Pemex Struck by Ransomware Attack,pemex struck ransomware attack


### Conversion of sentences into vectors

I used TfidfVectorizer. So what is TF-IDF ?

Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical statistic that demonstrates how important a word is to a corpus.
1. Term Frequency is just ratio number of current word to the number of all words in document/string/etc.
2. Inverse Document Frequency is a log of the ratio of the number of all documents/string in the corpus to the number of documents.
3. tf-idf is the product Term Frequency to Inverse Document Frequency.

##### Reason for choosing TF-IDF
TF-IDF is a well known method to evaluate how important is a word in a document. tf-idf are also a very interesting way to convert the textual representation of information into a Vector Space Model (VSM).

I have to create vectorizer using TfidfVectorizer class to fit and transform the UpdatedLines cloumn which I created in the data frame:

In [6]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df["UpdatedLines"])  # Learn and transform Statements

### Procedure for choosing optimum number of clusters
I used Silhouette analysis to select optimum number of clusters

Silhouette analysis refers to a method of interpretation and validation of consistency within clusters of data. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). It can be used to study the separation distance between the resulting clusters.

Calculation of Silhouette Value –
If the Silhouette index value is high, the object is well-matched to its own cluster and poorly matched to neighbouring clusters. The Silhouette Coefficient is defined as –

S(i) = ( b(i) – a(i) ) / ( max { ( a(i), b(i) ) }

Where,

       a(i) is the average dissimilarity of ith object to all other objects in the same cluster
       b(i) is the average dissimilarity of ith object with all objects in the closest cluster.
Range of Silhouette Value: S(i) will lie between [-1, 1]
1. If silhouette value is close to 1, sample is well-clustered and already assigned to a very appropriate cluster.
2. If silhouette value is about to 0, sample could be assign to another cluster closest to it and the sample lies equally far away from both the clusters. That means it indicates overlapping clusters
3. If silhouette value is close to –1, sample is misclassified and is merely placed somewhere in between the clusters.

##### Reason for choosing Silhouette analysis
It is very easy to apply and calculate Silhouette score corresponding to a particular number of clusters. So after getting a list of Silhouette scores we can simply choose the cluster number having highest Silhouette score.

##### Reason for choosing K-Means
K-means is one of the simplest and fastest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters.

In [7]:
NumberOfClusters = []  # list to store number of clusters
SilhouetteCoefficient = []  # list to store Silhouette score

for n_cluster in range(2, df.shape[0]):
    kmeans = KMeans(n_clusters = n_cluster).fit(X)
    label = kmeans.labels_ # Stores labels corresponding to a particular cluster number
    sil_coeff = silhouette_score(X, label, metric='euclidean') # Calculates Silhouette score
    
    NumberOfClusters.append(n_cluster)
    SilhouetteCoefficient.append(sil_coeff)

  return_n_iter=True)


In [8]:
# After getting the list of clusters and their corresponding Silhouette Score I can store it into data frame for easier operations
df_ = pd.DataFrame()
df_["Number Of Clusters"] = NumberOfClusters
df_["Silhouette Coefficient"] = SilhouetteCoefficient
df_.head()

Unnamed: 0,Number Of Clusters,Silhouette Coefficient
0,2,0.022271
1,3,0.029543
2,4,0.032694
3,5,0.033876
4,6,0.042102


In [9]:
# Finding the value of K(Cluster Number) for which Silhouette Score is highest
k = df_["Number Of Clusters"][df_["Silhouette Coefficient"] == df_["Silhouette Coefficient"].max()]
k

133    135
Name: Number Of Clusters, dtype: int64

### Final model with optimum number of Clusters

In [10]:
model = KMeans(n_clusters = int(k))
model.fit(X)
FinalLabels = model.labels_  # Storing the final labels

In [11]:
# Storing the Lines of the input file and Final Labels into data frame
final_df = pd.DataFrame()
final_df["Lines"] = df["Lines"]
final_df["Cluster_ID"] = FinalLabels
final_df.head()

Unnamed: 0,Lines,Cluster_ID
0,Ransomware attack at Mexico's Pemex halts work...,1
1,#city | #ransomware | Ransomware Attack At Mex...,1
2,"Mexico's Pemex Oil Suffers Ransomware Attack, ...",43
3,A Mexican oil company was hit by ransomware at...,41
4,Pemex Struck by Ransomware Attack,25


In [12]:
# Creating list of list where Statements belongs to a particular list belongs from the same cluster
li = []
for i in range(int(k)):
    li.append([])

for i in range(final_df.shape[0]):
    li[final_df.iloc[i][1]].append(final_df.iloc[i][0])
    
li[0:5] # Prints list of five clusters

[['Massive Malware Attack on Organization Networks'],
 ["Ransomware attack at Mexico's Pemex halts work, threatens to cripple computers",
  '#city | #ransomware | Ransomware Attack At Mexico’s Pemex Halts Work, Threatens To Cripple Computers',
  "Ransomware attack at Mexico's Pemex halts work, threatens to cripple computers By Reuters",
  'Ransomware attack at mexicos pemex halts work threatens to cripple computers'],
 ['Update: Labour website subject to second DDoS attack'],
 ["Labour says it has been hit by 'large scale cyber attack'",
  "UK Labour's digital platforms hit by large-scale cyber attack",
  '‘Sophisticated’ cyber attack on UK Labour Party platforms was probably just a DDoS, says official',
  'UK Labour Party Says It Has Experienced A ‘Large Scale Cyber Attack’ On Its Digital Platforms'],
 ['‘Buran’ new ransomware evolved from VegaLocker']]

### Writing into the output file (Output.txt)

In [13]:
# Writing the statements and their corresponding cluster id in output file
try:
    file1 = open("Output.txt","w")  # open Output.txt in write mode
    for i in range(len(li)):
        file1.write("Cluster id: "+ str(i) + " \n")  # writes cluster id, example: Cluster id : 0
        file1.writelines(li[i])  # Writes statements corresponding to the clusters
        file1.write("\n" + "*"*100 + " \n")  # Writes 100 stars(*) in the same format which is given in "sample_output.txt"
    file1.close()  # close the file
    print("Writes Successfully")
except:
    print("Unable to write into output file")