# AI CA6 
### Kasra Hajiheidari - 810199400

### Question 1:
Imbalanced datasets can introduce various challenges in clustering. Here are some common problems that may occur when clustering imbalanced datasets:

   - Clustering algorithms may be biased toward the majority class, resulting in clusters that predominantly represent the larger class. This is because algorithms aim to minimize the overall dissimilarity, and the majority class contributes more to this measure.

   - The minority class may be underrepresented in the clusters, leading to sparse or non-representative clusters for the smaller class. This can make it difficult to identify meaningful patterns or anomalies associated with the minority class.

   - Imbalances may make clustering algorithms more sensitive to noise or outliers in the minority class, potentially leading to the creation of clusters that are influenced by noise rather than true patterns.

   - The interpretation of clusters may be biased towards the majority class, potentially overshadowing the importance of minority classes. This can be a concern in scenarios where understanding patterns in minority classes is crucial.

Addressing imbalances in clustering requires a thoughtful combination of preprocessing techniques, algorithm selection, and careful consideration of evaluation metrics based on the specific characteristics of the dataset and the goals of the analysis.Here are some common solutions:

   - Use resampling techniques such as oversampling the minority class, undersampling the majority class, or a combination of both to balance the dataset.

   - Focus on selecting or engineering features that enhance the difference between different classes, making it easier for the algorithm to identify patterns in both majority and minority classes.

   - Choose clustering algorithms that are less sensitive to imbalances, such as density-based algorithms like DBSCAN or hierarchical clustering. These algorithms do not assume equal-sized clusters.

   - Utilize cluster ensemble methods that combine multiple clustering results, potentially helping to mitigate the impact of imbalances by considering diverse perspectives on the data.



## Phase 1: Data preprocessing

In [1]:
import hazm
import pandas as pd

train_file_path = 'train.csv'
test_file_path = 'test.csv'

stop_words = ['شاید', 'اما', 'اگر', 'برای', 'از', 'به', 'و', 'که', 'با', 'تا', 'را','این', 'آن', 'در', 'درون', 'نیز', 'همچنین', 'جز',
              'چون', 'چنانچه', 'بجز', 'درباره', 'بر', 'روی', 'بی', 'غیر', 'علاوه', 'مگر',
              '؟', '.', '،', 'همچون', 'وی', 'خود', 'چنین', ':',
              'می', 'است', 'شد', 'کرد', 'شده', 'باشد', 'بود', 'بوده']


tokenizer = hazm.WordTokenizer(replace_emails=True, replace_ids=True, replace_links=True,
                               replace_numbers=True, replace_hashtags=True,
                               separate_emoji=True, join_verb_parts=False)

normalizer = hazm.Normalizer()

def clean_stop_words_norm(text):
    words = tokenizer.tokenize(text)
    cleaned_words = [word for word in words if word not in stop_words]
    return cleaned_words


train_data = pd.read_csv(train_file_path)
test_data = pd.read_csv(test_file_path)


print(train_data.iloc[0])
train_data.loc[:, 'content'] = train_data.loc[:, 'content'].apply(clean_stop_words_norm)
test_data.loc[:, 'content'] = test_data.loc[:, 'content'].apply(clean_stop_words_norm)
print(train_data.iloc[0])

label                                                 فناوری
content    گزارش های منتشر شده حاکی از آن است که کاربران ...
Name: 0, dtype: object
label                                                 فناوری
content    [گزارش, های, منتشر, حاکی, کاربران, تلگرام, منا...
Name: 0, dtype: object


### Label Encoding

In [2]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

train_labels = label_encoder.fit_transform(train_data['label'])
train_data['label'] = train_labels

test_labels = label_encoder.fit_transform(test_data['label'])
test_data['label'] = test_labels

2. Lemmatization involves transforming words into their base or most recognized forms. On the other hand, stemming is a simpler heuristic process that often involves truncating word endings to approximate the desired base form, frequently entailing the removal of derivational affixes for simplification. This linguistic preprocessing aids in enhancing the consistency of word representations in text-based clustering projects. Additionally, it contributes to a more streamlined and effective analysis of textual data.

## Phase 2: Problem Procedure

### Extract Emedding

In [3]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn import utils
from gensim.test.utils import get_tmpfile
import matplotlib.pyplot as plt

tag_train = [TaggedDocument(words=row['content'], tags=[row['label']])
             for _, row in train_data.iterrows()]
tag_test = [TaggedDocument(words=row['content'], tags=[row['label']])
             for _, row in test_data.iterrows()]

embed_model = Doc2Vec(min_count=2, workers=5, dm=1)
embed_model.build_vocab(tag_train)

# train model
for epoch in range(40):
    embed_model.train(utils.shuffle(tag_train), epochs=1, total_examples=embed_model.corpus_count)

# extract embedding
embed_vec_train = [embed_model.infer_vector(content) for content in train_data['content']]
embed_vec_test = [embed_model.infer_vector(content) for content in test_data['content']]


### Train using K-Means and DBSCAN

#### K-Means

In [4]:
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN

kmeans_model = KMeans(n_clusters=len(train_data['label'].unique()),init='k-means++',
                      random_state=557, algorithm='elkan', n_init='auto')

kmeans_model.fit(embed_vec_train)

print(kmeans_model.labels_)

n_clusters_ = len(set(kmeans_model.labels_)) - (1 if -1 in kmeans_model.labels_ else 0)
n_noise_ = list(kmeans_model.labels_).count(-1)
print('Number of Clusters : ', n_clusters_)
print('Number of Outliers : ', n_noise_)

[1 1 1 ... 3 4 3]
Number of Clusters :  6
Number of Outliers :  0


#### DBSCAN

In [5]:
dbscan_model = DBSCAN(eps=2.5, min_samples=2, n_jobs=5)

dbscan_model.fit(embed_vec_train)

print(list(dbscan_model.labels_))

n_clusters_ = len(set(dbscan_model.labels_)) - (1 if -1 in dbscan_model.labels_ else 0)
n_noise_ = list(dbscan_model.labels_).count(-1)
print('Number of Clusters : ', n_clusters_)
print('Number of Outliers : ', n_noise_)

[0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

3. As computers inherently operate with numerical data, it becomes imperative to represent our information, relationships, and dependencies in a numeric format. Embedding serves as the means to accomplish this mapping, enabling us to convert our data into numerical representations. This facilitates the manipulation and analysis of datasets and problem-solving, allowing us to apply computational techniques effectively. Embedding plays a crucial role in translating the richness of diverse data types into a format that computers can efficiently process, enabling a seamless integration of data into various computational models.<br><br>
4.
- **Word2Vec:** Primarily aimed at representing words through numerical values while preserving their relational context. This technique assigns closer numerical values to words that share greater semantic relatedness. As a result, words with similar meanings exhibit reduced numerical distance in their representations.

- **Doc2Vec:** This method extends the concept of embedding from words to entire documents, posing a more intricate challenge. Leveraging the foundation of Word2Vec, Doc2Vec introduces a concept known as a "doc-vector" or "Paragraph Vector." This vector evolves dynamically with each word in the document. Two prevalent implementations are utilized: _Distributed Memory (PV-DM)_ and _Distributed Bag of Words (PV-DBOW)_. PV-DM captures the context of words in relation to the document, while PV-DBOW focuses on predicting words in a document without considering their order. This approach empowers the representation of entire documents through numerical vectors, facilitating more comprehensive analyses in text-based clustering projects.
5. 
- **K-Means:** Operating as a centroid-based or partition-based clustering algorithm, K-means categorizes points in the sample space into K groups based on their similarity. This similarity is typically evaluated using Euclidean Distance, and each group is represented by a centroid.

- **DBSCAN:** In contrast, DBScan stands out as a density-based clustering algorithm. It establishes clusters by requiring that the neighborhood of each point within a cluster, located within a specified radius, must contain a minimum number of points. This algorithm excels in identifying outliers and effectively handles noise within the dataset.

**Key Differences:**
- *Efficiency:* K-Means is generally more computationally efficient than DBSCAN.
- *Sensitivity to Cluster Number:* K-Means is sensitive to the pre-defined number of clusters, while DBSCAN autonomously identifies clusters based on density.
- *Data Characteristics:* DBSCAN may not perform optimally with sparse data points characterized by varying densities, whereas K-Means is less affected by this data distribution.




### Comparison (6)

In [6]:
k_pred = kmeans_model.predict(embed_vec_test)
print("k-means: ",k_pred)

d_pred = dbscan_model.fit_predict(embed_vec_test)
print("DBSCAN: ",list(d_pred))

k-means:  [3 3 4 ... 4 3 2]
DBSCAN:  [0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, -1, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, -1, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, -1, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 

6. In the analysis of the results, it is observed that K-Means effectively partitions the data into 6 distinct clusters. While DBSCAN because of the way it works (identifying cores using minimum samples that have the same flag in their neighborhod )

## Evaluation and Measurements
8. 
- **Silhouette Score:** A pivotal metric for evaluating the effectiveness of clustering, the Silhouette Score ascertains the quality of the clustering arrangement. Ranging from -1 to 1, a higher value indicates greater separation between clusters. Conversely, a score of -1 suggests an erroneous assignment of clusters.

- **Homogeneity:** This metric gauges the degree of similarity among samples within a cluster, offering insights into the homogeneity of the clusters. With values spanning from 0 to 1, a higher score signifies more homogenous clusters, indicating that the samples within each cluster share greater similarity. Homogeneity provides a valuable measure for assessing the internal cohesion of clusters in a clustering solution.

### Report Measurements (9)

In [7]:
from sklearn.metrics import silhouette_samples
from sklearn.metrics import homogeneity_score

print("K-Means Silhouette: ", silhouette_samples(embed_vec_test, k_pred))
print("K-Means Homogeneity: ", homogeneity_score(test_labels, k_pred))
print("---------------")
print("DBSCAN Silhouette: ", silhouette_samples(embed_vec_test, d_pred))
print("DBSCAN Homogeneity: ", homogeneity_score(test_labels, d_pred))


K-Means Silhouette:  [-0.11789858  0.01875667  0.2181031  ...  0.5125378  -0.01282262
  0.10433124]
K-Means Homogeneity:  0.08166251057219366
---------------
DBSCAN Silhouette:  [0.42006585 0.23673569 0.6263366  ... 0.6498601  0.34790635 0.0564233 ]
DBSCAN Homogeneity:  0.01198355676565597


10. Enhance dataset normalization and preprocessing, considering potential improvements to address stop words:

- **Normalization:**
   - Implement robust techniques for text cleaning, including lowercasing all text to ensure uniformity.
   - Explore lemmatization or stemming to reduce words to their base forms and enhance the consistency of the dataset.

- **Stop Words Handling:**
   - Consider removing common stop words that may not contribute significantly to the meaning of the text. This step helps focus on more meaningful content.
   - Evaluate the impact of stop word removal on the dataset's coherence and adjust accordingly.
   - Remove unnecessary special characters, symbols, or punctuation to ensure a cleaner and more uniform dataset.

- **Tokenization:**
   - Employ effective tokenization methods to break down text into individual units, such as words or phrases. This enhances the granularity of the dataset.

- **Numerical Normalization:**
   - Standardize numerical values within the dataset to a common scale to prevent biases during clustering.


## Dimension Reduction
7. The primary objective of employing Principal Component Analysis (PCA) is to retain maximal information within the data while reducing its dimensionality. The PCA process typically encompasses five key steps:

- **Standardization of Data:**
   - Normalize the data to a common scale using the formula:
   $$  z = \frac{{\text{{value}} - \text{{mean}}}}{{\text{{standard deviation}}}} $$
   This step ensures that all features contribute equally to the analysis.

- **Computation of Covariance Matrix:**
   - Calculate the covariance matrix to unveil relationships and correlations between different features within the dataset.

- **Eigenvectors and Eigenvalues Computation:**
   - Determine the eigenvectors and eigenvalues of the covariance matrix. These mathematical constructs help identify the principal components of the dataset, representing the directions of maximum variance.

- **Feature Vector Creation:**
   - Develop a feature vector to guide the selection of principal components. This involves sorting the eigenvalues in descending order and choosing the corresponding eigenvectors.

- **Data Recasting along Principal Components Axes:**
   - Transform the data by projecting it onto the selected principal components. This step effectively reduces the dimensionality of the dataset while retaining the most critical information.


### Use PCA (2D)

In [8]:
from sklearn.decomposition import PCA
import plotly.express as px
from mpl_toolkits.mplot3d import Axes3D

test_pca = PCA(n_components=2)
test_pca.fit(embed_vec_test)
test_pca_list = test_pca.transform(embed_vec_test)
print(test_pca_list)

[[ 0.56908062  0.006007  ]
 [ 1.24423806  0.25757783]
 [-0.9704696   0.02432247]
 ...
 [-1.44101464 -0.05011537]
 [ 0.95726428 -0.18978378]
 [ 4.01918059 -2.61083062]]


#### DBSCAN

In [9]:
d_pca_df = pd.DataFrame()
d_pca_df['x'] = test_pca_list[:, 0]
d_pca_df['y'] = test_pca_list[:, 1]
d_pca_df['label'] = d_pred

fig = px.scatter(d_pca_df, x='x', y='y', color='label')
fig.show()

#### K-Means

In [10]:
k_pca_df = pd.DataFrame()
k_pca_df['x'] = test_pca_list[:, 0]
k_pca_df['y'] = test_pca_list[:, 1]
k_pca_df['label'] = k_pred

fig = px.scatter(k_pca_df, x='x', y='y', color='label')
fig.show()

### Use PCA (3D)

In [11]:
test_pca3d = PCA(n_components=3)
test_pca3d.fit(embed_vec_test)
test_pca3d_list = test_pca3d.transform(embed_vec_test)
print(test_pca3d_list)

[[ 0.56908062  0.0060066   0.31225444]
 [ 1.24423806  0.25757763 -0.29303136]
 [-0.9704696   0.02432258 -0.06183699]
 ...
 [-1.44101464 -0.05011536 -0.02631069]
 [ 0.95726428 -0.18978384  1.04697899]
 [ 4.01918059 -2.61083072 -1.61206084]]


#### K-Means

In [12]:
k_pca3d_df = pd.DataFrame()
k_pca3d_df['x'] = test_pca3d_list[:, 0]
k_pca3d_df['y'] = test_pca3d_list[:, 1]
k_pca3d_df['z'] = test_pca3d_list[:, 2]
k_pca3d_df['label'] = k_pred

fig = px.scatter_3d(k_pca3d_df, x='x', y='y', z='z', color='label')
fig.show()

#### DBSCAN

In [14]:
d_pca3d_df = pd.DataFrame()
d_pca3d_df['x'] = test_pca3d_list[:, 0]
d_pca3d_df['y'] = test_pca3d_list[:, 1]
d_pca3d_df['z'] = test_pca3d_list[:, 2]
d_pca3d_df['label'] = d_pred

fig = px.scatter_3d(d_pca3d_df, x='x', y='y', z='z', color='label')
fig.show()