
# COVID-19 Tweet Clustering using GSDMM

This notebook performs clustering of COVID-19-related tweets using the **GSDMM (Gibbs Sampling for Dirichlet Multinomial Mixture Model)** algorithm. The main steps involve text preprocessing, tokenization, lemmatization, and clustering of tweets into groups based on similar content. Below is a summary of the approach and methodology used:

### Methodology:
1. **Data Collection**: 
   - The dataset contains tweets related to COVID-19, with a focus on sentiment analysis.
   
2. **Text Preprocessing**: 
   - Tokenization: The tweets are split into individual words.
   - Removal of stopwords: Common words (e.g., "the", "and", etc.) are removed to focus on meaningful words.
   - Lemmatization: Words are reduced to their base form (e.g., "running" to "run").
   - N-grams Creation: Bigram and trigram models are generated to capture word relationships.

3. **Modeling**: 
   - GSDMM is applied to the preprocessed tweets, where different combinations of hyperparameters (Alpha and Beta) are tested to optimize clustering.
   - The optimal number of clusters (K) is set to 4 for this analysis.

4. **Clustering and Evaluation**:
   - Each tweet is assigned to the most probable cluster based on the word distribution learned by the model.
   - The results are evaluated based on the loss function, which measures how well the documents are distributed across clusters.

5. **Results**: 
   - The final output includes the cluster assignments for each tweet, along with the most frequent words in each cluster.
   - Results are saved into a CSV file named `ClusteredTweets.csv`.

### Key Functions:
- `sent_to_words()`: Tokenizes the tweet texts into words.
- `make_n_grams()`: Generates bigrams and trigrams to enhance the model's understanding of word relationships.
- `remove_stopwords()`: Removes common stopwords to focus on meaningful words.
- `lemmatization()`: Reduces words to their base form using the Spacy NLP library.
- `choose_best_label()`: Assigns each tweet to the most likely cluster based on the word distribution in each cluster.

### Requirements:
- **Python 3.x**
- **Libraries**:
  - `numpy`
  - `pandas`
  - `gsdmm`
  - `gensim`
  - `spacy`
  - `nltk`

### How to Run:
1. Install the required libraries:
   ```bash
   pip install numpy pandas gsdmm gensim spacy nltk
   ```
2. Download the Spacy English model:
   ```bash
   python -m spacy download en_core_web_sm
   ```
3. Run the notebook and observe the results. The final output CSV file will be saved as `ClusteredTweets.csv`.



In [11]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/covid-19-tweets-for-sentiment-analysis/Corona.csv


In [12]:
pip install git+https://github.com/rwalk/gsdmm.git

Collecting git+https://github.com/rwalk/gsdmm.git
  Cloning https://github.com/rwalk/gsdmm.git to /tmp/pip-req-build-g__op5s_
  Running command git clone --filter=blob:none --quiet https://github.com/rwalk/gsdmm.git /tmp/pip-req-build-g__op5s_
  Resolved https://github.com/rwalk/gsdmm.git to commit 4ad1b6b6976743681ee4976b4573463d359214ee
  Preparing metadata (setup.py) ... [?25ldone
Note: you may need to restart the kernel to use updated packages.


### استيراد المكتبات المطلوبة مثل numpy و pandas و gsdmm و gensim و spacy
### Importing the required libraries like numpy, pandas, gsdmm, gensim, and spacy

In [13]:

import numpy as np
import pandas as pd
from gsdmm import MovieGroupProcess
from gensim.utils import simple_preprocess
import gensim, spacy


### تحميل نموذج اللغة الإنجليزية من Spacy لتسهيل عمليات المعالجة اللغوية مثل الجذرة (lemmatization)
### Loading the English language model from Spacy for easier NLP tasks like lemmatization

In [14]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])


In [15]:
# قراءة البيانات من ملف CSV يحتوي على تغريدات حول فيروس كورونا
# Reading the data from a CSV file that contains tweets about COVID-19
data = pd.read_csv(r'/kaggle/input/covid-19-tweets-for-sentiment-analysis/Corona.csv', header=0, encoding='cp437')


### تحويل النصوص إلى كلمات باستخدام دالة simple_preprocess من مكتبة gensim
### Converting tweets to words using the simple_preprocess function from gensim

In [16]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))

### Creating n-grams (such as bigrams and trigrams) to improve the textual representation and expand word interactions


In [17]:
# إنشاء n-grams (مثل bigrams و trigrams) لتحسين التمثيل النصي وتوسيع التفاعلات بين الكلمات
def make_n_grams(texts):
    bigram = gensim.models.Phrases(texts, min_count=5, threshold=100)
    bigram_mod = gensim.models.phrases.Phraser(bigram)
    trigram = gensim.models.Phrases(bigram[texts], threshold=100)
    trigram_mod = gensim.models.phrases.Phraser(trigram)
    bigrams_text = [bigram_mod[doc] for doc in texts]
    trigrams_text = [trigram_mod[bigram_mod[doc]] for doc in bigrams_text]
    return trigrams_text

# Removing stopwords from the texts to improve the quality of analysis


In [18]:
# إزالة الكلمات الشائعة (stopwords) من النصوص لتحسين جودة التحليل
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in gensim.parsing.preprocessing.STOPWORDS] for doc in texts]


### Lemmatizing the words using spacy to better analyze the words


In [19]:
# تحويل الكلمات إلى جذورها باستخدام مكتبة spacy لتحليل الكلمات بشكل أفضل
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent))
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out


## Extracting the top words from each cluster using the word distribution and comparing them


In [20]:
# استخراج أفضل الكلمات من كل مجموعة (cluster) باستخدام توزيع الكلمات والمقارنة بينها
def top_words(mgp, cluster_word_distribution, top_cluster, values):
    Text = ''
    TheseResults = []
    for cluster in top_cluster:
        sort_dicts = sorted(mgp.cluster_word_distribution[cluster].items(), key=lambda k: k[1], reverse=True)[:values]
        Text += "\nCluster %s : %s" % (cluster, sort_dicts)
        TheseResults.append([cluster, sort_dicts])
    return Text, TheseResults


### Preparing the data for training by tokenizing tweets, creating n-grams, lemmatizing words, and removing stopwords


In [21]:
# تجهيز البيانات للتدريب عن طريق تحويل التغريدات إلى كلمات، وإنشاء n-grams، وتحويل الكلمات إلى جذورها، وإزالة الكلمات الشائعة
tokens_reviews = list(sent_to_words(data['OriginalTweet']))
tokens_reviews = make_n_grams(tokens_reviews)
reviews_lemmatized = lemmatization(tokens_reviews, allowed_postags=['NOUN', 'VERB', 'ADV'])
reviews_lemmatized = remove_stopwords(reviews_lemmatized)


## Training using different values for Alpha and Beta to select the best combination based on performance


In [22]:
# التدريب باستخدام قيم مختلفة لـ Alpha و Beta لاختيار الأفضل بناءً على الأداء
Results = []
X = 0
K = 4  # عدد المجموعات (clusters)
for Alpha in list(np.linspace(0.05, 1, 5)):  # استخدام قيم مختلفة لـ Alpha
    for Beta in list(np.linspace(0.05, 1, 5)):  # استخدام قيم مختلفة لـ Beta
        X += 1
        print(f'Phase Number {X}')
        mgp = MovieGroupProcess(K=K, alpha=Alpha, beta=Beta, n_iters=5)
        vocab = set(x for review in reviews_lemmatized for x in review)
        n_terms = len(vocab)
        model = mgp.fit(reviews_lemmatized, n_terms)
        doc_count = np.array(mgp.cluster_doc_count)
        top_index = doc_count.argsort()[-10:][::-1]
        Loss = 0
        for i in range(K):
            Loss += abs((1 / K) - (doc_count[i] / sum(doc_count)))
        Results.append({'Parameters': [Alpha, Beta],
                        'Loss': Loss,
                        'Doc Number': doc_count,
                        'Top Index': doc_count.argsort()[-10:][::-1],
                        'Top Words': top_words(mgp, mgp.cluster_word_distribution, top_index, 10)[0],
                        'All Words': top_words(mgp, mgp.cluster_word_distribution, top_index, 100)[1]})


Phase Number 1
In stage 0: transferred 27414 clusters with 4 clusters populated
In stage 1: transferred 20230 clusters with 4 clusters populated
In stage 2: transferred 14793 clusters with 4 clusters populated
In stage 3: transferred 9974 clusters with 4 clusters populated
In stage 4: transferred 7928 clusters with 4 clusters populated
Phase Number 2
In stage 0: transferred 28494 clusters with 4 clusters populated
In stage 1: transferred 22228 clusters with 4 clusters populated
In stage 2: transferred 12371 clusters with 4 clusters populated
In stage 3: transferred 7732 clusters with 4 clusters populated
In stage 4: transferred 6502 clusters with 4 clusters populated
Phase Number 3
In stage 0: transferred 28851 clusters with 4 clusters populated
In stage 1: transferred 23278 clusters with 4 clusters populated
In stage 2: transferred 13377 clusters with 4 clusters populated
In stage 3: transferred 7774 clusters with 4 clusters populated
In stage 4: transferred 6398 clusters with 4 clust

In [23]:
# تحويل النتائج إلى DataFrame
ResultsDF = pd.DataFrame(columns=['Parameters', 'Loss', 'Doc Number', 'Top Index', 'Top Words', 'All Words'])
for n, i in enumerate(Results):
    ResultsDF.loc[n] = list(i.values())

# استخراج أفضل قيم لـ Alpha و Beta
best_result = ResultsDF.loc[ResultsDF['Loss'].idxmin()]
best_alpha, best_beta = best_result['Parameters']
print(f"أفضل قيمة لـ Alpha: {best_alpha}, وأفضل قيمة لـ Beta: {best_beta}")

# استخدام أفضل القيم للتدريب النهائي
mgp = MovieGroupProcess(K=K, alpha=best_alpha, beta=best_beta, n_iters=5)
vocab = set(x for review in reviews_lemmatized for x in review)
n_terms = len(vocab)
model = mgp.fit(reviews_lemmatized, n_terms)


أفضل قيمة لـ Alpha: 0.7625, وأفضل قيمة لـ Beta: 0.05
In stage 0: transferred 27475 clusters with 4 clusters populated
In stage 1: transferred 20365 clusters with 4 clusters populated
In stage 2: transferred 13419 clusters with 4 clusters populated
In stage 3: transferred 8904 clusters with 4 clusters populated
In stage 4: transferred 6814 clusters with 4 clusters populated


In [24]:
# تحديد الكلاستر لكل تغريدة
def choose_best_label(review, mgp):
    scores = []
    for cluster in range(K):
        score = 0
        for word in review:
            if word in mgp.cluster_word_distribution[cluster]:
                score += np.log(mgp.cluster_word_distribution[cluster][word] + 1) - np.log(mgp.cluster_word_count[cluster] + len(mgp.cluster_word_distribution[cluster]))
        scores.append(score)
    return np.argmax(scores)

# إضافة عمود "Cluster" يحتوي على الكلاستر لكل تغريدة
data['Cluster'] = [choose_best_label(review, mgp) for review in reviews_lemmatized]

# حفظ النتائج في ملف CSV
output_file = 'ClusteredTweets.csv'
data.to_csv(output_file, index=False)

# طباعة الخرج النهائي
print("تم تصنيف التويترات وحفظ النتائج في ملف ClusteredTweets.csv")

تم تصنيف التويترات وحفظ النتائج في ملف ClusteredTweets.csv
