## Task 2: Topic Modeling

This task involves identifying key topics within a dataset of sustainability-related reviews and tweets using Latent Dirichlet Allocation (LDA). The primary goal was to uncover recurring themes such as environmental conservation, climate change, and recycling. Text preprocessing steps like tokenization and stopword removal were applied to prepare the data for modeling, followed by topic visualization to analyze the results.

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# !pip install nltk sklearn matplotlib seaborn

# Import necessary Libraries

In [None]:
import pandas as pd
import re
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [None]:
# Load the dataset
df = pd.read_csv("/content/drive/My Drive/Advanced NLP/Dataset/twitter_dataset.csv", encoding ="ISO-8859-1" , names=["target", "ids", "date", "flag", "user", "text"])

df.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   target  1600000 non-null  int64 
 1   ids     1600000 non-null  int64 
 2   date    1600000 non-null  object
 3   flag    1600000 non-null  object
 4   user    1600000 non-null  object
 5   text    1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


In [None]:
df["target"].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,800000
4,800000


## Removing unecessary columns

In [None]:
df.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


We are only interested in the target and text columns, the rest will be removed

In [None]:
df = df[["text", "target"]]

# Let's extract tweets that are related to sustainability

In [None]:
# We defined a set of sustainability-related keywords
sustainability_keywords = [
    'climate change', 'renewable energy', 'clean energy', 'sustainable', 'green energy',
    'carbon emissions', 'environment', 'recycling', 'solar power', 'wind energy', 'sustainability',
    'biofuel', 'global warming', 'sustainable transport', 'fossil fuels', 'net zero', 'greenhouse gases',
    'carbon footprint', 'conservation', 'pollution'
]

In [None]:
# Now we will filter rows that contain any of the sustainability-related keywords in the 'text' column
def contains_sustainability_keywords(text):
    text = text.lower()  # Convert text to lowercase for case-insensitive matching
    return any(keyword in text for keyword in sustainability_keywords)

In [None]:
# Filter out rows containing sustainability-related keywords
sustainability_related_df = df[df['text'].apply(contains_sustainability_keywords)]

In [None]:
print(f"Sustainability Related DF contains {len(sustainability_related_df)}")

Sustainability Related DF contains 493


In [None]:
# Download nltk data
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
df = sustainability_related_df

# Text Preprocessing

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
import re
from nltk.corpus import stopwords

lemmatizer = WordNetLemmatizer()

# Clean the text: remove URLs, users, special chars stopwords etc
TEXT_CLEANING_RE = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"
stop_words = stopwords.words("english")
stop_words.extend(['today', 'good', 'like', 'gt', 'amp', 'quot'])

def preprocess(text):
    # Remove unwanted characters and lemmatize
    text = re.sub(TEXT_CLEANING_RE, ' ', str(text).lower()).strip()
    tokens = [lemmatizer.lemmatize(token) for token in text.split() if token not in stop_words]
    return " ".join(tokens)

df["cleaned_text"] = df.text.apply(preprocess)

In [None]:
pd.set_option('display.max_colwidth', None)

df.head()

Unnamed: 0,text,target,cleaned_text
335,getting annoyed easily today &gt;&gt;&gt; biofuel proposal: getting annoyed easily today &gt;&gt;&gt; biof.. http://tinyurl.com/ceprvs,0,getting annoyed easily biofuel proposal getting annoyed easily biof
3363,Gosh It is raining in summer cause of the global warming?,0,gosh raining summer cause global warming
3862,"Arse, totally forgot about a webinar that I wanted to attend this morning. Now I'll never know how to secure virtualised environments",0,arse totally forgot webinar wanted attend morning never know secure virtualised environment
4722,http://twitpic.com/2ya1c - Good F'in Morning Springtime my ass. Global warming?! Suck it!,0,f morning springtime as global warming suck
6254,I hate global warming and i hate snow. ITS APRIL ffs.,0,hate global warming hate snow april ffs


## Tokenization and Vectorization

In [None]:
# Tokenization and Vectorization
vectorizer = CountVectorizer(max_df=0.9, min_df=2, stop_words='english')
dtm = vectorizer.fit_transform(df['cleaned_text'])

# LDA Model Training

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

# Set the number of topics
num_topics = 5

# Create and train the LDA model
lda = LatentDirichletAllocation(n_components=num_topics, random_state=42)
lda.fit(dtm)

# Extracting the Topics

In [None]:
# Get the vocabulary
terms = vectorizer.get_feature_names_out()

In [None]:
# terms

In [None]:
# Display the top words for each topic
def display_topics(model, feature_names, num_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx + 1}:")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-num_top_words - 1:-1]]))

# Display the top 10 words for each topic
num_top_words = 10
display_topics(lda, terms, num_top_words)

Topic 1:
environment day work environmentally sorry safe save solar time conservation
Topic 2:
change pollution climate environmental know new recycling let lot project
Topic 3:
recycling environment environmental warming solar global day power morning people
Topic 4:
global warming environment thing think weather summer new work carbon
Topic 5:
global warming recycling environmental going weather hate really coffee got


# Summary of the Identified Topics

In [None]:
topic_summaries = []
for topic_idx, topic in enumerate(lda.components_):
    top_words = [terms[i] for i in topic.argsort()[:-num_top_words - 1:-1]]
    summary = f"Topic {topic_idx + 1}: This topic focuses on {', '.join(top_words[:5])} and related themes."
    topic_summaries.append(summary)

# Print topic summaries
for summary in topic_summaries:
    print(summary)

Topic 1: This topic focuses on environment, day, work, environmentally, sorry and related themes.
Topic 2: This topic focuses on change, pollution, climate, environmental, know and related themes.
Topic 3: This topic focuses on recycling, environment, environmental, warming, solar and related themes.
Topic 4: This topic focuses on global, warming, environment, thing, think and related themes.
Topic 5: This topic focuses on global, warming, recycling, environmental, going and related themes.


# Visualizing the Topic Distribution

In [None]:
import plotly.express as px

# Get the topic distribution for each document
doc_topic_dist = lda.transform(dtm)

In [None]:
import plotly.express as px
import pandas as pd

# Get the dominant topic for each document
doc_topics = doc_topic_dist.argmax(axis=1)

# Create a DataFrame for Plotly
df_plot = pd.DataFrame(doc_topics, columns=['Topic'])

# Map numeric topics to "Topic 1", "Topic 2", etc.
df_plot['Topic'] = df_plot['Topic'].apply(lambda x: f'Topic {x + 1}')

# Plot using Plotly
fig = px.histogram(df_plot, x='Topic', title='Distribution of Topics in Documents',
                   labels={'Topic': 'Topics'},
                   text_auto=True)

# Update layout for better appearance
fig.update_layout(xaxis_title='Topics',
                  yaxis_title='Number of Documents',
                  title_x=0.5)

# Show interactive plot
fig.show()