<a href="https://colab.research.google.com/github/muratal49/LDA-Topic-Selection-News-Articles/blob/main/LDA_Topic_Modelling_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We will be working on news articles and classify them buy their topics which can more than one depending on the content.  
**LDA models** start with the approach that both documents have a distribution of different topics, and each word in the corpus have different  distribution for each specific topic.

In this project, we will work on the scikit-learns News data set that has 3899 articles from 20 different Newsgroups.

In [3]:
#First pulling in the documents:

from sklearn.datasets import fetch_20newsgroups

# Load dataset (selecting a few categories)
categories = ['rec.sport.hockey', 'comp.graphics', 'sci.space', 'talk.politics.mideast']
newsgroups = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Convert to list of documents
documents = newsgroups.data
print(f"Loaded {len(documents)} news articles.")

Loaded 3899 news articles.


In [10]:
print(documents[0])




Hey tough guy, freedom necessitates responsibility, and
no freedom is absolute.  
BTW, to anyone who defends Arafat, read on:

"Open fire on the new Jewish immigrants, be they from the Soviet
Union, Ethiopia or anywhere else....I give you my instructions to
use violence against the immigrants.  I willjail anyone who
refuses to do this."
				Yassir Arafat, Al-Muharar, 4/10/90

At least he's not racist!
Just anti-Jewish



In [8]:
# Required libraries and tools:
import nltk
import gensim
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords

In [19]:
#lets do the first processing:
#Tokenization, Stopwors removal and lammatization:

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Tokenization and cleaning
def preprocess(text):
    return [word for word in simple_preprocess(text) if word not in stop_words]

# Apply preprocessing

processed_docs = [preprocess(doc) for doc in documents]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [16]:
# stopwords.words('english')[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [23]:
processed_docs[0][:10]

['hey',
 'tough',
 'guy',
 'freedom',
 'necessitates',
 'responsibility',
 'freedom',
 'absolute',
 'btw',
 'anyone']

the variblae

**corpus:** creates a list of size as processed_docs and each element in the list is BOW for that doc

In [25]:
#Bag of words for our corpus:
from gensim.corpora import Dictionary


dictionary = Dictionary(processed_docs)

corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [31]:
once_tokens = [dictionary[token_id] for token_id, freq in dictionary.dfs.items() if freq == 1]


In [32]:
once_tokens[:10]

['necessitates',
 'willjail',
 'liverpool',
 'vanecek',
 'tres',
 'spif',
 'newsfeed',
 'spalling',
 'lasts',
 'inconvienenced']

# It is time to train our **LDA** Model:

In [60]:
from gensim.models import LdaModel

# Train the LDA model
# Num of topics is used entry: We select 4, LDA will cluster docs into 4, you can also name
#them later when you see what those clusters are about.
lda_model = LdaModel(corpus=corpus, num_topics=4, id2word=dictionary, passes=10)

# Print the top words for each topic
topics = lda_model.print_topics(num_words=5)
for topic in topics:
    print(topic)

  and should_run_async(code)


(0, '0.007*"space" + 0.005*"people" + 0.005*"israel" + 0.004*"jews" + 0.003*"armenian"')
(1, '0.007*"pit" + 0.006*"la" + 0.006*"det" + 0.006*"gm" + 0.006*"period"')
(2, '0.008*"game" + 0.007*"said" + 0.007*"people" + 0.005*"like" + 0.005*"know"')
(3, '0.011*"image" + 0.008*"edu" + 0.008*"graphics" + 0.008*"jpeg" + 0.006*"file"')


So from above we can see that
*   0: Politics
*   1: Science
*   2: Computers
*   3: Sports



In [47]:
new_doc = "NASA is planning a mission to Mars next year using the advanced computer graphics date processing ."
bow_new_doc = dictionary.doc2bow(preprocess(new_doc))

# Get topic distribution
topic_distribution = lda_model[bow_new_doc]
print(topic_distribution)

[(0, 0.019474424), (1, 0.68809474), (2, 0.27263343), (3, 0.01979741)]


In [50]:
pip install pyLDAvis

Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py3-none-any.whl.metadata (5.9 kB)
Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-2.0 pyLDAvis-3.4.1


In [51]:
import pyLDAvis.gensim

# Prepare visualization
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
vis

**Let's improve our model using TF-IDF before LDA:**

In [52]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert processed documents back to text
text_documents = [" ".join(doc) for doc in processed_docs]

# Apply TF-IDF
vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf_matrix = vectorizer.fit_transform(text_documents)

# Get feature names (words)
feature_names = vectorizer.get_feature_names_out()

  and should_run_async(code)


In [54]:
feature_names[:10]

  and should_run_async(code)


array(['aa', 'aao', 'aaoepp', 'aaplay', 'aarnet', 'aas', 'aawin', 'ab',
       'ababa', 'abandon'], dtype=object)

In [55]:
#Keeping the high Tf_IDF words, applying the word filtering based on featere names extracted above:

important_words = set(feature_names)

# Filter processed docs to keep only important words
filtered_docs = [[word for word in doc if word in important_words] for doc in processed_docs]

  and should_run_async(code)


In [58]:

#Now using the same steps in LDA, this time using the filtered_docs dictionary with filtered words
dictionary_TDIDF = Dictionary(filtered_docs)

# Convert documents into bag-of-words format
corpus_TDIDF = [dictionary.doc2bow(doc) for doc in filtered_docs]

  and should_run_async(code)


In [59]:
# Train LDA model
lda_model_with_TDIDF = LdaModel(corpus=corpus_TDIDF, num_topics=4, id2word=dictionary_TDIDF, passes=10)

# Print topics
topics_TDIDF = lda_model_with_TDIDF.print_topics(num_words=5)
for topic in topics_TDIDF:
    print(topic)

  and should_run_async(code)


(0, '0.010*"edu" + 0.009*"graphics" + 0.007*"data" + 0.005*"image" + 0.005*"available"')
(1, '0.015*"jpeg" + 0.014*"image" + 0.009*"file" + 0.008*"gif" + 0.006*"bit"')
(2, '0.008*"people" + 0.007*"armenian" + 0.006*"armenians" + 0.006*"said" + 0.005*"israel"')
(3, '0.008*"space" + 0.006*"game" + 0.005*"like" + 0.005*"team" + 0.004*"think"')


In [61]:
# Prepare visualization
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
vis

  and should_run_async(code)
