## Topic modeling
is a type of statistical model used to uncover abstract topics in a collection of documents. One of the most popular techniques for this is Latent Dirichlet Allocation (LDA).

Here's a simple walkthrough using the gensim library in Python to perform LDA topic modeling on the 20 Newsgroups dataset, which is a collection of newsgroup documents classified into 20 categories.

#### Step 1: Acquire Data
The 20 Newsgroups dataset can be fetched directly using the sklearn.datasets module:

In [1]:
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
documents = newsgroups.data

#### Check how many documents we have

In [2]:
#how many documents do we have?
print(len(documents))

18846


#### Step 2: Preprocess the Data
Before running LDA, we need to preprocess the data:

Tokenize: Break down each document into words

Remove stop words: Words like "and", "the", "is", which don't add significant meaning.

Lemmatization: Convert each word to its base or dictionary form.

In [3]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

In [4]:
#download necessary dictionaries
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jagoodkid/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/jagoodkid/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /Users/jagoodkid/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

#### Try to make a set of stop words named 'stop_words'

In [5]:
#Enter your code here
stop_words = set(stopwords.words('english'))

In [6]:
lemmatizer = WordNetLemmatizer()

def preprocess(document):
    tokens = word_tokenize(document.lower())
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalpha() and token not in stop_words]
    return tokens

tokenized_data = [preprocess(doc) for doc in documents]

#### Step 3: Create a Dictionary and a Corpus
The dictionary maps words to their integer representation. The corpus will be used in LDA modeling:

In [7]:
from gensim.corpora import Dictionary

dictionary = Dictionary(tokenized_data)
corpus = [dictionary.doc2bow(doc) for doc in tokenized_data]

#### Step 4: Run LDA Model
Let's train an LDA Model

In [None]:
from gensim.models import LdaModel

NUM_TOPICS = 20
lda_model = LdaModel(corpus, num_topics=NUM_TOPICS, id2word=dictionary, passes=15)

#### Step 5: Display Topics
Now, we can display the topics generated by our model:

In [None]:
topics = lda_model.print_topics(num_words=5)
n=0
for topic in topics:
    temp = re.sub('[^a-zA-Z\s]','',topic[1]).split()
    print(f"{n}: {','.join(temp)}")

### Answers

#### 1. print(documents)
#### 2. print(len(documents)
#### 3. stop_words = set(stopwords.words('english'))