___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Topic Modeling Assessment Project

Welcome to your Topic Modeling Assessment! For this project you will be working with a dataset of over 400,000 quora questions that have no labeled cateogry, and attempting to find 20 cateogries to assign these questions to. The .csv file of these text questions can be found underneath the Topic-Modeling folder.

Remember you can always check the solutions notebook and video lecture for any questions.

#### Task: Import pandas and read in the quora_questions.csv file.

In [26]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
     ---------------------------------------- 0.0/587.7 MB ? eta -:--:--
     -------------------------------------- 0.0/587.7 MB 660.6 kB/s eta 0:14:50
     ---------------------------------------- 0.3/587.7 MB 3.4 MB/s eta 0:02:53
     ---------------------------------------- 0.6/587.7 MB 4.8 MB/s eta 0:02:02
     ---------------------------------------- 1.2/587.7 MB 7.8 MB/s eta 0:01:16
     --------------------------------------- 2.2/587.7 MB 10.9 MB/s eta 0:00:54
     --------------------------------------- 3.7/587.7 MB 14.8 MB/s eta 0:00:40
     --------------------------------------- 5.0/587.7 MB 16.9 MB/s eta 0:00:35
     --------------------------------------- 6.5/587.7 MB 18.7 MB/s eta 0:00:32
      -------------------------------------- 7.8/587.7 MB 20.0 MB/s eta 0:00:30
      ------------------------

In [1]:
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet
from nltk.util import ngrams
import nltk

import spacy
spacy.load('en_core_web_lg')

<spacy.lang.en.English at 0x251e8252900>

In [5]:
import kaggle

In [49]:
df = pd.read_csv(r"C:\Users\kristian.nordby\OneDrive - West Point\Desktop\AY 25-1\NLP\project 2\train.csv\train.csv")
df.drop(columns=[col for col in df.columns if col != 'question1'], inplace=True)
df.columns = ['Question']
df['Question'] = df['Question'].astype(str)
df.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


# Preprocessing

#### Task: Use TF-IDF Vectorization to create a vectorized document term matrix. You may want to explore the max_df and min_df parameters.

In [50]:
# Initialize the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Get the default English stopwords from spacy
nlp = spacy.load('en_core_web_lg')
stop_words = nlp.Defaults.stop_words
    
# Function to preprocess, tokenize, and lemmatize text
def preprocess(text):
    tokens = text.split(' ')
    return ' '.join([lemmatizer.lemmatize(word.lower()) for word in tokens if word.lower() not in stop_words])

# Apply the preprocessing to the questions
df['Preprocessed_Question'] = df['Question'].apply(lambda text: preprocess(text))

In [51]:
df.head()

Unnamed: 0,Question,Preprocessed_Question
0,What is the step by step guide to invest in sh...,step step guide invest share market india?
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,story kohinoor (koh-i-noor) diamond?
2,How can I increase the speed of my internet co...,increase speed internet connection vpn?
3,Why am I mentally very lonely? How can I solve...,mentally lonely? solve it?
4,"Which one dissolve in water quikly sugar, salt...","dissolve water quikly sugar, salt, methane car..."


In [52]:
# Vectorize the text data
# max_df: if a word is contianed in 95% of documents it is discarded
# min_df: if a word is only contained in 2 or less documents it is discarded

vectorizer = TfidfVectorizer(max_df = 0.95, min_df = 2, stop_words='english')
X = vectorizer.fit_transform(df['Preprocessed_Question'].fillna(''))

In [53]:
df.head()

Unnamed: 0,Question,Preprocessed_Question
0,What is the step by step guide to invest in sh...,step step guide invest share market india?
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,story kohinoor (koh-i-noor) diamond?
2,How can I increase the speed of my internet co...,increase speed internet connection vpn?
3,Why am I mentally very lonely? How can I solve...,mentally lonely? solve it?
4,"Which one dissolve in water quikly sugar, salt...","dissolve water quikly sugar, salt, methane car..."


# Non-negative Matrix Factorization

#### TASK: Using Scikit-Learn create an instance of NMF with 20 expected components. (Use random_state=42)..

In [54]:
from sklearn.decomposition import NMF

nmf = NMF(n_components=20, random_state=42)

nmf.fit(X)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=20, random_state=42, shuffle=False, solver='cd', tol=0.0001,
  verbose=0)

#### TASK: Print our the top 15 most common words for each of the 20 topics.

THE TOP 15 WORDS FOR TOPIC #0
['thing', 'read', 'place', 'visit', 'places', 'phone', 'buy', 'laptop', 'movie', 'ways', '2016', 'books', 'book', 'movies', 'best']


THE TOP 15 WORDS FOR TOPIC #1
['majors', 'recruit', 'sex', 'looking', 'differ', 'use', 'exist', 'really', 'compare', 'cost', 'long', 'feel', 'work', 'mean', 'does']


THE TOP 15 WORDS FOR TOPIC #2
['add', 'answered', 'needing', 'post', 'easily', 'improvement', 'delete', 'asked', 'google', 'answers', 'answer', 'ask', 'question', 'questions', 'quora']


THE TOP 15 WORDS FOR TOPIC #3
['using', 'website', 'investment', 'friends', 'black', 'internet', 'free', 'home', 'easy', 'youtube', 'ways', 'earn', 'online', 'make', 'money']


THE TOP 15 WORDS FOR TOPIC #4
['balance', 'earth', 'day', 'death', 'changed', 'live', 'want', 'change', 'moment', 'real', 'important', 'thing', 'meaning', 'purpose', 'life']


THE TOP 15 WORDS FOR TOPIC #5
['reservation', 'engineering', 'minister', 'president', 'company', 'china', 'business', 'country', 

#### TASK: Add a new column to the original quora dataframe that labels each question into one of the 20 topic categories.

In [54]:
topic_assignments = 

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


In [55]:
quora_df['Topic'] = topic_assignments.argmax(axis=1)

In [56]:
quora_df.head()

Unnamed: 0,Question,Topic
0,What is the step by step guide to invest in sh...,5
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,16
2,How can I increase the speed of my internet co...,17
3,Why am I mentally very lonely? How can I solve...,11
4,"Which one dissolve in water quikly sugar, salt...",14
5,Astrology: I am a Capricorn Sun Cap moon and c...,1
6,Should I buy tiago?,0
7,How can I be a good geologist?,10
8,When do you use シ instead of し?,19
9,Motorola (company): Can I hack my Charter Moto...,17


# Alternative Methods of Grouping
K-means and LDA

In [77]:
# K-means
from sklearn.cluster import KMeans

def cluster_and_filter_relevance(df, n_clusters=5, n_key_words=10):
    """
    Perform K-means clustering on the articles' abstracts and filter the most relevant clusters with lemmatization.
    USes TF-IDF to vectorize the documents and K-means to cluster them into groups
    
    :param df: DataFrame containing the articles data.
    :param n_clusters: Number of clusters to create.
    :param n_key_words: Number of top keywords to use for filtering relevant clusters.
    
    :return: Filtered DataFrame with relevant clusters, and a dictionary containing cluster keywords.
    """
    # Vectorize the text data
    # max_df: if a word is contianed in 95% of documents it is discarded
    # min_df: if a word is only contained in 2 or less documents it is discarded
    vectorizer = TfidfVectorizer(max_df = 0.95, min_df = 2, stop_words='english')
    X = vectorizer.fit_transform(df['Preprocessed_Question'].fillna(''))

    # Perform K-means clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(X)
    df['K_Means_Cluster'] = kmeans.labels_
    
    # Analyze the clusters to determine relevance
    # Initialize a dictionary to store the keywords for each cluster
    cluster_keywords = {}
    order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
    terms = vectorizer.get_feature_names_out()
    
    # Iterate through each cluster and store the top n keywords
    for i in range(n_clusters):
        cluster_keywords[i] = [terms[ind] for ind in order_centroids[i, :n_key_words]]
    
    # Here you could filter clusters based on relevance, or simply drop the combined column
    df_filtered = df.copy()  # If you want to perform further filtering, modify df_filtered

    # Return both the filtered DataFrame and the cluster_keywords dictionary
    return df_filtered, cluster_keywords

df_k_means, clusters_k_means = cluster_and_filter_relevance(df, 20, 10)

for group in sorted(clusters_k_means.keys()):
    print(f'cluster_k_means {group} keywords: {clusters_k_means[group]}')

print("\nCluster counts:")
print(df_k_means['Cluster'].value_counts().sort_index())

cluster_k_means 0 keywords: ['mean', 'say', 'dream', 'girl', 'symbol', 'guy', 'word', 'phrase', 'love', 'person']
cluster_k_means 1 keywords: ['long', 'school', 'distance', 'relationship', 'high', 'work', 'term', 'best', 'stay', 'time']
cluster_k_means 2 keywords: ['like', 'people', 'feel', 'think', 'look', 'girl', 'work', 'culture', 'sex', 'hate']
cluster_k_means 3 keywords: ['attend', 'like', 'school', 'university', 'college', 'conference', 'international', 'student', 'yale', 'best']
cluster_k_means 4 keywords: ['buy', 'best', 'laptop', 'phone', '15000', 'car', 'india', 'online', '15k', 'inr']
cluster_k_means 5 keywords: ['account', 'password', 'gmail', 'email', 'facebook', 'instagram', 'recover', 'delete', 'hack', 'forgot']
cluster_k_means 6 keywords: ['computer', 'science', 'engineering', 'data', 'best', 'university', 'difference', 'student', 'good', 'learning']
cluster_k_means 7 keywords: ['whatsapp', 'hack', 'message', 'account', 'phone', 'group', 'profile', 'chat', 'status', 'kn

In [None]:
df.head()

In [None]:
# LDA (takes a long time)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.manifold import TSNE


# Use CountVectorizer to convert the text data into a matrix of token counts
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['Preprocessed_Question'])

# Define the LDA model with the number of topics you want to extract
num_topics = 20  # You can adjust this number
lda = LatentDirichletAllocation(n_components=num_topics, random_state=42)
lda.fit(X)

# Get the topic distribution for each document
doc_topic_dist = lda.transform(X)

# Reduce the dimensionality of the topics for visualization using t-SNE
tsne_model = TSNE(n_components=2, random_state=42)
tsne_lda = tsne_model.fit_transform(doc_topic_dist)

# Plot the topics in a 2D space
plt.figure(figsize=(12, 8))
plt.scatter(tsne_lda[:, 0], tsne_lda[:, 1], c=doc_topic_dist.argmax(axis=1), cmap='viridis')
plt.colorbar(label='Topic')
plt.title('2D Visualization of LDA Topics')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.show()

# Display the top words in each topic
num_top_words = 10  # Number of top words to display for each topic
feature_names = vectorizer.get_feature_names_out()

for topic_idx, topic in enumerate(lda.components_):
    top_words = [feature_names[i] for i in topic.argsort()[:-num_top_words - 1:-1]]
    print(f"Topic #{topic_idx + 1}: {', '.join(top_words)}")


# Great job!