# Topic Modeling Assessment Project

For this project we will be working with a dataset of over 400,000 quora questions that have no labeled cateogry, and attempting to find 20 cateogries to assign these questions to. The .csv file of these text questions.

In [1]:
import pandas as pd 
import numpy as np

### 1. Data Preparation
- **Description**: Load and preprocess the dataset.
- **Tasks**:
  - Load the dataset into memory.
  - Perform any necessary preprocessing steps, such as removing stopwords, punctuation, and stemming/lemmatization.

In [2]:
quora = pd.read_csv('quora_questions.csv')
quora.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


In [3]:
quora.isnull().sum()

Question    0
dtype: int64

### 2. TF-IDF Vectorization
- **Description**: Convert the text documents into numerical vectors using TF-IDF.
- **Tasks**:
  - Implement TF-IDF vectorization on the preprocessed text data.

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
TFIDF = TfidfVectorizer(max_df=0.95, min_df= 2, stop_words='english') 

In [6]:
# A >> TFIDF matrix of the given dataset
A = TFIDF.fit_transform(quora['Question'])

### 3. Topic Modeling with NNMF
- **Description**: Apply Non-Negative Matrix Factorization to the TF-IDF matrix for topic extraction.
- **Tasks**:
  - Implement NNMF on the TF-IDF matrix.
  - Extract topics and analyze their relevance.

In [7]:
from sklearn.decomposition import NMF

In [8]:
nmf_model = NMF(n_components=20, random_state=42)

In [9]:
nmf_model.fit(A)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=20, random_state=42, shuffle=False, solver='cd', tol=0.0001,
  verbose=0)

#### Now we have the result of the Topic-Word relation and Topic-document relation

### 4. Evaluate Topic - Word relation result:
    - Print our the top 15 most common words for each of the 20 topics.

In [10]:
len(nmf_model.components_)

20

No. unique words in all Quora questions 

In [11]:
len(TFIDF.get_feature_names())

38669

In [12]:
for index, topic in enumerate(nmf_model.components_):
    print(f"Top 15 words in topic #{index}")
    print([TFIDF.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print("______________________________________________________________")

Top 15 words in topic #0
['thing', 'read', 'place', 'visit', 'places', 'phone', 'buy', 'laptop', 'movie', 'ways', '2016', 'books', 'book', 'movies', 'best']
______________________________________________________________
Top 15 words in topic #1
['majors', 'recruit', 'sex', 'looking', 'differ', 'use', 'exist', 'really', 'compare', 'cost', 'long', 'feel', 'work', 'mean', 'does']
______________________________________________________________
Top 15 words in topic #2
['add', 'answered', 'needing', 'post', 'easily', 'improvement', 'delete', 'asked', 'google', 'answers', 'answer', 'ask', 'question', 'questions', 'quora']
______________________________________________________________
Top 15 words in topic #3
['using', 'website', 'investment', 'friends', 'black', 'internet', 'free', 'home', 'easy', 'youtube', 'ways', 'earn', 'online', 'make', 'money']
______________________________________________________________
Top 15 words in topic #4
['balance', 'earth', 'day', 'death', 'changed', 'live', 

### 5. Evaluate Topic - document relation result:
- **Description**: Find the most appropriate topic for each document (Question).
- **Tasks**:
  - Transform the TDIDF result using the NMF model to get the probability of each topic to be the appropriate for each document
  - Get the best topic for each model by choose the topic with highest prob.


In [13]:
topics_probability = nmf_model.transform(A)

In [14]:
topics_probability

array([[2.75937605e-04, 5.91249293e-05, 6.17687040e-06, ...,
        6.97269969e-04, 2.13527728e-04, 0.00000000e+00],
       [1.96418670e-04, 8.85438224e-05, 0.00000000e+00, ...,
        0.00000000e+00, 5.51088847e-05, 1.05527238e-05],
       [1.78019854e-04, 6.47373072e-04, 1.60510763e-03, ...,
        3.02354836e-03, 1.05908512e-03, 1.23878889e-03],
       ...,
       [0.00000000e+00, 1.62431955e-05, 5.23720795e-06, ...,
        0.00000000e+00, 2.76279348e-06, 0.00000000e+00],
       [5.36236094e-04, 1.01567857e-03, 0.00000000e+00, ...,
        1.28720137e-04, 7.76975481e-04, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 1.25187210e-04]])

In [15]:
topics_probability[0].round(4)

array([0.0003, 0.0001, 0.    , 0.0005, 0.    , 0.0262, 0.0004, 0.    ,
       0.    , 0.    , 0.0002, 0.0012, 0.    , 0.    , 0.    , 0.0004,
       0.    , 0.0007, 0.0002, 0.    ])

In [16]:
topic_res = topics_probability.argmax(axis = 1)

In [17]:
quora['Topic'] = topic_res

In [18]:
quora.head()

Unnamed: 0,Question,Topic
0,What is the step by step guide to invest in sh...,5
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,16
2,How can I increase the speed of my internet co...,17
3,Why am I mentally very lonely? How can I solve...,11
4,"Which one dissolve in water quikly sugar, salt...",14


#### Now we can map each topic number to specific topic name based on words related to each topic number 