<a href="https://colab.research.google.com/github/priyanka-ingale/unstructured-intelligence/blob/main/ReviewTextAnalysisFinal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Review Text Analysis - Topic Modeling with LDA
This notebook performs topic modeling on restaurant and film reviews using Latent Dirichlet Allocation (LDA).

## 1. Setup and Data Loading

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [None]:
# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('omw-1.4', quiet=True)
print("NLTK resources downloaded successfully")

NLTK resources downloaded successfully


In [None]:
# Load the data
reviews = pd.read_excel('IA2_1.xlsx')
reviews.head()

Unnamed: 0,id,review,label
0,1,About the shop: There is a restaurant in Soi L...,restaurant
1,2,About the shop: Through this store for about t...,restaurant
2,3,Roast Coffee &amp; Eatery is a restaurant loca...,restaurant
3,4,Eat from the children. The shop is opposite. P...,restaurant
4,5,The Ak 1 shop at another branch tastes the sam...,restaurant


## 2. Text Preprocessing
Transform reviews into a document-term matrix with:
- Lemmatization
- Stop-words and punctuation removal
- Minimum document frequency = 5
- Include 2-grams (bigrams)

In [None]:
# 2. Initialize the Lemmatizer and Stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

processed_reviews = []

# 3. Preprocess the reviews: Tokenize, Lemmatize, remove stop-words and punctuations.
for doc in reviews['review']:
    # Tokenize and Lowercase
    tokens = nltk.word_tokenize(str(doc).lower())

    # Lemmatize all words and remove punctuations (using .isalpha())
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalpha()]

    # Remove stop-words
    cleaned_tokens = [token for token in lemmatized_tokens if token not in stop_words]

    # Join the tokens back into a single string for the Vectorizer
    processed_reviews.append(" ".join(cleaned_tokens))

# 4. Create the Document-Term Matrix (DTM)
vectorizer = CountVectorizer(min_df=5, ngram_range=(1, 2))
dtm = vectorizer.fit_transform(processed_reviews)

## 3. LDA Topic Modeling
Extract 6 topics from the reviews using Latent Dirichlet Allocation

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
# 1. Initialize the LDA model
lda_model = LatentDirichletAllocation(n_components=6, random_state=12)

# 2. Fit the model to the Document-Term Matrix (dtm) from Step 1
lda_model.fit(dtm)

# 3. Extract the topics of each document
doc_topic_distribution = lda_model.transform(dtm)

# Display the results for the first few documents
print("Topic distribution matrix shape:", doc_topic_distribution.shape)
print("\nTopic distribution for the first review (ID 1):")
print(doc_topic_distribution[0])

Topic distribution matrix shape: (1000, 6)

Topic distribution for the first review (ID 1):
[0.00100606 0.00100636 0.00100669 0.00100664 0.00100727 0.99496698]


## 4. Question 3: Topic Distribution for First 10 Restaurant and Movie Reviews

Report the topic distribution and top-2 topics for:
- First 10 restaurant reviews (ID = 1 to 10)
- First 10 movie reviews (ID = 501 to 510)

In [None]:
import numpy as np

# 1. Define the ranges for restaurant and movie reviews based on IDs
restaurant_ids = list(range(1, 11))
movie_ids = list(range(501, 511))

def topic_details(target_ids, distribution_matrix, dataframe):
    print(f"{'ID':<5} | {'Label':<12} | {'Top 2 Topics':<15} | {'Topic Distribution'}")
    print("-" * 80)

    for tid in target_ids:
        # Match the ID from the CSV to the correct row index in the distribution matrix
        row_indices = dataframe.index[dataframe['id'] == tid].tolist()

        if row_indices:
            idx = row_indices[0]
            dist = distribution_matrix[idx]

            # Find the indices of the top 2 topics (highest probabilities)
            # argsort() sorts ascending, so we take the last two and reverse them
            top_2_topics = dist.argsort()[-2:][::-1]

            # Formatting the distribution for readability
            dist_str = ", ".join([f"{prob:.4f}" for prob in dist])
            label = dataframe.iloc[idx]['label']

            print(f"{tid:<5} | {label:<12} | {str(top_2_topics):<15} | [{dist_str}]")

# 2. Run the report for Restaurant reviews
print("Topic Analysis: First 10 Restaurant Reviews (ID 1-10)")
topic_details(restaurant_ids, doc_topic_distribution, reviews)

# 3. Run the report for Movie reviews
print("\nTopic Analysis: First 10 Movie Reviews (ID 501-510)")
topic_details(movie_ids, doc_topic_distribution, reviews)

Topic Analysis: First 10 Restaurant Reviews (ID 1-10)
ID    | Label        | Top 2 Topics    | Topic Distribution
--------------------------------------------------------------------------------
1     | restaurant   | [5 4]           | [0.0010, 0.0010, 0.0010, 0.0010, 0.0010, 0.9950]
2     | restaurant   | [5 1]           | [0.0011, 0.0559, 0.0011, 0.0011, 0.0011, 0.9397]
3     | restaurant   | [5 4]           | [0.0008, 0.0008, 0.0008, 0.0008, 0.0008, 0.9958]
4     | restaurant   | [5 3]           | [0.0021, 0.0021, 0.0021, 0.2008, 0.0021, 0.7908]
5     | restaurant   | [5 0]           | [0.0050, 0.0049, 0.0049, 0.0049, 0.0049, 0.9752]
6     | restaurant   | [5 1]           | [0.0009, 0.0009, 0.0009, 0.0009, 0.0009, 0.9957]
7     | restaurant   | [5 4]           | [0.0019, 0.0019, 0.0019, 0.0019, 0.0019, 0.9906]
8     | restaurant   | [5 3]           | [0.0017, 0.0017, 0.0017, 0.0017, 0.0017, 0.9916]
9     | restaurant   | [5 2]           | [0.0167, 0.0167, 0.0167, 0.0167, 0.0167, 0.9

## 5. Question 4: Top-5 Terms for Each Topic

Find and display the top-5 terms with highest weights for each of the 6 topics, then describe what each topic is about.

In [None]:
# 1. Get the feature names (terms) from the vectorizer used in Step 1
terms = vectorizer.get_feature_names_out()

# 2. Loop through each of the 6 topics and extract the top 5 terms
for topic_idx, topic in enumerate(lda_model.components_):
    print(f"Topic {topic_idx}:")

    # Sort weights in descending order and get the top 5 indices
    # argsort() sorts ascending, so we take the last 5 elements [:-6:-1]
    top_terms_indices = topic.argsort()[:-6:-1]

    # Map indices back to the actual terms
    top_5_terms = [terms[i] for i in top_terms_indices]

    print(" ".join(top_5_terms))
    print("-" * 20)

Topic 0:
war film wa soldier stalingrad
--------------------
Topic 1:
quot people book ha also
--------------------
Topic 2:
quot life love people wa
--------------------
Topic 3:
quot film wa ha also
--------------------
Topic 4:
wa people time woman also
--------------------
Topic 5:
eat good delicious like food
--------------------


# Final Report

**Question 3:**



*   **Restaurant Reviews (ID 1-10)**:

    **Topic Distribution:** Almost all reviews in this group (IDs 1-9) are heavily dominated by Topic 5 (weights ~ 0.995).

    **Top-2 Topics:** Typically [5, 4] or [5, 1]. Review 10 is an outlier, showing a mix of Topic 4 and Topic 5.

    **Insight:** This high concentration in Topic 5 indicates a very consistent vocabulary (food, dining, taste) across the first 10 restaurant reviews.

*   **Movie Reviews (ID 501-510)**:

    **Topic Distribution:** These reviews show a much broader distribution across Topic 2 (Romance/Emotional) and Topic 3 (General Film).

    **Top-2 Topics:** Common pairs include [2, 3], [3, 2], or [3, 4].

    **Insight:** Movie reviews are more diverse in their themes, blending discussion of plot emotions (Topic 2) with general cinematic quality (Topic 3).

  
**Question 4: Top Terms and Topic Descriptions**

Based on the top 5 terms with the highest weights, the 6 topics are described as follows:


*   **Topic 0**	(war, film, soldier, stalingrad)
	War/Historical Films: Focuses on military history and conflict narratives.
*   **Topic 1**	(people, book, also, story)
	Literature & People: Related to books, storytelling, and general human interest..
*   **Topic 2**	(life, love, people, emotional)
	Romance & Emotion: Centers on human relationships, "love," and life lessons.
*   **Topic 3**	(film, movie, production, also)
	General Film/Cinema: Broad discussion about the movie industry and production quality.
*   **Topic 4**	(people, time, woman, also)
	Society & Time: General societal themes or time-based narratives.
*   **Topic 5**	(eat, good, delicious, food, like)
	Restaurant/Dining: The primary topic for food quality and restaurant experiences.


**Question 5: Review Insights (ID 1 and ID 501)**



*   **Review 1 [ID=1]:**

    **Insights:** This review is almost exclusively about Topic 5 (weight 0.9950). The content details a visit to a restaurant in Soi Langsuan, describing specific French dishes like "Duck l'orange" and "French onion soup." The model correctly identified this as a pure dining review based on keywords like "eat," "food," and "delicious."

    **Summary:** It is a standard culinary critique focusing on menu items and restaurant atmosphere.

*   **Review 501 [ID=501]:**

    **Insights:** This review is primarily associated with Topic 2 (weight 0.5462) and Topic 1 (weight 0.2450). It uses highly emotional and reflective language ("I love you," "unforgettable boy," "confidant"). While it doesn't mention a specific movie title in the opening, its vocabulary aligns with the "Romance & Emotion" and "Literature" themes.

    **Summary:** It is a sentimental movie review or character analysis focusing on the emotional resonance of human connections rather than technical film aspects.