# Stage 2.0: NLP Exploratory Analysis - Wordclouds

Visualization notebook for comparing review text between quality and non-quality reviews.

**Input**: Stage 2 NLP output (2.5_lda) with review_text column  
**Output**: Wordcloud visualizations comparing quality vs non-quality reviews

Quality = reviews with useful/funny/cool votes (T2_CLS_ufc_>0 = True)

## Overview

This notebook creates wordclouds to visually compare the vocabulary used in:
- **Quality reviews**: Reviews that received useful, funny, or cool votes
- **Non-quality reviews**: Reviews with zero votes

The wordclouds help identify if there are obvious vocabulary differences between the two classes.

## Imports and Global Settings

In [None]:
# Common Libraries
import sys
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Wordclouds
from wordcloud import WordCloud, STOPWORDS

# Add project root to path for imports
sys.path.insert(0, str(Path.cwd().parent))
from src.config import PathConfig

pd.set_option('display.float_format', lambda x: '%.5f' % x)

## Import Data

In [None]:
# Load training data from Stage 2 NLP output
# Using final LDA output which has all features, including review_text
input_dir = PathConfig.get_nlp_lda_dir()
print(f"Reading from: {input_dir}")

In [None]:
# Read training data (only need review_text and target column for wordclouds)
df = pd.read_parquet(
    input_dir / "train.parquet",
    columns=["review_id", "review_text", "T2_CLS_ufc_>0"]
)
print(f"Loaded {len(df):,} reviews")

In [None]:
# Rename target column to simpler name for this notebook
df = df.rename(columns={"T2_CLS_ufc_>0": "is_quality"})
df.head()

In [None]:
df.info()

## Dataframe Pre-Processing

In [None]:
df.is_quality.value_counts()

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   review_id        10000 non-null  object
 1   review_text      10000 non-null  object
 2   target_ufc_bool  10000 non-null  object
dtypes: object(3)
memory usage: 234.5+ KB


In [9]:
df.target_ufc_bool.value_counts()

True     5101
False    4899
Name: target_ufc_bool, dtype: int64

## Splitting Text

In [None]:
# Split corpus by quality label
corpus = df.review_text
Q_corpus = df[df["is_quality"] == True]["review_text"]
NQ_corpus = df[df["is_quality"] == False]["review_text"]
print(f'Corpus Size: Total:{len(corpus):,}, Quality:{len(Q_corpus):,}, Not Quality:{len(NQ_corpus):,}')

## WordClouds

### Wordcloud Stopwords

In [11]:
stopwords = set(STOPWORDS)
stopwords.update([])

#### Join Corpus Into One String

In [12]:
Q_text = " ".join(review for review in Q_corpus)
NQ_text = " ".join(review for review in NQ_corpus)

### Make Wordclouds

In [13]:
Q_wordcloud = WordCloud(stopwords=stopwords, background_color="cornflowerblue", colormap="Set1",
                      collocations=False, color_func=lambda *args, **kwargs: "black",
                      width=1000, height=1000).generate(Q_text)

In [14]:
NQ_wordcloud = WordCloud(stopwords=stopwords, background_color="lightcoral", colormap="Set1",
                      collocations=False, color_func=lambda *args, **kwargs: "black",
                      width=1000, height=1000).generate(NQ_text)

### Save and Show Wordclouds

In [None]:
# Create side-by-side wordcloud visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 7))

ax1.imshow(Q_wordcloud)
ax1.axis("off")
ax1.set_title("Quality Reviews\n(has useful/funny/cool votes)", fontsize=18, pad=10)

ax2.imshow(NQ_wordcloud)
ax2.axis("off")
ax2.set_title("Non-Quality Reviews\n(no votes)", fontsize=18, pad=10)

fig.suptitle("Yelp Review Text: Quality vs Non-Quality", fontsize=22, y=1.02)
fig.tight_layout(pad=2)

# Uncomment to save
# plt.savefig('../images/review_wordclouds.png', dpi=300, bbox_inches='tight')
plt.show()