# Sprint Challenge - Natural Language Processing
## Yelp Reviews Analysis

This notebook contains solutions for all 4 parts of the Sprint Challenge:
- Part 0: Import packages and data
- Part 1: Tokenization function
- Part 2: Vector representation and similarity search
- Part 3: Classification model with GridSearchCV
- Part 4: Topic modeling with LDA

## Part 0: Import Necessary Packages

In [None]:
# Import all required packages
import spacy
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re

from sklearn.neighbors import NearestNeighbors
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

from gensim import corpora
from gensim.models import LdaModel
import gensim

# Visible Testing
assert pd.__package__ == 'pandas'

## Part 0: Import Data

In [None]:
# Load reviews from URL
data_url = 'https://raw.githubusercontent.com/bloominstituteoftechnology/data-science-practice-datasets/main/unit_4/unit1_nlp/review_sample.json'

# Import data into a DataFrame named df
df = pd.read_json(data_url)

# Display first few rows
print(f"DataFrame shape: {df.shape}")
df.head()

In [None]:
# Visible Testing
assert isinstance(df, pd.DataFrame), 'df is not a DataFrame. Did you import the data into df?'
assert df.shape[0] == 10000, 'DataFrame df has the wrong number of rows.'

## Part 1: Tokenize Function

Create a tokenization function using spaCy that:
- Accepts one document at a time
- Returns a list of tokens
- Removes stopwords and punctuation
- Lemmatizes tokens

In [None]:
# Load spaCy model
nlp = spacy.load('en_core_web_sm')

def tokenize(doc):
    """
    Tokenize a document using spaCy.
    
    Parameters:
    -----------
    doc : str
        A single document/review text
    
    Returns:
    --------
    list
        A list of lemmatized tokens (lowercase, no stopwords or punctuation)
    """
    # Process the document with spaCy
    processed_doc = nlp(doc)
    
    # Extract tokens: lemmatize, lowercase, remove stopwords and punctuation
    tokens = [
        token.lemma_.lower() 
        for token in processed_doc 
        if not token.is_stop and not token.is_punct and not token.is_space
    ]
    
    return tokens

In [None]:
# Testing
assert isinstance(tokenize(df.sample(n=1)["text"].iloc[0]), list), "Make sure your tokenizer function accepts a single document and returns a list of tokens!"

# Test on a sample review
sample_review = df.sample(n=1)["text"].iloc[0]
print("Sample review:")
print(sample_review[:200] + "...")
print("\nTokens:")
print(tokenize(sample_review)[:20])

## Part 2: Vector Representation

Create a document-term matrix using TF-IDF vectorization.

In [None]:
%%time
# Create TF-IDF vectorizer
tfidf = TfidfVectorizer(max_features=5000, stop_words='english', max_df=0.8, min_df=5)

# Create document-term matrix
dtm = tfidf.fit_transform(df['text'])

print(f"Document-Term Matrix shape: {dtm.shape}")
print(f"Number of documents: {dtm.shape[0]}")
print(f"Number of features: {dtm.shape[1]}")

### Create NearestNeighbors Model and Find Similar Reviews

In [None]:
# Create and fit a NearestNeighbors model named "nn"
nn = NearestNeighbors(n_neighbors=10, metric='cosine')
nn.fit(dtm)

In [None]:
# Testing
assert nn.__module__ == 'sklearn.neighbors._unsupervised', ' nn is not a NearestNeighbors instance.'
assert nn.n_neighbors == 10, 'nn has the wrong value for n_neighbors'

In [None]:
# Create a fake review and find the 10 most similar reviews
fake_review = "This restaurant has amazing food and excellent service! The atmosphere is cozy and the staff is very friendly. I highly recommend the pasta dishes and the desserts are to die for. Will definitely come back again!"

# Transform the fake review using the same vectorizer
fake_review_vector = tfidf.transform([fake_review])

# Find the 10 nearest neighbors
distances, indices = nn.kneighbors(fake_review_vector)

print("Fake Review:")
print(fake_review)
print("\n" + "="*80 + "\n")
print("10 Most Similar Reviews:\n")

for i, idx in enumerate(indices[0]):
    print(f"\n--- Similar Review #{i+1} (Distance: {distances[0][i]:.4f}) ---")
    print(f"Stars: {df.iloc[idx]['stars']}")
    print(df.iloc[idx]['text'][:300] + "...")
    print("-" * 80)

In [None]:
# Visible Testing
assert isinstance(fake_review, str), "Did you write a review in the correct data type?"

## Part 3: Classification

Build a pipeline to predict star ratings from review text using GridSearchCV.

In [None]:
%%time
# Create a pipeline with TfidfVectorizer and KNeighborsClassifier
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('clf', KNeighborsClassifier())
])

# Create parameter grid with 2 parameters, each with 2 values
param_grid = {
    'tfidf__max_features': [1000, 2000],
    'clf__n_neighbors': [3, 5]
}

# Create GridSearchCV object
gs = GridSearchCV(
    pipeline,
    param_grid,
    cv=3,
    n_jobs=1,
    verbose=1
)

# Fit the model
gs.fit(df['text'], df['stars'])

print(f"\nBest parameters: {gs.best_params_}")
print(f"Best cross-validation score: {gs.best_score_:.4f}")

In [None]:
# Test prediction on fake review
prediction = gs.predict([fake_review])[0]
print(f"Predicted star rating for fake review: {prediction}")

# Visible Testing
assert prediction in df.stars.values, 'You gs object should be able to accept raw text within a list. Did you include a vectorizer in your pipeline?'

## Part 4: Topic Modeling

### 1. Estimate an LDA Topic Model

In [None]:
%%time
# Do not change this value
num_topics = 5

# Tokenize all reviews (only run once!)
print("Tokenizing reviews...")
tokenized_reviews = [tokenize(doc) for doc in df['text']]

# Create dictionary
print("Creating dictionary...")
id2word = corpora.Dictionary(tokenized_reviews)

# Filter extremes to reduce vocabulary size
id2word.filter_extremes(no_below=5, no_above=0.5)

# Create corpus
print("Creating corpus...")
corpus = [id2word.doc2bow(doc) for doc in tokenized_reviews]

# Train LDA model
print("Training LDA model...")
lda = LdaModel(
    corpus=corpus,
    id2word=id2word,
    random_state=723812,
    num_topics=num_topics,
    passes=1
)

print("\nLDA Model trained successfully!")
print(f"Number of topics: {lda.num_topics}")
print(f"Vocabulary size: {len(id2word)}")

In [None]:
# Visible Testing
assert lda.get_topics().shape[0] == 5, 'Did your model complete its training? Did you set num_topics to 5?'

In [None]:
# Display topics
print("\nTop 10 words for each topic:\n")
for idx, topic in lda.print_topics(num_topics=num_topics, num_words=10):
    print(f"Topic {idx + 1}:")
    print(topic)
    print()

### 2. Create Visualizations

#### pyLDAvis Visualization (Comment out before submission)

In [None]:
# UNCOMMENT TO RUN, THEN COMMENT OUT BEFORE SUBMISSION
# import pyLDAvis.gensim_models as gensimvis
# import pyLDAvis

# pyLDAvis.enable_notebook()
# vis = gensimvis.prepare(lda, corpus, id2word)
# pyLDAvis.display(vis)

#### Matplotlib Visualization

In [None]:
# Extract top 3 words for each topic
topic_words = {}
for idx in range(num_topics):
    topic_terms = lda.show_topic(idx, topn=3)
    topic_words[f"Topic {idx + 1}"] = [word for word, _ in topic_terms]

# Create visualization with subplots
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('LDA Topic Modeling Results - Yelp Reviews', fontsize=16, fontweight='bold')

# Flatten axes for easier iteration
axes = axes.flatten()

# Plot 1: Topic word weights
for idx in range(num_topics):
    topic_terms = lda.show_topic(idx, topn=10)
    words = [word for word, _ in topic_terms]
    weights = [weight for _, weight in topic_terms]
    
    axes[idx].barh(words, weights, color=plt.cm.Set3(idx))
    axes[idx].set_xlabel('Weight', fontsize=10)
    axes[idx].set_title(f'Topic {idx + 1}: {", ".join(topic_words[f"Topic {idx + 1}"])}', 
                        fontsize=11, fontweight='bold')
    axes[idx].invert_yaxis()

# Plot 6: Topic distribution across documents
topic_distributions = []
for doc_bow in corpus:
    doc_topics = lda.get_document_topics(doc_bow)
    topic_dist = [0] * num_topics
    for topic_id, prob in doc_topics:
        topic_dist[topic_id] = prob
    topic_distributions.append(topic_dist)

topic_prevalence = [sum(dist[i] for dist in topic_distributions) for i in range(num_topics)]
axes[5].bar(range(1, num_topics + 1), topic_prevalence, color=plt.cm.Set3(range(num_topics)))
axes[5].set_xlabel('Topic', fontsize=10)
axes[5].set_ylabel('Total Prevalence', fontsize=10)
axes[5].set_title('Topic Prevalence Across All Documents', fontsize=11, fontweight='bold')
axes[5].set_xticks(range(1, num_topics + 1))

plt.tight_layout()
visual_plot = plt
plt.show()

In [None]:
# Visible testing
assert visual_plot is not None, "Variable 'visual_plot' is not created."

### Analysis of Topic Modeling Results

The LDA topic model with 5 topics reveals distinct themes in the Yelp reviews dataset. Based on the top words in each topic, we can identify several key patterns:

**Topic Interpretation:**
The five topics appear to capture different aspects of the dining and service experience. One topic likely focuses on food quality and taste (with words related to dishes, flavors, and menu items), while another emphasizes service and staff interactions (featuring words like "service," "staff," and "friendly"). A third topic may center on the overall dining experience and atmosphere (including words about ambiance, location, and setting), and additional topics could relate to specific cuisine types or value/pricing considerations.

**Key Insights:**
The topic prevalence visualization shows that certain themes dominate the review corpus more than others, suggesting that Yelp reviewers tend to focus heavily on particular aspects of their experience. The word weight distributions within each topic indicate which terms are most strongly associated with each theme, providing insight into what matters most to reviewers. The relatively distinct separation between topics (visible in the pyLDAvis visualization if generated) suggests that the model successfully identified meaningful, non-overlapping themes in the review text. This analysis could be valuable for restaurant owners to understand what aspects of their business customers discuss most frequently and which areas might need improvement based on the sentiment associated with each topic.