#  Research Project  
## **Building a Natural Language Opinion Search Engine**  
**Jennifer Nava**  
**Course:** COSC 4397 – Natural Language Processing  
**Instructor:** Dr. Arjun Mukherjee  

---

##  Project Overview

This research project explores the design and implementation of a Natural Language Opinion Search Engine, using a real-world corpus of Amazon product reviews. The system is intended to go beyond traditional keyword-based search by understanding and responding to natural language queries that combine product **aspects** (e.g., "battery", "image quality") with **opinion terms** (e.g., "poor", "amazing").

The project emphasizes the importance of extracting **coherent and sentiment-relevant** opinions from text data and investigates multiple approaches rooted in Natural Language Processing (NLP).

---

##  Objectives

-  **Data Preprocessing:** Clean and normalize raw review text through tokenization, stopword removal, lemmatization, rare word filtering, and smiley handling.
-  **Boolean Retrieval Baseline:** Implement a baseline opinion search using Boolean logic to retrieve documents based on:
  - **Test 1:** Aspect-only matching
  - **Test 2:** Aspect AND Opinion matching
  - **Test 3:** Aspect OR Opinion matching

---

>  Dataset: Amazon.com product reviews (electronics/software domain)  
>  Timeline: Summer 2025 | University of Houston | COSC 4397




---

In [1]:
pip install nltk


Note: you may need to restart the kernel to use updated packages.


## 🌸  Step 1: Environment Setup and Required Imports

This section initializes the Python environment for Natural Language Processing by importing necessary libraries and downloading key NLTK resources. These tools will support tasks such as:

- **Tokenization** (breaking reviews into words)
- **Stopword removal** (removing common, non-informative words)
- **Lemmatization** (reducing words to their base forms)

The following libraries are used:

- `pandas` – for handling review data
- `re` – for regular expressions used in cleaning text
- `os` – for file management
- `nltk` – for natural language processing tasks

We also ensure that required NLTK resources (`punkt`, `stopwords`, and `wordnet`) are downloaded.


In [2]:
import pandas as pd
import re
import os
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# NLTK downloads
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to C:\Users\jenni/nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jenni/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\jenni/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### 🌸 Step 2: Load Review Dataset

We load the Amazon product review dataset from a serialized `.pkl` file (`reviews_segment.pkl`) into a pandas DataFrame. This file contains reviews and associated metadata for various electronics and software products.

We also ensure the `review_text` column is properly cast to string format to prevent issues during text processing in later steps.

In [3]:
df = pd.read_pickle("reviews_segment.pkl")
df['review_text'] = df['review_text'].astype(str)

### 🌸 Step 3: Stopword Handling and Lemmatizer Setup

To ensure cleaner and more meaningful tokens, we enhance the default list of stopwords from NLTK with a custom list provided in a separate file (`NLTK's list of english stopwords`). This helps eliminate additional non-informative words from the reviews.

We then combine the two stopword sets and initialize the NLTK WordNet lemmatizer, which will later be used to normalize tokens to their base forms (e.g., "running" → "run").

In [5]:
#extra_stopwords = set(pd.read_csv("NLTK's list of english stopwords", header=None, encoding='utf-8')[0].tolist())
#stop_words = set(stopwords.words("english")).union(extra_stopwords) 
##( this is to help ta to run it )
extra_stopwords = {
    "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours",
    "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself",
    "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which",
    "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be",
    "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an",
    "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for",
    "with", "about", "against", "between", "into", "through", "during", "before", "after",
    "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under",
    "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all",
    "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not",
    "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don",
    "should", "now"
}

stop_words = set(stopwords.words("english")).union(extra_stopwords)

lemmatizer = WordNetLemmatizer()

### 🌸 Step 4: Define Smileys and Token Extraction Function

To improve sentiment detection, we account for **emoticons (smileys)** that often serve as informal sentiment indicators in user reviews.

- **Positive Smileys:** 🙂 😊 :D etc.
- **Negative Smileys:** 🙁 ☹️ 😢 etc.

We define regular expressions to identify and remove these smileys before tokenization.

We also define a custom `extract_tokens()` function that performs the following steps:
1. Removes HTML tags from the review text.
2. Strips both positive and negative smileys using regex.
3. Removes all punctuation.
4. Converts text to lowercase and tokenizes it into words using `nltk.word_tokenize()`.

This prepares the text for further preprocessing such as lemmatization and stopword removal.


In [6]:
POSITIVE_SMILEYS = r"(:\)|:-\)|:\]|:D|:o\))"
NEGATIVE_SMILEYS = r"(:\(|:-\(|:\[|:'\()"
all_tokens = []

def extract_tokens(text):
    text = re.sub(r"<.*?>", " ", text)
    text = re.sub(rf"{POSITIVE_SMILEYS}|{NEGATIVE_SMILEYS}", "", text)
    text = re.sub(rf"[{string.punctuation}]", "", text)
    tokens = word_tokenize(text.lower())
    return tokens

### 🌸 Step 5: Token Frequency Analysis and Rare Word Filtering

After defining the token extraction logic, we apply the `extract_tokens()` function to the entire dataset:

- Each review is tokenized and stored in a new column `df['tokens']`.
- All tokens are collected into a global list `all_tokens` to compute frequency statistics.

We then use Python's `Counter` to count the occurrence of each token across the dataset. Based on these frequencies, we define a set of **rare words** — any word that appears **fewer than 5 times** is considered rare and stored in the `rare_words` set.

These rare words will be filtered out during the final preprocessing step to reduce noise and vocabulary size.


In [7]:
import string
df['tokens'] = df['review_text'].apply(extract_tokens)
all_tokens = []
for tokens in df['tokens']:
    all_tokens.extend(tokens)

from collections import Counter
word_counts = Counter(all_tokens)
rare_words = set([word for word, count in word_counts.items() if count < 5])


### 🌸 Step 6: Final Text Preprocessing Function

We define the `preprocess(text)` function to perform full review text cleaning, normalization, and filtering. This function builds on the previous steps and prepares the text for downstream search and analysis.

The function applies the following operations:

1. **Null Handling:** Returns an empty string if the input is missing.
2. **HTML Removal:** Strips out any HTML tags.
3. **Emoticon Normalization:** Replaces positive and negative smileys with the placeholders `positive_smiley` and `negative_smiley`, respectively.
4. **Punctuation Removal:** Eliminates all punctuation characters.
5. **Lowercasing and Tokenization:** Converts text to lowercase and splits it into tokens using NLTK.
6. **Stopword and Rare Word Filtering:** Removes common stopwords and rare words (those occurring fewer than 5 times).
7. **Lemmatization:** Reduces each remaining token to its base (dictionary) form using `WordNetLemmatizer`.

The resulting tokens are joined back into a single cleaned string, ready for indexing and retrieval.


In [8]:
def preprocess(text):
    if pd.isnull(text):
        return ""
    text = re.sub(r"<.*?>", " ", text)
    text = re.sub(POSITIVE_SMILEYS, "positive_smiley", text)
    text = re.sub(NEGATIVE_SMILEYS, "negative_smiley", text)
    text = re.sub(rf"[{string.punctuation}]", "", text)
    tokens = word_tokenize(text.lower())
    cleaned = [lemmatizer.lemmatize(t) for t in tokens if t not in stop_words and t not in rare_words]
    return " ".join(cleaned)

### 🌸 Step 7: Apply Preprocessing and Save Cleaned Data

We now apply the `preprocess()` function to each review in the dataset and store the cleaned output in a new column called `clean_text`.

After cleaning:

- The intermediate `tokens` column is dropped to reduce memory usage.
- The fully processed dataset is saved in two formats:
  - A serialized binary file (`.pkl`) for efficient loading in Python
  - A `.csv` file for inspection or use outside of Python environments

This marks the end of the preprocessing stage. The cleaned data is now ready for Boolean search, classification, and embedding-based retrieval in subsequent tasks.

In [9]:
df['clean_text'] = df['review_text'].apply(preprocess)
df.drop(columns=['tokens'], inplace=True)

df.to_pickle("reviews_segment_cleaned.pkl")
df.to_csv("reviews_segment_cleaned.csv", index=False)


### 🌸 Step 8: Define Opinion Queries

We define a set of five opinion queries that combine product **aspect terms** with corresponding **opinion terms**. Each query is expressed in the format:



In [10]:
queries = {
    "audio quality:poor": (["audio", "quality"], ["poor"]),
    "wifi signal:strong": (["wifi", "signal"], ["strong"]),
    "mouse button:click problem": (["mouse", "button"], ["click", "problem"]),
    "gps map:useful": (["gps", "map"], ["useful"]),
    "image quality:sharp": (["image", "quality"], ["sharp"])
}

### 🌸  Step 9: Define Boolean Retrieval Functions (Test 1–3)

We implement three Boolean search functions to evaluate different configurations of aspect and opinion term matching.

Each function operates on the preprocessed `clean_text` and returns a list of tuples containing the review ID and original review text for all matching entries.

####  🌸 `test1_boolean_aspect_retrieval(df, aspect_terms)`
- **Objective:** Match reviews that mention at least one aspect term.
- **Logic:** `aspect1 OR aspect2`

####  🌸 `test2_aspect_opinion_and(df, aspect_terms, opinion_terms)`
- **Objective:** Match reviews that contain at least one aspect term **and** at least one opinion term.
- **Logic:** `(aspect1 OR aspect2) AND (opinion1 OR opinion2)`

#### 🌸 `test3_aspect_opinion_or(df, aspect_terms, opinion_terms)`
- **Objective:** Match reviews that contain **either** an aspect term **or** an opinion term (or both).
- **Logic:** `aspect1 OR aspect2 OR opinion1 OR opinion2`

These functions serve as the foundation for evaluating the baseline Boolean retrieval model using different query interpretations.

In [11]:
def test1_boolean_aspect_retrieval(df, aspect_terms):
    matched = []
    for _, row in df.iterrows():
        text = row['clean_text']
        if any(a in text for a in aspect_terms):
            matched.append((row['review_id'], row['review_text']))
    return matched

def test2_aspect_opinion_and(df, aspect_terms, opinion_terms):
    matched = []
    for _, row in df.iterrows():
        text = row['clean_text']
        if any(a in text for a in aspect_terms) and any(o in text for o in opinion_terms):
            matched.append((row['review_id'], row['review_text']))
    return matched

def test3_aspect_opinion_or(df, aspect_terms, opinion_terms):
    matched = []
    for _, row in df.iterrows():
        text = row['clean_text']
        if any(a in text for a in aspect_terms) or any(o in text for o in opinion_terms):
            matched.append((row['review_id'], row['review_text']))
    return matched


### 🌸 Step 10: Run Boolean Tests and Save Results (Test 1–3)

For each query defined in our dictionary, we execute the three Boolean retrieval tests and evaluate their results:

1. **Test 1** – Aspect-only match  
2. **Test 2** – Aspect AND Opinion match  
3. **Test 3** – Aspect OR Opinion match

For each test:
- We retrieve matching reviews using the corresponding test function.
- We compute the precision as the ratio of matched reviews to retrieved results.
- All matched **review IDs** are written to `.txt` files for later evaluation.
- An **example review text** is printed for quick inspection.
- Results are printed to the console, including the number of matches and precision.

The output file names follow this convention:

In [12]:
import pandas as pd
import matplotlib.pyplot as plt
import ipywidgets as widgets
from IPython.display import display

# Function to check if review expresses the opinion
def is_relevant(text, opinion_terms):
    return any(opinion.lower() in text.lower() for opinion in opinion_terms)

# Store summary results
summary_rows = []

for query_label, (aspect_terms, opinion_terms) in queries.items():
    print(f"\n🌸 Test 1 Query: {query_label}")
    base_filename = query_label.replace(":", "_").replace(" ", "_")

    # --- Test 1 ---
    matches1 = test1_boolean_aspect_retrieval(df, aspect_terms)
    test1_ids = [mid for mid, _ in matches1]
    relevant1 = [mid for mid, text in matches1 if is_relevant(text, opinion_terms)]
    precision1 = len(relevant1) / len(test1_ids) if test1_ids else 0

    with open(f"{base_filename}_test1.txt", "w") as f:
        for mid in test1_ids:
            f.write(f"{mid.strip('\"\'')}\n")

    print(f"🌸 Test 1 Retrieved: {len(test1_ids)}, Relevant: {len(relevant1)}, Precision: {round(precision1, 3)}")
    if matches1:
        print("Review IDs:")
        for mid in test1_ids:
            print(mid.strip("'\""))
        print("\n🌸 Example Review Text (Test 1):")
        print(matches1[0][1].strip())

    summary_rows.append({
        'Query': query_label,
        'Test': 'Test 1',
        'Retrieved': len(test1_ids),
        'Relevant': len(relevant1),
        'Precision': round(precision1, 3)
    })

    # --- Test 2 ---
    print(f"\n🌸 Test 2 Query: {query_label}")
    matches2 = test2_aspect_opinion_and(df, aspect_terms, opinion_terms)
    test2_ids = [mid for mid, _ in matches2]
    relevant2 = [mid for mid, text in matches2 if is_relevant(text, opinion_terms)]
    precision2 = len(relevant2) / len(test2_ids) if test2_ids else 0

    with open(f"{base_filename}_test2.txt", "w") as f:
        for mid in test2_ids:
            f.write(f"{mid.strip('\"\'')}\n")

    print(f"🌸 Test 2 Retrieved: {len(test2_ids)}, Relevant: {len(relevant2)}, Precision: {round(precision2, 3)}")
    if matches2:
        print("Review IDs:")
        for mid in test2_ids:
            print(mid.strip("'\""))
        print("\n🌸 Example Review Text (Test 2):")
        print(matches2[0][1].strip())

    summary_rows.append({
        'Query': query_label,
        'Test': 'Test 2',
        'Retrieved': len(test2_ids),
        'Relevant': len(relevant2),
        'Precision': round(precision2, 3)
    })

    # --- Test 3 ---
    print(f"\n🌸 Test 3 Query: {query_label}")
    matches3 = test3_aspect_opinion_or(df, aspect_terms, opinion_terms)
    test3_ids = [mid for mid, _ in matches3]
    relevant3 = [mid for mid, text in matches3 if is_relevant(text, opinion_terms)]
    precision3 = len(relevant3) / len(test3_ids) if test3_ids else 0

    with open(f"{base_filename}_test3.txt", "w") as f:
        for mid in test3_ids:
            f.write(f"{mid.strip('\"\'')}\n")

    print(f"🌸 Test 3 Retrieved: {len(test3_ids)}, Relevant: {len(relevant3)}, Precision: {round(precision3, 3)}")
    if matches3:
        print("Review IDs:")
        for mid in test3_ids:
            print(mid.strip("'\""))
        print("\n🌸 Example Review Text (Test 3):")
        print(matches3[0][1].strip())

    summary_rows.append({
        'Query': query_label,
        'Test': 'Test 3',
        'Retrieved': len(test3_ids),
        'Relevant': len(relevant3),
        'Precision': round(precision3, 3)
    })



🌸 Test 1 Query: audio quality:poor
🌸 Test 1 Retrieved: 22982, Relevant: 1864, Precision: 0.081
Review IDs:
R1009X5OE67SIO
R100GDC82ALTP9
R101CB621E6E4K
R101Y2RFUYKFS2
R1029L2LRQDKVL
R102BUG4DYWSU2
R102DRWN99W665
R102F4AYVPYL8A
R102OFYPZ4BH7O
R102ZDRJU3MNHN
R1030NN29FWL8C
R1035FWJ49AR9D
R103C93Y5E49IT
R103DR0VSEZEDJ
R1041I7XBQ14B3
R1041OX4ER4IW5
R1043IG4GMHOFB
R1044PVTF5F1GO
R10490GQF2KDC9
R104ET8WJIR05F
R104J4YTL1RDUA
R104OHTKM727KJ
R1055KJIZC1ICG
R1058WTS41OMYK
R105GLDULN7QD8
R105LMF9L4ANCT
R105PC02WFFUZZ
R10623OAW776V9
R1063DZN992X6O
R1063MXILQ6IZ
R1064L6XX7TDJC
R1069MFR2O1E8V
R106DYUZ293LXU
R106EZ3WKI9UW2
R107KMJYYZBPLE
R1087OY4V8PRFX
R108L1PJBDDGFA
R108ZQ9AEWTPBH
R1090M2ZO331DO
R1091VMW97LTTV
R1096Q8HQTGD2A
R1099XU23GB6K2
R109IJR00DDCFV
R109TWLQSCEE83
R10A1EUN5VT1EJ
R10A4Y9QDATTP6
R10A82GZTUIDES
R10AJROABI0C3P
R10AR6AW0AGVJO
R10AWPUMH1Y4BL
R10BB2FGLY2VI4
R10BCN6Z4CCR7J
R10BPWJNXODGTF
R10C2NP2CDOQX8
R10C7RHJIG2G3H
R10CWQHXYETR5V
R10D1E2A0RAZ9R
R10D3RK0KKZ73R
R10DBATUEKOSRQ
R10DMTEV

In [14]:
import pandas as pd
summary_df = pd.DataFrame(summary_rows)
print("\n🌸 Final Precision Summary Table:")
print(summary_df.to_string(index=False))



🌸 Final Precision Summary Table:
                     Query   Test  Retrieved  Relevant  Precision
        audio quality:poor Test 1      22982      1864      0.081
        audio quality:poor Test 2       1812      1812      1.000
        audio quality:poor Test 3      27381      6263      0.229
        wifi signal:strong Test 1       3521       311      0.088
        wifi signal:strong Test 2        306       306      1.000
        wifi signal:strong Test 3       8627      5417      0.628
mouse button:click problem Test 1       9493      3476      0.366
mouse button:click problem Test 2       3401      3401      1.000
mouse button:click problem Test 3      35104     29086      0.829
            gps map:useful Test 1       3705       328      0.089
            gps map:useful Test 2        318       318      1.000
            gps map:useful Test 3       8640      5263      0.609
       image quality:sharp Test 1      23237      1089      0.047
       image quality:sharp Test 2       10