# 1. cleaning the data

## Removing unnecessary fields
### Description
This Python script reads a large JSON Lines (JSONL) file line by line, processes each entry to remove the specified fields, and writes the cleaned data into a new JSONL file. The script is optimized for handling large datasets by processing entries incrementally to avoid memory overload, making it suitable for datasets containing millions of records.

### Features
Field Removal: Removes non-essential fields (images, videos, details, and features) from each entry.
Memory Efficiency: Processes each line independently without loading the entire file into memory.
Scalability: Capable of handling datasets with millions of entries due to its line-by-line processing approach.
Preserves Original Structure: Maintains the integrity of the remaining data fields, ensuring the dataset is ready for subsequent analysis.


In [12]:
import json

# Input and output file paths
input_file = "updated_metadata.jsonl"
output_file = "cleaned_metadata.jsonl"

# Fields to remove
fields_to_remove = ["images", "videos", "details", "features", "bought_together", "description"]

# Process the file
with open(input_file, "r", encoding="utf-8") as infile, open(output_file, "w", encoding="utf-8") as outfile:
    for line in infile:
        entry = json.loads(line.strip())  # Load the JSON object
        # Remove the specified fields
        for field in fields_to_remove:
            entry.pop(field, None)
        # Write the cleaned entry to the output file
        outfile.write(json.dumps(entry) + "\n")

print(f"Cleaned data written to {output_file}")


Cleaned data written to cleaned_metadata.jsonl


## cleaning reviews
Objective:
The purpose of this step is to preprocess the cleaned_metadata.jsonl file by removing unnecessary subfields from the reviews field. This process ensures that only the relevant information for sentiment analysis is retained in the reviews data, making it cleaner and more focused for downstream processing.

Description:
In this step, we focus on cleaning the reviews field within each entry of the cleaned_metadata.jsonl file. Specifically, we remove the following unwanted subfields from each review:

parent_asin
user_id
asin
helpful_vote
These fields are irrelevant for sentiment analysis, as they do not contribute to evaluating the tone or opinion expressed in the review. By removing these fields, we reduce noise in the dataset, streamline the structure, and make it easier to analyze the sentiment of the reviews based on the remaining relevant fields, such as rating, title, text, and timestamp.

In [13]:
import json

# Input and output file paths (same file for input and output)
file_path = "cleaned_metadata.jsonl"

# Fields to remove within the reviews
review_fields_to_remove = ["parent_asin", "user_id", "asin", "helpful_vote", "images"]

# Process the file
with open(file_path, "r", encoding="utf-8") as infile:
    lines = infile.readlines()  # Read all lines into memory

# Modify the data
modified_lines = []
for line in lines:
    entry = json.loads(line.strip())  # Load the JSON object
    
    # Clean the reviews field
    if "reviews" in entry and isinstance(entry["reviews"], list):
        cleaned_reviews = []
        for review in entry["reviews"]:
            # Remove specified fields in each review
            cleaned_review = {k: v for k, v in review.items() if k not in review_fields_to_remove}
            cleaned_reviews.append(cleaned_review)
        entry["reviews"] = cleaned_reviews

    # Prepare the modified entry for output
    modified_lines.append(json.dumps(entry))

# Overwrite the file with the modified data
with open(file_path, "w", encoding="utf-8") as outfile:
    for modified_line in modified_lines:
        outfile.write(modified_line + "\n")

print(f"Reviews cleaned and file {file_path} updated.")

Reviews cleaned and file cleaned_metadata.jsonl updated.


Removing reviews that are not verified purchases. 

In [14]:
import json

# Define the file path
file_path = 'cleaned_metadata.jsonl'

# Read the file, filter the reviews, and write back to the same file
with open(file_path, 'r+') as file:
    lines = file.readlines()  # Read all lines into a list
    file.seek(0)  # Move the file pointer to the beginning
    file.truncate()  # Clear the file content

    # Process each line and filter the reviews
    for line in lines:
        data = json.loads(line)  # Parse the JSON data
        
        # Check if 'reviews' exists, then filter the reviews based on 'verified_purchase'
        if 'reviews' in data:
            # Only keep reviews where 'verified_purchase' is True
            data['reviews'] = [review for review in data['reviews'] if review.get('verified_purchase', False) is True]

        # Write the cleaned data back to the same file
        json.dump(data, file)
        file.write('\n')

print(f"File {file_path} has been updated with filtered reviews.")


File cleaned_metadata.jsonl has been updated with filtered reviews.


## Further cleaning data and feature engineering 

convert dates to suitable format and remove html tags

In [15]:
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup

# Load your cleaned data
import json

# Read the cleaned metadata (ensure it's in the proper JSON format for manipulation)
with open("cleaned_metadata.jsonl", "r") as file:
    cleaned_data = [json.loads(line) for line in file]

# Function to convert Unix timestamp to a readable date
def convert_timestamp(timestamp):
    # Convert milliseconds to seconds for datetime conversion
    return datetime.utcfromtimestamp(timestamp / 1000).strftime('%Y-%m-%d %H:%M:%S')

# Function to remove HTML tags from the review text
def remove_html_tags(text):
    return BeautifulSoup(text, "html.parser").get_text()

# Apply the transformations to the data
for entry in cleaned_data:
    # Convert timestamp for each review
    for review in entry.get('reviews', []):
        review['timestamp'] = convert_timestamp(review['timestamp'])
        # Remove HTML tags in the review text
        review['text'] = remove_html_tags(review['text'])

# Save the updated data back to cleaned_metadata.jsonl
with open("cleaned_metadata.jsonl", "w") as file:
    for entry in cleaned_data:
        file.write(json.dumps(entry) + "\n")

print("Timestamp converted and HTML tags removed successfully.")

  return datetime.utcfromtimestamp(timestamp / 1000).strftime('%Y-%m-%d %H:%M:%S')
  return BeautifulSoup(text, "html.parser").get_text()
  return BeautifulSoup(text, "html.parser").get_text()


Timestamp converted and HTML tags removed successfully.


## Feature engineering

adding new fields such as review_count and price range

In [16]:
import json

# Function to add Review Count feature
def add_review_count(data):
    for entry in data:
        entry['review_count'] = len(entry.get('reviews', []))  # Count reviews for each product
    return data

# Function to add Price Range feature, handling null prices and converting to numeric
def add_price_range(data):
    for entry in data:
        price = entry.get('price', None)  # Get the price, default to None if not present
        
        # If the price is None or empty, set to 'unknown'
        if price is None or price == '':
            entry['price_range'] = 'unknown'  # Set as 'unknown' if price is missing or empty
        else:
            try:
                # Convert price to float to ensure proper comparison
                price = float(price)
                if price < 10:
                    entry['price_range'] = 'low'
                elif 10 <= price < 30:
                    entry['price_range'] = 'medium'
                else:
                    entry['price_range'] = 'high'
            except ValueError:
                # If conversion fails (e.g., if price is non-numeric), set it as 'unknown'
                entry['price_range'] = 'unknown'
    
    return data

# Load the cleaned data from the cleaned_metadata.jsonl file
with open("cleaned_metadata.jsonl", "r") as file:
    cleaned_data = [json.loads(line) for line in file]

# Apply Review Count and Price Range feature engineering
cleaned_data = add_review_count(cleaned_data)
cleaned_data = add_price_range(cleaned_data)

# Save the updated data with the new features to the same file
with open("cleaned_metadata.jsonl", "w") as file:
    for entry in cleaned_data:
        file.write(json.dumps(entry) + "\n")

print("Review Count and Price Range features added successfully, handling null and non-numeric prices.")

Review Count and Price Range features added successfully, handling null and non-numeric prices.


adding sentiment scores scale -1 to 1 negative to positive, preprocessing text - Lowercasing, Remove Special Characters, Tokenization, Remove Stopwords, Stemming/Lemmatization

In [17]:
import json
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Download NLTK resources (run once)
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize NLP tools
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()
analyzer = SentimentIntensityAnalyzer()

# Function to preprocess text
def preprocess_text(text):
    # Lowercase
    text = text.lower()
    # Remove special characters (keep alphanumeric and spaces)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenization
    tokens = word_tokenize(text)
    # Remove stopwords and apply stemming or lemmatization
    tokens = [lemmatizer.lemmatize(ps.stem(word)) for word in tokens if word not in stop_words]
    return ' '.join(tokens)

# Function to get sentiment score (VADER)
def get_sentiment_score(text):
    sentiment = analyzer.polarity_scores(text)
    return sentiment['compound']  # Compound score ranges from -1 (negative) to 1 (positive)

# Function to adjust sentiment score based on rating
def adjust_sentiment_based_on_rating(rating, text_sentiment):
    # Normalize rating to scale -1 to 1 (1 is negative, 5 is positive)
    rating_sentiment = 2 * (rating / 5) - 1
    
    # Combine text sentiment and rating sentiment (average them)
    # Adjust the weights depending on how much importance you want to give to the rating vs text
    combined_sentiment = (text_sentiment + rating_sentiment) / 2
    return combined_sentiment

# File path
file_path = 'cleaned_metadata.jsonl'

# Read the file, preprocess the text, and add sentiment scores
with open(file_path, 'r+') as file:
    lines = file.readlines()  # Read all lines into a list
    file.seek(0)  # Move the file pointer to the beginning
    file.truncate()  # Clear the file content

    # Process each line
    for line in lines:
        data = json.loads(line)  # Parse the JSON data

        # Check if 'reviews' exists and process each review
        if 'reviews' in data:
            for review in data['reviews']:
                # Preprocess the review text
                cleaned_text = preprocess_text(review.get('text', ''))
                # Get sentiment score for the review text
                text_sentiment = get_sentiment_score(cleaned_text)
                # Adjust sentiment score based on the rating
                final_sentiment = adjust_sentiment_based_on_rating(review.get('rating', 0), text_sentiment)
                # Add the final sentiment score to the review
                review['sentiment_score'] = final_sentiment

        # Write the updated data back to the same file
        json.dump(data, file)
        file.write('\n')

print(f"File {file_path} has been updated with sentiment scores.")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mitan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mitan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mitan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


File cleaned_metadata.jsonl has been updated with sentiment scores.


Word count of reviews and average of sentiment scores

In [19]:
import json

# Open the JSONL file
file_path = "cleaned_metadata.jsonl"

# Process data line by line
with open(file_path, "r+") as file:
    lines = file.readlines()
    file.seek(0)  # Reset pointer to the start of the file

    for line in lines:
        data = json.loads(line)  # Parse JSON data
        
        # Add word_count for each review
        if "reviews" in data:
            for review in data["reviews"]:
                review["word_count"] = len(review["text"].split())
        
            # Compute average_sentiment_score
            total_sentiment = sum(review["sentiment_score"] for review in data["reviews"])
            data["average_sentiment_score"] = total_sentiment / len(data["reviews"]) if data["reviews"] else 0.0
        else:
            data["average_sentiment_score"] = 0.0  # Default if no reviews
        
        # Write updated JSON data back to file
        file.write(json.dumps(data) + "\n")
    
    file.truncate()  # Remove any leftover data from previous content

count of positive, negative and neutral reviews

In [20]:
import json

# Path to your JSONL file
file_path = "cleaned_metadata.jsonl"

# Temporary list to hold processed data
processed_data = []

# Read the JSONL file line by line
with open(file_path, "r") as infile:
    for line in infile:
        # Parse each line as a JSON object
        data = json.loads(line)
        
        # Initialize counts
        positive_reviews_count = 0
        negative_reviews_count = 0
        neutral_reviews_count = 0
        
        # Process reviews and count sentiment categories
        if "reviews" in data:
            for review in data["reviews"]:
                score = review.get("sentiment_score", 0)
                if score > 0.1:
                    positive_reviews_count += 1
                elif score < -0.1:
                    negative_reviews_count += 1
                else:
                    neutral_reviews_count += 1
        
        # Add counts to the data
        data["positive_reviews_count"] = positive_reviews_count
        data["negative_reviews_count"] = negative_reviews_count
        data["neutral_reviews_count"] = neutral_reviews_count
        
        # Append updated data to the processed list
        processed_data.append(data)

# Optional: Write the updated dataset back to the same file (or print/save as needed)
with open(file_path, "w") as outfile:
    for record in processed_data:
        outfile.write(json.dumps(record) + "\n")

# 2. Exploratory Data Analysis

## 1. General Sentiment Analysis
- Distribution of sentiment scores (mean, median, standard deviation, etc.).
- Proportion of positive, neutral, and negative reviews.
-  common sentiment range (e.g., highly positive, slightly negative).
- Comparison of sentiment scores across product categories or subcategories.


In [8]:
import json
import numpy as np
from collections import defaultdict
from tabulate import tabulate

# File path to the dataset
file_path = "cleaned_metadata.jsonl"

# Sentiment binning for most common sentiment range
sentiment_bins = [-1.0, -0.5, -0.1, 0.1, 0.5, 1.0]  # Binning the sentiment scores
bin_labels = ["Highly Negative", "Negative", "Neutral", "Positive", "Highly Positive"]

# Variables to store overall sentiment analysis
all_sentiment_scores = []
positive_reviews_count = 0
negative_reviews_count = 0
neutral_reviews_count = 0
range_counts = defaultdict(int)
total_reviews = 0

# Reading and processing the data
with open(file_path, 'r') as f:
    lines = f.readlines()
    for line in lines:
        data = json.loads(line)
        total_reviews += data.get('review_count', 0)  # Add total reviews for each product

        for review in data.get('reviews', []):
            score = review['sentiment_score']
            all_sentiment_scores.append(score)
            
            # Count the positive, negative, and neutral reviews
            if score > 0.1:
                positive_reviews_count += 1
            elif score < -0.1:
                negative_reviews_count += 1
            else:
                neutral_reviews_count += 1

            # Bin sentiment score and categorize the review
            bin_index = np.digitize([score], sentiment_bins)[0] - 1
            bin_index = max(0, min(bin_index, len(bin_labels) - 1))  # Ensure the index is within range
            range_counts[bin_labels[bin_index]] += 1

# 1. Distribution of Sentiment Scores
mean_score = np.mean(all_sentiment_scores)
median_score = np.median(all_sentiment_scores)
std_dev_score = np.std(all_sentiment_scores)

# 2. Proportion of Positive, Neutral, and Negative Reviews
positive_percentage = (positive_reviews_count / total_reviews) * 100
negative_percentage = (negative_reviews_count / total_reviews) * 100
neutral_percentage = (neutral_reviews_count / total_reviews) * 100

# 3. Most Common Sentiment Range (using the bin labels)
most_common_range = max(range_counts, key=range_counts.get)

# Displaying the results in a neat, tabular format
print("\n=== Sentiment Distribution ===")
print(f"Total Reviews: {total_reviews}")
print(f"Mean Sentiment Score: {mean_score:.4f}")
print(f"Median Sentiment Score: {median_score:.4f}")
print(f"Standard Deviation of Sentiment Score: {std_dev_score:.4f}")

print("\n=== Proportion of Reviews ===")
print(f"Positive Reviews: {positive_percentage:.2f}%")
print(f"Negative Reviews: {negative_percentage:.2f}%")
print(f"Neutral Reviews: {neutral_percentage:.2f}%")

print("\n=== Most Common Sentiment Range ===")
print(f"The most common sentiment range is: {most_common_range}")

# Display Sentiment Range Distribution
print("\n=== Sentiment Range Distribution ===")
range_table = []
for label in bin_labels:
    range_table.append([label, range_counts[label]])

print(tabulate(range_table, headers=["Sentiment Range", "Count"], tablefmt="fancy_grid", numalign="center"))


=== Sentiment Distribution ===
Total Reviews: 3279585
Mean Sentiment Score: 0.6847
Median Sentiment Score: 0.7881
Standard Deviation of Sentiment Score: 0.2922

=== Proportion of Reviews ===
Positive Reviews: 94.28%
Negative Reviews: 3.11%
Neutral Reviews: 2.62%

=== Most Common Sentiment Range ===
The most common sentiment range is: Highly Positive

=== Sentiment Range Distribution ===
╒═══════════════════╤═════════╕
│ Sentiment Range   │  Count  │
╞═══════════════════╪═════════╡
│ Highly Negative   │  20871  │
├───────────────────┼─────────┤
│ Negative          │  80982  │
├───────────────────┼─────────┤
│ Neutral           │  85832  │
├───────────────────┼─────────┤
│ Positive          │ 293436  │
├───────────────────┼─────────┤
│ Highly Positive   │ 2798464 │
╘═══════════════════╧═════════╛


In [None]:
import json
import numpy as np
from tabulate import tabulate
from collections import defaultdict

# File path to the large dataset
file_path = 'cleaned_metadata.jsonl'

# Initialize the category sentiment dictionary
category_sentiments = defaultdict(list)

# Function to process data incrementally from a file
def process_large_data(file_path):
    with open(file_path, 'r') as f:
        # Read the file line by line (assuming each line is a separate JSON object)
        for line in f:
            try:
                # Parse each line as a JSON object
                data = json.loads(line)
                
                # Process the sentiment scores for each category in the data
                for review in data.get('reviews', []):
                    sentiment_score = review.get('sentiment_score', 0.0)
                    for category in data.get('categories', []):
                        category_sentiments[category].append(sentiment_score)
            except json.JSONDecodeError:
                # Handle error if the line is not a valid JSON object
                print(f"Skipping invalid line: {line}")

# Function to calculate and display the statistics
def display_statistics():
    category_comparison = []

    # Calculate statistics for each category
    for category, sentiments in category_sentiments.items():
        mean_cat_score = np.mean(sentiments)
        median_cat_score = np.median(sentiments)
        positive_count = sum(1 for score in sentiments if score > 0.1)
        negative_count = sum(1 for score in sentiments if score < -0.1)
        neutral_count = sum(1 for score in sentiments if -0.1 <= score <= 0.1)
        total_count = len(sentiments)

        positive_percentage = (positive_count / total_count) * 100
        negative_percentage = (negative_count / total_count) * 100
        neutral_percentage = (neutral_count / total_count) * 100
        
        category_comparison.append({
            'Category': category,
            'Mean Sentiment': f"{mean_cat_score:.2f}",
            'Median Sentiment': f"{median_cat_score:.2f}",
            'Positive Reviews (%)': positive_percentage,
            'Negative Reviews (%)': negative_percentage,
            'Neutral Reviews (%)': neutral_percentage,
        })

    # Sort and display the top 5 categories with highest positive and negative reviews
    top_positive_categories = sorted(
        [entry for entry in category_comparison if entry['Positive Reviews (%)'] > 0],
        key=lambda x: x['Positive Reviews (%)'], reverse=True
    )[:5]

    top_negative_categories = sorted(
        [entry for entry in category_comparison if entry['Negative Reviews (%)'] > 0],
        key=lambda x: x['Negative Reviews (%)'], reverse=True
    )[:5]

    # Display results without truncation
    print("\n=== Category Comparison (Top 5 Categories with Highest Positive Reviews) ===")
    headers = ['Category', 'Mean Sentiment', 'Median Sentiment', 'Positive Reviews (%)', 'Negative Reviews (%)', 'Neutral Reviews (%)']
    positive_category_table = [list(item.values()) for item in top_positive_categories]
    print(tabulate(positive_category_table, headers=headers, tablefmt="plain", numalign="center"))

    print("\n=== Category Comparison (Top 5 Categories with Highest Negative Reviews) ===")
    negative_category_table = [list(item.values()) for item in top_negative_categories]
    print(tabulate(negative_category_table, headers=headers, tablefmt="plain", numalign="center"))

# Process the data incrementally
process_large_data(file_path)

# Once the data is processed, display the results
display_statistics()


=== Category Comparison (Top 5 Categories with Highest Positive Reviews) ===
Category        Mean Sentiment    Median Sentiment    Positive Reviews (%)    Negative Reviews (%)    Neutral Reviews (%)
Voluntaries          0.7                0.73                  100                      0                       0
Grounds              0.81               0.86                  100                      0                       0
Scottish Folk        0.73               0.81                  100                      0                       0
Zouk                 0.76               0.81                  100                      0                       0
Nicaragua            0.71               0.73                  100                      0                       0

=== Category Comparison (Top 5 Categories with Highest Negative Reviews) ===
Category              Mean Sentiment    Median Sentiment    Positive Reviews (%)    Negative Reviews (%)    Neutral Reviews (%)
Tierra Caliente            0.

## 2. Rating Analysis
- Distribution of ratings (1-5).
- Frequency of each rating score (1-5).
- Proportion of reviews with the highest and lowest ratings.
- Average sentiment score for each rating level.

In [14]:
import json
import pandas as pd
from collections import defaultdict

# Create a dictionary to store counts and sums for the analysis
rating_counts = defaultdict(int)
sentiment_sum = defaultdict(float)

# File path to the JSONL file
file_path = "cleaned_metadata.jsonl"

# Read the file line by line to process it efficiently
with open(file_path, 'r') as file:
    for line in file:
        # Parse each line of the JSONL file
        data = json.loads(line)
        
        # Extract reviews data from each entry
        reviews = data.get('reviews', [])
        
        for review in reviews:
            rating = review.get('rating')
            sentiment_score = review.get('sentiment_score')
            
            # Update rating counts and sentiment sums
            if rating is not None and sentiment_score is not None:
                rating_counts[rating] += 1
                sentiment_sum[rating] += sentiment_score

# 1. Distribution of Ratings (1-5)
print("=== Distribution of Ratings (1-5) ===")
for rating, count in rating_counts.items():
    print(f"Rating {rating}: {count} reviews")

# 2. Frequency of Each Rating Score (1-5)
print("\n=== Frequency of Each Rating Score (1-5) ===")
for rating, count in rating_counts.items():
    print(f"Rating {rating}: {count} reviews")

# 3. Proportion of Reviews with Highest and Lowest Ratings
highest_rating_count = rating_counts.get(5.0, 0)
lowest_rating_count = rating_counts.get(1.0, 0)
total_reviews = sum(rating_counts.values())

highest_rating_proportion = (highest_rating_count / total_reviews) * 100
lowest_rating_proportion = (lowest_rating_count / total_reviews) * 100

print(f"\n=== Proportion of Reviews with Highest and Lowest Ratings ===")
print(f"Proportion of Reviews with Highest Rating (5): {highest_rating_proportion:.2f}%")
print(f"Proportion of Reviews with Lowest Rating (1): {lowest_rating_proportion:.2f}%")

# 4. Average Sentiment Score for Each Rating Level
print("\n=== Average Sentiment Score for Each Rating Level ===")
for rating in rating_counts:
    avg_sentiment = sentiment_sum[rating] / rating_counts[rating]
    print(f"Rating {rating}: {avg_sentiment:.4f}")


=== Distribution of Ratings (1-5) ===
Rating 1.0: 106123 reviews
Rating 5.0: 2554415 reviews
Rating 4.0: 376301 reviews
Rating 3.0: 165893 reviews
Rating 2.0: 76853 reviews

=== Frequency of Each Rating Score (1-5) ===
Rating 1.0: 106123 reviews
Rating 5.0: 2554415 reviews
Rating 4.0: 376301 reviews
Rating 3.0: 165893 reviews
Rating 2.0: 76853 reviews

=== Proportion of Reviews with Highest and Lowest Ratings ===
Proportion of Reviews with Highest Rating (5): 77.89%
Proportion of Reviews with Lowest Rating (1): 3.24%

=== Average Sentiment Score for Each Rating Level ===
Rating 1.0: -0.2435
Rating 5.0: 0.7811
Rating 4.0: 0.5845
Rating 3.0: 0.3174
Rating 2.0: 0.0450


## 3. Review Text Analysis
- Average, median, and maximum word count across all reviews.
- Most common words in positive, negative, and neutral reviews.


In [21]:
import json
import nltk
from collections import Counter
from nltk.corpus import stopwords

# Download NLTK stopwords if not already installed
nltk.download('stopwords')

# Set of stopwords from NLTK
stopwords_set = set(stopwords.words('english'))

# Add custom stopwords relevant to your dataset (optional)
custom_stopwords = {"album", "music", "cd", "track", "song", "The", "this", 
                    "that", "it", "is", "was", "and", "in", "on", "for", "with", 
                    "by", "to", "of", "a", "an", "i", "you", "we", "they", "I", 
                    "the", "It", "This", "this", "CD", "would", "Would", "one", "songs", 
                    "-", "it.", "get", "record", "first", "A", "sound", "even", 
                    "vinyl", "case", "if", "If", "time", "play", "much", "bought", 
                    "got", "original", "album.", "new", "I'm", "music.", "CD.", "two", 
                    "buy", "Very", "2", "disc", "still", "thought", "listen", "really",
                    "many", "back", "could", "came", "received", "sounds", "know",
                    "tracks", "recording", "heard", "It's", "return", "band", "well", 
                    "My", "listening" ,"ordered", "&", "also", "There", "think"}
stopwords_set.update(custom_stopwords)

# Initialize counters and lists
word_counts = []
positive_reviews = []
negative_reviews = []

# File path to the JSONL file
file_path = "cleaned_metadata.jsonl"

# Read the file line by line to process it efficiently
with open(file_path, 'r') as file:
    for line in file:
        # Parse each line of the JSONL file
        data = json.loads(line)
        
        # Extract reviews data from each entry
        reviews = data.get('reviews', [])
        
        for review in reviews:
            text = review.get('text', '')
            sentiment_score = review.get('sentiment_score')
            
            # Clean the review text and filter out stopwords (case-insensitive)
            tokens = [word for word in text.lower().split() if word not in stopwords_set]
            
            # Calculate word count after filtering stopwords
            word_count = len(tokens)
            word_counts.append(word_count)
            
            # Categorize reviews based on sentiment score
            if sentiment_score > 0:
                positive_reviews.append(text)
            elif sentiment_score < 0:
                negative_reviews.append(text)

# 1. Average, Median, and Maximum Word Count Across All Reviews
avg_word_count = sum(word_counts) / len(word_counts)
median_word_count = sorted(word_counts)[len(word_counts) // 2]
max_word_count = max(word_counts)

print("=== Review Text Analysis ===")
print(f"Average Word Count: {avg_word_count:.2f}")
print(f"Median Word Count: {median_word_count}")
print(f"Maximum Word Count: {max_word_count}")

# 2. Most Common Words in Positive, Negative, and Neutral Reviews
def most_common_words(reviews):
    all_words = [word for review in reviews for word in review.split() if word not in stopwords_set]
    word_count = Counter(all_words)
    return word_count.most_common(10)

positive_common_words = most_common_words(positive_reviews)
negative_common_words = most_common_words(negative_reviews)

print("\nMost Common Words in Positive Reviews:")
for word, count in positive_common_words:
    print(f"{word}: {count}")

print("\nMost Common Words in Negative Reviews:")
for word, count in negative_common_words:
    print(f"{word}: {count}")



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mitan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


=== Review Text Analysis ===
Average Word Count: 18.22
Median Word Count: 8
Maximum Word Count: 4249

Most Common Words in Positive Reviews:
great: 564076
like: 495541
love: 421432
good: 377445
Great: 301923
best: 233219
Love: 189600
favorite: 141014
fan: 124158
never: 123481

Most Common Words in Negative Reviews:
like: 18438
quality: 9334
Not: 9160
bad: 8091
disappointed: 8028
good: 7664
never: 4913
poor: 4408
disappointed.: 4239
version: 3965
