# 1. cleaning the data

## Removing unnecessary fields
### Description
This Python script reads a large JSON Lines (JSONL) file line by line, processes each entry to remove the specified fields, and writes the cleaned data into a new JSONL file. The script is optimized for handling large datasets by processing entries incrementally to avoid memory overload, making it suitable for datasets containing millions of records.

### Features
Field Removal: Removes non-essential fields (images, videos, details, and features) from each entry.
Memory Efficiency: Processes each line independently without loading the entire file into memory.
Scalability: Capable of handling datasets with millions of entries due to its line-by-line processing approach.
Preserves Original Structure: Maintains the integrity of the remaining data fields, ensuring the dataset is ready for subsequent analysis.


In [12]:
import json

# Input and output file paths
input_file = "updated_metadata.jsonl"
output_file = "cleaned_metadata.jsonl"

# Fields to remove
fields_to_remove = ["images", "videos", "details", "features", "bought_together", "description"]

# Process the file
with open(input_file, "r", encoding="utf-8") as infile, open(output_file, "w", encoding="utf-8") as outfile:
    for line in infile:
        entry = json.loads(line.strip())  # Load the JSON object
        # Remove the specified fields
        for field in fields_to_remove:
            entry.pop(field, None)
        # Write the cleaned entry to the output file
        outfile.write(json.dumps(entry) + "\n")

print(f"Cleaned data written to {output_file}")


Cleaned data written to cleaned_metadata.jsonl


## cleaning reviews
Objective:
The purpose of this step is to preprocess the cleaned_metadata.jsonl file by removing unnecessary subfields from the reviews field. This process ensures that only the relevant information for sentiment analysis is retained in the reviews data, making it cleaner and more focused for downstream processing.

Description:
In this step, we focus on cleaning the reviews field within each entry of the cleaned_metadata.jsonl file. Specifically, we remove the following unwanted subfields from each review:

parent_asin
user_id
asin
helpful_vote
These fields are irrelevant for sentiment analysis, as they do not contribute to evaluating the tone or opinion expressed in the review. By removing these fields, we reduce noise in the dataset, streamline the structure, and make it easier to analyze the sentiment of the reviews based on the remaining relevant fields, such as rating, title, text, and timestamp.

In [13]:
import json

# Input and output file paths (same file for input and output)
file_path = "cleaned_metadata.jsonl"

# Fields to remove within the reviews
review_fields_to_remove = ["parent_asin", "user_id", "asin", "helpful_vote", "images"]

# Process the file
with open(file_path, "r", encoding="utf-8") as infile:
    lines = infile.readlines()  # Read all lines into memory

# Modify the data
modified_lines = []
for line in lines:
    entry = json.loads(line.strip())  # Load the JSON object
    
    # Clean the reviews field
    if "reviews" in entry and isinstance(entry["reviews"], list):
        cleaned_reviews = []
        for review in entry["reviews"]:
            # Remove specified fields in each review
            cleaned_review = {k: v for k, v in review.items() if k not in review_fields_to_remove}
            cleaned_reviews.append(cleaned_review)
        entry["reviews"] = cleaned_reviews

    # Prepare the modified entry for output
    modified_lines.append(json.dumps(entry))

# Overwrite the file with the modified data
with open(file_path, "w", encoding="utf-8") as outfile:
    for modified_line in modified_lines:
        outfile.write(modified_line + "\n")

print(f"Reviews cleaned and file {file_path} updated.")

Reviews cleaned and file cleaned_metadata.jsonl updated.


Removing reviews that are not verified purchases. 

In [14]:
import json

# Define the file path
file_path = 'cleaned_metadata.jsonl'

# Read the file, filter the reviews, and write back to the same file
with open(file_path, 'r+') as file:
    lines = file.readlines()  # Read all lines into a list
    file.seek(0)  # Move the file pointer to the beginning
    file.truncate()  # Clear the file content

    # Process each line and filter the reviews
    for line in lines:
        data = json.loads(line)  # Parse the JSON data
        
        # Check if 'reviews' exists, then filter the reviews based on 'verified_purchase'
        if 'reviews' in data:
            # Only keep reviews where 'verified_purchase' is True
            data['reviews'] = [review for review in data['reviews'] if review.get('verified_purchase', False) is True]

        # Write the cleaned data back to the same file
        json.dump(data, file)
        file.write('\n')

print(f"File {file_path} has been updated with filtered reviews.")


File cleaned_metadata.jsonl has been updated with filtered reviews.


## Further cleaning data and feature engineering 

convert dates to suitable format and remove html tags

In [15]:
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup

# Load your cleaned data
import json

# Read the cleaned metadata (ensure it's in the proper JSON format for manipulation)
with open("cleaned_metadata.jsonl", "r") as file:
    cleaned_data = [json.loads(line) for line in file]

# Function to convert Unix timestamp to a readable date
def convert_timestamp(timestamp):
    # Convert milliseconds to seconds for datetime conversion
    return datetime.utcfromtimestamp(timestamp / 1000).strftime('%Y-%m-%d %H:%M:%S')

# Function to remove HTML tags from the review text
def remove_html_tags(text):
    return BeautifulSoup(text, "html.parser").get_text()

# Apply the transformations to the data
for entry in cleaned_data:
    # Convert timestamp for each review
    for review in entry.get('reviews', []):
        review['timestamp'] = convert_timestamp(review['timestamp'])
        # Remove HTML tags in the review text
        review['text'] = remove_html_tags(review['text'])

# Save the updated data back to cleaned_metadata.jsonl
with open("cleaned_metadata.jsonl", "w") as file:
    for entry in cleaned_data:
        file.write(json.dumps(entry) + "\n")

print("Timestamp converted and HTML tags removed successfully.")

  return datetime.utcfromtimestamp(timestamp / 1000).strftime('%Y-%m-%d %H:%M:%S')
  return BeautifulSoup(text, "html.parser").get_text()
  return BeautifulSoup(text, "html.parser").get_text()


Timestamp converted and HTML tags removed successfully.


## Feature engineering

adding new fields such as review_count and price range

In [16]:
import json

# Function to add Review Count feature
def add_review_count(data):
    for entry in data:
        entry['review_count'] = len(entry.get('reviews', []))  # Count reviews for each product
    return data

# Function to add Price Range feature, handling null prices and converting to numeric
def add_price_range(data):
    for entry in data:
        price = entry.get('price', None)  # Get the price, default to None if not present
        
        # If the price is None or empty, set to 'unknown'
        if price is None or price == '':
            entry['price_range'] = 'unknown'  # Set as 'unknown' if price is missing or empty
        else:
            try:
                # Convert price to float to ensure proper comparison
                price = float(price)
                if price < 10:
                    entry['price_range'] = 'low'
                elif 10 <= price < 30:
                    entry['price_range'] = 'medium'
                else:
                    entry['price_range'] = 'high'
            except ValueError:
                # If conversion fails (e.g., if price is non-numeric), set it as 'unknown'
                entry['price_range'] = 'unknown'
    
    return data

# Load the cleaned data from the cleaned_metadata.jsonl file
with open("cleaned_metadata.jsonl", "r") as file:
    cleaned_data = [json.loads(line) for line in file]

# Apply Review Count and Price Range feature engineering
cleaned_data = add_review_count(cleaned_data)
cleaned_data = add_price_range(cleaned_data)

# Save the updated data with the new features to the same file
with open("cleaned_metadata.jsonl", "w") as file:
    for entry in cleaned_data:
        file.write(json.dumps(entry) + "\n")

print("Review Count and Price Range features added successfully, handling null and non-numeric prices.")

Review Count and Price Range features added successfully, handling null and non-numeric prices.


adding sentiment scores scale -1 to 1 negative to positive, preprocessing text - Lowercasing, Remove Special Characters, Tokenization, Remove Stopwords, Stemming/Lemmatization

In [17]:
import json
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Download NLTK resources (run once)
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize NLP tools
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()
analyzer = SentimentIntensityAnalyzer()

# Function to preprocess text
def preprocess_text(text):
    # Lowercase
    text = text.lower()
    # Remove special characters (keep alphanumeric and spaces)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenization
    tokens = word_tokenize(text)
    # Remove stopwords and apply stemming or lemmatization
    tokens = [lemmatizer.lemmatize(ps.stem(word)) for word in tokens if word not in stop_words]
    return ' '.join(tokens)

# Function to get sentiment score (VADER)
def get_sentiment_score(text):
    sentiment = analyzer.polarity_scores(text)
    return sentiment['compound']  # Compound score ranges from -1 (negative) to 1 (positive)

# Function to adjust sentiment score based on rating
def adjust_sentiment_based_on_rating(rating, text_sentiment):
    # Normalize rating to scale -1 to 1 (1 is negative, 5 is positive)
    rating_sentiment = 2 * (rating / 5) - 1
    
    # Combine text sentiment and rating sentiment (average them)
    # Adjust the weights depending on how much importance you want to give to the rating vs text
    combined_sentiment = (text_sentiment + rating_sentiment) / 2
    return combined_sentiment

# File path
file_path = 'cleaned_metadata.jsonl'

# Read the file, preprocess the text, and add sentiment scores
with open(file_path, 'r+') as file:
    lines = file.readlines()  # Read all lines into a list
    file.seek(0)  # Move the file pointer to the beginning
    file.truncate()  # Clear the file content

    # Process each line
    for line in lines:
        data = json.loads(line)  # Parse the JSON data

        # Check if 'reviews' exists and process each review
        if 'reviews' in data:
            for review in data['reviews']:
                # Preprocess the review text
                cleaned_text = preprocess_text(review.get('text', ''))
                # Get sentiment score for the review text
                text_sentiment = get_sentiment_score(cleaned_text)
                # Adjust sentiment score based on the rating
                final_sentiment = adjust_sentiment_based_on_rating(review.get('rating', 0), text_sentiment)
                # Add the final sentiment score to the review
                review['sentiment_score'] = final_sentiment

        # Write the updated data back to the same file
        json.dump(data, file)
        file.write('\n')

print(f"File {file_path} has been updated with sentiment scores.")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mitan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mitan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mitan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


File cleaned_metadata.jsonl has been updated with sentiment scores.


Word count of reviews and average of sentiment scores

In [19]:
import json

# Open the JSONL file
file_path = "cleaned_metadata.jsonl"

# Process data line by line
with open(file_path, "r+") as file:
    lines = file.readlines()
    file.seek(0)  # Reset pointer to the start of the file

    for line in lines:
        data = json.loads(line)  # Parse JSON data
        
        # Add word_count for each review
        if "reviews" in data:
            for review in data["reviews"]:
                review["word_count"] = len(review["text"].split())
        
            # Compute average_sentiment_score
            total_sentiment = sum(review["sentiment_score"] for review in data["reviews"])
            data["average_sentiment_score"] = total_sentiment / len(data["reviews"]) if data["reviews"] else 0.0
        else:
            data["average_sentiment_score"] = 0.0  # Default if no reviews
        
        # Write updated JSON data back to file
        file.write(json.dumps(data) + "\n")
    
    file.truncate()  # Remove any leftover data from previous content

count of positive, negative and neutral reviews

In [20]:
import json

# Path to your JSONL file
file_path = "cleaned_metadata.jsonl"

# Temporary list to hold processed data
processed_data = []

# Read the JSONL file line by line
with open(file_path, "r") as infile:
    for line in infile:
        # Parse each line as a JSON object
        data = json.loads(line)
        
        # Initialize counts
        positive_reviews_count = 0
        negative_reviews_count = 0
        neutral_reviews_count = 0
        
        # Process reviews and count sentiment categories
        if "reviews" in data:
            for review in data["reviews"]:
                score = review.get("sentiment_score", 0)
                if score > 0.1:
                    positive_reviews_count += 1
                elif score < -0.1:
                    negative_reviews_count += 1
                else:
                    neutral_reviews_count += 1
        
        # Add counts to the data
        data["positive_reviews_count"] = positive_reviews_count
        data["negative_reviews_count"] = negative_reviews_count
        data["neutral_reviews_count"] = neutral_reviews_count
        
        # Append updated data to the processed list
        processed_data.append(data)

# Optional: Write the updated dataset back to the same file (or print/save as needed)
with open(file_path, "w") as outfile:
    for record in processed_data:
        outfile.write(json.dumps(record) + "\n")

# 2. Exploratory Data Analysis