## Data Extraction for the Project
This notebook extracts the data to be used for the paper 3 titled: 'Most popular Topics in Positive and Negative Sentiments in Amazon Movies and TV Reviews Dataset'
#### Author: Rishikesh Kakde

Import Required Libraries

In [9]:
import json
import pandas as pd
from empath import Empath

import warnings
warnings.filterwarnings('ignore')

In [10]:
file_path = 'Movies_and_TV.jsonl'

In [17]:
# Count the number of lines in the file
with open(file_path, 'r', encoding='utf-8') as f:
    total_entries = sum(1 for line in f)

print(f"Total number of entries: {total_entries}")

Total number of entries: 17328314


In [11]:
# Initialize lists to hold the analysis and training datasets
analysis_data = []
training_data = []

# Open the JSONL file and process the first 3500 rows
with open(file_path, 'r', encoding='utf-8') as file:
    for i, line in enumerate(file):
        data = json.loads(line)
        if i < 1500:  # First 1500 rows for analysis dataset
            analysis_data.append(data)
        elif 1500 <= i < 3500:  # Next 2000 rows for training dataset
            training_data.append(data)
        elif i >= 3500:  # Stop processing after 3500 rows
            break

# Print the size of each dataset
print(f"Analysis dataset size: {len(analysis_data)}")
print(f"Training dataset size: {len(training_data)}")

Analysis dataset size: 1500
Training dataset size: 2000


#### Convert the Analysis Dataset to a DataFrame and Export as CSV

In [12]:
# Convert the analysis dataset to a pandas DataFrame
analysis_df = pd.DataFrame(analysis_data)

# Export the DataFrame to a CSV file
analysis_df.to_csv('analysis_dataset.csv', index=False)

# Confirm the file has been saved
print("Analysis dataset saved as 'analysis_dataset.csv'.")

Analysis dataset saved as 'analysis_dataset.csv'.


### Convert the Training Dataset to a DataFrame

In [13]:
training_df = pd.DataFrame(training_data)

# Display the first few rows to identify unnecessary columns
print("Columns in training dataset:", training_df.columns)
training_df.head()

Columns in training dataset: Index(['rating', 'title', 'text', 'images', 'asin', 'parent_asin', 'user_id',
       'timestamp', 'helpful_vote', 'verified_purchase'],
      dtype='object')


Unnamed: 0,rating,title,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase
0,1.0,One Star,The Kids wanted it and watched. I thought it ...,[],B006QKJQ12,B006QKJQ12,AHRM6YFXTOLMHK2MZZ7D2W2W2Q6Q,1453930439000,0,True
1,3.0,Sharknado,My 7 year old grandson had me watch this. It ...,[],B00EY74PJM,B00EY74PJM,AHRM6YFXTOLMHK2MZZ7D2W2W2Q6Q,1402404892000,1,False
2,5.0,Flight,Very good movie and enjoyed it very much. I l...,[],B00B0LPWU6,B00B0LPWU6,AHRM6YFXTOLMHK2MZZ7D2W2W2Q6Q,1398870334000,0,False
3,5.0,Downton Abbey Season 1,"Great series, loved it. Very easy to get righ...",[],B004KAJLNS,B004KAJLNS,AHRM6YFXTOLMHK2MZZ7D2W2W2Q6Q,1397309345000,0,False
4,5.0,Excellent Movie!!!,This film has become my favorite movie. Denzel...,[],B00UGQAB4S,B00UGQAB4S,AHU2Y2ZFQKI3V3ARFDKZA6ER4NUQ,1566075686049,2,True


#### Drop Unwanted Columns form the training dataset

In [14]:
# Specify the columns to drop
columns_to_drop = ['title', 'images', 'asin', 'parent_asin', 'user_id', 'helpful_vote', 'verified_purchase']

# Drop the specified columns
training_df = training_df.drop(columns=columns_to_drop, errors='ignore')

# Display the remaining columns
print("Columns after dropping unnecessary ones:", training_df.columns)

Columns after dropping unnecessary ones: Index(['rating', 'text', 'timestamp'], dtype='object')


#### Label Training Data Using EMPATH

In [15]:
# Initialize the Empath tool
lexicon = Empath()

# Function to classify sentiment based on the review text
def assign_sentiment(text):
    if not text:
        return None  # Handle empty or missing text
    analysis = lexicon.analyze(text, categories=['positive_emotion', 'negative_emotion'])
    positive_score = analysis.get('positive_emotion', 0)
    negative_score = analysis.get('negative_emotion', 0)
    
    if positive_score > negative_score:
        return 'positive'
    elif negative_score > positive_score:
        return 'negative'
    else:
        return 'neutral'

# Apply the function to the review column (adjust 'reviewText' to the correct column name)
training_df['sentiment'] = training_df['text'].apply(assign_sentiment)

# Drop rows where sentiment could not be assigned
training_df = training_df.dropna(subset=['sentiment'])

# Display the sentiment distribution
print(training_df['sentiment'].value_counts())


sentiment
neutral     1277
positive     567
negative     156
Name: count, dtype: int64


#### Step 7: Export the Labeled Training Dataset as CSV


In [16]:
# Export the labeled training dataset to a CSV file
training_df.to_csv('training_dataset_labeled.csv', index=False)

# Confirm the file has been saved
print("Labeled training dataset saved as 'training_dataset_labeled.csv'.")

Labeled training dataset saved as 'training_dataset_labeled.csv'.
