# Data Preparation for Llama Model Chat Application

This notebook is intended for preparing the Amazon Reviews 2023 dataset used for training and evaluating the Llama models in the Streamlit Chat application. This includes data cleaning, feature engineering, and dataset transformation to ensure it is ready for modeling.

## Sections Included:
- Load Raw Data
- Data Cleaning
- Feature Engineering
- Save Processed Dataset


In [None]:
from datasets import load_dataset
import pandas as pd
import numpy as np

# Load Amazon Reviews 2023 dataset
dataset = load_dataset("McAuley-Lab/Amazon-Reviews-2023", split="train")

# Convert dataset to pandas DataFrame
raw_data = pd.DataFrame(dataset)

# Display the first few rows of raw data
raw_data.head()


In [None]:
# Data Cleaning: Handling missing values and removing duplicates
clean_data = raw_data.dropna(subset=['review_body']).drop_duplicates()

# Remove rows where 'review_body' field is empty
clean_data = clean_data[clean_data['review_body'].str.strip() != '']

# Display cleaned data summary
clean_data.info()


In [None]:
# Feature Engineering: Add custom features
# Adding feature to calculate the length of each review
clean_data['review_length'] = clean_data['review_body'].apply(len)

# Adding a feature to count the number of words in each review
clean_data['word_count'] = clean_data['review_body'].apply(lambda x: len(x.split()))

# Display first few rows of engineered data
clean_data.head()


In [None]:
# Save cleaned and engineered dataset for later use
processed_filepath = 'data/processed_amazon_reviews_2023.csv'  # Updated to a realistic file path
clean_data.to_csv(processed_filepath, index=False)

# Confirm save by displaying message
print(f'Processed data saved to {processed_filepath}')