## Introduction

This notebook explores the dataset to understand features, and detect missing values.

- The dataset contains **101836** rows and **15** columns
    - Source: Kaggle - Amazon Reviews US Digital Software V1

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
from wordcloud import WordCloud


In [None]:
file_path = "../data/raw/amazon_reviews_us_Digital_Software_v1_00.tsv"

In [None]:
import os
if os.path.exists(file_path):
    print("file exists.")
else:
    print("file does not exist!")


In [None]:

data = pd.read_csv(file_path, sep="\t", encoding="utf-8")


In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.describe()

### Data Insights

1. Central Tendency (Mean):

    - The means of various columns indicate where the data is generally concentrated. For example, the average star_rating of 3.54 shows that most reviews are moderately positive and satisfactory.

2. Variation (Standard Deviation - STD):

    - The high standard deviation in helpful_votes and total_votes suggests that while most products have received very few votes, some have attracted a significantly larger amount of attention.

3. Range (Min/Max):

    - The wide range of values in customer_id and product_parent shows that the dataset is notably diverse. However, potential outliers in the data should be identified and managed.

In [None]:
data.columns.tolist()  # list of columns

In [None]:
data.isnull().sum()

5 rows have missing values in the `review_body` and `review_date` columns.

In [None]:
# replace missing values with empty strings
data["review_body"] = data["review_body"].fillna("")
data["review_date"] = data["review_date"].fillna("")

In [None]:
data.isnull().sum()  # âœ…

In [None]:
data.duplicated().sum()

No duplicate records.

In [None]:
sns.histplot(data["star_rating"], kde=True, bins=10, color="green")
plt.xlabel("star_rating")
plt.ylabel("frequency")
plt.show()

The ratings show a strong preference for 5-star reviews, with a significant portion also giving low ratings (1 and 2 stars).

In [None]:
data["review_length"] = data["review_body"].apply(lambda x: len(str(x)))

sns.histplot(data["review_length"], kde=True, bins=40)
plt.xlabel("length of review body")
plt.ylabel('frequency')
plt.xlim(0, 10000)
plt.ylim(0, 100000)
plt.show()

The length of `review_body` column shows that most reviews are shorter than 2000 characters.

In [None]:
review_body = data["review_body"]

In [None]:
review_body.head()

In [None]:
sns.scatterplot(x=data["review_length"], y=data["star_rating"])
plt.title("relationship between review body length and star rating")
plt.xlabel("review length")
plt.ylabel("srat rating")
plt.show()

Relationship between Review Length and Star Rating:
- The scatter plot shows that shorter reviews tend to have lower star ratings, while longer reviews are often associated with higher ratings.

In [None]:
# remove html tag (<br>) to clarify commonly used words
_REGEX_HTML_TAG = re.compile(r"<[^>]+>")

data["cleaned_review_body"] = review_body.str.replace(_REGEX_HTML_TAG, "", regex=True)

In [None]:
text = " ".join(data["cleaned_review_body"].astype(str))

wordcloud = WordCloud(width=800, height=400, background_color="white").generate(text)

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

- This word cloud represents the most common words used in customer reviews.
    - From the cloud, we can observe that words like **"software," "use," "product," "program,"** and **"work"** appear frequently, reflecting positive sentiments. However, terms such as **"problem," "error,"** and **"issue"** suggest areas where customers have expressed dissatisfaction or concerns.