<a href="https://www.kaggle.com/code/ainurrohmanbwx/amazon-customer-analytics?scriptVersionId=143613474" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

Data analysis on the Amazon Customer dataset is a critical step in understanding and optimizing the world's largest e-commerce platform, Amazon. In this analysis, we will employ the ETL (Extract, Transform, Load) method to unearth valuable insights from Amazon customer data. We will explore how Amazon operates in recommending products to users, analyze user behavior to enhance the shopping experience, conduct sentiment analysis to understand user sentiments toward Amazon's products and services, and identify top-performing products based on the best reviews. This analysis will help us gain a deeper understanding of the dynamics of the e-commerce market and Amazon's strategies to maintain its position as an industry leader.

# Load Data (Extract)

In [None]:
# Disable warning

import warnings

warnings.filterwarnings('ignore')
warnings.filterwarnings('ignore', category=FutureWarning)

In [None]:
import sqlite3

con = sqlite3.connect("data/database.sqlite")

In [None]:
type(con)

In [None]:
import pandas as pd

amazon = pd.read_sql_query("SELECT * FROM reviews", con)

In [None]:
amazon.head(5)

Features explanation:

- **Id**: this is a feature for unique identification for each review in the database.
- **ProductId**: this is a feature for unique identification for the product being reviewed.
- **UserId**: this is a feature for unique identification for users who provide reviews.
- **ProfileName**: this is a feature that contains the username or profile name of the user who provided the review.
- **HelpfulnessNumerator**: this is a feature that contains the number of people who found this review useful or helpful.
- **HelpfulnessDenominator**: this is a feature that contains the total number of people who saw this review or tried to rate it, not just those who found it useful.
- **Score**: this is a feature that contains ratings or ratings given by users for the products being reviewed.
- **Time**: this is the time feature when the review was created or published. This can be used to track when the review was written.
- **Summary**: this is a short summary feature of the review. It provides a quick overview of what users are reviewing.
- **Text**: this is a short summary feature of the review. It provides a quick overview of what users are reviewing.

In [None]:
amazon.shape

In [None]:
amazon.info()

# Data Preprocessing (Transform)

#### Testing whether the Helpfulness Numerator feature is greater than the Helpfulness Denominator feature?

In the context of review data, it should not be possible to have a HelpfulnessNumerator that is larger than the HelpfulnessDenominator, because that would mean that more people found the review useful than actually saw or rated the review.

In [None]:
for index, row in amazon.iterrows():
    helpfulness_numerator = row['HelpfulnessNumerator']
    helpfulness_denominator = row['HelpfulnessDenominator']

    if helpfulness_numerator > helpfulness_denominator:
        print(f"Error in row {index}: HelpfulnessNumerator ({helpfulness_numerator}) is greater than HelpfulnessDenominator ({helpfulness_denominator})")

Because there is a HelpfulnessNumerator row that is greater than the HelpfulnessDenominator, we skip it

In [None]:
amazon = amazon[amazon['HelpfulnessNumerator'] <= amazon['HelpfulnessDenominator']]

In [None]:
amazon.shape

#### Is there any duplicate data?

In analyzing Amazon reviews data, avoiding duplication in features such as UserId, ProfileName, Time, and Text is important because it can obscure variations in the data and interfere with initial understanding of patterns and trends in the dataset.

In [None]:
if amazon.duplicated(['UserId', 'ProfileName', 'Time', 'Text']).any():
    print(f"There are as many as {amazon.duplicated(['UserId', 'ProfileName', 'Time', 'Text']).sum()} duplicate data.")
else:
    print("There are no duplicate data.")

In [None]:
# Drop duplicate data

amazon = amazon.drop_duplicates(subset=['UserId', 'ProfileName', 'Time', 'Text'])

In [None]:
# Check again

if amazon.duplicated(['UserId', 'ProfileName', 'Time', 'Text']).any():
    print(f"There are as many as {amazon.duplicated(['UserId', 'ProfileName', 'Time', 'Text']).sum()} duplicate data.")
else:
    print("There are no duplicate data.")

In [None]:
amazon.shape

#### Are there any missing values?

In [None]:
# Check for missing values
missing_values = amazon.isnull().sum()

# Display columns with missing values and the count of missing values
missing_values = missing_values[missing_values > 0]

if not missing_values.empty:
    print("Columns with missing values:")
    for column, count in missing_values.items():
        print(f"{column}: {count} missing values")
else:
    print("There are no columns with missing value")

#### Is there an incorrect data type?

In [None]:
amazon.dtypes

If we observe each of the features above, something looks strange about the `time` feature, namely the data type is int64, it should be datetime. The solution is that we must first adjust the data type.

In [None]:
pd.to_datetime(amazon['Time'])

If we observe, there is another number behind the second, it is called the fraction of a second. This seconds action provides more details about the time. Because all we need is day, month and year. So we take the date, month and year.

In [None]:
amazon['Time'] = pd.to_datetime(amazon['Time'], unit='s')
amazon['Time']

# Let's analyze the data (Load)

#### How does Amazon recommend products?

In [None]:
amazon.columns

In observing how Amazon recommends products, we use 'UserId' instead of 'ProfileName', why? Here's why:
- **User Uniqueness**: 'UserId' is a unique identification for each user. This ensures that you have clear information about each user individually.
- **Ease of Processing**: Using 'UserId' makes it easier to automatically group reviews and user behavior.
- **Consistency**: 'ProfileName' may change over time or may not accurately reflect user preferences.
- **Privacy**: Using 'ProfileName' may reveal more personal information about the user than intended.

In [None]:
amazon['UserId'].nunique()

In [None]:
recommend = amazon.groupby(['UserId']).agg({'Summary':'count', 'Text':'count', 'Score':'count', 'ProductId':'count'}).sort_values(by="ProductId", ascending=False)

In [None]:
#rename columns
recommend.columns = ['Number_of_summaries', 'Num_text', 'Avg_score', 'Prods_purchased']

In [None]:
recommend.head()

In [None]:
# show top 10 users

top_10_users = recommend.index[0:10]
top_10_users

In [None]:
purchase_counts = recommend['Prods_purchased'][0:10].values
purchase_counts

In [None]:
import matplotlib.pyplot as plt

# Create a horizontal bar plot
plt.figure(figsize=(10, 6))
plt.barh(top_10_users, purchase_counts, color='skyblue')

# Add labels to the axes
plt.xlabel('Purchase Count')
plt.ylabel('UserID')

# Add a title
plt.title('Top 10 User Recommendations Based on Purchase Count')

# Add values on the bars
for i, v in enumerate(purchase_counts):
    plt.text(v + 1, i, str(v), va='center', fontsize=12)

# Reverse the order so that the highest count is at the top
plt.gca().invert_yaxis()  
plt.show()

#### Which products have the best reviews?

In [None]:
amazon.columns

In [None]:
len(amazon['ProductId'].unique())

In [None]:
prod_count = amazon['ProductId'].value_counts().to_frame()
prod_count

Determining the minimum threshold to determine best-selling sales is very subjective, in fact there are several techniques, apart from that it must involve expert domains. However, in this analysis, we do not discuss this in more depth. We agree that the threshold in this analysis is 500.

In [None]:
prod_count[prod_count['ProductId']>500]

In [None]:
freq_prods_ids = prod_count[prod_count['ProductId']>500].index
freq_prods_ids

In [None]:
# filter dataframes that only have purchases above 500

amazon['ProductId'].isin(freq_prods_ids)

In [None]:
freq_prods = amazon[amazon['ProductId'].isin(freq_prods_ids)]
freq_prods

In [None]:
import seaborn as sns

# Create a figure and axis
plt.figure(figsize=(10, 8))

# Specify the number of colors from the colormap
n_colors = len(freq_prods['Score'].unique())
palette = sns.color_palette("viridis", n_colors)

# Create the countplot
ax = sns.countplot(y='ProductId', data=freq_prods, hue='Score', palette=palette)

# Add labels and title
ax.set_xlabel('Review Count')
ax.set_ylabel('Product ID')
ax.set_title('Top Products with Best Reviews')

# Customize legend
legend = ax.legend(title='Score', loc='upper right', bbox_to_anchor=(1.3, 1.0))

# Customize the gridlines
sns.set(style="whitegrid")

# Show the plot
plt.show()

In [None]:
import plotly.express as px

# create interactive visualizations
data = freq_prods.groupby(['ProductId', 'Score']).size().reset_index(name='Count')

# Create an interactive bar chart using Plotly
fig = px.bar(data, y='ProductId', x='Count', color='Score', orientation='h', 
             labels={'ProductId': 'Product ID', 'Count': 'Review Count'},
             title='Top Products with Best Reviews')

# Customize the legend
fig.update_traces(marker_line_width=0)
fig.update_layout(legend_title_text='Score')

# Show the interactive plot
fig.show()

#### Understanding Amazon user behavior

In [None]:
amazon.columns

In [None]:
x = amazon['UserId'].value_counts()
x

In [None]:
amazon['Viewer_type'] = amazon['UserId'].apply(lambda user: "Frequent" if x[user]>50 else "Not Frequent")
amazon['Viewer_type']

In [None]:
amazon.head()

In [None]:
freq = amazon[amazon['Viewer_type']=='Frequent']
not_freq = amazon[amazon['Viewer_type']=='Not Frequent']

In [None]:
freq['Score'].value_counts()/len(freq)*100

In [None]:
not_freq['Score'].value_counts()/len(not_freq)*100

In [None]:
# Set colors and labels
freq_counts = freq['Score'].value_counts()
not_freq_counts = not_freq['Score'].value_counts()
freq_labels = freq_counts.index
not_freq_labels = not_freq_counts.index

# Set color palette
colors = ['#ff9999', '#66b3ff', '#99ff99', '#ffcc99', '#c2c2f0']

# Set plot size
plt.figure(figsize=(12, 6))

# Plot frequency data
plt.subplot(1, 2, 1)
plt.bar(freq_labels, freq_counts, color=colors)
plt.title('Behavior of Frequent Viewers on Amazon')
plt.xlabel('Score')
plt.ylabel('Number of Reviews')
plt.xticks(rotation=45)

# Plot non-frequency data
plt.subplot(1, 2, 2)
plt.bar(not_freq_labels, not_freq_counts, color=colors)
plt.title('Behavior of Non-Frequent Viewers on Amazon')
plt.xlabel('Score')
plt.ylabel('Number of Reviews')
plt.xticks(rotation=45)

# Set overall layout
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
# Create dataframes for frequency and non-frequency data
freq_df = pd.DataFrame({'Score': freq_labels, 'Number of Reviews': freq_counts})
not_freq_df = pd.DataFrame({'Score': not_freq_labels, 'Number of Reviews': not_freq_counts})

# Create interactive bar charts for frequency and non-frequency data
fig = px.bar(freq_df, x='Score', y='Number of Reviews', color='Score',
             title='Behavior of Frequent Viewers on Amazon')
fig.update_xaxes(title_text='Score', tickangle=45)
fig.update_yaxes(title_text='Number of Reviews')
fig.update_layout(xaxis_title='Score', yaxis_title='Number of Reviews')

fig2 = px.bar(not_freq_df, x='Score', y='Number of Reviews', color='Score',
              title='Behavior of Non-Frequent Viewers on Amazon')
fig2.update_xaxes(title_text='Score', tickangle=45)
fig2.update_yaxes(title_text='Number of Reviews')
fig2.update_layout(xaxis_title='Score', yaxis_title='Number of Reviews')

# Combine the two charts into one interactive dashboard
from plotly.subplots import make_subplots

dashboard = make_subplots(rows=1, cols=2, subplot_titles=("Frequent Viewers", "Non-Frequent Viewers"))
dashboard.add_trace(fig.data[0], row=1, col=1)
dashboard.add_trace(fig2.data[0], row=1, col=2)

# Set the layout for the dashboard
dashboard.update_layout(showlegend=False, title_text="Amazon Viewer Behavior Analysis")

# Show the interactive dashboard
dashboard.show()

#### Which users often beat around the bush?

In [None]:
amazon.columns

In [None]:
amazon[['UserId', 'ProductId', 'Text']]

In [None]:
amazon['Text_length'] = amazon['Text'].apply(lambda text: len(text.split(' ')))

In [None]:
amazon.head()

In [None]:
freq = amazon[amazon['Viewer_type']=='Frequent']
not_freq = amazon[amazon['Viewer_type']=='Not Frequent']

In [None]:
freq

In [None]:
import plotly.graph_objs as go

freq_data = freq['Text_length']
not_freq_data = not_freq['Text_length']

# Create the first subplot
fig = go.Figure()

fig.add_trace(go.Box(y=freq['Text_length'], name='Frequency'))

# Create the second subplot
fig.add_trace(go.Box(y=not_freq['Text_length'], name='Not Frequency'))

# Add layout and title
fig.update_layout(
    title='Box Plot of Text Length',
    xaxis=dict(title='User Type'),
    yaxis=dict(title='Text Length'),
)

# Show the interactive plot
fig.show()

#### Sentiment Analysis

In [None]:
from textblob import TextBlob

In [None]:
amazon.shape

In [None]:
sample = amazon[0:50000]

In [None]:
polarity = [TextBlob(text).sentiment.polarity if isinstance(text, str) else 0 for text in sample['Summary']]

In [None]:
len(polarity)

In [None]:
sample.loc[:, 'Polarity'] = polarity

In [None]:
sample.head()

In [None]:
sample_positive = sample[sample['Polarity']>0]
sample_negative = sample[sample['Polarity']<0]

In [None]:
from collections import Counter

Counter(sample_positive['Summary']).most_common(10)

In [None]:
Counter(sample_negative['Summary']).most_common(10)