# Amazon Product Data Analysis

This notebook explores the Amazon product dataset to identify key patterns and insights that will be useful for our vector search system.

## Dataset Overview

- Contains product listings from various categories with pricing, ratings, and review information
- Key columns: product_id, product_name, category, discounted_price, actual_price, rating, review_content

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestRegressor

# Set plotting style
plt.style.use('ggplot')
sns.set_theme(style="whitegrid")

In [None]:
# Load the dataset
amazon_df = pd.read_csv('../data/amazon_products.csv')

# Display basic information
print(f"Total records: {len(amazon_df)}")
amazon_df.info()

In [None]:
# Display sample data
amazon_df.head()

## Category Analysis

### 1. Main Category Distribution

In [None]:
# Extract main categories and count products
main_categories = amazon_df['category'].str.split('/').str[0].value_counts()

# Plot
plt.figure(figsize=(12, 8))
main_categories.plot(kind='barh')
plt.title('Product Distribution by Main Category')
plt.xlabel('Number of Products')
plt.ylabel('Main Category')
plt.tight_layout()
plt.savefig('../images/Product Distribution by Main Category.png')
plt.show()

![Main Categories](../images/Product%20Distribution%20by%20Main%20Category.png)

**Key Insights:**
- Computers & Accessories dominates the dataset
- Electronics/Home Theater is second largest category
- Helps identify focus areas for inventory management

### 2. Sub-Category Distribution

In [None]:
# Extract sub-categories and count products
sub_categories = amazon_df['category'].str.split('/').str[1].value_counts().head(15)

# Plot
plt.figure(figsize=(12, 8))
sub_categories.plot(kind='barh')
plt.title('Product Distribution By Sub-Category (Top 15)')
plt.xlabel('Number of Products')
plt.ylabel('Sub-Category')
plt.tight_layout()
plt.savefig('../images/Product Distribution By Sub-Category.png')
plt.show()

![Sub Categories](../images/Product%20Distribution%20By%20Sub-Category.png)

**Key Insights:**
- Cables & Accessories is most populated sub-category
- USBCables and WirelessUSBAdapters are prominent
- Helps understand product type distribution

### 3. Category Hierarchy

In [None]:
# Create a hierarchical view of categories
category_hierarchy = amazon_df.groupby(['category'])\
                    .size()\
                    .reset_index(name='count')\
                    .sort_values('count', ascending=False)\
                    .head(15)

# Plot
plt.figure(figsize=(14, 10))
sns.barplot(x='count', y='category', data=category_hierarchy)
plt.title('Product Distribution By Category Hierarchical (Top 15)')
plt.xlabel('Number of Products')
plt.ylabel('Category Path')
plt.tight_layout()
plt.savefig('../images/Product Distribution By Category Hierarchical.png')
plt.show()

![Hierarchical View](../images/Product%20Distribution%20By%20Category%20Hierarchical.png)

**Key Insights:**
- Detailed view of category relationships
- Shows the full path hierarchy
- Helps understand the product taxonomy

## Rating Analysis

### 1. Distribution of Ratings

In [None]:
# Plot rating distribution
plt.figure(figsize=(10, 6))
sns.histplot(amazon_df['rating'].dropna(), bins=20, kde=True)
plt.title('Distribution of Product Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.tight_layout()
plt.savefig('../images/Distribution of Product Ratings.png')
plt.show()

![Rating Distribution](../images/Distribution%20of%20Product%20Ratings.png)

**Key Insights:**
- Positively skewed distribution with most products rated highly
- Few products with low ratings
- Shows Amazon's quality control or potential rating bias

## Price Analysis

### 1. Discount Analysis

In [None]:
# Calculate discount percentage
amazon_df['discount_pct'] = ((amazon_df['actual_price'] - amazon_df['discounted_price']) / 
                           amazon_df['actual_price'] * 100).round(2)

# Discount vs Rating
plt.figure(figsize=(12, 8))
valid_data = amazon_df.dropna(subset=['discount_pct', 'rating'])
sns.scatterplot(x='discount_pct', y='rating', 
               data=valid_data, 
               alpha=0.5, 
               hue='category', 
               palette='viridis')
plt.title('Discount Percentage vs Product Rating')
plt.xlabel('Discount Percentage %')
plt.ylabel('Rating')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.savefig('../images/Discount Percentage vs Product Rating.png')
plt.show()

![Discount vs Rating](../images/Discount%20Percentage%20vs%20Product%20Rating.png)

**Key Insights:**
- Higher discounts (>70%) tend to have slightly lower ratings
- Best rated products (4.5+) have moderate discounts (30-50%)
- Helps evaluate discount strategies

### 2. Price Distribution by Category

In [None]:
# Price distribution by main category
amazon_df['main_category'] = amazon_df['category'].str.split('/').str[0]

plt.figure(figsize=(14, 8))
top_categories = amazon_df['main_category'].value_counts().nlargest(6).index
subset = amazon_df[amazon_df['main_category'].isin(top_categories)]

sns.boxplot(x='main_category', y='discounted_price', data=subset)
plt.title('Discounted Price Distribution by Main Category')
plt.xlabel('Main Category')
plt.ylabel('Discounted Price ($)')
plt.xticks(rotation=45)
plt.ylim(0, 500)  # Limit y-axis for better visualization
plt.tight_layout()
plt.savefig('../images/Discounted Price Distribution by Main Category.png')
plt.show()

![Price Distribution](../images/Discounted%20Price%20Distribution%20by%20Main%20Category.png)

**Key Insights:**
- Smart TVs show widest price range
- Cables & Accessories have most consistent pricing
- Helps understand market segments

## Price-Rating Relationships

In [None]:
# Price vs Rating by Category
plt.figure(figsize=(12, 8))
for cat in top_categories:
    cat_data = amazon_df[amazon_df['main_category'] == cat]
    sns.kdeplot(x='discounted_price', y='rating', data=cat_data, 
               levels=5, fill=True, alpha=0.5, label=cat)

plt.title('Price-Rating Density by Category')
plt.xlabel('Discounted Price ($)')
plt.ylabel('Rating')
plt.xlim(0, 500)  # Limit x-axis for better visualization
plt.legend()
plt.tight_layout()
plt.savefig('../images/Price-Rating Density by Category.png')
plt.show()

![Price-Rating Density](../images/Price-Rating%20Density%20by%20Category.png)

**Key Insights:**
- Rating distributions differ across price ranges
- Some categories have distinct price-rating patterns
- Helps understand price sensitivity per category

## Brand Analysis

In [None]:
# Extract brand from product_name or use another column if available
# This is a simplified approach - in reality, brand extraction would need more sophisticated NLP
brands = amazon_df['brand'].value_counts().head(10)

plt.figure(figsize=(12, 8))
brands.plot(kind='barh')
plt.title('Top 10 Brands by Product Count')
plt.xlabel('Number of Products')
plt.ylabel('Brand')
plt.tight_layout()
plt.savefig('../images/Top 10 Brands by Product Count.png')
plt.show()

![Top Brands](../images/Top%2010%20Brands%20by%20Product%20Count.png)

**Key Insights:**
- Popular brands in the dataset
- Shows market leaders in Amazon's product listings
- Useful for brand-based recommendation strategies

## Price-Quality Matrix

In [None]:
# Create price-quality matrix
plt.figure(figsize=(12, 10))
scatter = plt.scatter(amazon_df['discounted_price'], 
                     amazon_df['rating'],
                     c=amazon_df['discount_pct'],
                     s=50,
                     alpha=0.6,
                     cmap='viridis')

plt.colorbar(scatter, label='Discount Percentage %')
plt.title('Price-Quality Matrix with Popularity Indicators')
plt.xlabel('Price ($)')
plt.ylabel('Rating (Quality Indicator)')
plt.xlim(0, 500)  # Limit for better visualization
plt.tight_layout()
plt.savefig('../images/Price-Quality Matrix with Popularity Indicators.png')
plt.show()

![Price-Quality Matrix](../images/Price-Quality%20Matrix%20with%20Popularity%20Indicators.png)

**Key Insights:**
- Visualizes price-quality relationship
- Discount patterns across price and quality
- Helps identify value-for-money products

## Review Content Analysis

In [None]:
# Word cloud of review content
from wordcloud import WordCloud

# Combine all reviews
all_reviews = ' '.join(amazon_df['review_content'].dropna().astype(str))

# Generate and plot word cloud
wordcloud = WordCloud(width=800, height=400, 
                     background_color='white',
                     max_words=100,
                     colormap='viridis',
                     contour_width=1, 
                     contour_color='steelblue').generate(all_reviews)

plt.figure(figsize=(16, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Common Themes in Product Reviews', fontsize=20)
plt.tight_layout()
plt.savefig('../images/Common Themes in Product Reviews.png')
plt.show()

![Review Themes](../images/Common%20Themes%20in%20Product%20Reviews.png)

**Key Insights:**
- Common words and themes in reviews
- Product features frequently mentioned
- Customer concerns and priorities

## Feature Importance Analysis

In [None]:
# Define features for analysis
numeric_cols = ['discounted_price', 'actual_price', 'discount_pct']

# Prepare data
y = amazon_df['rating'].values
valid_indices = ~np.isnan(y)
y_clean = y[valid_indices]

# 1. Process numeric features
X_numeric = amazon_df[numeric_cols].fillna(0)[valid_indices]

# 2. Process categorical features
categorical_features = ['category']
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_categorical = encoder.fit_transform(amazon_df[categorical_features].fillna('Unknown')[valid_indices])
categorical_feature_names = encoder.get_feature_names_out(categorical_features)

# 3. Process text features (simplified - using only about_product)
# Limit features to control dimensionality
text_vectorizer = TfidfVectorizer(max_features=50, stop_words='english')
X_text = text_vectorizer.fit_transform(amazon_df['about_product'].fillna('')[valid_indices]).toarray()
text_feature_names = text_vectorizer.get_feature_names_out()

# Combine all features
X_combined = np.hstack([X_numeric, X_categorical, X_text])
all_feature_names = list(numeric_cols) + list(categorical_feature_names) + list(text_feature_names)

# Train model
model = RandomForestRegressor(random_state=42, n_estimators=100)
model.fit(X_combined, y_clean)

# Get top 20 features
importances = model.feature_importances_
indices = np.argsort(importances)[-20:]  # Get indices of top 20 features

# Plot
plt.figure(figsize=(12, 10))
plt.barh(range(len(indices)), importances[indices], align='center')
plt.yticks(range(len(indices)), [all_feature_names[i] for i in indices])
plt.title('Top 20 Features Influencing Product Rating')
plt.xlabel('Importance Score')
plt.tight_layout()
plt.savefig('../images/Top 20 Features Influencing Product Rating.png')
plt.show()

# Print top features
print("Top 20 features by importance:")
for i in indices[::-1]:
    print(f"{all_feature_names[i]}: {importances[i]:.4f}")

![Feature Importance](../images/Top%2020%20Features%20Influencing%20Product%20Rating.png)

**Key Insights:**
- Most influential features for product rating
- Price-related features are significant
- Certain product categories have stronger influence
- Text features from product descriptions impact ratings

## Conclusion

This exploration of the Amazon product dataset has revealed several key insights:

1. **Category Distribution**: Electronics and Computers dominate the dataset
2. **Rating Patterns**: Most products have high ratings (4+ stars)
3. **Price-Quality Relationship**: Price and quality don't always correlate directly
4. **Discount Strategies**: Discounting patterns vary by category and price point
5. **Feature Importance**: Price, category, and specific product features significantly influence ratings

These insights will inform our vector search system design, helping create more effective product recommendations and search functionality.