# Amazon Best Sellers Web Scraping

As a part of my web scraping learning journey, in this notebook, I will scrape Amazon's Best Sellers pages across multiple categories to collect product information including asin, titles, prices, ratings, and reviews. 

## Step 1: Import Required Libraries

In [4]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time, random

## Step 2: Define Extraction Functions

Now I will create helper functions to extract specific information from each product listing. These functions will handle cases where data might be missing.

In [62]:
# Function to extract ASIN
def get_asin_url(s):
  try:
    asin = s.find(attrs={'data-asin': True})['data-asin']
    url = 'https://www.amazon.com/dp/' + asin

  except:
    asin = 'N/A'
    url = 'N/A'

  return [asin, url]

# Function to extract rank
def get_rank(s):
  try:
    rank = s.find('span', class_='zg-bdg-text').text

  except:
    rank = 'N/A'

  return rank

# function to extract title
def get_title(s):
  try:
    title = s.find('div', class_='_cDEzb_p13n-sc-css-line-clamp-3_g3dy1').text

  except:
    title = 'N/A'

  return title

# function to extract price
def get_price(s):
  try:
    price = s.find('span', class_='_cDEzb_p13n-sc-price_3mJ9Z').text

  except:
    price = 'N/A'

  return price

# function to extract review count
def get_review_count(s):
  try:
    review_count = s.find('span', class_='a-size-small').text

  except:
    review_count = 'N/A'

  return review_count

# function to extract rating
def get_rating(s):
  try:
    rating = s.find('span', class_='a-icon-alt').text

  except:
    rating = 'N/A'

  return rating

## Step 3: Set Up Proxy Configuration

To avoid getting blocked by Amazon, we'll use rotating proxies. This helps distribute our requests across multiple IP addresses.

- Get your 10 free proxies from [Webshare](https://www.webshare.io/).

In [63]:
import os
from dotenv import load_dotenv

# load environment variables
load_dotenv()

# Credentials
USERNAME = os.getenv("PROXY_USERNAME")
PASSWORD = os.getenv("PROXY_PASSWORD")

# Load all proxy variables
proxy_list = [
    os.getenv("PROXY_1"),
    os.getenv("PROXY_2"),
    os.getenv("PROXY_3"),
    os.getenv("PROXY_4"),
    os.getenv("PROXY_5"),
    os.getenv("PROXY_6"),
    os.getenv("PROXY_7"),
    os.getenv("PROXY_8"),
    os.getenv("PROXY_9"),
    os.getenv("PROXY_10")
]

# pick one randomly
proxy_ip = random.choice(proxy_list).strip()

proxies = {
    "http":  f"http://{USERNAME}:{PASSWORD}@{proxy_ip}/",
    "https": f"http://{USERNAME}:{PASSWORD}@{proxy_ip}/"
}

## Step 4: Main Scraping Function

This function will iterate through our list of URLs, scrape each page, and compile the data into a DataFrame.

In [64]:
def get_data(URL_list):
    df = pd.DataFrame()

    for url in URL_list:
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
            "Referer": "https://www.google.com/",
        }

        try:
            webpage = requests.get(url, headers=headers, proxies=proxies, timeout=15)
            print("success" if webpage.status_code == 200 else f"failed ({webpage.status_code})")

            s = BeautifulSoup(webpage.content, 'html.parser')

            products = s.find('ol', class_='a-ordered-list a-vertical p13n-gridRow _cDEzb_grid-row_3Cywl').find_all('li')
            category = s.find_all('h1')[1].text.replace('Best Sellers in ', '')

            d = {'asin': [], 'url': [], 'category': [], 'rank': [], 'title': [], 'price': [], 'review_count': [], 'rating': []}

            for product in products:
                asin, url_link = get_asin_url(product)
                d['asin'].append(asin)
                d['url'].append(url_link)
                d['category'].append(category)
                d['rank'].append(get_rank(product))
                d['title'].append(get_title(product))
                d['price'].append(get_price(product))
                d['review_count'].append(get_review_count(product))
                d['rating'].append(get_rating(product))

            temp_df = pd.DataFrame(d)
            df = pd.concat([df, temp_df], ignore_index=True)

        except Exception as e:
            print(f"Error for URL {url}: {e}")

        time.sleep(random.uniform(1.5, 4.5))

    return df

## Step 5: First Scraping Run - All Categories

Let's start scraping! We'll attempt to scrape all 37 Amazon Best Seller categories.

In [33]:
URL = ['https://www.amazon.com/Best-Sellers-Amazon-Devices-Accessories/zgbs/amazon-devices/ref=zg_bs_nav_amazon-devices_0', 'https://www.amazon.com/Best-Sellers-Amazon-Renewed/zgbs/amazon-renewed/ref=zg_bs_nav_amazon-renewed_0', 'https://www.amazon.com/Best-Sellers-Appliances/zgbs/appliances/ref=zg_bs_nav_appliances_0', 'https://www.amazon.com/Best-Sellers-Apps-Games/zgbs/mobile-apps/ref=zg_bs_nav_mobile-apps_0', 'https://www.amazon.com/Best-Sellers-Arts-Crafts-Sewing/zgbs/arts-crafts/ref=zg_bs_nav_arts-crafts_0',  'https://www.amazon.com/Best-Sellers-Audible-Books-Originals/zgbs/audible/ref=zg_bs_nav_audible_0',  'https://www.amazon.com/Best-Sellers-Automotive/zgbs/automotive/ref=zg_bs_nav_automotive_0', 'https://www.amazon.com/Best-Sellers-Baby/zgbs/baby-products/ref=zg_bs_nav_baby-products_0', 'https://www.amazon.com/Best-Sellers-Beauty-Personal-Care/zgbs/beauty/ref=zg_bs_nav_beauty_0', 'https://www.amazon.com/best-sellers-books-Amazon/zgbs/books/ref=zg_bs_nav_books_0',  'https://www.amazon.com/best-sellers-camera-photo/zgbs/photo/ref=zg_bs_nav_photo_0', 'https://www.amazon.com/best-sellers-music-albums/zgbs/music/ref=zg_bs_nav_music_0', 'https://www.amazon.com/Best-Sellers-Cell-Phones-Accessories/zgbs/wireless/ref=zg_bs_nav_wireless_0',  'https://www.amazon.com/Best-Sellers-Clothing-Shoes-Jewelry/zgbs/fashion/ref=zg_bs_nav_fashion_0',  'https://www.amazon.com/Best-Sellers-Collectible-Coins/zgbs/coins/ref=zg_bs_nav_coins_0', 'https://www.amazon.com/Best-Sellers-Computers-Accessories/zgbs/pc/ref=zg_bs_nav_pc_0', 'https://www.amazon.com/Best-Sellers-Electronics/zgbs/electronics/ref=zg_bs_nav_electronics_0', 'https://www.amazon.com/Best-Sellers-Entertainment-Collectibles/zgbs/entertainment-collectibles/ref=zg_bs_nav_entertainment-collectibles_0', 'https://www.amazon.com/Best-Sellers-Gift-Cards/zgbs/gift-cards/ref=zg_bs_nav_gift-cards_0', 'https://www.amazon.com/Best-Sellers-Grocery-Gourmet-Food/zgbs/grocery/ref=zg_bs_nav_grocery_0', 'https://www.amazon.com/Best-Sellers-Handmade-Products/zgbs/handmade/ref=zg_bs_nav_handmade_0', 'https://www.amazon.com/Best-Sellers-Health-Household/zgbs/hpc/ref=zg_bs_nav_hpc_0', 'https://www.amazon.com/Best-Sellers-Home-Kitchen/zgbs/home-garden/ref=zg_bs_nav_home-garden_0', 'https://www.amazon.com/Best-Sellers-Industrial-Scientific/zgbs/industrial/ref=zg_bs_nav_industrial_0', 'https://www.amazon.com/Best-Sellers-Kitchen-Dining/zgbs/kitchen/ref=zg_bs_nav_kitchen_0', 'https://www.amazon.com/best-sellers-movies-TV-DVD-Blu-ray/zgbs/movies-tv/ref=zg_bs_nav_movies-tv_0', 'https://www.amazon.com/Best-Sellers-Musical-Instruments/zgbs/musical-instruments/ref=zg_bs_nav_musical-instruments_0', 'https://www.amazon.com/Best-Sellers-Office-Products/zgbs/office-products/ref=zg_bs_nav_office-products_0', 'https://www.amazon.com/Best-Sellers-Patio-Lawn-Garden/zgbs/lawn-garden/ref=zg_bs_nav_lawn-garden_0', 'https://www.amazon.com/Best-Sellers-Pet-Supplies/zgbs/pet-supplies/ref=zg_bs_nav_pet-supplies_0', 'https://www.amazon.com/best-sellers-software/zgbs/software/ref=zg_bs_nav_software_0', 'https://www.amazon.com/Best-Sellers-Sports-Outdoors/zgbs/sporting-goods/ref=zg_bs_nav_sporting-goods_0', 'https://www.amazon.com/Best-Sellers-Sports-Collectibles/zgbs/sports-collectibles/ref=zg_bs_nav_sports-collectibles_0', 'https://www.amazon.com/Best-Sellers-Tools-Home-Improvement/zgbs/hi/ref=zg_bs_nav_hi_0', 'https://www.amazon.com/Best-Sellers-Toys-Games/zgbs/toys-and-games/ref=zg_bs_nav_toys-and-games_0', 'https://www.amazon.com/Best-Sellers-Unique-Finds/zgbs/boost/ref=zg_bs_nav_boost_0', 'https://www.amazon.com/best-sellers-video-games/zgbs/videogames/ref=zg_bs_nav_videogames_0']

df = get_data(URL)

success
success
failed (429)
Error for URL https://www.amazon.com/Best-Sellers-Appliances/zgbs/appliances/ref=zg_bs_nav_appliances_0: 'NoneType' object has no attribute 'find_all'
success
success
failed (429)
Error for URL https://www.amazon.com/Best-Sellers-Audible-Books-Originals/zgbs/audible/ref=zg_bs_nav_audible_0: 'NoneType' object has no attribute 'find_all'
success
success
success
failed (429)
Error for URL https://www.amazon.com/best-sellers-books-Amazon/zgbs/books/ref=zg_bs_nav_books_0: 'NoneType' object has no attribute 'find_all'
success
success
success
success
failed (429)
Error for URL https://www.amazon.com/Best-Sellers-Collectible-Coins/zgbs/coins/ref=zg_bs_nav_coins_0: 'NoneType' object has no attribute 'find_all'
success
success
success
failed (429)
Error for URL https://www.amazon.com/Best-Sellers-Gift-Cards/zgbs/gift-cards/ref=zg_bs_nav_gift-cards_0: 'NoneType' object has no attribute 'find_all'
failed (429)
Error for URL https://www.amazon.com/Best-Sellers-Grocery-G

### Inspect the Results

Let's take a quick look at the data we've collected so far.

In [40]:
df.sample(2)

Unnamed: 0,asin,url,category,rank,title,price,review_count,rating
610,B001GAOTSW,https://www.amazon.com/dp/B001GAOTSW,Office Products,#11,"Pilot G2 Premium Gel Roller Pens 0.7, Fine Poi...",$14.39,50732,4.8 out of 5 stars
224,B07R53W4P6,https://www.amazon.com/dp/B07R53W4P6,Camera & Photo Products,#15,"Binoculars for Kids, Girls or Boys Real Kids B...",$18.74,1010,4.6 out of 5 stars


### Save First Batch

Let's save this data as a backup before continuing.

In [35]:
df.to_csv('best_sellers_1.csv')

## Step 6: Retry Failed URLs - Round 1

Some URLs failed in the first attempt. Let's retry those to collect more data.

In [46]:
failed_urls_1 = [
    "https://www.amazon.com/Best-Sellers-Appliances/zgbs/appliances/ref=zg_bs_nav_appliances_0",
    "https://www.amazon.com/Best-Sellers-Audible-Books-Originals/zgbs/audible/ref=zg_bs_nav_audible_0",
    "https://www.amazon.com/best-sellers-books-Amazon/zgbs/books/ref=zg_bs_nav_books_0",
    "https://www.amazon.com/Best-Sellers-Collectible-Coins/zgbs/coins/ref=zg_bs_nav_coins_0",
    "https://www.amazon.com/Best-Sellers-Gift-Cards/zgbs/gift-cards/ref=zg_bs_nav_gift-cards_0",
    "https://www.amazon.com/Best-Sellers-Grocery-Gourmet-Food/zgbs/grocery/ref=zg_bs_nav_grocery_0",
    "https://www.amazon.com/Best-Sellers-Home-Kitchen/zgbs/home-garden/ref=zg_bs_nav_home-garden_0",
    "https://www.amazon.com/Best-Sellers-Patio-Lawn-Garden/zgbs/lawn-garden/ref=zg_bs_nav_lawn-garden_0",
    "https://www.amazon.com/Best-Sellers-Tools-Home-Improvement/zgbs/hi/ref=zg_bs_nav_hi_0",
    "https://www.amazon.com/Best-Sellers-Unique-Finds/zgbs/boost/ref=zg_bs_nav_boost_0",
    "https://www.amazon.com/best-sellers-video-games/zgbs/videogames/ref=zg_bs_nav_videogames_0"
]

df_2 = get_data(failed_urls_1)

failed (429)
Error for URL https://www.amazon.com/Best-Sellers-Appliances/zgbs/appliances/ref=zg_bs_nav_appliances_0: 'NoneType' object has no attribute 'find_all'
failed (429)
Error for URL https://www.amazon.com/Best-Sellers-Audible-Books-Originals/zgbs/audible/ref=zg_bs_nav_audible_0: 'NoneType' object has no attribute 'find_all'
success
success
success
failed (429)
Error for URL https://www.amazon.com/Best-Sellers-Grocery-Gourmet-Food/zgbs/grocery/ref=zg_bs_nav_grocery_0: 'NoneType' object has no attribute 'find_all'
failed (429)
Error for URL https://www.amazon.com/Best-Sellers-Home-Kitchen/zgbs/home-garden/ref=zg_bs_nav_home-garden_0: 'NoneType' object has no attribute 'find_all'
success
failed (429)
Error for URL https://www.amazon.com/Best-Sellers-Tools-Home-Improvement/zgbs/hi/ref=zg_bs_nav_hi_0: 'NoneType' object has no attribute 'find_all'
failed (429)
Error for URL https://www.amazon.com/Best-Sellers-Unique-Finds/zgbs/boost/ref=zg_bs_nav_boost_0: 'NoneType' object has no at

### Check Second Batch Results

In [48]:
df_2.head(2)

Unnamed: 0,asin,url,category,rank,title,price,review_count,rating
0,1668205874,https://www.amazon.com/dp/1668205874,Books,#1,,$22.38,108,4.9 out of 5 stars
1,593296966,https://www.amazon.com/dp/0593296966,Books,#2,,$29.40,42,4.4 out of 5 stars


### Save Second Batch

In [49]:
df_2.to_csv('best_sellers_2.csv')

## Step 7: Retry Failed URLs - Round 2

Still have some failed URLs. Let's give them another try.

In [50]:
failed_urls_2 = [
    "https://www.amazon.com/Best-Sellers-Appliances/zgbs/appliances/ref=zg_bs_nav_appliances_0",
    "https://www.amazon.com/Best-Sellers-Audible-Books-Originals/zgbs/audible/ref=zg_bs_nav_audible_0",
    "https://www.amazon.com/Best-Sellers-Grocery-Gourmet-Food/zgbs/grocery/ref=zg_bs_nav_grocery_0",
    "https://www.amazon.com/Best-Sellers-Home-Kitchen/zgbs/home-garden/ref=zg_bs_nav_home-garden_0",
    "https://www.amazon.com/Best-Sellers-Tools-Home-Improvement/zgbs/hi/ref=zg_bs_nav_hi_0",
    "https://www.amazon.com/Best-Sellers-Unique-Finds/zgbs/boost/ref=zg_bs_nav_boost_0"
]

df_3 = get_data(failed_urls_2)

success
success
success
success
failed (429)
Error for URL https://www.amazon.com/Best-Sellers-Tools-Home-Improvement/zgbs/hi/ref=zg_bs_nav_hi_0: 'NoneType' object has no attribute 'find_all'
success


### Check Third Batch Results

In [52]:
df_3.sample(2)

Unnamed: 0,asin,url,category,rank,title,price,review_count,rating
128,B00N31CQ5A,https://www.amazon.com/dp/B00N31CQ5A,Unique Finds,#9,RIVER OF GOODS Stained Glass Bird Accent Lamp ...,$68.99,544,4.4 out of 5 stars
84,B008JA73RG,https://www.amazon.com/dp/B008JA73RG,Grocery & Gourmet Food,#25,"V8 Energy Peach Mango Energy Drink, 8 fl oz Ca...",,51576,4.6 out of 5 stars


### Save Third Batch

In [53]:
df_3.to_csv('best_sellers_3.csv')

## Step 8: Final Retry - Last Remaining URL

One more URL to complete our dataset!

In [55]:
failed_urls_3 = [
    "https://www.amazon.com/Best-Sellers-Tools-Home-Improvement/zgbs/hi/ref=zg_bs_nav_hi_0",
]

df_4 = get_data(failed_urls_3)

success


### Verify Final Batch

In [58]:
df_4.sample(2)

Unnamed: 0,asin,url,category,rank,title,price,review_count,rating
18,B09V366BDY,https://www.amazon.com/dp/B09V366BDY,Tools & Home Improvement,#19,KSIPZE 100ft Led Strip Lights RGB Music Sync C...,$9.98,32837,4.4 out of 5 stars
7,B00UXG4WR8,https://www.amazon.com/dp/B00UXG4WR8,Tools & Home Improvement,#8,everydrop by Whirlpool Ice and Water Refrigera...,$57.00,105209,4.7 out of 5 stars


### Save Fourth Batch

In [59]:
df_4.to_csv('best_sellers_4.csv')

## Step 9: Combine All Data

Now let's merge all four batches of data into a single comprehensive dataset.

In [60]:
df_final = pd.concat([df, df_2, df_3, df_4], ignore_index=True)

### Check Total Dataset Size

In [61]:
df_final.shape

(1104, 8)

### Exporting Combined Dataset

In [62]:
df_final.to_csv("Best-Sellers-Amazon.csv")

## Step 10: Data Cleaning - Load and Inspect

Now that we have all the raw data, let's load it and start the cleaning process.

In [125]:
df = pd.read_csv('Best-Sellers-Amazon.csv')

### Remove Unnecessary Columns

The 'Unnamed: 0' column was created by pandas during CSV export and isn't needed.

In [126]:
df.drop(columns=['Unnamed: 0'], inplace=True)

In [127]:
df.shape

(1104, 8)

### Sample the Data

Let's look at a random sample to understand the data quality.

In [99]:
df.sample(10)

Unnamed: 0,asin,url,category,rank,title,price,review_count,rating
510,B0BZYCJK89,https://www.amazon.com/dp/B0BZYCJK89,Kitchen & Dining,#1,Owala FreeSip Insulated Stainless Steel Water ...,EUR 25.71,92263.0,4.7 out of 5 stars
351,B0DLBTPDCS,https://www.amazon.com/dp/B0DLBTPDCS,Computers & Accessories,#22,Apple 2024 Mac mini Desktop Computer with M4 c...,,1672.0,4.8 out of 5 stars
516,B07YP2VH4B,https://www.amazon.com/dp/B07YP2VH4B,Kitchen & Dining,#7,KitchenAid Classic Multifunction Can Opener an...,EUR 12.85,92591.0,4.6 out of 5 stars
428,B0CLHTKY3V,https://www.amazon.com/dp/B0CLHTKY3V,Handmade Products,#9,Moisturizing Tallow Lip Balm – Grass-Fed Beef ...,$13.99,1024.0,4.7 out of 5 stars
833,B0DT2GNZ74,https://www.amazon.com/dp/B0DT2GNZ74,Collectible Coins,#24,Aizics Mint President Donald Trump 45 47 2025 ...,$12.98,176.0,4.8 out of 5 stars
836,B002NV5N5Q,https://www.amazon.com/dp/B002NV5N5Q,Collectible Coins,#27,The Last 25 Years of Lincoln Wheat Penny Colle...,,771.0,4.5 out of 5 stars
1005,B01GQ5GQEG,https://www.amazon.com/dp/B01GQ5GQEG,Grocery & Gourmet Food,#16,Goldfish Crackers Big Smiles Variety Pack with...,,98942.0,4.8 out of 5 stars
416,B00VL7A5HO,https://www.amazon.com/dp/B00VL7A5HO,Entertainment Collectibles,#27,,,,
291,B0C8HHV9DK,https://www.amazon.com/dp/B0C8HHV9DK,Cell Phones & Accessories,#22,"Anker iPhone 17/16 Charger, 2-Pack 20W Fast US...",,21886.0,4.6 out of 5 stars
855,B079YMX2J6,https://www.amazon.com/dp/B079YMX2J6,Gift Cards,#16,Uber eGift Card,$15.00,24416.0,4.7 out of 5 stars


## Step 11: Identify Data Quality Issues

After inspecting the sample, we can see the dataset has several quality issues that need to be addressed.

- Looks like the data has a lot of NaN values and wrong entries

### Check Missing Values

In [128]:
df.isnull().sum()

asin              0
url               0
category          0
rank              0
title           232
price           480
review_count     22
rating           33
dtype: int64

- There is no missing values in 'asin', 'url', 'category', and 'rank' columns

### Examine Rating Column Issues

In [129]:
df.rating.unique()

array(['4.4 out of 5 stars', '4.7 out of 5 stars', '4.6 out of 5 stars',
       '4.2 out of 5 stars', '4.3 out of 5 stars', '4.5 out of 5 stars',
       '4.1 out of 5 stars', '4.0 out of 5 stars', '3.8 out of 5 stars',
       '3.9 out of 5 stars', '3.7 out of 5 stars', '1.0 out of 5 stars',
       '2.8 out of 5 stars', '2.5 out of 5 stars', '2.4 out of 5 stars',
       nan, '3.5 out of 5 stars', '2.6 out of 5 stars',
       '2.0 out of 5 stars', '3.0 out of 5 stars', '5.0 out of 5 stars',
       '4.8 out of 5 stars', '4.9 out of 5 stars', '3.6 out of 5 stars',
       '3.3 out of 5 stars', '3.2 out of 5 stars'], dtype=object)

- In rating column, there are values like 'NaN', '4.5 out of 5 stars', etc.

### Examine Review Count Column Issues

In [130]:
df.review_count.unique()

array(['262,732', '42,101', '89,267', '29,335', '173,673', '64,517',
       '98,231', '12,148', '12,719', '8,284', '15,582', '33,703',
       '32,841', '33,501', '9,910', '55,419', '3,138', '3,058', '47,372',
       '40,067', '124,349', '26,263', '570,506', '20,454', '7,291',
       '1,613', '28,608', '882', '7,259', '304,665', '31,856', '14,484',
       '17,180', '19,000', '56,653', '4,446', '3,671', '450', '15,856',
       '2,801', '3,470', '9,644', '3,911', '3,111', '29,650', '6,571',
       '1,679', '4,114', '6,154', 'Amazon Renewed', '17,197', '5,329',
       '163', '65,043', '11,463', '1,750', '1,868', '2,097', '877',
       '12,456', 'Mojang', 'DZENDEV LIMITED', 'HyperSim Interactive',
       'RobTop Games', 'Serra Ayan', 'DEVABRIKOS LIMITED',
       'Prism Pioneers', 'RC Studios', 'Ringirout', 'Swifte games',
       'Broken Glass Games', 'Open World Offline Games',
       'AppSynergy Creations', 'EpicByte Studio', 'The Games Forest',
       'inovelapps', 'The Game Weaver', 'Red

- In review_count column, there are values like 'NaN', '1,234 ratings', 'Eric Trump', 'Andy Weir' etc.

### Examine Price Column Issues

In [131]:
df.price.unique()

array(['$11.99', nan, '$80.00', '$507.40', '$115.00', '$284.95', '$58.99',
       '$38.99', '$6.99', '$9.99', '$14.99', '$2.99', '$7.99', '$4.99',
       '$8.99', '$13.99', '$12.99', '$13.89', '$19.99', '$2.97', '$5.99',
       '$9.97', '$20.99', '$9.69', '$11.31', '$8.49', '$5.98', '$11.24',
       '$13.49', '$15.31', '$6.92', '$9.49', '$3.99', '$3.69', '$0.58',
       '$7.55', '$15.78', '$16.99', '$31.98', '$9.59', '$23.99', '$29.99',
       '$11.97', '$29.95', '$10.92', '$18.99', '$52.99', '$11.49',
       '$41.98', '$9.94', '$44.97', '$16.97', '$36.75', '$28.22',
       '$51.77', '$35.00', '$34.23', '$65.47', '$47.20', '$19.97',
       '$28.49', 'EUR\xa011.14', 'EUR\xa015.39', 'EUR\xa08.55',
       'EUR\xa015.34', 'EUR\xa08.56', 'EUR\xa016.29', 'EUR\xa016.24',
       'EUR\xa013.56', 'EUR\xa018.86', 'EUR\xa011.12', 'EUR\xa05.90',
       'EUR\xa03.43', 'EUR\xa021.42', 'EUR\xa05.14', 'EUR\xa017.13',
       '$35.99', '$28.78', '$33.89', '$99.74', '$149.00', '$205.00',
       '$21.12', 

- In price column, there are values like 'NaN', '$12.34', 'EUR\xa085.72', 'GBP\xa09.14', etc. 

## Step 12: Data Cleaning Strategy

Since we have URLs for all products, we could potentially re-scrape missing data from the product pages. However, for now, we'll focus on cleaning what we have.

### Remove Rows with Missing Values

Let's drop all rows with any missing data to ensure data quality.

In [132]:
df.dropna(inplace=True)

In [133]:
df.shape

(439, 8)

### Sample Cleaned Data

In [106]:
df.sample(5)

Unnamed: 0,asin,url,category,rank,title,price,review_count,rating
570,B0777S56RB,https://www.amazon.com/dp/B0777S56RB,Musical Instruments,#1,FogWorx Extreme High Density Fog Juice - Quart...,$17.99,21262,4.6 out of 5 stars
772,B09GWCRSQV,https://www.amazon.com/dp/B09GWCRSQV,Toys & Games,#23,"Crayola Washable Kids Paint (6ct), Essential P...",EUR 4.53,17124,4.7 out of 5 stars
327,B0D4YCXF8Q,https://www.amazon.com/dp/B0D4YCXF8Q,"Clothing, Shoes & Jewelry",#28,AUTOMET Long Sleeve Shirts Basic Tops,$9.99,2028,4.5 out of 5 stars
826,B07RYNCNB9,https://www.amazon.com/dp/B07RYNCNB9,Collectible Coins,#17,2 oz .999 Pure Copper Medallion (Lincoln Wheat),$9.99,146,4.7 out of 5 stars
233,B07GF4JCDY,https://www.amazon.com/dp/B07GF4JCDY,Camera & Photo Products,#24,Adorrgon 12x42 HD Binoculars for Adults High P...,$41.43,22067,4.4 out of 5 stars


In [134]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 439 entries, 0 to 1101
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   asin          439 non-null    object
 1   url           439 non-null    object
 2   category      439 non-null    object
 3   rank          439 non-null    object
 4   title         439 non-null    object
 5   price         439 non-null    object
 6   review_count  439 non-null    object
 7   rating        439 non-null    object
dtypes: object(8)
memory usage: 30.9+ KB


- I have noticed that we have wrong entries in the dataset for only rows with missing values in any of the three columns: title, rating, review_count, price.

In [135]:
df['price'].unique() # check unique values in price column

array(['$11.99', '$80.00', '$507.40', '$115.00', '$284.95', '$38.99',
       '$2.97', '$5.99', '$9.97', '$20.99', '$9.69', '$11.31', '$8.49',
       '$5.98', '$11.24', '$13.49', '$15.31', '$6.92', '$7.99', '$9.49',
       '$9.99', '$8.99', '$3.99', '$6.99', '$3.69', '$0.58', '$7.55',
       '$15.78', '$14.99', '$16.99', '$31.98', '$9.59', '$23.99',
       '$29.99', '$11.97', '$29.95', '$10.92', '$18.99', '$52.99',
       '$11.49', '$41.98', '$9.94', '$44.97', '$16.97', '$36.75',
       '$28.22', '$51.77', '$35.00', '$34.23', '$65.47', '$47.20',
       '$19.97', '$28.49', '$13.89', 'EUR\xa011.14', 'EUR\xa015.39',
       'EUR\xa08.55', 'EUR\xa015.34', 'EUR\xa08.56', 'EUR\xa016.29',
       'EUR\xa016.24', 'EUR\xa013.56', 'EUR\xa018.86', 'EUR\xa011.12',
       'EUR\xa05.90', 'EUR\xa03.43', 'EUR\xa021.42', 'EUR\xa05.14',
       'EUR\xa017.13', '$35.99', '$28.78', '$33.89', '$99.74', '$149.00',
       '$205.00', '$21.12', '$19.99', '$13.58', '$56.51', '$28.48',
       '$18.74', '$15.98', '$9

- In price column values like GBP\xa05.95, EUR\xa012.34 need to be converted to USD.

In [137]:
# Function to convert prices to USD
def convert_to_usd(price):
    """Convert EUR and GBP prices to USD"""
    if pd.isna(price):
        return None
    
    if '$' in price:
        return float(price.replace('$', ''))
    elif 'EUR' in price:
        return round(float(price.replace('EUR', '')) * 1.17)
    elif 'GBP' in price:
        return round(float(price.replace('GBP', '')) * 1.34)
    return None

# convert prices
df['price'] = df['price'].apply(convert_to_usd)

In [141]:
df['price']

0        11.99
32       80.00
42      507.40
44      115.00
48      284.95
         ...  
1094      9.99
1095     12.34
1096      8.99
1100      9.99
1101      5.49
Name: price, Length: 439, dtype: float64

In [None]:
df['review_count'].unique() # check unique values in review_count column

array(['262,732', '17,180', '3,911', '29,650', '6,154', '5,329', '25,259',
       '1,841', '5,290', '12,491', '3,535', '25,935', '83,492', '58,732',
       '17,143', '14,903', '4,001', '5,105', '56,330', '8,391', '11,418',
       '39,327', '2,805', '6,649', '2,211', '42,606', '100,123', '9,072',
       '43,448', '60,805', '115,630', '18,014', '7,544', '14,133',
       '21,405', '102,783', '82,700', '58,534', '60,925', '22,603',
       '57,479', '77,400', '32,074', '28,098', '6,535', '45,429',
       '28,324', '234,201', '102,994', '83,991', '72,738', '46,403',
       '102,198', '98,086', '24,671', '4,077', '6,741', '29,784',
       '29,622', '39,537', '38,462', '67,593', '5,892', '795', '54,497',
       '177,870', '40,254', '48,398', '113,612', '25,549', '120,913',
       '12,948', '34,241', '6,797', '136,371', '137,398', '30,615',
       '105,765', '69,016', '65,308', '45,972', '40,501', '11,946',
       '32,276', '3,152', '8,494', '21,889', '9,454', '10,118', '14,708',
       '1,450'

In [144]:
df['rating'].unique() # check unique values in rating column

array(['4.4 out of 5 stars', '4.1 out of 5 stars', '4.0 out of 5 stars',
       '3.7 out of 5 stars', '4.5 out of 5 stars', '4.8 out of 5 stars',
       '4.3 out of 5 stars', '4.6 out of 5 stars', '4.7 out of 5 stars',
       '4.9 out of 5 stars', '3.8 out of 5 stars', '4.2 out of 5 stars',
       '3.3 out of 5 stars', '3.5 out of 5 stars', '3.2 out of 5 stars',
       '3.6 out of 5 stars', '2.8 out of 5 stars'], dtype=object)

- 

## Step 13: Category Analysis and Filtering

Let's analyze how many products we have per category and filter out categories with too few products.

### View Category Distribution

In [55]:
df['category'].value_counts()

category
Kitchen & Dining                26
Camera & Photo Products         26
Clothing, Shoes & Jewelry       26
Toys & Games                    26
Sports & Outdoors               25
Arts, Crafts & Sewing           24
Musical Instruments             23
Cell Phones & Accessories       23
Office Products                 22
Baby                            18
Beauty & Personal Care          18
Automotive                      17
Industrial & Scientific         16
Home & Kitchen                  15
Sports Collectibles             15
Tools & Home Improvement        15
Health & Household              15
Gift Cards                      14
Computers & Accessories         13
Collectible Coins               12
Electronics                      7
Unique Finds                     7
Handmade Products                6
Patio, Lawn & Garden             6
Video Games                      6
Amazon Renewed                   5
Pet Supplies                     5
Appliances                       4
Entertainme

- All the data entries have been corrected now.

### Filter Categories with Sufficient Data

We'll keep only categories with at least 10 products to ensure meaningful analysis.

In [145]:
# number of rows per category
category_counts = df['category'].value_counts()

# categories with 10 or more rows
valid_categories = category_counts[category_counts >= 10].index

# Filter dataframe to keep only valid categories
df = df[df['category'].isin(valid_categories)]

print(f"Removed {len(category_counts) - len(valid_categories)} categories with less than 10 rows")
print(f"Remaining categories: {len(valid_categories)}")
print(f"Remaining rows: {df.shape[0]}")

Removed 10 categories with less than 10 rows
Remaining categories: 20
Remaining rows: 389


### Final Dataset Shape

In [146]:
df.shape

(389, 8)

### Final Data Summary

In [147]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 389 entries, 90 to 1101
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   asin          389 non-null    object 
 1   url           389 non-null    object 
 2   category      389 non-null    object 
 3   rank          389 non-null    object 
 4   title         389 non-null    object 
 5   price         389 non-null    float64
 6   review_count  389 non-null    object 
 7   rating        389 non-null    object 
dtypes: float64(1), object(7)
memory usage: 27.4+ KB


## Step 14: Export Final Clean Dataset

Perfect! Our cleaned dataset is ready for analysis. Let's save it as our final output.

In [148]:
df.to_csv("Best-Sellers-Amazon-Final.csv", index=False)

---