# Here is an Outline of my work:
- Install and import libraries 
- Download and Parse the Best seller HTML page source code using resquest and Beautifulsoup to get item categories topics URL.
- Repeat the above step for each item topic obtained using corresponding URL 
- Extract information from each pages and append in a Python Dictionaries.
- Save the information data to corresponding CSV file Using Pandas library for each page.


By the end of the project, we'll create a csv file in the following format:

```
Topic,Topic_url,Item_description,Rating out of 5,Minimum_price,Maximum_price,Review,Item Url
Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_nav_0/131-6756172-7735956,Fire TV Stick 4K streaming device with Alexa Voice Remote | Dolby Vision | 2018 release,4.7,39.9,0.0,615699,"https://images-na.ssl-images-amazon.com/images/I/51CgKGfMelL._AC_UL200_SR200,200_.jpg"
Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_nav_0/131-6756172-7735956,Fire TV Stick (3rd Gen) with Alexa Voice Remote (includes TV controls) | HD streaming device | 2021 release,4.7,39.9,0.0,1844,"https://images-na.ssl-images-amazon.com/images/I/51KKR5uGn6L._AC_UL200_SR200,200_.jpg"
Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_nav_0/131-6756172-7735956,"Amazon Smart Plug, works with Alexa – A Certified for Humans Device",4.7,24.9,0.0,425090,"https://images-na.ssl-images-amazon.com/images/I/41uF7hO8FtL._AC_UL200_SR200,200_.jpg"
Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_nav_0/131-6756172-7735956,Fire TV Stick Lite with Alexa Voice Remote Lite (no TV controls) | HD streaming device | 2020 release,4.7,29.9,0.0,151007,"https://images-na.ssl-images-amazon.com/images/I/51Da2Z%2BFTFL._AC_UL200_SR200,200_.jpg"

```

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import os

In [2]:
url ="https://www.amazon.com/Best-Sellers/zgbs/ref=zg_bs_unv_ac_0_ac_1"

HEADERS ={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}


In [3]:
response = requests.get(url, headers=HEADERS)

In [4]:
print(response)

<Response [200]>


In [7]:
page_contents=response.text

In [9]:
with open("bestseller_new.html","w",encoding='UTF-8') as f:
    f.write(page_contents)

## Use Beautiful Soup to parse and extract information

-Parse and explore the structure of downloaded web pages using Beautiful soup.

-Use the right properties and methods to extract the required information.

-Create functions to extract from the page into lists and dictionaries.


In [10]:
content = BeautifulSoup(page_contents,"html.parser")

##### Getting information out of  a topic Page

**scrape_products_pages() returns a Dataframe containing all the Departments of Best sellers descriptions and urls**

In [13]:
def scrape_products_pages():
    items_url='https://www.amazon.com/Best-Sellers/zgbs/ref=zg_bs_unv_ac_0_ac_1'
    response=requests.get(items_url,headers=HEADERS)
    if response.status_code !=200:
        raise Exception('Failed to load Page{}'.format(items_url))
    items_dict={
        'title':get_page_title(),
        'url':get_page_urls()
    }
    return pd.DataFrame(items_dict)   

***get_page_title() and get_page_urls() gets the titles and urls of all the departments in best sellers,and these functions are called inside scrape_products_pages()***

In [18]:
def get_page_title():
    selection_class='_p13n-zg-nav-tree-all_style_zg-browse-group__88fbz'
    doc = content.find("div",{"class":selection_class})
    hearder_link_tags = doc.find_all('div',{'role':'treeitem'})

    topic_titles=[]
    for tag in hearder_link_tags:
        topic_titles.append(tag.text.strip())
    return topic_titles

def get_page_urls():
    selection_class_2='_p13n-zg-nav-tree-all_style_zg-browse-group__88fbz'
    doc2 = content.find("div",{"class":selection_class_2})
    hearder_url_tags = doc2.find_all('a')  

    topic_urls=[]
    base_url='https://www.amazon.com/'
    for tag in hearder_url_tags:
        topic_urls.append(base_url+tag['href'])
    return topic_urls

**scrape_products() is the function which begins scrapping all the departments and create a DATA folde which contains all the CSV file of each department**

In [24]:

def scrape_products():
    print('Scrapping list of products')
    items_df=scrape_products_pages()
    os.makedirs('data',exist_ok=True)
    for index,row in items_df.iterrows():
        print('Scrapping Best Seller Departments for "{}"'.format(row['title']))
        scrape_product(row['url'],'data/{}.csv'.format(row['title']))
        

In [20]:
def scrape_product(item_url,path):
#     fname=topic_name + '.csv'
    if os.path.exists(path):
        print("The file {} already exists.Skipping.....".format(path))
        return 
    i_doc=get_item_page(item_url)
    tags=get_page(i_doc)
    Products_df=get_products_info(tags)
    Products_df.to_csv(path,index=None)

#### get_item_page(item_url) takes url of each department and parse using BeautifulSoup and get_page(item_doc)extract the tags from the created Html

In [62]:

def get_item_page(item_url):
    response=requests.get(item_url,headers=HEADERS)
    if response.status_code !=200:
        raise Exception('Failed to load Page{}'.format(item_url))
    item_doc=BeautifulSoup(response.text,'html.parser')
    return item_doc
def get_page(item_doc):
   
        s_class='p13n-gridRow _p13n-zg-list-grid-desktop_style_grid-row__3Cywl'
        docu=item_doc.find('div',{'class':s_class})
        p1_tags=docu.find_all('div',{'class','a-column a-span12 a-text-center _p13n-zg-list-grid-desktop_style_grid-column__2hIsc'})
        return p1_tags
    

**these functions take the tags of each department page and the serial number of products in particular department page**

In [50]:
def get_item_url(p1_tags,doc):

    base_url='https://www.amazon.com/'
    pro_url=base_url + p1_tags[doc].find_all('a',attrs={'class','a-link-normal'})[0]['href']
    return pro_url

    
def get_item_desc(p1_tags,doc):
    pro_desc=p1_tags[doc].find_all('a',attrs={'class','a-link-normal'})[1].text.strip()
    
    return pro_desc
    
def get_item_reviews(p1_tags,doc):
    review_tag=p1_tags[doc].find("span",attrs={"class":'a-size-small',})
    if review_tag==None:
        review_tag=None
    else:
        review_tag=(review_tag.text.strip())
    return review_tag

def get_item_rating(p1_tags,doc):
#     rating_tags = product_doc.find("span",attrs={"class":'a-icon-alt',})
    rating_tags = p1_tags[doc].find('span',{'class','a-icon-alt'})
    if rating_tags==None:
        rating_tags=None
    else:
        rating_tags=(rating_tags.text.strip())
        
    return rating_tags

def get_item_price(p1_tags,doc):
#     price_tag=''
    price_tag_1=p1_tags[doc].find("span",attrs={"class":'_p13n-zg-list-grid-desktop_price_p13n-sc-price__3mJ9Z'})
    price_tag_2=p1_tags[doc].find("span",attrs={"class":'p13n-sc-price'})
    if(price_tag_1!=None):
        price_tag=price_tag_1.text.strip()
    elif(price_tag_2!=None):
        price_tag=price_tag_2.text.strip()
    else:
        price_tag=None
    
    return price_tag

**get_products_info(prod_tags) gets all the products info from each department and return a Dataframe of al the products of each department**

In [69]:
def get_products_info(prod_tags):
    item_page_dict={
        'Name':[],
        'Reviews':[],
        'Ratings':[],
        'Price':[],
        'Url':[],
    }
    try:
        for item in range(len(prod_tags)):
            item_page_dict['Name'].append(get_item_desc(prod_tags,item))
            item_page_dict['Reviews'].append(get_item_reviews(prod_tags,item))
            item_page_dict['Ratings'].append(get_item_rating(prod_tags,item))
            item_page_dict['Price'].append(get_item_price(prod_tags,item))
            item_page_dict['Url'].append(get_item_url(prod_tags,item))
    except:
        pass
        
    return pd.DataFrame(item_page_dict)

In [70]:
scrape_products()

Scrapping list of products
Scrapping Best Seller Departments for "Amazon Devices & Accessories"
The file data/Amazon Devices & Accessories.csv already exists.Skipping.....
Scrapping Best Seller Departments for "Amazon Explore"
The file data/Amazon Explore.csv already exists.Skipping.....
Scrapping Best Seller Departments for "Amazon Launchpad"
The file data/Amazon Launchpad.csv already exists.Skipping.....
Scrapping Best Seller Departments for "Appliances"
The file data/Appliances.csv already exists.Skipping.....
Scrapping Best Seller Departments for "Apps & Games"
The file data/Apps & Games.csv already exists.Skipping.....
Scrapping Best Seller Departments for "Arts, Crafts & Sewing"
The file data/Arts, Crafts & Sewing.csv already exists.Skipping.....
Scrapping Best Seller Departments for "Audible Books & Originals"
The file data/Audible Books & Originals.csv already exists.Skipping.....
Scrapping Best Seller Departments for "Automotive"
The file data/Automotive.csv already exists.Ski