# Amazon-best-seller-web-scraping


 ![amazon](https://i.imgur.com/HXfMkKV.png)

 [Amazon.com](https://www.amazon.com/) is an American multinational technology company specialized in e-commerce, cloud computing, artificial intelligent. The platform is among the best in the industry, where many variety of items can be purchase.
 
Amazon has listed best sellers in alphabetic order that could be found in [Amazon Best Sellers](https://www.amazon.com/gp/bestsellers/?ref_=nav_cs_bestsellers_22595f4f23134e4aa687cca616dd2701). The page provides a list of items categories regrouped in department(about 40 variety). In this project, we are going to retrieve amazon best seller items in a variety of categories using web scraping. To achieve that we will use Python libraries [resquests](https://docs.python-requests.org/projects/requests-html/en/latest/) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to fetch, parse and extract the information we need from the web page.




 ![amazon](https://i.imgur.com/iGwAboG.png)


Here is an outline of the steps we will follow:
- Install and import libraries 
- Download and Parse the Bestseller HTML page source code using request and Beautifulsoup to get item categories topics URL.
- Repeat step 2 for each item topic obtained using the corresponding URL
- Extract information from each page 
- Combine the extracted information Extract information from each page's data in a Python Dictionaries
- Save the information data to CSV file Using Pandas library

By the end of the project, we’ll create a CSV file in the following format:

```
Topic,Topic_url,Item_description,Rating out of 5,Minimum_price,Maximum_price,Review,Item Url
Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_nav_0/131-6756172-7735956,Fire TV Stick 4K streaming device with Alexa Voice Remote | Dolby Vision | 2018 release,4.7,39.9,0.0,615699,"https://images-na.ssl-images-amazon.com/images/I/51CgKGfMelL._AC_UL200_SR200,200_.jpg"
Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_nav_0/131-6756172-7735956,Fire TV Stick (3rd Gen) with Alexa Voice Remote (includes TV controls) | HD streaming device | 2021 release,4.7,39.9,0.0,1844,"https://images-na.ssl-images-amazon.com/images/I/51KKR5uGn6L._AC_UL200_SR200,200_.jpg"
Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_nav_0/131-6756172-7735956,"Amazon Smart Plug, works with Alexa – A Certified for Humans Device",4.7,24.9,0.0,425090,"https://images-na.ssl-images-amazon.com/images/I/41uF7hO8FtL._AC_UL200_SR200,200_.jpg"
Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_nav_0/131-6756172-7735956,Fire TV Stick Lite with Alexa Voice Remote Lite (no TV controls) | HD streaming device | 2020 release,4.7,29.9,0.0,151007,"https://images-na.ssl-images-amazon.com/images/I/51Da2Z%2BFTFL._AC_UL200_SR200,200_.jpg"

```


How to Run the Code

You can execute the code using the “Run” button at the top of the page and selecting “ Run on Binder “. You can make changes and save your version of the notebook in Jovian by executing the following cells.


Notice: Any department on the bestseller page got 40 items categories wherein each category is listed the best 100 items on 2 pages(50 items per page)Due to captcha problems few pages couldn't be accessible.

In [8]:
!pip install jovian --upgrade --quiet

In [9]:
import jovian

In [10]:
# Execute this to save new versions of the notebook
jovian.commit(project="amazon-best-seller-web-scraping")

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
[jovian] Updating notebook "landryroni/amazon-best-seller-web-scraping" on https://jovian.ai[0m
[jovian] Uploading notebook..[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/landryroni/amazon-best-seller-web-scraping[0m


'https://jovian.ai/landryroni/amazon-best-seller-web-scraping'

## Install and import libraries

Let’s start with necessary libraries:

Install libraries via the pip command. Import the required packages that will be useful for scraping the data from the website.

In [11]:
!pip install requests pandas bs4  --upgrade --quiet

In [13]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

 Libraries are now downloaded, install and imported

## Download and Parse the Best seller  HTML page source code using resquest and Beautifulsoup to get  item categories topics URL.

We use the `get` function from `resquest` library to download th page. We define a [User-Agent](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent) header string to let servers and network peers identify the application, operating system, vendor, and/or version of the requesting. It help bypassing the detection as a scraper.

In [14]:
url ="https://www.amazon.com/Best-Sellers/zgbs/ref=zg_bs_unv_ac_0_ac_1"

HEADERS ={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}

In [15]:
response = requests.get(url, headers=HEADERS)

In [16]:
print(response)

<Response [200]>


The `requests.get` returns a response object containing the data from the web page. The `.status_code` property is use to check if the response was successful.
A successfull response will have the [HTTP response](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) code in between 200 and 299

In [17]:
response.text[:500]

'<!doctype html><html lang="en-us" class="a-no-js" data-19ax5a9jf="dingo"><!-- sp:feature:head-start -->\n<head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/>\n\n<script type=\'text/javascript\'>var ue_t0=ue_t0||+new Date();</script>\n<!-- sp:feature:cs-optimization -->\n<meta http-equiv=\'x-dns-prefetch-control\' content=\'on\'>\n<link rel="dns-prefetch" href="https://images-na.ssl-images-amazon.com">\n<link rel="dns-prefetch" href="https://m.media-amazon.com">\n<link rel="dn'

`response.text` can be used to look at the page contents we just downloaded, we can also check the length by apply `len(response.text)`. Here we just print the first 500 characters of the page content.

Let's save the contents to a file with the [html](https://en.wikipedia.org/wiki/HTML#:~:text=The%20HyperText%20Markup%20Language%2C%20or,displayed%20in%20a%20web%20browser.) extension.

In [18]:
with open("bestseller.html","w") as f:
    f.write(response.text)

We can view the file using the "File > Open" menu option within Jupyter and clicking on bestseller.html in the list of files displayed. 

![File](https://i.imgur.com/sv8WP4r.png)

We can also paid attention to the file size, for this task having close to 250kB as file size means the page has a content and have been successfuly downloaded where having about 6.6kB as file size means failed to download the exact page content. Failling can be because of captcha, or others security condition the web page request.

 Here is what we see when we open the file By clicking on it:
 
 
 ![Bestseller](https://i.imgur.com/6YJgwI7.png)

While this looks similar to the original web page, note that it's simply a copy. You will notice that none of the links or buttons work. To view or edit the source code of the file, click "File > Open" within Jupyter, then select the file bestseller.html from the list and click the "Edit" button

![source code](https://i.imgur.com/iAXyCOW.png)

In [19]:
with open("bestseller.html","r") as f:
    html_content = f.read()

In [20]:
html_content[:500]

'<!doctype html><html lang="en-us" class="a-no-js" data-19ax5a9jf="dingo"><!-- sp:feature:head-start -->\n<head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/>\n\n<script type=\'text/javascript\'>var ue_t0=ue_t0||+new Date();</script>\n<!-- sp:feature:cs-optimization -->\n<meta http-equiv=\'x-dns-prefetch-control\' content=\'on\'>\n<link rel="dns-prefetch" href="https://images-na.ssl-images-amazon.com">\n<link rel="dns-prefetch" href="https://m.media-amazon.com">\n<link rel="dn'

We have just read the file content and print the first  500 characters.
Now  we parse the web page information using `BeautifulSoup` and check to see the type

In [21]:
content = BeautifulSoup(html_content,"html.parser")

In [22]:
type(content)

bs4.BeautifulSoup

Let's accessing parent's tag and find all information data tags Attributes

In [23]:
doc = content.find("ul",{"id":"zg_browseRoot"})
hearder_link_tags = doc.find_all("li")

Here we find variety of item topics(categories) Url and their title, then store in a dictionary

In [24]:
############## find and get different item categories description and Url at any department 
topics_link = []
for tag in hearder_link_tags[1:]:
#     print(tag)
        topics_link.append({
        "title": tag.text.strip(),
        "url": tag.find("a")["href"] })
#################store in dictionary
table_topics = { k:[ d.get(k) for d in topics_link]
                   for k in set().union(*topics_link)}

Let's print and see all categories topics found on bestseller.

In [25]:
table_topics

{'url': ['https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_nav_0/138-0735230-1420505',
  'https://www.amazon.com/Best-Sellers-Amazon-Launchpad/zgbs/boost/ref=zg_bs_nav_0/138-0735230-1420505',
  'https://www.amazon.com/Best-Sellers-Appliances/zgbs/appliances/ref=zg_bs_nav_0/138-0735230-1420505',
  'https://www.amazon.com/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0/138-0735230-1420505',
  'https://www.amazon.com/Best-Sellers-Arts-Crafts-Sewing/zgbs/arts-crafts/ref=zg_bs_nav_0/138-0735230-1420505',
  'https://www.amazon.com/Best-Sellers-Audible-Audiobooks/zgbs/audible/ref=zg_bs_nav_0/138-0735230-1420505',
  'https://www.amazon.com/Best-Sellers-Automotive/zgbs/automotive/ref=zg_bs_nav_0/138-0735230-1420505',
  'https://www.amazon.com/Best-Sellers-Baby/zgbs/baby-products/ref=zg_bs_nav_0/138-0735230-1420505',
  'https://www.amazon.com/Best-Sellers-Beauty/zgbs/beauty/ref=zg_bs_nav_0/138-0735230-1420505',
  'https://www.amazon.com/best-sellers-books-Amazon/z

In [26]:
len(table_topics["url"])

40

We obtain a variety of 40 topics or category on Bestseller page

## Repeat setp 2 for each item category obtained using corresponding URL 

Let's import `time` library, to avoid facing several pages access denied by captcha we will observe few secondes sleeping time between each page request by applying function sleep from [time](https://docs.python.org/3/library/time.html) library.

In [35]:
import time 
import numpy as np

Here we defined a funtion `parse_page` to fetch and parse each single page url downloaded from the topic existing in any department of best seller 

In [36]:
def fetch(url):
    ''' The function take url and headers to download and parse the page using request.get and BeautifulSoup library
    it return a parent tag of typeBeautifulSoup object
    Argument:
    -url(string): web page url to be downloaded and parse
    Return:
    -doc(Beautiful 0bject): it's the parent tag containing the information that we need parsed from the page'''
    HEADERS= {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
    response = requests.get(url,headers= HEADERS)
    if response.status_code != 200:
        print("Status code:", response.status_code)
        raise Exception("Failed to link to web page " + topic_url)
    page_content  = BeautifulSoup(response.text,"html.parser")
    doc = page_content.findAll('div', attrs={'class':'a-section a-spacing-none aok-relative'})
    return doc

In [111]:
def parse_page(table_topics,pageNo):
    """The function take all topic categories and number of page to parse for each topic as input, apply get request to download each
    page, the use Beautifulsoup to parse the page. the function output are article_tags list containing all pages content, t_description
    list containing correspponding topic or categories then an url list for corresponding Url.
    Argument:
    -table_topics(dict): dictionary containing topic description and url
    -pageNo(int): number of page to parse per topic
    Return:
    -article_tags(list): list containing successfully parsed pages content where each index is a Beautifulsoup type
    -t_description(list): list containing  successfully parsed topic description
    -t_url(list): list containing successfully parsed page topic url
    -fail_tags(list): list containing pages url that failed first parsing 
    -failed_topic(list): list contaning pages topic description that failed first parsing
    """
    article_tags,t_description, t_url,fail_tags,failed_topic =[],[],[],[],[]
    for i in range(0,len(table_topics["url"])):
         # take the url
        topic_url = table_topics["url"][i]
        topics_description =  table_topics["title"][i]
        try:
            for j in range(1,pageNo+1):
                ref = topic_url.find("ref")
                url = topic_url[:ref]+"ref=zg_bs_pg_"+str(j)+"?_encoding=UTF8&pg="+str(j)
                time.sleep(10)
                #use resquest to obtain HMTL page content   +str(pageNo)+
                doc = fetch(url)
                if len(doc)==0:
                    print("failed to parse page{}".format(url))
                    fail_tags.append(url)
                    failed_topic.append(topics_description)
                else:
                    print("Sucsessfully parse:",url)
                    article_tags.append(doc)
                    t_description.append(topics_description)
                    t_url.append(topic_url) 
        except Exception as e:
            print(e)
    return article_tags,t_description,t_url,fail_tags,failed_topic

Here we defined the `reparse_failed_page` to try for a second time fetching and parsing pages that failed in first attempt, we try using a while loop to repeat
still getting successfully parsing but it do failed even when applying a sleeping time. We decide to defined this function 

In [112]:
def reparse_failed_page(fail_page_url,failed_topic):
    """The function take topic categories url, and description that failed to be accessible due to captcha in the first parsing process,
     try to fetch and parse thoses page for a second time.
     the function return article_tags list containing all pages content, topic_description,topic_url and other pages url and topic that failed to load content again
    Argument:
    -fail_page_url(dict): list containing failed first parsing web page url 
    -failed_topic(int): list contaning failed first parsing ictionary containing topic description and url
    Return:
    -article_tags2(list): list containing successfully parsed pages content where each index is a Beautifulsoup type
    -t_description(list): list containing  successfully parsed topic description
    -t_url(list): list containing successfully parsed page topic url
    -fail_p(list): list containing pages url that failed again 
    -fail_t(list): list contaning pages topic description that failed gain 
    """
    print("check if there is any failed pages,then print number:",len(fail_page_url))
    article_tag2, topic_url, topic_d, fail_p, fail_t = [],[],[],[],[]
    try:
        for i in range(len(fail_page_url)):
            time.sleep(20)
            doc = fetch(url)
            if len(doc)==0:
                print("page{}failed again".format(fail_page_url[i]))
                fail_p.append(fail_page_url[i])
                fail_t.append(failed_topic[i])
            else:
                article_tag2.append(doc)
                topic_url.append(fail_page_url[i])
                topic_d.append(failed_topic[i])
    except Exception as e:
        print(e)
    return article_tag2,topic_d,topic_url,fail_p,fail_t

Here we defined the `parse` function where the 2 level parsing will be done to get maximum number of page to parse 

In [113]:
def parse(table_topics,pageNo):
    """The function take table_topics, and number of page to parse for ecah topic url,the main purpose of this funtion is 
     to realize a double attempt to parse maximum number of pages it can .It's a combination of result getting from first 
     and second parse.
     Argument
     -table_topics(dict): dictionary containing topic description and url
     -pageNo(int): number of page to parse per topic
     Return:
     -all_arcticle_tag(list): list containing all successfully parsed pages content where each index is a Beautifulsoup type
     -all_topics_description(list): list containing  all successfully parsed topic description
     -all_topics_url(list): list containing all successfully parsed page topic url
    """
    article_tags,t_description,t_url,fail_tags,failed_topic = parse_page(table_topics,pageNo)
    if len(fail_tags)!=0:
        article_tags2,t_description2,t_url2,fail_tags2,failed_topic2 = reparse_failed_page(fail_tags,failed_topic)
        all_arcticle_tag = [*article_tags,*article_tags2]
        all_topics_description = [*t_description,*t_description2]
        all_topics_url = [*t_url,*t_url2]
        #return all_arcticle_tag,all_topics_description,all_topics_url
    else:
        print("successfully parsed all pages")
        all_arcticle_tag =   article_tags
        all_topics_description =t_description
        all_topics_url = t_url
       # return article_tags,t_description,t_url,fail_tags,failed_topic
    return all_arcticle_tag,all_topics_description,all_topics_url

In [65]:
all_arcticle_tag,all_topics_description,all_topics_url = parse(table_topics,2)

Sucsessfully parse: https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_pg_1?_encoding=UTF8&pg=1
Sucsessfully parse: https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_pg_2?_encoding=UTF8&pg=2
Sucsessfully parse: https://www.amazon.com/Best-Sellers-Amazon-Launchpad/zgbs/boost/ref=zg_bs_pg_1?_encoding=UTF8&pg=1
Sucsessfully parse: https://www.amazon.com/Best-Sellers-Amazon-Launchpad/zgbs/boost/ref=zg_bs_pg_2?_encoding=UTF8&pg=2
Sucsessfully parse: https://www.amazon.com/Best-Sellers-Appliances/zgbs/appliances/ref=zg_bs_pg_1?_encoding=UTF8&pg=1
Sucsessfully parse: https://www.amazon.com/Best-Sellers-Appliances/zgbs/appliances/ref=zg_bs_pg_2?_encoding=UTF8&pg=2
Sucsessfully parse: https://www.amazon.com/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_pg_1?_encoding=UTF8&pg=1
Sucsessfully parse: https://www.amazon.com/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_pg_2?_encoding=UTF8&pg=2
Sucsessfully parse: https://www.amazon.com/Best-Seller

Sucsessfully parse: https://www.amazon.com/best-sellers-software/zgbs/software/ref=zg_bs_pg_2?_encoding=UTF8&pg=2
Sucsessfully parse: https://www.amazon.com/Best-Sellers-Sports-Outdoors/zgbs/sporting-goods/ref=zg_bs_pg_1?_encoding=UTF8&pg=1
Sucsessfully parse: https://www.amazon.com/Best-Sellers-Sports-Outdoors/zgbs/sporting-goods/ref=zg_bs_pg_2?_encoding=UTF8&pg=2
Sucsessfully parse: https://www.amazon.com/Best-Sellers-Sports-Collectibles/zgbs/sports-collectibles/ref=zg_bs_pg_1?_encoding=UTF8&pg=1
Sucsessfully parse: https://www.amazon.com/Best-Sellers-Sports-Collectibles/zgbs/sports-collectibles/ref=zg_bs_pg_2?_encoding=UTF8&pg=2
Sucsessfully parse: https://www.amazon.com/Best-Sellers-Home-Improvement/zgbs/hi/ref=zg_bs_pg_1?_encoding=UTF8&pg=1
Sucsessfully parse: https://www.amazon.com/Best-Sellers-Home-Improvement/zgbs/hi/ref=zg_bs_pg_2?_encoding=UTF8&pg=2
Sucsessfully parse: https://www.amazon.com/Best-Sellers-Toys-Games/zgbs/toys-and-games/ref=zg_bs_pg_1?_encoding=UTF8&pg=1
Sucses

Let's check on the number of pages we successfully parse out of 80 pages

In [66]:
len(all_arcticle_tag)

76

##  Extract information data from each page 

Here we defined a function to extract information such as items description, rating, maximum price, minimum price, review, and image URL from the page. Information is extracted through corresponding HTML element from the parsed HTML page content.

What is an HTML element?

An HTML element is an ensemble consisting of the tag name, attributes, and child nodes(also include text nodes and other elements); Data can be extracted from an element, it can also manipulate the HTML.

What stands for HTML Tag, Attributes and Child nodes?

One easy way to understand what means an HTML tag is by answering questions such as how does a computer know what content to display, how to display it, and where to display it? what makes some text different than the others, distinguish the title from the main text or body paragraph. Most of what we listed is done by using an HTML tag which is a command in a web page that tells the browser to do something; Tags have the start `<>` and the ending`</>` brackets in order to work.

HTML attributes (`href`, `src`, `class`, `id`,`alt` ,etc.) are a modifier of an HTML element type, used inside the opening tag to control the element's behavior; it provides additional information about the HTML elements. Example of attribute `class="intro"` ,`id:"firstname"` .
HTML Child nodes(or Children) are elements that are direct children, elements that are nested exactly in the given one. Example `<head>` and `<body>` are children of `<html>` element.
Several different types of tags and attributes can be used in a single HMTL page content; Here we are going to find all HTML elements that contain the information we need by right-clicking on the specific part we want to get the corresponding information then select inspect the page. The picture below is an example to find item price tags information.

![price info](https://i.imgur.com/te6WBzI.png)

Here we defined the `get_topic_url_item_description` function to  extract corresponding item description 

In [67]:
def get_topic_url_item_description(doc,topic_description,topic_url):
    """The funtion takes a parent tag attribute, topic description and topic url as input, after finding the item name tags,
    the function return the item name(description), his corresponding topic(category) and his category url
    Argument:
    -doc(BeautifulSoup element): parents tag
    -topic_description(string): topic name or category
    -topic_url(string): topic url
    Return:
    -item_description(string): item name
    -topic_description(string):corresponding topic
    - topic_url(string): corresponding topic url"""
    name = doc.find("span", attrs={'class':'zg-text-center-align'})
    try:
        item_description = name.find_all('img', alt=True)[0]["alt"]
    except:
        item_description = ''
    return item_description,topic_description,topic_url  

Here we defined `get_item_price`  to extract corresponding item minimum and maximum price

In [68]:
def get_item_price(d):
    """The function take a parent tag attribute as input and find for corresponding child tag(item price),
    then return maximum price and minimum price for corresponding item and 0 when no price is found
    Argument:
    -d(BeautifulSoup element): parent tag
    Return:
    -min_price(float): item minimum price
    -max_price(float): item maximum price
    """
    p = d.find("span",attrs={"class":"a-size-base a-color-price"})
    try :
        if "-" in p.text :
            min_price = float(((p.text).split("-")[0]).replace("$",""))
            max_price = float((((p.text).split("-")[1]).replace(",","")).replace("$",""))
        else :
            min_price = float(((p.text[:5]).replace(",","")).strip().replace("$",""))
            max_price = 0.0
    except:
        min_price = 0.0
        max_price = 0.0
    return min_price,max_price

Here we defined `get_item_rate` and  `get_item_review` funtion to  extract corresponding item rate and costumers review

In [69]:
def get_item_rate(d):
    """The function take a parent tag attribute as input and find for corresponding child tag(rate),
    then return item rating out of 5, and 0.0 when can't find a rate
    Argument:
    -d(BeautifulSoup element): parent tag
    Return:
    -rating(float): item rating out or 5
    """
    rate = d.find("span",attrs={"class":"a-icon-alt"})
    try :
        rating = float(rate.text[:3])
    except:
        rating = 0.0
    return rating

def get_item_review(d):
    """The function take a parent tag attribute as input and find for corresponding child tag(costumers review),
    then return item review, and 0 when can't find number  of review
    Argument:
    -d(BeautifulSoup element): parent tag
    Return:
    -review(float): item costumer review
    """
    review = d.find("a",attrs ={"class":"a-size-small a-link-normal"})
    try :
        review = int((review.text).replace(",",""))
    except:
        review = 0
    return review

Here we defined `get_item_url`  function to extract corresponding item image url

In [70]:
def get_item_url(d):
    """The function take a parent tag attribute as input and find for corresponding child tag(image),
    then return item image url, and 'no image' if can't find an image
    -d(BeautifulSoup element): parent tag
    Return:
    -img(float): item image url
    """
    image = d.findAll("img", src = True)
    try:
        img = image[0]["src"]
    except:
        img = 'No image'
    return img

Item information data are extracted and directly store in a practicable data type to be convenient when making data analytic.

Item description, image url are given string data type.

Item price , rating are given float data type and costumer review integer data type.

## Combine data information extracted  from each pages  into a Python Dictionary


Here we defined a function called `get_info` to collect all item information data needed as list of data and store in a dictionary 

In [101]:
def get_info(article_tags,t_description,t_url):
    """The function take a list of pages content which each index is a Beautiful element that will be use to find parent tag,list of topic description and  topic url then
     the return a dictionary made of list of each item information data such as: his corresponding topic, the topic url,
     the item description, minimum price(maximum price if exist), item rating, costumer review, and item image url
    Argument:
    -article_tags(list): list containing all pages content where each index is a Beautifulsoup type
    -t_description(list): list containing  topic description
    -t_url(list): list containing topic url
    Return:
    -dictionary(dict): dictionary containing all item information data taken from each parse page topic
    """
    topic_description, topics_url, item, item_url = [],[],[],[]
    minimum_price, maximum_price, rating, costuomer_review = [],[],[],[]
    
    for idx in range(0,len(article_tags)):
        doc = article_tags[idx]#.findAll('div', attrs={'class':'a-section a-spacing-none aok-relative'})
        for d in doc :
            names,topic_name,topic_url = get_topic_url_item_description(d,t_description[idx],t_url[idx])
            min_price,max_price = get_item_price(d)
            rate = get_item_rate(d)
            review = get_item_review(d)
            url = get_item_url(d)
            ####put each item data inside corresponding list
            item.append(names)
            topic_description.append(topic_name)
            topics_url.append(topic_url)
            minimum_price.append(min_price)
            maximum_price.append(max_price)
            rating.append(rate)
            costuomer_review.append(review)
            item_url.append(url)
    return {
           "Topic": topic_description,
           "Topic_url": topics_url,
           "Item_description": item,
           "Rating out of 5": rating,
           "Minimum_price": minimum_price,
           "Maximum_price": maximum_price,
           "Review" :costuomer_review,
           "Item Url" : item_url}

In [102]:
data = get_info(all_arcticle_tag,all_topics_description,all_topics_url)

After 2 parsing attempt we got the maximum number of pages where we extracted data information, and now will have to store in a Dataframe using `Pandas` library.

## Save the information data to CSV file Using Pandas library

Let's save the optained data to a pandas dataframe

In [103]:
dataframe = pd.DataFrame(data)

Let's print and see the result, the data length(number of rows, number of columns)

In [104]:
dataframe

Unnamed: 0,Topic,Topic_url,Item_description,Rating out of 5,Minimum_price,Maximum_price,Review,Item Url
0,Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazo...,Fire TV Stick 4K streaming device with Alexa V...,4.7,39.90,0.00,615786,https://images-na.ssl-images-amazon.com/images...
1,Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazo...,Fire TV Stick (3rd Gen) with Alexa Voice Remot...,4.7,39.90,0.00,1884,https://images-na.ssl-images-amazon.com/images...
2,Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazo...,Echo Dot (3rd Gen) - Smart speaker with Alexa ...,4.7,39.90,0.00,1124438,https://images-na.ssl-images-amazon.com/images...
3,Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazo...,Fire TV Stick Lite with Alexa Voice Remote Lit...,4.7,29.90,0.00,151044,https://images-na.ssl-images-amazon.com/images...
4,Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazo...,"Amazon Smart Plug, works with Alexa – A Certif...",4.7,24.90,0.00,425111,https://images-na.ssl-images-amazon.com/images...
...,...,...,...,...,...,...,...,...
3795,Video Games,https://www.amazon.com/best-sellers-video-game...,PowerA Joy Con Comfort Grips for Nintendo Swit...,4.7,9.88,0.00,21960,https://images-na.ssl-images-amazon.com/images...
3796,Video Games,https://www.amazon.com/best-sellers-video-game...,Super Mario 3D All-Stars (Nintendo Switch),4.8,58.00,0.00,15109,https://images-na.ssl-images-amazon.com/images...
3797,Video Games,https://www.amazon.com/best-sellers-video-game...,"Quest Link Cable 16ft, VOKOO Oculus Quest Link...",4.3,29.90,0.00,3576,https://images-na.ssl-images-amazon.com/images...
3798,Video Games,https://www.amazon.com/best-sellers-video-game...,"RUNMUS K8 Gaming Headset for PS4, Xbox One, PC...",4.5,21.80,0.00,47346,https://images-na.ssl-images-amazon.com/images...


In [105]:
dataframe.shape

(3800, 8)

We have a dataframe with more than 3000 rows of data and 8 columns

Let's save dataframe as a csv file using `pandas`.

In [106]:
dataframe.to_csv('AmazonBestSeller.csv', index=None)

The CSV file created can be accessible by clicking on File > open

![csv](https://i.imgur.com/w50umFK.png)

Let's open our csv file read lines, and print out the first 5 lines  of data

In [107]:
with open("AmazonBestSeller.csv","r") as f:
    data = f.readlines()

In [108]:
data[:5]

['Topic,Topic_url,Item_description,Rating out of 5,Minimum_price,Maximum_price,Review,Item Url\n',
 'Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_nav_0/138-0735230-1420505,Fire TV Stick 4K streaming device with Alexa Voice Remote | Dolby Vision | 2018 release,4.7,39.9,0.0,615786,"https://images-na.ssl-images-amazon.com/images/I/51CgKGfMelL._AC_UL200_SR200,200_.jpg"\n',
 'Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_nav_0/138-0735230-1420505,Fire TV Stick (3rd Gen) with Alexa Voice Remote (includes TV controls) | HD streaming device | 2021 release,4.7,39.9,0.0,1884,"https://images-na.ssl-images-amazon.com/images/I/51KKR5uGn6L._AC_UL200_SR200,200_.jpg"\n',
 'Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_nav_0/138-0735230-1420505,Echo Dot (3rd Gen) - Smart speaker with Alexa - Charcoal,4.7,39.9,0.0,1124438,"https://images-na.ssl-images-amazon

In [109]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
[jovian] Updating notebook "landryroni/amazon-best-seller-web-scraping" on https://jovian.ai[0m
[jovian] Uploading notebook..[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/landryroni/amazon-best-seller-web-scraping[0m


'https://jovian.ai/landryroni/amazon-best-seller-web-scraping'

## Summary

What we have done so far was:

- Install and import libraries 
- Download and Parse the Best seller HTML page source code using resquest and Beautifulsoup to get item categories topics URL.
- Repeat setp 2 on each obtained item topic obtained using corresponding URL 
- Extract information from each page 
- Combine the extracted information Extract information from each pages data in a Python Dictionaries
- Save the information data to CSV file Using Pandas library

By the end of the project, we'll create a csv file in the following format:


```
Topic,Topic_url,Item_description,Rating out of 5,Minimum_price,Maximum_price,Review,Item Url
Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_nav_0/131-6756172-7735956,Fire TV Stick 4K streaming device with Alexa Voice Remote | Dolby Vision | 2018 release,4.7,39.9,0.0,615699,"https://images-na.ssl-images-amazon.com/images/I/51CgKGfMelL._AC_UL200_SR200,200_.jpg"
Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_nav_0/131-6756172-7735956,Fire TV Stick (3rd Gen) with Alexa Voice Remote (includes TV controls) | HD streaming device | 2021 release,4.7,39.9,0.0,1844,"https://images-na.ssl-images-amazon.com/images/I/51KKR5uGn6L._AC_UL200_SR200,200_.jpg"
Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_nav_0/131-6756172-7735956,"Amazon Smart Plug, works with Alexa – A Certified for Humans Device",4.7,24.9,0.0,425090,"https://images-na.ssl-images-amazon.com/images/I/41uF7hO8FtL._AC_UL200_SR200,200_.jpg"
Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_nav_0/131-6756172-7735956,Fire TV Stick Lite with Alexa Voice Remote Lite (no TV controls) | HD streaming device | 2020 release,4.7,29.9,0.0,151007,"https://images-na.ssl-images-amazon.com/images/I/51Da2Z%2BFTFL._AC_UL200_SR200,200_.jpg"
```

Even though we do have a promising result, and covered interesting features that will be helpful in future web scraping projects using python, requests, BeautifulSoup libraries, there is still a long way to go, extra work can be done like improving the scraping process through writing better code. Any other suggestions are welcome to help to deal with the failure of some pages when parsing due to captcha. Please feel free to let a comment that will help to improve this work.


## Future Work

- additiong work can be done to avoid captcha and getting access to all of pages.
- explore other more complex websites.
- explore how we might go about scraping data using Selenium and scrapy.

![next](https://i.imgur.com/RIq5fYV.png)

## References

Here are some link to learn more about used library:
- [request](https://docs.python-requests.org/projects/requests-html/en/latest/)
- [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) 
- [html source code](https://en.wikipedia.org/wiki/HTML#:~:text=The%20HyperText%20Markup%20Language%2C%20or,displayed%20in%20a%20web%20browser.)
- [pandas](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)
- [itertools.chain](https://docs.python.org/3/library/itertools.html#itertools.chain)

here are some tutorial on web scraping:

- [Web scraping project from scratch](https://www.youtube.com/watch?v=RKsLLG-bzEY&t=1104s)
- [Amazon web scraping](https://www.datacamp.com/community/tutorials/amazon-web-scraping-using-beautifulsoup)
- [web scraping with yahoo finance](https://towardsdatascience.com/web-scraping-yahoo-finance-477fe3daa852)


In [None]:
jovian.commit()

<IPython.core.display.Javascript object>

In [94]:
jovian.submit(assignment="zerotoanalyst-project1")

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
[jovian] Updating notebook "landryroni/amazon-best-seller-web-scraping" on https://jovian.ai[0m
[jovian] Uploading notebook..[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/landryroni/amazon-best-seller-web-scraping[0m
[jovian] Submitting assignment..[0m
[jovian] Verify your submission at https://jovian.ai/learn/zero-to-data-analyst-bootcamp/assignment/project-1-web-scraping-with-python[0m


In [None]:
jovian.commit(files=["AmazonBestSeller.csv"])

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
