# Script for extracting product list for a seller

The script takes as input a .txt file containing a list of brands offered by a seller, and then iterates through all the brands in the list.

For each brand in the brand list, it finds all the products offered by the given seller. The details are stored in a .jsonl file. 

In [85]:
import urllib.request # FOR URL ENCODING 
import requests # For making requests to download a webpage content
from selectorlib import Extractor # For extracting specific fileds from downloaded webpage
import json 
import random 
from time import sleep
import os
import jsonlines
import pandas as pd
import datetime
import re

#### Step 1: Read Brand List 

**NOTE:** Before running this, change the path variable 'brands' to point to the Brand List file. The brand list should be in .txt format with each line containing a brand name. 

The following code loads a brand list file, and reads all its brands into a list. 

In [87]:
!ls ../DATASET/BrandLists/

Appario_Brand_List.txt       URL_Encode_List.py
Cloudtail_Brand_List_A.txt   keywords.txt
Cloudtail_Brand_List_B.txt   text.txt
Cloudtail_Brand_List_Top.txt


In [96]:
brands = open('./../DATASET/BrandLists/Appario_Brand_List.txt', 'r')
# brands = open('./../DATASETS/BrandLists/CloudtailBrandListTop.txt', 'r')

brand_list = []

for b in brands:
     # Removing (\n) from the end of each brand name read
    b = b.strip(" \n")
    b = b.strip("\n")
    brand_list.append(b) 
print('Brand List: ', brand_list[:10])
print('Brand Count: ', len(brand_list))

Brand List:  ['10.or', '100FIT', '1KLICK', '2010KHARIDO', '3M', '4+D+%28LABEL%29', '4D', '5E', 'A-DATA', 'A.W.Faber-Castell']
Brand Count:  973


#### Step 2: Define Headers

Each header is a unique user agent which will be used to request the data from the website to be scraped. We use multiple user agents to ensure that if our request is rejected, we can retry.

To create more headers, simply copy any one of the old headers and replace the 'user-agent' string with a new 'user-agent' string, which can be found online. (Eg. https://developer.chrome.com/multidevice/user-agent)

In [89]:
headers = [
           {
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'referer': 'https://www.amazon.com/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
           },
           {
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'referer': 'https://www.amazon.com/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
           },
           {
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'referer': 'https://www.amazon.com/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
           },
           {
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'referer': 'https://www.amazon.com/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
           },
           {
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'referer': 'https://www.amazon.com/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
           },
           {
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1 Safari/605.1.15',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'referer': 'https://www.amazon.com/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
           },
           {
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'referer': 'https://www.amazon.com/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
           },
           {
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'referer': 'https://www.amazon.com/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
           },
           {
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'referer': 'https://www.amazon.com/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
           },
           {
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:70.0) Gecko/20100101 Firefox/70.0',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'referer': 'https://www.amazon.com/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
           },
           {
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'referer': 'https://www.amazon.com/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
           },
           {
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'referer': 'https://www.amazon.com/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
           },
           {
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'referer': 'https://www.amazon.com/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
           },
           {
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36 OPR/68.0.3618.165',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'referer': 'https://www.amazon.com/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
           },
           {
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Trident/7.0; rv:11.0) like Gecko',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'referer': 'https://www.amazon.com/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
           },
           {
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36 Edg/83.0.478.37',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'referer': 'https://www.amazon.com/',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
           }
]

#### Step 3: Read Extractor Files

The extractor (.yml) files contain *css id* information about the fields which we intend to extract from the scarped website. Here, the two extractor files are:
##### 1. product_list.yml
From the scraped webpage, this extractor file extracts the main *css division* which contains all the individual (child) products. Once the main div is scraped, it extracts all the child divisions (products) contained in it.
##### 2. nextpg.yml
Extracts the 'next' button from the website, to check if its disabled. If it is disabled, it means that we have reached the end of the product list for the current brand. We then move onto the next brand to continue our scraping. 

In [90]:
e = Extractor.from_yaml_file('./Extractor/product_list.yml')
l = Extractor.from_yaml_file('./Extractor/nextpg.yml')

#### Step 4: Define scrape function
**NOTE:** Set the variables MAX_TRIALS & ERROR_THRESHHOLD according to your preferences. 

A high MAX_TRIALS will slow down the scraping as it will scrape those pages without actually any data multiple times too, but it will reduce the chances of error. 
A low ERROR_THRESHHOLD will also slow down the scraping, as VPN will need to changed multiple times. However, it will reduce the chances missing data due to errors. 

The function scrape(url) downloads the webpage at the given url (here: product list pages) using requests module, and looks for products on the page. If it finds any product, it extracts the required fields and returns the data. If no product is found, it continues to randomly select a new header and retry scraping untill the limit MAX_TRIALS is reached, where it concludes that the page does not contain any data.

These multiple trials are required, as amazon often blocks a user for repeqatedly making requests using the same user agent. 

In [101]:
MAX_TRIALS = 15 # Set the max number of trials to perform here.
ERROR_COUNT = 1 # Used for keeping a count of errors, if the count exceeds threshhold, the user is asked to
                # change the vpn
    
ERROR_THRESHHOLD = 10 # Number of pages with missed information after which vpn change is required
def scrape(url):  
    global MAX_TRIALS
    global ERROR_COUNT
    global ERROR_THRESHHOLD
    '''
    This function downloads the webpage at the given url using requests module.
    
    Parameters:
    url (string): URL of webpage to scrape
    Returns: 
    string: If the URL contains products, returns the html of the webpage as text, else returns 'False'.
    '''
    
    # Download the page using requests
    print("Downloading %s"%url)
    trial = 0
    while(True):
        if ERROR_COUNT % ERROR_THRESHHOLD == 0:
            _ = input('Please Change VPN and press any key to continue')
            ERROR_COUNT += 1
        if trial == MAX_TRIALS:  
            print("Max trials exceeded yet no Data found on this page!")
            ERROR_COUNT += 1
            return 'False'
        trial = trial + 1
        print("Trial no:", trial)
        
        # Get the html data from the url
        while True:
            try:
                r = requests.get(url, headers=random.choice(headers), timeout = 15) 
                
                # We use product_list.yml extractor to extract the product details from the html data text
                data = e.extract(r.text) 
                # If the products div in the scraped html is not empty, return html text. 
                #If the products div in the scraped html is empty, retry with new user agent.
                if (data['products'] != None): 
                    return r.text
                else:
                    print("Retrying with new user agent!")
                    break
            except requests.exceptions.RequestException as err:
                print('Error Detected: ', err)
                print('Retrying after 30 seconds')
                sleep(30)
                continue
            except requests.exceptions.HTTPError as err:
                print('Error Detected: ', err)
                print('Retrying after 30 seconds')
                sleep(30)
                continue
            except requests.exceptions.ConnectionError as err:
                print('Error Detected: ', err)
                print('Retrying after 30 seconds')
                sleep(30)
                continue
            except requests.exceptions.Timeout as err:
                print('Error Detected: ', err)
                print('Retrying after 30 seconds')
                sleep(30)
                continue

#### Step 5: Initialise path of output file

**NOTE:** Set the File Name accoring to what is being scraped here

Eg: SCRAPED_PRODUCT_LIST_APPARIO or SCRAPED_PRODUCT_LIST_CLOUDTAIL

In [92]:
FileName = input('Enter a Filename for output file!\n')

outfile_path = str('./ScriptOutput/DATASET/' + str(FileName) + '.jsonl')    

Enter a Filename for output file!
ProductListAppario_NewWithHtmlPages


#### Step 6: Enter Seller Name

Enter Seller Name which the brand list is associated with.
Eg: Cloudtail India or Appario Retail Pvt Ltd

In [93]:
seller = input('Enter Seller Name!\n')

Enter Seller Name!
Appario Retail Private Ltd


#### Step 7: Defining Functions to clean the data.

In [94]:
def CleanRating(s):
    '''
    Here, the input is rating in a string format, eg: "3.3 out of 5 stars".
    The function converts it to a float, eg: '3.3'
    '''
    if s is not None:
        try:
            return float(s.split(' ')[0])
        except ValueError:
            return None
        except AttributeError:
            return None
    else:
        return None

def CleanRatingCount(s):
    '''
    Here, the input is RatingCount in a string format, eg: "336 ratings".
    The function converts it to a float, eg: '336'
    '''
    if s is not None:
        return float(s.split(' ')[0].replace(',', ''))
    else:
        return float(0)
    
def CleanAmazonPrice(s):
    '''
    Here, the input is AmazonPrice in a string format, eg: "₹ 336.00".
    The function converts it to a float, eg: '336'
    '''
    if s is not None:
        print(s)
        s = s.replace('₹', '').replace(',', '').replace(r'\x', '').replace('a', '')
        return float(s.strip().split(' ')[0])
    else:
        return s

#### Step 8: Begin Main Scraping

##### NOTE: CHANGE THE URL BASED ON THE SELLER

EG:

Cloudtail URLs-> https://www.amazon.in/s?i=merchant-items&me=AT95IG9ONZD7S&rh=p_4%3AAmazon

Appario URLS-> https://www.amazon.in/s?i=merchant-items&me=A14CZOWI0VEHLG&rh=p_4%3Amazon

Note that using both these urls, we are searching for 'Amazon' brand products, but first url searches for Amazon brand products on Cloudtail Storefront, and second one on Appario's.

In [None]:
with open(outfile_path,'a') as outfile:
    for b in brand_list:
        pg_number = 1
        
        while True:
                
            # To account for differnt urls based on page number
            if pg_number == 1:
                url = str("https://www.amazon.in/s?i=merchant-items&me=A14CZOWI0VEHLG&rh=p_4%3A"+str(b))
            else:
                url = str("https://www.amazon.in/s?i=merchant-items&me=A14CZOWI0VEHLG&rh=p_4%3A"+str(b)+"&dc&page="+str(pg_number))

            data_text = scrape(url)

            # Case 1: Scraped page does not contain any products
            if data_text == 'False': 
                pass

            # Case 2: Scraped page contains products
            else: 
                # Extract all product details in a dict 'data' using the extractor file
                data = e.extract(data_text) 

                # Save html text to file
                html_files_path = str('./ScriptOutput/HTML/'+ str(FileName) + '/' + str(b) +'/Page_'+str(pg_number)+'.html')
                os.makedirs(os.path.dirname(html_files_path), exist_ok=True) # Create file to save our html data
                with open(html_files_path, 'w') as file:
                    file.write(data_text)

                # data['products'] is a dict which contains details of all products present on the scraped page
                for product in data['products']: 
                    product['Rating'] = CleanRating(product['Rating'])
                    product['RatingCount'] = CleanRatingCount(product['RatingCount'])
                    product['AmazonPrice'] = CleanAmazonPrice(product['AmazonPrice'])
                    product['SearchUrl'] = url
                    product['Brand'] = b
                    product['Seller'] = seller
                    date = datetime.datetime.now()
                    product['Timestamp'] = date.strftime("%c")
                    product['ProductPageUrl'] = str('https://www.amazon.in' + str(product['ProductPageUrl']))
                    print("Saving Product: %s"%product['Title'])
                    print(product)
                    json.dump(product,outfile)
                    outfile.write("\n")
                          
            # If next page is not available, break and go to next brand                  
            if l.extract(data_text)['last'] == 'Next →':
                break
            elif data_text == 'False':
                break
            else:
                pg_number += 1 # Incrementing page numbe

Downloading https://www.amazon.in/s?i=merchant-items&me=A14CZOWI0VEHLG&rh=p_4%3A10.or
Trial no: 1
Retrying with new user agent!
Trial no: 2
Retrying with new user agent!
Trial no: 3
Retrying with new user agent!
Trial no: 4
Retrying with new user agent!
Trial no: 5
Retrying with new user agent!
Trial no: 6
Retrying with new user agent!
Trial no: 7
Retrying with new user agent!
Trial no: 8
Retrying with new user agent!
Trial no: 9
Retrying with new user agent!
Trial no: 10
Retrying with new user agent!
Trial no: 11
Retrying with new user agent!
Trial no: 12
Retrying with new user agent!
Trial no: 13
Retrying with new user agent!
Trial no: 14
Retrying with new user agent!
Trial no: 15
Retrying with new user agent!
Max trials exceeded yet no Data found on this page!
Downloading https://www.amazon.in/s?i=merchant-items&me=A14CZOWI0VEHLG&rh=p_4%3A100FIT
Trial no: 1
129
Saving Product: 100FIT Tempered Glass for Vivo Y19/U20 (11D)-Edge to Edge Full Screen Coverage
{'Title': '100FIT Tempered G

Retrying with new user agent!
Trial no: 2
Retrying with new user agent!
Trial no: 3
Retrying with new user agent!
Trial no: 4
Retrying with new user agent!
Trial no: 5
Retrying with new user agent!
Trial no: 6
Retrying with new user agent!
Trial no: 7
Retrying with new user agent!
Trial no: 8
Retrying with new user agent!
Trial no: 9
Retrying with new user agent!
Trial no: 10
Retrying with new user agent!
Trial no: 11
Retrying with new user agent!
Trial no: 12
Retrying with new user agent!
Trial no: 13
Retrying with new user agent!
Trial no: 14
Retrying with new user agent!
Trial no: 15
Retrying with new user agent!
Max trials exceeded yet no Data found on this page!
Downloading https://www.amazon.in/s?i=merchant-items&me=A14CZOWI0VEHLG&rh=p_4%3A1KLICK
Trial no: 1
480
Saving Product: 1KLICK G7 Optical Gaming Mouse (Black)
{'Title': '1KLICK G7 Optical Gaming Mouse (Black)', 'Label': None, 'Rating': 3.6, 'RatingCount': 46.0, 'AmazonPrice': 480.0, 'ProductPageUrl': 'https://www.amazon.in/

Retrying with new user agent!
Trial no: 2
Retrying with new user agent!
Trial no: 3
Retrying with new user agent!
Trial no: 4
Retrying with new user agent!
Trial no: 5
Retrying with new user agent!
Trial no: 6
Retrying with new user agent!
Trial no: 7
Retrying with new user agent!
Trial no: 8
Retrying with new user agent!
Trial no: 9
Retrying with new user agent!
Trial no: 10
Retrying with new user agent!
Trial no: 11
Retrying with new user agent!
Trial no: 12
Retrying with new user agent!
Trial no: 13
Retrying with new user agent!
Trial no: 14
Retrying with new user agent!
Trial no: 15
Retrying with new user agent!
Max trials exceeded yet no Data found on this page!
Downloading https://www.amazon.in/s?i=merchant-items&me=A14CZOWI0VEHLG&rh=p_4%3A4+D+%28LABEL%29


#### Step 9: Read .jsonl File

In [84]:
# ProductListFile = open('./ScriptOutput/DATASET/test.jsonl', 'r')
ProductListFile = open(outfile_path)

ProductList = []
reader = jsonlines.Reader(ProductListFile)
for item in reader.iter():
    ProductList.append(item)
    
df = pd.DataFrame(ProductList)
print(df.count())
df.head()

Title                     7
Brand                     7
Rating                    7
RatingCount               7
AnsweredQuestionsCount    7
MRP                       4
AmazonPrice               7
Savings                   4
ShortDescription          7
ProductDescription        0
BestSellerRank            0
DateFirstAvailable        0
Breadcrumbs               0
Seller                    7
FullfilledBy              7
Availability              7
ProductPageUrl            7
ASIN                      7
DiscountPercentage        4
Keywords                  0
dtype: int64


Unnamed: 0,Title,Brand,Rating,RatingCount,AnsweredQuestionsCount,MRP,AmazonPrice,Savings,ShortDescription,ProductDescription,BestSellerRank,DateFirstAvailable,Breadcrumbs,Seller,FullfilledBy,Availability,ProductPageUrl,ASIN,DiscountPercentage,Keywords
0,Fire TV Stick streaming media player with Alex...,Brand: Amazon,4.2,26423.0,1000.0,,3999.0,,"#1 best-selling streaming media player, with a...",,,,,Cloudtail India,Fulfilled by Amazon,Available,https://www.amazon.in/Amazon-FireTVStick-Alexa...,B0791YHVMK,,
1,Echo Dot (3rd Gen) – New and improved smart sp...,Brand: Amazon,4.3,26950.0,1000.0,4499.0,3499.0,1000.0,Our most popular smart speaker with 360 degree...,,,,,Cloudtail India,Fulfilled by Amazon,Available,https://www.amazon.in/All-new-Echo-Dot-3rd-Gen...,B07PFFMP9P,22.0,
2,Echo Dot (3rd Gen) – New and improved smart sp...,Amazon,4.3,26950.0,1000.0,4499.0,3499.0,1000.0,Our most popular smart speaker with 360 degree...,,,,,Cloudtail India,Fulfilled by Amazon,Available,https://www.amazon.in/All-new-Echo-Dot-3rd-Gen...,B07PKXJN7J,22.0,
3,All-New Alexa Voice Remote with Power and Volu...,Brand: Amazon,4.1,1286.0,675.0,,1999.0,,"Compatible with Fire TV Stick(2nd Generation),...",,,,,Cloudtail India,Fulfilled by Amazon,Available,https://www.amazon.in/Amazon-FireTV-Stick-Alex...,B07B6NCTWB,,
4,Echo Dot (3rd Gen) – New and improved smart sp...,Amazon,4.3,26950.0,1000.0,4499.0,3499.0,1000.0,Our most popular smart speaker with 360 degree...,,,,,Cloudtail India,Fulfilled by Amazon,Available,https://www.amazon.in/C78MP8/dp/B07PGL2ZSL/ref...,B07PGL2ZSL,22.0,
