In [29]:
import requests
from lxml import html
import re
import json
import csv
import click

In [2]:
response = requests.get(url="http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
                       headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36"})
response

<Response [200]>

In [3]:
print(response.request.headers)

{'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}


first while parsing a html file we used ```lxml.etree```. But since our response from the site is in text we will use different module called ```lxml.html```. and use ```fromstring()``` method to parse the html to a tree from text:

In [4]:
tree = html.fromstring(response.text)
tree

<Element html at 0x29b5b072d68>

## srapping the contents

- book title
- book cost
- book stock
- book description

In [5]:
title = tree.xpath("//div[contains(@class,'product_main')]/h1/text()")[0]
price = tree.xpath("//div[contains(@class,'product_main')]/p[1]/text()")[0]
stock = tree.xpath("//div[contains(@class,'product_main')]/p[2]/text()")[1].strip()
description = tree.xpath("//div[contains(@id,'product_description')]/following-sibling::p/text()")[0]

print(title,"\n")
print(price,"\n")
print(stock,"\n")
print(description,"\n")

A Light in the Attic 

Â£51.77 

In stock (22 available) 

It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think so

Cleaning the xpath code in the above code:

In [6]:
## storing the main div from where we will scrap all the data into one variable
product_main = tree.xpath("//div[contains(@class,'product_main')]")[0]


## Now we will use the same element 'product_main' to scrap the data we want: 

title = product_main.xpath(".//h1/text()")[0]
price = product_main.xpath(".//p[1]/text()")[0]
stock = product_main.xpath(".//p[2]/text()")[1].strip()
description = tree.xpath("//div[contains(@id,'product_description')]/following-sibling::p/text()")[0]

Storing the data into one variable:

In [7]:
book_information = {
    "title":title,
    "price":price,
    "stock":stock,
    "description":description
}

book_information

{'title': 'A Light in the Attic',
 'price': 'Â£51.77',
 'stock': 'In stock (22 available)',
 'description': "It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put 

## Cleaning the data

In [15]:
regex = re.compile("\d+")
stock_num = regex.findall(stock)[0]
stock_num

'22'

In [16]:
book_information = {
    "title":title,
    "price":price,
    "stock":stock_num,
    "description":description
}

book_information

{'title': 'A Light in the Attic',
 'price': 'Â£51.77',
 'stock': '22',
 'description': "It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your

## Writing the data to csv or json file

**Writing to a json file:**

In [19]:
def write_to_json(data, filename):
    '''
    Takes the scraped data in dictionary format 
    and exports the data into json file
    '''
    json_file = open(filename, mode="w")
    json_file.write(json.dumps(data))
    json_file.close()

In [20]:
write_to_json(book_information, "data.json")

**Writing to a csv file:**

In [27]:
def write_to_csv(filename,data):
    '''
    Takes a the data in the form of dictionary
    and stores the data into a csv file
    '''
    with open(filename,mode="w") as f:
        headers = ["title","price","stock","description"]
        writer = csv.DictWriter(f, headers)
        writer.writeheader()
        writer.writerow(data)

In [28]:
write_to_csv("data.csv",book_information)