# Web Scraping Tutorial 

> In this tutorial, we would be scraping information of products from an e-commerce website newegg.com

### Import the necessary libraries

> In order to import the libraries, an installation of the same is required when doing for first time

<pre><code>$ pip install beautifulsoup4
$ pip install lxml

</code></pre>

> Link to beautiful soup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/


In [4]:
from bs4 import BeautifulSoup
import requests

### Decide on a webpage for scraping

In [5]:
url = "https://www.newegg.com/Product/ProductList.aspx?Submit=StoreIM&Depa=1&Category=38"
# url = "https://www.newegg.com/Video-Cards-Video-Devices/Category/ID-38?Tpk=graphics%20cards"
# url = "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&DEPA=0&Order=BESTMATCH&Description=mobile&N=-1&isNodeId=1"

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}

> The headers is required to be sent when making a request to server because most servers can identify and block requests coming from a script, remember to be nice with the server.

### Make a request to the webpage using its URL and convert response to a Beautiful Soup object

In [6]:
response = requests.get(url, headers)

soup = BeautifulSoup(response.content, "lxml")
# print(soup)

> Before going ahead let's see how does response and soup look like

### Try Out basic commands, refer to the documentation for more

In [12]:
#Basic Traversal

# title of the page
print("Title Tag:", soup.title)

# get attributes:
print("Tag Name:", soup.title.name)

# get values:
print("Title String:",soup.title.string)

# # beginning navigation:
# print(soup.title.parent.name)

# print(soup.h1)

# # getting specific values:
# print(soup.p)

# all p tags
# print(soup.find_all('p'))

#iterate using loop
# for paragraph in soup.find_all('p'):
#     print(str(paragraph.text))

# for url in soup.find_all('a'):
#     print(url.get('href'))

Title Tag: <title>Computer Parts, PC Components, Laptop Computers, LED LCD TV, Digital Cameras and more - Newegg.com</title>
Tag Name: title
Title String: Computer Parts, PC Components, Laptop Computers, LED LCD TV, Digital Cameras and more - Newegg.com


### Get hold of a product

> we'll first extract information from single product, then run a loop for automation

In [13]:
containers = soup.findAll("div", {"class":"item-container"})

In [14]:
print("Total Records:", len(containers))

# print(containers[0])
# paste to sublime and analyze

Total Records: 32


### Find out what aspects of the product are we interested in!

In [15]:
# """
# What all information do we need?
# Information that repeats everywhere
# Brand of the product
# Names
# Price?
# Shipping
# """

In [16]:
container = containers[0]

#Brand of the product
# print(container.a)
# print(container.div)

# print(container.div.div.a)

# #referencing as array
# print(container.div.div.a.img["title"])

# print(container.div.a)

# print(container)

In [17]:
#title of the product
title_container = container.findAll("a", {"class":"item-title"})

print(title_container)
# print(title_container[0].text)

[<a class="item-title" href="https://www.newegg.com/Product/Product.aspx?Item=N82E16814487265&amp;ignorebbr=1" title="View Details"><i class="icon-premier icon-premier-xsm"></i>EVGA GeForce GTX 1070 SC GAMING ACX 3.0 Black Edition, 08G-P4-5173-KR, 8GB GDDR5, LED, DX12 OSD Support (PXOC)</a>]


In [58]:
#shipping price
shipping_container = container.findAll("li", {"class":"price-ship"})

print(shipping_container[0].text)
# print(shipping_container[0].text.strip())
shipping = shipping_container[0].text.strip()


        $4.99 Shipping
    


In [59]:
#price

# print(container.div.findAll("div", {"class":"item-action"})[0].findAll("li", {"class":"price-current"})[0].sup.text)
price_container_d = container.div.findAll("div", {"class":"item-action"})[0].findAll("li", {"class":"price-current"})[0].strong.text
price_container_c = container.div.findAll("div", {"class":"item-action"})[0].findAll("li", {"class":"price-current"})[0].sup.text

### Let's put all this into a loop

In [20]:
for container in containers:
    brand = container.div.div.a.img["title"]
   
    title_container = container.findAll("a", {"class":"item-title"})
    product_name = title_container[0].text
    
    shipping_container = container.findAll("li", {"class":"price-ship"})
    shipping = shipping_container[0].text.strip()
    
#     print("Brand:" + brand)
#     print("Product Name:" + product_name)
#     print("Shipping:" + shipping)
    
#     try:
#         price_container_d = container.div.findAll("div", {"class":"item-action"})[0].findAll("li", {"class":"price-current"})[0].strong.text
#         price_container_c = container.div.findAll("div", {"class":"item-action"})[0].findAll("li", {"class":"price-current"})[0].sup.text
#         print("Price:"+price_container_d+price_container_c)
#     except:
#         print("Price: None")
    
   
    
#     print()
    

### Exporting to a CSV file

In [61]:
#export to csv file

filename = "all_products.csv"

f = open(filename, "w")

headers = "Brand, Product Name, Shipping\n"
f.write(headers)

for container in containers:
    brand = container.div.div.a.img["title"]
   
    title_container = container.findAll("a", {"class":"item-title"})
    product_name = title_container[0].text
    
    shipping_container = container.findAll("li", {"class":"price-ship"})
    shipping = shipping_container[0].text.strip()

    f.write(brand+","+product_name.replace(",", "|")+","+shipping+"\n")
#     print("Brand:" + brand)
#     print("Product Name:" + product_name)
#     print("Shipping:" + shipping)
#     print()
f.close()