#Extracting Basic Product Details from an E-commerce Website
The objective is to attain raw data and perform Regular Expressions (RE) - One of the oldest and powerful NLP tool for Information Retrieval and Feature Engineering

##1. Installing and Loading Required Packages

In [166]:
# Import required packages
from bs4 import BeautifulSoup # Parse HTML and XML documents
import requests # Allow HTTP requests
import pandas as pd # Data analysis

In [None]:
# Selenium and Chrome Driver for connecting to Chrome browser
# Main purpose of Selenium is for browser connections and automating brower activities
# Please note that you can perform this extraction without selenium and chrome driver as well (by using requests module)
!pip install selenium
!apt-get update
!apt install chromium-chromedriver

from selenium import webdriver 
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)

Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/ InRelease [3,626 B]
Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Ign:3 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release [697 B]
Hit:5 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Hit:6 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release.gpg [836 B]
Hit:9 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:10 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:11 http://ppa.launchpad.net/marutter/c2d4u3.5/ubuntu bionic InRelease [15.4 kB]
Get:12 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Ign:13 https://developer.download.nvidia.com/com

  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()


#2. Parse the HTML Document

We are going to extract information on laptops. Before that, let's understand how a BeautifulSoup object stores HTML content and the basic functions to extract information from it.

You can get the same soup object using the Selenium driver that we just set up above or by using the requests module.

In [167]:
# Using Selenium Driver
url = "https://www.flipkart.com/"
driver.get(url)
soup = BeautifulSoup(driver.page_source)
type(soup)

bs4.BeautifulSoup

In [168]:
# Using requests module
# Header is a definition to your browser. Check yours here - http://www.xhaus.com/headers
headers = {"User-agent": 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
type(soup)

bs4.BeautifulSoup

Few basic useful functions to view and retrieve data from bs4 object are - prettify, find, find_all, text etc. You can extract info by classes, ids, tags like a, div, ul, li etc.

For full documentation, refer - https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [184]:
# View the source HTML content
# print(soup.prettify())

# Note that this will print the entire HTML of Flipkart Homepage and have commented intentionally

In [170]:
soup.title #Print the title tag

<title>Online Shopping Site for Mobiles, Electronics, Furniture, Grocery, Lifestyle, Books &amp; More. Best Offers!</title>

In [171]:
# Find all anchor tags with links
soup.find_all('a')[0:5] # printing only the first five matches

[<a href="/"><img alt="Flipkart" class="_1e_EAo" src="//img1a.flixcart.com/www/linchpin/fk-cp-zion/img/flipkart-plus_4ee2f9.png" title="Flipkart" width="75"/></a>,
 <a class="_33x-Wq" href="/plus">Explore<!-- --> <span class="_2Ky4Ru">Plus</span><img src="//img1a.flixcart.com/www/linchpin/fk-cp-zion/img/plus_b13a8b.png" width="10"/></a>,
 <a class="_3Ep39l" href="/account/login?ret=/">Login</a>,
 <a class="_3ko_Ud" href="/viewcart?otracker=Cart_Icon_Click"><svg class="_2fcmoV" height="14" viewbox="0 0 16 16" width="14" xmlns="http://www.w3.org/2000/svg"><path class="_2JpNOH" d="M15.32 2.405H4.887C3 2.405 2.46.805 2.46.805L2.257.21C2.208.085 2.083 0 1.946 0H.336C.1 0-.064.24.024.46l.644 1.945L3.11 9.767c.047.137.175.23.32.23h8.418l-.493 1.958H3.768l.002.003c-.017 0-.033-.003-.05-.003-1.06 0-1.92.86-1.92 1.92s.86 1.92 1.92 1.92c.99 0 1.805-.75 1.91-1.712l5.55.076c.12.922.91 1.636 1.867 1.636 1.04 0 1.885-.844 1.885-1.885 0-.866-.584-1.593-1.38-1.814l2.423-8.832c.12-.433-.206-.86-.655-.86

Now let's get to the objective - Extract information from the first page of 'laptops' search

In [172]:
url = "https://www.flipkart.com/search?q=laptop&sid=6bo%2Cb5g&as=on&as-show=on&otracker=AS_QueryStore_OrganicAutoSuggest_1_7_na_na_na&otracker1=AS_QueryStore_OrganicAutoSuggest_1_7_na_na_na&as-pos=1&as-type=HISTORY&suggestionId=laptop%7CLaptops&requestId=c149720f-935a-493f-a36d-5f45ef55734d"
driver.get(url)
soup = BeautifulSoup(driver.page_source)

#3. Inspect the Layout and Extract
Inspecting the web layout requires a basic understanding of HTML and DOM.
You can refer to w3schools tutorial [here](https://www.w3schools.com/html/)

Here we are mainly interested in Product names and details. This image will provide a quick reference.

In [173]:
%%html
<iframe src="https://drive.google.com/file/d/1dD-enzt9jW-7xh0v08IWBjaWsfrvnT-e/preview" width="900" height="500"></iframe>

So to get all the product names, we just need to find the class of interest and use find_all function to get all the relevant bs4 response data

In [174]:
# Get all the product names of laptops
soup.find_all(attrs = {'class': '_3wU53n'})

[<div class="_3wU53n">Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB SSD/Windows 10 Home/4 GB Graphics/NVIDIA Geforce GTX 1650...</div>,
 <div class="_3wU53n">Dell Vostro Core i3 10th Gen - (8 GB/1 TB HDD/Windows 10 Home) Vostro 3491 Thin and Light Laptop</div>,
 <div class="_3wU53n">HP 15s Core i5 10th Gen - (8 GB/512 GB SSD/Windows 10 Home) 15s-du2078TU Thin and Light Laptop</div>,
 <div class="_3wU53n">Dell Vostro Core i5 10th Gen - (8 GB/1 TB HDD/256 GB SSD/Windows 10 Home) Vostro 3491 Thin and Light L...</div>,
 <div class="_3wU53n">Asus Core i3 10th Gen - (4 GB/1 TB HDD/256 GB SSD/Windows 10 Home) X509JA-EJ654T Laptop</div>,
 <div class="_3wU53n">Asus Core i3 10th Gen - (4 GB/1 TB HDD/Windows 10 Home) X509JA-EJ485T Laptop</div>,
 <div class="_3wU53n">Asus VivoBook 14 Core i5 10th Gen - (8 GB/1 TB HDD/256 GB SSD/Windows 10 Home) X412FA-EK512T Thin and ...</div>,
 <div class="_3wU53n">Asus VivoBook 14 Core i5 10th Gen - (8 GB/1 TB HDD/256 GB SSD/Windows 10 Home) X412FA-EK513T Thin an

In [175]:
# Getting the text data inside product and price classes
products = [item.text for item in soup.find_all(attrs = {'class': '_3wU53n'})]
prices = [item.text for item in soup.find_all(attrs = {'class': '_1vC4OE _2rQ-NK'})]

In [176]:
list(zip(products, prices))

[('Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB SSD/Windows 10 Home/4 GB Graphics/NVIDIA Geforce GTX 1650...',
  '₹56,990'),
 ('Dell Vostro Core i3 10th Gen - (8 GB/1 TB HDD/Windows 10 Home) Vostro 3491 Thin and Light Laptop',
  '₹36,990'),
 ('HP 15s Core i5 10th Gen - (8 GB/512 GB SSD/Windows 10 Home) 15s-du2078TU Thin and Light Laptop',
  '₹49,990'),
 ('Dell Vostro Core i5 10th Gen - (8 GB/1 TB HDD/256 GB SSD/Windows 10 Home) Vostro 3491 Thin and Light L...',
  '₹52,990'),
 ('Asus Core i3 10th Gen - (4 GB/1 TB HDD/256 GB SSD/Windows 10 Home) X509JA-EJ654T Laptop',
  '₹39,990'),
 ('Asus Core i3 10th Gen - (4 GB/1 TB HDD/Windows 10 Home) X509JA-EJ485T Laptop',
  '₹32,990'),
 ('Asus VivoBook 14 Core i5 10th Gen - (8 GB/1 TB HDD/256 GB SSD/Windows 10 Home) X412FA-EK512T Thin and ...',
  '₹52,990'),
 ('Asus VivoBook 14 Core i5 10th Gen - (8 GB/1 TB HDD/256 GB SSD/Windows 10 Home) X412FA-EK513T Thin and ...',
  '₹52,990'),
 ('Asus VivoBook 14 Core i3 10th Gen - (4 GB/256 GB SSD/Windows 10 

In [177]:
# Description of each product is stored as list objects. 
# Hence we need to loop through all the un-ordered list objects and the list items inside them
product_details = []

for ul in soup.find_all(attrs = {'class': 'vFw0gD'}):
  details_list = []
  for li in ul.find_all('li'):
    details_list.append(li.text)
  product_details.append(', '.join(details_list))

#4. Store the Data

In [178]:
df = pd.DataFrame({
 'Product': products,
 'Price': prices,
 'Details': product_details
})

df.head()

Unnamed: 0,Product,Price,Details
0,Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB S...,"₹56,990","Intel Core i5 Processor (9th Gen), 8 GB DDR4 R..."
1,Dell Vostro Core i3 10th Gen - (8 GB/1 TB HDD/...,"₹36,990","Intel Core i3 Processor (10th Gen), 8 GB DDR4 ..."
2,HP 15s Core i5 10th Gen - (8 GB/512 GB SSD/Win...,"₹49,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ..."
3,Dell Vostro Core i5 10th Gen - (8 GB/1 TB HDD/...,"₹52,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ..."
4,Asus Core i3 10th Gen - (4 GB/1 TB HDD/256 GB ...,"₹39,990","Intel Core i3 Processor (10th Gen), 4 GB DDR4 ..."


In [181]:
from google.colab import drive
drive.mount('/drive')

Mounted at /drive


In [182]:
df.to_csv('/drive/My Drive/Colab Notebooks/NLP/Web Scraping/laptop_details.csv')

##Entire modularized code for quick reference

In [183]:
def flipkart_extract():
  '''Extract the product details from a specific url of Flipkart Website and returns a pandas dataframe'''

  # Website connection and parser
  url = "https://www.flipkart.com/search?q=laptop&sid=6bo%2Cb5g&as=on&as-show=on&otracker=AS_QueryStore_OrganicAutoSuggest_1_7_na_na_na&otracker1=AS_QueryStore_OrganicAutoSuggest_1_7_na_na_na&as-pos=1&as-type=HISTORY&suggestionId=laptop%7CLaptops&requestId=c149720f-935a-493f-a36d-5f45ef55734d"
  headers = {"User-agent": 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'}
  page = requests.get(url, headers=headers)
  soup = BeautifulSoup(page.content, 'html.parser')

  # Inspection and extraction
  products = [item.text for item in soup.find_all(attrs = {'class': '_3wU53n'})]
  prices = [item.text for item in soup.find_all(attrs = {'class': '_1vC4OE _2rQ-NK'})]

  product_details = []
  for ul in soup.find_all(attrs = {'class': 'vFw0gD'}):
    details_list = []
    for li in ul.find_all('li'):
      details_list.append(li.text)
    product_details.append(', '.join(details_list))

  # Storing the data
  df = pd.DataFrame({
      'Product': products,
      'Price': prices,
      'Details': product_details
      })
  
  return df

flipkart_extract()

Unnamed: 0,Product,Price,Details
0,Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB S...,"₹56,990","Intel Core i5 Processor (9th Gen), 8 GB DDR4 R..."
1,Dell Vostro Core i3 10th Gen - (8 GB/1 TB HDD/...,"₹36,990","Intel Core i3 Processor (10th Gen), 8 GB DDR4 ..."
2,HP 15s Core i5 10th Gen - (8 GB/512 GB SSD/Win...,"₹49,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ..."
3,Dell Vostro Core i5 10th Gen - (8 GB/1 TB HDD/...,"₹52,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ..."
4,Asus Core i3 10th Gen - (4 GB/1 TB HDD/256 GB ...,"₹39,990","Intel Core i3 Processor (10th Gen), 4 GB DDR4 ..."
5,Asus Core i3 10th Gen - (4 GB/1 TB HDD/Windows...,"₹32,990","Intel Core i3 Processor (10th Gen), 4 GB DDR4 ..."
6,Asus VivoBook 14 Core i5 10th Gen - (8 GB/1 TB...,"₹52,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ..."
7,Asus VivoBook 14 Core i5 10th Gen - (8 GB/1 TB...,"₹52,990","Intel Core i5 Processor (10th Gen), 8 GB DDR4 ..."
8,Asus VivoBook 14 Core i3 10th Gen - (4 GB/256 ...,"₹38,990","Intel Core i3 Processor (10th Gen), 4 GB DDR4 ..."
9,Asus VivoBook 14 Core i3 10th Gen - (4 GB/256 ...,"₹35,990","Intel Core i3 Processor (10th Gen), 4 GB DDR4 ..."
