* The internet is an absolutely massive source of data. Unfortunately, the vast majority if it isn’t available in 
* conveniently organized CSV files for download and analysis.
* If we  want to capture data from many websites, - answer is  web scraping.

### Web Scraping - Automatic Gathering of Information from the Web
* Writing some code that will fetch some information from the Web
* In todays session we’re going to cover how to do web scraping with Python from scratch, 

### Why would you scrape the Web
* Why does someone have to collect such large data from websites?
* To collect data from online shopping websites and use it to compare the prices of products.
* To collect email ID and then send bulk emails
* To collect data from Social Media websites such as Twitter to find out what’s trending.
* To collect a large set of data (Statistics, General Information, Temperature, Weather etc.) from websites, which can be  analyzed and used for  R&D.
* Pick up song lyrics from a specific Album but you dont want clicking around and doing copy paste
* Better way is to automate this extraction process and pull out the informatiom from the HTML script
* Automated Job Search: Collect information about job openings and interviews 

### Objective:
* The data on the websites are unstructured. Web scraping helps collect these unstructured data and store it in a structured form.
* Case Study: Automatic Gathering of Information about Laptops from "flipkart"

### Is it legal? 
* Some websites explicitly allow web scraping. Others explicitly forbid it. 
* Many websites don’t offer any clear guidance one way or the other.
* Before scraping any website, we should look for a terms and conditions page to see if there are explicit rules about scraping. If there are, we should follow them. If there are not, then it becomes more of a judgement call.

### Guidelines
* Remember, though, that web scraping consumes server resources for the host website. If we’re just scraping one page once,
* that isn’t going to cause a problem. But if our code is scraping 1,000 pages once every ten minutes, 
* that could quickly get expensive for the website owner.

* it’s also a good idea to follow these best practices:

* Never scrape more frequently than you need to.
* Consider caching the content you scrape so that it’s only downloaded once as you work on the code you’re using to filter
* and analyze it, rather than re-downloading every time you run your code
* Consider building pauses into your code using functions like time.sleep() to keep from overwhelming servers with too many
*requests in too short a timespan.

In [None]:
###  How to do Web Scraping?
#### To extract data using web scraping with python, you need to follow these basic steps:

* Find the URL that you want to scrape
* Inspecting the Page
* Find the data you want to extract
* Write the code
* Run the code and extract the data
* Store the data in the required format 

In [None]:
# Step1:  Find the URL that you want to scrape
# Inspect the  Flipkart website to extract the Price, Name, and Rating of Laptops. 
# "https://www.flipkart.com/search?q={0}&page={1}"

In [None]:
#Step 2: Inspect the page
# To inspect the page, just right click on the element and click on “Inspect”. You will see a “Browser Inspector Box” open.

In [None]:
#Step 3: Find the data you want to extract
# Here we are interested in the name, price, ratings and Specifications of several Laptops
# Specifications include cpu, ram, os, hd, display
# So, we inspect the page to see, under which tag the data we want to scrape is nested.
# we see that the Price, Name, and Rating which is in the “div” tag respectively

In [None]:
# Step 4: How to write code  to extract this information
#First, you’ll want to get the site’s HTML code into your Python script so that you can interact with it.
#page = requests.get(URL)
#This code performs an HTTP request to the given URL.
# It retrieves the HTML data that the server sends back and  stores that data in a Python object.
# For this task, you’ll use Python’s requests library. 
# Type the following in your terminal to install it: $ pip3 install requests

In [None]:
# Pick the relevant data
# You can parse an HTML response with Beautiful Soup and begin to pick out the relevant data.
# The data we want to extract is nested in <div> tags. So, I will find the div tags with those respective class-names, 
# extract the data and store the data in a variable. Refer the code below:

In [None]:
# Step 5 and 6: Run the code and store the data in a required format either a csv file or a dataframe

In [1]:
pip install fake-useragent

Collecting fake-useragent
  Downloading https://files.pythonhosted.org/packages/d1/79/af647635d6968e2deb57a208d309f6069d31cb138066d7e821e575112a80/fake-useragent-0.1.11.tar.gz
Building wheels for collected packages: fake-useragent
  Building wheel for fake-useragent (setup.py): started
  Building wheel for fake-useragent (setup.py): finished with status 'done'
  Created wheel for fake-useragent: filename=fake_useragent-0.1.11-cp37-none-any.whl size=13490 sha256=d6c34a937c156e6474e6657d5fdf3e89ddfc3c5f16c3520dc6175438fb6c43ea
  Stored in directory: C:\Users\Jigar\AppData\Local\pip\Cache\wheels\5e\63\09\d1dc15179f175357d3f5c00cbffbac37f9e8690d80545143ff
Successfully built fake-useragent
Installing collected packages: fake-useragent
Successfully installed fake-useragent-0.1.11
Note: you may need to restart the kernel to use updated packages.


### Let us execute Steps 1 to 6 in a Python code

In [1]:
# Import Required Libraries
import requests      #send request to HTML page
import bs4
from bs4 import BeautifulSoup     #python library for extracting data

from fake_useragent import UserAgent

import pandas as pd                       #Further Analysis of the extracted Data
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Initialization of the lists to store the extracted data
# The data that we extract is unstructured data. So we’ll create empty lists to store them in a structured form,
count=0                  # Intialize search row count
products=[]              #List to store name of the product
prices=[]                #List to store price of the product
ratings=[]               #List to store rating of the product
#specifications = []     #List to store specifications of the product
cpu = []                 #List to store CPU specifications of the product
ram = []                 #List to store RAM specifications of the product
os = []                  #List to store OS specifications of the product
hd = []                  #List to store HDD specifications of the product
display = []             #List to store Display specifications of the product

df=pd.DataFrame()        #Initialize Dataframe


In [3]:
# Creating an User agent  pip insatll fake-useragent
# A User agent acts as a bridge between the user and the internet . 
# It gives the webserver necessary information about your browser, software, device type and etc.
# According to this information the web servers can display different webpages for you
# The web server uses this information to adapt the content to specific web browsers and different OS
# https://pypi.org/project/fake-useragent/    # read here
 
user_agent = UserAgent() # Dummy User Agent
print(user_agent)

<fake_useragent.fake.FakeUserAgent object at 0x00000219AAB1B5C8>


In [4]:
# Set the product name. we are searching for laptops
# The extracted data will be related to that product.\ # Search for Laptops
product_name = 'laptop'

In [5]:
# Find Elements by ID
#To extract data from multiple pages of the product listing we’re going to use a for loop.
# The range will specify the number of pages to be extracted

url = "https://www.flipkart.com/search?q={0}&page={1}" 
print( url.format(product_name,1))          #run and check this  

https://www.flipkart.com/search?q=laptop&page=1


In [29]:
for i in range(1,4): # Limiting search to 3 pages due to multiple redirection issues for higher number of pages
    url = "https://www.flipkart.com/search?q={0}&page={1}" # Scrape data from Flipkart.com where 0 & 1 are place holder
    url = url.format(product_name,i)
    #print (url)
    
    ## Getting the reponse from the page using get method of requests module
    page = requests.get(url,headers = {"user_agent":user_agent.chrome})
    #print (page)
    ## Storing the content of the page in a variable
    
    html = page.content
    #print (html)
    
    # To Extract data from html file --- Creating BeautifulSoup object
    
    page_soup = bs4.BeautifulSoup(html, "html.parser")
       
    #print(page_soup.prettify())     #will show as a nested html file
    #it gives the visual representation of the parse tree created from the raw HTML content.
    
    #Iterate over page_soup.find_all('p')    # this will iterate over all paras
    print(page_soup.find_all('p')[0].get_text())

    ## Decoding the tags
    #('a',{'class':'_1fQZEK'})
    
    for containers in page_soup.findAll('a',{'class':'_1fQZEK'}):    # for loop iterating over list
        name=containers.find('div', attrs={'class':'_4rR01T'})
        price=containers.find('div', attrs={'class':'_30jeq3 _1_WHN1'})
        rating=containers.find('div', attrs={'class':'_3LWZlK'})
        specification = containers.find('div', attrs={'class':'fMghEO'})
        print (price, rating)
        
    

Best in the market!
<div class="_30jeq3 _1_WHN1">₹61,990</div> <div class="_3LWZlK">4.3</div>
<div class="_30jeq3 _1_WHN1">₹96,990</div> <div class="_3LWZlK">4.6</div>
<div class="_30jeq3 _1_WHN1">₹23,990</div> <div class="_3LWZlK">4<img class="_1wB99o" src="data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSIxMyIgaGVpZ2h0PSIxMiI+P

Terrific purchase
<div class="_30jeq3 _1_WHN1">₹42,499</div> None
<div class="_30jeq3 _1_WHN1">₹83,990</div> <div class="_3LWZlK">4.6</div>
<div class="_30jeq3 _1_WHN1">₹25,990</div> <div class="_3LWZlK">4</div>
<div class="_30jeq3 _1_WHN1">₹52,990</div> <div class="_3LWZlK">4.3<img class="_1wB99o" src="data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmc

Perfect product!
<div class="_30jeq3 _1_WHN1">₹39,999</div> None
<div class="_30jeq3 _1_WHN1">₹1,21,500</div> <div class="_3LWZlK">4.4</div>
<div class="_30jeq3 _1_WHN1">₹36,990</div> <div class="_3LWZlK">4.3</div>
<div class="_30jeq3 _1_WHN1">₹48,990</div> <div class="_3LWZlK">4.3<img class="_1wB99o" src="data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5v

In [None]:
     
    ## Splitting integrated specification into individual CPU, RAM, OS, HDD and Display specifications
    for col in specification:
           
        
        #Update the list to update the extracted data

        products.append(name.text) # Add product name to list
        
        prices.append(price.text) # Add price to list
        
        cpu.append(cput) # Add CPU specifications to list
        
        ram.append(ramt) # Add RAM specifications to list
        
        os.append(ost) # Add OS specifications to list
        
        hd.append(hdt) # Add HDD specifications to list
        
        display.append(displayt) # Add Display specifications to list
        
        ratings.append(rating.text) if type(rating) == bs4.element.Tag  else ratings.append('NaN') # Add Rating to list
        
        count=count+1 # Increment row count
    
    ## Create a dataframe with structured data from all searched rows
    df = pd.DataFrame({'Product Name':products,'CPU':cpu,'RAM':ram,'OS':os,"HD Capacity":hd,'Display':display,'Price':prices,'Rating':ratings,})

print('No. of rows searched',count)


In [None]:
#For extracting data from soup you need to specify the html tags you want to retrieve the data from.
#You could use inspect element on the webpage.

## Recap of the html tags
* p - A paragraph of text
* h1- A top-level heading
* h2, h3 - A lower-level heading
* li- An item in a list
* img - An image
* tr- A row in a table
* td - A cell in a table
* a - A link
* div - A block of space on the page (generic)
* span - A portion of text on the page (generic)
* meta - Information about the page that is not shown

In [None]:
### find() and find_all() function in Beautiful Soup
* To extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page
* soup.find_all('p')    # this iwll iterate over all paras* soup.find_all('p')[0].get_text()
 
* Classes and ids are used by CSS to determine which HTML elements to apply certain styles to.
* We can also use them when scraping to specify specific elements we want to scrape. 

In [None]:
print(df.shape)
df.head() # Preview dataframe

In [None]:
df.tail() # Preview dataframe

In [None]:
df.isnull().sum() # Check for null values

In [None]:
df.isna().sum() # Check for 'NaN' values

In [None]:
df.info() # Dataframe Information

In [None]:
df.describe() # Describe Data before cleaning and dtype conversion

In [None]:
df.dtypes # Check data types of columns

In [None]:
# Identify rows with wrongly positioned data i.e. a particular data misplaced under a different column
a=df[(~df['CPU'].str.contains('Processor'))|(~df['RAM'].str.contains('RAM'))|(~df['OS'].str.contains('Operating'))|(~df['HD Capacity'].str.contains('GB|TB'))].index
a # Save index information of such rows

In [None]:
df=df.drop(a, axis=0) # Drop rows with wrongly positioned data elements 

In [None]:
# Format Price column to remove ₹ and delimiter ',' used for the thousandth place 
df['Price'] = df['Price'].str.lstrip('₹')
df['Price'] = df['Price'].replace({',':''}, regex=True)
df.head() # Check if formatting is correct

In [None]:
# Convert numeric columns in string format to float for mathematical and graphic operations
for i in range(6,8,1):
    df.iloc[:,i]= df.iloc[:,i].astype(float).copy()

In [None]:
df.dtypes # Check data types of columns

In [None]:
df.describe() # Describe Data after cleaning and dtype conversion

In [None]:
# Save cleaned and processed data to a CSV file
df.to_csv('WebScrapingLaptops.csv',index=False)

In [None]:
# Univariate Analysis Plot Histograms and  BoxPlots

In [None]:
# Plot Histograms of Price and Rating


In [None]:
# Plot Distibution Plots of Price and Rating
columns=['Price','Rating']
for i in columns:
    sns.kdeplot(df[i],shade=True)
    plt.xlabel(i, fontsize=18)
    plt.ylabel('Distribution', fontsize=16)
    plt.show()

In [None]:
# Boxplot of Price  using Dataframe method
df.boxplot(column='Price',grid=True,figsize=(6,4))


In [None]:
# Box plot of Rating


In [None]:
# Bivariate Analysis
# Box plot of CPU and Price
plt.figure(figsize=(10,8))
sns.boxplot(y="CPU",x='Price',data=df)
plt.show()

In [None]:
# Box plot of RAM and Price
plt.figure(figsize=(10,8))
sns.boxplot(y="RAM",x='Price',data=df)
plt.show()

In [None]:
# Box plot of OS and Price
plt.figure(figsize=(10,8))
sns.boxplot(y="OS",x='Price',data=df)
plt.show()

In [None]:
# Box plot of HDD and Price


In [None]:
# Box plot of Display and Price


### Bar Graphs using Matplotlib

In [None]:
# Bar Graph - Processor Vs Price
Using plt
plt.figure(figsize=(15,5))
plt.bar(df['CPU'],df['Price'],color='green')
plt.xticks(rotation=45)
plt.xlabel('Processor')
plt.ylabel('Price')
plt.title('Processor Vs Price')
plt.show()

In [None]:
# Bar Graph - RAM Vs Price
plt.figure(figsize=(15,5))
plt.bar(df['RAM'],df['Price'],color='fuchsia')
plt.xticks(rotation=45)
plt.xlabel('RAM Size')
plt.ylabel('Price')
plt.title('RAM Size Vs Price')
plt.show()

In [None]:
#

In [None]:
# Bar Graph - OS Vs Price
plt.figure(figsize=(15,5))
plt.bar(df['OS'],df['Price'],color='brown')
plt.xticks(rotation=0)
plt.xlabel('Operating System')
plt.ylabel('Price')
plt.title('Operating System Vs Price')
plt.show()

In [None]:
# Bar Graph - HDD Vs Price
plt.figure(figsize=(15,5))
plt.bar(df['HD Capacity'],df['Price'],color='lime')
plt.xticks(rotation=45)
plt.xlabel('Hard Disk Capacity')
plt.ylabel('Price')
plt.title('Hard Disk Capacity Vs Price')
plt.show()

In [None]:
# Bar Graph - Display Vs Price
plt.figure(figsize=(15,5))
plt.bar(df['Display'],df['Price'],color='tomato')
plt.xticks(rotation=45)
plt.xlabel('Display Size')
plt.ylabel('Price')
plt.title('Display Vs Price')
plt.show()

### BarPlots using Seaborn library
* Price versus Categorical Variables

In [None]:
# Bar Plot - Price Vs CPU
# Bar Plot - Price Vs RAM
sns.barplot(x=df.Price, y=df.CPU)

In [None]:
# Bar Plot - Price Vs RAM
sns.barplot(x=df.Price, y=df.RAM)

In [None]:
# Bar Plot - Price Vs OS
sns.barplot(x=df['Price'], y=df['OS'])

In [None]:
# Bar Plot - Price Vs HDD
sns.barplot(x=df['Price'], y=df['HD Capacity'])

In [None]:
# Bar Plot - Price Vs Display
sns.barplot(x=df['Price'], y=df['Display'])

### BarPlots using Seaborn library
* Categorical Variables versus Price

In [None]:
# Bar Plot - CPU Vs Price
plt.figure(figsize=(12,5))
sns.barplot(x=df['CPU'], y=df['Price'])
plt.xticks(rotation=45)

In [None]:
# Bar Plot - RAM Vs Price
plt.figure(figsize=(12,5))
sns.barplot(x=df['RAM'], y=df['Price'])
plt.xticks(rotation=45)

In [None]:
# Bar Plot - OS Vs Price
plt.figure(figsize=(12,5))
sns.barplot(x=df['OS'], y=df['Price'])
plt.xticks(rotation=0)

In [None]:
# Bar Plot - HDD Vs Price
plt.figure(figsize=(12,5))
sns.barplot(x=df['HD Capacity'], y=df['Price'])
plt.xticks(rotation=45)

In [None]:
# Bar Plot - Display Vs Price
plt.figure(figsize=(12,5))
sns.barplot(x=df['Display'], y=df['Price'])
plt.xticks(rotation=45)

In [None]:
# Line Plot - Rating Vs Price between categorical variables
plt.figure(figsize=(8,4))
sns.lineplot(x=df['Rating'], y=df['Price'])

In [None]:
### You learned how to:
* Inspect the HTML structure of your target site with your browser’s tools
* Gain insight into how to decipher the data encoded in URLs
* Download the page’s HTML content using Python’s requests library
* Parse the downloaded HTML with Beautiful Soup to extract relevant information

In [None]:
https://www.edureka.co/blog/web-scrapfrom fake_useragent import UserAgent
ua = UserAgent()

ua.ie
# Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US);
ua.msie
# Mozilla/5.0 (compatible; MSIE 10.0; Macintosh; Intel Mac OS X 10_7_3; Trident/6.0)'
ua['Internet Explorer']
# Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; GTB7.4; InfoPath.2; SV1; .NET CLR 3.3.69573; WOW64; en-US)
ua.opera
# Opera/9.80 (X11; Linux i686; U; ru) Presto/2.8.131 Version/11.11
ua.chrome
# Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.2 (KHTML, like Gecko) Chrome/22.0.1216.0 Safari/537.2'
ua.google
# Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.13 (KHTML, like Gecko) Chrome/24.0.1290.1 Safari/537.13
ua['google chrome']
# Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11
ua.firefox
# Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1
ua.ff
# Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1
ua.safari
# Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25

# and the best one, random via real world browser usage statistic
ua.random