# Web Scraping Using Python

+ Web scraping is used to improve industrial efficiency by extracting more details using a **scraper**. 
+ This data can be transformed from a website into a structured format which is very much useful for further operation and storage point  of view. 
![Pic1.jpg](attachment:Pic1.jpg)
+ Figure: Web scraping  

## Following facts should be taken into consideration for web scraping: 
+ Read through the target website's terms and conditions to understand how you can legally use the data. 
+ Most websites prohibit you  from using their data for commercial purposes. 
+ Make sure you are not downloading data at too rapid a rate because 
this may break the website. 
+ You could potentially get blocked from  the website as well.  

## Different aspects of web scraping  
+ Important parameters of web scraping  : 
    + Crawler 
    + Scraper 
    
![Pic2.jpg](attachment:Pic2.jpg)

## Crawler  
+ A crawler browses the internet to index and search relevant content. 
+ It gets  links in a logical manner. It follows the extra steps for data collection. 
+ The extra  steps involved are as follows: 
    + Indexing 
    + Storing in databases 
+ Stored information in the database can be processed further to showcase on  the UI. 
+ With the help of that information, application development and product  development as well as maintenance get easier.  

## Scraper
+ It is a specialized tool used for accurate crawling and extracting data. 
+ It gets  these values from HTML. It can be used to store data in a particular file format  such as JSON, XML, CSV, etc. With the help of a scraper, storing data in a  spreadsheet becomes quick and easy. 
+ In figure below, you can see that the scraper  crawls the website and stores the data into a structured file format for easy  readability. 
+ Scraping is also known as human mining as it generates data which  is easy to read.  

![Pic3.jpg](attachment:Pic3.jpg)

## 3 Popular Tools and Libraries used for Web Scraping in Python
+ **BeautifulSoup**:
    + BeautifulSoup is an amazing parsing library in Python that enables the web scraping from HTML and XML documents.
    + BeautifulSoup automatically detects encodings and gracefully handles HTML documents even with special characters. 
    + We can navigate a parsed document and find what we need which makes it quick and painless to extract the data from the webpages.
+**Scrapy**:
    + Scrapy is a Python framework for large scale web scraping. 
    + It gives you all the tools you need to efficiently extract data from websites, process them as you want, and store them in your preferred structure and format. 
+**Selenium**: 
    + Selenium is another popular tool for automating browsers. It’s primarily used for testing in the industry but is also very handy for web scraping. 

## Details of web scraping  
+ Web scraping means to collect data from websites and  store it in a structured and organized manner. 
+ In the following example we will go in more detail about web scraping using a Python library. 
+ For the following  example we will make use of the **BeautifulSoup library**.  
![Pic6.jpg](attachment:Pic6.jpg)

# Required libraries

In [2]:
from bs4 import BeautifulSoup as bs
import requests

## Make sure all the files are available in the folder that you are working with and set your working directory

In [6]:
# for example: 
import os
#os.chdir('C:/Users/gawankar/OneDrive - The George Washington University/Desktop/Async_Material_Folder_06/Files')
os.getcwd()

'/Users/harshitaggarwal/Documents/GWU/Python/Week 6'

In [13]:
path = 'Users/harshitaggarwal/Documents/GWU/Python/Week 6/S_06_Working_Files/Files/'

# Read the Basic of HTML. pdf file

In [None]:
# Open "Files" file organizer/folder
# Right-click on "simple.html"
# Right-click on document and "inspect"

In [3]:
import urllib.request
from bs4 import BeautifulSoup as bs
filename = "simple.html"
url = "file:///"+path+filename
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
data = response.read()
response.close()

# Create the soup
soup = bs(data, "html.parser")

# Print parse tree
print(soup.prettify())

NameError: name 'path' is not defined

In [None]:
# Parsing a basic html page

In [None]:
import urllib.request
from bs4 import BeautifulSoup as bs

filename = "demo1.html"
url = "file:///"+path+filename
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
data = response.read()
response.close()

# Create the soup
soup = bs(data, "html.parser")

In [None]:
soup

In [None]:
print(soup.prettify())
# prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string

In [None]:
list(soup.children)
#children: Extracts a list of Tag objects that match the given criteria.
# You can specify the name of the Tag and any attributes you want the Tag to have.

In [None]:
html=list(soup.children)[0]
html

In [None]:
list(html.children)

In [None]:
body = list(html.children)[3]

In [None]:
body

In [None]:
list(body.children)

In [None]:
p = list(body.children)[1]

In [None]:
p

In [None]:
data = p.get_text()

In [None]:
data

# Finding all instances of a tag at once

In [None]:
from bs4 import BeautifulSoup as bs
import requests

In [14]:
import urllib.request
from bs4 import BeautifulSoup as bs

filename = "demo2.html"
url = "file:///"+path+filename
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
data = response.read()
response.close()

# Create the soup
soup = bs(data, "html.parser")

In [None]:
soup

In [None]:
soup.find_all('p')

In [16]:
soup.find_all('p')[0].get_text()

'First Paragraph'

In [None]:
soup.find('p')

In [24]:
soup.p.get('id')

# Serching the tags by class or id

In [None]:
import urllib.request
from bs4 import BeautifulSoup as bs

filename = "demo3.html"
url = "file:///"+path+filename
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
data = response.read()
response.close()

# Create the soup
soup = bs(data, "html.parser")

In [None]:
print(soup.prettify())

In [None]:
soup.find_all('p',class_='outer-text')

In [None]:
soup.find_all('p',id='first')

# Using CSS Selectors
+ CSS (Cascading Style Sheets) is a declarative language that controls how webpages look in the browser.
+ BeautifulSoup has a .select() method which uses the SoupSieve package to run a CSS selector against a parsed document and return all the matching elements. 
+ Tag has a similar method which runs a CSS selector against the contents of a single tag.

In [None]:
soup.select("div p")

In [None]:
soup.select("div p.first-item")

In [None]:
soup.select("div p#first")

In [None]:
soup.select("body p.outer-text")

In [None]:
soup.find_all('a')

In [None]:
soup.find(id="link3")

In [None]:
my_links = soup.find_all('a')

In [None]:
links = []
for link in my_links:
    links.append(link.get('href'))

In [None]:
links

# Handling tables

In [None]:
import urllib.request
from bs4 import BeautifulSoup as bs

filename = "table.html"
url = "file:///"+path+filename
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
data = response.read()
response.close()

# Create the soup
soup = bs(data, "html.parser")

# ---- Now we start parsing the table

tables = soup.findAll('table')

for table in tables:
    rows = table.findAll('tr')
    for row in rows:
        cells = row.findAll('td')
        for cell in cells:
            print (cell.getText())

In [None]:
import urllib.request
from bs4 import BeautifulSoup as bs

filename = "table.html"
url = "file:///"+path+filename
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
data = response.read()
response.close()

# Create the soup
soup = bs(data, "html.parser")

# ---- Now we start parsing the table

tables = soup.findAll('table')

myLists = []

for table in tables:
    rows = table.findAll('tr')
    # print "number of rows =", len(rows)
    r = []
    for row in rows:
        cells = row.findAll('td')
        # print "number of cols =", len(cells)
        c = []        
        for cell in cells:
            c.append(cell.getText())
        # print c
        r.append(c)
    myLists.append(r)

print (myLists)

In [None]:
import pandas as pd
headings = ['Heading1', 'Heading2','Heading3']

In [None]:
df1 = pd.DataFrame(myLists[0], columns=headings)

In [None]:
print(df1)

In [None]:
df2 = pd.DataFrame(myLists[1], columns=headings)

In [None]:
print(df2)

# Calling url open with this Request object returns 

In [None]:
# Step 1
import urllib.request
url = "http://www.google.com/"
request = urllib.request.Request(url)
# Calling urlopen with this Request object returns 
# a response object for the URL requested. This 
# response is a file-like object, which means you 
# can for example call .read() on the response

print(request)

In [None]:
# Step 2
response = urllib.request.urlopen(request)
#print (response.info())

In [None]:
# Step 3
html = response.read()
#print (html)

In [None]:
# Step 4
response.close()

## Example: 
![Pic8.png](attachment:Pic8.png)

In [None]:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
website = "https://www.cars.com/shopping/results/?stock_type=cpo&makes%5B%5D=mercedes_benz&models%5B%5D&list_price_max&maximum_distance=all&zip"

In [None]:
response=requests.get(website)

In [None]:
response.status_code

In [None]:
soup=bs(response.content,'html.parser')

In [None]:
soup

In [None]:
results=soup.find_all('div',{'class':'vehicle-card'})

In [None]:
len(results)

In [None]:
results[0].find('span',{'class':'primary-price'}).get_text()

# Case Study 1:  2020-21 NBA Player Stats: Per Game 
![Pic7.png](attachment:Pic7.png)

In [4]:
from bs4 import BeautifulSoup as bs
import requests
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline

In [5]:
url = 'https://www.basketball-reference.com/leagues/NBA_2020_per_game.html'
page = requests.get(url)
page

<Response [200]>

In [6]:
page.content



In [7]:
soup = bs(page.content,'html.parser')

In [8]:
print(soup.prettify)

<bound method Tag.prettify of 
<!DOCTYPE html>

<html class="no-js" data-root="/home/bbr/build" data-version="klecko-" itemscope="" itemtype="https://schema.org/WebSite" lang="en">
<head>
<meta charset="utf-8"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="width=device-width, initial-scale=1.0, maximum-scale=2.0" name="viewport">
<link href="https://d2p3bygnnzw9w3.cloudfront.net/req/202109021" rel="dns-prefetch"/>
<!-- Quantcast Choice. Consent Manager Tag v2.0 (for TCF 2.0) -->
<script async="true" type="text/javascript">
    (function() {
	var host = window.location.hostname;
	var element = document.createElement('script');
	var firstScript = document.getElementsByTagName('script')[0];
	var url = 'https://quantcast.mgr.consensu.org'
	    .concat('/choice/', 'XwNYEpNeFfhfr', '/', host, '/choice.js')
	var uspTries = 0;
	var uspTriesLimit = 3;
	element.async = true;
	element.type = 'text/javascript';
	element.src = url;
	
	firstScript.parentNode.insertBefore(ele

In [None]:
table = soup.find_all(class_ = "full_table" )

In [None]:
table

In [None]:
head = soup.find(class_= 'thead')
column_names_raw = [head.text for item in head][0]
column_names_raw

In [None]:
column_names_clean = column_names_raw.replace("\n",",").split(",")[2:-1]
column_names_clean

In [None]:
players = []
for i in range(len(table) ):
    player_ = []
    for td in table[i].find_all("td"):
        player_.append(td.text)
    players.append(player_)

df = pd.DataFrame(players, columns = column_names_clean).set_index("Player")
#cleaning the player's name from occasional special characters
df.index = df.index.str.replace('*', '')

In [None]:
df

In [None]:
df.to_csv ('2020 nba_data_per_game.csv', header=True)

In [None]:
mydata = pd.read_csv("2020 nba_data_per_game.csv")
mydata[0:5]

In [None]:
import seaborn as sns
top_10=mydata["Pos"].value_counts().head(10)
top_10
top_10.plot(kind='bar',color = list('rgbkymc'));

## Case Study 2:
![Pic4.jpg](attachment:Pic4.jpg)

In [None]:
from bs4 import BeautifulSoup as bs
import requests

In [None]:
link = 'https://www.amazon.in/OnePlus-Mirror-Black-128GB-Storage/product-reviews/B07DJHV6VZ/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews'

In [None]:
page = requests.get(link)

In [None]:
page

In [None]:
page.content

In [None]:
soup = bs(page.content,'html.parser')

In [None]:
print(soup.prettify())
# Prettify() function in BeautifulSoup will enable us to view how the tags are nested in the document.

In [None]:
names = soup.find_all('span',class_='a-profile-name')

In [None]:
names

In [None]:
cust_name = []
for i in range(0,len(names)):
    cust_name.append(names[i].get_text())
cust_name

In [None]:
cust_name.pop(0)
#pop() is an inbuilt function in Python that removes and returns the last value from the List or the given index value. 

In [None]:
cust_name

In [None]:
cust_name.pop(0)

In [None]:
cust_name

In [None]:
title = soup.find_all('a',class_='review-title-content')

In [None]:
title

In [None]:
review_title = []
for i in range(0,len(title)):
    review_title.append(title[i].get_text())
review_title

In [None]:
review_title[:] = [titles.lstrip('\n') for titles in review_title]
review_title

In [None]:
review_title[:] = [titles.rstrip('\n') for titles in review_title]
review_title

In [None]:
rating = soup.find_all('i',class_='review-rating')
rating

In [None]:
rate = []
for i in range(0,len(rating)):
    rate.append(rating[i].get_text())
rate

In [None]:
len(rate)

In [None]:
rate.pop(0)

In [None]:
rate.pop(0)

In [None]:
rate

In [None]:
review = soup.find_all("span",{"data-hook":"review-body"})
review

In [None]:
review_content = []
for i in range(0,len(review)):
    review_content.append(review[i].get_text())
review_content

In [None]:
review_content[:] = [reviews.lstrip('\n') for reviews in review_content]
review_content

In [None]:
review_content[:] = [reviews.rstrip('\n') for reviews in review_content]
review_content

In [None]:
len(review_content)

In [None]:
cust_name
review_title
rate
review_content

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame()

In [None]:
df['Customer Name']=cust_name

In [None]:
df

In [None]:
df['Review title']=review_title
#df['Ratings']=rate
df['Reviews']=review_content

In [None]:
df

In [None]:
df.to_csv(r'reviews.csv',index=True)