## Python Modules for Web Scraping

Web scraping is the process of constructing an agent which can extract, parse, download and organize useful information from the web automatically. In other words, instead of manually saving the data from websites, the web scraping software will automatically load and extract data from multiple websites as per our requirement.

In this section, we are going to discuss about useful Python libraries for web scraping.

The first question to ask before getting started with any python application is ‘Which libraries do I need?

`For web scraping there are a few different libraries to consider, including:`

- Beautiful Soup
- Requests
- Scrapy
- Selenium
- Urllib3

### Requests
It is a simple python web scraping library. It is an efficient HTTP library used for accessing web pages. With the help of Requests, we can get the raw HTML of web pages which can then be parsed for retrieving the data. Before using requests, let us understand its installation.

#### Installation of Requests
To install Requests, simply run this simple command in your terminal of choice:

`pip install requests`



In [1]:
!pip install requests



### Example
In this example, we are making a GET HTTP request for a web page. For this we need to first import requests library as follows −


In [2]:
import requests

In this following line of code, we use requests to make a GET HTTP requests for the url: https://authoraditiagarwal.com/ by making a GET request.


In [3]:
url  = 'https://authoraditiagarwal.com/'
response = requests.get(url)


Now we can retrieve the content by using .text property as follows −

In [5]:
response.text[:150]

'<!DOCTYPE html><html lang="en-US" id="html"><head><meta charset="UTF-8" /><meta http-equiv="X-UA-Compatible" content="IE=10" /><link rel="profile" hre'

Requests allows you to send HTTP/1.1 requests extremely easily. There’s no need to manually add query strings to your URLs, or to form-encode your POST data. Keep-alive and HTTP connection pooling are 100% automatic

We will disscuse this later it was just  an intruduction to Request library.

https://docs.python-requests.org/en/master/user/quickstart/


## Urllib3
It is another Python library that can be used for retrieving data from URLs similar to the requests library. You can read more on this at its technical documentation at 
<a href='https://urllib3.readthedocs.io/en/latest/'>here</a>.

#### Installing
urllib3 can be installed with <a href = 'https://pip.pypa.io/'>pip</a>.

`$ python -m pip install urllib3`

In [7]:
!pip install urllib3



#### Example: Scraping using Urllib3 and BeautifulSoup

In the following example, we are scraping the web page by using Urllib3 and BeautifulSoup. We are using Urllib3 at the place of requests library for getting the raw data (HTML) from web page. Then we are using BeautifulSoup for parsing that HTML data.

Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.

if you want to know more about the Beautiful Soup please refer to this <a href = 'https://www.crummy.com/software/BeautifulSoup/bs4/doc/'>here</a>

In [8]:
import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
response = http.request('GET', 'https://authoraditiagarwal.com')
soup = BeautifulSoup(response.data, 'lxml')
print (soup.title)
print (soup.title.text)

<title>Learn and grow together</title>
Learn and grow together


### Selenium
It is an open source automated testing suite for web applications across different browsers and platforms. It is not a single tool but a suite of software. We have selenium bindings for Python, Java, C#, Ruby and JavaScript. Here we are going to perform web scraping by using selenium and its Python bindings. 

Selenium Python bindings provide a convenient API to access Selenium WebDrivers like Firefox, IE, Chrome, Remote etc. The current supported Python versions are 2.7, 3.5 and above.

for doecumentaion please visit here https://pypi.org/project/selenium/.

#### Installing Selenium

Using the pip command, we can install urllib3 either in our virtual environment or in global installation.

`pip install selenium`

In [9]:
!pip install selenium

Collecting selenium
  Downloading selenium-3.141.0-py2.py3-none-any.whl (904 kB)
Installing collected packages: selenium
Successfully installed selenium-3.141.0


As selenium requires a driver to interface with the chosen browser, we need to download it. The following table shows different browsers and their links for downloading the same.

- Chrome https://sites.google.com/a/chromium.org/

- Edge https://developer.microsoft.com/

- Firefox https://github.com/

- Safari https://webkit.org/

#### Example

This example shows web scraping using selenium. It can also be used for testing which is called selenium testing.

After downloading the particular driver for the specified version of browser, we need to do programming in Python.

First, need to import webdriver from selenium as follows −

`from selenium import webdriver`

In [10]:
from selenium import webdriver

Now, provide the path of web driver which we have downloaded as per our requirement −

`path = r'C:\Users\nijat\Desktop\Data Science\Preparation For Interview\Technicl Skil\web Scraping\Chromedriver'`<br>
`browser = webdriver.Chrome(executable_path = path)`

In [11]:
path = r'C:\Users\nijat\Desktop\Data Science\Preparation For Interview\Technicl Skil\web Scraping\Chromedriver'
browser = webdriver.Chrome(executable_path = path)

Now, provide the url which we want to open in that web browser now controlled by our Python script.

In [12]:
browser.get('https://authoraditiagarwal.com/leadershipmanagement')

We can also scrape a particular element by providing the xpath as provided in lxml.

In [13]:
browser.find_element_by_xpath('/html/body').click()

You can check the browser, controlled by Python script, for output.

## Scrapy
Scrapy is a fast, open-source web crawling framework written in Python, used to extract the data from the web page with the help of selectors based on XPath. Scrapy was first released on June 26, 2008 licensed under BSD, with a milestone 1.0 releasing in June 2015. It provides us all the tools we need to extract, process and structure the data from websites.

if you want to know more about the Scapy visit <a herf="https://docs.scrapy.org/en/latest/">here</a>

#### Installing Scrapy
Using the pip command, we can install urllib3 either in our virtual environment or in global installation.

`pip install scrapy`

We will be using `Beautiful Soup`. But before to do web scraping Let's disscuse some important issue.


## Legality of Web Scraping

With Python, we can scrape any website or particular elements of a web page but do you have any idea whether it is legal or not? Before scraping any website we must have to know about the legality of web scraping. This chapter will explain the concepts related to legality of web scraping.

Generally, if you are going to use the scraped data for personal use, then there may not be any problem. But if you are going to republish that data, then before doing the same you should make download request to the owner or do some background research about policies as well about the data you are going to scrape.

### Research Required Prior to Scraping
If you are targeting a website for scraping data from it, we need to understand its scale and structure. Following are some of the files which we need to analyze before starting web scraping.

#### 1. Analyzing robots.txt
Actually most of the publishers allow programmers to crawl their websites at some extent. In other sense, publishers want specific portions of the websites to be crawled. To define this, websites must put some rules for stating which portions can be crawled and which cannot be. Such rules are defined in a file called robots.txt.

robots.txt is human readable file used to identify the portions of the website that crawlers are allowed as well as not allowed to scrape. There is no standard format of robots.txt file and the publishers of website can do modifications as per their needs. We can check the robots.txt file for a particular website by providing a slash and robots.txt after url of that website. For example, if we want to check it for Google.com, then we need to type https://www.google.com/robots.txt and we will get something as follows −
<code>
User-agent: *
Disallow: /search
Allow: /search/about
Allow: /search/static
Allow: /search/howsearchworks
Disallow: /sdch
Disallow: /groups
Disallow: /index.html?
Disallow: /?
Allow: /?hl=
Disallow: /?hl=*&
Allow: /?hl=*&gws_rd=ssl$
and so on……..
</code>

#### 2. Analyzing Sitemap files
What you supposed to do if you want to crawl a website for updated information? You will crawl every web page for getting that updated information, but this will increase the server traffic of that particular website. That is why websites provide sitemap files for helping the crawlers to locate updating content without needing to crawl every web page. Sitemap standard is defined at http://www.sitemaps.org/protocol.html.


#### 3. What is the Size of Website?
Is the size of a website, i.e. the number of web pages of a website affects the way we crawl? Certainly yes. Because if we have less number of web pages to crawl, then the efficiency would not be a serious issue, but suppose if our website has millions of web pages, for example Microsoft.com, then downloading each web page sequentially would take several months and then efficiency would be a serious concern.

#### 4. Checking Website’s Size
By checking the size of result of Google’s crawler, we can have an estimate of the size of a website. Our result can be filtered by using the keyword site while doing the Google search. For example, estimating the size of https://authoraditiagarwal.com/ is given below −

<img src = 'https://www.tutorialspoint.com/python_web_scraping/images/checking_the_size.jpg'>


You can see there are around 60 results which mean it is not a big website and crawling would not lead the efficiency issue.

#### 5. Which technology is used by website?
Another important question is whether the technology used by website affects the way we crawl? Yes, it affects. But how we can check about the technology used by a website? There is a Python library named builtwith with the help of which we can find out about the technology used by a website.

##### Example

In this example we are going to check the technology used by the website https://authoraditiagarwal.com with the help of Python library builtwith. But before using this library, we need to install it as follows −

Let's first install  builtwith

`pip install builtwith`

Now with the help of following simple line of codes we can check the technology used by a particular website −

In [14]:
import builtwith
builtwith.parse('http://authoraditiagarwal.com')

{'web-servers': ['Apache'],
 'advertising-networks': ['Google AdSense'],
 'javascript-frameworks': ['Prototype', 'jQuery'],
 'ecommerce': ['WooCommerce'],
 'cms': ['WordPress'],
 'programming-languages': ['PHP'],
 'blogs': ['PHP', 'WordPress']}

#### 6. Who is the owner of website?

The owner of the website also matters because if the owner is known for blocking the crawlers, then the crawlers must be careful while scraping the data from website. There is a protocol named Whois with the help of which we can find out about the owner of the website.

Let's first install the whois 

`pip install python-whois`

#### Example

In this example we are going to check the owner of the website say `microsoft.com` with the help of Whois. But before using this library, we need to install it as follows −

In [16]:
import whois
print (whois.whois('microsoft.com'))

{
  "domain_name": [
    "MICROSOFT.COM",
    "microsoft.com"
  ],
  "registrar": "MarkMonitor, Inc.",
  "whois_server": "whois.markmonitor.com",
  "referral_url": null,
  "updated_date": [
    "2021-03-12 23:25:32",
    "2021-04-07 12:58:15"
  ],
  "creation_date": [
    "1991-05-02 04:00:00",
    "1991-05-01 21:00:00"
  ],
  "expiration_date": [
    "2022-05-03 04:00:00",
    "2022-05-02 00:00:00"
  ],
  "name_servers": [
    "NS1-205.AZURE-DNS.COM",
    "NS2-205.AZURE-DNS.NET",
    "NS3-205.AZURE-DNS.ORG",
    "NS4-205.AZURE-DNS.INFO",
    "ns1-205.azure-dns.com",
    "ns2-205.azure-dns.net",
    "ns4-205.azure-dns.info",
    "ns3-205.azure-dns.org"
  ],
  "status": [
    "clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited",
    "clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
    "clientUpdateProhibited https://icann.org/epp#clientUpdateProhibited",
    "serverDeleteProhibited https://icann.org/epp#serverDeleteProhibited",
    "serverTrans

### Data Extraction

Analyzing a web page means understanding its sructure . Now, the question arises why it is important for web scraping?

#### Web page Analysis
Web page analysis is important because without analyzing we are not able to know in which form we are going to receive the data from (structured or unstructured) that web page after extraction. We can do web page analysis in the following ways −

#### Viewing Page Source

This is a way to understand how a web page is structured by examining its source code. To implement this, we need to right click the page and then must select the View page source option. Then, we will get the data of our interest from that web page in the form of HTML. But the main concern is about whitespaces and formatting which is difficult for us to format.

#### Inspecting Page Source by Clicking Inspect Element Option

This is another way of analyzing web page. But the difference is that it will resolve the issue of formatting and whitespaces in the source code of web page. You can implement this by right clicking and then selecting the Inspect or Inspect element option from menu. It will provide the information about particular area or element of that web page.

### Different Ways to Extract Data from Web Page

The following methods are mostly used for extracting data from a web page −

#### 1. Regular Expression
They are highly specialized programming language embedded in Python. We can use it through re module of Python. It is also called RE or regexes or regex patterns. With the help of regular expressions, we can specify some rules for the possible set of strings we want to match from the data.

If you want to learn more about regular expression in general, go to the link https://www.tutorialspoint.com/automata_theory/regular_expressions.htm and if you want to know more about re module or regular expression in Python, you can follow the link https://www.tutorialspoint.com/python/python_reg_expressions.htm.

#### Example

In the following example, we are going to scrape data about India from http://example.webscraping.com after matching the contents of `<td>` with the help of regular expression.


In [None]:
import re
import urllib.request

response = urllib.request.urlopen('http://example.webscraping.com/places/default/view/India-102')

html = response.read()
text = html.decode()
re.findall('<td class="w2p_fw">(.*?)</td>',text)

### Beautiful Soup

Suppose we want to collect all the hyperlinks from a web page, then we can use a parser called BeautifulSoup which can be known in more detail at https://www.crummy.com/software/BeautifulSoup/bs4/doc/. In simple words, BeautifulSoup is a Python library for pulling data out of HTML and XML files. It can be used with requests, because it needs an input (document or url) to create a soup object asit cannot fetch a web page by itself. You can use the following Python script to gather the title of web page and hyperlinks.

#### Installing Beautiful Soup

Using the pip command, we can install beautifulsoup either in our virtual environment or in global installation.

`pip install bs4`

#### Example
Note that in this example, we are extending the above example implemented with requests python module. we are using r.text for creating a soup object which will further be used to fetch details like title of the webpage.

First, we need to import necessary Python modules −
<code>
import requests
from bs4 import BeautifulSoup
</code>

In [20]:
import requests 
from bs4 import BeautifulSoup

In this following line of code we use requests to make a GET HTTP requests for the url: https://authoraditiagarwal.com/ by making a GET request.

In [21]:
r = requests.get('https://authoraditiagarwal.com/')

Now we need to create a Soup object as follows −


In [22]:
soup = BeautifulSoup(r.text, 'lxml')
print (soup.title)
print (soup.title.text)

<title>Learn and grow together</title>
Learn and grow together


### Lxml

Another Python library we are going to discuss for web scraping is lxml. It is a highperformance HTML and XML parsing library. It is comparatively fast and straightforward. You can read about it more on https://lxml.de/.

##### Installing lxml

`pip install lxml`

In [23]:
!pip install lxml



Now we need to provide the url of web page to scrap

In [24]:
url = 'https://authoraditiagarwal.com/leadershipmanagement/'

Now we need to provide the path (Xpath) to particular element of that web page −

In [28]:
import requests
from lxml import html 
path = '//*[@id="panel-836-0-0-1"]/div/div/p[1]'
response = requests.get(url)
byte_string = response.content
source_code = html.fromstring(byte_string)
tree = source_code.xpath(path)
print(tree[0].text_content()) 

IndexError: list index out of range

## Data Processing

To process the data that has been scraped, we must store the data on our local machine in a particular format like spreadsheet (CSV), JSON or sometimes in databases like MySQL.

### 1. CSV and JSON Data Processing
First, we are going to write the information, after grabbing from web page, into a CSV file or a spreadsheet. Let us first understand through a simple example in which we will first grab the information using `BeautifulSoup` module, as did earlier, and then by using Python CSV module we will write that textual information into CSV file.

First, we need to import the necessary Python libraries as follows −

In [29]:
import requests
from bs4 import BeautifulSoup
import csv

In this following line of code, we use requests to make a GET HTTP requests for the url: https://authoraditiagarwal.com/ by making a GET request.

` r = requests.get('https://authoraditiagarwal.com/')` 

Now, we need to create a Soup object as follows −

`soup = BeautifulSoup(r.text, 'lxml')`

Now, with the help of next lines of code, we will write the grabbed data into a CSV file named dataprocessing.csv.
<code>
f = csv.writer(open(' dataprocessing.csv ','w'))
f.writerow(['Title'])
f.writerow([soup.title.text])
</code>

Now let's put all together and run the script.

In [32]:
import requests
from bs4 import BeautifulSoup
import csv
r = requests.get('https://authoraditiagarwal.com/')
soup = BeautifulSoup(r.text,'lxml')
f = csv.writer(open('dataprocessing.csv','r+'))
f.writerow(['Title'])
f.writerow([soup.title.text])

25

After running this script, the textual information or the title of the webpage will be saved in the above mentioned CSV file on your local machine.

Let's read csv file.

In [33]:
import pandas as pd
data = pd.read_csv('dataprocessing.csv')
data

Unnamed: 0,Title
0,Learn and grow together


Similarly, we can save the collected information in a JSON file. The following is an easy to understand Python script for doing the same in which we are grabbing the same information as we did in last Python script, but this time the grabbed information is saved in JSONfile.txt by using JSON Python module.

In [35]:
import requests
from bs4 import BeautifulSoup
import csv
import json
r = requests.get('https://authoraditiagarwal.com/')
soup = BeautifulSoup(r.text, 'lxml')
y = json.dumps(soup.title.text)
with open('JSONFile.json', 'wt') as outfile:
   json.dump(y, outfile)

Let's read Json file


In [37]:
import json
  
# Opening JSON file
f = open('JSONFile.json',)
  
# returns JSON object as 
# a dictionary
data = json.load(f)
data

'"Learn and grow together"'

After running this script, the grabbed information i.e. title of the webpage will be saved in the above mentioned text file on your local machine.

## Data Processing using AWS S3

to understand this visit this url.  https://www.tutorialspoint.com/python_web_scraping/python_web_scraping_data_processing.htm


## Data processing using MySQL

With the help of following steps, we can scrape and process data into MySQL table −
- Step 1 − First, by using MySQL we need to create a database and table in which we want to save our scraped data. For example, we are creating the table with following query −
<code>
CREATE TABLE Scrap_pages (id BIGINT(7) NOT NULL AUTO_INCREMENT,
title VARCHAR(200), content VARCHAR(10000),PRIMARY KEY(id));
</code>

- Step 2 − Next, we need to deal with Unicode. Note that MySQL does not handle Unicode by default. We need to turn on this feature with the help of following commands which will change the default character set for the database, for the table and for both of the columns −

<code>
ALTER DATABASE scrap CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
ALTER TABLE Scrap_pages CONVERT TO CHARACTER SET utf8mb4 COLLATE
utf8mb4_unicode_ci;
ALTER TABLE Scrap_pages CHANGE title title VARCHAR(200) CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
ALTER TABLE pages CHANGE content content VARCHAR(10000) CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
</code>

- Step 3 − Now, integrate MySQL with Python. For this, we will need PyMySQL which can be installed with the help of the following command

`pip install PyMySQL `

Step 4 − Now, our database named Scrap, created earlier, is ready to save the data, after scraped from web, into table named Scrap_pages. Here in our example we are going to scrape data from Wikipedia and it will be saved into our database.

First, we need to import the required Python modules.

<code>
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import pymysql
import re
</code>

Now, make a connection, that is integrate this with Python.

<code>
conn = pymysql.connect(host='127.0.0.1',user='root', passwd = None, db = 'mysql',
charset = 'utf8')
cur = conn.cursor()
cur.execute("USE scrap")
random.seed(datetime.datetime.now())
def store(title, content):
   cur.execute('INSERT INTO scrap_pages (title, content) VALUES ''("%s","%s")', (title, content))
   cur.connection.commit()
</code>

Now, connect with Wikipedia and get data from it.

<code>
def getLinks(articleUrl):
   html = urlopen('http://en.wikipedia.org'+articleUrl)
   bs = BeautifulSoup(html, 'html.parser')
   title = bs.find('h1').get_text()
   content = bs.find('div', {'id':'mw-content-text'}).find('p').get_text()
   store(title, content)
   return bs.find('div', {'id':'bodyContent'}).findAll('a',href=re.compile('^(/wiki/)((?!:).)*$'))
links = getLinks('/wiki/Kevin_Bacon')
try:
   while len(links) > 0:
      newArticle = links[random.randint(0, len(links)-1)].attrs['href']
      print(newArticle)
      links = getLinks(newArticle)
</code>

Lastly, we need to close both cursor and connection.

<code>
finally:
   cur.close()
   conn.close()
</code>



## Processing Images and Videos

The web media content that we obtain during scraping can be images, audio and video files, in the form of non-web pages as well as data files. But, can we trust the downloaded data especially on the extension of data we are going to download and store in our computer memory? This makes it essential to know about the type of data we are going to store locally.

### Getting Media Content from Web Page

we are going to learn how we can download media content which correctly represents the media type based on the information from web server. We can do it with the help of Python requests module as we did in previous chapter.

First, we need to import necessary Python modules as follows −

In [39]:
import requests

Now, provide the URL of the media content we want to download and store locally.

In [59]:
url = "https://cdn.shortpixel.ai/client/to_avif,q_lossy,ret_img,w_300/https://authoraditiagarwal.com/wp-content/uploads/2020/06/evonne-yuwen-teoh-KYmH8ZqKjJ4-unsplash-copy-300x200.jpg"

Use the following code to create HTTP response object.

In [60]:
r = requests.get(url) 

With the help of following line of code, we can save the received content as .png file.

In [61]:
with open("ThinkBig.png",'wb') as f:
   f.write(r.content) 

let's check the image. 

<img src = 'ThinkBig.png'>

After running the above Python script, we will get a file named ThinkBig.png, which would have the downloaded image.

## Extracting Filename from URL

After downloading the content from web site, we also want to save it in a file with a file name found in the URL. But we can also check, if numbers of additional fragments exist in URL too. For this, we need to find the actual filename from the URL.

With the help of following Python script, using urlparse, we can extract the filename from URL −



In [51]:
import urllib3
from urllib.parse import urlparse
import os
url = "https://authoraditiagarwal.com/wpcontent/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg"
a = urlparse(url)
a.path

'/wpcontent/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg'

### Information about Type of Content from URL

While extracting the contents from web server, by GET request, we can also check its information provided by the web server. With the help of following Python script we can determine what web server means with the type of the content −

First, we need to import necessary Python modules as follows −

`import requests`

Now, we need to provide the URL of the media content we want to download and store locally.

`url = "https://authoraditiagarwal.com/wpcontent/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg"`

Following line of code will create HTTP response object.

`r = requests.get(url, allow_redirects=True)`

Now, we can get what type of information about content can be provided by web server.

`for headers in r.headers: print(headers)`

Let's put all together.


In [54]:
import requests
url = "https://cdn.shortpixel.ai/client/to_avif,q_lossy,ret_img,w_300/https://authoraditiagarwal.com/wp-content/uploads/2020/06/evonne-yuwen-teoh-KYmH8ZqKjJ4-unsplash-copy-300x200.jpg"
r = requests.get(url, allow_redirects=True)
for headers in r.headers:
    print(headers)

Date
Content-Type
Content-Length
Connection
Server
CDN-PullZone
CDN-Uid
CDN-RequestCountryCode
Access-Control-Allow-Origin
Access-Control-Allow-Headers
Access-Control-Expose-Headers
CDN-EdgeStorageId
Link
X-Tag
Pragma
Expires
Cache-Control
Last-Modified
CDN-CachedAt
CDN-RequestPullSuccess
CDN-RequestPullCode
CDN-RequestId
CDN-Cache


With the help of following line of code we can get the particular information about content type, say content-type −

In [55]:
print (r.headers.get('content-type'))

image/jpeg


With the help of following line of code, we can get the particular information about content type, say EType −

In [62]:
print (r.headers.get('ETag'))

None


In [63]:
print (r.headers.get('content-length'))

11462


With the help of following line of code we can get the particular information about content type, say Server −

In [64]:
print (r.headers.get('Server'))

BunnyCDN-MU1-675


## Generating Thumbnail for Images
Thumbnail is a very small description or representation. A user may want to save only thumbnail of a large image or save both the image as well as thumbnail. In this section we are going to create a thumbnail of the image named ThinkBig.png downloaded in the previous section “Getting media content from web page”.

For this Python script, we need to install Python library named Pillow, a fork of the Python Image library having useful functions for manipulating images. It can be installed with the help of following command −

`pip install pillow`

In [65]:
!pip install pillow



The following Python script will create a thumbnail of the image and will save it to the current directory by prefixing thumbnail file with Th_

In [66]:
import glob
from PIL import Image
for infile in glob.glob("ThinkBig.png"):
   img = Image.open(infile)
   img.thumbnail((128, 128), Image.ANTIALIAS)
   if infile[0:2] != "Th_":
      img.save("Th_" + infile, "png")

let's read see the Th_ThinkBig.png


<img src = 'Th_ThinkBig.png'>

### Screenshot from Website

In web scraping, a very common task is to take screenshot of a website. For implementing this, we are going to use selenium and webdriver. The following Python script will take the screenshot from website and will save it to current directory.


In [74]:
from selenium import webdriver
path = r'C:\Users\nijat\Desktop\Data Science\Preparation For Interview\Technicl Skil\web Scraping\Chromedriver'
browser = webdriver.Chrome(executable_path = path)
browser.get('https://www.linkedin.com/in/nijatullah-mansoor-276976199/')
screenshot = browser.save_screenshot('screenshot.png')
browser.quit

<bound method WebDriver.quit of <selenium.webdriver.chrome.webdriver.WebDriver (session="bd023a55b59487c1283c1d019728bff1")>>

Now let's read see the screenshoot 

<img src= 'screenshot.png'>

### Thumbnail Generation for Video

Suppose we have downloaded videos from website and wanted to generate thumbnails for them so that a specific video, based on its thumbnail, can be clicked. For generating thumbnail for videos we need a simple tool called ffmpeg which can be downloaded from www.ffmpeg.org. After downloading, we need to install it as per the specifications of our OS.

The following Python script will generate thumbnail of the video and will save it to our local directory

<code>
import subprocess
video_MP4_file = “C:\Users\gaurav\desktop\solar.mp4
thumbnail_image_file = 'thumbnail_solar_video.jpg'
subprocess.call(['ffmpeg', '-i', video_MP4_file, '-ss', '00:00:20.000', '-
   vframes', '1', thumbnail_image_file, "-y"]) 
</code>

### Ripping an MP4 video to an MP3

Suppose you have downloaded some video file from a website, but you only need audio from that file to serve your purpose, then it can be done in Python with the help of Python library called moviepy which can be installed with the help of following command −

`pip install moviepy`

Now, after successfully installing moviepy with the help of following script we can convert and MP4 to MP3.

<code>
import moviepy.editor as mp
clip = mp.VideoFileClip(r"C:\Users\gaurav\Desktop\1234.mp4")
clip.audio.write_audiofile("movie_audio.mp3")
</code>

# Dealing with Text

You can perform text analysis in by using Python library called Natural Language Tool Kit (NLTK). Before proceeding into the concepts of NLTK, let us understand the relation between text analysis and web scraping.

Analyzing the words in the text can lead us to know about which words are important, which words are unusual, how words are grouped. This analysis eases the task of web scraping.

#### Getting started with NLTK
The Natural language toolkit (NLTK) is collection of Python libraries which is designed especially for identifying and tagging parts of speech found in the text of natural language like English.

#### Installing NLTK
You can use the following command to install NLTK in Python −

`pip install nltk`


In [76]:
!pip install nltk



##### Downloading NLTK’s Data
After installing NLTK, we have to download preset text repositories. But before downloading text preset repositories, we need to import NLTK with the help of import command as follows −

`import nltk`

Now, with the help of following command NLTK data can be downloaded −

`nltk.download()`

Installation of all available packages of NLTK will take some time, but it is always recommended to install all the packages.

In [None]:
import nltk
nltk.download()

#### Installing Other Necessary packages

We also need some other Python packages like gensim and pattern for doing text analysis as well as building building natural language processing applications by using NLTK.

`gensim` − A robust semantic modeling library which is useful for many applications. It can be installed by the following command −

`pip install gensim`

`pattern` − Used to make gensim package work properly. It can be installed by the following command −

`pip install pattern`

#### Tokenization

The Process of breaking the given text, into the smaller units called tokens, is called tokenization. These tokens can be the words, numbers or punctuation marks. It is also called word segmentation.

<img src='https://www.tutorialspoint.com/python_web_scraping/images/tokenization.jpg'>

NLTK module provides different packages for tokenization. We can use these packages as per our requirement. Some of the packages are described here −

`sent_tokenize package `− This package will divide the input text into sentences. You can use the following command to import this package −

`from nltk.tokenize import sent_tokenize`

`word_tokenize package` − This package will divide the input text into words. You can use the following command to import this package −

`from nltk.tokenize import word_tokenize`

`WordPunctTokenizer package` − This package will divide the input text as well as the punctuation marks into words. You can use the following command to import this package −

`from nltk.tokenize import WordPuncttokenizer`


### Stemming
In any language, there are different forms of a words. A language includes lots of variations due to the grammatical reasons. For example, consider the words `democracy`, `democratic`, and `democratization`. For machine learning as well as for web scraping projects, it is important for machines to understand that these different words have the same base form.` Hence we can say that it can be useful to extract the base forms of the words while analyzing the text.`

This can be `achieved by stemming` which may be defined as the heuristic process of extracting the base forms of the words by chopping off the ends of words.

NLTK module provides different packages for stemming. We can use these packages as per our requirement. Some of these packages are described here −

`PorterStemmer package `− Porter’s algorithm is used by this Python stemming package to extract the base form. You can use the following command to import this package −

`from nltk.tokenize import WordPuncttokenizer`

For example, after giving the word ‘`writing`’ as the input to this stemmer, the output would be the word ‘`write’` after stemming.

`LancasterStemmer package` − Lancaster’s algorithm is used by this Python stemming package to extract the base form. You can use the following command to import this package −

`from nltk.stem.lancaster import LancasterStemmer`

For example, after giving the word ‘writing’ as the input to this stemmer then the output would be the word ‘writ’ after stemming.

`SnowballStemmer package` − Snowball’s algorithm is used by this Python stemming package to extract the base form. You can use the following command to import this package −

`from nltk.stem.snowball import SnowballStemmer`

For example, after giving the word ‘writing’ as the input to this stemmer then the output would be the word ‘write’ after stemming.

#### Lemmatization
An other way to extract the base form of words is by lemmatization, normally aiming to remove inflectional endings by using vocabulary and morphological analysis. The base form of any word after lemmatization is called lemma.

NLTK module provides following packages for lemmatization −

`WordNetLemmatizer package` − It will extract the base form of the word depending upon whether it is used as noun as a verb. You can use the following command to import this package −

`from nltk.stem import WordNetLemmatizer`

### Chunking
Chunking, which means `dividing` the data into `small chunks`, is one of the important processes in natural language processing to identify the parts of speech and short phrases like noun phrases. Chunking is to do the labeling of tokens. We can get the structure of the sentence with the help of chunking process.

### Example
In this example, we are going to implement Noun-Phrase chunking by using NLTK Python module. NP chunking is a category of chunking which will find the noun phrases chunks in the sentence.

Steps for implementing noun phrase chunking
We need to follow the steps given below for implementing noun-phrase chunking −

Step 1 − Chunk grammar definition

In the first step we will define the grammar for chunking. It would consist of the rules which we need to follow.

Step 2 − Chunk parser creation

Now, we will create a chunk parser. It would parse the grammar and give the output.

Step 3 − The Output

In this last step, the output would be produced in a tree format.

First, we need to import the NLTK package as follows −

In [78]:
import nltk

Next, we need to define the sentence. Here DT: the determinant, VBP: the verb, JJ: the adjective, IN: the preposition and NN: the noun.

In [79]:
sentence = [("a", "DT"),("clever","JJ"),("fox","NN"),("was","VBP"),("jumping","VBP"),("over","IN"),("the","DT"),("wall","NN")]

Next, we are giving the grammar in the form of regular expression.

In [5]:
grammar = "NP:{<DT>?<JJ>*<NN>}"
grammar

'NP:{<DT>?<JJ>*<NN>}'

Now, next line of code will define a parser for parsing the grammar.

In [6]:
parser_chunking = nltk.RegexpParser(grammar)

In [82]:
parser_chunking

<chunk.RegexpParser with 1 stages>

Now, the parser will parse the sentence.

In [8]:
parser_chunking.parse(sentence)

NameError: name 'sentence' is not defined

Next, we are giving our output in the variable.

In [3]:
Output = parser_chunking.parse(sentence)

NameError: name 'parser_chunking' is not defined

With the help of following code, we can draw our output in the form of a tree as shown below.

In [2]:
output.draw()

NameError: name 'output' is not defined

<img src='https://www.tutorialspoint.com/python_web_scraping/images/phrase_chunking.jpg'>

#### Bag of Word (BoW) Model Extracting and converting the Text into Numeric Form

Bag of Word (BoW), a useful model in natural language processing, is basically used to extract the features from text. After extracting the features from the text, it can be used in modeling in machine learning algorithms because raw data cannot be used in ML applications.

Working of BoW Model
Initially, model extracts a vocabulary from all the words in the document. Later, using a document term matrix, it would build a model. In this way, BoW model represents the document as a bag of words only and the order or structure is discarded.

Example

Suppose we have the following two sentences −

Sentence1 − This is an example of Bag of Words model.

Sentence2 − We can extract features by using Bag of Words model.

Now, by considering these two sentences, we have the following 14 distinct words −

- This
- is
- an
- example
- bag
 - of
- words
- model
- we
- can
- extract
- features
- by
- using

#### Building a Bag of Words Model in NLTK
Let us look into the following Python script which will build a BoW model in NLTK.

First, import the following package −



In [10]:
from sklearn.feature_extraction.text import CountVectorizer 

Next, define the set of sentences −

In [11]:
Sentences=['This is an example of Bag of Words model.', ' We can extract features by using Bag of Words model.']
vector_count = CountVectorizer()
features_text = vector_count.fit_transform(Sentences).todense()
print(vector_count.vocabulary_)

{'this': 10, 'is': 7, 'an': 0, 'example': 4, 'of': 9, 'bag': 1, 'words': 13, 'model': 8, 'we': 12, 'can': 3, 'extract': 5, 'features': 6, 'by': 2, 'using': 11}


### Python Web Scraping - Dynamic Websites

Web scraping is a complex task and the complexity multiplies if the website is dynamic. According to United Nations Global Audit of Web Accessibility more than 70% of the websites are dynamic in nature and they rely on JavaScript for their functionalities.

#### Dynamic Website Example

Let us look at an example of a dynamic website and know about why it is difficult to scrape. Here we are going to take example of searching from a website named http://example.webscraping.com/places/default/search. But how can we say that this website is of dynamic nature? It can be judged from the output of following Python script which will try to scrape data from above mentioned webpage −

In [None]:
import re
import urllib.request
response = urllib.request.urlopen('http://example.webscraping.com/places/default/search')
html = response.read()
text = html.decode()
re.findall('(.*?)',text)

#### Approaches for Scraping data from Dynamic Websites
We have seen that the scraper cannot scrape the information from a dynamic website because the data is loaded dynamically with JavaScript. In such cases, we can use the following two techniques for scraping data from dynamic JavaScript dependent websites −

- Reverse Engineering JavaScript
- Rendering JavaScript

if you wanto to know more please visit this.
https://www.tutorialspoint.com/python_web_scraping/python_web_scraping_dynamic_websites.htm


## Python Web Scraping - Form based Websites

These days WWW (World Wide Web) is moving towards social media as well as usergenerated contents. So the question arises how we can access such kind of information that is beyond login screen? For this we need to deal with forms and logins.

In previous chapters, we worked with HTTP GET method to request information but in this chapter we will work with HTTP POST method that pushes information to a web server for storage and analysis.

if you want to know more about this topic please refer to below link.

https://www.tutorialspoint.com/python_web_scraping/python_web_scraping_form_based_websites.htm



## Python Web Scraping - Processing CAPTCHA

#### What is CAPTCHA?
The full form of CAPTCHA is Completely Automated Public Turing test to tell Computers and Humans Apart, which clearly suggests that it is a test to determine whether the user is human or not.

A CAPTCHA is a distorted image which is usually not easy to detect by computer program but a human can somehow manage to understand it. Most of the websites use CAPTCHA to prevent bots from interacting.

if you want to know more about it please refer to this. 
https://www.tutorialspoint.com/python_web_scraping/python_web_scraping_processing_captcha.htm

### Python Web Scraping - Testing with Scrapers
In large web projects, automated testing of website’s backend is performed regularly but the frontend testing is skipped often. The main reason behind this is that the programming of websites is just like a net of various markup and programming languages. We can write unit test for one language but it becomes challenging if the interaction is being done in another language. That is why we must have suite of tests to make sure that our code is performing as per our expectation.

if you wanto to know more about this please refer to this.

https://www.tutorialspoint.com/python_web_scraping/python_web_scraping_testing_with_scrapers.htm

### Useful Books on Python Web Scraping
<img src = 'https://images-na.ssl-images-amazon.com/images/I/51u9mDi83gL._SX404_BO1,204,203,200_.jpg'>

<img src = 'https://images-na.ssl-images-amazon.com/images/I/51KgwVgNVOL._SX379_BO1,204,203,200_.jpg'>

<img src='https://images-na.ssl-images-amazon.com/images/I/61X0QVbrUvL._SX404_BO1,204,203,200_.jpg'>
