# Web Scraping

#### Problem Statement

After completing the Aspect Based Sentiment Analysis project. We Identified, relating to which aspects of our products - customers are facing lot of problems. 
 
Now we need to perform `"Competitor Analysis"`, we do `Competitor Price & Product Comparison`. For that we need to Scrape reviews from `3rd party` websites.

### Technical Requirements


#### 1. Where to Extract

We conducted the meetings with different departments like marketing, sales

- Identified their data needs.

- Where can it be found, Which sites have suitable data to extract?

#### 2. What to Extract

Took screenshots of the target web pages and mark with fields need to be extracted.

#### 3. Extraction Scale

Determined the `No. of websites` we are extracting and `Frequency` of data extraction crawls. It helped us to estimate the amount of `Infrastructural Resources` needed like (Servers, Data storage etc.,).

### Identify the structure of the HTML site

Using chrome developer tools inspect the HTML site structure. Identify the certain elements with specific classes or IDs.



- Whether it is a Static pages?, Dynamic Page? 

- Is data availabe in multiple pages inside website?

- RSS feeds

- Does it contain CAPTCHA or Login page?

### Identify the robots.txt file

It describes what a crawler should or shouldn’t crawl according to the Robots Exclusion Standard.

This file is usually available at the root of a website (www.example.com/robots.txt)

### Identify which library you are using to scrape

**Scrapy**

- `Full fledged solution`, lets you write little code and create 'Spiders'.

- `Downloading, cleaning and Saving` data into database using scrapy. Whereas BeautifulSoup is a smaller package which will only help you get information out of webpages.

-  Scrapy lets you `Handle Errors` and for `Resuming` a scrape from the last page it encountered.

- Built in `Login Form`, `Proxy`, `USER_AGENT` handling abilities. 

- It has `"Auto Throttling"` feature. Automatically throttling the scraping speed baed on load both on server and website.

- Automatically follows ROBOTS.txt


**BeautifulSoup**

- RSS Feeds or Specific Elements of sites.

- Sing Static Page, Less JavaScript, No login page.

- you need to use BeautifulSoup along with other libraries like `"Urllib", "Regular Expressions"` etc.,"


**Selenium** 

- Used for heavy-JS rendered pages or very sophisticaed web sites. 

- Selenium web driver is tool that automates the web-browsers.

# Working

#### Setup Environment

We used PyCharm IDE, Pyhton3 and created virtual environment (Isolates your project from rest of your environment).
Installed libraries like 

- BeautifulSoup

- Urllib

- Regular Expressions

- Scrapy

- NLTK

- Spacy

## Static pages, RSS Feeds -  BeautifulSoup

#### Urllib

Urllib library is used for loading and parsing URLs. It contais 5 Modules.

1. `Requests`: Opens URL

2. `Response`: used internally

3. `Error`: Handles URLError and HTTPError

4. `Parse`: Has variety of functions to breaking up the URL into meaningful pieces like Schema, Port, Host etc.

5. `RobotParser`: Used to inspect robot.txt files.

- Using Urllib open the website URL using `urllib.request module`. 

- Read the html content into a variable. and close the connection.

#### BeautifulSoup

You now got the webpage, but now you need to extract the data.  BeautifulSoup is a library which helps you to extract the data from the page.

- Using BeautifulSoup `parse` the html content using different kinds of parsers like 'lxml parser', 'html parser', 'html5 parser'.

- Using `find_all()` method `filter` the Tags and Attributes you want to filter.

#### Regular Expressions

Used to scan entire website for matching strings.

### Multiple pages, Multiple sites - Web Crawlers 

# Scrapy

Using Scrapy we create crawlers called **"Spiders"**. Spiders are Crawlers, When you pass URL Spiders go to the source code and searches for Tags and attributes we passed.

- Scrapy is faster than BeautifulSoup. 

- it is a framework. BeautifulSoup just a library.

### Create Project

Go to terminal in pycharm, navigate to a folder where you want to create project, and type this command.

*folder> Scrapy Startproject ProjectName*

This will create a folder with following files. 

- Spiders

- Settings.py

- items.py

- Middleware.py

- Pipelines.py

**Spiders folder::** We writes our spiders inside this folder. 

**settings.py::** this file allows you to `customize` the brhavior of all scrapy components including extensions, pipelines and spiders etc.,

**items.py::** The main purpose of items.py is to store your data that you crawled. Scrapy items are basically dictionaries. Scrapy spiders can return the extracted data as python dicts. While python dictionaries lack structure, it may lead to errors in projects. 

Items class used to define `common output data format`. Using these keys defined in the item class, it is easy to retrive and access the data. 


**middleware.py::** When you are sending request to a website, you can add some stuff to that request. Example (Adding proxies) etc.,

### Create  Spider

We create crawlers (Spiders) inside spiders folder. 

Import scrapy and create a class. This class inherits from `"Scrapy.sider"`. We dont need to write lot of code because of class we are inheriting from.

Inside this crawler, write methods to `parse` what elements you want to parse. In these parse methods we are telling scrapy to go to HTML source code and extract the elements we want from source code and return as a dictionary.

When creating a spider, we need to use `"yield"` rather than "return" inside our spiders. Because Scrapy is using Generators behind the scenes, yield is used with generators. thats why we need to use yield.

### Selectors

Scrapy uses "Selectors", to extract parts of HTML document specified by either `XPath` or `CSS` selectors. We used Google Chromes extension __Selector Gadget__ to identify the elements easily.

Xpath is language to select nodes in XML documents, which can also be used in HTML document. CSS is language to apply styling to HTML documents.

- *response.CSS('title::text").extract()*

- *response.CSS('span::text').extract()*



we use XPath to get "href" attribute values. It can ge access using CSS but by using Xpath it is more easy.

- response.XPath("//title/text()/").extract()

- response.XPah(//span[@class='class_name']/text()).extract()

### Item Containers

We put the extracted data into the item containers.

*Why do we need to put data inside containers? Cant we put directly into database?*

Yes, we can put directly into database but it create problems like `data inconsistencies`, errors if we store data directly into database.

Scapy spiders return extracted data as python dictionaries, but problem with dictionaries is it `lacks the structure`. It is easy to make typo error and return inconsistent data. 

To avoid that it is always good to move scraped data to a temporary location called containers and then store them inside the database.

The temporary containers where we store the extracted data are called as "ITEMS". We use items.py file to create containers.

We create a class and declare variables inside items.py file. Then import this items.py file into the spider. Next, inside the spider, create the instance of the class which is inside the items.py file.

### Pipelines

After scraping the data we need to store it in some database like mongoDB. This is done using pipelines.py. Data from item containers will be moved to pipelines before storing it in database. 


pipelins take the extracted data and performs following operations like,

- Cleanining HTML content.

- Validating scraped data (Checking that the items contain certain fields).

- Checking for duplicates and dropping them.

- Storing scraped data in a database.



*Scraped Data → Item containers → Pipelines → MongoDB*

### MongoDB

MongoDB is NoSQL Database. It stores the data in the form of documents (Similar to JSON objects). It has builtin replication and Sharding. It can scale horizontally.

MongoDB comes with Compass GUI and MongoShell, using them we can create Databases, tables and collections.

Collection is similar to table but it holds documents of data. Document is similar to rows in a table but not identical. A document can have its own set of columns. In mongoDB columns are called fields. In mongo fields are defined at column level. Collections are schema less where as tables have schema.

-----
### Insterting data into MongoDB using PyMongo

Establish a connection. 

*conn = MongoClient('localhost', port-number)*

Create a database and insert data into mongoDB

*insert_one() or insert_many()*

We use find() method to issue a query to retrieve data from collection. 

### Running Spider

Open the terminal in pyCharm, make sure your environment is activated and navigate to appropriate folder where your spiders are and type the command.

*folder>Scrapy crawl spider_name*

# Challenges

### Auto Throttling

The main idea is the following: if a server needs latency to respond, a client should send each request with latency to have 'N' requests processed in parallel.

AutoThrottling algorithm adjusts download delays based on time elapsed between establishing TCP connection and receiving HTTP headers.

change parameters like

- AUTOTHROTTLE_ENABLED

- AUTOTHROTTLE_START_DELAY

- AUTOTHROTTLE_MAX_DELAY

- CONCURRENT_REQUESTS_PER_DOMAIN

- CONCURRENT_REQUESTS_PER_IP

### Crawling multiple pages:

Pagination is sequence of pages in a website. Usually these list of pages are contined within list(li) or < a > tag or href attribute. And there is one "Next" button that takes to the next page.

We need to findout that next button in the source code, Then find the link that is redirecting to the next page using CSS selectors and take all the href attributes.

Scrapy has cool method inside it called `"response.follow"` It automatically follows next page. It takes two parameters

1. page you want to follow

2. Callback (where you want to go after following next page, Usually parse function)

### Challenge - Websites having login forms

Lot of websites restricts the content you might want to scrape behind the login page. To access the information we need to know how to login into website using python Scrapy.

In order to login into webpage you need to have login credentials (Username and Password). 

Notice how URL changes after you login and before login. To do that, 

Go to login form → Enter Username and Password → Go to inspect tab → Go to Network tab → click on login request → Navigate to form data.

There you will find CSRF_token, Username and Password. Go to python code, pass login page as start URL, using the CSS selectors extract CSRF_value. Using form response method pass the CSRF_token, Username and Password. After logging into the webpage call the scraper function.

### Challenge - Bypass Restrictions using USER_AGENT

When browser such as Mozilla or Chrome visits the website, that website asks for the identity of your browser. That identity of browser is called "USER_AGENT".

Every request made to browser contains USER_AGENT header and using the same USER_AGENT continuously leads to detection of the bot. The only way to make your USER_AGENT appear more real and bypass detection is to fake the USER_AGENT.

There is one method to trick the website suing USER_AGENT. You use USER_AGENT that is allowed by the website. Websites allow google's USER_AGENT  to appear on the search. So we basically replace our USER_AGENT with google's user agent so that website thinks it is googles bot and allows it. 

One more method is use multiple fake USER_AGENTS in rotation. We use `"Scrapy-USER_AGNET"` to do this. It has 2200 different USER_AGENTS.

### Challenge - Bypass restrictions using Proxies

IP Address is address of your computer, just like you have address for your house and office. Websites recognize your IP adderss and block if you try to scrape lot of website data. But what if you use another IP address instead of your own. Moreover, it is not illegal.

In proxy we use lot of different IP addresses and put them in rotation. So every time we send request to website, it will be with new IP address. 

To use proxy IP address in rotation, Install `"Scrapy-Proxy-pool"` library. Go to settings and change appropriate proxy settings.

### Challenge - Planning and defining objects

Common trap of web scraping is defining the data that you want to collect based entirely on whats available websites. I look at websites and decide what you want to collect, then look at another website and add some more fields. Atlast you will have lots of websites with different pieces of information.

This is an unsustainable approach. Simply adding attributes to your product type every time you see a new piece of information on a website will lead to far too many fields to keep track of.

Not only that, everytime you scrape a website you will be forced to perform detailed analysis on fields in the new website and fields you accumulated so far.

You need to change your storage (database) structure every time. This will result in messy and difficult to read dataset.

**Solution**

One of the best things you do when collecting the information is ignore the websites altogether. You dont start project based on what websites have rather what you want?

Make a detailed analysis on what are your project goals?, Is this data redundant?

### Challenge - Page redirects to other page

https://stackoverflow.com/questions/12737740/python-requests-and-persistent-sessions
https://requests.kennethreitz.org/en/master/user/advanced/#session-objects

Redirects allow a web server to point one URL to a piece of content at a different location. There are two types of redirects:

- `Server-side redirects`, where URL changes before the page is loaded.

- `Client side redirects`, Where page loads before redirecting to the new one.

I am trying to scrape some text but, as the URL is loaded, it automatically redirects to the login page.  Beatutifulsoup is scraping the login page.

Then I figured out how to submit a login form using requests module and retrieve session key. But still beautifulSoup is scraping just "You are being redirected."

Then I created persistent sessions using `requests.Session()` It persists cookies across all request you made from session instance.  Then saved sessions to a cache file (created a file with 'wb' previligies and using pickle serialized the file).

### Problem - Request time out - Not regularly

The process flow is like this,

- Send GET request to link A to retrieve page A.

- Gather all relevant links in page A and put then in a variable.

- Loop through that variable and gather relevant information in all of the pages in variable.

*Problem:* Request time-out, because process taking long time. Website that I scrape is not responding because I am sending lot of requests.

*Solution:* Implement an inside retry loop. Set the retry (5 times) and build loop that will run your request. If you have status code above 300, you will wait and retry in (2,3,4,5,6) seconds up to retry number. If status code is less than 300, you can use "continue" to skip the loop. This ensures you tried enough times before handling most server issues like timeout, page taking too long to load etc.,

### Handling website with reCAPTCHA

https://www.tutorialspoint.com/python_web_scraping/python_web_scraping_processing_captcha.htm


#### Tip
https://www.reddit.com/r/Python/comments/efd7tv/automating_an_insider_trading_dashboard_with/

#### Web Scraping vs Web Crawling

Web scraping involves specific data extraction on a targeted webpage, for instance, extract data about sales leads, real estate listing and product pricing. 

In contrast, web crawling is what search engines do. It scans and indexes the whole website along with its internal links. “Crawler” navigates through the web pages without a specific goal.

#### API and Web Scraping

API is like a channel to send your data request to a web server and get desired data. API will return the data in JSON format over the HTTP protocol. For example, Facebook API, Twitter API, and Instagram API. 

However, it doesn’t mean you can get any data you asked for. Web scraping can visualize the process as it allows you to interact with the websites.

#### Difference between Span and Div tags in HTML

Both are used to group together some HTML and hook some information to that chuck, most commonly with class and id.

The difference is that **span** element is `"in-line"` and usually used for small chunk of HTML inside a line (such as inside paragraph). Whereas **div** (division) element is `"block-line"` used to group large chunk of code


you can pass *BOT_NAME, USER_AGENT name, concurrent requests* etc.,


*`BOT_NAME`* is our project name.

*`USER_AGENT`* helps to identify yourself to website. Who is the person senting request. Suppose you visit google, google asks browser who are you? Browser replies google by sending a request and mentioning  USER_AGENT that it was a browser.

Obey or disobey Robots.txt file.

*`Concurrent Requests`*: When you make lot of requests at once that is called "Concurrent Requests". Request is basically asking website to open up. Similarly in web scraping we are asking website to open up so that we can scrape it. Whenever we ask website to open once, we get the data once. Because we are scraping a lot of data we are sending more requests to open up the website. Using Concurrent request we can adjust how many requests we can send in a given time.

`Proxy` is basically using different IP address to bypass restrictions of webscraping on a website. Whenever we are adding proxy to a webscraping, we are doing it through middleware. 

#### Typical Web Connection

When you enter website name "Google.com" → Browser checks in its cache for IP address → If browser is unaware it checks in OS → If OS is also unaware, It sends DNS request to router using DNS protocol to search in router's cache → If router is unaware, browser will send DNS request to DNS servers maintained by ISP → ISP will send response indicating that "www.google.com" is mapped to "172.217.19.68" 

#### Ethernet 

Ethernet is protocol used to communicate with machines in same network.

#### TCP/IP

TCP(Transmission Control Protocol), facilitates communication between dvices in a network. It is used together with IP protocol, so it is called as TCP/IP.

TCP takes messages from application/server and divides them into packets, which can be forwarded by devices in network such as switches and routers to destination. 

IP and TCP are two separate protocols. IP is part that obtains IP address to which data is sent. TCP is responsible for data delivery once IP is found. 

TCP/IP divides the different communication tasks into layers. Each layer has different function. Data goes through four individual layers before it is received on the other end. TCP/IP goes through these layers in reverse in order to reassemble data and to present it to the recepient. 

## Automate the scrapers with Airflow

https://towardsdatascience.com/schedule-web-scrapers-with-apache-airflow-3c3a99a39974

## Visualizing the running scrapers

https://towardsdatascience.com/pyspider-a-practical-usage-on-competitor-monitoring-metrics-c934d55f9c9a

### Challenges and best practices

https://medium.com/velotio-perspectives/web-scraping-introduction-best-practices-caveats-9cbf4acc8d0f
https://hub.packtpub.com/4-common-challenges-web-scraping-handle/
https://towardsdatascience.com/a-minimalist-end-to-end-scrapy-tutorial-part-i-11e350bcdec0

## Scraping Tools

**BeautifulSoup** is a library for parsing HTML and XML documents. Requests combined with Beautifulsoup is best for small, static, less JavaScript web pages.


**Scrapy** Scrapy is web crawling framework that provides a complete tool for scraping. In scrapy we create spiders which are python classes that define how a particular site/site will be scraped. 

**Selenium** Used for heavy-JS rendered pages or very sophisticaed web sites. Selenium web driver is tool that automates the web-browsers. This .tool is something to use when all doors of web scraping are being closed, and you still want the data which matters to you

### HTML Review

Tags are used to markup the start and end of an HTML element. Attribute defines property of an element. 

`<!DOCTYPE html>` is a tag that defines document type declaration. Lets know the browser which type of HTML you are using.

`<html></html>` everything between this is a HTML document.

`<body></body>` Body tag contains contents of document that appear in the browser.

`<br>` line break.

**Attributes** Tags can also have attributes which are extra bit of information. Attributes appear inside opening tags.

### Urllib library

Mainly used for Building, Loading and Parsing URLs. This package contains  5 Modules.

1. `Request`: Opens URL

2. Response: Used internally. You dont work with it.

3. Error: Request Exceptions

4. `Parse`: Has variety of functions to breaking up the URL into meaningful pieces like Schema, Port, Host etc.

5. `Robotparser`: Used to inspect robots.txt files

### Urllib.request

urllib.request is python module for fetching URLs. It offers methods like 'urlopen'.  urllib.request supports fetching URLs from many different schemes like HTTP, HTTPs, FTP etc., 

### Urlib.parse

This module is used to break URLs into components (Addressing, Scheme, Network, Location, Path etc.,) and combine the components back to URLs

### Urllib.error

**HTTPError and URLError**

urlopen raises URLError when it cannot handle a response  The urllib.error module defines the exception classes for exceptions raised by urllib.request.

HTTPError is the subclass of URLError raised in the specific case of HTTP URLs.

URLError is raised because there is no network connection. or the specified server doesn’t exist.

The HTTPError instance raised will have an integer ‘code’ attribute, which corresponds to the error sent by the server. Typical errors include ‘404’ (page not found), ‘403’ (request forbidden), and ‘401’ (authentication required), 500 Internal Server Error".

In [24]:
from urllib.request import urlopen
from urllib.error import HTTPError

try:
    html = request.urlopen('http://www.pythonscraping.com/pages/page1.html')

except HTTPError as e:
    print(e)
    # return null, break, or do some other plan-B
else:
    print('Else condition')
# program continues,
# Note:if you return or break in the exception catch,
# you do not need to use 'else' statement

Else condition


In [25]:
import requests
url = 'http://www.webscrapingfordatascience.com/basichttp/'
r = requests.get(url)
print(r.text)

Hello from the web!



In [26]:
# response headers
r.headers

{'Date': 'Wed, 08 Jan 2020 19:13:16 GMT', 'Server': 'Apache/2.4.18 (Ubuntu)', 'Content-Length': '20', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html; charset=UTF-8'}

## BeautifulSoup

We use BeautifulSoup to access the HTML content of the website. For static websites we can use BeautifulSoup. This is a very powerful package which allows you to navigate an HTML. To get the HTML we will use the requests packages.

In [27]:
from bs4 import BeautifulSoup

# urlopen function to open URL
# read() to get HTML content of the page.

html = request.urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser')
bs.h1

<h1>An Interesting Title</h1>

In [28]:
# you can do it without mentioning the read() method
# BeautifulSoup method transforms HTML content into beautifulSoup object
html = request.urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html, 'html.parser')
bs.h1

<h1>An Interesting Title</h1>

In [29]:
# h1 tag is nested deep inside beautifulsoup object. But it fetches it.
bs.html.body.h1

<h1>An Interesting Title</h1>

In [30]:
bs.html.h1

<h1>An Interesting Title</h1>

### Parsers

The parser is what is used to access the HTML tags and identify its inner elements. There are different parsers avaiable with BeautifulSoup. 

- html

- lxml

- html5lib

#### HTML and LXML parsers

When you create a BeautifulSoup object, two arguments are passed in. The first is the HTML text the object is based on, and the second specifies the parser that you want BeautifulSoup to use in order to create that object.

html.parser is a parser that is included with Python 3 and requires no extra installations in order to use.

Another popular parser is lxml. lxml has some advantages over html.parser in that it is generally better at parsing “messy” or malformed HTML code. One of the disadvantages of lxml is that it has to be installed separately and depends
on third-party C libraries to function. This can cause problems for portability and ease of use, compared to html.parser

#### html5lib parser

html5lib is an extremely forgiving parser that takes even more initiative correcting broken HTML.  it may be a good choice if you are working with messy or handwritten HTML sites.

## Advanced HTML Parsing using find_all()

Nearly every website you encounter contains stylesheets. CSS relies on the differentiation of HTML elements that might otherwise have the exact same markup in order to style them differently. 

Web scrapers can easily separate these two tags based on their class; 

Using this BeautifulSoup object, you can use the find_all function to extract a
Python list of proper nouns found by selecting only the text within <span
class="green"></span> tags.

you’re calling bs.find_all(tagName, tagAttributes) to get a list of all of the tags on the page, rather than just the first. After getting a list of names, the program iterates through all names in the list, and prints name.get_text() in order to separate the content from the tags.


### When to use get_text() and when to preserve tags

.get_text() strips all tags from the document you are working with and returns a Unicode string containing the text only. For example, if you are working with a large block of text that contains many hyperlinks, paragraphs, and other tags, all those will be stripped away, and you’ll be left with a tagless block of text. 

Keep in mind that it’s much easier to find what you’re looking for in a BeautifulSoup object than in a block of text. Calling .get_text() should always be the last thing you do, immediately before you print, store, or manipulate your final data. In general, you should try to preserve the tag structure of a document as long as possible.

### find() and find_all() with BeautifulSoup

**find_all(tag, attributes, recursive, text, limit, keywords)**

**find(tag, attributes, recursive, text, keywords)**


find() and find_all() are the two functions used to filter HTML pages to find lists of desired tags, or a single tag, based on their various attributes.

In all likelihood, 95% of the time you will need to use only the first two arguments: *"tag and attributes"*. 

With `Tag argument` you can pass a string name of a tag or even a Python list of string tag names. The `attribute` argument takes a python dictionary of attributes and matches tags that contain any of those atttributes.

*find_all(['h1','h2','h3','h4','h5','h6'])*

*find_all('span', {'class':{'green', 'red'}})*

The **keyword** argument allows you to select tags that contain a particular attribute or set of attributes. For example:

*title = bs.find_all(id='title', class_='text')*

This returns first tag with word "text" in the class_ attribute and "title" in the id attribute.  Keep in mind that anything that can be done
with keyword can also be accomplished regular_express and lambda_express.


Additionally, "class" is a reserved word in Python that cannot be used
as a variable or argument name. Instead, use an underscore or Alternatively, you can enclose class in quotes.

*bs.find_all(class='green')*

*bs.find_all(class_='green')*

*bs.find_all('', {'class':'green'})*

### Web Crawlers

With the above methods, you can scrape static pages. But when you want to Scrape multiple pages even multiple sites you need to use "Web Crawlers". Crawlers are used in situations where all data you need is not on single page. They crawl across the web.

Crawlers scrapes the contents a URL, examines that page for another URL and retrives that page. 

In [31]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html, 'html.parser')
for link in bs.find_all('a'):
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Wikipedia:Protection_policy#semi
#mw-head
#p-search
/wiki/Kevin_Bacon_(disambiguation)
/wiki/File:Kevin_Bacon_SDCC_2014.jpg
/wiki/Philadelphia
/wiki/Pennsylvania
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
#cite_note-1
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Holly_Near
http://baconbros.com/
#cite_note-2
#cite_note-actor-3
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Sleepers
/wiki/The_Woodsman_(2004_film)
/wiki/Animal_House
/wiki/Diner_(film)
/wiki/Tremors_(film)
/wiki/Crazy,_Stupid,_Love
/wiki/Friday_the_13th_(1980_film)
/wiki/Flatliners
/wiki/The_River_Wild
/wiki/Wild_Things_(film)
/wiki/Stir_of_Echoes
/wiki/Hollow_Man
/wiki/Frost/Nixon_(film)
/wiki/X-Men:_First_Class
/wiki/Black_Mass_(film)
/wiki/Patriots_Day_(film)
/wiki/Fox_Broadcasting_Company
/wiki/The_Following
/wiki/HBO
/wiki/Taking_Chance
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award
/w

In [32]:
# revising code with regular expressions to get only desired article links
import re
html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html, 'html.parser')
for  link in bs.find('div', {'id':'bodyContent'}).find_all('a', href=re.compile('^(/wiki/)((?!:).)*$')):
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Kevin_Bacon_(disambiguation)
/wiki/Philadelphia
/wiki/Pennsylvania
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Holly_Near
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Sleepers
/wiki/The_Woodsman_(2004_film)
/wiki/Animal_House
/wiki/Diner_(film)
/wiki/Tremors_(film)
/wiki/Crazy,_Stupid,_Love
/wiki/Friday_the_13th_(1980_film)
/wiki/Flatliners
/wiki/The_River_Wild
/wiki/Wild_Things_(film)
/wiki/Stir_of_Echoes
/wiki/Hollow_Man
/wiki/Frost/Nixon_(film)
/wiki/Black_Mass_(film)
/wiki/Patriots_Day_(film)
/wiki/Fox_Broadcasting_Company
/wiki/The_Following
/wiki/HBO
/wiki/Taking_Chance
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award
/wiki/I_Love_Dick_(TV_series)
/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy
/wiki/The_Guardian
/wiki/Academy_Award
/wiki/Hollywood_Walk_of_Fame
/wiki

Now you would see a list of all article URLs that the wikipedia article on Kevin Bacon links to.

## Crawling an Entire Site

In previous code we took a random walk through a website, going from link to link. But what if you need to systematically catalog or search every page on site? 

Crawling an entire site, especially a large one, is a memory intensive process that is best suited to applications for which a database to store crawling results is readily available.

The general approach to an exhaustive site crawl is to start with top-level page (such as home page), and search for a list of all internal links on that page. Everyone of those links is then crawled, and additional lists of links are found on each one of them, triggering another round of crawling.

When crawling entire site, we examine the tags and attributes and write rules with `Regular Expressions` to crawl those pages. 

A single function, getLinks, it takes URL and returns list of all linked URLs in the same form. A main function that calls getLinks with a starting article, chooses a random article link from the returned list, and calls getLinks again, until you stop the program or until no article links are found.

To avoid crawling the same page twice, we formatted the discovered links and kept in a running `set`. A set is similar to list but elements do not have a specific order, and only unique elements would be stored. 

## Scraping data from Dynamic Websites

scraper cannot scrape the information from a dynamic website because the data is loaded dynamically with JavaScript. In such cases, we can use the following two techniques for scraping data from dynamic JavaScript dependent websites −

- Reverse Engineering JavaScript

- Rendering JavaScript

#### Selenium for client side rendering

https://www.pluralsight.com/guides/advanced-web-scraping-tactics-python-playbook