<a href="https://www.kaggle.com/code/adebayoolalekan/web-scraping-tutorial?scriptVersionId=206158025" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

### What is Web Scraping?
#### The dictionary meaning of word ‘Scrapping’ implies getting something from the web. Here two questions arise: What we can get from the web and How to get that. The answer to the first question is ‘data’. Data is indispensable for any programmer and the basic requirement of every programming project is the large amount of useful data.

#### The answer to the second question is a bit tricky, because there are lots of ways to get data. In general, we may get data from a database or data file and other sources. But what if we need large amount of data that is available online? One way to get such kind of data is to manually search (clicking away in a web browser) and save (copy-pasting into a spreadsheet or file) the required data. This method is quite tedious and time consuming. Another way to get such data is using web scraping.

#### Web scraping, also called web data mining or web harvesting, is the process of constructing an agent which can extract, parse, download and organize useful information from the web automatically. In other words, we can say that instead of manually saving the data from websites, the web scraping software will automatically load and extract data from multiple websites as per our requirement.



**Web Crawling v/s Web Scraping**
The terms Web Crawling and Scraping are often used interchangeably as the basic concept of them is to extract data. However, they are different from each other. We can understand the basic difference from their definitions.

Web crawling is basically used to index the information on the page using bots aka crawlers. It is also called indexing. On the hand, web scraping is an automated way of extracting the information using bots aka scrapers. It is also called data extraction.


**Uses of Web Scraping**
The uses and reasons for using web scraping are as endless as the uses of the World Wide Web. Web scrapers can do anything like ordering online food, scanning online shopping website for you and buying ticket of a match the moment they are available etc. just like a human can do. Some of the important uses of web scraping are discussed here −

* E-commerce Websites − Web scrapers can collect the data specially related to the price of a specific product from various e-commerce websites for their comparison.

* Content Aggregators − Web scraping is used widely by content aggregators like news aggregators and job aggregators for providing updated data to their users.

* Marketing and Sales Campaigns − Web scrapers can be used to get the data like emails, phone number etc. for sales and marketing campaigns.

* Search Engine Optimization (SEO) − Web scraping is widely used by SEO tools like SEMRush, Majestic etc. to tell business how they rank for search keywords that matter to them.

* Data for Machine Learning Projects − Retrieval of data for machine learning projects depends upon web scraping.

**Components of a Web Scraper**
A web scraper consists of the following components −

* Web Crawler Module
A very necessary component of web scraper, web crawler module, is used to navigate the target website by making HTTP or HTTPS request to the URLs. The crawler downloads the unstructured data (HTML contents) and passes it to extractor, the next module.

* Extractor
The extractor processes the fetched HTML content and extracts the data into semistructured format. This is also called as a parser module and uses different parsing techniques like Regular expression, HTML Parsing, DOM parsing or Artificial Intelligence for its functioning.

* Data Transformation and Cleaning Module
The data extracted above is not suitable for ready use. It must pass through some cleaning module so that we can use it. The methods like String manipulation or regular expression can be used for this purpose. Note that extraction and transformation can be performed in a single step also.

* Storage Module
After extracting the data, we need to store it as per our requirement. The storage module will output the data in a standard format that can be stored in a database or JSON or CSV format.

![](http://www.tutorialspoint.com/python_web_scraping/images/web_scraper.jpg)

Requests
It is a simple python web scraping library. It is an efficient HTTP library used for accessing web pages. With the help of Requests, we can get the raw HTML of web pages which can then be parsed for retrieving the data. Before using requests, let us understand its installation.

**Public APIs:**

* Websites with open APIs provide structured data like JSON or XML.
* Examples:
* JSONPlaceholder: A free fake API for testing and prototyping.
* OpenWeatherMap: Provides weather data.
* CoinGecko API: Provides cryptocurrency data.

In [None]:
import requests

# Example API URL
url = 'https://en.wikipedia.org/wiki/Special:Search?go=Go&search=world+richest&ns0=1'

# Sending a GET request
response = requests.get(url)

# Print the JSON data
print(response.content)


**Urllib3**
It is another Python library that can be used for retrieving data from URLs similar to the requests library. You can read more on this at its technical documentation at https://urllib3.readthedocs.io/en/latest/.

In [None]:
#Sending HTTP Requests
#A basic HTTP GET request:
url = 'https://example.com'

# Sending a GET request
response = requests.get(url)

# Check if the request was successful
print(response.status_code)  # 200 means successful


In [None]:
print(response.content)

**Handling HTTP Responses**
The response object contains several important attributes:

* response.text: Returns the HTML content of the page as a string.
* response.content: Returns the raw bytes of the content (useful for non-text content like images).
* response.status_code: The HTTP status code (e.g., 200 for success).
* response.headers: HTTP headers of the response.

In [None]:
# Print the HTML content of the page
print(response.text)

In [None]:
# Print HTTP headers
print(response.headers)

**Introduction to BeautifulSoup**
BeautifulSoup is used to parse and navigate HTML documents easily. 

It allows us to extract meaningful data from the HTML page.

**Parsing HTML**
You need to create a BeautifulSoup object from the HTML content returned by requests.

In [None]:
from bs4 import BeautifulSoup

# Parsing HTML content with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Pretty printing the HTML
print(soup.prettify())


* Navigating the HTML Tree
* You can navigate and search through the HTML document using various methods:

* find(): Finds the first matching tag.
* find_all(): Finds all matching tags.

In [None]:
# Finding the first <h1> tag
h1_tag = soup.find('h1')
print(h1_tag.text)

# Finding all <a> tags (links)
a_tags = soup.find_all('a')
for tag in a_tags:
    print(tag.text, tag['href'])  # Print the link text and URL

**Practical Examples**
Let's see how we can use requests and BeautifulSoup to extract specific data from a website.

* Extracting Headings from a Page
This example extracts all the headings ("h1", "h2", etc.) from a page.

In [None]:
# Extract all headings (h1, h2, h3, etc.)
for i in range(1, 7):
    headings = soup.find_all(f'h{i}')
    for heading in headings:
        print(f"H{i}: {heading.text}")


Extracting Links from a Web Page
* This example extracts all the hyperlinks (<a> tags) from the page and prints their text and href attributes.

In [None]:
# Extract all links from the page
links = soup.find_all('a')
for link in links:
    href = link.get('href')
    text = link.text
    print(f"Text: {text}, URL: {href}")


In [None]:
from bs4 import BeautifulSoup
import requests

In [None]:
url = "https://www.scrapethissite.com/pages/forms/"
page = requests.get(url)

In [None]:
soup = BeautifulSoup(page.text, "html")

In [None]:
print(soup.prettify)

### Find And Find_All

In [None]:
soup.find("div")

In [None]:
soup.find_all("div", class_="col-md-12")

In [None]:
soup.find_all("p",class_ ="lead")

In [None]:
soup.find_all("p", class_ ="lead").text

In [None]:
soup.find("p", class_ ="lead").text.strip()

In [None]:
soup.find_all("th")

In [None]:
soup.find("th").text.strip()

In [None]:
 table = soup.find('table', {'class': 'table'})

In [None]:
table

In [None]:
import pandas as pd
# Initialize lists to store data
data = []
    
    # Extract headers
headers = []
for th in table.find_all('th'):
  headers.append(th.text.strip())
headers

In [None]:
# Extract team data
for row in table.find_all('tr', {'class': 'team'}):
    team_data = {}
 # Team name
    team_data['Team Name'] = row.find('td', {'class': 'name'}).text.strip()
        
    team_data['Year'] = int(row.find('td', {'class': 'year'}).text.strip())
        
        # Wins
    team_data['Wins'] = int(row.find('td', {'class': 'wins'}).text.strip())
        
        # Losses
    team_data['Losses'] = int(row.find('td', {'class': 'losses'}).text.strip())
        
        # OT Losses (might be empty for some years)
    ot_losses = row.find('td', {'class': 'ot-losses'}).text.strip()
    team_data['OT Losses'] = int(ot_losses) if ot_losses else None
        
        # Win Percentage
    team_data['Win %'] = float(row.find('td', {'class': 'pct'}).text.strip())
        
        # Goals For
    team_data['Goals For'] = int(row.find('td', {'class': 'gf'}).text.strip())
        
        # Goals Against
    team_data['Goals Against'] = int(row.find('td', {'class': 'ga'}).text.strip())
        
        # Goal Differential
    team_data['+/-'] = int(row.find('td', {'class': 'diff'}).text.strip())
        
    data.append(team_data)
    
    # Create DataFrame
df = pd.DataFrame(data)

In [None]:
data

In [None]:
df