###  Data Collection, Web scraping and API



Web scraping is the process of extracting data from websites using automated tools or software. It includes fetching and parsing the HTML code of web pages to extract relevant information.

Web scraping is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.

Many large websites, like Google, Twitter, Facebook, StackOverflow, etc. have API’s that allow you to access their data in a structured format. This is the best option.

Web scraping has two parts, namely the **crawler** and the **scraper**
The **crawler** is an  **AI algorithm** that browses the web to search for the particular data.

**scraper**, on the other hand, is a specific tool created to extract data from the website. The design of the scraper can vary greatly according to the complexity and scope of the project so that it can quickly and accurately extract the data.

In [1]:
# How  Web Scrapers Work?

Web Scrapers can extract all the data on particular sites or the specific data described by the user.
It is better to specify the data for quicker extraction of the data.

So, when a web scraper needs to scrape a site:

    -First the URLs are provided. 
    -Then it loads all the HTML code from those sites 
    -a more advanced scraper might even extract all the CSS and Javascript elements as well.
    -Then the scraper obtains the required data from this HTML code and outputs this data in the format specified by user.
    -Mostly, this is in the form of an Excel spreadsheet or a CSV file, but the data can also be saved in other formats, such as a JSON file.

In [2]:
# Different Types of Web Scrapers

Web Scrapers are divided on the basis of many criteria, e.g. 

        Self-built
        Pre-built Web Scrapers, 
        Browser extension
        Software Web Scrapers,
        Cloud 
        Local Web Scrapers etc
        
Self-built Web Scrapers requires advanced knowledge of programming. And to add more features, you need even more knowledge.

Pre-built Web Scrapers can be downloaded and run easily,These also have more advanced options that you can customize.

Browser extensions Web Scrapers are extensions that can be added/integrated to your browser,easy to run,but limited/
SO,Any advanced features that are outside the scope of your browser are impossible to run.

Software Web Scrapers don’t have these limitations as they can be downloaded and installed on your computer,they are complex and have advanced features.

Cloud Web Scrapers run on the cloud,computer resources are not used by this scrapper. U can focus on other tasks.

Local Web Scrapers, on the other hand, run on your computer using local resources

In [4]:
# Python for Web Scraping?

 Python is easy to use and it has vast library for scraping
 Scrapy is a very popular open-source web crawling framework that is written in Python.
 Beautiful soup is another Python library that is highly suitable for Web Scraping
 

In [5]:
#  Codes:Web Scraping in Python with BeautifulSoup

There are mainly two ways to extract data from a website:

    Use the API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook.

    Access the HTML of the webpage and extract useful information/data from it. This technique is called web scraping or web harvesting or web data extraction.

In [6]:
# Steps involved in web scraping:

Send an HTTP request to the URL of the webpage you want to access
The server returns the HTML content of the webpage,we use a third-party HTTP library for python-requests.

Now we need to parse the HTML data. Since most of the HTML data is nested, we cannot extract data simply through string processing. One needs a parser which can create a nested/tree structure of the HTML data. There are many HTML parser libraries available but the most advanced one is **html5lib**

Now we need to pull/extract data from parsed tree. We use  Beautiful Soup for this.

In [7]:
# Installing the required third-party libraries

In [14]:
# !pip install requests
# !pip install html5lib
# !pip install bs4

In [3]:
import html5lib

In [4]:
# Step 2: Accessing the HTML content from webpage 

import requests
url = "https://www.geeksforgeeks.org/data-structures/"
r = requests.get(url)
print(r.content)



> Above we got raw HTML content of the webpage. It is of ‘string’ type.

In [None]:
# Use BeautifulSoup to parse the HTML content of the page
soup = BeautifulSoup(r.content, 'html.parser')

# Find the elements containing the data you want to extract
data = soup.find('div', class_='some-class').text

# Print the extracted data
print(data)

### Example 2:

In [None]:
import requests
from bs4 import BeautifulSoup

In [8]:
url = "https://www.geeksforgeeks.org/data-structures/"

# Send a GET request to the URL
response = requests.get(url)

In [9]:
response

<Response [200]>

In [10]:
response.content



In [11]:
# Create a BeautifulSoup object by passing the response content and specifying the HTML parser
soup = BeautifulSoup(response.content, 'html.parser')

In [12]:
# Find specific elements on the page using BeautifulSoup's methods
# Here's an example of finding all the heading elements with class 'entry-title'
headings = soup.find_all('h2', class_='entry-title')

In [16]:
headings

[]

In [13]:
# Extract the text from the heading elements
heading_texts = [heading.text.strip() for heading in headings]

In [14]:
# Print the extracted heading texts
for text in heading_texts:
    print(text)

In [15]:
heading_texts

[]

Note: Sometimes you may get error “Not accepted” while sending request to url. So try adding a browser user agent like below. Find your user agent based on device and browser from here https://deviceatlas.com/blog/list-of-user-agent-strings

headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246"}

**Here the user agent is for Edge browser on windows 10. You can find your browser user agent from the above given link**

        r = requests.get(url=URL, headers=headers)
        print(r.content)

In [17]:
# Step 3: Parsing the HTML content 