# Web Scraping

Web scraping is the process of automatically extracting data from websites. 

It involves retrieving the HTML content of a webpage, parsing the HTML to locate specific elements or patterns, and extracting the desired data.

<img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.quintersol.com%2Fwp-content%2Fuploads%2F2020%2F05%2Fhow_does_web_scraping_work.png&f=1&nofb=1&ipt=3fed2e7db03120b72bb4632effd2309a3d6d7f70f875063498f0ee78b791114a&ipo=images">



> Important: Please be aware that the following techniques may be illegal when used on websites that prohibit web scraping.
 

# Can you scrape from all the websites?

* Scraping makes the website traffic spike and may cause the breakdown of the website server. Thus, not all websites allow people to scrape. 

* How do you know which websites are allowed or not? 

* You can look at the ‘robots.txt’ file of the website. 


> **Try Google.com/robots.txt**

<img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*pe7HHTIwhbqJJcEfsYcfaA.png">


You can see that Google does not allow web scraping for many of its sub-websites. However, it allows certain paths like ‘/m/finance’ and thus if you want to collect information on finance then this is a completely legal place to scrape.

## Step 1: Install necessary libraries
Ensure you have Python installed on your system. Additionally, you need to install the BeautifulSoup and requests libraries. Open your command prompt or terminal and run the following command:

## Step 2: Import Library


In [6]:
import requests
from bs4 import BeautifulSoup

## Step 3: GET Request

To start web scraping, you need to send a GET request to the desired webpage using the requests library. This fetches the HTML content of the webpage. Here's an example:

In [7]:
url = 'https://www.example.com'  # Replace with the desired webpage URL
response = requests.get(url)

## Step 4: Parse the HTML content
With the HTML content fetched, you need to parse it using BeautifulSoup. This library makes it easy to extract data from HTML. Here's an example:



In [10]:
soup = BeautifulSoup(response.content, 'html.parser') # Instead of html.parser you can use "lxml" too.

## Step 5: Find elements using CSS selectors
Using BeautifulSoup, you can find specific elements on the webpage using CSS selectors. For example, to find all the links on the webpage, use the find_all method with the appropriate CSS selector. 

Open Inspect Page

<img src="inspect.png">



In [11]:
links = soup.find_all('a')

## Step 6: Extract data from elements
Once you have found the desired elements, extract the data from them. You can access attributes like href or text to get the relevant information. Here's an example:


In [13]:
for link in links:
    href = link['href']
    text = link.text
    print(f'Link: {text}\nURL: {href}\n')

Link: More information...
URL: https://www.iana.org/domains/example



## Step 7: Handling pagination or multiple pages
If the data you want to scrape is spread across multiple pages or requires interaction, you need to handle pagination or navigate through different pages. This involves sending subsequent requests and parsing the HTML content of each page.

## Step 8: Data storage
Finally, you may want to store the extracted data for further analysis or usage. You can save it to a file, database, or any other storage medium using appropriate methods based on your requirements.

# Wroking Time

In [16]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

print(soup.get_text())



Profile: Dionysus





Name: Dionysus

Hometown: Mount Olympus

Favorite animal: Leopard 

Favorite Color: Wine






In [27]:
# For image
import os

image_elements = soup.find_all('img')

In [33]:
import requests
from bs4 import BeautifulSoup
import os
from urllib.parse import urljoin

# Step 1: Send a GET request to the webpage
url = 'http://olympus.realpython.org/profiles/dionysus'  # Replace with the desired webpage URL
response = requests.get(url)

# Step 2: Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Step 3: Find image elements
image_elements = soup.find_all('img')

# Step 4: Create the 'images' directory
if not os.path.exists('images'):
    os.makedirs('images')

# Step 5: Download and save the images
for image in image_elements:
    image_url = image['src']
    if image_url.startswith('http'):  # Absolute URL
        image_source = image_url
    else:  # Relative URL
        image_source = urljoin(url, image_url)

    image_name = os.path.basename(image_source)
    image_path = os.path.join('images', image_name)  # Define the path where the image will be saved

    # Sending a GET request to the image URL and saving the content to a file
    image_response = requests.get(image_source)
    with open(image_path, 'wb') as file:
        file.write(image_response.content)

    print(f'Saved image: {image_name}')

print('All images have been saved.')


Saved image: dionysus.jpg
Saved image: grapes.png
All images have been saved.


# Try Next

In [19]:
import requests
from bs4 import BeautifulSoup
import csv
# Make a request
page = requests.get(
    "https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/")
soup = BeautifulSoup(page.content, 'html.parser')

# Create top_items as empty list
all_products = []

# Extract and store in top_items according to instructions on the left
products = soup.select('div.thumbnail')
for product in products:
    name = product.select('h4 > a')[0].text.strip()
    description = product.select('p.description')[0].text.strip()
    price = product.select('h4.price')[0].text.strip()
    reviews = product.select('div.ratings')[0].text.strip()
    image = product.select('img')[0].get('src')

    all_products.append({
        "name": name,
        "description": description,
        "price": price,
        "reviews": reviews,
        "image": image
    })


keys = all_products[0].keys()

with open('products.csv', 'w', newline='') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(all_products)


Further reading
* [Web Scraping Pagination: A Simple Web Scraper Tutorial](https://rayobyte.com/blog/web-scraping-pagination/)
* [Web Scraping Python Tutorial: How to Scrape Data from a Website](https://www.freecodecamp.org/news/web-scraping-python-tutorial-how-to-scrape-data-from-a-website/)
* [Beautiful Soup Web Scraping Tutorial in Python](https://realpython.com/beautiful-soup-web-scraper-python/)
* [Web Scraping for Beginners](https://www.sitepoint.com/web-scraping-for-beginners/)
* [Best Two Non-Coding Web Data Scraping Tools You Need in Your Toolkit](https://medium.datadriveninvestor.com/best-two-non-coding-web-data-scraping-tools-you-need-in-your-toolkit-be31c7c8693e)



# Tools you can use

<img src="https://www.upwork.com/mc/documents/Install-ParseHub.png">

# Some Chrome Extension

<img src="https://lh3.googleusercontent.com/sJR755-g7jgngpgP2zddmlkwblliOh7cav72X-s9QrwlmqemAaUFLCZJw0J6EgNjXBwiOu0qXC8duFwVz58P0uSQ=w128-h128-e365-rj-sc0x00ffffff">

https://chrome.google.com/webstore/detail/grepsr-web-scraping-tool/hjdijkhlfpeafghibmiabeofkiicdnjm



----


<img src="https://static.dataminer.io/prod/img/dm/logo-long-h60.png"> 

https://chrome.google.com/webstore/detail/data-scraper-easy-web-scr/nndknepjnldbdbepjfgmncbggmopgden


---



<img src="https://lh3.googleusercontent.com/M2K0VbhN1C9reWRv8_65g6q3cPJnhBX9EKsYabhvY7Uu6hPOQyBphx3yrGow_nTsPspCwX69lfJb07Z1bSBbqmKToA=w128-h128-e365-rj-sc0x00ffffff">

https://chrome.google.com/webstore/detail/web-scraper-free-web-scra/jnhgnonknehpejjnehehllkliplmbmhn

https://www.scrapingbee.com/blog/web-scraping-tools/

# Web Scraping HTML Tables Without BeautifulSoup or Any Scraping Tool

In [35]:
#import the Pandas library
import pandas as pd
#scrap 1st table data and store as dataframe name df_award1
df_award1 = pd.read_html('https://en.wikipedia.org/wiki/Grammy_Award_records') [0]
#view the dataset as pandas dataframe object
df_award1.head()

Unnamed: 0,Rank,Artist,Awards
0,1,Beyoncé[a],32
1,2,Georg Solti,31
2,3,Quincy Jones,28
3,4,Alison Krauss[b],27
4,4,Chick Corea,27


<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*hfButBVbwnRkkn5W5E5f1g.png">

Learn More here:  [Web Scraping HTML Tables Without BeautifulSoup or Any Scraping Tool](https://python.plainenglish.io/web-scraping-html-tables-without-beautifulsoup-or-any-scraping-tool-b660803feca7)