Q1. What is Web Scraping? Why is it Used? Give three areas where Web Scraping is used to get data.

In [1]:
# Example of Web Scraping using Python

import requests
from bs4 import BeautifulSoup

# URL of the website to scrape
url = 'https://example.com'

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.text, 'html.parser')

    # Example: Extracting all links from the page
    links = soup.find_all('a')
    for link in links:
        print(link['href'])
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")


https://www.iana.org/domains/example


Q2. What are the different methods used for Web Scraping?

In [6]:
import requests
from bs4 import BeautifulSoup

# URL of the news website
url = 'https://timesofindia.indiatimes.com/business/budget'

# Send a GET request to the URL
response = requests.get(url)

# Parse the HTML content of the page
soup = BeautifulSoup(response.text, 'html.parser')

# Extract the titles of the latest articles
article_titles = soup.find_all('h2', class_='article-title')

# Print the extracted titles
for title in article_titles:
    print(title.text)


Q3. What is Beautiful Soup? Why is it used?

In [7]:
import requests
from bs4 import BeautifulSoup

# URL of the webpage to scrape
url = 'https://timesofindia.indiatimes.com/business/budget'

# Send a GET request to the URL
response = requests.get(url)

# Parse the HTML content of the page using Beautiful Soup
soup = BeautifulSoup(response.text, 'html.parser')

# Extract and print the title of the webpage
title = soup.title.text
print(f"Webpage Title: {title}")

# Extract and print all the hyperlinks (a tags) on the page
for link in soup.find_all('a'):
    print(link.get('href'))

# Extract and print the text from all paragraphs (p tags) on the page
paragraphs = soup.find_all('p')
for paragraph in paragraphs:
    print(paragraph.text)

# Extract and print the content of a specific div with class 'main-content'
main_content_div = soup.find('div', class_='main-content')
print(f"Main Content:\n{main_content_div.text}")


Webpage Title: Budget 2024: Union Budget 2024 News, Latest India Budget highlights and Live Updates
https://timesofindia.indiatimes.com/
https://timesofindia.indiatimes.com/us
https://epaper.indiatimes.com/timesepaper/publication-the-times-of-india,city-delhi.cms?redirectionSource=TOIWeb
https://timesofindia.indiatimes.com
https://timesofindia.indiatimes.com
https://timesofindia.indiatimes.com/business
https://timesofindia.indiatimes.com/business/budget
https://timesofindia.indiatimes.com/business/india-business
https://timesofindia.indiatimes.com/business/international-business
https://timesofindia.indiatimes.com/business/stock-market
https://timesofindia.indiatimes.com/videos/business
https://timesofindia.indiatimes.com/business/cryptocurrency
https://timesofindia.indiatimes.com/business/financial-calculators
https://timesofindia.indiatimes.com/toi-dialogues
https://timesofindia.indiatimes.com/business/credit-score-calculator-and-credit-card
https://timesofindia.indiatimes.com/busine

AttributeError: 'NoneType' object has no attribute 'text'

Q4. Why is flask used in this Web Scraping project?

In [None]:
"""
Flask is a micro web framework for Python, and it is commonly used for building web applications and APIs. In the context of a web scraping project, Flask might be used for various reasons:

Web Interface for Users: Flask allows you to create a simple web interface through which users can interact with your web scraping tool. Users can input parameters, initiate the scraping process, and view the results through a web browser.

API Endpoints: Flask makes it easy to create RESTful API endpoints. If your web scraping project involves serving data to other applications or users programmatically, Flask can provide a convenient way to expose these data endpoints.

Asynchronous Processing: Web scraping can sometimes be time-consuming, especially when scraping a large number of pages. Flask can be combined with asynchronous processing libraries (like Celery) to handle scraping tasks efficiently in the background without blocking the main application.

Integration with Databases: If your scraping project involves storing data in a database, Flask can be used to set up endpoints for storing and retrieving data. It provides a convenient way to integrate your scraping tool with a database.

Customization and Extensibility: Flask is lightweight and modular, allowing you to customize your application based on the specific needs of your web scraping project. You can easily add middleware, templates, and other extensions to enhance functionality.

Ease of Use: Flask is known for its simplicity and ease of use. It provides a straightforward way to set up a web server without a lot of boilerplate code, making it a good choice for small to medium-sized web scraping projects.
"""

Q5. Write the names of AWS services used in this project. Also, explain the use of each service.

In [8]:
"""
Amazon EC2 (Elastic Compute Cloud):

Use: Hosting the Flask application and running the web scraping scripts.
Explanation: EC2 provides scalable virtual servers in the cloud, making it suitable for hosting web applications and running scripts or services.
Amazon RDS (Relational Database Service):

Use: Storing scraped data in a relational database.
Explanation: RDS is a managed relational database service that supports multiple database engines. It can be used to store and manage structured data efficiently.
Amazon S3 (Simple Storage Service):

Use: Storing static assets, such as images or documents.
Explanation: S3 is a scalable object storage service. It can be used to store and retrieve any amount of data, making it suitable for hosting static assets used by the web application.
AWS Lambda:

Use: Running serverless functions for periodic scraping tasks.
Explanation: Lambda allows you to run code without provisioning or managing servers. You can use it for tasks like periodic web scraping or other asynchronous operations.
Amazon SQS (Simple Queue Service):

Use: Managing the queue for scraping tasks in a distributed system.
Explanation: SQS is a fully managed message queuing service that can be used to decouple and scale microservices, distributed systems, and serverless applications.
Amazon DynamoDB:

Use: Storing NoSQL data from scraping tasks.
Explanation: DynamoDB is a fast and flexible NoSQL database service that can be used to store and retrieve any amount of data and serve any level of request traffic.

"""

'\nAmazon EC2 (Elastic Compute Cloud):\n\nUse: Hosting the Flask application and running the web scraping scripts.\nExplanation: EC2 provides scalable virtual servers in the cloud, making it suitable for hosting web applications and running scripts or services.\nAmazon RDS (Relational Database Service):\n\nUse: Storing scraped data in a relational database.\nExplanation: RDS is a managed relational database service that supports multiple database engines. It can be used to store and manage structured data efficiently.\nAmazon S3 (Simple Storage Service):\n\nUse: Storing static assets, such as images or documents.\nExplanation: S3 is a scalable object storage service. It can be used to store and retrieve any amount of data, making it suitable for hosting static assets used by the web application.\nAWS Lambda:\n\nUse: Running serverless functions for periodic scraping tasks.\nExplanation: Lambda allows you to run code without provisioning or managing servers. You can use it for tasks lik