---

---

---

# What is Web Scraping?
- Web scraping is the automated process of gathering data from websites.
- It's like a bot that navigates through a webpage and collects data based on predefined instructions.
- This data can range from product prices to text content on articles, images, or structured data in tables.<br>
<img src="WebScraping_img/Web_Scraping.jpeg" width="500px" style="display: block; margin: auto;">
<br>

### How is Web Scraping different from Web Crawling?
- Web crawling, also known as spidering, is the process of systematically navigating the internet to discover and index web pages.
- Web crawlers (or spiders) start from a set of URLs, visit each page, extract links to other pages, and continue visiting new pages in a recursive
manner.
- This enables the crawler to build an extensive index of web pages across a domain or even the entire internet.
- The main goal of web crawling is to find and catalog all accessible pages on the web.
- Crawled pages are often stored in a database or index for later retrieval and use, such as by search engines or content aggregation tools.
- Web Scraping targets particular data points within a webpage, such as prices, reviews, product listings, or other structured information.
- Scraping is focused on extracting certain elements or fields from a webpage, rather than exploring links or indexing the entire page.
<img src="WebScraping_img/2.png" width="500px" style="display: block; margin: auto;">
<br>

### How does Web Scraping work?
1. HTTP Requests:
    - Web scrapers initiate HTTPS requests to servers to retrieve the HTML source of a webpage.
    - The GET and POST are most commonly used request types.
2. Parsing HTML:
    - The script navigates through the received HTML structure to identify and extract data of interest.
    - This involves extracting only the required specific data.
3. Storage:
    - After extraction, data is cleaned and stored in the desired format.
    - Data is usually stored in a database, CSV file, or spreadsheet for further analysis.
<img src="WebScraping_img/4.png" width="500px" style="display: block; margin: auto;">
<br>

### Types of Web Scraping
1. HTML Parsing:
    - HTML parsing is the most common form of web scraping.
    - It involves analyzing a web page’s HTML structure to extract relevant data.
    - Works well for websites with static content or basic HTML structures.
    - Example: Extracting blog titles, author names, and publication dates from a blog page.
2. Data Object Model (DOM) Parsing:
    - Focuses on navigating the DOM structure of a website.
    - The DOM structure refers to the hierarchy of elements of the webpage.
    - Works best with complex or dynamic websites where content might change upon certain events, such as clicking or scrolling.
<img src="WebScraping_img/HTML_vs_DOM.png" width="500px" style="display: block; margin: auto;">
<br>
<img src="WebScraping_img/dom-tree.png" width="500px" style="display: block; margin: auto;">
<br>
3. Headless Browser Scraping:
    - Headless browser scraping involves using a browser in headless mode to render web pages like a real user.
    - There is no GUI involved in headless browsing. Nothing is display visually on the screen.
    - Works best for websites that rely heavily on JavaScript or AJAX to load content.
    - Puppeteer is a commonly used tool to work with headless browsers.
    - Example: Extracting real-time stock prices from a financial website.
4. API-based Scraping:
    - Many websites offer APIs (Application Programming Interfaces) for structured data access.
    - This can be a more efficient and ethical alternative to traditional scraping methods.
    - Example: Extracting user information, posts, and comments from a social media platform’s API.
5. Image and Multimedia Scraping:
    - Image scraping involves extracting images, videos, or other media files from web pages.
    - Scrapers target img tags or other media tags in HTML, and download the files directly.

### Ethical Consideration
- Ethical considerations in web scraping are essential to ensure that data collection practices are conducted
responsibly and in line with the legal and moral obligations.
- These considerations mainly revolve around respecting website policies, data privacy, intellectual property,
and transparency with users.
1. Compliance with website Terms & Services:
    - Most websites have Terms of Service (ToS) that outline acceptable behaviors, including whether web scraping is permitted
    - Violating these terms can result in legal repercussions, as scraping without permission may be viewed as unauthorized access.
    - It’s crucial to review and abide by the website’s policies and request explicit permission for data access if the site prohibits scraping.
    - **What To Do:** Before starting any scraping activity, read the website’s ToS and Privacy Policy carefully. When in doubt, seek permission or use alternative, sanctioned APIs.
2. Respect for Data Ownership and Intellectual Property Rights:
    - The data on a website is generally owned by the website’s creators or operators.
    - Unauthorized replication or distribution may infringe on intellectual property rights.
    - **What To Do:** Use scraped data strictly for purposes that do not violate intellectual property laws and avoid redistributing content without permission.
3. Data Privacy and User Consent:
    - Websites may contain sensitive or personal information about users, such as names, email addresses, or comments.
    - Scraping such data without explicit user consent is a privacy breach.
    - Regulations like the GDPR (Europe) and CCPA (USA) impose strict guidelines on handling personal data.
    - **What To Do:** Avoid scraping personal data unless you have explicit permission. If personal data is required, ensur compliance with relevant privacy laws.
4. Rate Limits and Server Overload:
    - Websites operate with limited server resources, and excessive scraping can strain servers, which can slow down performance for other users.
    - Ethical scrapers should honor the website’s robots.txt file, which often specifies crawling frequency and areas off-limits to automated access.
    - **What To Do:** Implement rate limiting and time intervals between requests to reduce the impact on the website’s server.
5. Transparency and Disclosure:
    - Ethical web scraping involves transparency about the intent and use of the data, especially if it’s for commercial purposes.
    - Using data without context or presenting scraped data as a comprehensive view of a company’s offerings can mislead users and harm the reputation of the data’s original source.
    - **What To Do:** If using scraped data for public purposes, clearly disclose its source, the data collection process, and any limitations.

### Advantages of Web Scraping
1. Efficient Data Collection and Processing:
    - Web scraping allows for the automated collection of data at a large scale, offering much higher speed and efficiency than manual collection.
    - Helps save considerable time and effort, enabling faster access to information.
    - This is particularly beneficial for industries that rely on large datasets, such as e-commerce, market research, and finance.
2. Real-Time Data Access:
    - Web scraping enables real-time data extraction, allowing companies to monitor data and respond to changes immediately.
    - Access to real-time data provides businesses with a competitive edge by allowing them to adjust strategies based on the latest trends.
3. Cost-Effective Market Research:
    - Compared to traditional data collection methods, such as surveys or purchasing datasets, web scraping offers a cost-effective way to collect market data.
    - Web scraping can gather data from various websites, blogs, social media, and online forums, providing a broader view of the market landscape
4. Enhanced Decision-Making through Data-Driven Insights:
    - Access to data-driven insights enables organizations to make better, evidence-based decisions.
    - Web scraping helps compile data that is crucial for understanding consumer behavior, trends, and competitor activities.
    - Helps companies analyze historical data to identify trends and predict future behaviors, aiding long-term strategy planning.
5. Detecting and Analyzing Fraudulent Activities:
    - By monitoring patterns in online data, web scraping can help identify potentially fraudulent activities, such as fake reviews, counterfeit product listings, or misleading advertisements.
    - Companies can use web scraping to validate information about their own products and services by comparing data across different platforms, detecting inconsistencies that may indicate fraud
6. Enhanced SEO and Content Strategy:
    - Web scraping can help companies analyze competitors' keywords, backlinks, and content strategies to improve their own SEO performance.
    - Understanding high-performing content on competitors' websites can guide and allow companies to identify and replicate successful topics and formats.

### Disadvantages of Web Scraping
1. Legal and Ethical Risks:
    - Many websites have terms of service that prohibit or limit data scraping.
    - Extracting data without permission can lead to copyright issues, potential lawsuits, or restrictions from the website owner.
    - Scraping personal information, even if publicly available, can raise privacy issues, especially unde data protection laws like GDPR.
    - Companies can face penalties for scraping personal data without consent.
2. IP Blocking and Bot Detection:
    - Websites often deploy mechanisms like CAPTCHAs, rate limits, and IP blocking to detect and block scraping bots.
    - This can interrupt scraping processes, requiring continual adjustment to circumvent these systems.
    - Many scrapers use rotating proxies to avoid detection, which can be costly.
    - IPs can also quickly become blocked, rendering scraping scripts useless.
3. Data Accuracy and Consistency Issues:
    - Websites frequently update their layouts, URLs, or data structures.
    - These changes require scrapers to be reconfigured frequently, increasing maintenance time ancost.
    - Extracted data may contain inconsistencies, missing values, or irrelevant information that requires significant preprocessing before it becomes usable.
    - Cleaning and standardizing such data can be time-intensive.
    - Might require constant scraping and data refresh cycles
4. Incompatibility with Dynamic and JavaScript-Heavy Content:
    - Many modern websites use JavaScript frameworks (like React or Angular) that load content dynamically
    - Scraping such content requires additional tools like Selenium or Puppeteer, which increase
    complexity.
    - JavaScript-heavy pages can be slower to load and scrape, making data extraction more timeconsuming and resource-demanding.
5. Environmental Impact:
    - Large-scale scraping operations consume substantial computational resources, which contributes to energy usage and, indirectly, environmental impact.
    - This inadvertently translates to carbon emissions, an increasingly important consideration for environmentally conscious organizations.

### Alternatives to Web Scraping
1. Public APIs:
    - Many websites offer public APIs that allow developers to access structured data directly.
    - APIs provide clean and organized data formats, eliminating the need for extensive parsing or cleaning.
    - Using an official API helps avoid legal risks associated with web scraping.
2. RSS Feeds:
    - Really Simple Syndication feeds are a way to automatically receive updates from websites in a single feed.
    - RSS feeds are updated frequently, making it easy to access new content automatically.
    - Since RSS feeds are structured in XML, they’re easy to parse and don’t require complex scraping scripts.
3. Public Datasets:
    - Data portals provide clean, verified, and well-documented datasets, which are typically updated periodically.
    - Most data portals offer free access, with datasets available in formats like CSV, JSON, or Excel.
    - Using existing datasets reduces time spent on collection and cleaning.
4. Manual Data Collection:
    - No technical setup or coding is needed, making it accessible to anyone who can access the site.
    - Can be efficient without the need for dedicated tools or servers.
    - It often avoids triggering anti-scraping measures.
5. Licensed Partnerships with Data Owners:
    - Partnerships can unlock data that is not available publicly, providing a competitive edge.
    - Data is usually provided in structured formats and with reliable update frequencies, making it easy to integrate.
    - Since data is obtained through agreements, this avoids any compliance issues
<br>
<br>

 ---

---

---

# Client-Server model/architecture
- The Client-Server model is a fundamental design framework for networked applications.
- It organizes interactions between two major entities: clients (requesters) and servers (providers of resources).
- This architecture underpins most modern networks, including the internet, web applications, email systems, and various enterprise systems.
#### **Client:**
- Device or an application that initiates requests for services or resources.
- Clients are typically end-user devices (e.g., smartphones, laptops) or software applications (e.g., browsers, email clients) that communicate over a network.
#### **Server:**
- A server is a dedicated system or application that listens for and fulfills requests from clients.
- Servers provide resources, data, or services to clients, typically through a network connection.
#### **Working:**
- importance of server for websites/apps
- client-server communication

## HTTP Request and Response
### **HTTP:**
- The HTTP (Hypertext Transfer Protocol) is the foundation of data communication on the web.
- It facilitates the exchange of information between clients and servers.
- The HTTP request-response cycle is central to how web applications function.
#### **HTTP Request:**
- An HTTP request is a message sent by the client to the server to initiate an action or request a resource.
- The request message consists of following components:
    - Request Line: Method / Verb, Address (URI), Version
    - Headers: used for conveying additional meta-data about the request
    - Body: contains the major contents of the request message
<img src="WebScraping_img/http_request.jpg" width="500px" style="display: block; margin: auto;">
<br>
<img src="WebScraping_img/request.png" width="500px" style="display: block; margin: auto;">
<br>
#### **HTTP Response:**
- An HTTP response is the message sent by the server back to the client after processing the request.
- The response message consists of following components:
  - Response Line: Status Code, Version
  - Headers: used for conveying additional meta-data about the response
  - Body: contains the requested contents (usually JSON format)
<img src="WebScraping_img/http_response.jpg" width="500px" style="display: block; margin: auto;">
<br>
<img src="WebScraping_img/response.png" width="500px" style="display: block; margin: auto;">
<br>
 ## HTTP Method
- HTTP methods, also known as HTTP verbs, are a fundamental part of the HTTP protocol.
- They define the action to be performed on a resource identified by a URI.
- Each method has specific semantics and is used for different purposes in client-server communication.
1. GET:
    - Used to request data from a specified resource.
    - It is the most commonly used HTTP method.
    - **Application:** Retrieving web pages, images, or other resources from a server.
<img src="WebScraping_img/get.png" width="500px" style="display: block; margin: auto;">
<br>
2. POST:
    - Used to submit data to be processed to a specified resource.
    - This often results in the creation of a new resource.
    - **Application:** Submitting forms or uploading files
<img src="WebScraping_img/post.png" width="500px" style="display: block; margin: auto;">
<br>
3. PUT:
    - Used to update an existing resource or create a new resource if it does not exist.
    - It sends data to the server to replace the current representation of the resource.
    - **Application:** Updating user details or replacing an entire user.
<img src="WebScraping_img/put.png" width="500px" style="display: block; margin: auto;">
<br>
4. PATCH:
    - Used to apply partial modifications to a resource.
    - It sends a set of instructions to update the resource rather than replacing it entirely.
    - **Application:** Updating specific fields of a user, like changing a user’s email without altering other attributes.
<img src="WebScraping_img/patch.png" width="500px" style="display: block; margin: auto;">
<br>
5. DELETE:
    - Used to remove a specified resource from the server.
    - **Application:** Deleting user accounts, posts, or other resources.
<img src="WebScraping_img/delete.png" width="500px" style="display: block; margin: auto;">
<br>
Summary
<img src="WebScraping_img/3.png" width="500px" style="display: block; margin: auto;">
<br>
## HTTP Status Codes
- HTTP status codes are three-digit numbers sent by a server in response to a client's request made to the server.
- These codes are crucial for understanding the results of HTTP requests.
- They provide important feedback to clients about the success or failure of requests and help developers diagnose issues.
- Properly using and interpreting HTTP status codes is essential for effective communication between clients and servers in the web ecosystem.
1. 1xx - Informational:
    - These codes indicate that the request has been received and the process is continuing.
    - They are rarely used in web applications but are important for certain protocols.
    - **Examples:** 100 Continue, 101 Switching Protocols
2. 2xx - Success:
    - These codes indicate that the client's request was successfully received, understood, and accepted.
    - **Examples:** 200 OK, 201 Created, 202 Accepted
3. 3xx - Redirection:
    - These codes indicate that further action is needed to complete the request.
    - This usually involves redirection to another URI.
    - **Examples:** 300 Multiple Choices, 301 Moved Permanently, 302 Found
4. 4xx - Client Error:
    - These codes indicate that the client made an error, resulting in the request being unable to be fulfilled.
    - **Examples:** 400 Bad Request, 401 Unauthorized, 404 Not Found
6. 5xx - Server Error:
    - These codes indicate that the server failed to fulfill a valid request due to an error on the server side.
    - **Examples:** 502 Bad Gateway, 503 Service Unavailable
<img src="WebScraping_img/1.png" width="500px" style="display: block; margin: auto;">
<br>
## Web Technologies
- Web technologies encompass a wide array of tools, languages, and protocols used to create, maintain, and manage websites and web applications.
- HTML provides the foundational structure, CSS handles visual presentation, and JavaScript adds dynamic behavior.
- Together, they enable developers to build rich, interactive user experiences on the web.
- Understanding these technologies is crucial for web scraping, as it allows developers to extract data from web pages effectively, even when dealing with dynamic content.
1. HTML:
    - HTML is the standard markup language used for creating web pages.
    - It provides the structure and layout of a web document by defining elements like headings, paragraphs, links, images, and other types of content using appropriate tags.
    - Tags can also be nested to create more complex structures.
2. CSS:
    - CSS is a stylesheet language used to describe the presentation of a document written in HTML.
    - It controls the layout, colors, fonts, and overall visual appearance of web pages.
    - CSS uses selectors to target HTML elements and apply styles.
3. JavaScript:
    - JavaScript is a high-level, dynamic programming language that enables interactivity and functionality on web pages.
    - It allows developers to create rich user experiences through client-side scripting.
<br>
<br>

---

---

---

# Anaconda
- Anaconda is a popular open-source distribution of Python (and R) mainly used for data science, machine learning, and scientific computing.
- Anaconda includes Conda, a package, dependency, and environment manager, making it easy to install and manage libraries and environments.
- It comes with over 1,500 pre-installed packages for data science, including popular libraries like NumPy, Pandas, Matplotlib, TensorFlow, and Scikit-Learn.
- Anaconda provides easy access to Jupyter Notebook, a powerful tool for interactive coding, data visualization, and exploratory analysis.
- Anaconda includes conda, which is its package and environment manager (similar to pip)
### Common Anaconda Prompts
- conda update conda: **updates conda to the latest version**
- conda update -all: **updates all packages to the latest version**
- conda env list: **lists all available environments**
- conda create --name \<env_name>: **create a new environment**
- conda activate \<env_name>: **activates an environment**
- conda deactivate: **deactivates the current working environment**
- conda list: **lists all the packages installed in the current environment**
- conda env export --name \<env_name> --file environment.yml: **export an environment to a .yml file**
- conda env create --file environment.yml: **import the exported environment in another system**
- conda remove --name \<env_name> --all: **removes the provided environment**

### Setup
1. Create a project directory
2. Install Anaconda
    - <a href="https://www.anaconda.com/" target="_blank"><b>Visit Official Anaconda Website</b></a>
3. Open Anaconda Prompt
4. Create an Anaconda environment
5. Activate the created environment
6. Install necessary packages:
    - **pandas**
    - **numpy**
    - **requests**
    - **beautifulsoup4**
    - **lxml:** recommended for parsing XML/HTML content
    - **html5lib:** alternate parsers for Beautiful Soup
    - **selenium**
    - **python-chromedriver-binary:** Chrome driver for Selenium
    - **python-geckodriver:** Firefox driver for Selenium
    - **webdriver-manager:** manages and downloads web drivers automatically (recommended, for auto-updates)
    - **jupyter:** installs the main components of the Jupyter ecosystem
    - **ipykernel:** to create a jupyter kernel for an environment
6. Create an appropriate Jupyter kernel:
    - **python -m ipykernel install --user --name=\<env_name> --display-name "\<Your Env Display Name>"**
7. Launch Jupyter and create new notebooks using the appropriate kernel
<br>
<br>

---

---

---

# Requests
- The Requests module is a powerful and user-friendly HTTP library for Python, designed to make it easier to send HTTP requests.
- It abstracts away the complexities of making requests and handling responses, allowing developers to focus on building applications.

### Key Features:
- **Simplicity:** Known for its clean and straightforward syntax, making it accessible for beginners and experienced developers alike.
- **Flexibility:** Supports a wide range of HTTP methods and allows for customization, such as adding headers, queryparameters, and more.
- **User-Friendly:** Compared to the built-in urllib library, Requests provides a more intuitive API which is particularlybeneficial for rapid development and prototyping.
- **Automatic Content Decoding:** Requests automatically decodes the content based on the response headers, so you can work with the data directly without worrying about the encoding.
- **Session Management:** You can persist certain parameters across multiple requests using session objects, which can be useful for maintaining state.
- **Built-in Error Handling:** Requests come with built-in mechanisms to handle common HTTP errors and exceptions.
- **Community & Support:** The Requests library has a large and active community. Can easily find tutorials, documentation, and support for any issues you encounter.

### Applications of Requests:
1. **Web Scraping:**
    - Requests is often the first step in web scraping.
    - It allows developers to send HTTP requests to retrieve the HTML content of web pages, which can then be parsed and analyzed using libraries like Beautiful Soup or lxml.
2. **API Interaction:**
    - Commonly used to interact with RESTful APIs.
    - It can send various types of HTTP requests (GET, POST, PUT, DELETE) to perform operations like retrieving data, creating new records, or updating existing ones.
3. **File Uploads:**
    - Requests supports file uploads, which is essential for applications where users need to submit files to a server.
4. **Session Management:**
    - Requests provides the ability to maintain sessions across multiple requests, which is useful for applications that require authentication.
    - A web application that requires users to log in can use a session to maintain the user's login state while making subsequent requests.
5. **Data Submission:**
    - Requests can be used to submit data to web servers, especially in web forms where users enter information.
    - Example: An application could use Requests to send user feedback or comments to a server

### GET requests
#### Query Parameters:
- They are parameters which are included in the URI as part of the request message.
- Mainly used to filter the data received from the server.
- In a URI, these are placed after the ? and separated by &
- **Application:** Filter phones by a price range in an e-commerce website

### POST requests
- Since POST requests are used to send data to the server to create a resource, the Requests library provides the data parameter for this
- We can also pass JSON data directly using the json parameter
- These parameters accept dictionary-like objects as arguments

### Headers
- Headers enable communicate of additional information such as authorization credentials, content types, and user-agent data, etc. between the clients and servers.
- Understanding and managing headers is essential for authenticating requests, specifying response formats, handling cookies, and personalizing user experiences.

#### Common types of Headers:
- **Authorization:** Used for passing authentication tokens.
- **Content-Type:** Specifies the format of the request data, such as JSON or XML.
- **User-Agent:** Provides information about the client making the request, such as the browser or app.
- **Accept:** Informs the server about the types of content the client can process, like JSON or HTML.
- **Custom Headers:** Headers that may be unique to an API or web application.

#### Practical tips for using Headers:
- Always check API documentation to see required headers for a particular request.
- Use headers for efficient server interaction because servers expect headers for security, data formatting, and client identification.
- Perform appropriate error handling because missing or incorrect headers can lead to HTTP errors like 401 Unauthorized or 415 Unsupported Media Type.
<br>
<br>

---

In [1]:
## Imports
import requests
## Version
print(requests.__version__)

2.32.3


In [2]:
# GET
# send a request to GitHub API and printing the response
# print the status code and the contents
# GitHub API: `https://api.github.com`
uri = "https://api.github.com"
response = requests.get(uri)

In [3]:
response

<Response [200]>

In [4]:
# 200 is success
response.status_code

200

In [5]:
# printing first 1000 letters of the response
response.content[:1000]

b'{"current_user_url":"https://api.github.com/user","current_user_authorizations_html_url":"https://github.com/settings/connections/applications{/client_id}","authorizations_url":"https://api.github.com/authorizations","code_search_url":"https://api.github.com/search/code?q={query}{&page,per_page,sort,order}","commit_search_url":"https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}","emails_url":"https://api.github.com/user/emails","emojis_url":"https://api.github.com/emojis","events_url":"https://api.github.com/events","feeds_url":"https://api.github.com/feeds","followers_url":"https://api.github.com/user/followers","following_url":"https://api.github.com/user/following{/target}","gists_url":"https://api.github.com/gists{/gist_id}","hub_url":"https://api.github.com/hub","issue_search_url":"https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}","issues_url":"https://api.github.com/issues","keys_url":"https://api.github.com/user/keys","label_sea

In [6]:
# use query parameters in a GET message
# search for all GitHub repositories that contain the word `requests` and the main language used is `python`
# GitHub repository search API: `https://api.github.com/search/repositories`
uri = "https://api.github.com/search/repositories"

In [7]:
params = {"q": "requests+language:python"}

In [8]:
response = requests.get(uri, params=params)

In [9]:
response.status_code

200

In [10]:
response.content[:2000]

b'{"total_count":246,"incomplete_results":false,"items":[{"id":33210074,"node_id":"MDEwOlJlcG9zaXRvcnkzMzIxMDA3NA==","name":"secrules-language-evaluation","full_name":"SpiderLabs/secrules-language-evaluation","private":false,"owner":{"login":"SpiderLabs","id":508521,"node_id":"MDEyOk9yZ2FuaXphdGlvbjUwODUyMQ==","avatar_url":"https://avatars.githubusercontent.com/u/508521?v=4","gravatar_id":"","url":"https://api.github.com/users/SpiderLabs","html_url":"https://github.com/SpiderLabs","followers_url":"https://api.github.com/users/SpiderLabs/followers","following_url":"https://api.github.com/users/SpiderLabs/following{/other_user}","gists_url":"https://api.github.com/users/SpiderLabs/gists{/gist_id}","starred_url":"https://api.github.com/users/SpiderLabs/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/SpiderLabs/subscriptions","organizations_url":"https://api.github.com/users/SpiderLabs/orgs","repos_url":"https://api.github.com/users/SpiderLabs/repos","events_url":

In [11]:
## POST
def get_data_format(rcvd_response):
	json_response = rcvd_response.json()
	data_format = json_response['headers']['Content-Type'].split("/")[-1]
	print(f"Response data format: {data_format}")

In [12]:
# send some data to a test server
# server address: `https://httpbin.org/post`
# https://httpbin.org/ is website which give us to illustrate these method on their website
uri = "https://httpbin.org/post"

In [13]:
# the data we are sending
data = {
	"username": "bruce",
	"password": "bruce123"
}

In [14]:
response = requests.post(uri, data=data)

In [15]:
response.status_code

200

In [16]:
# by default the data we sent is sent under form
# server is telling it has received the information in the form of x-www-form-urlencoded
response.json()

{'args': {},
 'data': '',
 'files': {},
 'form': {'password': 'bruce123', 'username': 'bruce'},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate, zstd',
  'Content-Length': '32',
  'Content-Type': 'application/x-www-form-urlencoded',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.32.3',
  'X-Amzn-Trace-Id': 'Root=1-67d006a0-7d3f9a995c90278f31b3b6d2'},
 'json': None,
 'origin': '152.58.92.51',
 'url': 'https://httpbin.org/post'}

In [17]:
response.json()['headers']['Content-Type'].split("/")[-1]

'x-www-form-urlencoded'

In [18]:
get_data_format(response)

Response data format: x-www-form-urlencoded


In [19]:
# APIs usually expect data in JSON format
# send data in JSON format using POST message
uri

'https://httpbin.org/post'

In [20]:
data

{'username': 'bruce', 'password': 'bruce123'}

In [21]:
# changing the data to json
response2 = requests.post(uri, json=data)

In [22]:
response2.status_code

200

In [23]:
# by default the data we sent is sent under data and json
# server is telling it has received the information in the form of json
response2.json()

{'args': {},
 'data': '{"username": "bruce", "password": "bruce123"}',
 'files': {},
 'form': {},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate, zstd',
  'Content-Length': '45',
  'Content-Type': 'application/json',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.32.3',
  'X-Amzn-Trace-Id': 'Root=1-67d006a1-2e2700e12410b46f0151d04f'},
 'json': {'password': 'bruce123', 'username': 'bruce'},
 'origin': '152.58.92.51',
 'url': 'https://httpbin.org/post'}

In [24]:
get_data_format(response2)

Response data format: json


In [25]:
# PUT
# use PUT method to update data
# address: `https://httpbin.org/put`
uri = "https://httpbin.org/put"

In [26]:
data = {
	"param1": "value1"
}

In [27]:
response = requests.put(uri, data=data)

In [28]:
response.status_code

200

In [29]:
# by default the data we sent is sent under form
# server is telling it has received the information in the form of x-www-form-urlencoded
response.json()

{'args': {},
 'data': '',
 'files': {},
 'form': {'param1': 'value1'},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate, zstd',
  'Content-Length': '13',
  'Content-Type': 'application/x-www-form-urlencoded',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.32.3',
  'X-Amzn-Trace-Id': 'Root=1-67d006a2-152bf6d2128277be146b4b17'},
 'json': None,
 'origin': '152.58.92.51',
 'url': 'https://httpbin.org/put'}

In [30]:
# DELETE
# use DELETE method to delete a resource
# address: `https://httpbin.org/delete`
uri = "https://httpbin.org/delete"

In [31]:
# No parameters will be passed
response = requests.delete(uri)

In [32]:
response.status_code

200

In [33]:
# All the form, data, json etc. Everything is empty as we have not passed any data.
response.json()

{'args': {},
 'data': '',
 'files': {},
 'form': {},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate, zstd',
  'Content-Length': '0',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.32.3',
  'X-Amzn-Trace-Id': 'Root=1-67d006a3-697c3cab5c8e70a22b3f5b96'},
 'json': None,
 'origin': '152.58.92.51',
 'url': 'https://httpbin.org/delete'}

In [34]:
# Headers
# use a GET request
# url: `https://httpbin.org/headers`
headers = {
	'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
    'Accept': 'application/json',
	'Authorization': 'Bearer YOUR_ACCESS_TOKEN',
	'Content-Type': 'application/json',
	'X-Custom-Header': 'CustomValue',
}

In [35]:
uri = "https://httpbin.org/headers"

In [36]:
response = requests.get(uri, headers=headers)

In [37]:
response.status_code

200

In [38]:
# shows the headers sent by the server to us
response.headers

{'Date': 'Tue, 11 Mar 2025 09:47:17 GMT', 'Content-Type': 'application/json', 'Content-Length': '393', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}

In [39]:
type(response.headers)

requests.structures.CaseInsensitiveDict

In [40]:
dict(response.headers)

{'Date': 'Tue, 11 Mar 2025 09:47:17 GMT',
 'Content-Type': 'application/json',
 'Content-Length': '393',
 'Connection': 'keep-alive',
 'Server': 'gunicorn/19.9.0',
 'Access-Control-Allow-Origin': '*',
 'Access-Control-Allow-Credentials': 'true'}

In [41]:
# It will show the headers we had sent to the server
response.json()

{'headers': {'Accept': 'application/json',
  'Accept-Encoding': 'gzip, deflate, zstd',
  'Authorization': 'Bearer YOUR_ACCESS_TOKEN',
  'Content-Type': 'application/json',
  'Host': 'httpbin.org',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
  'X-Amzn-Trace-Id': 'Root=1-67d006a5-108578ff4e271a884dc8cbf2',
  'X-Custom-Header': 'CustomValue'}}

In [42]:
# showing the our headers sent to the sever in formatted way
for header, value in response.json()['headers'].items():
	print(f"{header:<16}: {value}")

Accept          : application/json
Accept-Encoding : gzip, deflate, zstd
Authorization   : Bearer YOUR_ACCESS_TOKEN
Content-Type    : application/json
Host            : httpbin.org
User-Agent      : Mozilla/5.0 (Windows NT 10.0; Win64; x64)
X-Amzn-Trace-Id : Root=1-67d006a5-108578ff4e271a884dc8cbf2
X-Custom-Header : CustomValue


In [43]:
# Response Object
# address: `https://api.github.com`
# The most common attributes of the response object are:
    # `status_code`: shows the HTTP status code of the request
    # `text`: shows the content of the response as a string
    # `content`: shows the content of the response as binary data
    # `headers`: shows all the response headers
    # `json()`: parse the server's response as JSON
uri = "https://api.github.com"

In [44]:
response = requests.get(uri)

In [45]:
response

<Response [200]>

In [46]:
response.status_code

200

In [47]:
type(response.status_code)

int

In [48]:
if response.status_code == 200:
	print("Successful Request!")
else:
	print("Unsuccessful Request!")

Successful Request!


In [49]:
# return string form
response.text

'{"current_user_url":"https://api.github.com/user","current_user_authorizations_html_url":"https://github.com/settings/connections/applications{/client_id}","authorizations_url":"https://api.github.com/authorizations","code_search_url":"https://api.github.com/search/code?q={query}{&page,per_page,sort,order}","commit_search_url":"https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}","emails_url":"https://api.github.com/user/emails","emojis_url":"https://api.github.com/emojis","events_url":"https://api.github.com/events","feeds_url":"https://api.github.com/feeds","followers_url":"https://api.github.com/user/followers","following_url":"https://api.github.com/user/following{/target}","gists_url":"https://api.github.com/gists{/gist_id}","hub_url":"https://api.github.com/hub","issue_search_url":"https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}","issues_url":"https://api.github.com/issues","keys_url":"https://api.github.com/user/keys","label_sear

In [50]:
# It is a string not dictionary
type(response.text)

str

In [51]:
# return binary form
response.content

b'{"current_user_url":"https://api.github.com/user","current_user_authorizations_html_url":"https://github.com/settings/connections/applications{/client_id}","authorizations_url":"https://api.github.com/authorizations","code_search_url":"https://api.github.com/search/code?q={query}{&page,per_page,sort,order}","commit_search_url":"https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}","emails_url":"https://api.github.com/user/emails","emojis_url":"https://api.github.com/emojis","events_url":"https://api.github.com/events","feeds_url":"https://api.github.com/feeds","followers_url":"https://api.github.com/user/followers","following_url":"https://api.github.com/user/following{/target}","gists_url":"https://api.github.com/gists{/gist_id}","hub_url":"https://api.github.com/hub","issue_search_url":"https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}","issues_url":"https://api.github.com/issues","keys_url":"https://api.github.com/user/keys","label_sea

In [52]:
# It is a bytes format neither dictionary nor string
type(response.content)

bytes

In [53]:
# shows all the response headers (including what we headers we have sent and the additional headers server has added.)
response.headers

{'Date': 'Tue, 11 Mar 2025 09:47:09 GMT', 'Cache-Control': 'public, max-age=60, s-maxage=60', 'Vary': 'Accept,Accept-Encoding, Accept, X-Requested-With', 'ETag': '"4f825cc84e1c733059d46e76e6df9db557ae5254f9625dfe8e1b09499c449438"', 'x-github-api-version-selected': '2022-11-28', 'Access-Control-Expose-Headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset', 'Access-Control-Allow-Origin': '*', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains; preload', 'X-Frame-Options': 'deny', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '0', 'Referrer-Policy': 'origin-when-cross-origin, strict-origin-when-cross-origin', 'Content-Security-Policy': "default-src 'none'", 'Server': 'github.com', 'Content-Type': 'application/json; charset=utf

In [54]:
# CaseInsensitiveDict 
type(response.headers)

requests.structures.CaseInsensitiveDict

In [55]:
dict(response.headers)

{'Date': 'Tue, 11 Mar 2025 09:47:09 GMT',
 'Cache-Control': 'public, max-age=60, s-maxage=60',
 'Vary': 'Accept,Accept-Encoding, Accept, X-Requested-With',
 'ETag': '"4f825cc84e1c733059d46e76e6df9db557ae5254f9625dfe8e1b09499c449438"',
 'x-github-api-version-selected': '2022-11-28',
 'Access-Control-Expose-Headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset',
 'Access-Control-Allow-Origin': '*',
 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains; preload',
 'X-Frame-Options': 'deny',
 'X-Content-Type-Options': 'nosniff',
 'X-XSS-Protection': '0',
 'Referrer-Policy': 'origin-when-cross-origin, strict-origin-when-cross-origin',
 'Content-Security-Policy': "default-src 'none'",
 'Server': 'github.com',
 'Content-Type': 'application/jso

In [56]:
for header, value in dict(response.headers).items():
	print(f"{header:<35}: {value}")

Date                               : Tue, 11 Mar 2025 09:47:09 GMT
Cache-Control                      : public, max-age=60, s-maxage=60
Vary                               : Accept,Accept-Encoding, Accept, X-Requested-With
ETag                               : "4f825cc84e1c733059d46e76e6df9db557ae5254f9625dfe8e1b09499c449438"
x-github-api-version-selected      : 2022-11-28
Access-Control-Expose-Headers      : ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset
Access-Control-Allow-Origin        : *
Strict-Transport-Security          : max-age=31536000; includeSubdomains; preload
X-Frame-Options                    : deny
X-Content-Type-Options             : nosniff
X-XSS-Protection                   : 0
Referrer-Policy                    : origin-when-cross-ori

In [57]:
# shows all the request headers in json
response.json()

{'current_user_url': 'https://api.github.com/user',
 'current_user_authorizations_html_url': 'https://github.com/settings/connections/applications{/client_id}',
 'authorizations_url': 'https://api.github.com/authorizations',
 'code_search_url': 'https://api.github.com/search/code?q={query}{&page,per_page,sort,order}',
 'commit_search_url': 'https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}',
 'emails_url': 'https://api.github.com/user/emails',
 'emojis_url': 'https://api.github.com/emojis',
 'events_url': 'https://api.github.com/events',
 'feeds_url': 'https://api.github.com/feeds',
 'followers_url': 'https://api.github.com/user/followers',
 'following_url': 'https://api.github.com/user/following{/target}',
 'gists_url': 'https://api.github.com/gists{/gist_id}',
 'hub_url': 'https://api.github.com/hub',
 'issue_search_url': 'https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}',
 'issues_url': 'https://api.github.com/issues',
 'keys_url': '

In [58]:
dict(response.json())

{'current_user_url': 'https://api.github.com/user',
 'current_user_authorizations_html_url': 'https://github.com/settings/connections/applications{/client_id}',
 'authorizations_url': 'https://api.github.com/authorizations',
 'code_search_url': 'https://api.github.com/search/code?q={query}{&page,per_page,sort,order}',
 'commit_search_url': 'https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}',
 'emails_url': 'https://api.github.com/user/emails',
 'emojis_url': 'https://api.github.com/emojis',
 'events_url': 'https://api.github.com/events',
 'feeds_url': 'https://api.github.com/feeds',
 'followers_url': 'https://api.github.com/user/followers',
 'following_url': 'https://api.github.com/user/following{/target}',
 'gists_url': 'https://api.github.com/gists{/gist_id}',
 'hub_url': 'https://api.github.com/hub',
 'issue_search_url': 'https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}',
 'issues_url': 'https://api.github.com/issues',
 'keys_url': '

In [59]:
# Working with a public API
# API: `https://jsonplaceholder.typicode.com/`
# endpoint: `posts`
# perform error handling: `raise_for_status()`
# dummy post url
url = "https://jsonplaceholder.typicode.com/posts"

In [60]:
# Make a GET request to retrieve a list of posts
# error handling by using try and except.
try:
    response = requests.get(url)
    # raise_for_status is the method to catch all the error that may occur while performing the conversation with the server and return none if no error
    response.raise_for_status()
# except will be executed only if there is error in the try block
except Exception as e:
    # Print the error if an exception occurs
    print(e)
else:
    # Print the status code
    status_code = response.status_code
    print(f"Status Code: {status_code}")

    # If the request was successful
    if status_code == 200:
        print("\nSuccessful GET request!")
        # Parse response as JSON
        posts = response.json()

        # Print the top 3 posts
        for i in range(3):
            print(f"\nPost {i + 1}:")
            print(posts[i])
    # run if the status code is other than 200
    else:
        print("Unsuccessful GET request!")

Status Code: 200

Successful GET request!

Post 1:
{'userId': 1, 'id': 1, 'title': 'sunt aut facere repellat provident occaecati excepturi optio reprehenderit', 'body': 'quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto'}

Post 2:
{'userId': 1, 'id': 2, 'title': 'qui est esse', 'body': 'est rerum tempore vitae\nsequi sint nihil reprehenderit dolor beatae ea dolores neque\nfugiat blanditiis voluptate porro vel nihil molestiae ut reiciendis\nqui aperiam non debitis possimus qui neque nisi nulla'}

Post 3:
{'userId': 1, 'id': 3, 'title': 'ea molestias quasi exercitationem repellat qui ipsa sit aut', 'body': 'et iusto sed quo iure\nvoluptatem occaecati omnis eligendi aut ad\nvoluptatem doloribus vel accusantium quis pariatur\nmolestiae porro eius odio et labore et velit aut'}


In [61]:
# Make a POST request to submit a post
# include the following parameters:
    # `title`
    # `body`
    # `userId` 
new_post = {
	"title": "Sample Post",
	"body": "This is a sample post",
	"userId": 101
}

In [62]:
try:
    response = requests.post(url, data=new_post)
    response.raise_for_status()
except Exception as e:
    print(e)
else:
    status_code = response.status_code
    # Run the code successfully for POST (201 Created)
    if status_code == 201:
        print(f"Status Code: {status_code}")
        print("\nSuccessful POST request!")
        post = response.json()
        print("\nPost:")
        for header, value in post.items():
            print(f"{header:<15}: {value}")
    else:
        print(f"Status Code: {status_code}")
        print("\nUnsuccessful POST request!")

Status Code: 201

Successful POST request!

Post:
title          : Sample Post
body           : This is a sample post
userId         : 101
id             : 101


---

---

---

# Beautiful Soup
- Beautiful Soup is a Python library used to parse HTML and XML documents.
- It’s especially useful for web scraping because it helps navigate, search, and modify the HTML (or XML) content fetched from a webpage.
- It transforms complex HTML documents into a tree structure, where each node corresponds to elements such as tags, text, attributes, etc.
- This makes it easy to locate and extract specific information.

### Advantages of Beautiful Soup:
- **Easy to Learn and Use:** Has a user-friendly syntax that makes it easy for users to quickly locate and extract data from web pages.
- **Flexible Parsing:** Works with different parsers, such as the built-in Python parser or lxml, offering flexibility in terms of speed and error handling.
- **Handles Broken HTML:** Automatically fixes errors in the HTML structure, allowing users to scrape data from pages that other parsers might struggle with.
- **Efficient Navigation and Search Functions:** Provides intuitive functions like find, find_all, and select to search and navigate through HTML tags and CSS selectors.
- **Integration with Other Libraries:** Integrates smoothly with libraries like Requests, to retrieve web pages before parsing them. Also works well with - Pandas for data analysis or Selenium for JavaScriptheavy pages, making it a versatile choice for a complete web scraping workflow.
- **Well-Documented and Active Community:** Has a comprehensive documentation and an active community that provides resources, tutorials, and troubleshooting support, making it accessible for new users.

### Applications of Beautiful Soup:
- **Price Comparison and Monitoring:** Widely used by e-commerce companies and consumers to scrape prices from various online stores.
- **Job Listings Aggregation:** Commonly used to scrape job listings from platforms like LinkedIn, Indeed, or company career pages. This can help create job aggregators that compile positions from various sources.
- **Market Research and Sentiment Analysis:** Companies often use web scraping to collect data from forums, blogs, and review sites to analyze customer sentiment about their products or their competitors.
- **Real Estate Listings:** Useful for gathering real estate listings from sites like Zillow or Realtor.com. Data on prices, locations, features, and property availability can be scraped and analyzed to identify trends, track prices, and help potential buyers and real estate investors make informed decisions.
- **Travel and Flight Price Tracking:** Used to monitor and compare prices across different airlines, hotels, and booking platforms. By gathering this data, users can develop apps to track flight and accommodation prices, helping travelers find the best deals.

---

In [63]:
## Imports
from bs4 import BeautifulSoup

In [64]:
## Create a `soup` object
with open("html-doc.html") as file:
    # here we are using html.parser but, we can other parsers also as per our convenience
	soup = BeautifulSoup(file, "html.parser")

In [65]:
# This is an object of BeautifulSoup
type(soup)

bs4.BeautifulSoup

In [66]:
## Basics of the `soup` object
# `prettify()`
# individual tags:
    # `title`
    # `a`
    # `p`
# `text`
# `name`
# `parent`
# `children`
# `descendants`
# `get_text()`
# `find()`
# `find_all()`
# `get()` / square bracket notation

# prettifing the html code to look html code in simple and understanding view
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



In [67]:
# to get a particular tag of the website
soup.title

<title>The Dormouse's story</title>

In [68]:
# if there are multiple occurance of a tag then the first one will be returned
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [69]:
soup.p

<p class="title"><b>The Dormouse's story</b></p>

In [70]:
# it is not a string so string operations will not work to get the content in the tag only
print(type(soup.title))
soup.title

<class 'bs4.element.Tag'>


<title>The Dormouse's story</title>

In [71]:
# to get the content of a tag and it is a string so we can apply all string operations on it
print(type(soup.title.text))
soup.title.text

<class 'str'>


"The Dormouse's story"

In [72]:
print(soup.title.text.upper())
print(soup.title.text.split())
print(soup.title.text.lower())

THE DORMOUSE'S STORY
['The', "Dormouse's", 'story']
the dormouse's story


In [73]:
# name
soup.title

<title>The Dormouse's story</title>

In [74]:
# print the name of the tag
soup.title.name

'title'

In [75]:
# parent is used to get the parent tag of the current tag
soup.title.parent

<head><title>The Dormouse's story</title></head>

In [76]:
# parent of parent tag of current tag
soup.title.parent.parent

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p></body></html>

In [77]:
soup.body

<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p></body>

In [78]:
# children of the current tag # It is returning list iterator instead of tag as there is no children of the tag but list_iterator 
soup.body.children

<list_iterator at 0x24bab3202b0>

In [79]:
# iterating on the list_iterator 
for child in soup.body.children:
	print(child) # there are three p tags



<p class="title"><b>The Dormouse's story</b></p>


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


<p class="story">...</p>


In [80]:
# title is only children of the head tag
for child in soup.head.children:
	print(child)

<title>The Dormouse's story</title>


In [81]:
# descendant returns each and every tags and contents inside the tag recursivel. No matter how deeply the tags are
soup.body.descendants

<generator object Tag.descendants at 0x0000024BAAFB32A0>

In [82]:
for descendant in soup.body.descendants:
	print(descendant)



<p class="title"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
The Dormouse's story


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
Elsie
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Tillie
;
and they lived at the bottom of a well.


<p class="story">...</p>
...


In [83]:
# get_text() is used to get only the plan text (in python string) while ignoring all tags
soup.body

<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p></body>

In [84]:
soup.body.get_text()

"\nThe Dormouse's story\nOnce upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.\n..."

In [85]:
# strip is used to remove the whitespaces if any
print(soup.body.get_text().strip())

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...


In [86]:
type(soup.body.get_text())

str

In [87]:
# test() and get_text() will work Adjectively same if there is no tag in side the current tag
soup.title.get_text()

"The Dormouse's story"

In [88]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [89]:
# find() is used to find the first occurance of any tag
soup.find('a')

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [90]:
# find_all() is used to find the all occurance of any ta
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [91]:
# It is ResultSet, a list of anchor ⚓ tags
type(soup.find_all('a'))

bs4.element.ResultSet

In [92]:
### get() to access value of any key
# method 1
soup.a['id']

'link1'

In [93]:
### get()
# method 2 (advisable to use instead of method 1)
soup.a.get('class')

['sister']

In [94]:
soup.a.get('href')

'http://example.com/elsie'

In [95]:
soup.a.get('id')

'link1'

---

---

---

# What is Selenium?
- Selenium is a powerful, open-source tool used for automating web browsers.
- It is often utilized for web scraping when interacting with dynamic websites that rely on JavaScript to load content, which static scraping libraries like Beautiful Soup or Requests cannot handle effectively.
- When scraping websites using Python, Selenium acts as a web driver, automating browser actions to interact with web pages like a human user.
- It can navigate to web pages, simulate user interactions (clicks, scrolls, form submissions), and extract data directly from rendered HTML.
- Selenium was originally developed for testing web applications. Over time, it became a popular tool for web scraping due to its ability to handle dynamic, JavaScript-heavy websites.

### Key Features:
- Dynamic Content Handling:
    - Can scrape data from JavaScript-heavy sites.
    - Waits for elements to load before interacting or extracting data.
- Interaction Simulation:
    - Handles tasks such as clicking buttons, filling forms, selecting dropdowns, and scrolling pages.
    - Useful for scraping data hidden behind user interactions.
- Cross-Browser Support:
    - Works with popular browsers like Chrome, Firefox, Edge, and Safari.
- Customizable Waits:
    - Implements explicit and implicit waits to ensure elements are fully loaded before actions are performed.

###  Comparison
<img src="WebScraping_img/5.png" width="500px" style="display: block; margin: auto;">
<br>

### Advantages:
- Handles JavaScript and AJAX (Asynchronous JavaScript and XML)
- Simulates user behavior
- Allows scraping data behind login screens or requiring user interaction

### Disadvantages:
- Slower than static scraping methods since it requires a full browser environment
- Heavy CPU and memory usage
- Websites may block or detect Selenium bots

# Navigating Web Pages Steps
1. Activate virtual environment
2. Install selenium
3. Download suitable web driver

## 1. Locating Elements
- Selenium provides the function **find_element** to find and locate elements of a webpage
- There're several strategies to identify an element of a webpage:
1. ID:
<img src="WebScraping_img/6.png" width="500px" style="display: block; margin: auto;">
<br>
2. Name:
<img src="WebScraping_img/7.png" width="500px" style="display: block; margin: auto;">
<br>
3. Class:
<img src="WebScraping_img/8.png" width="500px" style="display: block; margin: auto;">
<br>
4. Tag:
<img src="WebScraping_img/9.png" width="500px" style="display: block; margin: auto;">
<br>
5. XPath:
<img src="WebScraping_img/10.png" width="500px" style="display: block; margin: auto;">
<br>
6. Element Location Strategies:
<img src="WebScraping_img/11.png" width="500px" style="display: block; margin: auto;">
<br>
<img src="WebScraping_img/12.png" width="500px" style="display: block; margin: auto;">
<br>
## 2. XPath
- What is XPath?
    - XPath (XML Path Language) is a query language used to navigate and locate elements within XML or HTML documents.
    - Selenium uses XPath as one of its locator strategies to find elements on a webpage.
    - XPath is a powerful tool for locating elements in Selenium, offering unmatched flexibility and precision.
    - It's a go-to solution when working with complex web pages or when other locators are insufficient.
### Advantages:
- Locate Elements Anywhere:
    - XPath can traverse the entire DOM, allowing you to locate elements in deep nested structures or without unique identifiers.
    - Works for all elements, even those without **id**, **name**, or **class**.
- Offers Rich Syntax: XPath supports a variety of conditions and operators, enabling users to
    - Match elements by attributes
    - Locate elements by position
    - Use partial matches
- Supports Text-Based Matching:
    - Locate elements based on visible text
- Supports Relative Paths: You can locate elements without specifying their full path in the DOM, making XPath expressions robust to changes
    - Relative: **//div\[@class='example']**
    - Absolute: **/html/body/div\[1]/div\[2]**
- Navigate the DOM Hierarchy
<img src="WebScraping_img/13.png" width="500px" style="display: block; margin: auto;">
<br>
- Independent of HTML Structure
    - XPath can navigate through the DOM and locate elements that might not be directly visible or styled.
    - Elements in the DOM can exist even if they are hidden from the user's view, such as elements with CSS properties like **display: none** or **visibility: hidden**

### Disadvantages:
- XPath is slower than CSS Selectors because of its ability to traverse the entire DOM.
- XPath expressions can be harder to read and maintain, especially for deeply nested elements.
- Some older browsers may have limited support for advanced XPath queries.

## 3. Interacting with Elements
1. Typing Input into Fields:
    - This is achieved using the **send_keys** function
    - It's used to simulate typing into an element, such as a text input field or a text area.
    - It allows users to send keystrokes or input text programmatically as if a user were typing on a keyboard.
    - Enables simulation of key presses like Enter, Tab, etc.
        - This is achieved by using the class **Keys** from the module **selenium.webdriver.common.keys**
2. Clearing Input Fields:
    - This is achieved using the **clear** function
    - It's used to clear the text content of a text input element on a web page
    - It ensures the field is empty before entering new data (which can be done using the send_keys function)
3. Clicking Buttons:
    - This is achieved using the **click** function
    - It's used to simulate a mouse click on a web element
    - Allows users to interact with clickable elements on a web page, such as buttons, links, checkboxes, or radio buttons.
4. Submitting Forms:
    - This is achieved using the **submit** function
    - Helps to automatically submit a form without explicitly clicking the "Submit" button
    - Executes the action associated with the \<form> tag, such as navigating to a new page or processing data.
    - Direct use of submit is not common in modern web development:
        - Most forms today rely on JavaScript events or custom logic
        - For such forms, using the click function on a "Submit" button or JavaScript execution may be more reliable.

## 4. Dropdown & Multiselect
1. Dropdown:
    - Need to identify the dropdown element as usual
    - Wrap the identified element under the class **Select**, imported from the module **selenium.webdriver.support.select**
    - There are 3 ways to select an option from the dropdown list:
        - **select_by_index:** provide the index of the option
        - **select_by_value:** provide the name attribute of the option
        - **select_by_visible_text:** provide the actual text value of the option
2. Multiselect:
    - Similar to dropdown as discussed above
    - There are multiple ways to select an option:
        - **select_by_index:** provide the index of the option
        - **select_by_value:** provide the name attribute of the option
        - **select_by_visible_text:** provide the actual text value of the option
    - Similarly, there are multiple ways to deselect an option:
        - **deselect_by_index:** provide the index of the option
        - **deselect_by_value:** provide the name attribute of the option
        - **deselect_by_visible_text:** provide the actual text value of the option
        - **deselect_all:** takes no arguments

## 5. Scrolling
- There're several ways of scrolling a webpage using Selenium:
    - Scrolling to a specific element
    - Scrolling vertically
    - Scrolling horizontally
    - Scrolling the page height
    - Infinite scrolling
- Scrolling actions are mainly achieved using the **execute_script** method
    - This method is mainly used to execute JavaScript code within the context of the currently loaded webpage
    - It allows users to directly interact with and manipulate the Document Object Model (DOM) of the page
    - Helps interact with elements that might not be accessible using Selenium's standard methods
    - To better handle dynamically loaded content on modern and JavaScript-heavy websites
        - **script:** String containing JavaScript code
        - **args:** Optional arguments to pass to the JavaScript code, usually Web Elements or other variables
<img src="WebScraping_img/14.png" width="500px" style="display: block; margin: auto;">
<br>

1. Scrolling to a specific element:
    - Identify any element on the webpage
    - Use the method **scrollIntoView**
<img src="WebScraping_img/15.png" width="500px" style="display: block; margin: auto;">
<br>
2. Scrolling Vertically:
    - Use the method **scrollBy**
    - Specify the number pixels to scroll by
    - +ve value indicates scrolling down
    - -ve value indicates scrolling up
<img src="WebScraping_img/16.png" width="500px" style="display: block; margin: auto;">
<br>
3. Scrolling Horizontally:
    - Similar to scrolling vertically
    - The no. of pixels are provided as the first argument
    - +ve value indicates scrolling to the right
<img src="WebScraping_img/17.png" width="500px" style="display: block; margin: auto;">
<br>
4. Scrolling to Page Height:
    - Use the **scrollTo** method
    - Pass the value **document.body.scrollHeight** in place of pixels as usual
    - It refers to total height of the content in a webpage
<img src="WebScraping_img/18.png" width="500px" style="display: block; margin: auto;">
<br>
5. Infinite Scrolling:
    - Initially, the webpage loads a fixed amount of content.
    - As the user scrolls close to the bottom of the page, a JavaScript function triggers a request to load more content dynamically
    - The new content is added to the page, and the process repeats.
###### Algorithm:
- Get the height of the currently loaded page (h1)
- Run an infinite loop
- Scroll down the page to h1
- Inside the loop, get the height of the page again (h2)
- If h1 is same as h2, break out of the loop as no new content has been loaded
- If h1 is not same as h2, update h1 as h2 and continue the loop
<img src="WebScraping_img/19.png" width="1000px" style="display: block; margin: auto;">
<br>
<br>

---

## 1. Advanced Web Interactions Intro
- This section covers advanced web interactions that go beyond basic navigation and element manipulation.
- By mastering these techniques, developers will be able to handle real-world web scraping and automation challenges, including interacting with dynamic content, handling alerts, and managing iframes.

## 2. Explicit Waits
#### What is Explicit Wait?
- Type of wait that pauses the execution of the script until a specific condition is met or a specified maximum time is reached .
- Useful when dealing with dynamic web elements that take time to appear or become interactable on the page.
- Helps avoid exceptions such as **NoSuchElementException** or **ElementNotInteractableException**
- Improves script reliability by waiting only as long as necessary
- Optimizes test execution time compared to fixed delays (e.g., time.sleep())

#### How to Implement?
- Selenium provides the **WebDriverWait** class to implement explicit waits
- It works with expected conditions defined in the **selenium.webdriver.support.expected_conditions module**
- The script checks for the condition at regular intervals (default - 500ms) until it's met or the timeout occurs
<img src="WebScraping_img/20.png" width="1000px" style="display: block; margin: auto;">
<br>
- Parameters:
    - **driver:** The WebDriver instance controlling the browser
    - **timeout:** Maximum time (in seconds) to wait for the condition to be met
    - **poll_frequency:** How often (in seconds) the condition is checked (default: 0.5 seconds)
    - **ignored_exceptions:** A tuple of exceptions to ignore while waiting (optional)
- WebDriverWait often works in conjunction with **Expected Conditions** (EC) to define what to wait for:
<img src="WebScraping_img/21.png" width="800px" style="display: block; margin: auto;">
<br>

## 3. Implicit Waits
#### What is Implicit Wait?
- Implicit Waits are a mechanism in Selenium to specify a default waiting time for the WebDriver when **searching for elements on a webpage**.
- If an element is not immediately found, the WebDriver waits for the specified duration before throwing a **NoSuchElementException**.
- This type of wait applies globally to **all element searches in the WebDriver instance**.
#### How It Works:
- We can set up the waiting duration using the **implicitly_wait** method of the webdriver instance.
- Makes Selenium scripts resilient to minor delays in the loading of web elements caused by network speed, animations, or dynamic content.
- Once set, it applies to all **find_element** and **find_elements** methods for the lifetime of the WebDriver instance.
- If the element is found before the timeout period, the script proceeds immediately. Otherwise, it waits until the timeout is reached and raises an exception if the element is still not found.
<img src="WebScraping_img/22.png" width="800px" style="display: block; margin: auto;">
<br>
#### Advantages:
- **Simplicity:** Easy to implement and applies globally, avoiding repetitive waits for every element.
- **Resilience:** Handles minor delays in loading dynamically generated elements, reducing script failures.
- **Better Control:** Provides a default buffer for all element searches without the need for explicit handling.
#### Disadvantages:
- Since it applies globally, it may not suit situations where different elements require different wait times.
- Mixing implicit waits with explicit waits can cause unpredictable behavior, as implicit waits can interfere with explicit wait polling mechanisms.
- Implicit waits only handle element visibility or presence and cannot wait for specific conditions like page titles or JavaScript execution.
#### Best Practices:
- Use either implicit waits or explicit waits in your script, but not both, to avoid conflicts.
- Set reasonable timeout durations and not very high implicit wait times (e.g., 60 seconds) as it can unnecessarily delay script execution.
- Implicit waits are suitable for simple scripts without complex wait conditions.
#### Implicit vs Explicit Wait:
<img src="WebScraping_img/23.png" width="1000px" style="display: block; margin: auto;">
<br>

## 4. Frames & IFrames
#### What is a Frame/IFrame?
- In web development, **frames** and **iframes** are HTML elements that allow you to embed one HTML document inside another.
#### Frame:
- Part of the **\<frame>** and **\<frameset>** HTML tags, which were used in early web development to divide the browser window into multiple sections, each capable of loading a separate HTML document.
- Now obsolete in HTML5, frames are rarely used. They were replaced by iframes and other modern layout techniques like CSS Grid and Flexbox.
#### Iframe (Inline Frame):
- An **\<iframe>** is an HTML element that embeds another HTML document within the current page.
- Changes in the parent page (like CSS or JavaScript) typically do not affect the iframe's content, and vice versa.
- Each iframe has its own DOM (Document Object Model), CSS, and JavaScript scope.
- Interaction between the parent and iframe is restricted if they originate from different domains for security reasons.
#### Working with Selenium:
- To interact with iframe content in Selenium, users must explicitly switch the context to the iframe using **driver.switch_to.frame()**
- Frame content can be identified in many different ways:
    - ID
    - Name
    - Index
    - Xpath
    - CSS Selector
- After interacting with an iframe, switch back to the parent page using **driver.switch_to.default_content()**
#### Best Practices:
- Ensure you know which frame or iframe contains the desired elements by inspecting the page source.
- Whenever possible, avoid switching by index to maintain flexibility if the page structure changes, Only use **XPATH**.
- Always switch back to the main content after interacting with a frame.

## 5. Handling Alerts
#### What are Alerts:
- Alerts in web refer to small, temporary messages or pop-ups that appear in a web browser to communicate information or request user actions.
- They are typically generated by JavaScript or built into the HTML/CSS structure of a webpage.
- Alerts are used for various purposes, including notifying users, obtaining confirmation, or prompting for input.
#### Types of Alerts:
1. JavaScript alerts:
    - Created using JavaScript's **alert()** function.
    - Displays a simple message to the user with a single "OK" button.
    - Blocks user interaction with the page until dismissed.
2. Confirmation Alerts:
    - Created using JavaScript's **confirm()** function.
    - Asks the user to confirm an action with "OK" and "Cancel" buttons.
    - Returns true for "OK" and false for "Cancel"
3. Prompt Alerts:
    - Created using JavaScript's **prompt()** function.
    - Requests user input and provides an input field along with "OK" and "Cancel" buttons.
    - Returns the input value for "OK" or null for "Cancel".
4. Browser-Based Alerts (Authentication Pop-Ups):
    - Appear when a website requests HTTP basic authentication.
    - Requires entering a username and password
5. Custom HTML Alerts (Modal Dialogs):
    - Designed using HTML, CSS, and JavaScript to create custom alert-style messages.
    - Offers more flexibility in design and functionality (e.g., styled dialog boxes with multiple buttons or inputs).
#### Handling Alerts with Selenium:
- To interact with alerts in Selenium, users must explicitly switch the context to the alert box using **driver.switch_to.alert**
    - Use the **accept()** method to click the "OK" button
    - Use the **dismiss()** method to click the "Cancel" button on confirmation pop-ups
    - Use the **text** attribute to retrieve the message displayed on the alert
    - Use the **send_keys()** method to input text into a prompt pop-up

---

## Best Practices and Optimization Intro
- This section provides strategies for writing maintainable, efficient, and robust Selenium scripts.
- Adhering to these best practices will improve the performance of test automation or web scraping projects and make the code easier to debug and maintain.
#### 1. Write Maintainable Code
1. Page Object Model (POM):
    - The Page Object Model is a design pattern that separates the code used to locate and interact with web elements.
    - This improves code readability and reusability
    - Each webpage is represented by a class
    - Elements are defined as variables, and interactions (methods) are encapsulated within the class
2. Variable Names:
    - Avoid overly generic names (ex: element1, button1)
    - Use descriptive names to define web elements
#### 2. Enhance Performance
1. Using Appropriate Waits:
    - Overuse of time.sleep() can slow down scripts unnecessarily
    - Look to use Explicit Waits and Implicit Waits for better performance
    - Use Explicit Waits for better control
2. Optimize Locator Strategies:
    - Use ID and NAME where possible as they are the fastest locators
    - Avoid using XPath unless necessary, as it can be slower
3. Reuse Browser Sessions:
    - Instead of launching a new browser for each test, reuse the browser session if applicable
    - For scraping, minimize browser interaction by using headless mode
<img src="WebScraping_img/24.png" width="500px" style="display: block; margin: auto;">
<br>
#### 2. Robustness and Error Handling
1. Try-Except blocks:
    - Wrap code for critical interactions within try-except blocks to manage unexpected failures
    - Catch specific exceptions if possible
<img src="WebScraping_img/25.png" width="500px" style="display: block; margin: auto;">
<br>
2. Release Resources:
    - Always close the browser session at the end of the script to free resources
    - This can be achieved using driver.quit()
#### 4. Logging and Debugging
1. Logging:
    - Avoid printing directly to the console to debug; use logs instead
    - Use python’s logging module for better control
<img src="WebScraping_img/26.png" width="500px" style="display: block; margin: auto;">
<br>
2. Capture Screenshots:
    - Save a screenshot for debugging If code fails
    - Use the save_screenshot method of the webdriver instance
3. Developer Tools:
    - Use the browser's Developer Tools to inspect element locators and understand dynamic content of the webpage
    - Gives a better understanding of the website structure and its respective code
#### 5. Security Considerations
1. Secure Sensitive Data:
    - Avoid exposing credentials in scripts
    - Use encrypted files or environment variables for storage
2. Respect Website Policies:
    - Check the website’s robots.txt and Terms of Service before scraping
    - Be ethical in your automation and scraping practices

---

## Action Chains:
- Action Chains in Selenium is a feature that allows automating complex user interactions such as mouse and keyboard events.
- It is part of the **ActionChains** class in Selenium, designed to handle actions like clicking, dragging, hovering, and sending keyboard input.
#### Working of Action Chains:
- An instance of ActionChains is created by passing the WebDriver instance.
- Use various methods provided by the class to define a sequence of actions.
- Call the **perform()** method to execute all actions in the defined sequence.
#### Common Methods:
- **click()**: Clicks on a specified web element.
- **click_and_hold()**: Clicks without releasing the mouse button.
- **double_click()**: Double-clicks on the element.
- **context_click()**: Right-clicks on the element.
- **move_to_element()**: Moves the mouse pointer to the element.
#### Loading a Webpage:
- We can use **return document.readyState** to retrieve the current loading state of the web page.
- The values returned by this command can be:
    - **loading**: The document is still loading.
    - **interactive**: The document has been loaded but external resources like images and stylesheets might still be loading.
    - **complete**: The entire document, including external resources, has been fully loaded.
- This is commonly used in Selenium scripts to wait until the page is fully loaded before interacting with elements.
#### until():
- It has a parameter **method**.
- This parameter must be a callable (function or lambda) which takes a **WebElement** as input and returns a boolean value.
- If the callable doesn’t return a truthy value within the timeout period, a **TimeoutException** exception will be raised.

## "This site can't be reached":
<img src="WebScraping_img/site-cant-be-reached.png" width="800px" style="display: block; margin: auto;">
<br>

#### Causes:
- The ChromeDriver or WebDriver version might not match the installed browser version
- A proxy or VPN might interfere with the request
- Firewall or corporate network restrictions may block access
- Websites often **restrict access to bots or automated tools like Selenium**
- The website might have anti-scraping mechanisms in place
- The certificate or SSL/TLS version might not be supported or configured correctly
#### Solutions:
- Ensure your WebDriver matches the browser version
- Configure browser **options in Selenium**
- Force the browser to use HTTP/1.1
- Use a **user-agent string to make requests** appear like they are coming from a browser
- Run the browser in **incognito mode**
## Selenium Options:
- In Selenium, **Options** is used to **customize and configure browser settings and behavior** when automating browsers
- Allows to specify preferences, enable or disable features, and set options that are specific to the browser you are automating
- Each browser driver has its own Options class to provide these configurations
#### Common Usecases:
- Running browser in headless mode (without GUI)
- Disabling browser notifications
- Setting the default download directory
- Adding browser extensions
#### Key Methods:
- **add_argument(arg)**: **Adds a command-line argument** to the browser
- **add_experimental_option(name, value)**: Adds experimental options or preferences
- **set_capability(name, value)**: Sets desired capabilities for the browser
#### Advantages:
- Helps customize browser behavior according to your testing needs
- **Suppress unnecessary UI elements like notifications or pop-ups**
- Ensure the browser runs with specific settings each time, promoting consistency
- Improves efficiency as we can run browsers in **headless mode to save resources** during testing
## getBoundingClientRect():
- JavaScript function, used to manipulate and interact with elements on a webpage by executing JavaScript
- useful when standard Selenium methods can't achieve certain tasks directly, such as **precise scrolling or positioning**
- This approach ensures that Selenium scripts can interact with dynamically loaded or partially visible elements more reliably
#### Advantages:
- **Helps in cases where the elements are not immediately visible in the viewport**
- Handles scenarios where the webpage layout changes, requiring precise adjustments
- Ensures elements are well-placed for user interaction or screenshots
- Works even when the standard Selenium methods (**scrollIntoView()**, **actions.move_to_element()**, etc.) don't work as intended

---

---

---