# Data hunting and gathering (part 2)

<img style = "border-radius:20px;" src = "http://unadocenade.com/wp-content/uploads/2012/09/cavalls-de-valltorta.jpg">

# Session 2: Creating Our Own Web API - Scraping

In this session, we dive into the world of web scraping, focusing on advanced techniques and tools to extract data from dynamic web pages.

## Learning Objectives
- **Understanding HTML and CSS**: Grasp the basics of HTML structure and CSS styling to navigate web pages effectively.
- **XPath Selectors**: Learn how to use XPath selectors to pinpoint and extract specific content from web pages.
- **Scraping Dynamic Content with Selenium**: Understand how to scrape dynamically generated content, which standard scraping tools can't always handle.

## Additional Python Libraries
You will need to install these Python libraries for scraping tasks:

- **lxml**: A powerful library for processing XML and HTML in Python.
  - Install via pip: `pip install lxml`
- **selenium**: An automated web browser tool for testing web applications, also useful for complex scraping tasks.
  - Install via pip: `pip install selenium`

### Note
Ensure that all software and libraries are installed before the session. This will enable you to actively participate in the exercises and follow along with the scraping examples.

## 1. "Making Your Own API": Web Scraping

### Understanding Web Scraping
Web scraping becomes essential when data is available on the web but isn't accessible through an API, or the existing API lacks certain functionalities or has restrictive terms of service. In such scenarios, **Web Scraping** is the technique that enables automated extraction of this data, replicating the access a human would have visually.

### Why Web Scraping?
- **Data Accessibility**: Sometimes, the only way to access certain data is directly from the web pages where it is displayed.
- **Flexibility**: Web scraping allows you to tailor data extraction to specific needs, bypassing limitations of existing APIs.

### Preparing for Web Scraping: Understanding Web Page Structure
Before delving into scraping, it's crucial to have a basic understanding of web page structure and how data is stored and presented. This session covers:

#### Basic HTML and CSS Static Pages
- **HTML (HyperText Markup Language)**: The standard markup language used to create web pages. Understanding HTML is key to identifying the data you want to scrape.
- **CSS (Cascading Style Sheets)**: Used for describing the presentation of a document written in HTML. Knowing CSS helps in pinpointing specific elements on a page.

#### Dynamic HTML
- **Basic JavaScript Example Using JQuery**: Websites often use JavaScript to load data dynamically. Understanding how this works is crucial for scraping data from such dynamic pages.

### 1.1 Basic HTML + CSS 101

#### Understanding the Foundation of Web Pages

The most fundamental web pages are constructed using HTML and CSS. These technologies serve two primary purposes: **HTML (Hypertext Markup Language)** structures and stores the content, making it the primary target for web scraping, while **CSS (Cascading Style Sheets)** formats and styles the content, highlighting visual elements like fonts, colors, borders, and layout.

#### HTML: The Structure of the Web
HTML is a markup language typically rendered by web browsers. It uses 'tags' to define elements on a web page. A typical tag format includes a tag name, attributes (if any), and the content between opening and closing tags.

#### Key Components of an HTML File

- **DOCTYPE Declaration**: 
  - Begins with `<!DOCTYPE html>`, indicating the use of HTML5.
  - Earlier HTML versions had different DOCTYPEs.

- **HTML Tag**: 
  - The `html` tag (and its closing `/html` tag) encloses the entire web page content.

- **Head and Body**: 
  - The `head` section often includes the `title` tag, defining the webpage's name, links to CSS stylesheets, and JavaScript files for dynamic behavior.
  - The `body` contains the visible webpage content.

- **Common HTML Elements**:
  - **Headings and Paragraphs**: Use `h#` (where # is a number) for headings and `p` for paragraphs.
  - **Hyperlinks**: Defined with the `href` attribute in `a` (anchor) tags.
  - **Images**: Embedded using `img` tags with the `src` attribute. Note: `img` is self-closing.

#### Exercise: Build a Basic HTML Web Page

Let's put your HTML knowledge into practice:

- Create a file named 'example.html' in your favorite text editor.
- Build a basic HTML web page containing elements like `title`, `h1`, `p`, `img`, and `a` tags. Remember that nearly all tags need to be closed with a `/tag`.

This exercise aims to familiarize you with the basic structure of HTML and how various elements come together to form a web page.

If you are lazy go to the files folder and double-click on "example.html". You can check the html code executing the following line.

In [3]:
%%html

<!-- Start of the HTML head section -->
<head>
    <!-- Title of the webpage -->
    <title>
        Basic knowledge for web scraping.
    </title>	
</head>
<!-- Start of the HTML body section -->
<body>
    <!-- Header 1 indicating the subject of the content -->
    <h1>About HTML
    </h1>
    <!-- Paragraph explaining what HTML is and providing a link for further information -->
    <p>Html (Hypertext markdown language) is the basic language to provide contents in the web. It is a tagged language. You can check more about it in <a href="http://www.w3.org/community/webed/wiki/HTML">World Wide Web Consortium.</a></p>
    
    <!-- Paragraph indicating that one of the following images is clickable -->
    <p> One of the following rubberduckies is clickable
    </p>
    <!-- Image of a rubber ducky; this one is not clickable -->
    <p>
        <img src = "files/rubberduck.jpg"/>
    
        <!-- Clickable image (hyperlinked) of a rubber ducky -->
        <a href="http://www.pinterest.com/misscannabliss/rubber-duck-mania/"><img src = "files/rubberduck.jpg"/></a>
    </p>
</body>

### Understanding Old Style vs. Current HTML Static Pages

#### Old Style HTML Static Pages
Old style HTML pages often relied on tables and lists for structuring content. 

- **Lists**:
  - **Ordered Lists (`ol`)**: Used for creating lists where order matters, with each item represented by `li` (list item).
  - **Unordered Lists (`ul`)**: Used for lists where order is not important, again using `li` for each item.

- **Tables**:
  - The `table` tag is used to create a table.
  - Each table row is marked with a `tr` tag.
  - Table columns are defined by `td` (table data) elements within each row.
  - Tables may include a header (`thead`) and a body (`tbody`).
  - `th` elements are used similarly to `td` but for headers.
  - To span a cell across multiple columns, use `colspan` with the number of cells to cover.

#### Current HTML Static Pages
Modern HTML pages focus more on using containers and CSS for layout and styling.

- **Divisions (`div`)**: 
  - The `div` tag signifies a division and is used to define a block of content. It is a versatile container used in modern web design.

- **Spans (`span`)**: 
  - The `span` tag is used to highlight or style a specific part of a block of content. It's an inline container and is often used for small-scale modifications to text or other elements.

Both old and current styles of HTML have their uses, but modern practices favor the use of `div` and `span` along with CSS for more flexible and responsive design.


### Understanding CSS for Web Scraping

CSS, which stands for Cascading Style Sheets, is a stylesheet language used to describe the presentation and formatting of HTML documents. In web scraping, understanding CSS is crucial for effectively navigating and extracting data from web pages.

#### What is CSS?
- **CSS** is a language designed to style the content of HTML files. By using CSS, web developers define how various HTML elements should appear on a webpage.
- The term **"cascading"** refers to the priority given to specific style rules over more generic ones. This hierarchy is a fundamental aspect of CSS.

#### The Role of CSS in Web Scraping
- **Separation of Concerns**: CSS allows for a clear separation between the structure of HTML (content) and the style of the webpage (appearance). This separation makes webpages easier to design and maintain, and also easier to scrape.
- **Selectors and Properties**:
  - **Selectors** are patterns used to select the element(s) you want to style, or in the case of scraping, the elements you want to extract.
  - **Properties** are the aspects of the elements you want to style, such as color, font, width, height, and more.
- **Cascading Order**:
  - Styles are applied in order of specificity, with more specific selectors overriding more general ones. Inline styles (directly within an HTML element) have the highest specificity.

#### Example in Web Scraping
Consider a webpage with the following HTML and CSS:

```html
<!-- HTML Example -->
<div class="product-description">
    <p>Awesome product</p>
</div>
```

Now, css example:

```css
/* CSS Example */
.product-description p {
    color: blue;
}
```

In this example, the CSS targets a `p` element within a `div` of the class `product-description` and changes its text color to blue. Understanding how this CSS rule applies helps in scraping data accurately.

### Conclusion
For web scraping, CSS is not just about understanding webpage aesthetics; it's about comprehensively understanding the webpage's structure. This understanding is crucial for effective data extraction.

### Python Example: Web Scraping Using CSS Selectors

This example demonstrates how to scrape a website (Python's official blog) and extract specific content using CSS selectors in Python. We use the `requests` and `BeautifulSoup` libraries to accomplish this.

#### Script Explanation

1. **Import Libraries**:
   - `requests` for sending HTTP requests.
   - `BeautifulSoup` from `bs4` for parsing HTML content.

   ```python
   import requests
   from bs4 import BeautifulSoup


In [None]:
pip install requests beautifulsoup4

In [None]:
import requests
from bs4 import BeautifulSoup

# URL of the website we want to scrape
# In this case, we are targeting the Python.org blog page
url = "https://www.python.org/blogs/"

# Send a GET request to the specified URL
# This request fetches the HTML content of the webpage
response = requests.get(url)

# Parse the HTML content of the page
# 'BeautifulSoup' is a Python library for parsing HTML documents
# It creates a parse tree from page source code that can be used to extract data easily
soup = BeautifulSoup(response.text, 'html.parser')

In [None]:
# what is soup if you print it:
soup

In [None]:
# Use a CSS selector to extract specific elements
# Here, we select all 'h2' elements (commonly used for titles/headings in HTML)
# 'select' is a method that finds all instances of a tag with the specified CSS path
titles = soup.select('h2')

# Iterate through the extracted titles and print them
# 'get_text()' extracts the text part of the HTML element, and 'strip()' removes leading/trailing whitespaces
for title in titles:
    print(title.get_text().strip())

# This script prints out all the text content of 'h2' tags found on the Python blog page
# It provides an example of how to extract and print specific parts of a webpage

### Web Scraping Exercise: Extracting News Headlines from BBC Technology

#### Objective
Write a Python script to scrape headlines from BBC's Technology news section and categorize them based on keywords.

#### Task Details

1. **Website to Scrape**:
   - Target the BBC's 'Technology' section: [BBC Technology News](https://www.bbc.co.uk/news/technology).

2. **Scraping Requirement**:
   - Scrape the main headlines from the page, typically found in `h3` tags or a specific class.

3. **Categorization**:
   - Categorize the headlines based on predefined keywords like 'Apple', 'Microsoft', 'Google', etc.
   - Count the number of headlines that fall into each category.

4. **Output**:
   - Print each headline along with its respective category.
   - Summarize with the count of headlines in each category.

In [9]:
import requests
from bs4 import BeautifulSoup

# URL of the BBC technology news section
url = "https://www.bbc.co.uk/news/technology"

# Send a GET request and parse the HTML content
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Define categories and associated keywords
categories = {
    'Apple': ['Apple', 'iPhone', 'iPad'],
    'Microsoft': ['Microsoft', 'Windows', 'Bill Gates'],
    'Google': ['Google', 'Android', 'Alphabet']
    # Add more categories as needed
}

# Function to determine the category of a headline
def categorize_headline(headline):
    # Logic to determine the category based on keywords
    # Return the category name if a keyword is found, else return 'Other'
    pass

# Scrape and process the headlines
# Look for 'h3' tags or other relevant tags
# Use the categorize_headline function to categorize each headline
# Print each headline and its category

# Print the count of headlines in each category

## 1.3 Selecting Elements with XPath

XPath, or XML Path Language, is a versatile and robust tool for navigating and selecting elements within HTML documents. While Beautiful Soup and requests are commonly used libraries for web scraping, XPath offers a unique and powerful approach to extracting data from web pages.

### What is XPath?

XPath was originally designed for navigating XML documents, but it is equally applicable to HTML, which shares a structural similarity with XML. XPath allows you to specify the precise location of elements or data within an HTML document using a concise and expressive syntax.

### Key Differentiators:

Here are some key differentiators that set XPath apart from other web scraping approaches:

1. **Granular Selection**: XPath provides granular control over element selection. Unlike Beautiful Soup, which often requires multiple iterations and filtering, XPath allows you to pinpoint elements directly based on their attributes, tags, or positions within the document.

2. **Hierarchical Navigation**: XPath excels at navigating the hierarchical structure of HTML documents. It enables you to traverse the document tree, moving up, down, or across branches with ease.

3. **Precise Queries**: With XPath, you can create precise queries to extract specific data. For example, you can target elements with specific attributes, such as selecting all `<a>` elements with a particular class or locating elements within specific parent elements.

4. **Text Extraction**: XPath's `text()` function simplifies the extraction of text content from elements. This is particularly useful for scraping text data, such as headlines, paragraphs, or product descriptions.

### How to Use XPath:

To utilize XPath for web scraping, you typically follow these steps:

1. **Send an HTTP Request**: Use a library like requests to send an HTTP GET request to the webpage you want to scrape. This retrieves the HTML content of the page.

2. **Parse the HTML**: Once you have the HTML content, parse it using a library like lxml or lxml.html. This step constructs a structured representation of the webpage that you can navigate with XPath.

3. **Construct XPath Expressions**: Formulate XPath expressions that target the specific elements or data you wish to extract. XPath expressions can vary in complexity, allowing you to adapt to different webpage structures.

4. **Apply XPath Expressions**: Apply your XPath expressions to the parsed HTML document to select the desired elements or data. This process effectively filters the HTML content to capture only what you need.

5. **Retrieve and Process Data**: Retrieve the selected elements or data using the XPath queries and process them as needed for your scraping task.

In summary, XPath is a powerful tool for web scraping that offers precise and efficient element selection within HTML documents. While libraries like Beautiful Soup and requests are valuable, XPath provides an additional layer of control and flexibility, making it a valuable choice for advanced scraping projects.


### Understanding XPath Syntax

- **Absolute Path (`/`)**: 
  - Using a single slash indicates an absolute path from the root element.
  - Example: `xpath('/html/body/p')` selects all paragraph (`<p>`) elements directly under the `<body>` within the `<html>` root element.

- **Relative Path (`//`)**:
  - Double slashes indicate a relative path, meaning the selection can start anywhere in the document hierarchy.
  - Example: `xpath('//a/div')` finds all `<div>` elements that are descendants of `<a>` tags, regardless of their specific location in the document.

- **Wildcards (`*`)**:
  - The asterisk acts as a wildcard, representing any element.
  - Example: `xpath('//a/div/*')` selects all elements that are children of `<div>` tags under `<a>` tags, anywhere in the document.
  - Another example: `xpath('/*/*/div')` finds `<div>` elements that are at the second level of the hierarchy from the root.

- **Selecting Specific Elements (Using Brackets)**:
  - If a selection returns multiple elements, you can specify which one to select using brackets.
  - Example: `xpath('//a/div[1]')` selects the first `<div>` in the set; `xpath('//a/div[last()]')` selects the last `<div>`.

### Working with Attributes

- **Selecting Attributes (`@`)**:
  - The `@` symbol is used to work with element attributes.
  - Example: `xpath('//@name')` selects all attributes named 'name' in the document.
  - To select `<div>` elements with a 'name' attribute: `xpath('//div[@name]')`.
  - To select `<div>` elements without any attributes: `xpath('//div[not(@*)]')`.
  - To find `<div>` elements with a specific 'name' attribute value: `xpath('//div[@name="chachiname"]')`.

### Utilizing Built-in Functions

- XPath comes with several built-in functions to aid in element selection.
  - `contains()`: Selects elements containing a specific substring. Example: `xpath('//*[contains(name(),'iv')]')`.
  - `count()`: Used for conditional selection based on child count. Example: `xpath('//*[count(div)=2]')`.

### Combining Paths and Selecting Relatives

- **Combining Paths (`|`)**:
  - Use the pipe symbol to combine paths, functioning like an OR operator.
  - Example: `xpath('/div/p|/div/a')` selects elements matching either `div/p` or `div/a`.

- **Selecting Relatives**:
  - You can refer to various relational aspects like parent, ancestors, children, or descendants.
  - Example: `xpath('//div/div/parent::*')` selects the parent elements of `div/div` paths.

Understanding XPath is essential for effective web scraping, as it allows precise targeting and extraction of data based on the structure of a webpage.

In [None]:
from lxml import html
import requests

# URL of the website we want to scrape
# For this example, we'll use a news website like the BBC technology page
url = "https://www.bbc.co.uk/news/technology"

# Send a GET request to the URL to fetch the webpage's HTML content
response = requests.get(url)

# Parse the HTML content of the webpage
# The html.fromstring method constructs an lxml HTML document from the response text
tree = html.fromstring(response.content)

# Use XPath to select specific elements
# In this case, we'll attempt to extract text content from the page
# The XPath expression here captures all text content within the HTML structure
# Note: The actual XPath may vary depending on the webpage's HTML structure
headlines = tree.xpath('//text()')


# Print each extracted headline
# The text() function in XPath extracts the text content of the selected elements
for headline in headlines:
    print(headline.strip())

# This script prints all the text content of <h3> tags found on the BBC Technology page
# It demonstrates how to use XPath for extracting specific information from a webpage

## 2.0. Starting with Selenium 

Selenium is a powerful tool primarily used for automating web browsers. It's widely utilized in areas such as web scraping, automated testing, and automating web-based administration tasks.

### Introduction to Selenium Without Geckodriver

Traditionally, Selenium works in conjunction with a driver specific to each browser, like geckodriver for Firefox or chromedriver for Chrome. However, recent developments have enabled certain browsers to be controlled directly by Selenium without the need for an additional driver:

- **Chrome**: Recent versions of Google Chrome can be controlled by Selenium directly through the Chrome DevTools Protocol. This simplifies the setup process as you don't need to download and set up chromedriver separately.

- **Microsoft Edge**: Similar to Chrome, the Edge browser (Chromium version) can also be automated directly using Selenium with its built-in driver capabilities. 

This approach of using Selenium without an additional driver streamlines browser automation tasks, making it more accessible and easier to configure, especially for beginners and those looking to quickly set up automated browser interactions.

## 2.1 Basic Concepts of Selenium WebDriver

### Understanding WebDriver

WebDriver is a key component of the Selenium suite. It acts as an interface to interact with the web browser, allowing you to control it programmatically. WebDriver can perform operations like opening web pages, clicking buttons, entering text in forms, and extracting data from web pages.

#### Key Functions of WebDriver
- **Opening a Web Page**: WebDriver can navigate to a specific URL.
- **Locating Elements**: It can find elements on a web page based on their attributes (like ID, name, XPath).
- **Interacting with Elements**: WebDriver can simulate actions like clicking buttons, typing text, and submitting forms.

### Interacting with Web Elements

You can locate and interact with elements on a web page using various methods provided by WebDriver. The choice of method depends on the attributes of the HTML elements you're targeting.

- **find_element_by_id**: Locates an element by its unique ID.
- **find_element_by_name**: Finds an element by its name attribute.
- **find_element_by_xpath**: Uses XPath queries to locate elements, providing a powerful way to navigate the DOM.

In [None]:
### Selenium WebDriver Python Examples
#### Example 1: Opening a Web Page

#This example demonstrates how to open a web page using Selenium WebDriver.

from selenium import webdriver

# Initialize the Chrome WebDriver
driver = webdriver.Chrome()

# Open a web page
driver.get("https://www.python.org")

In [None]:
# Close the browser
driver.quit()

**Example 2: Web Scraping NBA Player Salaries**

This example demonstrates how to scrape NBA player salary data from a website using Selenium in Python. It's a practical illustration of how Selenium can be utilized to automate web browsing and extract specific data from web pages. The script navigates through different pages for each NBA season, collects player names and their corresponding salaries, and organizes this data into a pandas DataFrame for each year from 1990 to 2018. This is a useful example for learning how to manage web elements, extract text, and handle data using pandas in Python.

In [None]:
# Importing the necessary libraries
from selenium import webdriver  # Used to automate web browser interaction
from selenium.webdriver.common.by import By  # Helps in locating elements on web pages
import pandas as pd  # Pandas library for data manipulation and analysis

# Creating an empty DataFrame with specified columns
# This DataFrame will be used to store the scraped data
df = pd.DataFrame(columns=['Player', 'Salary', 'Year'])

# Initializing the Chrome WebDriver
# This opens up a Chrome browser window for web scraping
driver = webdriver.Chrome()

# Looping through the years 2017 to 2018
for yr in range(2017, 2019):
    # Constructing the URL for each year by appending the year range to the base URL
    page_num = str(yr) + '-' + str(yr + 1) + '/'
    url = 'https://hoopshype.com/salaries/players/' + page_num
    driver.get(url)  # Navigating to the constructed URL in the browser
    
    # Finding all player name elements on the page using their XPATH
    # XPATH is a syntax used to navigate through elements and attributes in an XML document
    players = driver.find_elements(By.XPATH, '//td[@class="name"]')

    # Similarly, finding all salary elements on the page using their XPATH
    salaries = driver.find_elements(By.XPATH, '//td[@class="hh-salaries-sorted"]')
    
    # Extracting the text from each player element and storing in a list
    players_list = [player.text for player in players]

    # Extracting the text from each salary element and storing in a list
    salaries_list = [salary.text for salary in salaries]
    
    # Pairing each player's name with their salary and year using the zip function
    data_tuples = list(zip(players_list[1:], salaries_list[1:]))
    
    # Creating a temporary DataFrame for the current year
    # This DataFrame contains the player names, their salaries, and the year
    temp_df = pd.DataFrame(data_tuples, columns=['Player', 'Salary'])
    temp_df['Year'] = yr

    # Appending the temporary DataFrame to the master DataFrame
    # ignore_index=True is used to ensure the index continues correctly in the master DataFrame
    df = df.append(temp_df, ignore_index=True)

# Closing the WebDriver after completing the scraping
# This is important to free up resources and avoid potential memory leaks
driver.close()

## Business challenge: **Analyzing Barcelona's Rental Market: A Web Scraping and Data Visualization Project**

## Objective:
The goal is to develop a Python-based web scraper to extract rental property data from Idealista for different neighborhoods in Barcelona. This data will be analyzed using Power BI to uncover insights into the city's rental market.

## Scope:
- **Web Scraping**: Extract key data points such as rental prices, property size, number of bedrooms, and neighborhood locations from Idealista.
- **Data Analysis and Visualization**: Analyze the scraped data to identify trends and patterns, then visualize these findings using Power BI.

## Steps and Agile Methodology Application:

### 1. Project Initiation and Planning (Sprint 0)
- **Team Setup**: Form cross-functional teams with roles such as Scrum Master, Product Owner and Data Analysts.
- **Requirement Gathering**: Define the specific data points to be scraped from Idealista.
- **Tool Selection**: Choose appropriate tools for web scraping (e.g., Python with libraries like BeautifulSoup, Selenium) and for data visualization (Power BI).
- **Backlog Creation**: Create a product backlog comprising user stories (e.g., "As a data analyst, I want to scrape rental prices so that I can analyze the average rent in each neighborhood").

### 2. Sprint Execution
- **Sprint Planning**: Break down the backlog into smaller, manageable tasks to be completed in each sprint (e.g., setting up the scraping environment, designing the data model, etc.).
- **Daily Stand-ups**: Hold brief daily meetings to discuss progress, roadblocks, and next steps.
- **Development and Testing**: Perform iterative development, with regular testing to ensure data accuracy and reliability.
- **Sprint Review**: At the end of each sprint, review the work completed and demonstrate the functionality.
- **Sprint Retrospective**: Reflect on the sprint process to identify improvements for the next sprint.

### 3. Web Scraping Phase
- Implement web scraping scripts to extract the required data from Idealista.
- Ensure compliance with Idealista's web scraping policies and legal considerations.

### 4. Data Analysis and Visualization
- Clean and preprocess the scraped data for analysis.
- Use Power BI to create interactive dashboards and visualizations that highlight key aspects of the rental market in Barcelona.

### 5. Final Review and Presentation
- Compile the findings and insights into a comprehensive report.
- Present the data analysis and visualizations to stakeholders or in a class setting.

### Deliverables:
- Source code for the web scraping script.
- Power BI dashboard files.
- Final report detailing methodology, findings, and insights.

### Learning Outcomes:
- Practical application of web scraping and data analysis.
- Experience in using Agile methodology for project management.
- Enhanced collaboration and teamwork skills.
- Proficiency in using Python for data collection and Power BI for data visualization.


## Answers

In [None]:
import requests
from bs4 import BeautifulSoup

# URL of the BBC technology news section
url = "https://www.bbc.co.uk/news/technology"

# Send a GET request and parse the HTML content
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Define categories and associated keywords
categories = {
    'Apple': ['Apple', 'iPhone', 'iPad'],
    'Microsoft': ['Microsoft', 'Windows', 'Bill Gates'],
    'Google': ['Google', 'Android', 'Alphabet']
    # Add more categories as needed
}

# Function to determine the category of a headline
def categorize_headline(headline):
    for category, keywords in categories.items():
        for keyword in keywords:
            if keyword in headline:
                return category
    return 'Other'

# Scrape and process the headlines
# Look for 'h3' tags or other relevant tags
headlines = soup.find_all('h3')
category_counts = {category: 0 for category in categories.keys()}
category_counts['Other'] = 0

for h in headlines:
    headline_text = h.get_text().strip()
    category = categorize_headline(headline_text)
    category_counts[category] += 1
    print(f"Headline: {headline_text}\nCategory: {category}\n")

# Print the count of headlines in each category
print("Headline Counts by Category:")
for category, count in category_counts.items():
    print(f"{category}: {count}")