Absolutely,  let's break down a web scraping in-class exercise tailored for an Intelligence Community (IC) perspective, focused on the Russia-Ukraine conflict.

**Title:** Divergent Perspectives: Web Scraping the Russia-Ukraine Conflict

**Objectives:**

*   **Understanding Media Bias:** Learn how different news outlets frame the same events through selective reporting and language.
*   **Intelligence Gathering:** Extract key information on troop movements, casualties, equipment, and political statements to enhance situational awareness.
*   **Critical Analysis:**  Develop skills to sift through narratives and identify information potentially used for propaganda or disinformation.

**Technical Setup**

*   **Programming Language:**  Python (its popularity and ease of use are ideal for in-class)
*   **Libraries:**
    *   **Requests:** For fetching website HTML content.
    *   **Beautiful Soup 4:**  For parsing and extracting data from HTML.
    *   **Pandas:** For storing and manipulating scraped data (optional).

**Project Outline**

1.  **Website Selection**
    *   Students research and select three news websites:
        *   **Pro-Russian slant:**  Ex: TASS (Russian state-owned), RT, SouthFront
        *   **Pro-Ukrainian slant:**  Ex: The Kyiv Independent, Ukrinform
        *   **International focus:** Ex: BBC World News, Al Jazeera, Reuters
    
2.  **Target Elements Identification:**
    *   **Headlines:**  Capture overall messaging
    *   **Articles:** For in-depth analysis, identifying key events, locations, and figures.
    *   **Dates:** To timeline events and spot evolving narratives.
    *   **Author:** To attribute viewpoints and consider potential affiliations.

3.  **Web Scraping Script Development:**
    *   **Inspect webpage structure:** Students use browser developer tools to pinpoint the HTML tags containing the target elements.
    *   **Write targeted Python scripts:** Employ Requests and Beautiful Soup to:
        *   Fetch HTML from each site.
        *   Parse the HTML, isolating target elements based on tags/classes.
        *   Extract the text content.

4.  **Data Storage and Analysis**
    *   **Storage:** Save the results:
        *   Simple: CSV or text files.
        *   Advanced (optional): Use Pandas dataframes for easier manipulation.
    *   **Analysis:** Guide students to perform:
        *   **Comparative Word Analysis:** Word clouds, keyword frequencies to pinpoint differing terminology
        *   **Sentiment Analysis:**  Tools to estimate positive/negative tone in the coverage. 
        *   **Timeline Visualizations:** Plotting events according to their reporting dates.

**In-Class Discussion:** 

*   **How does each outlet frame the same event?** Look for specific word choices, omitted details, and the prominence given to particular narratives.
*   **Identifying potential disinformation:** Does any outlet spread demonstrably false information or promote unsubstantiated claims?
*   **IC Applications:**  How can similar techniques be used in real-world IC monitoring of evolving international situations?

**Ethical Considerations**

Emphasize the importance of responsible web scraping, respecting website terms of service, and avoiding excessive requests that could overload servers.

**Let me know if you'd like assistance crafting the actual Python code examples or want to explore more advanced functionalities within this exercise!** 
Absolutely,  let's break down a web scraping in-class exercise tailored for an Intelligence Community (IC) perspective, focused on the Russia-Ukraine conflict.

**Title:** Divergent Perspectives: Web Scraping the Russia-Ukraine Conflict

**Objectives:**

*   **Understanding Media Bias:** Learn how different news outlets frame the same events through selective reporting and language.
*   **Intelligence Gathering:** Extract key information on troop movements, casualties, equipment, and political statements to enhance situational awareness.
*   **Critical Analysis:**  Develop skills to sift through narratives and identify information potentially used for propaganda or disinformation.

**Technical Setup**

*   **Programming Language:**  Python (its popularity and ease of use are ideal for in-class)
*   **Libraries:**
    *   **Requests:** For fetching website HTML content.
    *   **Beautiful Soup 4:**  For parsing and extracting data from HTML.
    *   **Pandas:** For storing and manipulating scraped data (optional).

**Project Outline**

1.  **Website Selection**
    *   Students research and select three news websites:
        *   **Pro-Russian slant:**  Ex: TASS (Russian state-owned), RT, SouthFront
        *   **Pro-Ukrainian slant:**  Ex: The Kyiv Independent, Ukrinform
        *   **International focus:** Ex: BBC World News, Al Jazeera, Reuters
    
2.  **Target Elements Identification:**
    *   **Headlines:**  Capture overall messaging
    *   **Articles:** For in-depth analysis, identifying key events, locations, and figures.
    *   **Dates:** To timeline events and spot evolving narratives.
    *   **Author:** To attribute viewpoints and consider potential affiliations.

3.  **Web Scraping Script Development:**
    *   **Inspect webpage structure:** Students use browser developer tools to pinpoint the HTML tags containing the target elements.
    *   **Write targeted Python scripts:** Employ Requests and Beautiful Soup to:
        *   Fetch HTML from each site.
        *   Parse the HTML, isolating target elements based on tags/classes.
        *   Extract the text content.

4.  **Data Storage and Analysis**
    *   **Storage:** Save the results:
        *   Simple: CSV or text files.
        *   Advanced (optional): Use Pandas dataframes for easier manipulation.
    *   **Analysis:** Guide students to perform:
        *   **Comparative Word Analysis:** Word clouds, keyword frequencies to pinpoint differing terminology
        *   **Sentiment Analysis:**  Tools to estimate positive/negative tone in the coverage. 
        *   **Timeline Visualizations:** Plotting events according to their reporting dates.

**In-Class Discussion:** 

*   **How does each outlet frame the same event?** Look for specific word choices, omitted details, and the prominence given to particular narratives.
*   **Identifying potential disinformation:** Does any outlet spread demonstrably false information or promote unsubstantiated claims?
*   **IC Applications:**  How can similar techniques be used in real-world IC monitoring of evolving international situations?

**Ethical Considerations**

Emphasize the importance of responsible web scraping, respecting website terms of service, and avoiding excessive requests that could overload servers.

**Let me know if you'd like assistance crafting the actual Python code examples or want to explore more advanced functionalities within this exercise!** 


1. Setup

In [4]:
import requests
from bs4 import BeautifulSoup
import pandas as pd  # Optional for advanced data handling





ModuleNotFoundError: No module named 'bs4'

2. Basic News Article Scraping

In [None]:
def scrape_article(article_url):
    response = requests.get(article_url)
    response.raise_for_status()  # Raise an error if the request fails

    soup = BeautifulSoup(response.content, 'html.parser')

    headline = soup.find('h1', class_='article-title').text.strip()  
    article_text = soup.find('div', class_='article-body').get_text(strip=True, separator=' ')
    date = soup.find('time').text.strip()

    return {'headline': headline, 'text': article_text, 'date': date}

# Example usage (replace with your actual URLs)
pro_russian_article = scrape_article('https://tass.com')
pro_ukrainian_article = scrape_article('https://kyivindependent.com')
international_article = scrape_article('https://bbc.com/news')


3. Scraping Multiple Articles

In [None]:
def scrape_news_section(section_url):
    articles = []
    response = requests.get(section_url)
    response.raise_for_status()

    soup = BeautifulSoup(response.content, 'html.parser')
    article_links = soup.find_all('a', class_='article-link')

    for link in article_links:
        article_url = link['href']
        article_data = scrape_article(article_url)
        articles.append(article_data)

    return articles

# Example
ukraine_section_data = scrape_news_section('https://example.com/ukraine-news') 


4. Storing Data (Optional - Using Pandas)

In [None]:
df = pd.DataFrame(ukraine_section_data)
df.to_csv('ukraine_news_data.csv', index=False)


5. Handling Pagination

    Often, news sections span multiple pages. Here's how to modify the code to navigate through them:

In [None]:
def scrape_news_section_paginated(section_url, max_pages=3):
    articles = []

    for page_num in range(1, max_pages + 1):
        page_url = f"{section_url}?page={page_num}"  # Assume pagination structure
        response = requests.get(page_url)
        response.raise_for_status()

        # ... (rest of the article scraping logic from previous example) ... 

        articles.extend(articles)  # Combine articles from all pages

    return articles


Key Changes:

Loop over pages: We introduce a loop iterating through page numbers.
Pagination URL patterns: Adjust page_url to match how your target website handles pagination (e.g., some use '/page/2' instead of '?page=2').
Optional max_pages: Limit scraping to a certain number of pages.

6. Filtering Articles by Keywords

In [None]:
def scrape_article(article_url, keywords=['troops', 'equipment']):  # Add keywords
    # ... (rest of the article scraping logic) ...

    article_text = article_text.lower()  # For case-insensitive filtering
    if any(word in article_text for word in keywords):
        return {'headline': headline, 'text': article_text, 'date': date}
    else:
        return None  # Skip articles that don't match keywords


Key Changes:

keywords parameter: The function now takes keywords to filter on.
Filtering logic: We check if any keyword appears in the article text.

7. Advanced Sentiment Analysis

Let's introduce a popular sentiment analysis library:

In [None]:
import nltk  # You might need to install: pip install nltk
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')  # Download sentiment lexicon

# Inside your analysis section:
sia = SentimentIntensityAnalyzer()

for article in ukraine_section_data:
    sentiment_scores = sia.polarity_scores(article['text'])
    article['sentiment'] = sentiment_scores['compound']  # Store compound score


8. Keyword Filtering with Regular Expressions

Regular expressions (regex) provide a powerful way to define complex search patterns:

In [None]:
import re

def scrape_article(article_url, keyword_regex):
    # ... (rest of article scraping logic) ...

    if re.search(keyword_regex, article_text):
        return {'headline': headline, 'text': article_text, 'date': date}
    else:
        return None 
 

Example Regex Patterns:

Specific equipment: r'\btank\b|\bmissile\b|\bartillery\b' (Word boundaries with 'or' conditions)
Phrases: r'civilian casualties'
Names (case-insensitive): r'(?i)putin|zelenskyy'
Resources:

Regex Tutorial: https://www.regular-expressions.info/
Regex Tester: https://regex101.com/ (Great for building and testing your patterns)

9. Visualizing Sentiment Analysis

In [None]:
import matplotlib.pyplot as plt

# Gather sentiment scores from your scraped data
sentiment_scores = [article['sentiment'] for article in ukraine_section_data]

# Histogram of sentiment
plt.hist(sentiment_scores)
plt.xlabel('Sentiment Score (Compound)')
plt.ylabel('Number of Articles')
plt.title('Sentiment Distribution in Ukraine News Coverage')
plt.show()


Enhancements:

Seaborn: Consider the Seaborn library (https://seaborn.pydata.org/) for more aesthetically pleasing visualizations.
Box Plots: Show distributions across different news websites to compare sentiment trends.
Time Series: Plot average sentiment over time to see if it correlates with real-world events.
Important Considerations

Regex Complexity: Be careful – overly complex regexes can slow down your scraper.
Visualization Interpretation: Use visualizations in conjunction with critical reading of the scraped articles. Sentiment analysis won't always capture the nuanced meaning of language.
Further Exploration

Would you be interested in any of the following?

Word Clouds for Keyword Visualization: See which words are the most frequent within positively and negatively scored articles.
Topic Modeling: Identify underlying themes within the corpus of articles (more advanced technique).