# Section 10. Working with Web Data Practice

#### Instructor: Pierre Biscaye

The purpose of this notebook is to give you opportunities and challenge to practice applying the skills developed in the other notebooks. 

The content of this notebook is taken from UC Berkeley D-Lab's Python Web APIs [course](https://github.com/dlab-berkeley/Python-Web-APIs) and their Python Web Scraping [course](https://github.com/dlab-berkeley/Python-Web-Scraping).

In [None]:
# Import required libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from datetime import datetime
from pynytimes import NYTAPI

In [None]:
import configparser
import os
from getpass import getpass

def get_api_key(api_name):
    config_file_path = os.path.expanduser("~/.notebook-api-keys")
    config = configparser.ConfigParser(interpolation=None)  # Disable interpolation to avoid issues with special characters
    
    # Try reading the existing config file
    if os.path.exists(config_file_path):
        config.read(config_file_path)
    
    # Check if API key is present
    if config.has_option("API_KEYS", api_name):
        # Ask if the user wants to update the key
        update_key = input(f"An API key for {api_name} already exists. Do you want to update it? (y/n): ").lower()
        if update_key == 'n':
            return config.get("API_KEYS", api_name)
    
    # If no key exists or user opts to update, prompt for the new key
    api_key = getpass(f"Enter your {api_name} API key: ")

    # Save the API key in the config file
    if not config.has_section("API_KEYS"):
        config.add_section("API_KEYS")
    config.set("API_KEYS", api_name, api_key)
    
    with open(config_file_path, "w") as f:
        config.write(f)
    
    return api_key

# Example usage to retrieve the NYT API key
api_key = get_api_key("NYT")

print("NYT API key retrieved successfully.")


In [None]:
# Intialize the NYT API class into an object using your API key
nyt = NYTAPI(api_key, parse_dates=True)

## 1. Challenge: Find the top stories for a section

- Choose a section of the NYT. Grab the top stories and store it in a list. Here are the sections:
```arts```, ```automobiles```, ```books```, ```business```, ```fashion```, ```food```, ```health```, ```home```, ```insider```, ```magazine```, ```movies```, ```national```, ```nyregion```, ```obituaries```, ```opinion```, ```politics```, ```realestate```, ```science```, ```sports```, ```sundayreview```, ```technology```, ```theater```, ```tmagazine```, ```travel```, ```upshot```, and ```world```
- How many stories are in the section?
- What is the title of the first story?

In [None]:
# Your code here

## 2. Challenge: Article Searching

- Retrieve a set of NYT articles for a query of your choice. Restrict the number of results so it does not run too long or exhaust your API limits.
- Use a relevant time interval in constructing your `dates` dictionary
- Use `type_of_material` and `section_name` as keys in your `options` dictionary.
    - For `type_of_material` values refer to this [list](https://github.com/michadenheijer/pynytimes/blob/main/VALID_SEARCH_OPTIONS.md#type-of-material-values).
    - For `section_name` values refer to this [list](https://github.com/michadenheijer/pynytimes/blob/main/VALID_SEARCH_OPTIONS.md#section-name-values).

In [None]:
# Your code here

## 3. Challenge: Most Positive, Most Negative

What are the top 3 most positive and negative texts in the NYT database of articles around the time of the 2024 election? Tip: try using the `sort_values()` method on the "sentiment" column in your df!

In [None]:
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import numpy as np 

df = pd.read_csv("Data/election2024_articles.csv")

# Your code here

## 4. Challenge: Web Scraping Find All

Use BeautifulSoup to find all the HTML tags that appear in the main menu of the Illinois State Senate website.

In [None]:
# Import required libraries
from bs4 import BeautifulSoup
from datetime import datetime
import requests
import time

In [None]:
# Make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp')
# Read the content of the server’s response
src = ...
# Parse the response into an HTML tree
soup = ...

In [None]:
# Your code here

## 5. Challenge: Extract specific attributes

Extract all `href` attributes for each HTML tag in the main menu.

In [None]:
# Your code here

## 6. Challenge: Modularize Your Code

Turn the code we created to extract information about each senator for the 98th senate into a function that accepts a URL, scrapes the URL for its senators, and returns a list of tuples containing information about each senator. In other words, make the code flexible to allow scraping of multiple senates rather than one specific instance.

In [None]:
def get_members(url):
# Your code here

    return(members)

In [None]:
# Test your code!
url = 'http://www.ilga.gov/senate/default.asp?GA=98'
senate_members = get_members(url)
len(senate_members)

## 7. Challenge: Scrape All Bills

Create a dictionary which maps a senator's district (the key) onto a a list of bills (the value) coming from that district. You can do this by looping over all of the senate members in `members` and calling `get_bills()` for each of their associated bill URLs. For practice, just loop over the first 10 senators.

**NOTE:** please call the function `time.sleep(1)` for each iteration of the loop, so that we don't destroy the state's web site.

In [None]:
# Copy get_bills() code here

In [None]:
bills_dict = ...
for member in senate_members[...]:
    # your code here
    time.sleep(1)