# Section 4. Working with Web Data Practice

#### Instructor: Pierre Biscaye

The purpose of this notebook is to give you opportunities and challenge to practice applying the skills developed in the other notebooks. 

The content of this notebook is taken from UC Berkeley D-Lab's Python Web APIs [course](https://github.com/dlab-berkeley/Python-Web-APIs) and their Python Web Scraping [course](https://github.com/dlab-berkeley/Python-Web-Scraping).


In [None]:
# Import required libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup
import requests
import time
import re

# NY Times API Practice

In [None]:
from pynytimes import NYTAPI
import configparser
import os
from getpass import getpass

In [None]:
# Load or update your NYT key
from api_utils import get_api_key
nyt_key = get_api_key("NYT")
nyt = NYTAPI(nyt_key, parse_dates=True)

## 1. Challenge: Find the top stories for a section

- Choose a section. Grab the top stories and store it in a list.
- How many stories are in the section?
- What is the title of the first story?

In [None]:
# your code
section = ...
top_stories = ...
print(f"There are {len(top_stories)} {section} stories.")

In [None]:
# Grab first story
top__story = ...
top_story_title = ...
top_sport_story_title

| Attribute      | Data Type | Definition      |
| ----------- | ----------- | ----------- |
| url      | string       | Article's URL.       |
| adx_keywords   | string        | Semicolon separated list of keywords.        |
| subsection   | string        | Article's subsection (e.g. Politics). Can be empty |
| column   | string        | Deprecated. Set to null.        |
| eta_id   | integer        | Deprecated. Set to 0.|
| section   | string        | Article's section (e.g. Sports).        |
| id   | integer        | Asset ID number (e.g. 100000007772696).        |
| asset_id   | integer        | Asset ID number (e.g. 100000007772696).        |
| nytdsection   | string        | Article's section|
| byline   | string        | Article's byline (e.g. By Thomas L. Friedman).        |
| type   | string        | Asset type (e.g. Article, Interactive, ...).        |
| title   | string        | Article's headline (e.g. When the Cellos Play, the Cows Come Home).        |
| abstract   | string        | Brief summary of the article.|
| published_date   | string        | When the article was published on the web (e.g. 2021-04-19).        |
| source   | string        | Publisher (e.g. New York Times).        |
| updated   | string        | When the article was last updated (e.g. 2021-05-12 06:32:03).|
| des_facet   | array        | Array of description facets (e.g. Quarantine (Life and Culture)).        |
| org_facet   | array        | Array of organization facets (e.g. Sullivan Street Bakery).        |
| per_facet   | array        | Array of person facets (e.g. Bittman, Mark).        |
| geo_facet   | array        | Array of geographic facets (e.g. Canada).        |
| media   | array        | Array of images.        |
| media.type   | string        | Asset type (e.g. image).        |
| media.subtype   | string        | Asset subtype (e.g. photo).        |
| media.caption   | string        | Media caption        |
| media.copyright   | string        | Media credit        |
| media.approved_for_syndication   | boolean        | Whether media is approved for syndication.        |
| media.media-metadata   | array        | Media metadata (url, width, height, ...).        |
| media.media-metadata.url   | string        | Image's URL.        |
| media.media-metadata.format   | string        | Image's crop name     |
| media.media-metadata.height   | integer        | Image's height |
| media.media-metadata.width   | integer        | Image's width      |
| uri   | string        | An article's globally unique identifier.      |

## 2. Challenge: Article Searching

- Retrieve a set of NYT articles for a query of your choice. Restrict the number of results so it does not run too long or exhaust your API limits.
- Use a relevant time interval in constructing your `dates` dictionary.
- Use `type_of_material` and `section_name` as keys in your `options` dictionary.
    - For `type_of_material` values refer to this [list](https://github.com/michadenheijer/pynytimes/blob/main/VALID_SEARCH_OPTIONS.md#type-of-material-values).
    - For `section_name` values refer to this [list](https://github.com/michadenheijer/pynytimes/blob/main/VALID_SEARCH_OPTIONS.md#section-name-values).
- How many articles did you retrieve? What is the title of the first article?

In [None]:
# Example: 
query = "your query"
begin = datetime(YEAR, MONTH, DAY)
end = datetime(YEAR, MONTH, DAY)
date_dict = {"begin": begin, "end": end}

options_dict = {
    "sort": "oldest",
    "sources": ["New York Times", "AP"],
    "type_of_material": [...    ]
}

articles = ...

In [None]:
# your code

## 3. Challenge: Most Positive, Most Negative

What are the top 3 most positive and negative texts in the NYT database of articles around the time of the 2024 election? Tip: try using the `sort_values()` method on the "sentiment" column in your df!

In [None]:
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import numpy as np 

df = pd.read_csv("Data/election2024_articles.csv")

In [None]:
# your code


In [None]:
# Titles with most positive texts


In [None]:
# Titles with most negative texts


# IMF API Practice

## 4. Challenge: Scraping country-level time series

Adapt our code for the IMF API to pull all the time series for a given country. Select two indicators of interest and plot them over time. 

In [None]:
# code here

# Web Scraping Practice: CERDI Doctoral Students

## 5. Challenge: Scraping the Data

Suppose we want to extract information on members of the CERDI team. Let's say we're specifically interested in the current doctoral students.

Visit the [CERDI team page](https://cerdi.uca.fr/version-francaise/unite/lequipe#/) and see the source to see how the information is structured.

Your first task is to scrape and soup the content of the Team/Equipe page.

In [None]:
# your code

## 6. Challenge: Identify the number of doctoral students

Based on reviewing the source code, identify where in the soup the doctoral students are listed. Look at tags and attributes.

*Hint*: Some tags do not have attributes, but may have strings associated with them.  
*Hint*: The tag `li` identifies a list.

Using `find_next('ul')` tells BeautifulSoup: "Start at the 'Doctorants' label, then walk down the page until you hit the first `<ul>` tag." This is much more flexible than trying to hardcode the exact path.

In [None]:
# your code

## 7. Challenge: Scraping Function

Write code to extract information about each doctoral student. In particular, save the name, email, and URL associated with each doctoral student. Then run the function and export the results to a csv file.

In [None]:
# your function

In [None]:
# Test your code!


## 8. Challenge: Scrape Images

Write a new function to download the vignettes (profile pictures) associated with each student. The file name for each should be the student's name, with '_' instead of spaces.


In [None]:
# Your function

In [None]:
# run your function