# Homework: Intro Scraping Practice

In this assignment, we'll practicing our scraping skills by examining the "Published Reproductions" section of Soma's Investigate.ai project: https://investigate.ai/

### 0) Setup

Import `requests`, `BeautifulSoup`, and `pandas`. Remember that, even though we installed the library as `pip install beautifulsoup4`, the import statement uses a slightly different name.

In [51]:
import requests
from bs4 import BeautifulSoup
import pandas as pd



### 1) Grab the HTML for https://investigate.ai/

Use `requests` to get the HTML, assigning it to a variable

In [52]:
http_response = requests.get("https://investigate.ai/")

# print the status code
http_response


<Response [200]>

In [53]:
# store the HTML of the page in a variable
html = http_response.text

### 2) Use `BeautifulSoup` to convert the HTML into its DOM representation

In [54]:
# convert the HTML to DOM representation using BeautifulSoup
soup = BeautifulSoup(html)

### 3) Use `.select(...)` to select *just* the "Published reproductions" section


You'll want "View Source" or pop open the Element Inspector to figure out which element to target.

Reminder: An element's `class` attribute can contain *multiple* classes, separated by spaces. For example: `<div class="potato hamburger">Hello</div>` has two classes, `potato` and `hamburger`. A CSS selector for *either* of the classes — `soup.select(".potato")` *or* `soup.select(".hamburger")` — will both match that element.

Hint: Look for a `div` with a particular class.

In [55]:
# select class "content section-projects" section
projects_section = soup.select(".section-projects")
projects_section

[<div class="content section-projects">
 <div class="columns">
 <div class="column col-sm-auto col-8 col-mx-auto text-center">
 <h3>Published reproductions</h3>
 <p>Data science and machine learning can be used anywhere! From a small visualization at The Upshot to a year-long investigations by Reveal, let's try to put these new skills in context.</p>
 </div>
 <div class="column col-12">
 <div class="columns">
 <div class="column col-4 col-lg-6 col-sm-12">
 <article class="card">
 <div class="card-header">
 <h5><a href="/nyt-takata-airbags/">Searching for faulty airbags in vehicle complaints</a></h5>
 </div>
 <div class="card-body">
 <p>The National Highway Transportation Safety Administration receives thousands and thousands of vehicle complaints each year. Can we train a computer to filter out leads on Takata airbag malfunctions?</p>
 </div>
 <div class="card-footer">
                         The New York Times
                     </div>
 </article>
 </div>
 <div class="column col-4 

### 4) Use `projects_section.select(...)` to select elements that represent a single project

Assign the list to a variable named `project_els`.

In [59]:
# select class "content section-projects" section
projects_section = soup.select(".section-projects")

# Assign the list to a variable named `project_els`.
project_els = []

for project in projects_section:
    # Use .extend() to add elements of the list returned by project.select(".card") to project_els
    project_els.extend(project.select("article.card"))

print(project_els)


[<article class="card">
<div class="card-header">
<h5><a href="/nyt-takata-airbags/">Searching for faulty airbags in vehicle complaints</a></h5>
</div>
<div class="card-body">
<p>The National Highway Transportation Safety Administration receives thousands and thousands of vehicle complaints each year. Can we train a computer to filter out leads on Takata airbag malfunctions?</p>
</div>
<div class="card-footer">
                        The New York Times
                    </div>
</article>, <article class="card">
<div class="card-header">
<h5><a href="/latimes-crime-classification/">Building a crime classification engine</a></h5>
</div>
<div class="card-body">
<p>Using machine learning as an investigative tool to cast light on years of underreporting by the Los Angeles Police Department.</p>
</div>
<div class="card-footer">
                        Los Angeles Times
                    </div>
</article>, <article class="card">
<div class="card-header">
<h5><a href="/caixin-museum-word-

### 5) Count the number of matching elements, using `len`

Does it match the number of projects you see on the page? (It should.)

In [57]:
# Now project_els should be a flat list containing all .card elements
print(len(project_els))  # This should

26


### 6) For each project, print its title, publisher, summary, and link

You'll want to construct a `for` loop. In each iteration of the loop, print out something that looks like this:

```
Title: Searching for faulty airbags in vehicle complaints

Publisher: The New York Times

Summary: The National Highway Transportation Safety Administration receives thousands and thousands of vehicle complaints each year. Can we train a computer to filter out leads on Takata airbag malfunctions?

Link: https://investigate.ai/nyt-takata-airbags/
---
```

In [58]:
for project in project_els:
    # Assuming each piece of information is uniquely identifiable within the project element
    title = project.find("h5").text.strip()
    publisher = project.find(class_="card-footer").text.strip()
    summary = project.find(class_="card-body").text.strip()
    link = project.find("a")['href'].strip()  # Assuming the link is in an href attribute

    print(f"Title: {title}\nPublisher: {publisher}\nSummary: {summary}\nLink: 'https://investigate.ai{link}'\n")

Title: Searching for faulty airbags in vehicle complaints
Publisher: The New York Times
Summary: The National Highway Transportation Safety Administration receives thousands and thousands of vehicle complaints each year. Can we train a computer to filter out leads on Takata airbag malfunctions?
Link: 'https://investigate.ai/nyt-takata-airbags/'

Title: Building a crime classification engine
Publisher: Los Angeles Times
Summary: Using machine learning as an investigative tool to cast light on years of underreporting by the Los Angeles Police Department.
Link: 'https://investigate.ai/latimes-crime-classification/'

Title: Chinese museum analysis
Publisher: Caixin
Summary: A word-count analysis of the names of around 4500 museums in China.
Link: 'https://investigate.ai/caixin-museum-word-count/'

Title: Analyzing online safety through app store reviews
Publisher: The Washington Post
Summary: After downloading over a hundred thousand reviews of "random chat apps," how to find reports of bu

### 7) Now, let's do the same thing, but storing the info in a `pandas` `DataFrame`

Specifically, a `DataFrame` with the columns `title`, `publisher`, `summary`, and `link`.

In [63]:
# Initialize an empty list to store project details
projects_data = []

for project in project_els:
    # Extract project details
    title = project.find("h5").text.strip()
    publisher = project.find(class_="card-footer").text.strip()
    summary = project.find(class_="card-body").text.strip()
    link = project.find("a")['href'].strip()  # Assuming the link is in an href attribute
    
    # Append project details to the list as a dictionary
    projects_data.append({
        "Title": title,
        "Publisher": publisher,
        "Summary": summary,
        "Link": f'https://investigate.ai{link}'
    })

# Convert the list of dictionaries into a DataFrame
projects_df = pd.DataFrame(projects_data)

# Display the DataFrame
projects_df

Unnamed: 0,Title,Publisher,Summary,Link
0,Searching for faulty airbags in vehicle compla...,The New York Times,The National Highway Transportation Safety Adm...,https://investigate.ai/nyt-takata-airbags/
1,Building a crime classification engine,Los Angeles Times,Using machine learning as an investigative too...,https://investigate.ai/latimes-crime-classific...
2,Chinese museum analysis,Caixin,A word-count analysis of the names of around 4...,https://investigate.ai/caixin-museum-word-count/
3,Analyzing online safety through app store reviews,The Washington Post,After downloading over a hundred thousand revi...,https://investigate.ai/wapo-app-reviews/
4,Uncovering abusive doctors that were allowed t...,Atlanta Journal-Constitution,"How to comb through 100,000 disciplinary docum...",https://investigate.ai/ajc-doctors-abuse/
5,Analyzing the tone of Trump's speeches,The New York Times,Standard sentiment analysis scores a document ...,https://investigate.ai/upshot-trump-emolex/
6,Detecting special interest model legislation i...,"USA Today, The Arizona Republic, and the Cente...",Special interest groups use model legislation ...,https://investigate.ai/azcentral-text-reuse-mo...
7,Detecting bots in FCC comment submissions,,The comment period on the FCC's net neutrality...,https://investigate.ai/fcc-comments/
8,Figuring out what Democratic candidates care a...,Bloomberg,In the wide field of Democratic presidential c...,https://investigate.ai/bloomberg-tweet-topics/
9,What does Trump tweet about?,The New York Times,What does Trump tweet about? An analysis of ov...,https://investigate.ai/nyt-trump-tweets/


In [74]:
projects_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Title           26 non-null     object
 1   Publisher       26 non-null     object
 2   Summary         26 non-null     object
 3   Link            26 non-null     object
 4   Summary_Length  26 non-null     int64 
dtypes: int64(1), object(4)
memory usage: 1.1+ KB


### 8) Using that `DataFrame`, calculate:

Who are the most-featured publishers?

In [65]:
projects_df['Publisher'].value_counts().head(2)

Publisher
The New York Times    3
ProPublica            3
Name: count, dtype: int64

Which project has the longest summary, in number of characters of text? How long is it?

In [71]:
# Add a new column to the DataFrame that contains the length of the summary
projects_df['Summary_Length'] = projects_df['Summary'].apply(len)


In [70]:
# Display the DataFrame to check if the new column has been added
projects_df

Unnamed: 0,Title,Publisher,Summary,Link,Summary_Length
0,Searching for faulty airbags in vehicle compla...,The New York Times,The National Highway Transportation Safety Adm...,https://investigate.ai/nyt-takata-airbags/,198
1,Building a crime classification engine,Los Angeles Times,Using machine learning as an investigative too...,https://investigate.ai/latimes-crime-classific...,126
2,Chinese museum analysis,Caixin,A word-count analysis of the names of around 4...,https://investigate.ai/caixin-museum-word-count/,67
3,Analyzing online safety through app store reviews,The Washington Post,After downloading over a hundred thousand revi...,https://investigate.ai/wapo-app-reviews/,143
4,Uncovering abusive doctors that were allowed t...,Atlanta Journal-Constitution,"How to comb through 100,000 disciplinary docum...",https://investigate.ai/ajc-doctors-abuse/,87
5,Analyzing the tone of Trump's speeches,The New York Times,Standard sentiment analysis scores a document ...,https://investigate.ai/upshot-trump-emolex/,193
6,Detecting special interest model legislation i...,"USA Today, The Arizona Republic, and the Cente...",Special interest groups use model legislation ...,https://investigate.ai/azcentral-text-reuse-mo...,149
7,Detecting bots in FCC comment submissions,,The comment period on the FCC's net neutrality...,https://investigate.ai/fcc-comments/,157
8,Figuring out what Democratic candidates care a...,Bloomberg,In the wide field of Democratic presidential c...,https://investigate.ai/bloomberg-tweet-topics/,126
9,What does Trump tweet about?,The New York Times,What does Trump tweet about? An analysis of ov...,https://investigate.ai/nyt-trump-tweets/,63


In [72]:
# Get the project with the longest summary with the idxmax() method
projects_df.loc[projects_df['Summary_Length'].idxmax()]

Title                            Bias in the jury selection process
Publisher                                               APM Reports
Summary           When selecting a jury, both the defense and th...
Link                  https://investigate.ai/apm-reports-jury-bias/
Summary_Length                                                  243
Name: 21, dtype: object

How many times longer is it than the average summary?

In [75]:
# How many times longer is it than the average summary?
longest_summary_length = projects_df['Summary_Length'].max()
average_summary_length = projects_df['Summary_Length'].mean()

difference = longest_summary_length / average_summary_length
print(f"it is {difference:.2f} times longer than the average summary")


it is 1.82 times longer than the average summary


---

---

---