## Scraping information about famous actresses from Wikipedia

Inspired by the Oscars this month, I will practice scraping a webpage by gathering information about actresses.

##### What process will I follow to accomplish this?

Step 1: Import packages

Step 2: Get the data

Follow the guide from Digital Ocean: https://www.digitalocean.com/community/tutorials/how-to-scrape-web-pages-with-beautiful-soup-and-python-3

And the documentation provided by beautifulsoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Focus on this list 21st Century American Actresses https://en.wikipedia.org/wiki/Category:21st-century_American_actresses

##### Step A: Collect our page, parse it, and set it up as a BeautifulSoup object

In [3]:
#Import libraries
import requests
import bs4
from bs4 import BeautifulSoup
import csv

# Collect 21st Century American Actresses
page = requests.get('https://en.wikipedia.org/wiki/Category:21st-century_American_actresses')

# Collect a beautifulsoup object
soup = BeautifulSoup(page.text, 'html.parser')

##### Step B: Collect the data we would like

For this step we will be collecting:
- Actresses' names
- Links to their profile

In [4]:
# Create a file to write to, add headers row
f = csv.writer(open('actress-names.csv', 'w'))
f.writerow(['Name', 'Link'])

# Pull all text from the mw-category div
actress_name_list = soup.find(class_='mw-category')

# Pull text from all instances of <a> tag within mw-category div
actress_name_list_items = actress_name_list.find_all('a')

# Create for loop to print out all actress' names and links
# Use .contents to pull out the <a> tag’s children
for actress_name in actress_name_list_items:
    names = actress_name.contents[0]
    links = 'https://www.wikipedia.org' + actress_name.get('href')
    f.writerow([names, links])
    print(names)
    print(links)

Mariann Aalda
https://www.wikipedia.org/wiki/Mariann_Aalda
Aaliyah
https://www.wikipedia.org/wiki/Aaliyah
Caroline Aaron
https://www.wikipedia.org/wiki/Caroline_Aaron
Rose Abdoo
https://www.wikipedia.org/wiki/Rose_Abdoo
Paula Abdul
https://www.wikipedia.org/wiki/Paula_Abdul
Donzaleigh Abernathy
https://www.wikipedia.org/wiki/Donzaleigh_Abernathy
Whitney Able
https://www.wikipedia.org/wiki/Whitney_Able
Amy Acker
https://www.wikipedia.org/wiki/Amy_Acker
Allegra Acosta
https://www.wikipedia.org/wiki/Allegra_Acosta
Anabelle Acosta
https://www.wikipedia.org/wiki/Anabelle_Acosta
Ava Acres
https://www.wikipedia.org/wiki/Ava_Acres
Isabella Acres
https://www.wikipedia.org/wiki/Isabella_Acres
Tatum Adair
https://www.wikipedia.org/wiki/Tatum_Adair
Amy Adams
https://www.wikipedia.org/wiki/Amy_Adams
Brooke Adams (actress)
https://www.wikipedia.org/wiki/Brooke_Adams_(actress)
Catlin Adams
https://www.wikipedia.org/wiki/Catlin_Adams
Jane Adams (actress)
https://www.wikipedia.org/wiki/Jane_Adams_(actr

#####  Step 3: Practice getting birthdate from one actress's wikipedia page

Now that we know how to pull a list of actresses from one page in one category in Wikipedia, let's try to get the information that we want from their profile.

In [5]:
# Collect Meryl Streep's page
page_meryl = requests.get('https://en.wikipedia.org/wiki/Meryl_Streep')

# Collect a beautifulsoup object
soup_meryl = BeautifulSoup(page_meryl.text, 'html.parser')

# Get Meryl Streep's birthdate:
# Pull all content from the bday span
meryl_birthdate = soup_meryl.find(class_='bday')
birthday = meryl_birthdate.contents[0]
print(birthday)

1949-06-22


In [6]:
# Get Meryl Streep's name:
# Pull all content from the firstHeading <h1> tag
meryl_name = soup_meryl.find(class_='firstHeading')
meryl = meryl_name.contents[0]
print(meryl)

Meryl Streep


In [7]:
allrows = soup_meryl.find_all(scope="row")
allrows

[<th scope="row">Born</th>,
 <th scope="row">Alma mater</th>,
 <th scope="row">Occupation</th>,
 <th scope="row">Years active</th>,
 <th scope="row"><span style="white-space:nowrap;">Works</span></th>,
 <th scope="row"><span class="nowrap">Spouse(s)</span></th>,
 <th scope="row"><span class="nowrap">Partner(s)</span></th>,
 <th scope="row">Children</th>,
 <th scope="row">Awards</th>,
 <th scope="row">Website</th>,
 <th class="navbox-group" scope="row" style="background: #EEDD82;width:1%">1928–1950</th>,
 <th class="navbox-group" scope="row" style="background: #EEDD82;width:1%">1951–1975</th>,
 <th class="navbox-group" scope="row" style="background: #EEDD82;width:1%">1976–2000</th>,
 <th class="navbox-group" scope="row" style="background: #EEDD82;width:1%">2001–present</th>,
 <th class="navbox-group" scope="row" style="background: #EEDD82;width:1%">1936–1950</th>,
 <th class="navbox-group" scope="row" style="background: #EEDD82;width:1%">1951–1975</th>,
 <th class="navbox-group" scope="

In [8]:
# Create a file to write to, add headers row
f_meryl = csv.writer(open('actress-birthdates.csv', 'w'))
f_meryl.writerow(['Name', 'Birthdate'])

# Add row with Meryl's information
f_meryl.writerow([meryl, birthday])

25