# Web Scraping Development
## IMDB Actor Filmography

## Objectives
* To learn more about web scraping
* To pull actor credits from filmography
* To create a DataFrame containing film title and 'href' link for detailed data pull
* Example used will be to pull actor filmography data from IMDB

In [2]:
# Install packages, if necessary
# pip install requests
# pip install beautifulsoup4

In [3]:
# Load libraries and URL:
import requests
import re
from bs4 import BeautifulSoup
import pandas as pd
# import numpy as np
# import seaborn as sns

# Example URL is for Tom Cruise:
url = 'https://www.imdb.com/name/nm0000129/'

# Load URL and confirm success by loading first 100 characters:
r = requests.get(url)
print(r.content[:100])

# Parse HTML with BeautifulSoup:
soup = BeautifulSoup(r.content, 'html.parser')

b'\n\n\n\n<!DOCTYPE html>\n<html\n    xmlns:og="http://ogp.me/ns#"\n    xmlns:fb="http://www.facebook.com/200'


In [21]:
# Use Prettify HTML for manual review if necessary:
# print(soup.prettify())

In [8]:
# Manual review of Prettify HTML suggests all actor credits have:
# <div> tags
# 'id' attribute that starts with 'actor-'
# Note: code can be modified at this point to scrape other types of credits

# Within 'soup', find all <div> tag with 'id' attribute starting with 'actor-':
films = soup.find_all('div', id = re.compile('^actor-'))

In [9]:
# Review parsing accuracy:
for i in films:
    print(i.text) # Movie title
    print(i.get('href')) # Movie URL reference



 2021

Mission: Impossible 7
(announced)

Ethan Hunt

None


 

Luna Park
(announced)


None


 

Untitled Tom Cruise/SpaceX Project
(announced)


None


 2022

Mission: Impossible 8
(announced)

Ethan Hunt

None


 

Live Die Repeat and Repeat
(pre-production)

Cage (rumored)

None


 2021

Top Gun: Maverick
(post-production)

Maverick

None


 2018

Mission: Impossible - Fallout

Ethan Hunt

None


 2017

American Made

Barry Seal

None


 2017

The Mummy

Nick Morton

None


 2016

Jack Reacher: Never Go Back

Jack Reacher

None


 2015

Mission: Impossible - Rogue Nation

Ethan Hunt

None


 2014

Edge of Tomorrow

Cage

None


 2013/I

Oblivion

Jack

None


 2012

Jack Reacher

Reacher

None


 2012

Rock of Ages

Stacee Jaxx

None


 2011

Mission: Impossible - Ghost Protocol

Ethan Hunt

None


 2010

Knight and Day

Roy Miller

None


 2008

Valkyrie

Colonel Claus von Stauffenberg

None


 2008

Tropic Thunder

Les Grossman - Grossman's Office

None


 2007

Lions for Lambs

In [10]:
# Create new array, and append movie title and href:
filmsarray = []
for film in films:
    filmsarray.append([film.a.text, film.a.get('href')])

# Print array for review:
filmsarray

[['Mission: Impossible 7', '/title/tt9603212/'],
 ['Luna Park', '/title/tt1123441/'],
 ['Untitled Tom Cruise/SpaceX Project', '/title/tt12273460/'],
 ['Mission: Impossible 8', '/title/tt9603208/'],
 ['Live Die Repeat and Repeat', '/title/tt5617712/'],
 ['Top Gun: Maverick', '/title/tt1745960/'],
 ['Mission: Impossible - Fallout', '/title/tt4912910/'],
 ['American Made', '/title/tt3532216/'],
 ['The Mummy', '/title/tt2345759/'],
 ['Jack Reacher: Never Go Back', '/title/tt3393786/'],
 ['Mission: Impossible - Rogue Nation', '/title/tt2381249/'],
 ['Edge of Tomorrow', '/title/tt1631867/'],
 ['Oblivion', '/title/tt1483013/'],
 ['Jack Reacher', '/title/tt0790724/'],
 ['Rock of Ages', '/title/tt1336608/'],
 ['Mission: Impossible - Ghost Protocol', '/title/tt1229238/'],
 ['Knight and Day', '/title/tt1013743/'],
 ['Valkyrie', '/title/tt0985699/'],
 ['Tropic Thunder', '/title/tt0942385/'],
 ['Lions for Lambs', '/title/tt0891527/'],
 ['Mission: Impossible III', '/title/tt0317919/'],
 ['War of the

In [15]:
# Convert array to DataFrame with column titles:
pdfilms = pd.DataFrame(filmsarray, columns = ['Title', 'href'])
pdfilms

Unnamed: 0,Title,href
0,Mission: Impossible 7,/title/tt9603212/
1,Luna Park,/title/tt1123441/
2,Untitled Tom Cruise/SpaceX Project,/title/tt12273460/
3,Mission: Impossible 8,/title/tt9603208/
4,Live Die Repeat and Repeat,/title/tt5617712/
5,Top Gun: Maverick,/title/tt1745960/
6,Mission: Impossible - Fallout,/title/tt4912910/
7,American Made,/title/tt3532216/
8,The Mummy,/title/tt2345759/
9,Jack Reacher: Never Go Back,/title/tt3393786/


## Repeat to Check
Repeat the same exercise but with another actor; in this case, Tom Hanks.

Note: Tom Hanks' filmography section begins with Producer, instead of Actor.  This originally caused an issue and forced me to review how to search for Actor credits, only.

In [17]:
# Repeat for another actor to test:
# Example URL is for Tom Hanks:
url_test = 'https://www.imdb.com/name/nm0000158/'

# Load URL and confirm success:
r_test = requests.get(url_test)
# Verify by printing first 100 characters:
# print(r_test.content[:100])

# Parse HTML with BeautifulSoup:
soup_test = BeautifulSoup(r_test.content, 'html.parser')

In [18]:
# Prettify HTML for manual review:
# print(soup_test.prettify())

# Within 'soup_test', find all <div> tag with 'id' attribute starting with 'actor-':
films_test = soup_test.find_all('div', id = re.compile('^actor-'))

filmsarray_test = []
for film in films_test:
    filmsarray_test.append([film.a.text, film.a.get('href')])
filmsarray_test

[['Untitled Elvis Presley Project', '/title/tt3704428/'],
 ['A Man Called Ove', '/title/tt7405458/'],
 ['In the Garden of Beasts', '/title/tt2123969/'],
 ['BIOS', '/title/tt3420504/'],
 ['News of the World', '/title/tt6878306/'],
 ['Greyhound', '/title/tt6048922/'],
 ['A Beautiful Day in the Neighborhood', '/title/tt3224458/'],
 ['Toy Story 4', '/title/tt1979376/'],
 ['The Post', '/title/tt6294822/'],
 ['The David S. Pumpkins Halloween Special', '/title/tt7452910/'],
 ['The Circle', '/title/tt4287320/'],
 ['Inferno', '/title/tt3062096/'],
 ['Sully', '/title/tt3263904/'],
 ['Maya & Marty', '/title/tt5543284/'],
 ['A Hologram for the King', '/title/tt2980210/'],
 ['Ithaca', '/title/tt3501590/'],
 ['Bridge of Spies', '/title/tt3682448/'],
 ['Carly Rae Jepsen: I Really Like You', '/title/tt7094450/'],
 ['Toy Story That Time Forgot', '/title/tt3473654/'],
 ['Saving Mr. Banks', '/title/tt2140373/'],
 ['Toy Story of Terror', '/title/tt2446040/'],
 ['Captain Phillips', '/title/tt1535109/'],
 [

In [19]:
# Convert array to DataFrame with column titles:
pdfilms_test = pd.DataFrame(filmsarray_test, columns = ['Title', 'href'])
pdfilms_test

Unnamed: 0,Title,href
0,Untitled Elvis Presley Project,/title/tt3704428/
1,A Man Called Ove,/title/tt7405458/
2,In the Garden of Beasts,/title/tt2123969/
3,BIOS,/title/tt3420504/
4,News of the World,/title/tt6878306/
...,...,...
86,Happy Days,/title/tt0070992/
87,Taxi,/title/tt0077089/
88,Bosom Buddies,/title/tt0080202/
89,The Love Boat,/title/tt0075529/
