# Scrape for Movie Data Set 

This notebook is part of my Stanley Tucci move exploration project. In this notebook, I will be providing the process of scraping the data we want from Wikipedia. This data will later be further cleaned and analyzed in subsequent notebooks

### Get Base URLs for Actor/Movie Genre/Production Company We Want

Change what URL you want to explore for this project

In [2]:
main_wiki_page = 'https://en.wikipedia.org/wiki/Stanley_Tucci'

### Import Packages

In [1]:
from bs4 import BeautifulSoup as bs
import requests

### Read in Wiki Page with Beautiful Soup and Select Film Table

In [4]:
r = requests.get(main_wiki_page)

#Convert to beautiful soup object
soup = bs(r.content)

#print html
contents = soup.prettify()
#print(contents)

Now that we have loaded in this wikipedia page, we want to choose the URLs of the movies/TV shows we want to explore. In this example, I am only going to explore the Films that Stanley Tucci acted in; however, there is a possibility for expansion to include his many TV (an emmy winning) performances in a later iteration. 

In [5]:
## Get only the Film Table from the HTML Page

table = soup.find_all('table', class_="wikitable")[0]   # Only use the first table

Within this table, I am only going to select films that have a wikipedia page. The films without a Wikipedia page could be explored in another iteration of the project and would require more independent research on each film. 

In [6]:
#Only the Links are italicized, select the links of his movies
movies = table.select("i a")

### Get Information Table For Each Film 

This section selects information about each of the films. This section also includes the following cleaning: 

1. Remove all references [1]
2. Make Long Strings into Lists (production team, starring etc) 
3. Change the numerical strings to integer values
4. Make movie release a date time object

In [84]:
## Simple String Replacement \xa0 becomes a space
def get_content_value(row_data):
    if row_data.find('li'):
        return [li.get_text(" ", strip = True).replace("\xa0" , " " )for li in row_data.find_all('li')]
    elif row_data.find('br'):
        return [ text for text in row_data.stripped_strings]
    else:
        return row_data.get_text(" ", strip = True).replace("\xa0" , " " )

#clean to remove references in the text 
def clean_tags(soup):
    for tag in soup.find_all("sup"):
        tag.decompose()
    
    return


## Create a Function to Take in a URL 
def get_info_box(url):
    #get just the info box of the movie

    r = requests.get(url)
    
    #Convert to beautiful soup object
    soup = bs(r.content)
    
    #Remove References 
    clean_tags(soup)
    
    #Print out the HTML 
    contents = soup.prettify()
    
    info_box = soup.find(class_='infobox vevent')
    info_rows = info_box.find_all('tr')
    
    
    movie_info = {}

    for index, row in enumerate(info_rows):
        if index == 0:
            movie_info['title'] = row.find('th').get_text(" ", strip = True)
        
        else:
            header = row.find('th')
            if header:
                content_key = row.find('th').get_text(" ", strip = True)
                content_value = get_content_value(row.find('td'))
                movie_info[content_key] = content_value
            
    return movie_info

Show this works and show one example

In [85]:
get_info_box(base_path + movies[0]['href'])

{'title': "Prizzi's Honor",
 'Directed by': 'John Huston',
 'Produced by': 'John Foreman',
 'Screenplay by': ['Richard Condon', 'Janet Roach'],
 'Based on': ["Prizzi's Honor", 'by Richard Condon'],
 'Starring': ['Jack Nicholson', 'Kathleen Turner'],
 'Music by': 'Alex North',
 'Cinematography': 'Andrzej Bartkowiak',
 'Edited by': ['Kaja Fehr', 'Rudi Fehr'],
 'Production company': 'ABC Motion Pictures',
 'Distributed by': ['20th Century Fox',
  '(U.S.)',
  'Producers Sales Organization',
  '(International)'],
 'Release date': 'June 14, 1985',
 'Running time': '130 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$16 million',
 'Box office': '$26.6 million'}

In [86]:
movie_info_list = []
base_path = 'https://en.wikipedia.org/'
for index, movie in enumerate(movies) :
    try:
        relative_path = movie['href']
        full_path = base_path + relative_path
        title = movie['title']
        
        movie_info_list.append(get_info_box(full_path))
        
    except Exception as e:
        print(movie.get_text())
        print(e)
        print()

Monkey Shines
'NoneType' object has no attribute 'find_all'

A Modern Affair
'NoneType' object has no attribute 'find_all'

The Life and Death of Peter Sellers
'NoneType' object has no attribute 'find'



In [100]:
movie_info_list

[{'title': "Prizzi's Honor",
  'Directed by': 'John Huston',
  'Produced by': 'John Foreman',
  'Screenplay by': ['Richard Condon', 'Janet Roach'],
  'Based on': ["Prizzi's Honor", 'by Richard Condon'],
  'Starring': ['Jack Nicholson', 'Kathleen Turner'],
  'Music by': 'Alex North',
  'Cinematography': 'Andrzej Bartkowiak',
  'Edited by': ['Kaja Fehr', 'Rudi Fehr'],
  'Production company': 'ABC Motion Pictures',
  'Distributed by': ['20th Century Fox',
   '(U.S.)',
   'Producers Sales Organization',
   '(International)'],
  'Release date': 'June 14, 1985',
  'Running time': '130 minutes',
  'Country': 'United States',
  'Language': 'English',
  'Budget': '$16 million',
  'Box office': '$26.6 million',
  'Running time (int)': 130},
 {'title': "Who's That Girl",
  'Directed by': 'James Foley',
  'Produced by': ['Rosilyn Heller', 'Bernard Williams'],
  'Written by': ['Andrew Smith', 'Ken Finkleman'],
  'Starring': ['Madonna',
   'Griffin Dunne',
   'Haviland Morris',
   'John McMartin',
 

First Iteration:
These will not be assessed. A Modern Affair doesn't have info box. In the next iteration, I will look more into Monkey Shines and Life and Death of Peter Sellers HTML code to copy info box information


#### Finish First Round of Cleaning the Data

In [101]:
#print([movie.get('Running time', 'N/A') for movie in movie_info_list])

example = "85 minutes"
def minute_to_integer(running_time):
    if running_time == "N/A":
        return None
    
    value = int(running_time.split(" ")[0])
    return value

for movie in movie_info_list:
    movie["Running time (int)"] = minute_to_integer(movie.get('Running time', 'N/A'))
    

In [104]:
print([movie.get('Running time (int)','N/A') for movie in movie_info_list])

[130, 94, 124, 85, 89, 113, 106, 93, 87, 105, 99, 89, 141, 101, 126, None, 88, 101, 89, 87, 107, 96, 92, 103, 95, 96, 101, 116, 104, 104, 108, 103, 97, 85, 117, 105, 135, 107, 128, 106, 90, 110, 109, 116, 110, 80, 100, 81, 120, 104, 93, 123, 135, 92, 119, 109, 124, 142, 89, 125, 115, 106, 128, 146, 82, 126, 92, 107, 165, 92, 117, 123, 99, 129, 137, 90, 129, 149, 105, 106, 91, 93, 110, 98, 90, 118, 93, 106, None, None, None]


In [115]:
#Convert to Budget and Box Time to Numbers 
import re 

amounts = r"thousand|million|billion"
number = r"\d+(,\d{3})*\.*\d*"

word_re = rf"\${number}(-|\sto\s|–)?({number})?\s({amounts})"
value_re = rf"\${number}"

def word_to_value(word):
    value_dict = {"thousand": 1000, "million": 1000000, "billion": 1000000000}
    return value_dict[word]

def parse_word_syntax(string):
    value_string = re.search(number, string).group()
    value = float(value_string.replace(",", ""))
    word = re.search(amounts, string, flags=re.I).group().lower()
    word_value = word_to_value(word)
    return value*word_value

def parse_value_syntax(string):
    value_string = re.search(number, string).group()
    value = float(value_string.replace(",", ""))
    return value

def budget_box_to_integer(value):
    if value == "N/A":
        return None
    
    if isinstance(value,list):
        value = value[0]
        
    word_syntax = re.search(word_re,value,flags= re.I)
    value_syntax = re.search(value_re, value)
    
    if word_syntax:
        return parse_word_syntax(word_syntax.group())
    elif value_syntax:
        return parse_value_syntax(value_syntax.group())
    else:
        return None

In [119]:
for movie in movie_info_list:
    movie["Box office (float)"] = budget_box_to_integer(movie.get('Box office', 'N/A'))
    movie["Budget (float)"] = budget_box_to_integer(movie.get('Budget','N/A'))
   

In [121]:
# Convert to Date Time 

from datetime import datetime

dates = [movie.get('Release date', 'N/A') for movie in movie_info_list]

def clean_date(date):
    return date.split("(")[0].strip()

def date_conversion(date):
    if isinstance(date, list):
        date = date[0]
        
    if date == "N/A":
        return None
        
    date_str = clean_date(date)

    fmts = ["%B %d, %Y", "%d %B %Y"]
    for fmt in fmts:
        try:
            return datetime.strptime(date_str, fmt)
        except:
            pass
    return None

In [122]:
for movie in movie_info_list:
    movie['Release date (datetime)'] = date_conversion(movie.get('Release date', 'N/A'))
    

In [123]:
movie_info_list[50]

{'title': 'The Tale of Despereaux',
 'Directed by': ['Sam Fell', 'Rob Stevenhagen'],
 'Produced by': ['Gary Ross', 'Allison Thomas'],
 'Screenplay by': 'Gary Ross',
 'Story by': ['Will McRobb', 'Chris Viscardi'],
 'Based on': ['The Tale of Despereaux', 'by', 'Kate DiCamillo'],
 'Starring': ['Matthew Broderick',
  'Robbie Coltrane',
  'Dustin Hoffman',
  'Richard Jenkins',
  'Kevin Kline',
  'Frank Langella',
  'William H. Macy',
  'Tracey Ullman',
  'Emma Watson'],
 'Narrated by': 'Sigourney Weaver',
 'Music by': 'William Ross',
 'Cinematography': 'Brad Blackbourn',
 'Edited by': 'Mark Solomon',
 'Production company': ['Universal Pictures',
  'Relativity Media',
  'Larger Than Life Productions'],
 'Distributed by': 'Universal Pictures',
 'Release date': ['December 19, 2008 ( 2008-12-19 ) (United States)'],
 'Running time': '93 minutes',
 'Country': ['United Kingdom', 'United States'],
 'Language': 'English',
 'Budget': '$60 million',
 'Box office': '$86.9 million',
 'Running time (int)

### Get Plot Overview For Each Film

For finding the plot, I am going to use a different setup and use the Wikipedia package in python. 

Cleaning done in this section:

1. List sections named Plot or Synopsis, if one of those doesn't exist just take the first introductory paragraph
2. Remove all \n and \' values in the plot, while keeping all other punctuations

In [72]:
movie_plot_text = {}

def get_plot_text(movie_list):
    import wikipedia
    
    #get title of wikipedia page interested in
    for index, movie in enumerate(movie_list):
        try:
            wiki = wikipedia.page(movie['title'])
            
            #extract plain text 
            text = wiki.content
            
            # Replace '==' with '' (an empty string)
            text = text.replace('==', '')
            
            #Get Sections of the Wikipedia page
            split_text = text.split('\n\n\n')
            
            #Get Sections of the Wikipedia page
            plot = split_text[1]
            
            #Clean up Plot Text 
            
            for index, text in enumerate(split_text):
                if 'Plot' in text[:5] :
                    plot = text[6:]
                    break
                
                if 'Synopsis' in text[:10]:
                    plot = text[10:]
                    break
                    
                if index == len(split_text) - 1:
                    plot = split_text[0]
            
            
            # Replace '\n' (a new line) with '' & end the string at $1000.
            plot = plot.replace('\n', '')
            plot = plot.replace('\'','')
            
              
            #Finish cleaning of plot and return in the dictionary
            movie_plot_text[movie['title']] = plot
            
            
        except Exception as e:
            print(movie.get_text())
            print(e)
            print()
            
    return movie_plot_text
    
    
    

In [74]:
plot_text_dict = get_plot_text(movies)

Julie & Julia
Page id "julien julian" does not match any pages. Try another id!

The Silence
Page id "the silence (2013 film)" does not match any pages. Try another id!

The Witches
Page id "the witches 2021 film" does not match any pages. Try another id!



There are three movies that do not have a the title corresponding to their wikipedia. 

### Rotten Tomatoes Dataset


In [126]:
import requests
import urllib
import os

In [133]:
base_url = 'http://www.omdbapi.com/?apikey=[ed42f1d7]&'


In [134]:
#Taken From GitHub 
def get_omdb_info(title):
    #base_url = "http://www.omdbapi.com/?"
    parameters = {"apikey": os.environ['OMDB_API_KEY'], 't': title}
    params_encoded = urllib.parse.urlencode(parameters)
    full_url = base_url + params_encoded
    return requests.get(full_url).json()

def get_rotten_tomato_score(omdb_info):
    ratings = omdb_info.get('Ratings', [])
    for rating in ratings:
        if rating['Source'] == 'Rotten Tomatoes':
            return rating['Value']
    return None

In [135]:
for movie in movie_info_list:
    title = movie['title']
    omdb_info = get_omdb_info(title)
    movie['imdb'] = omdb_info.get('imdbRating', None)
    movie['metascore'] = omdb_info.get('Metascore', None)
    movie['rotten_tomatoes'] = get_rotten_tomato_score(omdb_info)

KeyError: 'OMDB_API_KEY'

### Save DataSets as a JSON File

In [62]:
import json 

def save_data(title,data):
    with open(title, 'w', encoding = 'utf-8') as f:
        json.dump(data,f,ensure_ascii = False, indent=2)

In [63]:
import json 

def load_data(title):
    with open(title,encoding = 'utf-8') as f:
        return json.load(f)
    

In [124]:
## Save data with pickle 
import pickle 

def save_data_pickle(name, data):
    with open(name, 'wb') as f:
        pickle.dump(data, f)
        
def load_data_pickle(name):
    with open(name, 'rb') as f:
        return pickle.load(f)

In [125]:
save_data_pickle("tucci_movie_data_cleaned_more.pickle", movie_info_list)
