# Scrape for Movie Data Set 

This notebook is part of my Stanley Tucci move exploration project. In this notebook, I will be providing the process of scraping the data we want from Wikipedia. This data will later be further cleaned and analyzed in subsequent notebooks

### Get Base URLs for Actor/Movie Genre/Production Company We Want

Change what URL you want to explore for this project

In [2]:
main_wiki_page = 'https://en.wikipedia.org/wiki/Stanley_Tucci'

### Import Packages

In [1]:
from bs4 import BeautifulSoup as bs
import requests

### Read in Wiki Page with Beautiful Soup and Select Film Table

In [4]:
r = requests.get(main_wiki_page)

#Convert to beautiful soup object
soup = bs(r.content)

#print html
contents = soup.prettify()
#print(contents)

Now that we have loaded in this wikipedia page, we want to choose the URLs of the movies/TV shows we want to explore. In this example, I am only going to explore the Films that Stanley Tucci acted in; however, there is a possibility for expansion to include his many TV (an emmy winning) performances in a later iteration. 

In [5]:
## Get only the Film Table from the HTML Page

table = soup.find_all('table', class_="wikitable")[0]   # Only use the first table

Within this table, I am only going to select films that have a wikipedia page. The films without a Wikipedia page could be explored in another iteration of the project and would require more independent research on each film. 

In [6]:
#Only the Links are italicized, select the links of his movies
movies = table.select("i a")

### Get Information Table For Each Film 

This section selects information about each of the films. This section also includes the following cleaning: 

1. Remove all references [1]
2. Make Long Strings into Lists (production team, starring etc) 
3. Change the numerical strings to integer values
4. Make movie release a date time object

In [19]:
## Simple String Replacement \xa0 becomes a space
def get_content_value(row_data):
    if row_data.find('li'):
        return [li.get_text(" ", strip = True).replace("\xa0" , " " )for li in row_data.find_all('li')]
    else:
        return row_data.get_text(" ", strip = True).replace("\xa0" , " " )

## Create a Function to Take in a URL 
def get_info_box(url):
    #get just the info box of the movie

    r = requests.get(url)
    
    #Convert to beautiful soup object
    soup = bs(r.content)
    
    #Print out the HTML 
    contents = soup.prettify()
    
    info_box = soup.find(class_='infobox vevent')
    info_rows = info_box.find_all('tr')
    
    
    movie_info = {}

    for index, row in enumerate(info_rows):
        if index == 0:
            movie_info['title'] = row.find('th').get_text(" ", strip = True)
        elif index ==1:
            continue 
        else:
            content_key = row.find('th').get_text(" ", strip = True)
            content_value = get_content_value(row.find('td'))
            movie_info[content_key] = content_value
            
    return movie_info

Show this works and show one example

In [21]:
get_info_box(base_path + movies[0]['href'])

{'title': "Prizzi's Honor",
 'Directed by': 'John Huston',
 'Produced by': 'John Foreman',
 'Screenplay by': 'Richard Condon Janet Roach',
 'Based on': "Prizzi's Honor by Richard Condon",
 'Starring': ['Jack Nicholson', 'Kathleen Turner'],
 'Music by': 'Alex North',
 'Cinematography': 'Andrzej Bartkowiak',
 'Edited by': 'Kaja Fehr Rudi Fehr',
 'Production company': 'ABC Motion Pictures',
 'Distributed by': '20th Century Fox (U.S.) Producers Sales Organization (International)',
 'Release date': 'June 14, 1985',
 'Running time': '130 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$16 million [1]',
 'Box office': '$26.6 million [2]'}

In [22]:
movie_info_list = []
base_path = 'https://en.wikipedia.org/'
for index, movie in enumerate(movies) :
    try:
        relative_path = movie['href']
        full_path = base_path + relative_path
        title = movie['title']
        
        movie_info_list.append(get_info_box(full_path))
        
    except Exception as e:
        print(movie.get_text())
        print(e)
        print()

Monkey Shines
'NoneType' object has no attribute 'find_all'

A Modern Affair
'NoneType' object has no attribute 'find_all'

The Life and Death of Peter Sellers
'NoneType' object has no attribute 'find'

The Wind Rises
'NoneType' object has no attribute 'get_text'



First Iteration:
These will not be assessed


### Get Plot Overview For Each Film

For finding the plot, I am going to use a different setup and use the Wikipedia package in python. 

Cleaning done in this section:

1. List sections named Plot or Synopsis, if one of those doesn't exist just take the first introductory paragraph
2. Remove all \n and \' values in the plot, while keeping all other punctuations

In [72]:
movie_plot_text = {}

def get_plot_text(movie_list):
    import wikipedia
    
    #get title of wikipedia page interested in
    for index, movie in enumerate(movie_list):
        try:
            wiki = wikipedia.page(movie['title'])
            
            #extract plain text 
            text = wiki.content
            
            # Replace '==' with '' (an empty string)
            text = text.replace('==', '')
            
            #Get Sections of the Wikipedia page
            split_text = text.split('\n\n\n')
            
            #Get Sections of the Wikipedia page
            plot = split_text[1]
            
            #Clean up Plot Text 
            
            for index, text in enumerate(split_text):
                if 'Plot' in text[:5] :
                    plot = text[6:]
                    break
                
                if 'Synopsis' in text[:10]:
                    plot = text[10:]
                    break
                    
                if index == len(split_text) - 1:
                    plot = split_text[0]
            
            
            # Replace '\n' (a new line) with '' & end the string at $1000.
            plot = plot.replace('\n', '')
            plot = plot.replace('\'','')
            
              
            #Finish cleaning of plot and return in the dictionary
            movie_plot_text[movie['title']] = plot
            
            
        except Exception as e:
            print(movie.get_text())
            print(e)
            print()
            
    return movie_plot_text
    
    
    

In [74]:
plot_text_dict = get_plot_text(movies)

Julie & Julia
Page id "julien julian" does not match any pages. Try another id!

The Silence
Page id "the silence (2013 film)" does not match any pages. Try another id!

The Witches
Page id "the witches 2021 film" does not match any pages. Try another id!



There are three movies that do not have a the title corresponding to their wikipedia. 

### Save DataSets as a JSON File

In [62]:
import json 

def save_data(title,data):
    with open(title, 'w', encoding = 'utf-8') as f:
        json.dump(data,f,ensure_ascii = False, indent=2)

In [63]:
import json 

def load_data(title):
    with open(title,encoding = 'utf-8') as f:
        return json.load(f)
    