# Web scraping Disney movies collection data 

In [956]:
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs
import re

In [957]:
#load the page
r = requests.get("https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films")

#convert to bs object
soup = bs(r.content)


In [958]:
#find all headings with size 3 to get the years in string
soup.find_all("h3")

[<h3><span id="1930s.E2.80.931940s"></span><span class="mw-headline" id="1930s–1940s">1930s–1940s</span></h3>,
 <h3><span class="mw-headline" id="1950s">1950s</span></h3>,
 <h3><span class="mw-headline" id="1960s">1960s</span></h3>,
 <h3><span class="mw-headline" id="1970s">1970s</span></h3>,
 <h3><span class="mw-headline" id="1980s">1980s</span></h3>,
 <h3><span class="mw-headline" id="1990s">1990s</span></h3>,
 <h3><span class="mw-headline" id="2000s">2000s</span></h3>,
 <h3><span class="mw-headline" id="2010s">2010s</span></h3>,
 <h3><span class="mw-headline" id="2020s">2020s</span></h3>,
 <h3><span class="mw-headline" id="Undated_films">Undated films</span></h3>,
 <h3 class="vector-menu-heading" id="p-personal-label">
 <span class="vector-menu-heading-label">Personal tools</span>
 </h3>,
 <h3 class="vector-menu-heading" id="p-namespaces-label">
 <span class="vector-menu-heading-label">Namespaces</span>
 </h3>,
 <h3 class="vector-menu-heading" id="p-views-label">
 <span class="vecto

We will get a value of None in the first header3 because we have two span elements inside this first header. So we need to manually replace the first value of the list above with the respective year, i.e., 1930s-1940s.

In [959]:
#collecting the number of years first
#create an emtpy list of years
years_list = []
for years in soup.find_all("h3"):
    
    if years.string is not None:
        years = years.string.replace('s','')
        years_list.append(years)
    
    print(years)

<h3><span id="1930s.E2.80.931940s"></span><span class="mw-headline" id="1930s–1940s">1930s–1940s</span></h3>
1950
1960
1970
1980
1990
2000
2010
2020
Undated film
<h3 class="vector-menu-heading" id="p-personal-label">
<span class="vector-menu-heading-label">Personal tools</span>
</h3>
<h3 class="vector-menu-heading" id="p-namespaces-label">
<span class="vector-menu-heading-label">Namespaces</span>
</h3>
<h3 class="vector-menu-heading" id="p-views-label">
<span class="vector-menu-heading-label">Views</span>
</h3>
<h3>
<label for="searchInput">Search</label>
</h3>
<h3 class="vector-menu-heading" id="p-navigation-label">
<span class="vector-menu-heading-label">Navigation</span>
</h3>
<h3 class="vector-menu-heading" id="p-interaction-label">
<span class="vector-menu-heading-label">Contribute</span>
</h3>
<h3 class="vector-menu-heading" id="p-tb-label">
<span class="vector-menu-heading-label">Tools</span>
</h3>
<h3 class="vector-menu-heading" id="p-coll-print_export-label">
<span class="vect

In [960]:
#printing the list
years_list

['1950',
 '1960',
 '1970',
 '1980',
 '1990',
 '2000',
 '2010',
 '2020',
 'Undated film']

In [961]:
#insert the years 1930s-1940s as a first index value
years_list.insert(0,'1930-1940')
years_list.insert(9,'Upcoming')

In [962]:
#printing the years list
years_list

['1930-1940',
 '1950',
 '1960',
 '1970',
 '1980',
 '1990',
 '2000',
 '2010',
 '2020',
 'Upcoming',
 'Undated film']

In [963]:
for tag in soup.find_all("href"):
    print(tag)

Now upon inspecting the table we want to extract, we have links for each of the movie title and that will provide us with more information about each movie. We need to loop through each of these movie title and get the link and access the page for that movie.

In [964]:
#create empty list of links to movie title
data = {} 

init = 0
#loop through the table tags
for table in soup.find_all("table"):
    movies =[]
    #loop through each table to get all the links
    for link in table.find_all("i"):
        
        #try and except to ignore the errors in title extraction
        try:
            #get the link of movie title
            title_page = 'https://en.wikipedia.org' + link.a['href']
            #add it to the list of movies
            movies.append(title_page)
        except:
            pass
     
    
    try:
        data[years_list[init]] = movies
        init = init +1
    except:
        pass

    

In [965]:
#we can check the data for upcoming films 
data['Upcoming']

['https://en.wikipedia.org/wiki/Pinocchio_(2022_live-action_film)',
 'https://en.wikipedia.org/wiki/Hocus_Pocus_2',
 'https://en.wikipedia.org/wiki/Strange_World_(film)',
 'https://en.wikipedia.org/wiki/Disenchanted_(film)',
 'https://en.wikipedia.org/wiki/Haunted_Mansion_(2023_film)',
 'https://en.wikipedia.org/wiki/The_Little_Mermaid_(2023_film)',
 'https://en.wikipedia.org/wiki/Elemental_(2023_film)']

## Using Pandas library to get the table



In [966]:
#taking Turning red movie page for an example
page = requests.get("https://en.m.wikipedia.org/wiki/Turning_Red")

In [967]:
#reading the content of the page directly using pandas read_html and putting it to a dataframe
df_movie= pd.read_html(page.content)[0]

In [968]:
df_movie.set_index(df_movie.columns[0]).transpose()

Turning Red,Official promotional poster,Directed by,Screenplay by,Story by,Produced by,Starring,Cinematography,Edited by,Music by,Productioncompanies,Distributed by,Release dates,Running time,Country,Language,Budget,Box office
Turning Red.1,Official promotional poster,Domee Shi,Julia Cho Domee Shi,Domee Shi Julia Cho Sarah Streicher,Lindsey Collins,Rosalie Chiang Sandra Oh Ava Morse Hyein Park ...,Mahyar Abousaeedi Jonathan Pytko,Nicholas C. Smith Steve Bloom,Ludwig Göransson,Walt Disney Pictures Pixar Animation Studios,Walt Disney StudiosMotion Pictures,"March 1, 2022El Capitan Theatre) March 11, 2022",100 minutes[1],United States,English,$175 million[2],$19.9 million[3]


The problem here is that we cannot get the names separated with commas or any such information that needs to be punctuated for that matter. 

## Using Beautiful Soup to extract tables

In [990]:
"""
   This function will find the column name based on the movie details by matching it with the dictionary of columns created below. 
   This will check for a matching word in the list of all columns ce
   
   Inputs: Word --> A word from the movie details like "Directed by"
   
"""
def key_finder(word, keys_list):
    final_word = ""
    init_num = 0 
    for prop in keys_list:
        count =0

        for letter in word:
            if letter in prop:
                count +=1

        if count > init_num:
            init_num = count 
            final_word = prop

    return final_word

In [997]:

#find the first table of the movie page

def movie_details_collector(page_content):
    table = page_content.find("table")
    
    #set the key to initially to an empty value
    key = '' 
    
    dict_movie_props = {'Title':'','Based on':'','Directed by':'','Written by':'','Screenplay by':'','Story by':'','Produced by':'','Starring':'',\
                     'Cinematography':'','Edited by':'','Music by':'', 'Productioncompanies':'',\
                     'Distributed by':'','Narrated by':'','Created by':'','Genre':'', 'Color process':'',\
                     'Release dates':'','Running time':'','Country':'','Language':'','Budget':'','Box office':''}
    
    keys_list = [cols for cols in dict_movie_props.keys()]
    #initialize 
    num = 1
    init = 0

    #iterate through the table to find th and td tagsf
    for items in table.find_all(["th","td"]): 

        #get the name of the movie
        if init == 0:
            dict_movie_props['Title'] = items.get_text()
            init += 1
            #print(items.get_text(), num)
        else:

            #if the item is odd we get the th tag's text as keys of a dictionary
            if num % 2 != 0:
                #print(items.get_text(), num)
                key = key_finder(items.get_text()[:10], keys_list)
                if key in keys_list:
                    key = key
                else:
                    key = ''

            #else for even value of num we get the td tag values
            else:
                #print(items.get_text(), num)
                value = items.get_text().replace("\n",",")
                #if value[0] == ',':
                #    value = value[1:] 
                if num !=2:
                    dict_movie_props[key] = value
                else:
                    pass
                #print("key = "+ key)

        #print(dict_movie_props)
        num += 1

    return dict_movie_props

In [998]:
"""
 A function to add numpy null values where the information is provided for a property of a movie
"""
def add_null_values(dict_of_props):
    for det in dict_of_props.items():
        if det[1] == '':
            dict_of_props[det[0]] = np.NaN

    return dict_of_props


In [999]:
def create_list_of_values(dict_movie_prop): 
    values_list = []
    for values in dict_movie_prop.values():
        values_list.append(values)
    
    return values_list

Now we will put everything together and call the appropriate functions to loop through each page of movies collected earlier.

## Collecting Details Finally


Note: Nikki: Wild Dog of the North needs to be removed due to technical issues as cited on its wikipedia page. Tall Tale from 1995 also needs to be removed for issues with the page. 

In [1000]:
%%capture
#Create a loop that can go through each of the movie page and collect the relevant data

#create empty dictionary with the keys having different properties of a movie
dataframe_columns = ['Title','Based on','Directed by','Written by','Screenplay by','Story by','Produced by','Starring',\
                         'Cinematography','Edited by','Music by', 'Productioncompanies',\
                         'Distributed by','Narrated by','Created by','Genre','Color process',\
                         'Release dates','Running time','Country','Language','Budget','Box office']

#collecting columns names
keys_list = dataframe_columns

#create a dataframe with just the columns
df= pd.DataFrame(columns = dataframe_columns)

#loop through each of the movies link available in dictionary of data
#for link in data.values():
for items in data['2000']:
    page = requests.get(items)
    page_content = bs(page.content)

    #to check if the movie page is having any issues


    if "Please help improve this article by adding citations to reliable sources." in page_content.table.get_text():
        pass
    else:
        dict_movie_details = movie_details_collector(page_content)

        #get the null values added for empty values
        dict_movie_details = add_null_values(dict_movie_details)


        df = df.append(pd.DataFrame([create_list_of_values(dict_movie_details)], 
             columns=dataframe_columns), 
             ignore_index=True)



ValueError: 23 columns passed, passed data had 24 columns

In [1001]:
df

Unnamed: 0,Title,Based on,Directed by,Written by,Screenplay by,Story by,Produced by,Starring,Cinematography,Edited by,...,Narrated by,Created by,Genre,Color process,Release dates,Running time,Country,Language,Budget,Box office
0,The Tigger Movie,Characters createdby A. A. Milne,Jun Falkenstein,,Jun Falkenstein,Eddie Guzelian,Cheryl Abood,",Jim Cummings,Nikita Hopkins,Ken Sansom,John F...",,",Makoto Arai,Robert Fisher, Jr.,",...,John Hurt,,,,"February 11, 2000",78 minutes[1],,English,$15 million[4][5][6]–$30 million[7],$96.2 million[7]
1,Whispers: An Elephant's Tale,,Dereck Joubert,,",Dereck Joubert,Jordan Moffet,Holly Goldberg S...",",Dereck Joubert,Beverly Joubert,",",Beverly Joubert,Dereck Joubert,",",Angela Bassett,Joanna Lumley,Anne Archer,Debi...",Dereck Joubert,Nena Olwage,...,,,,,",March 10, 2000 (2000-03-10),",72 minutes,United States,English,"$4,000,000 (estimated)[1]","$500,000 (USA) (30 November 2000)[1]"
2,Dinosaur,,",Ralph Zondag,Eric Leighton,",,",John Harrison,Robert Nelson Jacobs,",",John Harrison,Robert Nelson Jacobs,Thom Enriq...",Pam Marsden,",D. B. Sweeney,Alfre Woodard,Ossie Davis,Max C...",",David Hardberger,S. Douglas Smith,",H. Lee Peterson,...,,,,,",May 19, 2000 (2000-05-19) (United States),",82 minutes[1],United States,English,$127.5 million[1],$349.8 million[1]
3,Fantasia 2000,,",Don Hahn,Pixote Hunt,Hendel Butoy,Eric Goldbe...",,,,",Roy E. Disney,Donald W. Ernst,",",James Levine,Steve Martin,Itzhak Perlman,Quin...",Tim Suhrstedt,",Jessica Ambinder-Rojas,Lois Freeman-Fox,",...,,,,,",December 17, 1999 (1999-12-17) (Premiere),,Ja...",75 minutes,United States,English,$80–$85 million[1][2],$90.9 million[1]
4,The Kid,,Jon Turteltaub,Audrey Wells,,,",Hunt Lowry,Christina Steinberg,Jon Turteltaub,",",Bruce Willis,Spencer Breslin,Emily Mortimer,L...",Peter Menzies Jr.,",Peter Honess,David Rennie,",...,,,,,",July 7, 2000 (2000-07-07),",104 minutes[2],United States,English,$65 million[3],$110.3 million[3]
5,The Little Mermaid II: Return to the Sea,The Little Mermaidby Hans Christian Andersen,Jim Kammerud,",Elizabeth Anderson,Temple Mathews,Elise D'Hae...",,,",Leslie Hough,David Lovegren,",",Jodi Benson,Tara Charendoff,Samuel E. Wright,...",,,...,,,,,",September 19, 2000 (2000-09-19),",75 minutes[1],,English,,
6,Remember the Titans,,Boaz Yakin,Gregory Allen Howard,,,Jerry BruckheimerChad Oman,",Denzel Washington,Will Patton,Donald Faison,N...",Philippe Rousselot,Michael Tronick,...,,,,,",September 29, 2000 (2000-09-29),",113 minutes,United States,English,$30 million[1],$136.8 million[1]
7,102 Dalmatians,The Hundred and One Dalmatiansby Dodie Smith,Kevin Lima,,",Kristen Buckley,Brian Regan,Bob Tzudiker,Noni...",",Kristen Buckley,Brian Regan,",Edward S. Feldman,",Glenn Close,Ioan Gruffudd,Alice Evans,Tim McI...",Adrian Biddle,Gregory Perler,...,,,,,",November 22, 2000 (2000-11-22) (United States),",100 minutes,United States,English,$85 million[1],$183.6 million[1]
8,The Emperor's New Groove,Original storyby Roger AllersMatthew Jacobs,Mark Dindal,,David Reynolds,",Chris Williams,Mark Dindal,",Randy Fullmer,",David Spade,John Goodman,Eartha Kitt,Patrick ...",,Pamela Ziegenhagen-Shefland,...,,,,,",December 10, 2000 (2000-12-10) (premiere),Dec...",78 minutes,United States,English,$100 million,$169.6 million
9,Recess: School's Out,Recessby Paul GermainJoe Ansolabehere,Chuck Sheetz,,Jonathan Greenberg,",Paul Germain,Joe Ansolabehere,Jonathan Greenb...",",Joe Ansolabehere,Paul Germain,Toshio Suzuki (...",",Andrew Lawrence,Rickey D'Shon Collins,Jason D...",,Tony Mizgalski,...,,,,,",February 10, 2001 (2001-02-10) (premiere),,Fe...",83 minutes[1],United States,English,$23 million[2],$44.5 million[2]


In [977]:
df.drop(df.index, inplace=True)

In [979]:
page = requests.get("https://en.m.wikipedia.org/wiki/Lilo_%26_Stitch")
page_content = bs(page.content)
#page = requests.get("https://en.m.wikipedia.org/wiki/Raya_and_the_Last_Dragon")
#page_content = bs(page.content)

In [985]:
page_content.table.get_text()

'This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources:\xa0"Lilo & Stitch"\xa0–\xa0news\xa0· newspapers\xa0· books\xa0· scholar\xa0· JSTOR (August 2020) (Learn how and when to remove this template message)'