# Web scraping - Disney movies
__In the next project we will cut out our data about Disney movies from this [Wikipedia page](https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films).__<br/><br/>
In order to complete the task we must do four tasks:
- [X] Extract information about each movie. 
- [X] Extract information about all movies together. 
- [X] Clean the data. 
- [X] Keep the data organized in json csv and files. 

<br/>Of course we will start with the important imports:

In [1]:
import requests
import json 
from bs4 import BeautifulSoup as bs
import os
from tqdm import tqdm
import pandas as pd

Now we are ready to start!

### Task one: Extract information about each movie.
In this section we will implement a function for extracting information for a movie using its URL, the function will return a dictionary with all the existing movie properties.

In [2]:
def get_info(url): 
    
    #First thing: you need to read the HTML code of the URL.
    ret_json = {}
    req = requests.get(url)
    soup = (req.content)
    soup = bs(req.text, "html.parser")
    info_box = soup.find("table", class_="infobox vevent")
    
    #Get the title
    ret_json['Title'] = info_box.find("th").text
    
    #Iterate over the info box
    for line in info_box.find_all("tr")[2:]:
        if line.find("td") and line.find("th"):          
            info_array = []
            
            for info in line.find_all("td"):
                for tag in str(info).split('<br/>'): #Dealing with line breaks and li items ((Separating them into separate objects)
                    tag = bs(tag, 'html.parser')
                    if tag.find('li'):
                        for li in tag.find_all('li'):
                                info_array.append(li.text)
                    else:
                        info_array.append(tag.text)
            
            #If one of the values of the 'removes' array in our array - it must be removed
            removes = ['US','\n'] + ['['+str(i)+']' for i in range(1,8)] 
            
            for remove in removes:
                info_array = list(map(lambda x: x.replace(remove,''),info_array))
            
            #We also need to remove the '\xa0' string as well
            info_array = list(map(lambda x: x.replace('\xa0',' '),info_array))
            
            #'Budget' & 'Box office' should not contain strings without a sign of money or the words 'million' and 'billion'
            if line.find("th").text in ['Budget','Box office']:
                info_array = list(filter(lambda x: '¥' in x or '₹' in x or '$' in x or 'million' in x or 'billion' in x ,info_array))
            
            #'Running time' should not contain strings without the 'min' string
            if line.find("th").text == 'Running time':
                info_array = list(filter(lambda x: 'min' in x.lower() ,info_array))
                
            #After a quick look at the data, I realized that all I needed was the first object in the array in 'Budget','Box office' & 'Based on' 
            if line.find("th").text in ['Budget','Box office','Based on']:
                info_array = info_array[0:1]
            
            ret_json[line.find("th").text] = info_array
             
    return ret_json

In [3]:
#The purpose of this function is to print the dictionary more beautifully.

def pretty(d, indent=0):
    for key, value in d.items():
        print( str(key) + ':')
        if isinstance(value, dict):
            pretty(value, indent+1)
        else:
            print('\t' * (indent+1) + str(value))

In [4]:
#We will just check on a random URL that the function does work.

url = "https://en.wikipedia.org/wiki/Something_Wicked_This_Way_Comes_(film)"
pretty(get_info(url)) 

Title:
	Something Wicked This Way Comes
Directed by:
	['Jack Clayton']
Written by:
	['Ray Bradbury']
Based on:
	['Something Wicked This Way Comes']
Produced by:
	['Peter Douglas']
Starring:
	['Jason Robards', 'Jonathan Pryce', 'Diane Ladd', 'Pam Grier']
Cinematography:
	['Stephen H. Burum']
Edited by:
	['Barry Mark Gordon', 'Art J. Nelson']
Music by:
	['James Horner']
Productioncompanies :
	['Walt Disney Productions', 'The Bryna Company']
Distributed by:
	['Buena Vista Distribution']
Release date:
	['April 29, 1983 (1983-04-29) (United States)']
Running time:
	['95 minutes']
Country:
	['United States']
Language:
	['English']
Budget:
	['$20 million']
Box office:
	['$8.4 million']


Nice! The first task completed ✅

### Task two: Extract information about all movies together.
In this section we will extract all the links to Disney movies from [this page](https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films) and activate on them the function we implemented earlier.

In [5]:
#First we approach the html code of the page

main_web_path = "https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films"
req = requests.get(main_web_path)
soup = (req.content)
soup = bs(req.text, "html.parser")
path = "https://en.wikipedia.org"

In [6]:
#Put all the movies tabels in one array:

tabels = soup.find_all("table")[1:11]
info_dict = []

#Iterate over the tabels and extracting their URL & names

for table in tqdm(tabels):
    info_lines = table.find_all("tr")[1:]
    for info_line in info_lines:
        #In this section I differentiate between movie names with a link and movie names without a link
        #If the movie name is without a link it should be stored as a 'Title' and that's it
        if info_line.find_all("td")[1].find("a"):
            link = path + info_line.find_all("td")[1].find("a")['href'] 
            info_dict.append(get_info(link))
        else:
            title = info_line.find_all("td")[1].text
            info_dict.append({'Title': title})   

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [04:05<00:00, 24.55s/it]


In [7]:
#We will just check on a random URL that the function does work.

pretty(info_dict[0])

Title:
	Academy Award Review of Walt Disney Cartoons
Productioncompany :
	['Walt Disney Productions']
Release date:
	['May 19, 1937 (1937-05-19)']
Running time:
	['41 minutes (74 minutes 1966 release)']
Country:
	['United States']
Language:
	['English']
Box office:
	['$45.472']


Well done! We finished task number two ✅

In [8]:
#Implementing two functions for saving data in a JSON file and loading data from a JSON file.

def save_data(title, data):
    with open(title, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)
        
def load_data(title):
    with open(title, encoding="utf-8") as f:
        return json.load(f)

In [9]:
#Saving the unclean data (for now) in the JSON file.

save_data('disney_data.json',info_dict)

### Task three: clean the data!
<font size="1.5"> "Better without data than unclean data." <br/>(Ancient Chinese proverb)</font> 
<br/><br/>In this section we will make slight improvements to the data.




In [10]:
movie_info_list = load_data('disney_data.json')

In [11]:
def RepresentsInt(s): #This function checks whether the variable is integer or not
    try: 
        int(s)
        return True
    except ValueError:
        return False

In [12]:
def covert_to_date_time(date): #This function converts the date cycle to a YYYY-MM-DD date
    
    months = {'January':1, 'February':2, 'March':3, 'April':4, 'May':5, 'June':6, 'July':7, 'August':8, 'September':9, 'October':10, 'November':11, 'December':12}
    
    date = date.replace(',','')
    date = date.split(' ')
    
    if(date[0]==''): #If there is no date
        return 'None-None-None'
    if(len(date)==2):  #If there is only a year
        year = date[0]
        return str(year)+'-None-None'
    
    month = None
    day = None
    year = None
    
    #We now distinguish which of the three members of the array is a year, which is a day, and which represents a month.
    if date[0] in months: 
        month = months[date[0]]
        if RepresentsInt(date[1]):
            if int(date[1]) < 1000:
                day = int(date[1])
            else:
                year = int(date[1])
    
    if date[1] in months:
        month = months[date[1]]
        if RepresentsInt(date[0]):
            day = int(date[0])
    
    #This condition is added after an error in the data run - this is a single case
    if RepresentsInt(date[2]) and year == None:
        year = int(date[2])
    
    return '-'.join([str(year), str(month), str(day)])

In [13]:
def str_to_money(money): #This function converts the money string to numeric value (in dollars)
    money = str(money)
    
    convert = 1
    
    #Deleting all the strings that interfere with our conversion of the string to integer
    money = money.replace('US','')
    money = money.replace('–',' ')
    money = money.replace('–',' ')
    money = money.replace('-',' ')
    money = money.strip() 
    
    sign = '$'
    
    if '₹' in money: #Check if Rupee - than convert to dollar (₹ = 0.013 x $)
        sign = '₹'
        money = money.replace('$','')
        convert = 0.013
    elif '¥' in money: #Check if Yuan - than convert to dollar (¥ = 0.0091 x $)
        convert = 0.0091
        sign = '¥'
    elif '$' in money: #A small change that makes a difference: from '$ ' to '$'
        money = money.replace('$ ','$')
    
    #Isolating the amount of money from the rest of the string
    if sign in money:
        money = money.split(sign)[1]
    else:
        money = money.split(' ')[0]
    
    #Convert money to millions and billions (if any) and numerical value.
    if 'million' in money:
        money = money.replace(',','')
        money = float(money.split(' ')[0]) * 1000000
    elif 'billion' in money:
        money = money.replace(',','')
        money = float(money.split(' ')[0]) * 1000000000
    else:
        money = money.replace(',','')
        money = float(money.split(' ')[0])
    
    return money * convert

In [15]:
#In addition to the JSON file we also build a dataframe with columns of the movie name and the production company
df = pd.DataFrame({'Title': [], 'Production Company': []})

In [16]:
for movie in movie_info_list: #Now it's time to build the datasets
    
    #Isolating the amount of minutes from the rest of the string
    if 'Running time' in movie:
        if isinstance(movie['Running time'],list):
            movie['Running time'] = movie['Running time'][-1]
        
        movie['Running time (Minutes)'] = int(movie['Running time'].split(' ')[0].split('–')[0])
        movie.pop('Running time', None) #Now removing the string of the time
    
    #Use our function to convert the date to a normal structure (including handling lists)
    if 'Release date' in movie:
        if isinstance(movie['Release date'],list):
            movie['Release Date (date-time)'] = []
            
            for date in movie['Release date']:
                movie['Release Date (date-time)'].append(covert_to_date_time(date))
        else:
            movie['Release Date (date-time)'] = covert_to_date_time(movie['Release date'])
    
    #Use our function to convert the money to a numeric value (including handling lists)
    if "Box office" in movie:    
        if isinstance(movie['Box office'],list):        
            new_box_office = []
            
            for Box_office in movie['Box office']:
                money = str_to_money(Box_office)
                new_box_office.append(money)
            movie['Box office'] = new_box_office
        else:
            money = str_to_money(movie["Box office"])
            movie['Box office'] = money
    
    #Use our function to convert the money to a numeric value (including handling lists)
    if 'Budget' in movie:
        if movie["Budget"] != 'unknown':
            if isinstance(movie['Budget'],list):
                new_Budget = []
                
                for Budget in movie['Budget']:
                    new_Budget.append( str_to_money(Budget) )
                movie['Budget'] = new_Budget
            else:
                movie['Budget'] = str_to_money(movie["Budget"])
   
    #Now it remains to find the production company for the csv file.
    #(thanks to Wikipedia and its unclean data - the production company's string comes in three forms)
    if 'Production company' in movie:
        df.loc[len(df)] = [movie["Title"], movie['Production company']]
    elif 'Productioncompany ' in movie:
        df.loc[len(df)] = [movie["Title"], movie['Productioncompany ']]
    elif 'Productioncompanies ' in movie:
        df.loc[len(df)] = [movie["Title"], movie['Productioncompanies ']]
    else:
        df.loc[len(df)] = [movie["Title"], '']
    
    

  return array(a, dtype, copy=False, order=order)


In [17]:
#Check that everything is working properly
pretty(movie_info_list[100])

Title:
	The Aristocats
Directed by:
	['Wolfgang Reitherman']
Story by:
	['Ken Anderson', 'Larry Clemmons', 'Eric Cleworth', 'Vance Gerry', 'Julius Svendsen', 'Frank Thomas', 'Ralph Wright']
Based on:
	['"The Aristocats"']
Produced by:
	['Winston Hibler', 'Wolfgang Reitherman']
Starring:
	['Phil Harris', 'Eva Gabor', 'Sterling Holloway', 'Scatman Crothers', 'Paul Winchell', 'Lord Tim Hudson', 'Thurl Ravenscroft', 'Dean Clark', 'Liz English', 'Gary Dubin']
Edited by:
	['Tom Acosta']
Music by:
	['George Bruns']
Productioncompany :
	['Walt Disney Productions']
Distributed by:
	['Buena Vista Distribution']
Release date:
	['December 11, 1970 (1970-12-11) (premiere)', 'December 24, 1970 (1970-12-24) (United States)']
Country:
	['United States']
Language:
	['English']
Budget:
	[4000000.0]
Box office:
	[191000000.0]
Running time (Minutes):
	79
Release Date (date-time):
	['1970-12-11', '1970-12-24']


Awesome! The third task is complete ✅

In [18]:
df = df.dropna()
df.head()

Unnamed: 0,Title,Production Company
0,Academy Award Review of Walt Disney Cartoons,[Walt Disney Productions]
1,Snow White and the Seven Dwarfs,[Walt Disney Productions]
2,Pinocchio,[Walt Disney Productions]
3,Fantasia,[Walt Disney Productions]
4,The Reluctant Dragon,[Walt Disney Productions]


### Task four: Keep the data organized in json and csv files.
The final task is very simple: the clean data - save in the appropriate files ✅

In [19]:
save_data('disney_data.json',movie_info_list)
df.to_csv('movie_data.csv')

<h2 align="center">Looks good! Now we can analyze the data as we wish.</h2>

![](https://media1.giphy.com/media/w87qIKJBwmffO/200w.webp?cid=ecf05e47u8ph8v3bc00vlfnj8yrna56l1i5s00fhmiq0csy0&rid=200w.webp&ct=g)