# Web-scraping to Collect Data for the Best 250 Movies of All Times

## Calling the Necessary Libraries for Web-scraping

In [1]:
import pandas as pd
import requests 
from bs4 import BeautifulSoup

## Constructing the Necessary Functions 

In [2]:
# The following recognizes if the length of a movie exists. If the runtime is given in the format of 
# 'x hours y minutes', then the function returns True, and if the runtime does not exist, the function 
# returns False.

def Movie_Length(x):
    t=False
    S={'0','1','2','3','4','5','6','7','8','9'}
    if x[0] in S:
        try:
            y=x[1]
        except:
            y='None'
        if y=='h':
            t=True
    return t

## Web-scraping Job:

### Extracting the Name, Production Year, and url of the Best 250 Movies of All Times

1.  Using the website https://www.listchallenges.com/reel-stats-statistical-top-250-movies-of-all-time, we 
    first extract the names and production years of the best 250 movies of all times. The webscraping is performed
    through the 'BeautifulSoup' and 'requests' libraries.

In [3]:
# Extracting the names and production years of each of the 250 best movies of all times.

movie_names=[]
movie_years=[]
for m in range(1,8):
    url='https://www.listchallenges.com/reel-stats-statistical-top-250-movies-of-all-time/list/' + str(m)
    page=requests.get(url)
    print(m,page)
    soup=BeautifulSoup(page.content, 'html.parser')

    page_cont=soup.findAll('div',class_="item-name")
    for k in page_cont:
        movie_names.append(k.get_text().replace('\t','').replace('\r','').replace('\n','').split('(')[0].strip())
        movie_years.append(int(k.get_text().replace('\t','').replace('\r','').replace('\n','').split('(')[1]
                       .replace(')','')))

1 <Response [200]>
2 <Response [200]>
3 <Response [200]>
4 <Response [200]>
5 <Response [200]>
6 <Response [200]>
7 <Response [200]>


In [4]:
# We make sure that we have collected the names and production years of all 250 best movies of all time. 

print(len(movie_names))
print(len(movie_years))

250
250


2. Using the above website, we extract the url of all 250 movies in the website https://www.rottentomatoes.com/ to    extract more information about each of these 250 movies.     

In [5]:
# Extracting the rottentomatoes.com address of each of the 250 best movies of all times.

movie_urls=[]
for m in range(1,8):
    url='https://www.listchallenges.com/reel-stats-statistical-top-250-movies-of-all-time/list/' + str(m)
    page=requests.get(url)
    print(m,page)
    soup=BeautifulSoup(page.content, 'html.parser')

    page_cont_url=soup.findAll('div',class_="rt-score")
    for k in page_cont_url:
        try:
            h=k.find('a').get('href')
        except:
            h='The url does not exist!'
        movie_urls.append(h)

1 <Response [200]>
2 <Response [200]>
3 <Response [200]>
4 <Response [200]>
5 <Response [200]>
6 <Response [200]>
7 <Response [200]>


In [6]:
# Very few of the movies do not have a recored in rottentomatoes.com. For those, the url shows 
# 'The url does not exist!'.

len(movie_urls)

250

### Extracting the Movie Info from rottentomatoes.com for Each of the Best 250 Movies of All Times  

Using rottentomatoes.com website, we extract the following information about each of the 250 best movies of all times:

1. *Genre* of the movie (This will be collected in 'genre' list).
2. *Box Office* of the movie indicating the amount, in US dollars, the movie has sold tickets in cinemas (This will      be collected in 'box_office' list).
3. *Runtime* indicating the length of the movie (This will be collected in 'duration' list).
4. *Critic Ratings* indicating the average score, in percentage, of the movie received by the critics of the       
   rottentomatoes.com (This will be collected in 'Critic_Rating' list).
5. *Audience Ratings* indicating the average score, in percentage, of the movie received by all auidence of  
   rottentomatoes.com that watched the movie (This will be collected in 'Audience_Rating' list). 

In [7]:
# Using the movie_urls we constructed above, we scrap rottentomatoes.com page for each of the above 250 movies.

genre=[]
box_office=[]
duration=[]
Critic_Rating=[]
Audience_Rating=[]
for m in movie_urls:
    try:
        page=requests.get(m)
    except:
        y='The genre is not known!'
    soup=BeautifulSoup(page.content, 'html.parser')
    cont=soup.findAll('div',class_="meta-value genre")
    gross=soup.findAll('div',class_="meta-value")
    rating=soup.findAll('span', class_="mop-ratings-wrap__percentage")
    
    try:
        y=cont[0].get_text().replace('\n','').replace(' ','').strip()
    except:
        y='The genre is not known!'
    genre.append(y)
    
    z='Not available!'
    w='Not available!'
    for k in gross:
        b=k.text.find('$')
        c=Movie_Length(k.get_text().strip())
        if b!=-1:
            z=k.get_text().replace(' ','').replace('$','').replace(',','').strip()
        if c==True:
            w=k.get_text().strip()
            
    box_office.append(z)
    duration.append(w)
    
    try:
        t_one=int(rating[0].get_text().strip().replace('%',''))
    except:
        t_one='Not available!'
    try:
        t_two=int(rating[1].get_text().strip().replace('%',''))
    except:
        t_two='Not available!'
    
    Critic_Rating.append(t_one)
    Audience_Rating.append(t_two)

## Transforming  the Collected Data into a Pandas Dataframe

In [8]:
# Defining the names of the columns of the dataframe.

Dict_one=['Movie Name','Movie Year','Movie url','Genre','Runtime','Box Office','Critic Ratings','Audience Ratings']
Dict_two=[movie_names,movie_years,movie_urls,genre,duration,box_office,Critic_Rating,Audience_Rating]
print(len(Dict_one))
print(len(Dict_two))

8
8


In [9]:
# Putting all the collected data into the dataframe 'df'.

df=pd.DataFrame()
for k in range(len(Dict_one)):
    df[Dict_one[k]]=Dict_two[k]

In [10]:
# Displaying the first few rows of the dataframe.

df.head()

Unnamed: 0,Movie Name,Movie Year,Movie url,Genre,Runtime,Box Office,Critic Ratings,Audience Ratings
0,The Godfather,1972,http://www.rottentomatoes.com/m/12911,"drama,crime",2h 57m,134.8M,98,98
1,12 Angry Men,1957,http://www.rottentomatoes.com/m/18108,drama,1h 35m,Not available!,100,97
2,The Godfather: Part II,1974,http://www.rottentomatoes.com/m/12926,"drama,crime",3h 20m,Not available!,98,97
3,Seven Samurai,1954,http://www.rottentomatoes.com/m/16992,action,3h 28m,271.7K,100,97
4,Schindler's List,1993,http://www.rottentomatoes.com/m/12903,"history,drama",3h 15m,96.6M,97,97


In [11]:
# Saving the dataframe into a csv file for the next step of the process.

df.to_csv('250-Best-Movies.csv')