# Instructions
         
For Part A, you need to scrape IMDB web page to find out top movies sorted by user votes. For each movie, you need to pull :
- movie_id
- rank
- title 
- runtime
- year
- rating
- votes

The URL of an page that include movies released between 2018 and 2020 sorted by number of votes is: 

https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2020-12-31&sort=num_votes,desc

Please click the URL and investigate how you can pull movie_id, rank, title,... from the webpage. 


**You need to write code after where I have <span style="color:red">'''  Your code here ...    '''.</span>**

***
Now let’s look at the read_m_from_html_string(url, num_of_m=50) function in detail. The parameter “num_of_m” in the function def read_m_from_html_string(url, num_of_m=50)
  represents the top number of movies you want to retrieve. For example, read_m_from_html_string(url,500) means that we want to extract top 500 movies released between, sorted by users' votes.

This function returns a list of dictionaries. Each dictionary represents one of the top movies, which could look like the following:

{
  
    'movie_id': 'tt7286456',
    'rank': '1.',
    'title': 'Joker',
    'runtime': 2h 2m,
    'year': '2019',
    'rating': '8.4',
    'votes': '1,421,777',
}


After you implement “read_m_from_html_string”, which will return a list of top movies, you need to export the movies list to a csv file.


***

After you done with scraping the needed data, you should clean and transform it as needed to make it ready for enriching the given "Movies.csv" dataset.
***

Finally, export the enriched dataset to a CSV file:
Use the following naming convention: Project_3_PartA_Lastname.csv




In [1]:
import warnings
warnings.filterwarnings('ignore')
from bs4 import BeautifulSoup
import pandas as pd

***

## read_m_from_html_string

Inside this function, you need to write your code to pull the movies information from the provided Movies 500 HTML String text file.

For each movie, you need to pull :
- movie_id
- rank
- title 
- runtime
- year
- rating
- votes

To give examples on how to pull data from the web bage html string, I have included the code to pull the movie_id.
You need to inculde your code to pull the other needed movie information (title, rank, year, ......). You should have no missing values for each of the collected data.

The URL of an page that include movies released between 2018 and 2020 sorted by number of votes is: 

https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2020-12-31&sort=num_votes,desc

Please click the URL and investigate how you can pull movie_id, rank, title,... from the webpage using the Inspect option.



In [2]:
# This function, read a number of movies from a url html string. The default value is 50
def read_m_from_html_string(url, num_of_m=50):
    
    print(url)
    
    with open('TopVoted_500_Movies_HTML.txt', 'r', encoding="utf8") as file:
        html_string = file.read()   # to read the hmtl file as a string
        # I have included the Movies 500 HTML String.txt file in the project folder. Please take a look.
    
    # create a soup object
    soup = BeautifulSoup(html_string, "html.parser")
    
    '''
    Click the URL and investigate how you can pull movie_id, rank, title,... from the webpage.
    To investigate the html of a web page , For example:
    URL: https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2020-12-31&sort=num_votes,desc
    Right-click anywhere on the webpage, and at the very bottom of the menu that pops up, 
    you will see "Inspect", Click on it.
    '''
    '''
    Fetching a div that includes all the movies. This can be done by using find and find_all functions.
    for example, find_all('div') will give you all divs on the page. Actually, 
    this find or find_all function can have two parameters,
    in the code below 'div' is the tag name and 'ipc-page-grid__item ipc-page-grid__item--span-2' is an attribute 
    value of the tag. You can also do movie_list = soup.find('div', 'ipc-page-grid__item ipc-page-grid__item--span-2'). 
    Here you explicitly say: I want to find a div with 
    attribute class = 'ipc-page-grid__item ipc-page-grid__item--span-2'.
    
    Since on each imdb page, there's only one div with class = 'lister-list', we can use find rather than find_all. 
    Find_all will return a list of div tags, while find() will return only one div.
   '''     
    movie_list = soup.find('div', 'ipc-page-grid__item ipc-page-grid__item--span-2') 
    # this div contains all the listed movies in the requested html web page string.
    
    list_movies = [] # initialize the function return value, which is a list of movies. 
                     # This list will contains the scraped data transformed to a structured format.
    
    # Using count track the number of movies processed. now it's 0 - No movie has been processed yet.
    count = 0
    
    # each movie listed in a div with attribute value 'ipc-metadata-list-summary-item'.
    divs=  movie_list.find_all('li','ipc-metadata-list-summary-item') # To find all the listed movies in the page.
    for d in divs:
        dict_each_movie = {}  # initialize the movie dictionary to store the movie information.

        # Pulling the movie_id
        try:
            movie_id = d.find('a', 'ipc-title-link-wrapper').attrs['href']
            movie_id = movie_id[7:16]
            
        except:
            movie_id = ""
        finally:
            dict_each_movie["movie_id"] = movie_id
            print(movie_id)
            
        # Pulling the rank
        '''  Your code here ...    '''
        try:
            rank = d.find('h3', 'ipc-title__text').text
            rank = rank.split(' ')[0]
        except:
            rank = ""
        finally:
            dict_each_movie["rank"] = rank
            print(rank)
            
        # Pulling the title
        '''  Your code here ...    '''
        try:
            title = d.find('h3', 'ipc-title__text').text
            title = title.split(".",1)[1]
            title = title.strip() # Remove this after to keep in cleaning part
            
        except:
            title = ""
        finally:
            dict_each_movie["title"] = title
            print(title)
            
        # Pulling the runtime
        '''  Your code here ...    '''
        try:
            rt = d.find_all('span', 'sc-479faa3c-8 bNrEFi dli-title-metadata-item')
            runtime = ''
            for r in rt:
                runtime = runtime + "," + r.text
            runtime = runtime.split(",")[2]
            
        except:
            runtime = ""
        finally:
            dict_each_movie["runtime"] = runtime
            print(runtime)

        # Pulling the year
        '''  Your code here ...    ''' 
        try:
            yr = d.find_all('span', 'sc-479faa3c-8 bNrEFi dli-title-metadata-item')
            year = ''
            for y in yr:
                year = year + "," + y.text
            year = year.split(",")[1]
            
        except:
            year = ""
        finally:
            dict_each_movie["year"] = year
            print(year)
 
        # Pulling the rating
          # the rating out of 10
        '''  Your code here ...    '''   
        try:
            rating = d.find('span', 'ipc-rating-star ipc-rating-star--base ipc-rating-star--imdb ratingGroup--imdb-rating').get('aria-label')
            rating = rating.split(':')[1]
            rating = rating.strip() # Remove this after to keep in cleaning part
        except:
            rating = ""
        finally:
            dict_each_movie["rating"] = rating
            print(rating)

        # Pulling the votes
        '''  Your code here ...    '''
        try:
            votes = d.find('div', 'sc-21df249b-0 jmcDPS').text
            votes = votes[5:]
            
        except:
            votes = ""
        finally:
            dict_each_movie["votes"] = votes
            print(votes)

        list_movies.append(dict_each_movie)  # To add the movie information to the movies list.

        count +=1
        print('===============================')
        print()
        if count == num_of_m:
            break # to exit from the loop.

    return list_movies


###  Call statement to scrap the TopVoted 500 movies
##### read_m_from_html_string(url,500)

In [3]:
url = "https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2020-12-31&sort=num_votes,desc"

Movies_list = read_m_from_html_string(url,500)  #to read the topVoted 500 movies
Movies_list

https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2020-12-31&sort=num_votes,desc
tt7286456
1.
Joker
2h 2m
2019
8.4
1,422,218

tt4154796
2.
Avengers: Endgame
3h 1m
2019
8.4
1,227,564

tt4154756
3.
Avengers: Infinity War
2h 29m
2018
8.4
1,167,536

tt6751668
4.
Parasite
2h 12m
2019
8.5
903,939

tt1825683
5.
Black Panther
2h 14m
2018
7.3
820,837

tt7131622
6.
Once Upon a Time in Hollywood
2h 41m
2019
7.6
809,786

tt8946378
7.
Knives Out
2h 10m
2019
7.9
747,547

tt8579674
8.
1917
1h 59m
2019
8.2
650,257

tt4633694
9.
Spider-Man: Into the Spider-Verse
1h 57m
2018
8.4
641,118

tt5463162
10.
Deadpool 2
1h 59m
2018
7.6
624,815

tt4154664
11.
Captain Marvel
2h 3m
2019
6.8
597,379

tt1727824
12.
Bohemian Rhapsody
2h 14m
2018
7.9
574,079

tt6723592
13.
Tenet
2h 30m
2020
7.3
568,337

tt6644200
14.
A Quiet Place
1h 30m
2018
7.5
567,953

tt6966692
15.
Green Book
2h 10m
2018
8.2
542,702

tt6320628
16.
Spider-Man: Far from Home
2h 9m
2019
7.4
537,741

tt1270797
17.
Venom
1h 

50,990

tt2850386
320.
The Croods: A New Age
1h 35m
2020
6.9
50,770

tt5177088
321.
The Girl in the Spider's Web
1h 55m
2018
6.1
50,657

tt7734218
322.
Stuber
1h 33m
2019
6.2
50,570

tt7315484
323.
The Silence
1h 30m
2019
5.3
49,588

tt5932728
324.
The Professor and the Madman
2h 4m
2019
7.2
49,542

tt7153766
325.
Unsane
1h 38m
2018
6.4
49,141

tt7139936
326.
A Rainy Day in New York
1h 32m
2019
6.5
49,116

tt5691670
327.
Under the Silver Lake
2h 19m
2018
6.5
48,997

tt1102427
328.
The Babysitter: Killer Queen
1h 41m
2020
5.8
48,993

tt7060344
329.
Raatchasan
2h 50m
2018
8.3
48,903

tt2837574
330.
The Old Man & the Gun
1h 33m
2018
6.7
48,871

tt8236336
331.
The Report
1h 59m
2019
7.2
48,591

tt7772580
332.
The Perfection
1h 30m
2018
6.2
48,431

tt7431594
333.
Race 3
2h 40m
2018
1.9
48,185

tt6212478
334.
American Animals
1h 56m
2018
7.0
48,143

tt5397194
335.
Anon
1h 40m
2018
6.1
47,688

tt1620680
336.
A Wrinkle in Time
1h 49m
2018
4.3
47,167

tt5208252
337.
Operation Finale
2h 2m
2018


[{'movie_id': 'tt7286456',
  'rank': '1.',
  'title': 'Joker',
  'runtime': '2h 2m',
  'year': '2019',
  'rating': '8.4',
  'votes': '1,422,218'},
 {'movie_id': 'tt4154796',
  'rank': '2.',
  'title': 'Avengers: Endgame',
  'runtime': '3h 1m',
  'year': '2019',
  'rating': '8.4',
  'votes': '1,227,564'},
 {'movie_id': 'tt4154756',
  'rank': '3.',
  'title': 'Avengers: Infinity War',
  'runtime': '2h 29m',
  'year': '2018',
  'rating': '8.4',
  'votes': '1,167,536'},
 {'movie_id': 'tt6751668',
  'rank': '4.',
  'title': 'Parasite',
  'runtime': '2h 12m',
  'year': '2019',
  'rating': '8.5',
  'votes': '903,939'},
 {'movie_id': 'tt1825683',
  'rank': '5.',
  'title': 'Black Panther',
  'runtime': '2h 14m',
  'year': '2018',
  'rating': '7.3',
  'votes': '820,837'},
 {'movie_id': 'tt7131622',
  'rank': '6.',
  'title': 'Once Upon a Time in Hollywood',
  'runtime': '2h 41m',
  'year': '2019',
  'rating': '7.6',
  'votes': '809,786'},
 {'movie_id': 'tt8946378',
  'rank': '7.',
  'title': 'K

In [4]:
# to convert the movies list of dics to dataframe
df_movies = pd.DataFrame(Movies_list)
df_movies

Unnamed: 0,movie_id,rank,title,runtime,year,rating,votes
0,tt7286456,1.,Joker,2h 2m,2019,8.4,1422218
1,tt4154796,2.,Avengers: Endgame,3h 1m,2019,8.4,1227564
2,tt4154756,3.,Avengers: Infinity War,2h 29m,2018,8.4,1167536
3,tt6751668,4.,Parasite,2h 12m,2019,8.5,903939
4,tt1825683,5.,Black Panther,2h 14m,2018,7.3,820837
...,...,...,...,...,...,...,...
495,tt9072352,496.,Relic,1h 29m,2020,6.0,29282
496,tt1006569,497.,Antebellum,1h 45m,2020,5.8,29199
497,tt8652728,498.,Waves,2h 15m,2019,7.5,29103
498,tt7748244,499.,Mortal World,1h 47m,2018,7.6,29052


***
#  To export the colleted movies to IMDb_TopVoted.csv file.


In [5]:
df_movies.to_csv('IMDb_TopVoted_Group9.csv', index = False)

# Importing the given dataset "Movies.csv" to Pandas DataFrame called df1

In [6]:
# Importing the csv file to df1 and print the df1.

'''  Your code here ...    '''
df1 = pd.read_csv("Movies.csv", encoding= "ISO-8859-1")


# Import the scraped data from the IMDb_TopVoted.csv file to Pandas DataFrame called df2

In [7]:
# You need to import the collected dataset "IMDb_TopVoted.csv" and print the df2.
# To handel Latin characters that may contained in the csv file
# with no issue, use  encoding= "ISO-8859-1" with the pd.read_csv()
# Example: df1 = pd.read_csv('thefilename.csv', encoding= "ISO-8859-1") 
# Using encoding= "ISO-8859-1" will avoid Unicode-Decode-Errors.

'''  Your code here ...    '''
df2 = pd.read_csv("IMDb_TopVoted_Group9.csv", encoding= "ISO-8859-1")


# Data cleansing and transformation for df2.

In [8]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   movie_id  500 non-null    object 
 1   rank      500 non-null    float64
 2   title     500 non-null    object 
 3   runtime   500 non-null    object 
 4   year      500 non-null    int64  
 5   rating    500 non-null    float64
 6   votes     500 non-null    object 
dtypes: float64(2), int64(1), object(4)
memory usage: 27.5+ KB


In [9]:
# Cleaning and tranforming df2
 # rank, year, and votes should have a numeric integer data type.
 # runtime column should be renamed to runtimeMinutes and the value should be in minutes, 
 # for example: 2h 2m should be 122
    
'''  Your code here ...    '''
df2['rank'] = df2['rank'].astype('int')

df2['year'] = df2['year'].astype('int')

df2['votes'] = df2['votes'].replace(',','',regex=True)
df2['votes'] = df2['votes'].astype('int')

def runtime_to_min(df):
    for i in range(len(df)):
        if 'h' in df['runtime'][i] and 'm' in df['runtime'][i]:
            try:
                hour = ( int(df['runtime'][i].split('h')[0]) )*60
                minute = int((df['runtime'][i].split('h')[1]).split('m')[0].strip()) 
            except:
                hour = 0
                minute = 0
            finally:
                df['runtime'][i] = hour+minute
        elif 'h' in df['runtime'][i]:
            try:
                hour = ( int(df['runtime'][i].split('h')[0]) )*60 
            except:
                hour = 0
            finally:
                df['runtime'][i] = hour
        elif 'm' in df['runtime'][i]:
            try:
                minute = ( int(df['runtime'][i].split('m')[0]) ) 
            except:
                minute = 0
            finally:
                df['runtime'][i] = minute
        else:
            df['runtime'][i] = 0

runtime_to_min(df2)
df2['runtime'] = df2['runtime'].astype('int')
df2.rename(columns={"runtime":"runtimeMinutes"},inplace=True)

# 	Enrich the given dataset (df1) by merging it to the scraped data (df2).

In [10]:
# Merege the two dataframes to one dataframe called df.
'''  Your code here ...    '''
df = pd.merge(df1, df2)


# Data cleansing and transformation for df.

In [11]:
#cleaning ratingcategory column
df['ratingCategory'] = df['ratingCategory'].replace(['Approved', 'Unrated'],['G', 'Not Rated'])
df['ratingCategory'].fillna('Not Rated', inplace=True)

In [12]:
#cleaning genres column
df['genres']=df['genres'].replace(r'\N','-')

# Rearrange the dataset fields to be listed in the following order: 
movie_id , rank , title ,  originalTitle ,  description ,
          year ,  votes , rating ,  runtimeMinutes ,  ratingCategory ,  genres

In [13]:
# Rearrange the dataset fields.
'''  Your code here ...    '''
df = df[['movie_id' , 'rank' , 'title' , 'originalTitle' , 'description' , 'year' , 'votes' , 'rating' , 'runtimeMinutes' , 'ratingCategory' , 'genres']]


# Export the enriched dataset to a CSV file:

In [14]:
# Use the following naming convention: 
#  Project_3_PartA_Lastname.csv
'''  Your code here ...    '''
df.to_csv('Project_3_Part_A_Group9.csv', index =False, encoding= "cp1252")
