# Web scraping project
This project will scrape data about the top 50 films released from 1950 to 2012 (source https://www.imdb.com/search/title/sort=num_votes,desc&start=1&title_type=feature&year=1950,2012.) Using this information,this project will investigate the following 3 questions:
Which director has the highest number of movies in the top 50 rated movies?
Which genre has the highest rating among the top 50 rated movies?
What is the gross expenditure of the lowest-rated movies as compared with the highest-rated ones?



In [1]:
#Import libraries
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
import copy

file = open('moviescrape1.csv', 'w')
writer = csv.writer(file)
writer.writerow(['Movie title', 'Director', 'Ratings', 'Genre','Gross expenditure'])
page_to_scrape = requests.get("https://www.imdb.com/search/title/?sort=num_votes,desc&start=1&title_type=feature&year=1950,2012")
soup = BeautifulSoup(page_to_scrape.text, 'html.parser')

In [2]:
#movie title and director name comes as hyperlink, scraping process below
#Title
movie= soup.findAll('h3', class_='lister-item-header')
movienames=[]
for link in movie:
    movienames.append(link.a.text)
print(movienames)
#Director name
director=soup.findAll('p', class_='')
directornames=[]
for link in director:
        directornames.append(link.a.text)
print(directornames)

['The Shawshank Redemption', 'The Dark Knight', 'Inception', 'Fight Club', 'Forrest Gump', 'Pulp Fiction', 'The Matrix', 'The Lord of the Rings: The Fellowship of the Ring', 'The Godfather', 'The Lord of the Rings: The Return of the King', 'The Dark Knight Rises', 'The Lord of the Rings: The Two Towers', 'Seven', 'Django Unchained', 'Gladiator', 'Batman Begins', 'Inglourious Basterds', 'The Silence of the Lambs', 'Saving Private Ryan', 'Avengers Assemble', 'Star Wars: Episode IV - A New Hope', "Schindler's List", 'The Prestige', 'Shutter Island', 'The Departed', 'Avatar', 'The Green Mile', 'Star Wars: Episode V - The Empire Strikes Back', 'The Godfather Part II', 'Memento', 'Back to the Future', 'Titanic', 'GoodFellas', 'Leon', 'American Beauty', 'Pirates of the Caribbean: The Curse of the Black Pearl', 'WALL·E', 'American History X', 'Kill Bill: Vol. 1', 'V for Vendetta', 'Terminator 2: Judgment Day', 'The Truman Show', 'The Usual Suspects', 'The Lion King', 'Iron Man', 'Star Wars: Ep

In [4]:
#Rating
ratinginfo=soup.findAll('div', class_='inline-block ratings-imdb-rating')
ratings=[]
for value in ratinginfo:
    rating=value.text.strip()
    ratings.append(rating)
print(ratings)
#genre
genreinfo=soup.findAll('span', class_='genre')
genre=[]
for gen in genreinfo:
    genrestrip=gen.text.strip()
    genre.append(genrestrip)
print(genre)
#gross expenditure
spendinfo=soup.findAll('p', class_='sort-num_votes-visible')
cost=[]
for value in spendinfo:
    valuestrip=value.findAll('span')
    if len(valuestrip) > 1:
        gross_expenditure = valuestrip[1].text.strip()
        cost.append(gross_expenditure)
print(cost)
#Datascrape complete 

['9.3', '9.0', '8.8', '8.8', '8.8', '8.9', '8.7', '8.8', '9.2', '9.0', '8.4', '8.8', '8.6', '8.4', '8.5', '8.2', '8.3', '8.6', '8.6', '8.0', '8.6', '9.0', '8.5', '8.2', '8.5', '7.9', '8.6', '8.7', '9.0', '8.4', '8.5', '7.9', '8.7', '8.5', '8.3', '8.1', '8.4', '8.5', '8.2', '8.2', '8.6', '8.2', '8.5', '8.5', '7.9', '8.3', '8.3', '8.2', '8.3', '8.4']
['Drama', 'Action, Crime, Drama', 'Action, Adventure, Sci-Fi', 'Drama', 'Drama, Romance', 'Crime, Drama', 'Action, Sci-Fi', 'Action, Adventure, Drama', 'Crime, Drama', 'Action, Adventure, Drama', 'Action, Drama, Thriller', 'Action, Adventure, Drama', 'Crime, Drama, Mystery', 'Drama, Western', 'Action, Adventure, Drama', 'Action, Crime, Drama', 'Adventure, Drama, War', 'Crime, Drama, Thriller', 'Drama, War', 'Action, Sci-Fi', 'Action, Adventure, Fantasy', 'Biography, Drama, History', 'Drama, Mystery, Sci-Fi', 'Mystery, Thriller', 'Crime, Drama, Thriller', 'Action, Adventure, Fantasy', 'Crime, Drama, Fantasy', 'Action, Adventure, Fantasy', 'Cr

In [5]:
#Loop through lists using Zip function to combime them together
for movie,director,rate,genre,gross in zip(movienames,directornames,ratings,genre,cost):
    print(movie + "-" + director + "-" + rate + "-" + genre + "-" + gross)
    #Write each item onto a new row
    writer.writerow([movie, director,rate,genre,gross])
file.close()

The Shawshank Redemption-Frank Darabont-9.3-Drama-2,766,904
The Dark Knight-Christopher Nolan-9.0-Action, Crime, Drama-2,740,409
Inception-Christopher Nolan-8.8-Action, Adventure, Sci-Fi-2,431,827
Fight Club-David Fincher-8.8-Drama-2,203,546
Forrest Gump-Robert Zemeckis-8.8-Drama, Romance-2,152,218
Pulp Fiction-Quentin Tarantino-8.9-Crime, Drama-2,124,095
The Matrix-Lana Wachowski-8.7-Action, Sci-Fi-1,970,820
The Lord of the Rings: The Fellowship of the Ring-Peter Jackson-8.8-Action, Adventure, Drama-1,927,122
The Godfather-Francis Ford Coppola-9.2-Crime, Drama-1,925,794
The Lord of the Rings: The Return of the King-Peter Jackson-9.0-Action, Adventure, Drama-1,898,687
The Dark Knight Rises-Christopher Nolan-8.4-Action, Drama, Thriller-1,753,834
The Lord of the Rings: The Two Towers-Peter Jackson-8.8-Action, Adventure, Drama-1,713,480
Seven-David Fincher-8.6-Crime, Drama, Mystery-1,711,128
Django Unchained-Quentin Tarantino-8.4-Drama, Western-1,612,938
Gladiator-Ridley Scott-8.5-Action,

## Data analysis
This section will answer the following questions:
1. Which director has the highest number of movies in the top 50 rated movies?
2. Which genre has the highest rating among the top 50 rated movies?
3. What is the gross expenditure of the lowest-rated movies as compared with the highest-rated
ones?

In [6]:
df1=pd.read_csv("moviescrape1.csv",encoding='ISO-8859-1')
df1.head()
#1.Which director has the highest number of movies in the top 50 rated movies?
directorcount=df1.groupby(by=["Director"],as_index=True)['Director'].count()
directorcount=directorcount.sort_values(ascending=False)
directorcount.head()

print('The director with the highest number of films in the top 50 is Christoper nolan with a total of 6 films')

The director with the highest number of films in the top 50 is Christoper nolan with a total of 6 films


In [7]:
#2 Which genre has the highest rating among the top 50 rated movies?
df2=df1.copy(deep=True)
df2['Genre']=df2['Genre'].str.split(',', n=1).str[0]

genrecount=df2.groupby(by=["Genre"],as_index=True)['Ratings'].mean()
genrecount=genrecount.sort_values(ascending=False)
genrecount.head()

print('Based on a single category, the genre with the highest rating is Crime, with an average of 8.71')




Based on a single category, the genre with the highest rating is Crime, with an average of 8.71


In [8]:
#3.What is the gross expenditure of the lowest-rated movies as compared with the highest-rated ones?
df1=df1.sort_values(by='Gross expenditure', ascending=False)
df1['Gross expenditure'] = df1['Gross expenditure'].str.replace(',', '').astype(float)
df_top_rate=df1.head()
df_bot_rate=df1.tail()
print(f"The average rating of a top-rated movie (i.e. one within the top 10% of the movie list) is {round(df_top_rate['Ratings'].mean(),2)}. Here the average gross \nexpenditure is \u00A3{round(df_top_rate['Gross expenditure'].mean(),2)}.")
print(f'In comparison, the average rating of a bottom-rated movie (i.e. one within the bottom 10% of the movie list) is {round(df_bot_rate["Ratings"].mean(),2)}. Here the \naverage gross expenditure is \u00A3{round(df_bot_rate["Gross expenditure"].mean(),2)}.')


The average rating of a top-rated movie (i.e. one within the top 10% of the movie list) is 8.94. Here the average gross 
expenditure is £2458980.8.
In comparison, the average rating of a bottom-rated movie (i.e. one within the bottom 10% of the movie list) is 8.3. Here the 
average gross expenditure is £1070620.2.


<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Movie title        5 non-null      object 
 1   Director           5 non-null      object 
 2   Ratings            5 non-null      float64
 3   Genre              5 non-null      object 
 4   Gross expenditure  5 non-null      float64
dtypes: float64(2), object(3)
memory usage: 240.0+ bytes
