# An Analysis of Movie Performance

In this part, you’ll gather data about popular movies and award winners. The goal is to build a dataset that you’ll later use to analyze what makes a movie successful and how awards and box office performance relate to one another.

In [1]:
import pandas as pd
import requests
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import json
import time
from pathlib import Path 

### Part 1: Data Gathering
1. Scrape Best Picture Data.  
    * Scrape the [Best Picture wikipedia page](https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture).  
    * Extract for each Movie:  
        * Year  
        * Film Title  
        * Winner (Yes/No)  
    * Data cleaning tips:  
        * Ensure that year and film title columns are clean and consistent (no footnotes, parentheses, etc.).
        * Save the results as best_picture.csv.  

In [2]:
URL = 'https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture'

headers = {
    "User-Agent": "Movie_Agent"
}

resp = requests.get(URL, headers=headers)
soup = BeautifulSoup(resp.text)
#soup.prettify()

In [3]:
soup = BeautifulSoup(resp.text)
#soup.prettify()

In [4]:
winners = soup.findAll('tr', attrs={'style' : 'background:#FAEB86'})
winning_titles = [winner.td.text.strip() for winner in winners]

In [5]:
# Get all wikitables
all_tables = soup.findAll('table', attrs={'class' : 'wikitable'})

# Find just the tables with movie data by filtering on the 'Year of Film Release' column header
# This excludes the 2 wikitables at the very bottom of the page ('Age superlatives' and 'Production companies and distributors with multiple nominations and wins')
movie_tables = [table for table in all_tables if 'Year of Film Release' in table.find('tr').text]

In [6]:
# Create empty list to store dictionaries of movie data, ex. [{'Title': 'Wings', 'Movie_Year': '1927', 'Awards_Year': '1928', 'Winner':'True'}]
movie_info = []

# Iterate through all tables to extract movie data
for table in movie_tables:
    
    # Find all 'tr' tags ('table row'), skipping the first one since it just contains column headers
    rows = table.findAll('tr')[1:]
    
    # Iterate through all rows of the table to find year, movie, and winner status
    for row in rows:
        
        # If the row contains a 'th' ('table header') tag, extract the year and store it in a variable
        if len(row.findAll('th')) != 0:
            awards_year = row.th.a.text

            # If the awards_year contains a slash, only grab the later year
            if '/' in awards_year:
                awards_year = awards_year[:2] + awards_year[-2:]
        
        # Get the movie title in this row, if there is one
        if len(row.findAll('td')) != 0:
            title = row.td.text.strip()
        else:
            title = ''
        # Get the winner status by seeing if the background is yellow
        if row.has_attr('style'):
            if row['style']=='background:#FAEB86':
                winner='Yes'
            else:
                winner='No'
        else:
            winner='No'
        
        # If this row has a movie title, append the movie info to the movie_info list
        if title != '':
            movie_info_dict = {'Title': title, 'Year': awards_year, 'Winner': winner}
            movie_info.append(movie_info_dict)

In [7]:
# Convert movie_info to a pandas DataFrame
movie_info_df = pd.DataFrame(movie_info)
movie_info_df.tail(20)

Unnamed: 0,Title,Year,Winner
591,Oppenheimer,2023,Yes
592,American Fiction,2023,No
593,Anatomy of a Fall,2023,No
594,Barbie,2023,No
595,The Holdovers,2023,No
596,Killers of the Flower Moon,2023,No
597,Maestro,2023,No
598,Past Lives,2023,No
599,Poor Things,2023,No
600,The Zone of Interest,2023,No


In [8]:
# Write the final combined movies DataFrame to a csv file in the data folder
filepath = Path('../data/best_picture.csv')  
filepath.parent.mkdir(parents=True, exist_ok=True)  
movie_info_df.to_csv(filepath, index=False)  