# An Analysis of Movie Performance

In this part, you’ll gather data about popular movies and award winners. The goal is to build a dataset that you’ll later use to analyze what makes a movie successful and how awards and box office performance relate to one another.

In [None]:
import requests
import pandas as pd
import re

### Part 1: Data Gathering
1. Scrape Best Picture Data.  
    

In [None]:
url = "https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture"
headers = {
    "User-Agent": "MyAwardScraper"
}
response=requests.get(url,headers=headers)
response.status_code

In [None]:
from bs4 import BeautifulSoup
!pip install html5lib

In [None]:
#Create beautifulsoup object 
soup = BeautifulSoup(response.text)
print(soup.prettify())

* Scrape the [Best Picture wikipedia page](https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture).  
    

In [None]:
#tag = soup.find('th', attrs={"style","text-align-center"})
#tag

tag= soup.find("table", {"class":"wikitable"})
tag

* Extract for each year:  
        * Year  
        * Film Title  
        * Winner (Yes/No)  
    

In [None]:

from io import StringIO
from IPython.core.display import HTML
all_tables=[]

#table_html = str(soup.find('table', attrs={'class' : 'wikitable'}))
#HTML(table_html)
#academy_awards=pd.read_html(StringIO(str(soup.find('table', attrs={'class' : 'wikitable'}))))[0]
pd.set_option('display.max_colwidth', None)
for table in soup.find_all('table', attrs={'class' : 'wikitable sortable sticky-header'}):
                           
    df = pd.read_html(StringIO(str(table)))[0]
    all_tables.append(df)
    academy_awards_full = pd.concat(all_tables, ignore_index=True)
    
  
academy_awards_full['Year of Film Release'].astype(str)
academy_awards_full['year'] = academy_awards_full['Year of Film Release'].str.extract(r"(\d{4})")
academy_awards_full['year']    
academy_awards_full=academy_awards_full.drop(columns="Year of Film Release")
academy_awards_full["year"].astype(int)
academy_awards_full

* Data cleaning tips:  
        * Ensure that year and film title columns are clean and consistent (no footnotes, parentheses, etc.).
        * Save the results as best_picture.csv.  

In [1]:
import json

Gather Movie Data via TMDB API  
   

a. Set up the API    
    * Create a free [TMDB account](https://developer.themoviedb.org/docs/getting-started)  
    * Generate an API key are review their documentation, especially:  
        * /discover/movie  
        * /movie/{movie_id}  
        * /search/movie  
   

In [2]:
with open('keys_api.json') as fi:
    credentials = json.load(fi)

 b. Collect top movies (2015-2024)  
    For each year from 2015 to 2024:  
        * Query TMDB for the top 100 movies (by vote count).  
        * For each movie, gather:  
            * Title  
            * Release Year  
            * Genre(s)  
            * Vote Average  
            * Vote Count  
            * Budget  
            * Revenue  
            * TMDB ID  
        * Store all results in a single DataFrame and export to movies_2015_2024.csv.
        * Hint: TMDB rate limits are generous for free accounts, but you should pause between requests (eg. time.sleep(0.25)). 
        * Some Oscar films may not appear in the top 100 by vote count. For any missing, use the /search/movie endpoint to add it.  