Cleaning data for our Oscar Analysis
This requires the following dependencies:
- Selenium Geckodriver
pipenv--> Optional, but required if you want to follow the installation directly
From here, you can install the Python dependencies:
Scraping the Oscar data
The code we used to scrape the oscar data is
In order to analyze Best Picture results, you need to type:
python diversity_analysis/oscar_results.py "Best Picture" -o data/best_picture.csv
We initially looked at the directing and acting categories as well, before deciding to focus specifically on Best Picture nominations.
Formatting the IMDb data
IMDb posts data on its movies online. We specifically focused on its
title.basics.tsv.gz file, which contains basic information about movies, including the genre of the movies.
After downloading and uncompressing this data, we typed this to properly format the data as a CSV file:
xsv input data.tsv --no-quoting | xsv search "movie" -s titleType > imdb_movie_data.csv
From here, we joined the Oscar data to the IMDb data in order to get the genres for each of these movies. The script we used to do this is
(Note that this requires obtaining an API key for the Open Movie Database and storing that key with the environment variable
This process involved some manual work in handling false positives and false negatives, so the script will look fairly clunky. However, I've tried to make it at least somewhat replicable.
From here, our actual analysis is in