# Analysis of Oscar Award Winning Movies

We are aiming to conduct analyses on American films that have received recognition in the Oscars and Golden Globes. We would like to explore what categorizes a critically acclaimed/Oscar winning film through analysis of a winning film’s directing, ensemble, and audience reception. In response to popular opinions that award shows often nominate films that are not necessarily box-office hits (but rather lauded as high culture), we plan to analyze the correlation between beloved films rated by the public and award winning movies. We are interested in answering questions such as: Is there a difference between films that win Oscars and films that win Golden Globes? Do award winning films favor a certain demographic range for casting? What genres frequent award show nominations the most? Is there a veritable difference between female and male representation in acclaimed movies and if so do they conform to certain stereotypical roles? Are award winning movies only focused on a collection of niche topical themes? In engaging with this study, we hope to better understand how art is judged and the consequences of award nominations/wins on film viewership.

In terms of data sources, we plan to gather basic information about all the movies we are studying from iMDB. This will include factors such as cast, genre, release date, directors, and writers. We will also consolidate this data with critics’ and audience reviews and ratings from Rotten Tomatoes. This will give us a holistic picture of the basic characteristics of the movie, as well as its reception with the general public and at awards shows. We are also planning to analyze the scripts of these movies, as found on iMDB, for more qualitative and subtle characteristics, such as the number and content of lines of characters based on their key demographic information. One such quality we will observe for movies is whether they pass the Bechdel Test, a popular metric for determining the significance of female characters in the film. The collection of this data will require using iMDB’s API, as well as web scraping and preprocessing the text from film script files. Towards the end of the project, we hope to employ machine learning to add a predictive element to our analyses, extrapolating our findings to predict the strength of a particular film’s reception. 

## Data Collection

See **movie_analysis.py**.
The universal IMDb movie id was used to identify films from varying sources. Example: Contagion (2011), IMDb id: tt1598778, IMDb url: www.imdb.com/title/tt1598778/

Sources: 
1. **Pre-existing data sets from IMDb (ie: BestPictureAcademyAwards.csv, Top1000Actors.csv)**
2. **The Movie Database API (information collected from IMDb)** <p>https://developers.themoviedb.org/3/getting-started/introduction </p>
<p>Sample API calls: </p>
3. **See BechdelTest.py** 
 <p> Webscraping for Bechdel Rankings from: https://bechdeltest.com/. </p>


## Data Format

**Movie DataFrame** 
<p> 


| Fields       | Description                                |
|--------------|--------------------------------------------|
| ID           | unique IMDb id                             |
| Title        | Movie title                                |
| Cast         | List of tuples (actor name, rank in movie) |
| Budget       | Movie budget                               |
| Keywords     | List                                       |
| Bechdel Pass | T/F                                        |
| IMDb Rating  |                                            |
| Runtime      |                                            |
| Year         | Year of Release                            |
| Genres       |                                            |
| Num Votes    |                                            |
| Release Date | Full release date                          |
| Directors    |                                            |
| Award Year   | Year of Award Nom.(1yr after release)      |
| Winner       | 1 for winner, 0 for nomination             |

</p>

**Actor DataFrame** 

| Fields     |                                              |
|------------|----------------------------------------------|
| Actor name |                                              |
| Gender     | 1 for female, 2 for male, 0 for undocumented |


## Descriptive Statistics

Running the following cell will construct all dataframes required. 

In [None]:
import pandas as pd
from movie_analysis import MovieAnalyzer

## get dataframes
result = MovieAnalyzer().make_dataframes()
movies = result[0]
actors = result[1]

#print data frames
movies.head()
actors.head()

## Data Analysis, Visualizations, and Insights

### Some Preliminary Graphs... 

#### Analyzing Gender Over Time:

In [49]:
import numpy as np
import matplotlib.pyplot as plt
#movies['no_of_females'] = movies.apply(lambda  row: [item[2] == "F" for item in row['Cast']]), axis=1)

#function that iterates through cast list and counts females
def count_females(cast_list):
    female_count = 0
    for item in cast_list: 
        act_ser = actors[actors['Name']==item[0]]
        if act_ser["Gender"].tolist()[0] == 1:
            female_count += 1
    return female_count
#get number of female cast members per movie
movies['no_of_females'] = movies.apply(lambda row: count_females(row['Cast']),axis=1)
print (movies)

                                  Title  \
ID                                        
tt0031210                  Dark Victory   
tt0031381            Gone with the Wind   
tt0031385            Goodbye, Mr. Chips   
tt0031593                   Love Affair   
tt0031679  Mr. Smith Goes to Washington   
...                                 ...   
tt6155172                          Roma   
tt6266538                          Vice   
tt6294822                      The Post   
tt6966692                    Green Book   
tt7349662                BlacKkKlansman   

                                                        Cast    Budget  \
ID                                                                       
tt0031210  [(Bette Davis, 1), (George Brent, 2), (Humphre...         0   
tt0031381  [(Vivien Leigh, 10), (Clark Gable, 11), (Olivi...   4000000   
tt0031385  [(Robert Donat, 1), (Greer Garson, 2), (Terry ...         0   
tt0031593  [(Irene Dunne, 1), (Charles Boyer, 2), (Maria ...         0

#### Analyzing Topics of Films In Relation to Bechdel Scores:

In [None]:
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator


# creates wordclouds of the keywords of films that pass the bechdel test and of films that don't
def bechdel_keywords_cloud(df):
    # divide full movie data into two data frames, one for films that pass the bechdel test, one for those that fail
    # films with no bechdel scores are not included
    bechdel_pass = df[df['Bechdel Pass'] == True] # all the films that pass the Bechdel test
    bechdel_fail = df[df['Bechdel Pass'] == False] # all the films that fail the Bechdel test)
    
    pass_keywords = []
    for row in bechdel_pass['Keywords']:
        row = row[1:len(row)-1].replace("'","")
        words = row.split(", ")
        for word in words:
            pass_keywords.append(word)
    pass_text = " ".join(pass_keywords)

    # generate a word cloud of keywords of all films that pass the Bechdel Test
    pass_keyword_cloud = WordCloud(stopwords=stopwords, max_font_size=40, relative_scaling=1, max_words=400, background_color="white", colormap='magma').generate(pass_text)
    plt.imshow(pass_keyword_cloud, interpolation='bilinear')
    plt.axis("off")
    plt.figure()

    fail_keywords = []
    for row in bechdel_fail['Keywords']:
        row = row[1:len(row)-1].replace("'","")
        words = row.split(", ")
        for word in words:
            fail_keywords.append(word)
    fail_text = " ".join(fail_keywords)

    # generate a word cloud of keywords of all films that fail the Bechdel Test
    fail_keyword_cloud = WordCloud(stopwords=stopwords, max_font_size=30, max_words=400, background_color="white", relative_scaling=1).generate(fail_text)
    plt.imshow(fail_keyword_cloud, interpolation='bilinear')
    plt.axis("off")
    plt.figure()

    plt.show()

bechdel_keywords_cloud(movies)


#### Analyzing Film Genres in Relation to Bechdel Scores:

In [None]:
# creates freqency graphs for genres of films that pass the bechdel test and those that don't
def bechdel_genre_graphs(bechdel_pass, bechdel_fail):
    # divides full movie data into two data frames, one for films that pass the bechdel test, one for those that fail
    # films with no bechdel scores are not included
    bechdel_pass = df[df['Bechdel Pass'] == True] # all the films that pass the Bechdel test
    bechdel_fail = df[df['Bechdel Pass'] == False] # all the films that fail the Bechdel test
    
    pass_genres = []
    for row in bechdel_pass['Genres']:
        genres = row.split(", ")
        for word in genres:
            pass_genres.append(word)

    pd.Series(pass_genres).value_counts().plot(kind='bar', colormap = 'plasma', title = 'Genres of Films that Pass the Bechdel Test')
    plt.figure()


    fail_genres = []
    for row in bechdel_fail['Genres']:
        genres = row.split(", ")
        for word in genres:
            fail_genres.append(word)
    pd.Series(fail_genres).value_counts().plot(kind='bar',colormap = 'viridis', title = 'Genres of Films that Fail the Bechdel Test')
    plt.figure()
    plt.show()
    
bechdel_genre_graphs(movies)

## Future Plans

## Sources and Acknowledgements 