# How does cinema view the world?

This project will explore how cities and countries are depicted in cinema.

We will explore this topic from multiple aspects.

Contents:

0. [Data preprocessing](#data-preprocessing)
1. [General analysis by city/country](#general-analysis-by-location)
2. [Genre distributions & Bias](#genre-distributions--bias)
3. [Exploring how different countries view each other](#exploring-how-different-countries-view-each-other)
4. [Exploring how locations changed their view in time](#exploring-how-locations-changed-their-view-in-time)
5. [Character depiction and stereotypes](#character-depiction-and-stereotypes)

## Data preprocessing

### Imports

Here we import the required libraries and helper functions we will need for the analysis.

In [13]:
import os

import pandas as pd
import numpy as np

# We use google maps to get the coordinates of the cities/countries.
import googlemaps

# We use our own helper functions to load the data and get an embedding.
from helpers import load_data, get_embedding

# To track progress we use the tqdm package.
import tqdm

# To load our generated movie analysis we will use the json package.
import json

from sklearn.preprocessing import MinMaxScaler

# We hide warnings to make the notebook a bit cleaner.
import warnings
warnings.filterwarnings("ignore")

### Data loading

For the project we are using the [CMU Movie Summary Corpus](https://www.cs.cmu.edu/~ark/personas/) which contains plot summaries of 42,306 movies. The dataset also contains a number of metadata information about the movie and the actors.

We will use the following files from the dataset:
- `plot_summaries.txt` - which contains the plot summaries.
- `character.metadata.tsv` - which contains information about the characters and actors that play in a movie.
- `movie.metadata.tsv` - which contains information about movies.


As our project explores how *cinema views the world* as our main tool we have decided to analyze how the location of a movie affects the story, characters and what bias can be found.

#### Location information
The dataset does not provide the plot location to us and as such we have extracted location information using the newly released [JSON ChatGPT API](https://platform.openai.com/docs/guides/text-generation/json-mode).

With the use of the OpenAI API we have extracted the following information for each movie summary:

Example output of the movie ***Pest from the West***
```json
{
   "cities": [
      "Mexico City"
   ],
   "countries": [
      "Mexico"
   ],
   "characters": {
      "Keaton": {
         "nationality": "USA",
         "alignment": "good"
      }
   }
}
```

This data resides in the `movie_analysis.json` file. The code that helped us generate these results resides in `calculate-locations.ipynb`.

#### Embeddings
We have also computed semantic embeddings of all summaries in order to be able to get similarity metrics between movies or from a term to a movie. To calculate these embeddings we have used the [OpenAI Embeddings API](https://platform.openai.com/docs/guides/embeddings). Each embedding vector is `1536` dimensional.

The embeddings are stored in the `embeddings.npy` file. The code that helped us generate these results resides in `calculate-embeddings.ipynb`.


#### TMDB Dataset

We plan on using the [TMDB databaset](https://www.themoviedb.org) in order to get good user scores for movies. This dataset provides us with an easy to use python library and as such there will be no data problems.

In [14]:
DATA_PATH = 'data/'

# We load the data using our helper function.
loaded_data = load_data(DATA_PATH)

# We initialize the google maps client using our API key.
gmaps = googlemaps.Client(key=os.environ['GOOGLE_MAPS_API_KEY'])

# We extract the variables from the loaded data.
character_metadata = loaded_data['character_metadata'] # The metadata of the characters.
movie_metadata = loaded_data['movie_metadata'] # The metadata of the movies.
plot_summaries = loaded_data['plot_summaries'] # The plot summaries of the movies.
embeddings = loaded_data['embeddings'] # The embeddings of the movies as a numpy array.
combined_plot_summaries = loaded_data['combined_plot_summaries'] # The movie summaries combined with their embeddings.
city_country_analysis = loaded_data['city_country_analysis'] # The analysis of the cities and countries.
cities = city_country_analysis['cities'] # A list of all the cities.
countries = city_country_analysis['countries'] # A list of all the countries.
cities_movies = city_country_analysis['cities_movies'] # A dictionary mapping cities to movies.
countries_movies = city_country_analysis['countries_movies'] # A dictionary mapping countries to movies.
embeddings_of_movies_in_cities = city_country_analysis['embeddings_of_movies_in_cities'] # A dictionary mapping cities to embeddings of movies.
embeddings_of_movies_in_countries = city_country_analysis['embeddings_of_movies_in_countries'] # A dictionary mapping countries to embeddings of movies.

## General analysis by location

## Genre distributions & Bias

## Exploring how different countries view each other

## Exploring how locations changed their view in time

## Character depiction and stereotypes