# Preliminary EDA: West Coast Datathon 2020

The goal of this notebook is to search for answerable questions in available data sets as well as to identify useful supplemental data sets which are available online.

Author: Marlin Figgins

Date: Sept. 28, 2020.

## References

```
@misc{kawakatsu2020emergence,
      title={Emergence of Hierarchy in Networked Endorsement Dynamics}, 
      author={Mari Kawakatsu and Philip S. Chodrow and Nicole Eikmeier and Daniel B. Larremore},
      year={2020},
      eprint={2007.04448},
      archivePrefix={arXiv},
      primaryClass={cs.SI}
}
```

In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Reviewing the Oscar Dataset

We can see that the main variables are the category for which the individual was nominated, the corresponding film, and whether or not they were a winner, as well as the year for the film.

In [24]:
# Load in the Oscars data set 
Oscars_df = pd.read_csv("../data/the_oscar_award.csv")
Oscars_df.head(5)

Unnamed: 0,year_film,year_ceremony,ceremony,category,name,film,winner
0,1927,1928,1,ACTOR,Richard Barthelmess,The Noose,False
1,1927,1928,1,ACTOR,Emil Jannings,The Last Command,True
2,1927,1928,1,ACTRESS,Louise Dresser,A Ship Comes In,False
3,1927,1928,1,ACTRESS,Janet Gaynor,7th Heaven,True
4,1927,1928,1,ACTRESS,Gloria Swanson,Sadie Thompson,False


### Reviewing movie_industry.csv
We can see that

In [25]:
Movie_df = pd.read_csv("../data/movie_industry.csv", encoding='latin-1')
Movie_df.head(5)

Unnamed: 0,budget,company,country,director,genre,gross,name,rating,released,runtime,score,star,votes,writer,year
0,8000000.0,Columbia Pictures Corporation,USA,Rob Reiner,Adventure,52287414.0,Stand by Me,R,1986-08-22,89,8.1,Wil Wheaton,299174,Stephen King,1986
1,6000000.0,Paramount Pictures,USA,John Hughes,Comedy,70136369.0,Ferris Bueller's Day Off,PG-13,1986-06-11,103,7.8,Matthew Broderick,264740,John Hughes,1986
2,15000000.0,Paramount Pictures,USA,Tony Scott,Action,179800601.0,Top Gun,PG,1986-05-16,110,6.9,Tom Cruise,236909,Jim Cash,1986
3,18500000.0,Twentieth Century Fox Film Corporation,USA,James Cameron,Action,85160248.0,Aliens,R,1986-07-18,137,8.4,Sigourney Weaver,540152,James Cameron,1986
4,9000000.0,Walt Disney Pictures,USA,Randal Kleiser,Adventure,18564613.0,Flight of the Navigator,PG,1986-08-01,90,6.9,Joey Cramer,36636,Mark H. Baker,1986


## QUESTION: What can profits, Oscars, and IMDB tell us about networks of prestige in the film community?

Are concepts of prestige (realized via awards) well represented through IMDB ratings and profits?

## Do we have all the data present to do this?

- We have yearly time steps $t$ which denote award giving 
    - `the_oscar_awards.csv -> year_ceremony`. 
- We have the relase dates for films 
    - `movie_industry.csv -> year`.
- We have nominations by name and films as well as winners and when they were given 
    - `the_oscar_awards.csv -> name, film`
    
- We have qualitty of the movie as told by IMBD 
    - `movie_industry.csv -> score`

## What is the mathematical framework here?

The method here is primarily adapted from Kawakatsu 2020 which used a network model to construct heirachies of prestige from network data on math PhDs and the prestige of various universities. I believe we can adapt aspects of this method to make statistical models of prestige using ratings and profits and seeing if they are good predictors of Oscar adwarding.

We have a weighted undirected network at each time step $A(t)$ where each entry $a_{ij}$. We say there is an endoresment between individuals $i$ and $j$ if they have collaborated together in year $t$. There are several ways to quantify the weights on these relationships.

1. **Profit**: The net sales across all collaborations in that year (indicative of consumer choice).
2. **IMBD ratings**: The average quality of productions at that time step as reported by IMBD 
    (be wary of sample size -> `movie_industry.csv -> votes`

**Oscar nomination status**: Binary based on Oscar nomination at time step $t$. 

- We can check Oscar status after the fact to see if it's well explained by prestige as generated by either profit or IMBD ratings.
  
### Dynamic model of endoresments.

We say that endorsements evolve over time as follows

$$
A(t+1) = \lambda A(t) +(1-\lambda) \Delta(t),
$$

where $\Delta(t)$ is the matrix describing new endorsements at time $t$. Therefore, the endoresment network at time $t+1$ is a weighted average of the old network and the endorsements at the most recent time step. 

### Selecting endoresments.

Using the network $A(t)$, we can construct the new endorements network $\Delta(t)$ by computing a score $s_i$ for each individual by its connnections using methods like PageRank...

We assume that endorsements have an innate utility of endorsement that follows: 

$$
u_{ij} = \beta_1 (s_i - s_j)^2 + \beta_0 s_i.
$$

This is supposed to encode the tendacy for higher ranked individuals to be endorsed and individuals to endorse those of simiilar standing.

*Note:* I think we can motivate the use of another function or statistical model based on the patterns we observe in our actual data.

Once these utilities are generated, we can compute the probability of individual $i$ endorsing $j$ as 

$$
p_{ij} \propto e^{u_{ij}}.
$$

We then draw endorsements to form our matrix $\Delta(t)$. 

Source code for the original paper: https://github.com/PhilChodrow/prestige_reinforcement