In [1]:
import pandas as pd
import numpy as np
from imdb import Cinemagoer

In [2]:
ia = Cinemagoer()

ia.get_movie_infoset()

['airing',
 'akas',
 'alternate versions',
 'awards',
 'connections',
 'crazy credits',
 'critic reviews',
 'episodes',
 'external reviews',
 'external sites',
 'faqs',
 'full credits',
 'goofs',
 'keywords',
 'list',
 'locations',
 'main',
 'misc sites',
 'news',
 'official sites',
 'parents guide',
 'photo sites',
 'plot',
 'quotes',
 'recommendations',
 'release dates',
 'release info',
 'reviews',
 'sound clips',
 'soundtrack',
 'synopsis',
 'taglines',
 'technical',
 'trivia',
 'tv schedule',
 'video clips',
 'vote details']

### 1. Data: [Cornell Movie-Dialogs Corpus](https://convokit.cornell.edu/documentation/movie.html) & [Cinemagoer](https://cinemagoer.readthedocs.io/en/latest/index.html)

[Conell's Page on This](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html)

[Cinemagoer](https://cinemagoer.readthedocs.io/en/latest/index.html)

[Our Complied Data for Download](https://uillinoisedu-my.sharepoint.com/:f:/g/personal/mengyue4_illinois_edu/ErWZnueG65RPrwWrxzms3DUBWE5y55pf1IomDARFoBG02w?e=uuMaeg)

A large metadata-rich collection of fictional conversations extracted from raw movie scripts. (220,579 conversational exchanges between 10,292 pairs of movie characters in 617 movies).

Dataset details:
1. Speaker-level information:

speakers in this dataset are movie characters. We take speaker index from the original data release as the speaker name. For each character, the data further provide the following information as speaker-level metadata:

    speaker_id: speaker id

    character_name: name of the character in the movie

    movie_idx: index of the movie this character appears in

    movie_name: title of the movie

    gender: gender of the character (“?” for unlabeled cases)

    credit_pos: position on movie credits (“?” for unlabeled cases, converted to 0)

2. Utterance-level information:

For each utterance, the data provide:

    utterances_id: index of the utterance

    speaker: the speaker id of who authored the utterance

    belonging_id: id of the first utterance in the conversation this utterance belongs to

    reply_to: id of the utterance to which this utterance replies to (None if the utterance is not a reply)

    text: textual content of the utterance

    movie_idx: index of the movie from which this utterance occurs

3. Movie-level information:

Conversations are indexed by the id of the first utterance that make the conversation. For each conversation the data provide:

    movie_idx: index of the movie from which this utterance occurs

    movie_name: title of the movie

    release_year: year of movie release

    rating: IMDB rating of the movie

    votes: number of IMDB votes

    genre: a list of genres this movie belongs to

    Cinemagoer_id: id from Cinemagoer, scraped from IMDB

    multiple_plot: multiple versions of plot introduction from IMDB

### 2. This Part's Goal:

~~1. restrict the year range of our movies to only as early as 2000~~

2. from the scripts (plots) and conversations, try to gain some insights from their <mark>shared similarities</mark> or <mark>differences</mark> on topics and words, obtaining some comprehension on what makes a great movie in terms of the plot

3. we plan to use *Word Frequency Analysis* and *Text Similarity Analysis*. 
   The frequency analysis examines the most frequently used words or phrases to identify common elements in successful movie scripts.
   The similarity analysis measures the similarity between different movie scripts to find patterns or trends.

### 3. Read-in & Preprocess on Cornell

In [4]:
movies = pd.read_csv(r'C:\Users\98768\Desktop\is310\project\movie-corpus\processed\movies.csv')
speakers = pd.read_csv(r'C:\Users\98768\Desktop\is310\project\movie-corpus\processed\speakers.csv')
utterances = pd.read_csv(r'C:\Users\98768\Desktop\is310\project\movie-corpus\processed\utterances.csv')

print('movies:', movies.columns.values)
print('speakers:', speakers.columns.values)
print('utterances:', utterances.columns.values)

movies: ['movie_idx' 'movie_name' 'release_year' 'rating' 'votes' 'genre'
 'Cinemagoer_id' 'multiple_plot']
speakers: ['speaker_id' 'character_name' 'movie_idx' 'movie_name' 'gender'
 'credit_pos']
utterances: ['utterances_id' 'belonging_id' 'text' 'speaker' 'reply_to' 'movie_idx']


In [7]:
utterances

Unnamed: 0,utterances_id,belonging_id,text,speaker,reply_to,movie_idx
0,L1045,L1044,They do not!,u0,L1044,m0
1,L1044,L1044,They do to!,u2,end,m0
2,L985,L984,I hope so.,u0,L984,m0
3,L984,L984,She okay?,u2,end,m0
4,L925,L924,Let's go.,u0,L924,m0
...,...,...,...,...,...,...
304708,L666371,L666369,Lord Chelmsford seems to want me to stay back ...,u9030,L666370,m616
304709,L666370,L666369,I'm to take the Sikali with the main column to...,u9034,L666369,m616
304710,L666369,L666369,"Your orders, Mr Vereker?",u9030,end,m616
304711,L666257,L666256,"Good ones, yes, Mr Vereker. Gentlemen who can ...",u9030,L666256,m616


In [6]:
speakers

Unnamed: 0,speaker_id,character_name,movie_idx,movie_name,gender,credit_pos
0,u0,BIANCA,m0,10 things i hate about you,f,4
1,u2,CAMERON,m0,10 things i hate about you,m,3
2,u3,CHASTITY,m0,10 things i hate about you,?,0
3,u4,JOEY,m0,10 things i hate about you,m,6
4,u5,KAT,m0,10 things i hate about you,f,2
...,...,...,...,...,...,...
9030,u9029,CREALOCK,m616,zulu dawn,?,0
9031,u9033,STUART SMITH,m616,zulu dawn,?,0
9032,u9028,COGHILL,m616,zulu dawn,?,0
9033,u9031,MELVILL,m616,zulu dawn,?,0


In [5]:
movies

Unnamed: 0,movie_idx,movie_name,release_year,rating,votes,genre,Cinemagoer_id,multiple_plot
0,m0,10 things i hate about you,1999,6.9,62847,"'comedy', 'romance'",147800,"['A high-school boy, Cameron, cannot date Bian..."
1,m1,1492: conquest of paradise,1992,6.2,10421,"'adventure', 'biography', 'drama', 'history'",103594,"[""Christopher Columbus' discovery of the Ameri..."
2,m2,15 minutes,2001,6.1,25854,"'action', 'crime', 'drama', 'thriller'",179626,['A homicide detective and a fire marshal must...
3,m3,2001: a space odyssey,1968,8.4,163227,"'adventure', 'mystery', 'sci-fi'",62622,['After uncovering a mysterious artifact burie...
4,m4,48 hrs.,1982,6.9,22289,"'action', 'comedy', 'crime', 'drama', 'thriller'",83511,['A hard-nosed cop reluctantly teams up with a...
...,...,...,...,...,...,...,...,...
612,m612,watchmen,2009,7.8,135229,"'action', 'crime', 'fantasy', 'mystery', 'sci-...",409459,['In a version of 1985 where superheroes exist...
613,m613,xxx,2002,5.6,53505,"'action', 'adventure', 'crime'",295701,['The US government recruits extreme sports at...
614,m614,x-men,2000,7.4,122149,"'action', 'sci-fi'",16026746,"[""A band of mutants use their uncanny gifts to..."
615,m615,young frankenstein,1974,8.0,57618,"'comedy', 'sci-fi'",72431,['An American grandson of the infamous scienti...


### 4. Add in the Plot from Cinemagoer