## Table of Contents <a class="anchor" id="top"></a>
* [Data Preparation](#Data Prep)
* [Entity Resolution](#Entity)
* [Relation Extraction](#Relation)
* [Query System](#Query)

## Data Prep <a class="anchor" id="Data Prep"></a>
[[back to top]](#top)

In [58]:
%load_ext autoreload
%autoreload 2

#standard library imports
import re
import nltk
import numpy as np
import pandas as pd
import os
from collections import Counter

#modeling functions & utilities
from pronounResolution import pronResolution_base, pronResolution_nnMod, pronResolution_nn, pronEval
from relationExtract import simpleRE, REEval, getRelations, extract_relation_categories

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [59]:
files = [x for x in os.listdir('prep_scripts') if '_gapi' in x]
df = pd.read_csv('prep_scripts/' + files[1])[['speaker', 'dialogue', 'sentences', 'sentiment', 'entities', 'tokens']]
df['tokens'] = df['tokens'].apply(lambda x: eval(x))
df['sentiment'] = df['sentiment'].apply(lambda x: eval(x))
df['speaker'] = df['speaker'].apply(lambda x: x.strip())
df['entities'] = df['entities'].apply(lambda x: eval(x))
df.head()

Unnamed: 0,speaker,dialogue,sentences,sentiment,entities,tokens
0,Announcer,[first lines; announcement over speaker] Repor...,[{'content': u'[first lines; announcement over...,"{'score': -0.2, 'magnitude': 1.6}","[{'name': 'lines', 'salience': 0.35250518, 'me...","[{'pos': 'PUNCT', 'content': '[', 'begin': 0, ..."
1,narrator,the Avengers are in the process of infiltratin...,[{'content': u'the Avengers are in the process...,"{'score': 0.1, 'magnitude': 0.1}","[{'name': 'Avengers', 'salience': 0.47595453, ...","[{'pos': 'DET', 'content': 'the', 'begin': 0, ..."
2,Tony Stark,Shit!,"[{'content': u'Shit!', 'begin': 0, 'score': -0...","{'score': -0.6, 'magnitude': 0.6}",[],"[{'pos': 'X', 'content': 'Shit', 'begin': 0, '..."
3,Steve Rogers,"Language! JARVIS, what's the view from upstairs?","[{'content': u'Language!', 'begin': 0, 'score'...","{'score': 0, 'magnitude': 0.1}","[{'name': 'Language', 'salience': 0.7599061, '...","[{'pos': 'NOUN', 'content': 'Language', 'begin..."
4,JARVIS,The central building is protected by some kind...,[{'content': u'The central building is protect...,"{'score': 0.7, 'magnitude': 1.5}","[{'name': 'building', 'salience': 0.47500995, ...","[{'pos': 'DET', 'content': 'The', 'begin': 0, ..."


In [60]:
cList = list(df.speaker.unique())
cCount = Counter(df.speaker)
df['total_sent'] = df['sentiment'].apply(lambda x: x['score'] * x['magnitude'])
cDict = dict(df.groupby('speaker').total_sent.sum())

# number of pronouns for each line
df['num_pron'] = df['tokens'].apply(lambda x: sum([int(t['pos'] == 'PRON') for t in x]))

# total sentiment score for each line
df['total_sent'] = df['sentiment'].apply(lambda x: x['score'] * x['magnitude'])

#set nearby speakers
charRange = 10
nearbyList = np.dstack((df.shift(i).speaker.values for i in range(-charRange, charRange+1)))[0]
df['nearbyChars'] = None
for i, nearbyChars in enumerate(nearbyList):
    df.set_value(i, 'nearbyChars', nearbyChars)

df.head()
    

Unnamed: 0,speaker,dialogue,sentences,sentiment,entities,tokens,total_sent,num_pron,nearbyChars
0,Announcer,[first lines; announcement over speaker] Repor...,[{'content': u'[first lines; announcement over...,"{'score': -0.2, 'magnitude': 1.6}","[{'name': 'lines', 'salience': 0.35250518, 'me...","[{'pos': 'PUNCT', 'content': '[', 'begin': 0, ...",-0.32,3,"[Tony Stark, Clint Barton, narrator, Natasha R..."
1,narrator,the Avengers are in the process of infiltratin...,[{'content': u'the Avengers are in the process...,"{'score': 0.1, 'magnitude': 0.1}","[{'name': 'Avengers', 'salience': 0.47595453, ...","[{'pos': 'DET', 'content': 'the', 'begin': 0, ...",0.01,0,"[Steve Rogers, Tony Stark, Clint Barton, narra..."
2,Tony Stark,Shit!,"[{'content': u'Shit!', 'begin': 0, 'score': -0...","{'score': -0.6, 'magnitude': 0.6}",[],"[{'pos': 'X', 'content': 'Shit', 'begin': 0, '...",-0.36,0,"[narrator, Steve Rogers, Tony Stark, Clint Bar..."
3,Steve Rogers,"Language! JARVIS, what's the view from upstairs?","[{'content': u'Language!', 'begin': 0, 'score'...","{'score': 0, 'magnitude': 0.1}","[{'name': 'Language', 'salience': 0.7599061, '...","[{'pos': 'NOUN', 'content': 'Language', 'begin...",0.0,1,"[Steve Rogers, narrator, Steve Rogers, Tony St..."
4,JARVIS,The central building is protected by some kind...,[{'content': u'The central building is protect...,"{'score': 0.7, 'magnitude': 1.5}","[{'name': 'building', 'salience': 0.47500995, ...","[{'pos': 'DET', 'content': 'The', 'begin': 0, ...",1.05,1,"[Strucker, Steve Rogers, narrator, Steve Roger..."


## Task 1. Entity Resolution <a class="anchor" id="Entity"></a>
[[back to top]](#top)

In [61]:
df.apply(lambda x:pronResolution_nnMod(cCount, x, absolute=False), axis=1)
df.head(20)

Unnamed: 0,speaker,dialogue,sentences,sentiment,entities,tokens,total_sent,num_pron,nearbyChars
0,Announcer,[first lines; announcement over speaker] Repor...,[{'content': u'[first lines; announcement over...,"{'score': -0.2, 'magnitude': 1.6}","[{'name': 'lines', 'salience': 0.35250518, 'me...","[{'pos': 'PUNCT', 'content': '[', 'begin': 0, ...",-0.32,3,"[Tony Stark, Clint Barton, narrator, Natasha R..."
1,narrator,the Avengers are in the process of infiltratin...,[{'content': u'the Avengers are in the process...,"{'score': 0.1, 'magnitude': 0.1}","[{'name': 'Avengers', 'salience': 0.47595453, ...","[{'pos': 'DET', 'content': 'the', 'begin': 0, ...",0.01,0,"[Steve Rogers, Tony Stark, Clint Barton, narra..."
2,Tony Stark,Shit!,"[{'content': u'Shit!', 'begin': 0, 'score': -0...","{'score': -0.6, 'magnitude': 0.6}",[],"[{'pos': 'X', 'content': 'Shit', 'begin': 0, '...",-0.36,0,"[narrator, Steve Rogers, Tony Stark, Clint Bar..."
3,Steve Rogers,"Language! JARVIS, what's the view from upstairs?","[{'content': u'Language!', 'begin': 0, 'score'...","{'score': 0, 'magnitude': 0.1}","[{'name': 'Language', 'salience': 0.7599061, '...","[{'pos': 'NOUN', 'content': 'Language', 'begin...",0.0,1,"[Steve Rogers, narrator, Steve Rogers, Tony St..."
4,JARVIS,The central building is protected by some kind...,[{'content': u'The central building is protect...,"{'score': 0.7, 'magnitude': 1.5}","[{'name': 'building', 'salience': 0.47500995, ...","[{'pos': 'DET', 'content': 'The', 'begin': 0, ...",1.05,1,"[Strucker, Steve Rogers, narrator, Steve Roger..."
5,Thor,Loki's scepter must be here. Strucker couldn't...,"[{'content': u""Loki's scepter must be here."", ...","{'score': 0.1, 'magnitude': 0.6}","[{'name': 'scepter', 'salience': 0.54812527, '...","[{'pos': 'NOUN', 'content': 'Loki', 'begin': 0...",0.06,1,"[Fortress Soldier, Strucker, Steve Rogers, nar..."
6,narrator,Natasha knocks out some soldiers,[{'content': u'Natasha knocks out some soldier...,"{'score': 0.2, 'magnitude': 0.2}","[{'name': 'Natasha Romanoff', 'salience': 0.76...","[{'pos': 'NOUN', 'content': 'Natasha', 'begin'...",0.04,0,"[Strucker, Fortress Soldier, Strucker, Steve R..."
7,Natasha Romanoff,"At long last is lasting a little long, boys.",[{'content': u'At long last is lasting a littl...,"{'score': 0.4, 'magnitude': 0.4}","[{'name': 'boys', 'salience': 1, 'meta': {}, '...","[{'pos': 'ADP', 'content': 'At', 'begin': 0, '...",0.16,0,"[Fortress Soldier, Strucker, Fortress Soldier,..."
8,narrator,as some soldiers shoot at him,"[{'content': u'as some soldiers shoot at him',...","{'score': 0.2, 'magnitude': 0.2}","[{'name': 'Soldiers', 'salience': 1, 'meta': {...","[{'pos': 'ADP', 'content': 'as', 'begin': 0, '...",0.04,1,"[Strucker, Fortress Soldier, Strucker, Fortres..."
9,Clint Barton,Yeah. I think we lost the element of surprise.,"[{'content': u'Yeah.', 'begin': 0, 'score': 0....","{'score': 0.1, 'magnitude': 0.3}","[{'name': 'surprise', 'salience': 0.64406866, ...","[{'pos': 'X', 'content': 'Yeah', 'begin': 0, '...",0.03,2,"[Fortress Soldier, Strucker, Fortress Soldier,..."


In [6]:
pronEval([df, df], numExamples=2)


******** line 766 ********
764. Kurt:
Oh, no.

765. narrator:
back with Scott and the ants

=> 766. Scott Lang:
=> I'm employing the bullet ants. Hapanera-clamda-mana-merna. I don't remember what it's called but I feel bad for this guy. [using the ants Scott takes down one of the security guards with Luis also punching him]

767. Luis:
See, that's what I'm talkin’ bout. That's what I call it, an unfortunate casualty, in a very serious operation, you know? [Hope then comes along and enters the room and places the signal decoy]

768. Kurt:
Signal decoy in place. Mean pretty lady did good, Scott.

******** test model 1: line 766 ********
6 pronouns resolved
1. I => ['Peggy Carter']
2. I => ['Pym Tech Employee']
3. what => ['Hope van Dyne']
4. it => ['Cop on Speaker']
5. I => ['Hideous Rabbit']
6. him => ['Pym Tech Employee']

how many are correctly identified? 2

******** line 766 ********
764. Kurt:
Oh, no.

765. narrator:
back with Scott and the ants

=> 766. Scott Lang:
=> I'm employi

## Task 2. Relation Extraction <a class="anchor" id="Relation"></a>
[[back to top]](#top)

In [48]:
df.apply(lambda x: pronResolution_nnMod(cCount, x), axis=1)
df.head()

Unnamed: 0,speaker,dialogue,sentences,sentiment,entities,tokens,total_sent,num_pron,nearbyChars
0,Announcer,[first lines; announcement over speaker] Repor...,[{'content': u'[first lines; announcement over...,"{'magnitude': 1.6, 'score': -0.2}","[{'salience': 0.35250518, 'type': 'OTHER', 'me...","[{'content': '[', 'pos': 'PUNCT', 'label': 'P'...",-0.32,3,"[Tony Stark, Clint Barton, narrator, Natasha R..."
1,narrator,the Avengers are in the process of infiltratin...,[{'content': u'the Avengers are in the process...,"{'magnitude': 0.1, 'score': 0.1}","[{'salience': 0.47595453, 'type': 'PERSON', 'm...","[{'content': 'the', 'pos': 'DET', 'label': 'DE...",0.01,0,"[Steve Rogers, Tony Stark, Clint Barton, narra..."
2,Tony Stark,Shit!,"[{'content': u'Shit!', 'begin': 0, 'score': -0...","{'magnitude': 0.6, 'score': -0.6}",[],"[{'content': 'Shit', 'pos': 'X', 'label': 'ROO...",-0.36,0,"[narrator, Steve Rogers, Tony Stark, Clint Bar..."
3,Steve Rogers,"Language! JARVIS, what's the view from upstairs?","[{'content': u'Language!', 'begin': 0, 'score'...","{'magnitude': 0.1, 'score': 0}","[{'salience': 0.7599061, 'type': 'OTHER', 'men...","[{'content': 'Language', 'pos': 'NOUN', 'label...",0.0,1,"[Steve Rogers, narrator, Steve Rogers, Tony St..."
4,JARVIS,The central building is protected by some kind...,[{'content': u'The central building is protect...,"{'magnitude': 1.5, 'score': 0.7}","[{'salience': 0.47500995, 'type': 'LOCATION', ...","[{'content': 'The', 'pos': 'DET', 'label': 'DE...",1.05,1,"[Strucker, Steve Rogers, narrator, Steve Roger..."


In [62]:
df['relations'] = df.apply(lambda x:extract_relation_categories(cList, x), axis=1)
df.head()

15
{'pos': 'PRON', 'content': 'They', 'begin': 34, 'index': 9, 'lemma': 'They', 'label': 'NSUBJ', 'char': ['Tony Stark', 'Steve Rogers', 'Bruce Banner']}
16
{'pos': 'PRON', 'content': 'They', 'begin': 10, 'index': 5, 'lemma': 'They', 'label': 'NSUBJ', 'char': ['Ultron', 'Vision', 'Clint Barton']}
16
{'pos': 'PRON', 'content': 'them', 'begin': 74, 'index': 18, 'lemma': 'them', 'label': 'DOBJ', 'char': ['Ultron', 'Vision', 'Clint Barton']}
17
{'pos': 'PRON', 'content': 'They', 'begin': 0, 'index': 1, 'lemma': 'They', 'label': 'NSUBJ', 'char': ['Steve Rogers', 'Strucker', 'Wanda Maximoff']}
20
{'pos': 'PRON', 'content': 'them', 'begin': 50, 'index': 12, 'lemma': 'them', 'label': 'NSUBJ', 'char': ['Steve Rogers', 'Ulysses Klaue', 'Tony Stark']}
21
{'pos': 'PRON', 'content': 'them', 'begin': 16, 'index': 3, 'lemma': 'them', 'label': 'DOBJ', 'char': ['Peggy Carter', 'Ultron']}
23
{'pos': 'PRON', 'content': 'they', 'begin': 10, 'index': 4, 'lemma': 'they', 'label': 'NSUBJ', 'char': ['Pietro M

Unnamed: 0,speaker,dialogue,sentences,sentiment,entities,tokens,total_sent,num_pron,nearbyChars,relations
0,Announcer,[first lines; announcement over speaker] Repor...,[{'content': u'[first lines; announcement over...,"{'score': -0.2, 'magnitude': 1.6}","[{'name': 'lines', 'salience': 0.35250518, 'me...","[{'pos': 'PUNCT', 'content': '[', 'begin': 0, ...",-0.32,3,"[Tony Stark, Clint Barton, narrator, Natasha R...","[{'ent2': 'Clint Barton', 'men2': ['We'], 'lin..."
1,narrator,the Avengers are in the process of infiltratin...,[{'content': u'the Avengers are in the process...,"{'score': 0.1, 'magnitude': 0.1}","[{'name': 'Avengers', 'salience': 0.47595453, ...","[{'pos': 'DET', 'content': 'the', 'begin': 0, ...",0.01,0,"[Steve Rogers, Tony Stark, Clint Barton, narra...",
2,Tony Stark,Shit!,"[{'content': u'Shit!', 'begin': 0, 'score': -0...","{'score': -0.6, 'magnitude': 0.6}",[],"[{'pos': 'X', 'content': 'Shit', 'begin': 0, '...",-0.36,0,"[narrator, Steve Rogers, Tony Stark, Clint Bar...",
3,Steve Rogers,"Language! JARVIS, what's the view from upstairs?","[{'content': u'Language!', 'begin': 0, 'score'...","{'score': 0, 'magnitude': 0.1}","[{'name': 'Language', 'salience': 0.7599061, '...","[{'pos': 'NOUN', 'content': 'Language', 'begin...",0.0,1,"[Steve Rogers, narrator, Steve Rogers, Tony St...",
4,JARVIS,The central building is protected by some kind...,[{'content': u'The central building is protect...,"{'score': 0.7, 'magnitude': 1.5}","[{'name': 'building', 'salience': 0.47500995, ...","[{'pos': 'DET', 'content': 'The', 'begin': 0, ...",1.05,1,"[Strucker, Steve Rogers, narrator, Steve Roger...","[{'ent2': 'Thor', 'men2': ['we'], 'line': 4, '..."


In [64]:
REEval([df], 5)


******** line 705 ********
703. Clint Barton:
[to Natasha] We got a window. Four, three...give 'em hell. [Natasha drops out of the Quinjet on a bike and rides towards the truck and picks up Steve's shield]

704. Natasha Romanoff:
I'm always picking up after you boys.

=> 705. Clint Barton:
=> They're heading under the overpass, I've got no shot.

706. Natasha Romanoff:
Which way?

707. Clint Barton:
Hard right... Now. [Natasha heads over the truck, she throws Steve back his shield and he uses it to knock off Ultron from him]

******** test model 1: line 705 ********
1 relations identified
entities: Sam Wilson => Pietro Maximoff-['They']
relation: belong to same team
category: 6. same team mentioning

how many are correctly identified? 1

******** line 837 ********
835. Ultron:
You're stalling to protect the people.

836. Tony Stark:
Well, that is the mission. Did you forget?

=> 837. Ultron:
=> I've moved beyond your mission. I'm free. [suddenly the Vibranium core he's placed beneath 

## Putting Everything Together, a Simple Query System <a class="anchor" id="Query"></a>
[[back to top]](#top)

In [68]:
def checkQuery(relationList, ent1, ent2, relationClass):
    for relation in relationList:
        if ent1 in relation['ent1'] and ent2 in relation['ent2'] and relationClass == relation['class']:
            return True
    return False

def printAnswer(row):
    print('Movie: {}, Line {}'.format(row.movie, row.lineNum))
    print('{}: {}'.format(row.speaker, row.dialogue))
    print()
    
def queryScore(relationList, query, relationClass):
    querySet = set(query.split(' '))
    resultScore = 0
    
    for relation in relationList:
        relationSet = set()
        if type(relation['ent1']) == str:
            relationSet |= set(relation['ent1'].lower().split())
        else:
            for ent in relation['ent1']:
                #print(set(ent.split()))
                relationSet |= set(ent.lower().split())
            
        if type(relation['ent2']) == str:
            #print(relation['ent2'])
            relationSet |= set(relation['ent2'].lower().split())
        else:
            for ent in relation['ent2']:
                relationSet |= set(ent.lower().split())
        
        relationSet |= set(relation['relation'].lower().split())
        relationSet |= set(relationClass[relation['class']].lower().split())
        tempScore = len(relationSet & querySet) / (len(relationSet) + len(querySet))
        
        if tempScore > resultScore:
            resultScore = tempScore
        
    return resultScore

#Simple Query System

print('Select the movies of your interest:')
print('***Enter all to use all movies')
print('***Enter n, m, x, y (numbers separated by commas) for specific selections')
print('***Enter random, n for n random selections\n')

files = [x for x in os.listdir('prep_scripts') if '_gapi' in x]
for i, fileName in enumerate(files):
    print('{}. {}'.format(i+1, re.split(r'_tw_|_imsdb_', fileName)[0]))


x = input()


#random selection
try:
    if 'random' in x:
        queryFiles = np.random.choice(files, int(x.split(',')[-1]), replace=False)
    elif x != 'all':
        queryFiles = np.array(files)[[int(select) - 1 for select in x.split(',')]]
    #use all files
    else:
        queryFiles = files    
        
except:
    print('\nunexpected input, will use all movie files\n')
    queryFiles = files    

#print(queryFiles)
df_data = None
charSet = set()

for i, fileName in enumerate(queryFiles):    
    df = pd.read_csv('prep_scripts/'+fileName)[['speaker', 'dialogue', 'sentences', 'sentiment', 'entities', 'tokens']]
    df['tokens'] = df['tokens'].apply(lambda x: eval(x))
    df['sentiment'] = df['sentiment'].apply(lambda x: eval(x))
    df['total_sent'] = df['sentiment'].apply(lambda x: x['score'] * x['magnitude'])
    df['entities'] = df['entities'].apply(lambda x: eval(x))
    df['movie'] = re.split(r'_tw_|_imsdb_', fileName)[0]
    df['lineNum'] = df.index + 1
    
    charRange = 10
    nearbyList = np.dstack((df.shift(i).speaker.values for i in range(-charRange, charRange+1)))[0]
    df['nearbyChars'] = None
    for line, nearbyChars in enumerate(nearbyList):
        df.set_value(line, 'nearbyChars', nearbyChars)
    
    cList = list(df.speaker.unique())
    cDict = dict(df.groupby('speaker').total_sent.sum())
    
    #resolve entities
    df.apply(lambda x:pronResolution_nnMod(cList, x), axis=1)
    
    #extract relations
    df['relations'] = df.apply(lambda x:extract_relation_categories(x), axis=1)
    
    if i == 0:
        df_data = df[df.relations.notnull()]        
        
    else:
        df_data = pd.concat((df_data, df[df.relations.notnull()]))
    
    charSet |= set(df.speaker.unique())

relationClasses = getRelations()
    
print('Type end to finish at any time')
print('Choose one of the following:')
print('1. Structured search')
print('2. Free form query')
searchType = int(input()) - 1

#relationList = df_data[df_data.hasRelation == True]['relations'].values

if not searchType:
    
    while True:
        print('Characters: ')
        print(charSet)
        print('\nRelations:')
        for k, v in relationClasses.items():
            print('{}. {}'.format(k+1, v))
        print('What relation are you looking for?')
        ent1 = input('Entity 1:')
        if ent1 == 'end':
            break
        ent2 = input('Entity 2:')
        if ent2 == 'end':
            break
        relationClass = int(input('Relation category: '))-1

        qMatch = df_data.relations.apply(lambda x: checkQuery(x, ent1, ent2, relationClass))
        if sum(qMatch) == 0:
            print('nothing found\n')
        else:
            df_data[qMatch].apply(lambda x: printAnswer(x), axis=1)

else:
    while True:
        query = input('Enter query')
        if query == 'end':
            break
        df = df_data.copy()
        df['queryScore'] = df.relations.apply(lambda x: queryScore(x, query, relationClasses))
        df = df.sort_values(by='queryScore', ascending=False).head().copy()
        df.apply(lambda x: printAnswer(x), axis=1)

Select the movies of your interest:
***Enter all to use all movies
***Enter n, m, x, y (numbers separated by commas) for specific selections
***Enter random, n for n random selections

1. ant-man
2. avengers_age_of_ultron
3. captain_america_civil_war
4. captain_america_the_first_avenger
5. captain_america_the_winter_soldier
6. fantastic_four
7. iron_man_3
8. lego_marvel_super_heroes
9. spider-man
10. the_amazing_spider-man_2
11. the_amazing_spider-man
12. the_avengers
13. the_wolverine
14. thor_the_dark_world
15. thor
16. x-men_apocalypse
17. x-men_days_of_future_past
18. x-men
19. x-men_the_last_stand
all
Type end to finish at any time
Choose one of the following:
1. Structured search
2. Free form query
2
Enter querycaptain america helps thor fight loki and iron man
Movie: lego_marvel_super_heroes, Line 348
Captain America: Colonel Fury, sir, Loki jumped into a Vortex and vanished.

Movie: lego_marvel_super_heroes, Line 386
Loki: Oh and so am I, brother! I intend to get my revenge on 