There are no restrictions on or requirements for what you might want to analyze. To help get you started thinking, a few ideas include:

- What are the topics of the script? You could make a break down by character. Do these topics evolve over time (e.g., by act)?
- Who talks (spends time) with whom? Does it change over time? Which fraction of the talk does each speaker contribute?
- What is the mood of each speaker (e.g., the average sentiment of the words they utter)? Does it change over time? Does it depend on who they are talking to or who or what they are talking about?
- Who or what does each speaker talk about? What is the sentiment of the speaker about each entity (i.e., what is the sentiment of the words that appear near them in the dialogue?). You will probably want to do coreference resolution (to identify what pronouns matches what noun) when identifying who talks about what or whom. In dialogs, participants agree about the antecedents of pronouns, so you may want to process consecutive utterances from the various dialogue participants as a single unit of text for the purpose of coreference resolution.
- What are the Named Entities (e.g., people, places, organizations) that appear in your script? Are there any generalizations about when or where they appear?
- What are the similarities between the characters in the script (e.g., defined in terms of vector similarity between each character’s ‘corpus’? Which characters are the most and least similar to each other, and do these results have an intuitive explanation?

                                                                http://www.klintonbicknell.com/ling400fall2017/hw/hw4.html

In [43]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#enable IPython to display matplotlib graphs
%matplotlib inline
import csv

In [44]:
#read in data
data=pd.read_csv('All-seasons.csv')
data.head(10)
data.shape

(70896, 4)

In [45]:
data.head(10)

Unnamed: 0,Season,Episode,Character,Line
0,10,1,Stan,"You guys, you guys! Chef is going away. \n"
1,10,1,Kyle,Going away? For how long?\n
2,10,1,Stan,Forever.\n
3,10,1,Chef,I'm sorry boys.\n
4,10,1,Stan,"Chef said he's been bored, so he joining a gro..."
5,10,1,Chef,Wow!\n
6,10,1,Mrs. Garrison,Chef?? What kind of questions do you think adv...
7,10,1,Chef,What's the meaning of life? Why are we here?\n
8,10,1,Mrs. Garrison,I hope you're making the right choice.\n
9,10,1,Cartman,I'm gonna miss him. I'm gonna miss Chef and I...


In [46]:
data.tail(10)

Unnamed: 0,Season,Episode,Character,Line
70886,9,14,Randy,"Oh. Well, tell you what: let's leave the car h..."
70887,9,14,Stan,All right!\n
70888,9,14,Randy,Come on! Or maybe I'll have three beers. \n
70889,9,14,Stan,That's probably okay if you spread it out.\n
70890,9,14,Randy,Well how about four?\n
70891,9,14,Stan,I think you're pushing it.\n
70892,9,14,Randy,How about twenty?\n
70893,9,14,Stan,That's not disciprine.\n
70894,9,14,Randy,Right right. Does vodka count?\n
70895,9,14,Stan,Dad!\n


## Parsing Dialogues Text Using StanfordCoreNLP
- Splitting, tokenization, normalization.
- Remove stopwords

#before calling server in python, need to start server in terminal within the directory where Stanford Corenlp sorce folder is located at
#source folder: C:\Users\Dingding\stanford-corenlp-full-2017-06-09
#command line: java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

In [47]:
from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')

In [48]:
# read line into list of sentences:
lines=data['Line'].tolist()
lines

['You guys, you guys! Chef is going away. \n',
 'Going away? For how long?\n',
 'Forever.\n',
 "I'm sorry boys.\n",
 "Chef said he's been bored, so he joining a group called the Super Adventure Club. \n",
 'Wow!\n',
 'Chef?? What kind of questions do you think adventuring around the world is gonna answer?!\n',
 "What's the meaning of life? Why are we here?\n",
 "I hope you're making the right choice.\n",
 "I'm gonna miss him.  I'm gonna miss Chef and I...and I don't know how to tell him! \n",
 'Dude, how are we gonna go on? Chef was our fuh...f-ffriend. \n',
 'And we will all miss you, Chef,  but we know you must do what your heart tells you..\n',
 'Bye-bye!\n',
 'Good-bye!\n',
 'So long!\n',
 'So long, Chef!\n',
 'Good-bye, Chef!\n',
 'Good-bye, Chef! Have a great time with the Super Adventure Club!\n',
 'Good-bye! ..\n',
 'Draw two card, fatass.\n',
 'Reverse to you, Jew. \n',
 "I'll get it. \n",
 'Hello there, children!\n',
 "He's back!\n",
 'Yeah!\n',
 'All right! \n',
 "Chef! I ca

In [52]:
#tokenize and normalize the list of sentences:
list_parsed_lines = []
for line in lines:
    this_parsed_line = nlp.annotate(line, properties={
  'annotators': 'tokenize, lemma',
  'outputFormat': 'json'
  })
    list_parsed_lines.append(this_parsed_line)
list_parsed_lines[0:3] #return list of dictionaries for each line

[{'sentences': [{'index': 0,
    'tokens': [{'after': ' ',
      'before': '',
      'characterOffsetBegin': 0,
      'characterOffsetEnd': 3,
      'index': 1,
      'lemma': 'you',
      'originalText': 'You',
      'pos': 'PRP',
      'word': 'You'},
     {'after': '',
      'before': ' ',
      'characterOffsetBegin': 4,
      'characterOffsetEnd': 8,
      'index': 2,
      'lemma': 'guy',
      'originalText': 'guys',
      'pos': 'NNS',
      'word': 'guys'},
     {'after': ' ',
      'before': '',
      'characterOffsetBegin': 8,
      'characterOffsetEnd': 9,
      'index': 3,
      'lemma': ',',
      'originalText': ',',
      'pos': ',',
      'word': ','},
     {'after': ' ',
      'before': ' ',
      'characterOffsetBegin': 10,
      'characterOffsetEnd': 13,
      'index': 4,
      'lemma': 'you',
      'originalText': 'you',
      'pos': 'PRP',
      'word': 'you'},
     {'after': '',
      'before': ' ',
      'characterOffsetBegin': 14,
      'characterOffsetEnd': 18

In [60]:
len(list_parsed_lines) #70,896 dictionary objects for 70,896 lines of dialogue

70896

In [63]:
#join split lemmas into list_of_tokens for each line
#remove stop words from list_of_tokens for each line
#and append each line_of_tokens into a list

#remove stop words using nltk
from nltk.corpus import stopwords

stopset = ['I','a','and','to','the','in','of','my','for','with','that','as','at','from','is','on','have','me','be','an','it','this',
          ',', '.', '?', '!', '?!','??','!!','-',':',';']
cachedStopWords = stopset+stopwords.words("english")

list_token_sets = []
for line_dict in list_parsed_lines:
    tokens_this_line = []
    for s in line_dict['sentences']:
        tokens_from_this_s = [lemma['lemma'] for lemma in s['tokens']]
        tokens_this_line = tokens_this_line + [token for token in tokens_from_this_s if token not in cachedStopWords]
    list_token_sets.append(tokens_this_line) #2-D list of list of tokens for each line of dialogue
    
print(list_token_sets[0:10]) #print the first 10 elements in list to check 

[['guy', 'guy', 'chef', 'go', 'away'], ['go', 'away', 'long'], ['forever'], ['sorry', 'boy'], ['chef', 'say', 'bore', 'join', 'group', 'call', 'Super', 'adventure', 'Club'], ['wow'], ['chef', 'kind', 'question', 'think', 'adventuring', 'around', 'world', 'gon', 'na', 'answer'], ['meaning', 'life'], ['hope', 'make', 'right', 'choice'], ['gon', 'na', 'miss', 'gon', 'na', 'miss', 'chef', 'i.', 'know', 'tell']]


In [73]:
len(list_token_sets) #match original records in data

70896

In [76]:
#save the list object to local file using pickle
import pickle

with open("parsed_Line.txt", "wb") as f:   #Pickling
    pickle.dump(list_token_sets, f)

with open("parsed_Line.txt", "rb") as f:   # Unpickling
    test = pickle.load(f)

#test #success!

In [66]:
#append list_token_sets back to dataframe data as a new column 'parsed_Line'
array_token_sets=np.asarray(list_token_sets)
array_token_sets
data['parsed_Line']=array_token_sets
data.head(10)

Unnamed: 0,Season,Episode,Character,Line,parsed_Line
0,10,1,Stan,"You guys, you guys! Chef is going away. \n","[guy, guy, chef, go, away]"
1,10,1,Kyle,Going away? For how long?\n,"[go, away, long]"
2,10,1,Stan,Forever.\n,[forever]
3,10,1,Chef,I'm sorry boys.\n,"[sorry, boy]"
4,10,1,Stan,"Chef said he's been bored, so he joining a gro...","[chef, say, bore, join, group, call, Super, ad..."
5,10,1,Chef,Wow!\n,[wow]
6,10,1,Mrs. Garrison,Chef?? What kind of questions do you think adv...,"[chef, kind, question, think, adventuring, aro..."
7,10,1,Chef,What's the meaning of life? Why are we here?\n,"[meaning, life]"
8,10,1,Mrs. Garrison,I hope you're making the right choice.\n,"[hope, make, right, choice]"
9,10,1,Cartman,I'm gonna miss him. I'm gonna miss Chef and I...,"[gon, na, miss, gon, na, miss, chef, i., know,..."


In [69]:
#write parsed data to local csv file:
data.to_csv('South_Park_Dialogues_Parsed.csv', sep=',', header=True, index=False)