# Natural Language Processing

Natural Languae Processing (NLP) focuses on computer understanding of human language. The future of text-based NLP may include more accurate search, universal translation, summary of text, conversational coding, androids... etc.

There are many many challenges that stand in the way of these things at the moment. A short list of these challenges include:
1. Polysemy - words can have multiple meanings; which meaning is the correct one?
2. Fluidity of syntax and grammar - what rules to use to break down a sentence?
3. Errors - misspellings or incorrect grammar can derail brittle analysis
3. Semantics - meaning can change on a word and sentence level
4. Context - how much is plainly written and how much must be inferred?
5. Evolution of language - rarely are languages in stasis.


The Basic outline for NLP flows through four **very broad** tiers/categories of increasing scope and difficulty:
1. Morphological processing - What are the discrete units (tokens) of meaning? in english it is relatively trivial as "words" are distinctly separated, however there are subwords. For example, "incoherently" has a prefix "in-", "coherent", and a suffix "-ly". each part changes the meaning and usage. Tokens can have multiple meanings (polysemy), and the exact meaning and type (e.g. noun, verb etc.) of a word may be ambiguous at this point. 

2. Syntax/Grammar processing - What is the structure of the sentence? Do the words interact correctly? By looking at the what tokens are in a string as well as what *order* they are in, we can determine relationships between tokens based on rules of definitions (lexicon) and syntax (grammar/structure). This processing can convert a sentence like "The large cat chased the rat"into a formal notation such as "Article Adjective Noun Verb Article Noun", or further into "Noun-Phrase Verb Noun-Phrase" (see Lkit pdf for tree viz). Grammar can disambiguate the meanings of "brush" in the sentences "**Brush** your hair" (verb) vs. "Hand me the **brush**" (noun).

3. Semantic Analysis -  What is the meaning of a string (sentence) of tokens (words)? The relationship of words in the syntactic framework allows us to disambiguate the meaning of the words. Semantic analysis allows us to  in the sentence "He put a carrot on the plate and then ate **it**", we need semantics to determine what "it" is - in this case "it" is a carrot, not a plate.

4. Pragmatic (contextual) analysis - What is the meaning with respect to the entire context? There are many phrases that are still ambiguous after semantic analysis, such as "put the apple in the basket on the shelf.", which can have two meanings:
 - put the apple which is *currently in the basket* on the shelf
 - put the apple into the basket which is *currently on the shelf*

   Although this is a trivial example, the "correct" answer depends on the current state of the apple and basket, which may have been determined in previous sentences. Humor and sarcasm are extremely advanced forms of contextual understanding: "Trump is definitely the best president ever" can mean completely opposite things depending on who is saying it, and when they say it. It would require both understanding of the current state of a broad range of topics, as well as the history of the person saying it.

A few interesting links that helped me create this document (as I am still learning!)

[Algorithmia - What is NLP?](https://blog.algorithmia.com/introduction-natural-language-processing-nlp/)

[Lkit NLP intro](https://www.scm.tees.ac.uk/isg/aia/nlp/NLP-overview.pdf) 

[Zareen Syed's slideshare](https://www.slideshare.net/zareen/challenges-in-nlp)

[tutorialspoint intro to NLP](https://www.tutorialspoint.com/artificial_intelligence/artificial_intelligence_natural_language_processing.htm)

[Analytics Vidhya guide to NLP](https://www.analyticsvidhya.com/blog/2017/01/ultimate-guide-to-understand-implement-natural-language-processing-codes-in-python/)



## Important steps breakdown

### Preprocessing (even more than before!!!)
#### Remove Noise
- remove scene action tagged sentences
- remove language stopwords words such as "is", "a", "this". These are super common words that do not help in determining context of words.

#### Lexicon Normalization
- compressing multiple representations of the same word into one with stemming (strip suffixes)

### Potential outputs at this stage:

- Statistical: word and sentence counts per character
- tf-idf: term frequency inverse document frequency. This finds the frequency of words in a subset, and normalizes it by the frequency of the same word in the entire set. It finds the relative importance of a word in the subset vs the whole. It can be used to determine if a word is more frequent in a specific episode than it is in the whole show, or if a word is used more frequently by a specific character than all the characters.

Either of the above can be used to create word cloud outputs per character, episode, etc.

### Advanced syntactic/semantic processing
#### Object Standardization
- Many domain specific words/acronyms are not in standart lexical dictionaries. 

#### Syntactic Parsing
- analysis of grammar and arrangement. We want to tag words with their relationship to the other words in a sentence.


Statistical:
word and sentence counts per character
tf-idf: term frequency inverse document frequency. This finds the frequency of words in a subset, and normalizes it by the frequency of the same word in the entire set. It finds the relative importance of a word in the subset vs the whole. It can be used to determine if a word is more frequent in a specific episode than it is in the whole show, or if a word is used more frequently by a specific character than all the characters.

Sentiment analysis:

In [1]:
import pandas as pd
import re

In [2]:
df=pd.read_csv('clean_RandMtranscript.csv')

In [3]:
df.columns

Index(['Sentence_id', 'Season', 'Episode', 'Episode_num', 'Episode_id',
       'Character', 'Line'],
      dtype='object')

In [4]:
df.Character.value_counts()

Scene Action:                      1608
Rick:                               930
Morty:                              681
Jerry:                              521
Summer:                             356
Beth:                               335
 Rick:                              284
 Morty:                             220
 Summer:                            100
Pickle Rick:                         83
 Jerry:                              75
 Beth:                               55
Unity :                              54
All Ricks:                           36
Mr. Needful:                         34
Birdperson:                          34
Tiny Rick:                           33
Rick :                               32
Dr. Bloom:                           31
Arthrisha:                           31
 Hemorrhage:                         28
President:                           27
Jessica:                             26
Principal Vagina:                    26
Testicle Monster A:                  26


In [5]:
df['Character']=df.Character.str.strip()
df.Character.value_counts()

Scene Action:               1608
Rick:                       1215
Morty:                       901
Jerry:                       596
Summer:                      456
Beth:                        390
Pickle Rick:                  83
Unity :                       54
All Ricks:                    37
Mr. Goldenfold:               36
Birdperson:                   34
Mr. Needful:                  34
Jessica:                      34
Rick :                        33
Tiny Rick:                    33
Principal Vagina:             31
Dr. Bloom:                    31
Arthrisha:                    31
Hemorrhage:                   28
President:                    27
Testicle Monster A:           26
All:                          26
Rick 1:                       25
Cornvelious Daniel:           25
Fart:                         25
Cop Morty:                    25
Tammy:                        24
Dr. Wong:                     23
Agency Director:              23
Annie:                        22
          

In [6]:
df[(df.Character.str.contains('Morty'))]


Unnamed: 0,Sentence_id,Season,Episode,Episode_num,Episode_id,Character,Line
2,3,1,Pilot,1,1,Morty:,"What, Rick? What’s going on?"
4,5,1,Pilot,1,1,Morty:,It's the middle of the night. What are you tal...
6,7,1,Pilot,1,1,Morty:,Ow! Ow! You're tugging me too hard!
10,11,1,Pilot,1,1,Morty:,"Yeah, Rick... I-it's great. Is this the surprise?"
12,13,1,Pilot,1,1,Morty:,What?! A bomb?!
14,15,1,Pilot,1,1,Morty:,T-t-that's absolutely crazy!
16,17,1,Pilot,1,1,Morty:,Jessica? From my math class?
18,19,1,Pilot,1,1,Morty:,Ohh...
20,21,1,Pilot,1,1,Morty:,Whhhh-wha?
22,23,1,Pilot,1,1,Morty:,"No, you can't! Jessica doesn't even know I e..."


In [7]:
df[(~df.Character.str.contains('except'))]

Unnamed: 0,Sentence_id,Season,Episode,Episode_num,Episode_id,Character,Line
0,1,1,Pilot,1,1,Scene Action:,[Open to Morty’s room]
1,2,1,Pilot,1,1,Rick:,Morty! You gotta come on. Jus'... you gotta ...
2,3,1,Pilot,1,1,Morty:,"What, Rick? What’s going on?"
3,4,1,Pilot,1,1,Rick:,"I got a surprise for you, Morty."
4,5,1,Pilot,1,1,Morty:,It's the middle of the night. What are you tal...
5,6,1,Pilot,1,1,Rick:,
6,7,1,Pilot,1,1,Morty:,Ow! Ow! You're tugging me too hard!
7,8,1,Pilot,1,1,Rick:,"We gotta go, gotta get outta here, come on. Go..."
8,9,1,Pilot,1,1,Scene Action:,[Cut to Rick's ship]
9,10,1,Pilot,1,1,Rick:,"What do you think of this... flying vehicle,..."


In [8]:
df[(df.Character.str.contains('Morty')) &
   (~df.Character.str.contains('except'))]


Unnamed: 0,Sentence_id,Season,Episode,Episode_num,Episode_id,Character,Line
2,3,1,Pilot,1,1,Morty:,"What, Rick? What’s going on?"
4,5,1,Pilot,1,1,Morty:,It's the middle of the night. What are you tal...
6,7,1,Pilot,1,1,Morty:,Ow! Ow! You're tugging me too hard!
10,11,1,Pilot,1,1,Morty:,"Yeah, Rick... I-it's great. Is this the surprise?"
12,13,1,Pilot,1,1,Morty:,What?! A bomb?!
14,15,1,Pilot,1,1,Morty:,T-t-that's absolutely crazy!
16,17,1,Pilot,1,1,Morty:,Jessica? From my math class?
18,19,1,Pilot,1,1,Morty:,Ohh...
20,21,1,Pilot,1,1,Morty:,Whhhh-wha?
22,23,1,Pilot,1,1,Morty:,"No, you can't! Jessica doesn't even know I e..."


In [9]:
caseexcept=(~df.Character.str.contains('[eE]xcept'))

In [10]:
dfMorty=df[(df.Character.str.contains('Morty')) & caseexcept]

In [11]:
len(dfMorty)

1075

In [12]:
dfRick=df[(df.Character.str.contains('Rick')) & caseexcept]

In [13]:
dfRick.head()

Unnamed: 0,Sentence_id,Season,Episode,Episode_num,Episode_id,Character,Line
1,2,1,Pilot,1,1,Rick:,Morty! You gotta come on. Jus'... you gotta ...
3,4,1,Pilot,1,1,Rick:,"I got a surprise for you, Morty."
5,6,1,Pilot,1,1,Rick:,
7,8,1,Pilot,1,1,Rick:,"We gotta go, gotta get outta here, come on. Go..."
9,10,1,Pilot,1,1,Rick:,"What do you think of this... flying vehicle,..."


In [14]:
len(dfRick)

1639

In [15]:
dfBeth=df[(df.Character.str.contains('Beth')) & caseexcept]

In [16]:
dfBeth.head()

Unnamed: 0,Sentence_id,Season,Episode,Episode_num,Episode_id,Character,Line
46,47,1,Pilot,1,1,Beth:,"Morty, are you getting sick?"
52,53,1,Pilot,1,1,Beth:,Dad?
55,56,1,Pilot,1,1,Beth:,Jerry!
61,62,1,Pilot,1,1,Beth:,"Oh, dad…"
111,112,1,Pilot,1,1,Beth:,Scalpel.


In [17]:
len(dfBeth)

412

In [18]:
dfJerry=df[(df.Character.str.contains('Jerry')) & caseexcept]

In [19]:
len(dfJerry)

640

In [20]:
dfSummer=df[(df.Character.str.contains('Summer')) & caseexcept]

In [21]:
len(dfSummer)

516

In [22]:
dfRick.Character.value_counts()

Rick:                            1215
Pickle Rick:                       83
All Ricks:                         37
Tiny Rick:                         33
Rick :                             33
Rick 1:                            25
Doofus Rick:                       18
Cop Rick:                          15
Evil Rick:                         14
Rick 2:                            13
Young Rick:                         9
Rick Council 1:                     8
Rick 30:                            7
Teacher Rick:                       6
Rick D716:                          6
Alternate Rick:                     5
Rick D716-B:                        5
Rick K-22:                          5
Commander in Chief Rick/Rick:       4
Juggling Rick:                      4
Rick/Rick:                          4
Guard Rick:                         4
All Ricks :                         3
Cornvelious Daniel/Rick:            3
Rick J-22:                          3
Other Rick:                         3
Rick 4:     

In [23]:
dfMorty.Character.value_counts()

Morty:                         901
Cop Morty:                      25
All Mortys:                     18
Morty :                         15
Morty 2:                        14
Candidate Morty:                12
Morty 1:                        10
Mechanical Morty:                9
Campaign Manager Morty:          8
Morty 30:                        6
Lizard Morty:                    4
Morty and Summer:                3
Evil Morty:                      3
Morty Mart Morty:                3
"Lawyer" Morty:                  3
Religious Morty:                 3
Mortytown Loco:                  2
Candidate Morty :                2
Morty 23:                        2
Morty K-22:                      2
Fat Morty:                       2
Glasses Morty:                   2
Morty 1, 2 and 3:                2
Summer 1 & Morty 2:              1
Morty 1 & Summer 2:              1
Tall Morty:                      1
All religious Mortys:            1
Oh! Morty:                       1
Hammerhead Morty:   

## Noise removal - remove stop words

Stop words are commonly used and usually low context words in text, such as "is", "the", "of", "in", etc. if they were not removed, they would likely dominate any statistical analysis that we could do on characters.

Natural Languate Toolkit (NLTK) is a commonly used NLP toolkit for python. It includes a list of stopwords and filtering abilities. here is an example usage from [pythonspot](https://pythonspot.com/nltk-stop-words/

Although I already split the data just to get an idea of what is going on, it probably is better to do all this noise removal before splitting, and then recreate splits afterwards.


In [24]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import string

In [25]:
list(string.punctuation)

['!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '{',
 '|',
 '}',
 '~']

In [26]:
data = "All work and no play make's jack dull boy. All work and no play makes jack a dull boy."
stopWords = set(stopwords.words('english'))
words = word_tokenize(data)
wordsFiltered = []
 
for w in words:
    w=w.lower()
    print(w)
    if w not in stopWords and w not in list(string.punctuation):
        wordsFiltered.append(w)
 
print(wordsFiltered)



all
work
and
no
play
make
's
jack
dull
boy
.
all
work
and
no
play
makes
jack
a
dull
boy
.
['work', 'play', 'make', "'s", 'jack', 'dull', 'boy', 'work', 'play', 'makes', 'jack', 'dull', 'boy']


In [27]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [28]:
df['Line_no_stopwords'] = df['Line'].apply(lambda x: ' '.join([word for word in word_tokenize(x.lower())
                                                               if word not in list(string.punctuation)]))
 #(StopWords)
#                                                                and word not in list(string.punctuation)


In [29]:
df

Unnamed: 0,Sentence_id,Season,Episode,Episode_num,Episode_id,Character,Line,Line_no_stopwords
0,1,1,Pilot,1,1,Scene Action:,[Open to Morty’s room],open to morty ’ s room
1,2,1,Pilot,1,1,Rick:,Morty! You gotta come on. Jus'... you gotta ...,morty you got ta come on jus ... you got ta co...
2,3,1,Pilot,1,1,Morty:,"What, Rick? What’s going on?",what rick what ’ s going on
3,4,1,Pilot,1,1,Rick:,"I got a surprise for you, Morty.",i got a surprise for you morty
4,5,1,Pilot,1,1,Morty:,It's the middle of the night. What are you tal...,it 's the middle of the night what are you tal...
5,6,1,Pilot,1,1,Rick:,,
6,7,1,Pilot,1,1,Morty:,Ow! Ow! You're tugging me too hard!,ow ow you 're tugging me too hard
7,8,1,Pilot,1,1,Rick:,"We gotta go, gotta get outta here, come on. Go...",we got ta go got ta get outta here come on got...
8,9,1,Pilot,1,1,Scene Action:,[Cut to Rick's ship],cut to rick 's ship
9,10,1,Pilot,1,1,Rick:,"What do you think of this... flying vehicle,...",what do you think of this ... flying vehicle m...


In [30]:
for w in word_tokenize(df.Line_no_stopwords[2]):
    if w not in list(string.punctuation):
        print(w)

what
rick
what
’
s
going
on


This is not quite right, and in the specific example above, we see that there are interesting deviations in characters. Probably need to convert everything to ascii first to get rid of these weird characters.

I'll use unidecode to convert all weird unicode characters to close approximations in ascii.

In [31]:
from unidecode import unidecode
unidecode(df.Line[2])

"  What, Rick? What's going on?"

In [32]:
df['Line']=df.Line.apply(lambda x: unidecode(x))

In [33]:
df.Line[2]

"  What, Rick? What's going on?"

In [34]:
word_tokenize(df.Line[2])

['What', ',', 'Rick', '?', 'What', "'s", 'going', 'on', '?']

In [35]:
word_tokenize(df.Line[1])

['Morty',
 '!',
 'You',
 'got',
 'ta',
 'come',
 'on',
 '.',
 'Jus',
 "'",
 '...',
 'you',
 'got',
 'ta',
 'come',
 'with',
 'me',
 '.']

In [36]:
word_tokenize(df.Line[4])

['It',
 "'s",
 'the',
 'middle',
 'of',
 'the',
 'night',
 '.',
 'What',
 'are',
 'you',
 'talking',
 'about',
 '?']

In [37]:
word_tokenize("this this' this's this'this thisn't")

['this', 'this', "'", 'this', "'s", "this'this", 'this', "n't"]

Try again with the same 

In [38]:
df['Line_no_stopwords'] = df['Line'].apply(lambda x: ' '.join([word for word in word_tokenize(x.lower())
                                                               if word not in stopWords and word not in list(string.punctuation)]))


In [39]:
df[['Line','Line_no_stopwords']]

Unnamed: 0,Line,Line_no_stopwords
0,[Open to Morty's room],open morty 's room
1,Morty! You gotta come on. Jus'... you gotta ...,morty got ta come jus ... got ta come
2,"What, Rick? What's going on?",rick 's going
3,"I got a surprise for you, Morty.",got surprise morty
4,It's the middle of the night. What are you tal...,'s middle night talking
5,,
6,Ow! Ow! You're tugging me too hard!,ow ow 're tugging hard
7,"We gotta go, gotta get outta here, come on. Go...",got ta go got ta get outta come got surprise m...
8,[Cut to Rick's ship],cut rick 's ship
9,"What do you think of this... flying vehicle,...",think ... flying vehicle morty built outta stu...


In [40]:
df.Line_no_stopwords[df.Line_no_stopwords.str.contains("'")]

0                                      open morty 's room
2                                           rick 's going
4                                 's middle night talking
6                                  ow ow 're tugging hard
8                                        cut rick 's ship
10                   yeah rick ... i-it 's great surprise
13      're gon na drop get whole fresh start morty cr...
14                           t-t-that 's absolutely crazy
15      come morty take easy morty 's gon na good righ...
17      drop bomb know want somebody know want thing '...
19                                  jessica 's gon na eve
21                                      's surprise morty
22      ca n't jessica n't even know exist -- forget c...
23      i-i get 're trying say morty listen 'm ... n't...
25      you-you n't worry getting jessica anything sh-...
26                     n't care jessica y-yyyyyyyyyyou --
27                                   know morty 're right
31      'm tak

All common contractions are part of the stop list, however the apostrophe is handled oddly with NLTK... Either way, I can remove the values with apostrophes.

In [41]:
for word in df['Line_no_stopwords'].iloc[25].split():
    print(word)
    if word.find("'")>-1:
        print('apostrophe')

you-you
n't
apostrophe
worry
getting
jessica
anything
sh-sh-she
--
's
apostrophe
morty


In [42]:
df['Line_no_stopwords']=df['Line_no_stopwords'].apply(lambda x: ' '.join(word for word in x.split()
                                                if word.find("'")==-1))

In [43]:
df

Unnamed: 0,Sentence_id,Season,Episode,Episode_num,Episode_id,Character,Line,Line_no_stopwords
0,1,1,Pilot,1,1,Scene Action:,[Open to Morty's room],open morty room
1,2,1,Pilot,1,1,Rick:,Morty! You gotta come on. Jus'... you gotta ...,morty got ta come jus ... got ta come
2,3,1,Pilot,1,1,Morty:,"What, Rick? What's going on?",rick going
3,4,1,Pilot,1,1,Rick:,"I got a surprise for you, Morty.",got surprise morty
4,5,1,Pilot,1,1,Morty:,It's the middle of the night. What are you tal...,middle night talking
5,6,1,Pilot,1,1,Rick:,,
6,7,1,Pilot,1,1,Morty:,Ow! Ow! You're tugging me too hard!,ow ow tugging hard
7,8,1,Pilot,1,1,Rick:,"We gotta go, gotta get outta here, come on. Go...",got ta go got ta get outta come got surprise m...
8,9,1,Pilot,1,1,Scene Action:,[Cut to Rick's ship],cut rick ship
9,10,1,Pilot,1,1,Rick:,"What do you think of this... flying vehicle,...",think ... flying vehicle morty built outta stu...


It also looks like "--" and "..." show up and can be removed.

In [44]:
for word in df.Line_no_stopwords.iloc[1].split():
    
    if word.find('--')==-1 and word.find('...')==-1:
        print(word)

morty
got
ta
come
jus
got
ta
come


In [45]:
df['Line_no_stopwords']=df['Line_no_stopwords'].apply(lambda x: ' '.join(word for word in x.split()
                                                if word.find("--")==-1 and word.find("...")==-1))

Some lines have no words anymore!

In [46]:
df[df.Line_no_stopwords=='']

Unnamed: 0,Sentence_id,Season,Episode,Episode_num,Episode_id,Character,Line,Line_no_stopwords
5,6,1,Pilot,1,1,Rick:,,
29,30,1,Pilot,1,1,Morty:,,
32,33,1,Pilot,1,1,Rick:,,
38,39,1,Pilot,1,1,Morty:,It was?,
51,52,1,Pilot,1,1,Jerry:,What?,
97,98,1,Pilot,1,1,Frank:,,
116,117,1,Pilot,1,1,Tom:,,
120,121,1,Pilot,1,1,Tom:,,
162,163,1,Pilot,1,1,Jerry:,,
177,178,1,Pilot,1,1,Morty:,What?!,


In [47]:
idxnowords=df[df.Line_no_stopwords==''].index
df.drop(idxnowords,inplace=True)
# df.iloc[idxnowords]

Before further processing, I can do some rudimentary word frequency stats.

In [48]:
caseexcept=(~df.Character.str.contains('[eE]xcept'))
dfRick=df[(df.Character.str.contains('Rick')) & caseexcept]
dfMorty=df[(df.Character.str.contains('Morty')) & caseexcept]
dfSummer=df[(df.Character.str.contains('Summer')) & caseexcept]
dfJerry=df[(df.Character.str.contains('Jerry')) & caseexcept]
dfBeth=df[(df.Character.str.contains('Beth')) & caseexcept]

In [49]:
dfJerry.Line_no_stopwords.loc[115].replace('``','').split()

['manager',
 'gave',
 'hour',
 'lunch',
 'thought',
 'hey',
 'swing',
 'wife',
 'works']

In [50]:
def wordagg(df):
    dflist=[]
    for line in df:
            dflist.extend(line.replace('``','').split()) # removes quotes
    
    return dflist

    

In [51]:
Rickwords=wordagg(dfRick.Line_no_stopwords)
len(Rickwords)

13697

In [52]:
Rickwords[:10]

['morty', 'got', 'ta', 'come', 'jus', 'got', 'ta', 'come', 'got', 'surprise']

In [53]:
Mortywords=wordagg(dfMorty.Line_no_stopwords)
len(Mortywords)

5733

In [54]:
Jerrywords=wordagg(dfJerry.Line_no_stopwords)
len(Jerrywords)

3275

In [55]:
Jerrywords

['see',
 'new',
 'episode',
 'singing',
 'show',
 'tonight',
 'guys',
 'think',
 'gon',
 'na',
 'best',
 'singer',
 'damn',
 'beth',
 'okay',
 'due',
 'respect',
 'rick',
 'talking',
 'respect',
 'due',
 'son',
 'supposed',
 'pass',
 'classes',
 'keep',
 'dragging',
 'high-concept',
 'sci-fi',
 'rigamarole',
 'real',
 'knock',
 'knock',
 'manager',
 'gave',
 'hour',
 'lunch',
 'thought',
 'hey',
 'swing',
 'wife',
 'works',
 'well',
 'lunch',
 'mean',
 'one',
 'three',
 'meals',
 'existed',
 'millennia',
 'well',
 'yeah',
 'horses',
 'okay',
 'let',
 'rehash',
 'fight',
 'sense',
 'busy',
 'way',
 'whoa',
 'floor',
 'kind',
 'literature',
 'really',
 'nice-looking',
 'nursing',
 'home',
 'hey',
 'honey',
 'crazy',
 'idea',
 'bad',
 'pitch',
 'let',
 'put',
 'dad',
 'let',
 'put',
 'dad',
 'nursing',
 'home',
 'told',
 'ordering',
 'something',
 'valentine',
 'day',
 'importantly',
 'father',
 'horrible',
 'influence',
 'son',
 'since',
 'fighting',
 'ever',
 'affair',
 'guy',
 'come',


In [56]:
Summerwords=wordagg(dfSummer.Line_no_stopwords)
len(Summerwords)

2293

In [57]:
Bethwords=wordagg(dfBeth.Line_no_stopwords)
len(Bethwords)

1955

Use set to count distinct words and see how varied each persons vocabulary is. "lexical richness" can be calculated as the total unique words divided by total words. The more unique words spoken, the more rich their vocabulary.

In [58]:
def richness(text):
    return len(set(text))/len(text)

In [59]:
richness(Rickwords)

0.26823391983646055

In [60]:
richness(Mortywords)

0.28274899703471135

In [61]:
richness(Jerrywords)

0.41557251908396947

In [62]:
richness(Summerwords)

0.4443959877889228

In [63]:
from nltk import FreqDist
FreqDist(Rickwords).most_common(10)

[('morty', 593),
 ('oh', 184),
 ('know', 181),
 ('na', 152),
 ('get', 147),
 ('got', 140),
 ('gon', 138),
 ('right', 133),
 ('yeah', 128),
 ('go', 119)]

In [64]:
FreqDist(Mortywords).most_common(10)

[('rick', 285),
 ('know', 152),
 ('oh', 139),
 ('like', 70),
 ('na', 64),
 ('right', 61),
 ('get', 60),
 ('gon', 58),
 ('okay', 57),
 ('man', 56)]

In [65]:
FreqDist(Summerwords).most_common(10)

[('oh', 56),
 ('grandpa', 45),
 ('god', 41),
 ('rick', 40),
 ('dad', 35),
 ('yeah', 32),
 ('like', 29),
 ('morty', 24),
 ('going', 21),
 ('go', 20)]

In [66]:
FreqDist(Bethwords).most_common(10)

[('jerry', 66),
 ('dad', 36),
 ('oh', 34),
 ('know', 24),
 ('like', 22),
 ('okay', 21),
 ('want', 20),
 ('morty', 18),
 ('go', 16),
 ('get', 16)]

In [67]:
FreqDist(Jerrywords).most_common(10)

[('well', 44),
 ('know', 42),
 ('oh', 38),
 ('like', 33),
 ('rick', 32),
 ('okay', 30),
 ('hey', 30),
 ('uh', 29),
 ('beth', 28),
 ('got', 26)]

I can make word clouds with this!

In [68]:
Rickfreq=FreqDist(Rickwords).most_common(100)
Mortyfreq=FreqDist(Mortywords).most_common(20)
Summerfreq=FreqDist(Summerwords).most_common(10)
Bethfreq=FreqDist(Bethwords).most_common(10)
Jerryfreq=FreqDist(Jerrywords).most_common(10)

In [69]:
for item in Rickfreq[:15]:
    print(item)

('morty', 593)
('oh', 184)
('know', 181)
('na', 152)
('get', 147)
('got', 140)
('gon', 138)
('right', 133)
('yeah', 128)
('go', 119)
('like', 114)
('rick', 97)
('time', 88)
('one', 87)
('buurp', 83)


In [70]:
Rickfreq[-1][1]

19

In [71]:
Rickfreq[90:99]

[('new', 20),
 ('y-you', 20),
 ('lot', 20),
 ('gun', 20),
 ('planet', 20),
 ('ta', 19),
 ('anything', 19),
 ('crap', 19),
 ('idea', 19)]

In [72]:
len(Rickfreq)

100

the D3 wordcloud function I have takes a .js file with the format

`gword=[{text: 'firstword', size: 100},{text: 'nextword", size:50}];`

I need to convert this list into a js object, and convert the word count into a scalable word size.

In previous work, the font scale has been 80 to 20.

word count is linearly normalized between 1 and 0, then it is scaled between the two font sizes.

In [73]:
bigsize=80
smallsize=20
def jsviewout(freqlist,bigsize,smallsize):
    if len(freqlist)< 100:
        return ('Error, List must have 100 or more counts')
    else:
        highest=freqlist[0][1]
        lowest=freqlist[-1][1]
        for k in freqlist[:100]:
            fsize=k[1]
#             print(fsize)
            fsize=(fsize - lowest) / (highest - lowest)
            fsize= fsize * (bigsize-smallsize) + smallsize
            print(k[0],fsize)



In [74]:
jsviewout(Rickfreq,bigsize,smallsize)

morty 80.0
oh 37.247386759581886
know 36.933797909407666
na 33.90243902439025
get 33.379790940766554
got 32.64808362369338
gon 32.4390243902439
right 31.916376306620208
yeah 31.393728222996515
go 30.45296167247387
like 29.930313588850176
rick 28.153310104529616
time 27.21254355400697
one 27.10801393728223
buurp 26.689895470383277
hey 26.689895470383277
summer 26.27177700348432
come 26.167247386759584
well 25.853658536585364
jerry 25.54006968641115
let 25.43554006968641
take 25.226480836236934
little 25.226480836236934
bleep 24.808362369337978
think 24.494773519163765
really 24.181184668989545
back 24.181184668989545
uh 24.181184668989545
good 24.076655052264808
whoa 23.972125435540068
look 23.65853658536585
want 23.449477351916375
okay 23.2404181184669
mean 23.13588850174216
thing 23.031358885017422
god 23.031358885017422
listen 22.926829268292682
us 22.822299651567945
guys 22.822299651567945
see 22.717770034843205
could 22.50871080139373
say 22.40418118466899
would 22.40418118466899
r

In [75]:
def jswrite(freqlist,fname,bigsize,smallsize):
    if len(freqlist)< 100:
        return ('Error, List must have 100 or more counts')
    elif type(fname) is not str or fname[-3:] !='.js':
        return ('Error, filename incorrect format, make string with ".js" filetype')
    else:
        fhand=open(fname,'w')
        fhand.write("gword =[")
        highest=freqlist[0][1]
        lowest=freqlist[-1][1]
        first = True
        for k in freqlist[:100]:
            if not first : fhand.write(",\n")
            first = False
            fsize=k[1]
            word=k[0]
#             print(fsize)
            fsize=(fsize - lowest) / (highest - lowest)
            fsize= fsize*(bigsize-smallsize) + smallsize
#             print(k[0],fsize)
            fhand.write("{text: '"+word+"', size: "+str(fsize)+"}")
        fhand.write("\n];\n")



In [76]:
fname='test.js'

In [77]:
jswrite(Rickfreq,fname,bigsize,smallsize)

In [78]:
Rickfreq=FreqDist(Rickwords)
Mortyfreq=FreqDist(Mortywords)
Summerfreq=FreqDist(Summerwords)
Bethfreq=FreqDist(Bethwords)
Jerryfreq=FreqDist(Jerrywords)

In [79]:
jswrite(Rickfreq.most_common(100),'rickfreq.js',bigsize,smallsize)
jswrite(Mortyfreq.most_common(100),'mortyfreq.js',bigsize,smallsize)
jswrite(Summerfreq.most_common(100),'summerfreq.js',bigsize,smallsize)
jswrite(Bethfreq.most_common(100),'bethfreq.js',bigsize,smallsize)
jswrite(Jerryfreq.most_common(100),'jerryfreq.js',bigsize,smallsize)

At this point I created a folder called `word_cloud` where I put the d3 functions. I need to move all .js files to that location.

In [80]:
import glob
import os
for file in glob.glob('*.js'):
    os.rename(file,os.path.join('word_cloud',file))

I have a gword.htm file that needs to be slightly modified for each file to show correctly.

In [81]:
gwordfile=os.path.join('word_cloud','gword.htm')
with open(gwordfile) as file:
    gword=file.read()

In [82]:
htmnames=['rickfreq','mortyfreq','summerfreq','bethfreq','jerryfreq']
for name in htmnames:
    newdoc=gword.replace("gword.js",name+'.js')
    htmhand=open(os.path.join('word_cloud',name+'.htm'),'w')
    htmhand.write(newdoc)
    htmhand.close()


These are pretty cool, but there are more powerful ways to see what words are important to specific characters. Specifically, we need to discount words that are common amongst all the characters. One of the most commonly used statistic for this is [term frequency-inverse document frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). This discounts words by their frequency in the entire corpus of text. To define things clearly, a corpus is a collection of documents, and a document is a collection of terms, or words. The importance is creating a word list from the total document, and then looking at frequencies of specific words in reference to their base frequency in the corpus.

For example, in this case the total number of lines a character says is their "document". We take the count of a specific word a character says, divide it by the total count of all words they say, which gives us a "term frequency". Based on the linear word size normalization we roughly do this above, but not exactly. Explicitly, our current word clouds display how important a word is scaled in terms of the most frequent word.

"Term document frequency" is generally a logarithmically scaled frequency of "documents" with the specific term in it

In [83]:
corpus=Rickwords+Mortywords+Summerwords+Bethwords+Jerrywords
corpusfreq=FreqDist(corpus)

With a dict of counts, I can use the lists generated above to search the dict, pull values, and create calculated fields in terms of the total corpus.

In [84]:
Rickfreq.most_common(100)[0][0]

'morty'

In [85]:
ccount=corpusfreq.get(Rickfreq.most_common(100)[0][0])
ccount

697

In [86]:
dcount=Rickfreq.most_common(100)[0][1]
dcount

593

In [87]:
sum(Rickfreq.values())

13697

In [88]:
sum(corpusfreq.values())

26953

In [89]:
dcount/ccount

0.8507890961262554

this means that Rick says 85% of the times "Morty" is said by a main character. Definitely makes sense. But it is worth noting that Rick says about half of all total words!


In [90]:
# from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# countvect=CountVectorizer()

# rickcountvect=countvect.fit_transform(Rickwords).toarray()

I am defining a tweaked version of tf-idf. The tf part is kept normal, but I am changing the idf to be in the same terms of tf i.e. it is natural log of the total word count in the corpus divided by the specific word count in the corpus. Note how this in inverse.
$$ln(\frac{\sum All_{corpus}}{\sum Word_{corpus}})$$

The natural log discounts rare words that are only said by a single person.

In [91]:
charsum=sum(Rickfreq.values())
corpsum=sum(corpusfreq.values())
for word in Rickfreq.most_common(10):
    wname=word[0]
    wcount=word[1]
    ccount=corpusfreq.get(wname)
    tf=wcount/charsum
    idf=corpsum/ccount
    print(wname, word[1],ccount,tf,idf,tf*idf)

morty 593 697 0.04329415200408849 38.6700143472023 1.674185479148059
oh 184 451 0.01343359859823319 59.76274944567628 0.8028287871799982
know 181 417 0.013214572534131561 64.63549160671462 0.8541303921161821
na 152 260 0.011097320581149157 103.66538461538461 1.1504080062450508
get 147 262 0.010732277140979777 102.87404580152672 1.1040727701558317
got 140 233 0.010221216324742644 115.67811158798283 1.1823710025784913
gon 138 237 0.010075198948674893 113.72573839662448 1.1458094399309469
right 133 235 0.009710155508505512 114.6936170212766 1.1136928571095706
yeah 128 246 0.009345112068336133 109.5650406504065 1.0238975836498527
go 119 211 0.008688033876031247 127.739336492891 1.1098036827519915


In [92]:
from math import log
def tfidf(charfreq,corpusfreq,numwords):
    charsum=sum(charfreq.values())
    corpsum=sum(corpusfreq.values())
    charrank=charfreq.most_common(numwords)
    dflist=[]
    for word in charrank:
        wname=word[0]
        wcount=word[1]
        ccount=corpusfreq.get(wname)
        tf=wcount/charsum
        idf=corpsum/ccount
        tfidf=tf*log(idf)
        dflist.append({'word':wname,
                      'char_count':wcount,
                      'corpus_count':ccount,
                      'term_freq':tf,
                      'inv_doc_freq':idf,
                      'tfidf':tfidf})
    df_freq = pd.DataFrame(dflist, columns = ['word', 'char_count', 
                                              'corpus_count','term_freq', 
                                             'inv_doc_freq','tfidf'])
    return df_freq

In [93]:
rick=tfidf(Rickfreq,corpusfreq,100)
rick.sort_values('tfidf',ascending=False).head(10)

Unnamed: 0,word,char_count,corpus_count,term_freq,inv_doc_freq,tfidf
0,morty,593,697,0.043294,38.670014,0.158243
2,know,181,417,0.013215,64.635492,0.055088
1,oh,184,451,0.013434,59.762749,0.054949
3,na,152,260,0.011097,103.665385,0.051505
4,get,147,262,0.010732,102.874046,0.049728
5,got,140,233,0.010221,115.678112,0.048559
6,gon,138,237,0.010075,113.725738,0.047694
7,right,133,235,0.00971,114.693617,0.046048
8,yeah,128,246,0.009345,109.565041,0.043889
9,go,119,211,0.008688,127.739336,0.042137


In [94]:
summer=tfidf(Summerfreq,corpusfreq,100)
summer.sort_values('tfidf',ascending=False).head(10)

Unnamed: 0,word,char_count,corpus_count,term_freq,inv_doc_freq,tfidf
1,grandpa,45,70,0.019625,385.042857,0.116834
0,oh,56,451,0.024422,59.762749,0.099896
2,god,41,148,0.017881,182.114865,0.093062
4,dad,35,124,0.015264,217.362903,0.082143
3,rick,40,457,0.017444,58.978118,0.071124
5,yeah,32,246,0.013956,109.565041,0.065542
6,like,29,268,0.012647,100.570896,0.058314
8,going,21,99,0.009158,272.252525,0.051348
13,mom,17,47,0.007414,573.468085,0.047091
9,go,20,211,0.008722,127.739336,0.042303


Note I think I need to combine "got ta" and "gon na" again, and also I need to combine grampa and grandpa

In [95]:
morty=tfidf(Mortyfreq,corpusfreq,100)
morty.sort_values('tfidf',ascending=False).head(10)

Unnamed: 0,word,char_count,corpus_count,term_freq,inv_doc_freq,tfidf
0,rick,285,457,0.049712,58.978118,0.202685
1,know,152,417,0.026513,64.635492,0.110527
2,oh,139,451,0.024246,59.762749,0.099174
3,like,70,268,0.01221,100.570896,0.056299
9,man,56,107,0.009768,251.897196,0.054008
4,na,64,260,0.011163,103.665385,0.051811
5,right,61,235,0.01064,114.693617,0.050458
8,okay,57,175,0.009942,154.017143,0.050081
12,geez,46,61,0.008024,441.852459,0.048872
6,get,60,262,0.010466,102.874046,0.048493


In [96]:
jerry=tfidf(Jerryfreq,corpusfreq,100)
jerry.sort_values('tfidf',ascending=False).head(10)

Unnamed: 0,word,char_count,corpus_count,term_freq,inv_doc_freq,tfidf
0,well,44,186,0.013435,144.908602,0.066855
8,beth,28,47,0.00855,573.468085,0.054305
1,know,42,417,0.012824,64.635492,0.053462
2,oh,38,451,0.011603,59.762749,0.047461
7,uh,29,138,0.008855,195.311594,0.046706
6,hey,30,168,0.00916,160.434524,0.046515
3,like,33,268,0.010076,100.570896,0.046461
5,okay,30,175,0.00916,154.017143,0.046141
4,rick,32,457,0.009771,58.978118,0.039838
12,look,23,119,0.007023,226.495798,0.038083


In [97]:
beth=tfidf(Bethfreq,corpusfreq,100)
beth.sort_values('tfidf',ascending=False).head(10)

Unnamed: 0,word,char_count,corpus_count,term_freq,inv_doc_freq,tfidf
0,jerry,66,149,0.03376,180.892617,0.175479
1,dad,36,124,0.018414,217.362903,0.099098
2,oh,34,451,0.017391,59.762749,0.071137
6,want,20,136,0.01023,198.183824,0.054109
5,okay,21,175,0.010742,154.017143,0.054107
4,like,22,268,0.011253,100.570896,0.051887
3,know,24,417,0.012276,64.635492,0.051177
17,leave,12,32,0.006138,842.28125,0.041347
8,go,16,211,0.008184,127.739336,0.039693
18,father,11,31,0.005627,869.451613,0.03808


To fix the 'got ta' problem, I can modify the underlying dataframe, then re-run my custom function wordagg, the NLTK FreqDist, and then the tfidf function.

In [98]:
df['Line_no_stopwords']=df.Line_no_stopwords.str.replace('got ta','gotta')
df['Line_no_stopwords']=df.Line_no_stopwords.str.replace('gon na','gonna')
# df.Line_no_stopwords[df.Line_no_stopwords.str.contains('grampa')]
df['Line_no_stopwords']=df.Line_no_stopwords.str.replace('grampa','grandpa')
df.Line_no_stopwords.loc[3638]


'come let help grandpa'

In [99]:
caseexcept=(~df.Character.str.contains('[eE]xcept'))
dfRick=df[(df.Character.str.contains('Rick')) & caseexcept]
dfMorty=df[(df.Character.str.contains('Morty')) & caseexcept]
dfSummer=df[(df.Character.str.contains('Summer')) & caseexcept]
dfJerry=df[(df.Character.str.contains('Jerry')) & caseexcept]
dfBeth=df[(df.Character.str.contains('Beth')) & caseexcept]

In [100]:
Rickwords=wordagg(dfRick.Line_no_stopwords)
len(Rickwords)

13539

In [101]:
Mortywords=wordagg(dfMorty.Line_no_stopwords)
len(Mortywords)

5671

In [102]:
Summerwords=wordagg(dfSummer.Line_no_stopwords)
len(Summerwords)

2281

In [103]:
Bethwords=wordagg(dfBeth.Line_no_stopwords)
len(Bethwords)

1941

In [104]:
Jerrywords=wordagg(dfJerry.Line_no_stopwords)
len(Jerrywords)

3256

In [105]:
Rickfreq=FreqDist(Rickwords)
Mortyfreq=FreqDist(Mortywords)
Summerfreq=FreqDist(Summerwords)
Bethfreq=FreqDist(Bethwords)
Jerryfreq=FreqDist(Jerrywords)
corpus=Rickwords+Mortywords+Summerwords+Bethwords+Jerrywords
corpusfreq=FreqDist(corpus)

In [106]:
rick=tfidf(Rickfreq,corpusfreq,100)
rick.sort_values('tfidf',ascending=False).head(10)

Unnamed: 0,word,char_count,corpus_count,term_freq,inv_doc_freq,tfidf
0,morty,593,697,0.043799,38.289813,0.159657
2,know,181,417,0.013369,64.0,0.055599
1,oh,184,451,0.01359,59.175166,0.055456
3,get,147,262,0.010858,101.862595,0.050201
4,gonna,138,237,0.010193,112.607595,0.04815
5,right,133,235,0.009823,113.565957,0.046488
6,yeah,128,246,0.009454,108.487805,0.044308
7,got,120,205,0.008863,130.185366,0.043155
8,go,119,211,0.008789,126.483412,0.042542
9,like,114,268,0.00842,99.58209,0.038741


In [107]:
summer=tfidf(Summerfreq,corpusfreq,100)
summer.sort_values('tfidf',ascending=False).head(10)

Unnamed: 0,word,char_count,corpus_count,term_freq,inv_doc_freq,tfidf
1,grandpa,53,81,0.023235,329.481481,0.134708
0,oh,56,451,0.024551,59.175166,0.100179
2,god,41,148,0.017975,180.324324,0.093374
4,dad,35,124,0.015344,215.225806,0.082424
3,rick,40,457,0.017536,58.398249,0.071325
5,yeah,32,246,0.014029,108.487805,0.065749
6,like,29,268,0.012714,99.58209,0.058496
8,going,21,99,0.009206,269.575758,0.051527
13,mom,17,47,0.007453,567.829787,0.047265
9,go,20,211,0.008768,126.483412,0.042439


In [108]:
morty=tfidf(Mortyfreq,corpusfreq,100)
morty.sort_values('tfidf',ascending=False).head(10)

Unnamed: 0,word,char_count,corpus_count,term_freq,inv_doc_freq,tfidf
0,rick,285,457,0.050256,58.398249,0.204404
1,know,152,417,0.026803,64.0,0.111471
2,oh,139,451,0.024511,59.175166,0.100016
3,like,70,268,0.012344,99.58209,0.056792
8,man,56,107,0.009875,249.420561,0.0545
4,right,61,235,0.010756,113.565957,0.050904
7,okay,57,175,0.010051,152.502857,0.050529
10,geez,46,61,0.008111,437.508197,0.049326
5,get,60,262,0.01058,101.862595,0.048919
6,gonna,58,237,0.010227,112.607595,0.048314


In [109]:
jerry=tfidf(Jerryfreq,corpusfreq,100)
jerry.sort_values('tfidf',ascending=False).head(10)

Unnamed: 0,word,char_count,corpus_count,term_freq,inv_doc_freq,tfidf
0,well,44,186,0.013514,143.483871,0.067111
8,beth,28,47,0.0086,567.829787,0.054537
1,know,42,417,0.012899,64.0,0.053647
2,oh,38,451,0.011671,59.175166,0.047623
7,uh,29,138,0.008907,193.391304,0.046891
6,hey,30,168,0.009214,158.857143,0.046695
3,like,33,268,0.010135,99.58209,0.046632
5,okay,30,175,0.009214,152.502857,0.046319
4,rick,32,457,0.009828,58.398249,0.039973
11,look,23,119,0.007064,224.268908,0.038236


In [110]:
beth=tfidf(Bethfreq,corpusfreq,100)
beth.sort_values('tfidf',ascending=False).head(10)

Unnamed: 0,word,char_count,corpus_count,term_freq,inv_doc_freq,tfidf
0,jerry,66,149,0.034003,179.114094,0.176409
1,dad,36,124,0.018547,215.225806,0.099629
2,oh,34,451,0.017517,59.175166,0.071477
6,want,20,136,0.010304,196.235294,0.054398
5,okay,21,175,0.010819,152.502857,0.05439
4,like,22,268,0.011334,99.58209,0.052149
3,know,24,417,0.012365,64.0,0.051424
16,leave,12,32,0.006182,834.0,0.041584
8,go,16,211,0.008243,126.483412,0.039898
17,father,11,31,0.005667,860.903226,0.038299


In [111]:
beth[['word','tfidf']].head()

Unnamed: 0,word,tfidf
0,jerry,0.176409
1,dad,0.099629
2,oh,0.071477
3,know,0.051424
4,like,0.052149


In [112]:
beth.sort_values('tfidf',ascending=False).tfidf.iloc[0]

0.17640881913861872

In [113]:
for index, row in beth[:10].iterrows():
    print(row['tfidf'])

0.17640881913861872
0.09962944802881181
0.0714771081052535
0.05142359299362809
0.05214920718302296
0.054389927849528154
0.05439788170450718
0.03380386919070741
0.039897876737206864
0.03811334200018445


In [114]:
len(beth)

100

I need to refactor the jswrite function above to work with dataframes

In [115]:
def jswrite_df(freqlist,fname,bigsize,smallsize):
    if len(freqlist)< 100:
        return ('Error, List must have 100 or more counts')
    elif type(fname) is not str or fname[-3:] !='.js':
        return ('Error, filename incorrect format, make string with ".js" filetype')
    else:
        fhand=open(fname,'w')
        fhand.write("gword =[")
        highest=freqlist.sort_values('tfidf',ascending=False).tfidf.iloc[0]
        lowest=freqlist.sort_values('tfidf',ascending=True).tfidf.iloc[0]
        first = True
        for index, k in freqlist[:100].iterrows():
            if not first : fhand.write(",\n")
            first = False
            fsize=k['tfidf']
            word=k['word']
#             print(fsize)
            fsize=(fsize - lowest) / (highest - lowest)
            fsize= fsize*(bigsize-smallsize) + smallsize
#             print(k[0],fsize)
            fhand.write("{text: '"+word+"', size: "+str(fsize)+"}")
        fhand.write("\n];\n")


In [116]:
bigsize=100
smallsize=20
jswrite_df(rick,'ricktfidf.js',bigsize,smallsize)


In [117]:
jswrite_df(morty,'mortytfidf.js',bigsize,smallsize)
jswrite_df(summer,'summertfidf.js',bigsize,smallsize)
jswrite_df(beth,'bethtfidf.js',bigsize,smallsize)
jswrite_df(jerry,'jerrytfidf.js',bigsize,smallsize)


Once again, I move the files into the word_cloud directory, and make a new file based on the gword.js files.

In [118]:
import glob
import os
for file in glob.glob('*.js'):
    os.rename(file,os.path.join('word_cloud',file))

In [119]:
gword

'<!DOCTYPE html>\n<meta charset="utf-8">\n<script src="d3.v3.js"></script>\n<script src="d3.layout.cloud.js"></script>\n<script src="gword.js"></script>\n<body>\n<script>\n  var fill = d3.scale.category20();\n\n  d3.layout.cloud().size([700, 700])\n      .words(gword)\n      .rotate(function() { return ~~(Math.random() * 2) * 90; })\n      .font("Impact")\n      .fontSize(function(d) { return d.size; })\n      .on("end", draw)\n      .start();\n\n  function draw(words) {\n    d3.select("body").append("svg")\n        .attr("width", 700)\n        .attr("height", 700)\n      .append("g")\n        .attr("transform", "translate(350,350)")\n      .selectAll("text")\n        .data(words)\n      .enter().append("text")\n        .style("font-size", function(d) { return d.size + "px"; })\n        .style("font-family", "Impact")\n        .style("fill", function(d, i) { return fill(i); })\n        .attr("text-anchor", "middle")\n        .attr("transform", function(d) {\n          return "translate

I have a gword.htm file that needs to be slightly modified for each file to show correctly.

In [120]:
htmnames=['ricktfidf','mortytfidf','summertfidf','bethtfidf','jerrytfidf']
for name in htmnames:
    newdoc=gword.replace("gword.js",name+'.js')
    htmhand=open(os.path.join('word_cloud',name+'.htm'),'w')
    htmhand.write(newdoc)
    htmhand.close()


# Checkpoint

Here is a good place to checkpoint and save data!
We need to save the following things: 
1. The full newly cleaned dataframe `df`
2. The individual character dataframes `dfRick`,`dfMorty` etc.
3. The word cloud dataframes `rick`,`morty` etc.

In [131]:
df.to_csv('NLPclean_RandM_data.csv',index_label=False)
dfRick.to_csv('dfRick.csv',index_label=False)
dfMorty.to_csv('dfMorty.csv',index_label=False)
dfSummer.to_csv('dfMorty.csv',index_label=False)
dfJerry.to_csv('dfJerry.csv',index_label=False)
dfBeth.to_csv('dfBeth.csv',index_label=False)
rick.to_csv('rick_100cloud.csv',index_label=False)
morty.to_csv('morty_100cloud.csv',index_label=False)
summer.to_csv('summer_100cloud.csv',index_label=False)
beth.to_csv('beth_100cloud.csv',index_label=False)
jerry.to_csv('jerry_100cloud.csv',index_label=False)



## And now, we reload and move forward!

In [130]:
df=pd.read_csv('NLPclean_RandM_data.csv')
df.head()

Unnamed: 0,Sentence_id,Season,Episode,Episode_num,Episode_id,Character,Line,Line_no_stopwords
0,1,1,Pilot,1,1,Scene Action:,[Open to Morty's room],open morty room
1,2,1,Pilot,1,1,Rick:,Morty! You gotta come on. Jus'... you gotta ...,morty gotta come jus gotta come
2,3,1,Pilot,1,1,Morty:,"What, Rick? What's going on?",rick going
3,4,1,Pilot,1,1,Rick:,"I got a surprise for you, Morty.",got surprise morty
4,5,1,Pilot,1,1,Morty:,It's the middle of the night. What are you tal...,middle night talking


After removing stop words, we need to normalize the lexicon One was to do this is with lemmatization, or a procedure to obtain the root from a word. It will turn things like "multiplying" into multiply.

In [138]:
from nltk.stem.wordnet import WordNetLemmatizer 
from nltk import download
lem = WordNetLemmatizer()
download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/devinmccormack/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [163]:
from nltk import pos_tag
download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/devinmccormack/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [157]:
lem.lemmatize(df.Line_no_stopwords.str.split()[14][1])

'absolutely'

In [192]:
pos_tag(word_tokenize(df.Line[127]))

[('We', 'PRP'), ("'re", 'VBP'), ('losing', 'VBG'), ('him', 'PRP'), ('.', '.')]

In [194]:
lem.lemmatize('losing','v')

'lose'

We need to take these lists, and then convert the long form part of speech into a simple tag for lemmatization. Essentially, we take the first letter of the tag if it begins with v (verb), n (noun), r (adverb), or j (adjective). All other things can just be simplified as a noun type
[Check this stackoverflow post for more info](https://stackoverflow.com/questions/15586721/wordnet-lemmatization-and-pos-tagging-in-python).