# Natural Language Processing

Natural Languae Processing (NLP) focuses on computer understanding of human language. The future of text-based NLP may include more accurate search, universal translation, summary of text, conversational coding, androids... etc.

There are many many challenges that stand in the way of these things at the moment. A short list of these challenges include:
1. Polysemy - words can have multiple meanings; which meaning is the correct one?
2. Fluidity of syntax and grammar - what rules to use to break down a sentence?
3. Errors - misspellings or incorrect grammar can derail brittle analysis
3. Semantics - meaning can change on a word and sentence level
4. Context - how much is plainly written and how much must be inferred?
5. Evolution of language - rarely are languages in stasis.


The Basic outline for NLP flows through four **very broad** tiers/categories of increasing scope and difficulty:
1. Morphological processing - What are the discrete units (tokens) of meaning? in english it is relatively trivial as "words" are distinctly separated, however there are subwords. For example, "incoherently" has a prefix "in-", "coherent", and a suffix "-ly". each part changes the meaning and usage. Tokens can have multiple meanings (polysemy), and the exact meaning and type (e.g. noun, verb etc.) of a word may be ambiguous at this point. 

2. Syntax/Grammar processing - What is the structure of the sentence? Do the words interact correctly? By looking at the what tokens are in a string as well as what *order* they are in, we can determine relationships between tokens based on rules of definitions (lexicon) and syntax (grammar/structure). This processing can convert a sentence like "The large cat chased the rat"into a formal notation such as "Article Adjective Noun Verb Article Noun", or further into "Noun-Phrase Verb Noun-Phrase" (see Lkit pdf for tree viz). Grammar can disambiguate the meanings of "brush" in the sentences "**Brush** your hair" (verb) vs. "Hand me the **brush**" (noun).

3. Semantic Analysis -  What is the meaning of a string (sentence) of tokens (words)? The relationship of words in the syntactic framework allows us to disambiguate the meaning of the words. Semantic analysis allows us to  in the sentence "He put a carrot on the plate and then ate **it**", we need semantics to determine what "it" is - in this case "it" is a carrot, not a plate.

4. Pragmatic (contextual) analysis - What is the meaning with respect to the entire context? There are many phrases that are still ambiguous after semantic analysis, such as "put the apple in the basket on the shelf.", which can have two meanings:
 - put the apple which is *currently in the basket* on the shelf
 - put the apple into the basket which is *currently on the shelf*

   Although this is a trivial example, the "correct" answer depends on the current state of the apple and basket, which may have been determined in previous sentences. Humor and sarcasm are extremely advanced forms of contextual understanding: "Trump is definitely the best president ever" can mean completely opposite things depending on who is saying it, and when they say it. It would require both understanding of the current state of a broad range of topics, as well as the history of the person saying it.

A few interesting links that helped me create this document (as I am still learning!)

[Algorithmia - What is NLP?](https://blog.algorithmia.com/introduction-natural-language-processing-nlp/)

[Lkit NLP intro](https://www.scm.tees.ac.uk/isg/aia/nlp/NLP-overview.pdf) 

[Zareen Syed's slideshare](https://www.slideshare.net/zareen/challenges-in-nlp)

[tutorialspoint intro to NLP](https://www.tutorialspoint.com/artificial_intelligence/artificial_intelligence_natural_language_processing.htm)

[Analytics Vidhya guide to NLP](https://www.analyticsvidhya.com/blog/2017/01/ultimate-guide-to-understand-implement-natural-language-processing-codes-in-python/)



## Important steps breakdown

### Preprocessing (even more than before!!!)
#### Remove Noise
- remove scene action tagged sentences
- remove language stopwords words such as "is", "a", "this". These are super common words that do not help in determining context of words.

#### Lexicon Normalization
- compressing multiple representations of the same word into one with stemming (strip suffixes)

### Potential outputs at this stage:

- Statistical: word and sentence counts per character
- tf-idf: term frequency inverse document frequency. This finds the frequency of words in a subset, and normalizes it by the frequency of the same word in the entire set. It finds the relative importance of a word in the subset vs the whole. It can be used to determine if a word is more frequent in a specific episode than it is in the whole show, or if a word is used more frequently by a specific character than all the characters.

Either of the above can be used to create word cloud outputs per character, episode, etc.

### Advanced syntactic/semantic processing
#### Object Standardization
- Many domain specific words/acronyms are not in standart lexical dictionaries. 

#### Syntactic Parsing
- analysis of grammar and arrangement. We want to tag words with their relationship to the other words in a sentence.


Statistical:
word and sentence counts per character
tf-idf: term frequency inverse document frequency. This finds the frequency of words in a subset, and normalizes it by the frequency of the same word in the entire set. It finds the relative importance of a word in the subset vs the whole. It can be used to determine if a word is more frequent in a specific episode than it is in the whole show, or if a word is used more frequently by a specific character than all the characters.

Sentiment analysis:

In [1]:
import pandas as pd
import re

In [2]:
df=pd.read_csv('clean_RandMtranscript.csv')

In [3]:
df.columns

Index(['Sentence_id', 'Season', 'Episode', 'Episode_num', 'Episode_id',
       'Character', 'Line'],
      dtype='object')

In [4]:
df.Character.value_counts()

Scene Action:                                                           1608
Rick:                                                                    930
Morty:                                                                   681
Jerry:                                                                   521
Summer:                                                                  356
Beth:                                                                    335
 Rick:                                                                   284
 Morty:                                                                  220
 Summer:                                                                 100
Pickle Rick:                                                              83
 Jerry:                                                                   75
 Beth:                                                                    55
Unity :                                                                   54

In [5]:
df['Character']=df.Character.str.strip()
df.Character.value_counts()

Scene Action:                                                           1608
Rick:                                                                   1215
Morty:                                                                   901
Jerry:                                                                   596
Summer:                                                                  456
Beth:                                                                    390
Pickle Rick:                                                              83
Unity :                                                                   54
All Ricks:                                                                37
Mr. Goldenfold:                                                           36
Mr. Needful:                                                              34
Jessica:                                                                  34
Birdperson:                                                               34

In [6]:
df[(df.Character.str.contains('Morty'))]


Unnamed: 0,Sentence_id,Season,Episode,Episode_num,Episode_id,Character,Line
2,3,1,Pilot,1,1,Morty:,"What, Rick? What’s going on?"
4,5,1,Pilot,1,1,Morty:,It's the middle of the night. What are you tal...
6,7,1,Pilot,1,1,Morty:,Ow! Ow! You're tugging me too hard!
10,11,1,Pilot,1,1,Morty:,"Yeah, Rick... I-it's great. Is this the surprise?"
12,13,1,Pilot,1,1,Morty:,What?! A bomb?!
14,15,1,Pilot,1,1,Morty:,T-t-that's absolutely crazy!
16,17,1,Pilot,1,1,Morty:,Jessica? From my math class?
18,19,1,Pilot,1,1,Morty:,Ohh...
20,21,1,Pilot,1,1,Morty:,Whhhh-wha?
22,23,1,Pilot,1,1,Morty:,"No, you can't! Jessica doesn't even know I e..."


In [7]:
df[(~df.Character.str.contains('except'))]

Unnamed: 0,Sentence_id,Season,Episode,Episode_num,Episode_id,Character,Line
0,1,1,Pilot,1,1,Scene Action:,[Open to Morty’s room]
1,2,1,Pilot,1,1,Rick:,Morty! You gotta come on. Jus'... you gotta ...
2,3,1,Pilot,1,1,Morty:,"What, Rick? What’s going on?"
3,4,1,Pilot,1,1,Rick:,"I got a surprise for you, Morty."
4,5,1,Pilot,1,1,Morty:,It's the middle of the night. What are you tal...
5,6,1,Pilot,1,1,Rick:,
6,7,1,Pilot,1,1,Morty:,Ow! Ow! You're tugging me too hard!
7,8,1,Pilot,1,1,Rick:,"We gotta go, gotta get outta here, come on. Go..."
8,9,1,Pilot,1,1,Scene Action:,[Cut to Rick's ship]
9,10,1,Pilot,1,1,Rick:,"What do you think of this... flying vehicle,..."


In [8]:
df[(df.Character.str.contains('Morty')) &
   (~df.Character.str.contains('except'))]


Unnamed: 0,Sentence_id,Season,Episode,Episode_num,Episode_id,Character,Line
2,3,1,Pilot,1,1,Morty:,"What, Rick? What’s going on?"
4,5,1,Pilot,1,1,Morty:,It's the middle of the night. What are you tal...
6,7,1,Pilot,1,1,Morty:,Ow! Ow! You're tugging me too hard!
10,11,1,Pilot,1,1,Morty:,"Yeah, Rick... I-it's great. Is this the surprise?"
12,13,1,Pilot,1,1,Morty:,What?! A bomb?!
14,15,1,Pilot,1,1,Morty:,T-t-that's absolutely crazy!
16,17,1,Pilot,1,1,Morty:,Jessica? From my math class?
18,19,1,Pilot,1,1,Morty:,Ohh...
20,21,1,Pilot,1,1,Morty:,Whhhh-wha?
22,23,1,Pilot,1,1,Morty:,"No, you can't! Jessica doesn't even know I e..."


In [9]:
caseexcept=(~df.Character.str.contains('[eE]xcept'))

In [10]:
dfMorty=df[(df.Character.str.contains('Morty')) & caseexcept]

In [11]:
len(dfMorty)

1075

In [12]:
dfRick=df[(df.Character.str.contains('Rick')) & caseexcept]

In [13]:
dfRick.head()

Unnamed: 0,Sentence_id,Season,Episode,Episode_num,Episode_id,Character,Line
1,2,1,Pilot,1,1,Rick:,Morty! You gotta come on. Jus'... you gotta ...
3,4,1,Pilot,1,1,Rick:,"I got a surprise for you, Morty."
5,6,1,Pilot,1,1,Rick:,
7,8,1,Pilot,1,1,Rick:,"We gotta go, gotta get outta here, come on. Go..."
9,10,1,Pilot,1,1,Rick:,"What do you think of this... flying vehicle,..."


In [14]:
len(dfRick)

1639

In [15]:
dfBeth=df[(df.Character.str.contains('Beth')) & caseexcept]

In [16]:
dfBeth.head()

Unnamed: 0,Sentence_id,Season,Episode,Episode_num,Episode_id,Character,Line
46,47,1,Pilot,1,1,Beth:,"Morty, are you getting sick?"
52,53,1,Pilot,1,1,Beth:,Dad?
55,56,1,Pilot,1,1,Beth:,Jerry!
61,62,1,Pilot,1,1,Beth:,"Oh, dad…"
111,112,1,Pilot,1,1,Beth:,Scalpel.


In [17]:
len(dfBeth)

412

In [18]:
dfJerry=df[(df.Character.str.contains('Jerry')) & caseexcept]

In [19]:
len(dfJerry)

640

In [20]:
dfSummer=df[(df.Character.str.contains('Summer')) & caseexcept]

In [21]:
len(dfSummer)

516

In [22]:
dfRick.Character.value_counts()

Rick:                            1215
Pickle Rick:                       83
All Ricks:                         37
Rick :                             33
Tiny Rick:                         33
Rick 1:                            25
Doofus Rick:                       18
Cop Rick:                          15
Evil Rick:                         14
Rick 2:                            13
Young Rick:                         9
Rick Council 1:                     8
Rick 30:                            7
Rick D716:                          6
Teacher Rick:                       6
Rick K-22:                          5
Rick D716-B:                        5
Alternate Rick:                     5
Rick/Rick:                          4
Juggling Rick:                      4
Guard Rick:                         4
Commander in Chief Rick/Rick:       4
Rick 4:                             3
All Ricks :                         3
Rick J-22:                          3
Cornvelious Daniel/Rick:            3
Other Rick: 

In [23]:
dfMorty.Character.value_counts()

Morty:                         901
Cop Morty:                      25
All Mortys:                     18
Morty :                         15
Morty 2:                        14
Candidate Morty:                12
Morty 1:                        10
Mechanical Morty:                9
Campaign Manager Morty:          8
Morty 30:                        6
Lizard Morty:                    4
Morty and Summer:                3
Morty Mart Morty:                3
"Lawyer" Morty:                  3
Religious Morty:                 3
Evil Morty:                      3
Morty K-22:                      2
Morty 1, 2 and 3:                2
Morty 23:                        2
Glasses Morty:                   2
Candidate Morty :                2
Mortytown Loco:                  2
Fat Morty:                       2
Other Morty:                     1
Morty & Summer:                  1
Morty 2, 3 and 4:                1
Morty Doll:                      1
Summer 1 & Morty 2:              1
Beth, Summer, and Mo

## Noise removal - remove stop words

Stop words are commonly used and usually low context words in text, such as "is", "the", "of", "in", etc. if they were not removed, they would likely dominate any statistical analysis that we could do on characters.

Natural Languate Toolkit (NLTK) is a commonly used NLP toolkit for python. It includes a list of stopwords and filtering abilities. here is an example usage from [pythonspot](https://pythonspot.com/nltk-stop-words/

Although I already split the data just to get an idea of what is going on, it probably is better to do all this noise removal before splitting, and then recreate splits afterwards.


In [24]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import string

In [25]:
list(string.punctuation)

['!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '{',
 '|',
 '}',
 '~']

In [26]:
data = "All work and no play make's jack dull boy. All work and no play makes jack a dull boy."
stopWords = set(stopwords.words('english'))
words = word_tokenize(data)
wordsFiltered = []
 
for w in words:
    w=w.lower()
    print(w)
    if w not in stopWords and w not in list(string.punctuation):
        wordsFiltered.append(w)
 
print(wordsFiltered)



all
work
and
no
play
make
's
jack
dull
boy
.
all
work
and
no
play
makes
jack
a
dull
boy
.
['work', 'play', 'make', "'s", 'jack', 'dull', 'boy', 'work', 'play', 'makes', 'jack', 'dull', 'boy']


In [27]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [28]:
df['Line_no_stopwords'] = df['Line'].apply(lambda x: ' '.join([word for word in word_tokenize(x.lower())
                                                               if word not in list(string.punctuation)]))
 #(StopWords)
#                                                                and word not in list(string.punctuation)


In [29]:
df

Unnamed: 0,Sentence_id,Season,Episode,Episode_num,Episode_id,Character,Line,Line_no_stopwords
0,1,1,Pilot,1,1,Scene Action:,[Open to Morty’s room],open to morty ’ s room
1,2,1,Pilot,1,1,Rick:,Morty! You gotta come on. Jus'... you gotta ...,morty you got ta come on jus ... you got ta co...
2,3,1,Pilot,1,1,Morty:,"What, Rick? What’s going on?",what rick what ’ s going on
3,4,1,Pilot,1,1,Rick:,"I got a surprise for you, Morty.",i got a surprise for you morty
4,5,1,Pilot,1,1,Morty:,It's the middle of the night. What are you tal...,it 's the middle of the night what are you tal...
5,6,1,Pilot,1,1,Rick:,,
6,7,1,Pilot,1,1,Morty:,Ow! Ow! You're tugging me too hard!,ow ow you 're tugging me too hard
7,8,1,Pilot,1,1,Rick:,"We gotta go, gotta get outta here, come on. Go...",we got ta go got ta get outta here come on got...
8,9,1,Pilot,1,1,Scene Action:,[Cut to Rick's ship],cut to rick 's ship
9,10,1,Pilot,1,1,Rick:,"What do you think of this... flying vehicle,...",what do you think of this ... flying vehicle m...


In [30]:
for w in word_tokenize(df.Line_no_stopwords[2]):
    if w not in list(string.punctuation):
        print(w)

what
rick
what
’
s
going
on


This is not quite right, and in the specific example above, we see that there are interesting deviations in characters. Probably need to convert everything to ascii first to get rid of these weird characters.

I'll use unidecode to convert all weird unicode characters to close approximations in ascii.

In [31]:
from unidecode import unidecode
unidecode(df.Line[2])

"  What, Rick? What's going on?"

In [32]:
df['Line']=df.Line.apply(lambda x: unidecode(x))

In [33]:
df.Line[2]

"  What, Rick? What's going on?"

In [34]:
word_tokenize(df.Line[2])

['What', ',', 'Rick', '?', 'What', "'s", 'going', 'on', '?']

In [35]:
word_tokenize(df.Line[1])

['Morty',
 '!',
 'You',
 'got',
 'ta',
 'come',
 'on',
 '.',
 'Jus',
 "'",
 '...',
 'you',
 'got',
 'ta',
 'come',
 'with',
 'me',
 '.']

In [36]:
word_tokenize(df.Line[4])

['It',
 "'s",
 'the',
 'middle',
 'of',
 'the',
 'night',
 '.',
 'What',
 'are',
 'you',
 'talking',
 'about',
 '?']

In [37]:
word_tokenize("this this' this's this'this thisn't")

['this', 'this', "'", 'this', "'s", "this'this", 'this', "n't"]

Try again with the same 

In [38]:
df['Line_no_stopwords'] = df['Line'].apply(lambda x: ' '.join([word for word in word_tokenize(x.lower())
                                                               if word not in stopWords and word not in list(string.punctuation)]))


In [39]:
df[['Line','Line_no_stopwords']]

Unnamed: 0,Line,Line_no_stopwords
0,[Open to Morty's room],open morty 's room
1,Morty! You gotta come on. Jus'... you gotta ...,morty got ta come jus ... got ta come
2,"What, Rick? What's going on?",rick 's going
3,"I got a surprise for you, Morty.",got surprise morty
4,It's the middle of the night. What are you tal...,'s middle night talking
5,,
6,Ow! Ow! You're tugging me too hard!,ow ow 're tugging hard
7,"We gotta go, gotta get outta here, come on. Go...",got ta go got ta get outta come got surprise m...
8,[Cut to Rick's ship],cut rick 's ship
9,"What do you think of this... flying vehicle,...",think ... flying vehicle morty built outta stu...


In [40]:
df.Line_no_stopwords[df.Line_no_stopwords.str.contains("'")]

0                                      open morty 's room
2                                           rick 's going
4                                 's middle night talking
6                                  ow ow 're tugging hard
8                                        cut rick 's ship
10                   yeah rick ... i-it 's great surprise
13      're gon na drop get whole fresh start morty cr...
14                           t-t-that 's absolutely crazy
15      come morty take easy morty 's gon na good righ...
17      drop bomb know want somebody know want thing '...
19                                  jessica 's gon na eve
21                                      's surprise morty
22      ca n't jessica n't even know exist -- forget c...
23      i-i get 're trying say morty listen 'm ... n't...
25      you-you n't worry getting jessica anything sh-...
26                     n't care jessica y-yyyyyyyyyyou --
27                                   know morty 're right
31      'm tak

All common contractions are part of the stop list, however the apostrophe is handled oddly with NLTK... Either way, I can remove the values with apostrophes.

In [41]:
for word in df['Line_no_stopwords'].iloc[25].split():
    print(word)
    if word.find("'")>-1:
        print('apostrophe')

you-you
n't
apostrophe
worry
getting
jessica
anything
sh-sh-she
--
's
apostrophe
morty


In [42]:
df['Line_no_stopwords']=df['Line_no_stopwords'].apply(lambda x: ' '.join(word for word in x.split()
                                                if word.find("'")==-1))

In [43]:
df

Unnamed: 0,Sentence_id,Season,Episode,Episode_num,Episode_id,Character,Line,Line_no_stopwords
0,1,1,Pilot,1,1,Scene Action:,[Open to Morty's room],open morty room
1,2,1,Pilot,1,1,Rick:,Morty! You gotta come on. Jus'... you gotta ...,morty got ta come jus ... got ta come
2,3,1,Pilot,1,1,Morty:,"What, Rick? What's going on?",rick going
3,4,1,Pilot,1,1,Rick:,"I got a surprise for you, Morty.",got surprise morty
4,5,1,Pilot,1,1,Morty:,It's the middle of the night. What are you tal...,middle night talking
5,6,1,Pilot,1,1,Rick:,,
6,7,1,Pilot,1,1,Morty:,Ow! Ow! You're tugging me too hard!,ow ow tugging hard
7,8,1,Pilot,1,1,Rick:,"We gotta go, gotta get outta here, come on. Go...",got ta go got ta get outta come got surprise m...
8,9,1,Pilot,1,1,Scene Action:,[Cut to Rick's ship],cut rick ship
9,10,1,Pilot,1,1,Rick:,"What do you think of this... flying vehicle,...",think ... flying vehicle morty built outta stu...


It also looks like "--" and "..." show up and can be removed.

In [44]:
for word in df.Line_no_stopwords.iloc[1].split():
    
    if word.find('--')==-1 and word.find('...')==-1:
        print(word)

morty
got
ta
come
jus
got
ta
come


In [45]:
df['Line_no_stopwords']=df['Line_no_stopwords'].apply(lambda x: ' '.join(word for word in x.split()
                                                if word.find("--")==-1 and word.find("...")==-1))

Before further processing, I can do some rudimentary word frequency stats.

In [46]:
dfRick=df[(df.Character.str.contains('Rick')) & caseexcept]
dfMorty=df[(df.Character.str.contains('Morty')) & caseexcept]
dfSummer=df[(df.Character.str.contains('Summer')) & caseexcept]
dfJerry=df[(df.Character.str.contains('Jerry')) & caseexcept]
dfBeth=df[(df.Character.str.contains('Beth')) & caseexcept]

In [47]:
def wordagg(df):
    dflist=[]
    for line in df:
        dflist.extend(line.split())
    return dflist

    

In [48]:
Rickwords=wordagg(dfRick.Line_no_stopwords)
len(Rickwords)

13757

In [49]:
Mortywords=wordagg(dfMorty.Line_no_stopwords)
len(Mortywords)

5761

In [50]:
Jerrywords=wordagg(dfJerry.Line_no_stopwords)
len(Jerrywords)

3301

In [51]:
Summerwords=wordagg(dfSummer.Line_no_stopwords)
len(Summerwords)

2310

In [52]:
Bethwords=wordagg(dfBeth.Line_no_stopwords)
len(Bethwords)

1965

Use set to count distinct words and see how varied each persons vocabulary is. "lexical richness" can be calculated as the total unique words divided by total words. The more unique words spoken, the more rich their vocabulary.

In [53]:
def richness(text):
    return len(set(text))/len(text)

In [54]:
richness(Rickwords)

0.2671367303918005

In [55]:
richness(Mortywords)

0.28154834230168374

In [56]:
richness(Jerrywords)

0.41260224174492577

In [57]:
Rickwords.count('morty')/len(Rickwords)

0.04310532819655448

In [108]:
from nltk import FreqDist
FreqDist(Rickwords).most_common(10)

[('morty', 593),
 ('oh', 184),
 ('know', 181),
 ('na', 152),
 ('get', 147),
 ('got', 140),
 ('gon', 138),
 ('right', 133),
 ('yeah', 128),
 ('go', 119)]

In [109]:
FreqDist(Mortywords).most_common(10)

[('rick', 285),
 ('know', 152),
 ('oh', 139),
 ('like', 70),
 ('na', 64),
 ('right', 61),
 ('get', 60),
 ('gon', 58),
 ('okay', 57),
 ('man', 56)]

In [110]:
FreqDist(Summerwords).most_common(10)

[('oh', 56),
 ('grandpa', 45),
 ('god', 41),
 ('rick', 40),
 ('dad', 35),
 ('yeah', 32),
 ('like', 29),
 ('morty', 24),
 ('going', 21),
 ('go', 20)]

In [111]:
FreqDist(Bethwords).most_common(10)

[('jerry', 66),
 ('dad', 36),
 ('oh', 34),
 ('know', 24),
 ('like', 22),
 ('okay', 21),
 ('want', 20),
 ('morty', 18),
 ('go', 16),
 ('get', 16)]

In [113]:
FreqDist(Jerrywords).most_common(10)

[('well', 44),
 ('know', 42),
 ('oh', 38),
 ('like', 33),
 ('rick', 32),
 ('okay', 30),
 ('hey', 30),
 ('uh', 29),
 ('beth', 28),
 ('``', 26)]

I can make word clouds with this!

In [175]:
Rickfreq=FreqDist(Rickwords).most_common(100)
Mortyfreq=FreqDist(Mortywords).most_common(20)
Summerfreq=FreqDist(Summerwords).most_common(10)
Bethfreq=FreqDist(Bethwords).most_common(10)
Jerryfreq=FreqDist(Jerrywords).most_common(10)

In [115]:
for item in Rickfreq[:15]:
    print(item)

('morty', 593)
('oh', 184)
('know', 181)
('na', 152)
('get', 147)
('got', 140)
('gon', 138)
('right', 133)
('yeah', 128)
('go', 119)
('like', 114)
('rick', 97)
('time', 88)
('one', 87)
('buurp', 83)


In [103]:
Rickfreq[-1][1]

19

In [117]:
Rickfreq[90:99]

[('yes', 21),
 ('new', 20),
 ('y-you', 20),
 ('lot', 20),
 ('gun', 20),
 ('planet', 20),
 ('ta', 19),
 ('anything', 19),
 ('crap', 19)]

In [106]:
len(Rickfreq)

100

the D3 wordcloud function I have takes a .js file with the format

`gword=[{text: 'firstword', size: 100},{text: 'nextword", size:50}];`

I need to convert this list into a js object, and convert the word count into a scalable word size.

In previous work, the font scale has been 80 to 20.

word count is linearly normalized between 1 and 0, then it is scaled between the two font sizes.

In [170]:
bigsize=80
smallsize=20
def jsviewout(freqlist,bigsize,smallsize):
    if len(freqlist)< 100:
        return ('Error, List must have 100 or more counts')
    else:
        highest=freqlist[0][1]
        lowest=freqlist[-1][1]
        for k in freqlist[:100]:
            fsize=k[1]
#             print(fsize)
            fsize=(fsize - lowest) / (highest - lowest)
            fsize= fsize * (bigsize-smallsize) + smallsize
            print(k[0],fsize)



In [171]:
jsviewout(Rickfreq,bigsize,smallsize)

morty 80.0
oh 37.247386759581886
know 36.933797909407666
na 33.90243902439025
get 33.379790940766554
got 32.64808362369338
gon 32.4390243902439
right 31.916376306620208
yeah 31.393728222996515
go 30.45296167247387
like 29.930313588850176
rick 28.153310104529616
time 27.21254355400697
one 27.10801393728223
buurp 26.689895470383277
hey 26.689895470383277
summer 26.27177700348432
come 26.167247386759584
well 25.853658536585364
jerry 25.54006968641115
let 25.43554006968641
take 25.226480836236934
little 25.226480836236934
bleep 24.808362369337978
think 24.494773519163765
`` 24.285714285714285
really 24.181184668989545
back 24.181184668989545
uh 24.181184668989545
good 24.076655052264808
whoa 23.972125435540068
look 23.65853658536585
want 23.449477351916375
okay 23.2404181184669
mean 23.13588850174216
thing 23.031358885017422
god 23.031358885017422
listen 22.926829268292682
us 22.822299651567945
guys 22.822299651567945
see 22.717770034843205
could 22.50871080139373
say 22.40418118466899
wou

In [172]:
def jswrite(freqlist,fname,bigsize,smallsize):
    if len(freqlist)< 100:
        return ('Error, List must have 100 or more counts')
    elif type(fname) is not str or fname[-3:] !='.js':
        return ('Error, filename incorrect format, make string with ".js" filetype')
    else:
        fhand=open(fname,'w')
        fhand.write("gword =[")
        highest=freqlist[0][1]
        lowest=freqlist[-1][1]
        first = True
        for k in freqlist[:100]:
            if not first : fhand.write(",\n")
            first = False
            fsize=k[1]
            word=k[0]
#             print(fsize)
            fsize=(fsize - lowest) / (highest - lowest)
            fsize= fsize*(bigsize-smallsize) + smallsize
#             print(k[0],fsize)
            fhand.write("{text: '"+word+"', size: "+str(fsize)+"}")
        fhand.write("\n];\n")



In [173]:
fname='test.js'

In [174]:
jswrite(Rickfreq,fname,bigsize,smallsize)

In [176]:
Rickfreq=FreqDist(Rickwords).most_common(100)
Mortyfreq=FreqDist(Mortywords).most_common(100)
Summerfreq=FreqDist(Summerwords).most_common(100)
Bethfreq=FreqDist(Bethwords).most_common(100)
Jerryfreq=FreqDist(Jerrywords).most_common(100)

In [177]:
jswrite(Rickfreq,'rickfreq.js',bigsize,smallsize)
jswrite(Mortyfreq,'mortyfreq.js',bigsize,smallsize)
jswrite(Summerfreq,'summerfreq.js',bigsize,smallsize)
jswrite(Bethfreq,'bethfreq.js',bigsize,smallsize)
jswrite(Jerryfreq,'jerryfreq.js',bigsize,smallsize)

At this point I created a folder called `word_cloud` where I put the d3 functions. I need to move all .js files to that location.

In [180]:
import glob
import os
for file in glob.glob('*.js'):
    os.rename(file,os.path.join('word_cloud',file))

I have a gword.htm file that needs to be slightly modified for each file to show correctly.

In [184]:
gwordfile=os.path.join('word_cloud','gword.htm')
with open(gwordfile) as file:
    gword=file.read()

In [189]:
htmnames=['mortyfreq','summerfreq','bethfreq','jerryfreq']
for name in htmnames:
    newdoc=gword.replace("rickfreq",name)
    htmhand=open(os.path.join('word_cloud',name+'.htm'),'w')
    htmhand.write(newdoc)
    htmhand.close()


In [63]:
# from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# countvect=CountVectorizer()

# rickcountvect=countvect.fit_transform(Rickwords).toarray()

In [64]:
# rickcountvect

After removing stop words, we need to normalize the lexicon One was to do this is with lemmatization, or a procedure to obtain the root from a word. It will turn things like "multiplying" into multiply.

In [65]:
# from nltk.stem.wordnet import WordNetLemmatizer 
# lem = WordNetLemmatizer()

In [66]:
# df['Line_no_stopwords'].apply(lambda x: lem)