# Natural Language Processing

Natural Languae Processing (NLP) focuses on computer understanding of human language. The future of text-based NLP may include more accurate search, universal translation, summary of text, conversational coding, androids... etc.

There are many many challenges that stand in the way of these things at the moment. A short list of these challenges include:
1. Polysemy - words can have multiple meanings; which meaning is the correct one?
2. Fluidity of syntax and grammar - what rules to use to break down a sentence?
3. Errors - misspellings or incorrect grammar can derail brittle analysis
3. Semantics - meaning can change on a word and sentence level
4. Context - how much is plainly written and how much must be inferred?
5. Evolution of language - rarely are languages in stasis.


The Basic outline for NLP flows through four **very broad** tiers/categories of increasing scope and difficulty:
1. Morphological processing - What are the discrete units (tokens) of meaning? in english it is relatively trivial as "words" are distinctly separated, however there are subwords. For example, "incoherently" has a prefix "in-", "coherent", and a suffix "-ly". each part changes the meaning and usage. Tokens can have multiple meanings (polysemy), and the exact meaning and type (e.g. noun, verb etc.) of a word may be ambiguous at this point. 

2. Syntax/Grammar processing - What is the structure of the sentence? Do the words interact correctly? By looking at the what tokens are in a string as well as what *order* they are in, we can determine relationships between tokens based on rules of definitions (lexicon) and syntax (grammar/structure). This processing can convert a sentence like "The large cat chased the rat"into a formal notation such as "Article Adjective Noun Verb Article Noun", or further into "Noun-Phrase Verb Noun-Phrase" (see Lkit pdf for tree viz). Grammar can disambiguate the meanings of "brush" in the sentences "**Brush** your hair" (verb) vs. "Hand me the **brush**" (noun).

3. Semantic Analysis -  What is the meaning of a string (sentence) of tokens (words)? The relationship of words in the syntactic framework allows us to disambiguate the meaning of the words. Semantic analysis allows us to  in the sentence "He put a carrot on the plate and then ate **it**", we need semantics to determine what "it" is - in this case "it" is a carrot, not a plate.

4. Pragmatic (contextual) analysis - What is the meaning with respect to the entire context? There are many phrases that are still ambiguous after semantic analysis, such as "put the apple in the basket on the shelf.", which can have two meanings:
 - put the apple which is *currently in the basket* on the shelf
 - put the apple into the basket which is *currently on the shelf*

   Although this is a trivial example, the "correct" answer depends on the current state of the apple and basket, which may have been determined in previous sentences. Humor and sarcasm are extremely advanced forms of contextual understanding: "Trump is definitely the best president ever" can mean completely opposite things depending on who is saying it, and when they say it. It would require both understanding of the current state of a broad range of topics, as well as the history of the person saying it.

A few interesting links that helped me create this document (as I am still learning!)

[Algorithmia - What is NLP?](https://blog.algorithmia.com/introduction-natural-language-processing-nlp/)

[Lkit NLP intro](https://www.scm.tees.ac.uk/isg/aia/nlp/NLP-overview.pdf) 

[Zareen Syed's slideshare](https://www.slideshare.net/zareen/challenges-in-nlp)

[tutorialspoint intro to NLP](https://www.tutorialspoint.com/artificial_intelligence/artificial_intelligence_natural_language_processing.htm)

[Analytics Vidhya guide to NLP](https://www.analyticsvidhya.com/blog/2017/01/ultimate-guide-to-understand-implement-natural-language-processing-codes-in-python/)



## Important steps breakdown

### Preprocessing (even more than before!!!)
#### Remove Noise
- remove scene action tagged sentences
- remove language stopwords words such as "is", "a", "this". These are super common words that do not help in determining context of words.

#### Lexicon Normalization
- compressing multiple representations of the same word into one with stemming (strip suffixes)

### Potential outputs at this stage:

- Statistical: word and sentence counts per character
- tf-idf: term frequency inverse document frequency. This finds the frequency of words in a subset, and normalizes it by the frequency of the same word in the entire set. It finds the relative importance of a word in the subset vs the whole. It can be used to determine if a word is more frequent in a specific episode than it is in the whole show, or if a word is used more frequently by a specific character than all the characters.

Either of the above can be used to create word cloud outputs per character, episode, etc.

### Advanced syntactic/semantic processing
#### Object Standardization
- Many domain specific words/acronyms are not in standart lexical dictionaries. 

#### Syntactic Parsing
- analysis of grammar and arrangement. We want to tag words with their relationship to the other words in a sentence.


Statistical:
word and sentence counts per character
tf-idf: term frequency inverse document frequency. This finds the frequency of words in a subset, and normalizes it by the frequency of the same word in the entire set. It finds the relative importance of a word in the subset vs the whole. It can be used to determine if a word is more frequent in a specific episode than it is in the whole show, or if a word is used more frequently by a specific character than all the characters.

Sentiment analysis:

In [1]:
import pandas as pd


In [34]:
df=pd.read_csv('clean_RandMtranscript.csv')

In [35]:
df.columns

Index(['Sentence_id', 'Season', 'Episode', 'Episode_num', 'Episode_id',
       'Character', 'Line'],
      dtype='object')

In [36]:
df[(df.Character.str.contains('Morty'))]


Unnamed: 0,Sentence_id,Season,Episode,Episode_num,Episode_id,Character,Line
2,3,1,Pilot,1,1,Morty:,"(rubs his eyes) What, Rick? What’s going on?"
4,5,1,Pilot,1,1,Morty:,It's the middle of the night. What are you tal...
6,7,1,Pilot,1,1,Morty:,Ow! Ow! You're tugging me too hard!
10,11,1,Pilot,1,1,Morty:,"Yeah, Rick... I-it's great. Is this the surprise?"
12,13,1,Pilot,1,1,Morty:,What?! A bomb?!
14,15,1,Pilot,1,1,Morty:,T-t-that's absolutely crazy!
16,17,1,Pilot,1,1,Morty:,Jessica? From my math class?
18,19,1,Pilot,1,1,Morty:,Ohh...
20,21,1,Pilot,1,1,Morty:,Whhhh-wha?
22,23,1,Pilot,1,1,Morty:,"No, you can't! (shoves Rick away) Jessica does..."


In [18]:
df[(~df.Character.str.contains('except'))]

Unnamed: 0,Sentence_id,Season,Episode,Episode_num,Episode_id,Character,Line
0,1,1,Pilot,1,1,Scene Action:,[Open to Morty’s room]
1,2,1,Pilot,1,1,Rick:,"(stumbles in drunkenly, and turns on the light..."
2,3,1,Pilot,1,1,Morty:,"(rubs his eyes) What, Rick? What’s going on?"
3,4,1,Pilot,1,1,Rick:,"I got a surprise for you, Morty."
4,5,1,Pilot,1,1,Morty:,It's the middle of the night. What are you tal...
5,6,1,Pilot,1,1,Rick:,"(spills alcohol on Morty's bed) Come on, I got..."
6,7,1,Pilot,1,1,Morty:,Ow! Ow! You're tugging me too hard!
7,8,1,Pilot,1,1,Rick:,"We gotta go, gotta get outta here, come on. Go..."
8,9,1,Pilot,1,1,Scene Action:,[Cut to Rick's ship]
9,10,1,Pilot,1,1,Rick:,(Rick drives through the night sky) What do yo...


In [33]:
df[(df.Character.str.contains('Morty')) &
   (~df.Character.str.contains('except')) &
  (df.Character.str.contains('\['))].head()


Unnamed: 0,Sentence_id,Season,Episode,Episode_num,Episode_id,Character,Line
2910,2911,1,Close_Rick-counters_of_the_Rick_Kind,10,10,[Trans. Evil Rick’s lab. Rick is tied to an ex...,"Ahh, isn’t it beautiful?"
7041,7042,3,Pickle_Rick,3,24,"[Transition to Beth, Summer, and Morty sitting...","""Dr. WONG / FAMILY THERAPY / COPROPHRAGIA RECO..."


In [25]:
dfMorty=df[(df.Character.str.contains('Morty')) &
         (~df.Character.str.contains('except'))]

In [26]:
dfRick=df[(df.Character.str.contains('Rick')) &
         (~df.Character.str.contains('except'))]

In [27]:
dfRick.head()

Unnamed: 0,Sentence_id,Season,Episode,Episode_num,Episode_id,Character,Line
1,2,1,Pilot,1,1,Rick:,"(stumbles in drunkenly, and turns on the light..."
3,4,1,Pilot,1,1,Rick:,"I got a surprise for you, Morty."
5,6,1,Pilot,1,1,Rick:,"(spills alcohol on Morty's bed) Come on, I got..."
7,8,1,Pilot,1,1,Rick:,"We gotta go, gotta get outta here, come on. Go..."
9,10,1,Pilot,1,1,Rick:,(Rick drives through the night sky) What do yo...


In [30]:
df[(df.Character.str.contains('Beth')) &
         (~df.Character.str.contains('except'))]

Unnamed: 0,Sentence_id,Season,Episode,Episode_num,Episode_id,Character,Line
46,47,1,Pilot,1,1,Beth:,"Morty, are you getting sick? (Morty lifts his ..."
52,53,1,Pilot,1,1,Beth:,Dad?
55,56,1,Pilot,1,1,Beth:,Jerry!
61,62,1,Pilot,1,1,Beth:,"(tears up in happiness) Oh, dad…"
111,112,1,Pilot,1,1,Beth:,(performing heart surgery on a horse) Scalpel.
114,115,1,Pilot,1,1,Beth:,Jerry?
118,119,1,Pilot,1,1,Beth:,"Jerry, please tell me you're here for an incre..."
122,123,1,Pilot,1,1,Beth:,"Okay, I only ask, Jerry, because, as you know,..."
124,125,1,Pilot,1,1,Beth:,Excuse me?
128,129,1,Pilot,1,1,Beth:,"Hey, Tom! We know when we're losing him. WE CA..."
