# Section 40: Foundations of Natural Language Processing

## Questions

- Regex : wtf?!
-

## Learning Objectives

- Introduce the field of Natural Language Processing
- Learn about the extensive preprocessing involved with text data
- Touch on some things like sentiment analysis 


## NLP & Word Vectorization

> **_Natural Language Processing_**, or **_NLP_**, is the study of how computers can interact with humans through the use of human language.  Although this is a field that is quite important to Data Scientists, it does not belong to Data Science alone.  NLP has been around for quite a while, and sits at the intersection of *Computer Science*, *Artificial Intelligence*, *Linguistics*, and *Information Theory*. 

## Where is NLP Used?
- Reviews (i.e. Amazon)
- Stock market trading

- **Demonstration:**
    - [Google Duplex AI Assistant](https://youtu.be/D5VN56jQMWM)

### Working with Text Data

- Preparing text data requires more processing than normal data.
- In addition to cleaning the text itself to remove meaningless words, we have to convert our text data into numeric form for our machine learning models to analyze.
- Text data must be cleaned and vectorized before we can use it.


## NLP with NLTK

### NLP Vocabulary
- Corpus
    - Body of text
    
- Bag of Words
    - Collection of all words from a corpus.

    
- Stopwords
- Tokenization
    - Separating long strings into single-words in a list.
    
- Stemming 

<img src="https://raw.githubusercontent.com/learn-co-students/dsc-nlp-and-word-vectorization-online-ds-ft-100719/master/images/new_stemming.png" width=40%>

- Lemmatization

|   Word   |  Stem | Lemma |
|:--------:|:-----:|:-----:|
|  Studies | Studi | Study |
| Studying | Study | Study |

## Context-Free Grammers and POS Tagging

<img src="https://raw.githubusercontent.com/jirvingphd/dsc-context-free-grammars-and-POS-tagging-online-ds-ft-100719/master/images/new_LevelsOfLanguage-Graph.png">

#### Syntax and Meaning Can be Difficult for Computers 

In English, sentences consist of a **_Noun Phrase_** followed by a **_Verb Phrase_**, which may optionally be followed by a **_Prepositional Phrase_**.

This ***seems simple, but it gets more tricky*** when we realize that there is a recursive structure to these phrases.

- A noun phrase may consist of multiple smaller noun phrases, and in some cases, even a verb phrase. 
- Similarly, a verb phrase can consist of multiple smaller verb phrases and noun phrases, which can themselves be made up of smaller noun phrases and verb phrases. 


This leads levels of **_ambiguity_** that can be troublesome for computers. NLTK's documentation explains this by examining the classic Groucho Marx joke:

> ***"While hunting in Africa, I shot an elephant in my pajamas. How he got into my pajamas, I don't know."***



<img src="https://raw.githubusercontent.com/jirvingphd/dsc-context-free-grammars-and-POS-tagging-online-ds-ft-100719/master/images/parse_tree.png">

## Feature Engineering for Text Data


* Do we remove stop words or not?    
* Do we stem or lemmatize our text data, or leave the words as is?   
* Is basic tokenization enough, or do we need to support special edge cases through the use of regex?  
* Do we use the entire vocabulary, or just limit the model to a subset of the most frequently used words? If so, how many?  
* Do we engineer other features, such as bigrams, or POS tags, or Mutual Information Scores?   
* What sort of vectorization should we use in our model? Boolean Vectorization? Count Vectorization? TF-IDF? More advanced vectorization strategies such as Word2Vec?  


In [1]:
# !pip install -U fsds_100719
from fsds_100719.imports import *

fsds_1007219  v0.7.21 loaded.  Read the docs: https://fsds.readthedocs.io/en/latest/ 


Handle,Package,Description
dp,IPython.display,Display modules with helpful display and clearing commands.
fs,fsds_100719,Custom data science bootcamp student package
mpl,matplotlib,Matplotlib's base OOP module with formatting artists
plt,matplotlib.pyplot,Matplotlib's matlab-like plotting module
np,numpy,scientific computing with Python
pd,pandas,High performance data structures and tools
sns,seaborn,High-level data visualization library based on matplotlib


[i] Pandas .iplot() method activated.


### MacBeth

In [16]:
# import requests
# macbeth = requests.get('http://www.gutenberg.org/cache/epub/2264/pg2264.txt').text
# macbeth[:500]

In [17]:
# print(macbeth[14000:18000])

In [18]:
# text = macbeth.split('David Reed')[-1]
# print(text[:500])

___

# Capstone Excerpt:

In [4]:
df = pd.read_csv('https://raw.githubusercontent.com/jirvingphd/capstone-project-using-trumps-tweets-to-predict-stock-market/master/data/trump_tweets_12012016_to_01012020.csv')
df.head()

Unnamed: 0,source,text,created_at,retweet_count,favorite_count,is_retweet,id_str
0,Twitter Media Studio,https://t.co/EVAEYD1AgV,01-01-2020 03:12:07,25016,108830,False,1212209862094012416
1,Twitter for iPhone,HAPPY NEW YEAR!,01-01-2020 01:30:35,85409,576045,False,1212184310389850119
2,Twitter for iPhone,Our fantastic First Lady! https://t.co/6iswto4WDI,01-01-2020 01:22:28,27567,132633,False,1212182267113680896
3,Twitter for iPhone,RT @DanScavino: https://t.co/CJRPySkF1Z,01-01-2020 01:18:47,10796,0,True,1212181341078458369
4,Twitter for iPhone,RT @SenJohnKennedy: I think Speaker Pelosi is ...,01-01-2020 01:17:43,8893,0,True,1212181071988703232


In [3]:
corpus = list(df['text'].values)
corpus[:10]

['https://t.co/EVAEYD1AgV',
 'HAPPY NEW YEAR!',
 'Our fantastic First Lady! https://t.co/6iswto4WDI',
 'RT @DanScavino: https://t.co/CJRPySkF1Z',
 'RT @SenJohnKennedy: I think Speaker Pelosi is having 2nd thoughts about impeaching the President. The Senate should get back to work on USM…',
 'Thank you Steve. The greatest Witch Hunt in U.S. history! https://t.co/I3bSNVp6gC',
 'RT @ThisWeekABC: Sen. Ron Johnson says charges against Pres. Trump are "pretty thin gruel" and Speaker Nancy Pelosi\'s decision to withhold…',
 "RT @SenJohnKennedy: The Senate needs to reauthorize the Violence Against Women Act and I am proud to cosponsor @SenJoniErnst's bill that g…",
 'RT @LindseyGrahamSC: To our Iraqi allies:This is your moment to convince the American people the US-Iraq relationship is meaningful to yo…',
 'RT @LindseyGrahamSC: President Trump unlike President Obama will hold you accountable for threats against Americans and hit you where it…']

## Tweet Natural Language Processing

To prepare Donal Trump's tweets for modeling, **it is essential to preprocess the text** and simplify its contents.
<br><br>
1. **At a minimum, things like:**
    - punctuation
    - numbers
    - upper vs lowercase letters<br>
    ***must*** be addressed before any initial analyses. I refer tho this initial cleaning as **"minimal cleaning"** of the text content<br>
    
> Version 1 of the tweet processing removes these items, as well as the removal of any urls in a tweet. The resulting data column is referred to here as "content_min_clean".

<br><br>
2. It is **always recommended** that go a step beyond this and<br> remove **commonly used words that contain little information** <br>for our machine learning algorithms. Words like: (the,was,he,she, it,etc.)<br> are called **"stopwords"**, and it is critical to address them as well.

> Version 2 of the tweet processing removes these items and the resulting data column is referred here as `cleaned_stopped_content`

<br>

3. Additionally, many analyses **need the text tokenzied** into a list of words<br> and not in a natural sentence format. Instead, they are a list of words (**tokens**) separated by ",", which tells the algorithm what should be considered one word.<br><br>For the tweet processing, I used a version of tokenization, called `regexp_tokenziation` <br>which uses pattern of letters and symbols (the `expression`) <br>that indicate what combination of alpha numeric characters should be considered a single token.<br><br>The pattern I used was `"([a-zA-Z]+(?:'[a-z]+)?)"`, which allows for words such as "can't" that contain "'" in the middle of word. This processes was actually applied in order to process Version 1 and 2 of the Tweets, but the resulting text was put back into sentence form. 

> Version 3 of the tweets keeps the text in their regexp-tokenized form and is reffered to as `cleaned_stopped_tokens`
<br>

4. While not always required, it is often a good idea to reduce similar words down to a shared core.
There are often **multiple variants of the same word with the same/simiar meaning**,<br> but one may plural **(i.e. "democrat" and "democrats")**, or form of words is different **(i.e. run, running).**<br> Simplifying words down to the basic core word (or word *stem*) is referred to as **"stemming"**. <br><br> A more advanced form of this also understands things like words that are just in a **different tense** such as  i.e.  **"ran", "run", "running"**. This process is called  **"lemmatization**, where the words are reduced to their simplest form, called "**lemmas**"<br>  

> Version 4 of the tweets are all reduced down to their word lemmas, futher aiding the algorithm in learning the meaning of the texts.


#### EXAMPLE TWEETS AND PROCESSING STEPS:

**TWEET FROM 08-25-2017 12:25:10:**
* **["content"] column:**<p><blockquote>***"Strange statement by Bob Corker considering that he is constantly asking me whether or not he should run again in '18. Tennessee not happy!"***
    
    
* **["content_min_clean"] column:**<p><blockquote>***"strange statement by bob corker considering that he is constantly asking me whether or not he should run again in  18  tennessee not happy "***
    
    
* **["cleaned_stopped_content"] column:**<p><blockquote>***"strange statement bob corker considering constantly asking whether run tennessee happy"***
    
    
* **["cleaned_stopped_tokens"] column:**<p><blockquote>***"['strange', 'statement', 'bob', 'corker', 'considering', 'constantly', 'asking', 'whether', 'run', 'tennessee', 'happy']"***
    
    
* **["cleaned_stopped_lemmas"] column:**<p><blockquote>***"strange statement bob corker considering constantly asking whether run tennessee happy"***

# Practicing Text Preprocessing with Trump's Tweets

### Make a Bag-of-Words Frequency Distribution 

In [6]:
from nltk import FreqDist
corpus[0]

'https://t.co/EVAEYD1AgV'

In [8]:
freq= FreqDist(','.join(corpus))
freq.most_common(100)

[(' ', 355279),
 ('e', 193636),
 ('t', 151004),
 ('a', 133814),
 ('o', 132306),
 ('n', 115782),
 ('i', 111789),
 ('r', 107277),
 ('s', 97228),
 ('h', 76548),
 ('l', 67415),
 ('d', 57543),
 ('u', 46550),
 ('c', 45746),
 ('m', 43554),
 ('p', 37247),
 ('g', 35477),
 ('y', 33375),
 ('.', 32244),
 ('w', 30005),
 ('f', 29211),
 ('b', 22246),
 ('T', 19911),
 ('v', 17186),
 (',', 14545),
 ('k', 13783),
 ('/', 13588),
 ('S', 12999),
 ('A', 12922),
 ('R', 12294),
 ('C', 10866),
 ('I', 10743),
 (':', 9821),
 ('N', 9009),
 ('!', 8861),
 ('D', 8787),
 ('@', 8646),
 ('M', 8063),
 ('E', 8008),
 ('P', 7726),
 ('W', 7249),
 ('O', 6828),
 ('H', 6614),
 ('B', 6115),
 ('G', 5937),
 ('F', 5701),
 ('L', 4774),
 ('U', 4765),
 ('0', 4659),
 ('x', 4068),
 ('J', 4060),
 ('’', 3445),
 ('1', 3191),
 ('j', 3084),
 ('…', 2956),
 ('2', 2812),
 ('-', 2672),
 ('K', 2622),
 ('V', 2460),
 ('Y', 2194),
 (';', 2105),
 ('z', 2082),
 ('&', 2079),
 ('“', 1993),
 ('”', 1890),
 ('5', 1848),
 ('#', 1785),
 ('3', 1730),
 ('4', 1

In [9]:
from nltk import word_tokenize
tokens = word_tokenize(','.join(corpus))
freq=FreqDist(tokens)
freq.most_common(100)

[('the', 15560),
 (',', 14160),
 ('.', 13708),
 (':', 9462),
 ('to', 9308),
 ('!', 8861),
 ('@', 8646),
 ('and', 8497),
 ('of', 7176),
 ('a', 5658),
 ('is', 5096),
 ('in', 4996),
 ('https', 4265),
 ('for', 4081),
 ('RT', 3819),
 ('’', 3445),
 ('on', 3150),
 ('I', 3132),
 ('that', 3031),
 ('are', 2825),
 ('with', 2650),
 ('...', 2620),
 ('be', 2501),
 ('will', 2486),
 ('our', 2418),
 ('The', 2367),
 ('have', 2116),
 (';', 2105),
 ('&', 2079),
 ('amp', 2070),
 ('“', 1993),
 ('it', 1932),
 ('”', 1890),
 ('you', 1834),
 ('was', 1789),
 ('#', 1785),
 ('at', 1621),
 ('has', 1600),
 ('they', 1553),
 ('s', 1517),
 ('great', 1501),
 ('President', 1492),
 ('not', 1415),
 ('we', 1389),
 ('by', 1375),
 ('this', 1336),
 ('all', 1304),
 ('t', 1289),
 ('(', 1181),
 (')', 1174),
 ('Trump', 1154),
 ('Democrats', 1145),
 ('people', 1124),
 ('very', 1084),
 ('-', 1079),
 ('We', 1072),
 ('who', 1040),
 ('?', 1007),
 ('realDonaldTrump', 996),
 ('from', 979),
 ('my', 966),
 ('as', 956),
 ('he', 943),
 ('the

### Removing Stopwords

In [10]:
## Make a list of stopwords to remove
from nltk.corpus import stopwords
import string

In [19]:
# Get all the stop words in the English language
stopwords_list = stopwords.words('english')
stopwords_list+=string.punctuation
print(stopwords_list)
stopwords_list.remove('until')
stopwords_list.extend(['“','...','”'])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [20]:
## Commentary on not always accepting what is or isn't in stopwords
'until' in stopwords_list

False

In [21]:
stopped_tokens = [w.lower() for w in tokens if w.lower() not in stopwords_list]
freq = FreqDist(stopped_tokens)
freq.most_common(100)

[('https', 4265),
 ('rt', 3819),
 ('’', 3445),
 ('great', 2552),
 ('amp', 2070),
 ('president', 1605),
 ('people', 1309),
 ('trump', 1193),
 ('democrats', 1166),
 ('realdonaldtrump', 1045),
 ('country', 947),
 ('news', 931),
 ('thank', 929),
 ('big', 832),
 ('fake', 802),
 ('new', 791),
 ('many', 749),
 ('today', 747),
 ('get', 741),
 ('would', 714),
 ('border', 711),
 ('america', 699),
 ('never', 676),
 ('time', 666),
 ('u.s.', 625),
 ('american', 611),
 ('much', 594),
 ('want', 591),
 ('one', 588),
 ('years', 587),
 ('media', 582),
 ('good', 567),
 ('united', 543),
 ('even', 525),
 ('house', 523),
 ('states', 509),
 ('back', 492),
 ('``', 491),
 ('done', 485),
 ("'s", 479),
 ('must', 478),
 ('make', 478),
 ('china', 474),
 ('like', 465),
 ('going', 460),
 ('vote', 458),
 ('nothing', 455),
 ('dems', 452),
 ('job', 440),
 ('impeachment', 435),
 ('jobs', 431),
 ('state', 413),
 ('day', 407),
 ('first', 406),
 ('us', 404),
 ('bad', 402),
 ('whitehouse', 398),
 ('made', 391),
 ('military'

In [24]:
from nltk import word_tokenize
from ipywidgets import interact

@interact
def tokenize_tweet(i=(0,len(corpus)-1)):
    from nltk.corpus import stopwords
    import string
    from nltk import word_tokenize,regexp_tokenize
    
    print(f"- Tweet #{i}:\n")
    print(corpus[i],'\n')
    tokens = word_tokenize(corpus[i])

    # Get all the stop words in the English language
    stopwords_list = stopwords.words('english')
    stopwords_list += string.punctuation
    stopped_tokens = [w.lower() for w in tokens if w not in stopwords_list]
    
    print(tokens,end='\n\n')
    print(stopped_tokens)

interactive(children=(IntSlider(value=7032, description='i', max=14065), Output()), _dom_classes=('widget-inte…

In [27]:
## Get FreqDist for Cleaned Text Data
corpus[:20]

['https://t.co/EVAEYD1AgV',
 'HAPPY NEW YEAR!',
 'Our fantastic First Lady! https://t.co/6iswto4WDI',
 'RT @DanScavino: https://t.co/CJRPySkF1Z',
 'RT @SenJohnKennedy: I think Speaker Pelosi is having 2nd thoughts about impeaching the President. The Senate should get back to work on USM…',
 'Thank you Steve. The greatest Witch Hunt in U.S. history! https://t.co/I3bSNVp6gC',
 'RT @ThisWeekABC: Sen. Ron Johnson says charges against Pres. Trump are "pretty thin gruel" and Speaker Nancy Pelosi\'s decision to withhold…',
 "RT @SenJohnKennedy: The Senate needs to reauthorize the Violence Against Women Act and I am proud to cosponsor @SenJoniErnst's bill that g…",
 'RT @LindseyGrahamSC: To our Iraqi allies:This is your moment to convince the American people the US-Iraq relationship is meaningful to yo…',
 'RT @LindseyGrahamSC: President Trump unlike President Obama will hold you accountable for threats against Americans and hit you where it…',
 'RT @LindseyGrahamSC: Very proud of President @r

### Comparing Phases of Proprocessing/Tokenization

In [None]:
# def clean_text(text,exclude_words=['until']):
#     from nltk.corpus import stopwords
#     import string
#     from nltk import word_tokenize,regexp_tokenize
#     ## tokenize text
#     tokens = word_tokenize(text)
#     # Get all the stop words in the English language
#     stopwords_list = stopwords.words('english')
#     stopwords_list += string.punctuation
#     stopped_tokens = [w.lower() for w in tokens if w not in stopwords_list]
#     return stopped_tokens

In [None]:
from nltk import word_tokenize
from ipywidgets import interact

@interact
def tokenize_tweet(i=(0,len(corpus)-1)):
    from nltk.corpus import stopwords
    import string
    from nltk import word_tokenize,regexp_tokenize
    
    print(f"- Tweet #{i}:\n")
    print(corpus[i],'\n')
    tokens = word_tokenize(corpus[i])

    # Get all the stop words in the English language
    stopwords_list = stopwords.words('english')
    stopwords_list += string.punctuation
    stopped_tokens = [w.lower() for w in tokens if w not in stopwords_list]
    
    print(tokens,end='\n\n')
    print(stopped_tokens)

## Regular Expressions

- Best regexp resource and tester: https://regex101.com/

    - Make sure to check "Python" under Flavor menu on left side.

In [25]:
text =  corpus[6615]
text

'I will be in Green Bay Wisconsin on Saturday April 27th at the Resch Center — 7:00pm (CDT). Big crowd expected! #MAGA https://t.co/BPYK8PF0O8'

In [26]:
text2=corpus[7347]
text2

'RT @real_defender: @realDonaldTrump Protecting America and putting Americans first. Thank you Mr. President!'

In [28]:
from nltk import regexp_tokenize
pattern = r"([a-zA-Z]+(?:'[a-z]+)?)"
regexp_tokenize(text,pattern)

['I',
 'will',
 'be',
 'in',
 'Green',
 'Bay',
 'Wisconsin',
 'on',
 'Saturday',
 'April',
 'th',
 'at',
 'the',
 'Resch',
 'Center',
 'pm',
 'CDT',
 'Big',
 'crowd',
 'expected',
 'MAGA',
 'https',
 't',
 'co',
 'BPYK',
 'PF',
 'O']

In [29]:
print('[i] Word Tokenize:',end='\n'+'---'*20+'\n')
print(word_tokenize(text))

print('\n[i] Regexp Tokenize:',end='\n'+'---'*20+'\n')
print(regexp_tokenize(text,pattern))

[i] Word Tokenize:
------------------------------------------------------------
['I', 'will', 'be', 'in', 'Green', 'Bay', 'Wisconsin', 'on', 'Saturday', 'April', '27th', 'at', 'the', 'Resch', 'Center', '—', '7:00pm', '(', 'CDT', ')', '.', 'Big', 'crowd', 'expected', '!', '#', 'MAGA', 'https', ':', '//t.co/BPYK8PF0O8']

[i] Regexp Tokenize:
------------------------------------------------------------
['I', 'will', 'be', 'in', 'Green', 'Bay', 'Wisconsin', 'on', 'Saturday', 'April', 'th', 'at', 'the', 'Resch', 'Center', 'pm', 'CDT', 'Big', 'crowd', 'expected', 'MAGA', 'https', 't', 'co', 'BPYK', 'PF', 'O']


In [30]:
def clean_text(text,regex=True):
    from nltk.corpus import stopwords
    import string
    from nltk import word_tokenize,regexp_tokenize

    ## tokenize text
    if regex:
        pattern = r"([a-zA-Z]+(?:'[a-z]+)?)"
        tokens= regexp_tokenize(text,pattern)
    else:
        tokens = word_tokenize(text)
    # Get all the stop words in the English language
    stopwords_list = stopwords.words('english')
    stopwords_list += string.punctuation
    stopped_tokens = [w.lower() for w in tokens if w not in stopwords_list]
    return stopped_tokens

In [None]:
# @interact
# def regexp_tokenize_tweet(i=(0,len(corpus)-1)):
#     print(f"- Tweet #{i}:\n")
#     print(corpus[i],'\n')
#     from nltk import regexp_tokenize
#     pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
#     tokens= regexp_tokenize(corpus[i],pattern)

#     # It is usually a good idea to lowercase all tokens during this step, as well
#     stopped_tokens = [w.lower() for w in tokens if w not in stopwords_list]
#     print(tokens,end='\n\n')
#     return print(stopped_tokens)

In [34]:
import re

def find_urls(string): 
    return re.findall(r"(http[s]?://\w*\.\w*/+\w+)",string)

def find_hashtags(string):
    return re.findall(r'\#\w*',string)

def find_retweets(string):
    return re.findall(r'RT [@]?\w*:',string)

def find_mentions(string):
    return re.findall(r'\@\w*',string)

In [35]:
find_urls(text)

['https://t.co/BPYK8PF0O8']

In [None]:
find_mentions(text2)

### Stemming/Lemmatization

In [36]:

from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize('feet')) # foot
print(lemmatizer.lemmatize('running')) # run [?!] Does not match expected output

foot
running


In [37]:
text_in =  corpus[6615]

# # urls = find_urls(text)
# def clean_text(text,regex=True):
#     from nltk.corpus import stopwords
#     import string
#     from nltk import word_tokenize,regexp_tokenize

#     ## tokenize text
#     if regex:
#         pattern = r"([a-zA-Z]+(?:'[a-z]+)?)"
#         tokens= regexp_tokenize(text,pattern)
#     else:
#         tokens = word_tokenize(text)
#     # Get all the stop words in the English language
#     stopwords_list = stopwords.words('english')
#     stopwords_list += string.punctuation
#     stopped_tokens = [w.lower() for w in tokens if w not in stopwords_list]
#     return stopped_tokens

def process_tweet(text,as_lemmas=False,as_tokens=True):
#     text=text.copy()
    for x in find_urls(text):
        text = text.replace(x,'')
        
    for x in find_retweets(text):
        text = text.replace(x,'')    
        
    for x in find_hashtags(text):
        text = text.replace(x,'')    

    if as_lemmas:
        from nltk.stem.wordnet import WordNetLemmatizer
        lemmatizer = WordNetLemmatizer()
        text = lemmatizer.lemmatize(text)
    
    if as_tokens:
        text = clean_text(text)
    
    if len(text)==0:
        text=''
            
    return text

In [38]:
@interact
def show_processed_text(i=(0,len(corpus)-1)):
    text_in = corpus[i]#.copy()
    print(text_in)
    text_out = process_tweet(text_in)
    print(text_out)
    text_out2 = process_tweet(text_in,as_lemmas=True)
    print(text_out2)

interactive(children=(IntSlider(value=7032, description='i', max=14065), Output()), _dom_classes=('widget-inte…

In [39]:
corpus[:6]

['https://t.co/EVAEYD1AgV',
 'HAPPY NEW YEAR!',
 'Our fantastic First Lady! https://t.co/6iswto4WDI',
 'RT @DanScavino: https://t.co/CJRPySkF1Z',
 'RT @SenJohnKennedy: I think Speaker Pelosi is having 2nd thoughts about impeaching the President. The Senate should get back to work on USM…',
 'Thank you Steve. The greatest Witch Hunt in U.S. history! https://t.co/I3bSNVp6gC']

## Text Classification

> Potential Tasks: Classify Android vs iPhone tweets (from period where Android tweets still exist

In [40]:
df['datetime'] = pd.to_datetime(df['created_at'])
df

Unnamed: 0,source,text,created_at,retweet_count,favorite_count,is_retweet,id_str,datetime
0,Twitter Media Studio,https://t.co/EVAEYD1AgV,01-01-2020 03:12:07,25016,108830,False,1212209862094012416,2020-01-01 03:12:07
1,Twitter for iPhone,HAPPY NEW YEAR!,01-01-2020 01:30:35,85409,576045,False,1212184310389850119,2020-01-01 01:30:35
2,Twitter for iPhone,Our fantastic First Lady! https://t.co/6iswto4WDI,01-01-2020 01:22:28,27567,132633,False,1212182267113680896,2020-01-01 01:22:28
3,Twitter for iPhone,RT @DanScavino: https://t.co/CJRPySkF1Z,01-01-2020 01:18:47,10796,0,True,1212181341078458369,2020-01-01 01:18:47
4,Twitter for iPhone,RT @SenJohnKennedy: I think Speaker Pelosi is ...,01-01-2020 01:17:43,8893,0,True,1212181071988703232,2020-01-01 01:17:43
...,...,...,...,...,...,...,...,...
14061,Twitter for Android,The President of Taiwan CALLED ME today to wis...,12-03-2016 00:44:20,24700,111106,False,804848711599882240,2016-12-03 00:44:20
14062,Twitter for iPhone,Thank you Ohio! Together we made history – and...,12-02-2016 02:45:18,17283,72196,False,804516764562374656,2016-12-02 02:45:18
14063,Twitter for iPhone,Heading to U.S. Bank Arena in Cincinnati Ohio ...,12-01-2016 22:52:10,5564,31256,False,804458095569158144,2016-12-01 22:52:10
14064,Twitter for Android,Getting ready to leave for the Great State of ...,12-01-2016 14:38:09,9834,57249,False,804333771021570048,2016-12-01 14:38:09


In [41]:
df = df.set_index('datetime').sort_index()
df

Unnamed: 0_level_0,source,text,created_at,retweet_count,favorite_count,is_retweet,id_str
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-12-01 14:37:57,Twitter for iPhone,My thoughts and prayers are with those affecte...,12-01-2016 14:37:57,12077,65724,False,804333718999539712
2016-12-01 14:38:09,Twitter for Android,Getting ready to leave for the Great State of ...,12-01-2016 14:38:09,9834,57249,False,804333771021570048
2016-12-01 22:52:10,Twitter for iPhone,Heading to U.S. Bank Arena in Cincinnati Ohio ...,12-01-2016 22:52:10,5564,31256,False,804458095569158144
2016-12-02 02:45:18,Twitter for iPhone,Thank you Ohio! Together we made history – and...,12-02-2016 02:45:18,17283,72196,False,804516764562374656
2016-12-03 00:44:20,Twitter for Android,The President of Taiwan CALLED ME today to wis...,12-03-2016 00:44:20,24700,111106,False,804848711599882240
...,...,...,...,...,...,...,...
2020-01-01 01:17:43,Twitter for iPhone,RT @SenJohnKennedy: I think Speaker Pelosi is ...,01-01-2020 01:17:43,8893,0,True,1212181071988703232
2020-01-01 01:18:47,Twitter for iPhone,RT @DanScavino: https://t.co/CJRPySkF1Z,01-01-2020 01:18:47,10796,0,True,1212181341078458369
2020-01-01 01:22:28,Twitter for iPhone,Our fantastic First Lady! https://t.co/6iswto4WDI,01-01-2020 01:22:28,27567,132633,False,1212182267113680896
2020-01-01 01:30:35,Twitter for iPhone,HAPPY NEW YEAR!,01-01-2020 01:30:35,85409,576045,False,1212184310389850119


In [42]:
df['clean_text'] = df['text'].apply(process_tweet)
df

Unnamed: 0_level_0,source,text,created_at,retweet_count,favorite_count,is_retweet,id_str,clean_text
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2016-12-01 14:37:57,Twitter for iPhone,My thoughts and prayers are with those affecte...,12-01-2016 14:37:57,12077,65724,False,804333718999539712,"[my, thoughts, prayers, affected, tragic, stor..."
2016-12-01 14:38:09,Twitter for Android,Getting ready to leave for the Great State of ...,12-01-2016 14:38:09,9834,57249,False,804333771021570048,"[getting, ready, leave, great, state, indiana,..."
2016-12-01 22:52:10,Twitter for iPhone,Heading to U.S. Bank Arena in Cincinnati Ohio ...,12-01-2016 22:52:10,5564,31256,False,804458095569158144,"[heading, u, s, bank, arena, cincinnati, ohio,..."
2016-12-02 02:45:18,Twitter for iPhone,Thank you Ohio! Together we made history – and...,12-02-2016 02:45:18,17283,72196,False,804516764562374656,"[thank, ohio, together, made, history, real, w..."
2016-12-03 00:44:20,Twitter for Android,The President of Taiwan CALLED ME today to wis...,12-03-2016 00:44:20,24700,111106,False,804848711599882240,"[the, president, taiwan, called, me, today, wi..."
...,...,...,...,...,...,...,...,...
2020-01-01 01:17:43,Twitter for iPhone,RT @SenJohnKennedy: I think Speaker Pelosi is ...,01-01-2020 01:17:43,8893,0,True,1212181071988703232,"[i, think, speaker, pelosi, nd, thoughts, impe..."
2020-01-01 01:18:47,Twitter for iPhone,RT @DanScavino: https://t.co/CJRPySkF1Z,01-01-2020 01:18:47,10796,0,True,1212181341078458369,
2020-01-01 01:22:28,Twitter for iPhone,Our fantastic First Lady! https://t.co/6iswto4WDI,01-01-2020 01:22:28,27567,132633,False,1212182267113680896,"[our, fantastic, first, lady]"
2020-01-01 01:30:35,Twitter for iPhone,HAPPY NEW YEAR!,01-01-2020 01:30:35,85409,576045,False,1212184310389850119,"[happy, new, year]"


In [43]:
android = df.groupby('source').get_group('Twitter for Android')
android.index

DatetimeIndex(['2016-12-01 14:38:09', '2016-12-03 00:44:20',
               '2016-12-03 01:41:30', '2016-12-03 03:06:41',
               '2016-12-03 16:37:27', '2016-12-04 05:13:58',
               '2016-12-04 11:41:47', '2016-12-04 11:49:06',
               '2016-12-04 11:57:41', '2016-12-04 12:05:35',
               ...
               '2017-03-05 11:40:20', '2017-03-07 12:04:13',
               '2017-03-07 12:13:59', '2017-03-07 13:13:20',
               '2017-03-07 13:41:58', '2017-03-07 13:46:28',
               '2017-03-07 14:14:03', '2017-03-08 12:11:25',
               '2017-03-25 14:37:52', '2017-03-25 14:41:14'],
              dtype='datetime64[ns]', name='datetime', length=364, freq=None)

In [44]:
iphone = df.groupby('source').get_group('Twitter for iPhone').loc[:android.index[-1]]
iphone

Unnamed: 0_level_0,source,text,created_at,retweet_count,favorite_count,is_retweet,id_str,clean_text
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2016-12-01 14:37:57,Twitter for iPhone,My thoughts and prayers are with those affecte...,12-01-2016 14:37:57,12077,65724,False,804333718999539712,"[my, thoughts, prayers, affected, tragic, stor..."
2016-12-01 22:52:10,Twitter for iPhone,Heading to U.S. Bank Arena in Cincinnati Ohio ...,12-01-2016 22:52:10,5564,31256,False,804458095569158144,"[heading, u, s, bank, arena, cincinnati, ohio,..."
2016-12-02 02:45:18,Twitter for iPhone,Thank you Ohio! Together we made history – and...,12-02-2016 02:45:18,17283,72196,False,804516764562374656,"[thank, ohio, together, made, history, real, w..."
2016-12-03 19:09:40,Twitter for iPhone,State Treasurer John Kennedy is my choice for ...,12-03-2016 19:09:40,9800,39057,False,805126876779913216,"[state, treasurer, john, kennedy, choice, us, ..."
2016-12-03 19:13:01,Twitter for iPhone,Our great VPE @mike_pence is in Louisiana camp...,12-03-2016 19:13:01,9224,39351,False,805127720749383680,"[our, great, vpe, mike, pence, louisiana, camp..."
...,...,...,...,...,...,...,...,...
2017-03-24 12:14:32,Twitter for iPhone,After seven horrible years of ObamaCare (skyro...,03-24-2017 12:14:32,12566,68241,False,845247455868391425,"[after, seven, horrible, years, obamacare, sky..."
2017-03-24 12:23:00,Twitter for iPhone,The irony is that the Freedom Caucus which is ...,03-24-2017 12:23:00,13364,61991,False,845249587178819584,"[the, irony, freedom, caucus, pro, life, plann..."
2017-03-24 17:03:46,Twitter for iPhone,Today I was pleased to announce the official a...,03-24-2017 17:03:46,12933,66692,False,845320243614547968,"[today, i, pleased, announce, official, approv..."
2017-03-24 17:59:42,Twitter for iPhone,Today I was thrilled to announce a commitment ...,03-24-2017 17:59:42,20212,89339,False,845334323045765121,"[today, i, thrilled, announce, commitment, bil..."


In [46]:
len(android), len(iphone)

(364, 240)

In [47]:
df_corpus = pd.concat([iphone,android],axis=0)
df_corpus['source'].value_counts()

Twitter for Android    364
Twitter for iPhone     240
Name: source, dtype: int64

### Vectorization 

- Count vectorization
- Term Frequency-Inverse Document Frequency (TF-IDF)
    -  Used for multiple texts
    
    
**_Term Frequency_** is calculated with the following formula:

$$ \text{Term Frequency}(t) = \frac{\text{number of times it appears in a document}} {\text{total number of terms in the document}} $$ 

**_Inverse Document Frequency_** is calculated with the following formula:

$$ IDF(t) = log_e(\frac{\text{Total Number of Documents}}{\text{Number of Documents with it in it}})$$

The **_TF-IDF_** value for a given word in a given document is just found by multiplying the two!


## Questions/Topics 
- Next time: vectorization
- Vs Embeddings

In [48]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

In [49]:
vectorizer.fit_transform(df_corpus['clean_text'].values)

AttributeError: 'list' object has no attribute 'lower'