# Text Processing Exercise 

In this exerise, you will learn some building blocks for text processing . You will learn how to normalize, tokenize, stemmeize, and lemmatize tweets from Twitter.

### Fetch Data from the online resource

First, we will use the `get_tweets()` function from the `exercise_helper` module to get all the tweets from the following Twitter page https://twitter.com/AIForTrading1. This website corresponds to a Twitter account created especially for this course. This webiste contains 28 tweets, and our goal will be to get all these 28 tweets. The `get_tweets()` function uses the `requests` library and BeautifulSoup to get all the tweets from our website. In a later lesson we will learn how the use the `requests` library and BeautifulSoup to get data from websites. For now, we will just use this function to help us get the tweets we want.

In [2]:
import exercise_helper

all_tweets = exercise_helper.get_tweets()

print(all_tweets)

['The Long-Term Stock Exchange Is Worth a Shot', 'Predicting Stock Performance with Natural Language Deep Learning', 'Comcast Acquiring Time Warner Cable In All Stock Deal Worth $45.2 Billion', 'Facebook stock drops more than 20% after revenue forecast misses', 'Facebook Buying WhatsApp for $16B in Cash and Stock Plus $3B in RSUs', 'Netflix’s ‘death cross’ is the third for FAANG stocks and Nasdaq Composite is next', 'After Yesterday’s Signs of Recovery, Crypto Markets See Drastic Losses', 'MF Sees Australia Risks Tilt to Downside on China, Trade War', 'Bitcoin Cash Clash Is Costing Billions With No End in Sight', 'SEC Crypto Settlements Spur Expectations of Wider ICO Crackdown', 'Nissan’s Drama Looks a Lot Like a Palace Coup', 'Yahoo Finance has apparently killed its API', 'Tesla Tanks After Goldman Downgrades to Sell', 'Goldman Sachs to Open a Bitcoin Trading Operation', 'Tax-Free Bitcoin-To-Ether Trading in US to End Under GOP Plan', 'Goldman Sachs Is Setting Up a Cryptocurrency Trad

### Normalization
Text normalization is the process of transforming text into a single canonical form.

There are many normalization techniques, however, in this exercise we focus on two methods. First, we'll converting the text into lowercase and second, remove all the punctuation characters the text.

#### TODO: Part 1

Convert text to lowercase.

Use the Python built-in method `.lower()` for converting each tweet in `all_tweets` into the lower case.

In [3]:
# your code goes here
all_tweets = [tweet.lower() for tweet in all_tweets]
all_tweets

['the long-term stock exchange is worth a shot',
 'predicting stock performance with natural language deep learning',
 'comcast acquiring time warner cable in all stock deal worth $45.2 billion',
 'facebook stock drops more than 20% after revenue forecast misses',
 'facebook buying whatsapp for $16b in cash and stock plus $3b in rsus',
 'netflix’s ‘death cross’ is the third for faang stocks and nasdaq composite is next',
 'after yesterday’s signs of recovery, crypto markets see drastic losses',
 'mf sees australia risks tilt to downside on china, trade war',
 'bitcoin cash clash is costing billions with no end in sight',
 'sec crypto settlements spur expectations of wider ico crackdown',
 'nissan’s drama looks a lot like a palace coup',
 'yahoo finance has apparently killed its api',
 'tesla tanks after goldman downgrades to sell',
 'goldman sachs to open a bitcoin trading operation',
 'tax-free bitcoin-to-ether trading in us to end under gop plan',
 'goldman sachs is setting up a cryp

#### Part 2 

Here, we are using `Regular Expression` library to remove punctuation characters. 

The easiest way to remove specific punctuation characters is with regex, the `re` module. You can sub out specific patterns with a space:

```python
re.sub(pattern, ' ', text) 
```

This will substitute a space with anywhere the pattern matches in the text. 

Pattern for punctuation is the following `[^a-zA-Z0-9]`. 

In [4]:
import re 

counter = 0

for tweet in all_tweets:
    all_tweets[counter] = re.sub(r'[^a-zA-Z0-9]', ' ', tweet) 
    counter += 1

print(all_tweets[0])

the long term stock exchange is worth a shot


### NLTK: Natural Language ToolKit

NLTK is a leading platform for building Python programs to work with human language data. It has a suite of tools for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. 

Let's import NLTK. 

In [6]:
import os 
import nltk 
nltk.data.path.append(os.path.join(os.getcwd(), "nltk_data"))

#### TODO: Part 1

NLTK has `TweetTokenizer` method that splits tweets into tokens.

This make tokenizng tweets much easier and faster. 

For `TweetTokenizer`, you can pass the following argument `(preserve_case= False)` to make your tokens in lower case. In the cell below tokenize each tweet in `all_tweets` 

In [7]:
from nltk.tokenize import TweetTokenizer

#  your code goes here
tk = TweetTokenizer(preserve_case=False)

tokens = [tk.tokenize(tweet) for tweet in all_tweets]
tokens[0]

['the', 'long', 'term', 'stock', 'exchange', 'is', 'worth', 'a', 'shot']

#### Part 2

NLTK adds more modularity for tokenization.

For example, stop words are words which do not contain important significance to be used in text analysis. They are repetitive words such as "the", "and", "if", etc. Ideally, we want to remove these words from our tokenized lists. 

NLTK has a list of these words, `nltk.corpus.stopwords`, which you actually need to download through `nltk.download`.

Let's print out stopwords in English to see what these words are. 

In [8]:
from nltk.corpus import stopwords
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/darren/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

### TODO: 

print stop words in English

In [9]:
# your code is here
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

#### TODO: Part 3 

In the cell below use the `.split()` method to split each tweet into a list of words and remove the stop words from all the tweets.

In [10]:
## your code is here 
tokens_stop = []

for tweet in all_tweets:
    words = tweet.split()
    tweet = [w for w in words if w not in stopwords.words("english")]
    tokens_stop.append(tweet)
    
print(tokens_stop[0])

['long', 'term', 'stock', 'exchange', 'worth', 'shot']


### Stemming
Stemming is the process of reducing words to their word stem, base or root form.

### TODO:

In the cell below, use  the `PorterStemmer` method from the ntlk library to perform stemming on all the tweets

In [11]:
from nltk.stem.porter import PorterStemmer

stemmed = []

for tweet in tokens_stop:
    tweet = [PorterStemmer().stem(word) for word in tweet]
    stemmed.append(tweet)
    
print(stemmed[0])

['long', 'term', 'stock', 'exchang', 'worth', 'shot']


### Lemmatizing
#### Part 1

Lemmatization is the process of grouping together the inflected forms of a word so they can be analyzed as a single item.

For reducing the words into their root form, you can use `WordNetLemmatizer()` method. 

For more information about lemmatzing in NLTK, please take a look at NLTK documentation https://www.nltk.org/api/nltk.stem.html

If you like to understand more about Stemming and Lemmatizing, take a look at the following source: 
https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

In [12]:
nltk.download('wordnet') ### download this part 

[nltk_data] Downloading package wordnet to /Users/darren/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

### TODO:

In the cell below, use the `WordNetLemmatizer()` method to lemmatize all the tweets

In [13]:
from nltk.stem.wordnet import WordNetLemmatizer

# your code goes here
lemmatized = []

for tweet in stemmed:
    tweet = [WordNetLemmatizer().lemmatize(word) for word in tweet]
    lemmatized.append(tweet)
    
print(lemmatized[0])

['long', 'term', 'stock', 'exchang', 'worth', 'shot']


#### TODO: Part 2

In the cell below, lemmatize verbs by specifying `pos`. For `WordNetLemmatizer().lemmatize` add `pos` as an argument.

In [14]:
from nltk.stem.wordnet import WordNetLemmatizer

lemmatized = []

for tweet in stemmed:
    tweet = [WordNetLemmatizer().lemmatize(word, pos='v') for word in tweet]
    lemmatized.append(tweet)
    
print(lemmatized[0])

['long', 'term', 'stock', 'exchang', 'worth', 'shoot']
