# SI 370: Natural Language Processing (Part 1)

#### NOTES
- uses: translation, topic modeling, sentiment analysis, chatbots, customer service, advertising, marketing, classifying documents
- mix of language, ML, AI
- NLP = computational linguistics
- natural language is often ambiguous, can be hard for computers to understand what's on
- text data comes in large volumes
- text is highly unstructured

- NLTK and spaCy are python NLP packages

- tokenization: split the longer sentences into smaller parts called tokens (sentences --> words)
    - split string by space is an example, but has a lot of limitations, just use built in stuff
    
- stop words are words that appear a lot but aren't super important (think "and"), we can remove them

In [1]:
import pandas as pd

In [2]:
import nltk
import gensim

## Exercise 1: Please provide an example of an NLP task that you would want to perform on any text dataset you are interested in. (2pts)

I would love to do a sentiment analysis of Trump's tweets, and look for the keywords that are often negative or positively viewed. I just think this would be interesting.

## Basic text preprocessing

Let's first open our first NLP dataset, a Twitter sentiment dataset.

In [3]:
link='https://raw.githubusercontent.com/vineetdhanawat/twitter-sentiment-analysis/master/datasets/Sentiment%20Analysis%20Dataset.csv'
# df = pd.read_csv(link,encoding="ISO-8859-1")

In [4]:
df = pd.read_csv(link,encoding="ISO-8859-1")

In [5]:
df.head()

Unnamed: 0,ItemID,Sentiment,SentimentText
0,1,0,is so sad for my APL frie...
1,2,0,I missed the New Moon trail...
2,3,1,omg its already 7:30 :O
3,4,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,5,0,i think mi bf is cheating on me!!! ...


In [6]:
df.shape

(1048575, 3)

## Basic text preprocessing

#### Prerequisites
From NLTK, we have already installed the following corpora using nltk.download():

1. stopwords
2. punkt
3. wordnet

These are required for doing text preprocessing using the nltk package

### Tokenizing sentences

Probably the first step, where you split longer sentences into smaller parts called tokens. Passages are tokenized to sentences, sentences to words.

In [7]:
# load the tokenization function from nltk
from nltk import word_tokenize
sent = 'This is a sentence, awaiting to be tokenized...'

In [8]:
# split the string into a list of tokens using the word_tokenize function
tokens = word_tokenize(sent)
print(sent)
print(tokens)

# re-join the tokens into a single string
sent = ' '.join(tokens)
print(sent)

This is a sentence, awaiting to be tokenized...
['This', 'is', 'a', 'sentence', ',', 'awaiting', 'to', 'be', 'tokenized', '...']
This is a sentence , awaiting to be tokenized ...


### Removing stopwords

Words that has small contribution to the meaning of phrases but appear frequently. It is advised to remove them in most tasks.

In [9]:
# load list of stopwords from nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [10]:
# change stopwords list to set
S = set(stopwords.words('english'))
for sent in df['SentimentText'][:3].values:
    sent = sent.strip().lower()
    print("Original sentence: ",sent)
    tokens = word_tokenize(sent)
    tokens_stop_removed = []
    for token in tokens:
        if not token in S:
            tokens_stop_removed.append(token)
    sent_stop_removed = ' '.join(tokens_stop_removed)
    print("Stopword removed: ",sent_stop_removed)

Original sentence:  is so sad for my apl friend.............
Stopword removed:  sad apl friend ... ... ... ... .
Original sentence:  i missed the new moon trailer...
Stopword removed:  missed new moon trailer ...
Original sentence:  omg its already 7:30 :o
Stopword removed:  omg already 7:30 :


Why are some of the stopwords like 'I' still in the removed version? This is because the stopwords didn't take into account capitalized words. Therefore, lower-casing is also an important preprocessing step in several tasks.

### Stemming and lemmatizing

Stemming is the process of eliminating prefixes and affixes to bring the word to its root form. Lemmatization is related to stemming, but captures canonical form of words.

In [11]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()

In [12]:
words = ["game","gaming","gamed","games","runs","ran",'running']
for word in words:
    print(word,'->',ps.stem(word))

game -> game
gaming -> game
gamed -> game
games -> game
runs -> run
ran -> ran
running -> run


In [13]:
from nltk.stem import WordNetLemmatizer 
lemma = WordNetLemmatizer()

In [14]:
words = ["dogs","corpora","ran","games","dice"]
for word in words:
    print(word,'->',lemma.lemmatize(word))

dogs -> dog
corpora -> corpus
ran -> ran
games -> game
dice -> dice


## Exercise 2: Create a new column "PreprocessedText" on the dataframe "sample", where the column contains the text that has been (1) lower-cased, (2) tokenized, (3) stopword-removed, (4) lemmatized, (5) stemmed, and (6) re-joined into a string (4pts)

In [15]:
sample = df[:100]
# type your code here

preprocessed_text = []

for line in sample['SentimentText'].values:
    line = line.strip().lower()
    tokens = word_tokenize(line)
    tokens_out = []
    for token in tokens:
        if not token in S:
            token = lemma.lemmatize(token)
            token = ps.stem(token)
            tokens_out.append(token)
    line_out = ' '.join(tokens_out)
    preprocessed_text.append(line_out)

sample['PreprocessedText'] = preprocessed_text

sample.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,ItemID,Sentiment,SentimentText,PreprocessedText
0,1,0,is so sad for my APL frie...,sad apl friend ... ... ... ... .
1,2,0,I missed the New Moon trail...,miss new moon trailer ...
2,3,1,omg its already 7:30 :O,omg alreadi 7:30 :
3,4,0,.. Omgaga. Im sooo im gunna CRy. I'...,.. omgaga . im sooo im gunna cri . 've dentist...
4,5,0,i think mi bf is cheating on me!!! ...,think mi bf cheat ! ! ! t_t


## Applying text to machine learning

Now we will try sentiment analysis based on the text data that we have.
To do so, we have to structuralize the form of the text.

Slide for explanation on BOW, change to vector of vocabulary size

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

In [17]:
# transform sentences into array
all_sentences = sample['SentimentText']
vectorizer.fit(all_sentences)
X = vectorizer.transform(all_sentences)
arr = X.toarray()
print(arr.shape)

(100, 555)


In [18]:
arr[0]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

Let's take a look at how each word has been assigned to an index

In [19]:
vocab = vectorizer.vocabulary_

# print first 10 instances of vocabulary
for i,(k,v) in enumerate(vocab.items()):
    if i<10:
        print(k,v)

is 243
so 419
sad 388
for 155
my 315
apl 33
friend 159
missed 300
the 459
new 323


In [20]:
# we can set a limitation to the number of features
vectorizer = CountVectorizer(max_features=100)
X = vectorizer.fit_transform(all_sentences)
arr = X.toarray()
print(arr.shape)

(100, 100)


In [21]:
# target values y
y = sample['Sentiment']
print(y)

0     0
1     0
2     1
3     0
4     0
     ..
95    0
96    0
97    0
98    0
99    0
Name: Sentiment, Length: 100, dtype: int64


Now that we can transform each text sentence into vectors, we can try classification using previously learned models.

In [22]:
# split dataset into train and test
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

In [23]:
# load random forest model
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

In [24]:
# train model
rf.fit(X_train,y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [25]:
# test model and get accuracy
from sklearn.metrics import accuracy_score

y_pred = rf.predict(X_test)
acc = accuracy_score(y_true=y_test, y_pred=y_pred)
print(acc)

0.7


Let's try printing the feature importance of the model.

In [26]:
import numpy as np
# Check feature importance
feat_importance = rf.feature_importances_

idx2word = {v:k for k,v in vectorizer.vocabulary_.items()}

# get the feature importances for the top-5 features
n=10
for idx in feat_importance.argsort()[::-1][:n]:
    print(idx,feat_importance[idx],idx2word[idx])

93 0.08926803102347022 what
72 0.05833382740807037 see
86 0.053160761590403775 tonight
52 0.038130296917699 my
83 0.036227231083201444 to
39 0.034536118258779606 its
28 0.03225013031154172 has
61 0.032094465588886616 on
79 0.02998369725702587 the
26 0.02594745949722944 gonna


### Improving from BOW: n-grams

Instead of a single word, n-grams allows for capturing more precise phrases and expressions.

In [27]:
vectorizer = CountVectorizer(ngram_range=(1,3))  # lump together words into common phrases, acccept 1 up to 3 words
X = vectorizer.fit_transform(all_sentences)
arr = X.toarray()
print(arr.shape)

(100, 2276)


In [28]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
rf = RandomForestClassifier()
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
acc = accuracy_score(y_true=y_test, y_pred=y_pred)
print(acc)

0.7




In [29]:
vocab = vectorizer.vocabulary_

# print first 10 instances of vocabulary
for i,(k,v) in enumerate(vocab.items()):
    if i<10:
        print(k,v)

is 885
so 1682
sad 1568
for 523
my 1208
apl 118
friend 541
is so 906
so sad 1691
sad for 1573


## Exercise 3: Demonstrate how to change the hyperparameters from the NLP side of the model. (2pts)

Changeable hyperparameters
- level of n-grams to use
- vocabulary size
- size of data samples (WARNING: total data is 1M lines, so please do not use the whole dataset or it may crash the servers. ~10000 lines are acceptable)

In [44]:
# please write code here

# increase the amount of n grams
vectorizer = CountVectorizer(ngram_range=(1,5))
X = vectorizer.fit_transform(all_sentences)
arr = X.toarray()
print(arr.shape)

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
rf = RandomForestClassifier()
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
acc = accuracy_score(y_true=y_test, y_pred=y_pred)
print(acc)

vocab = vectorizer.vocabulary_
# vocab
# print first 10 instances of vocabulary
for i,(k,v) in enumerate(vocab.items()):
    if i<10:
        print(k,v)

(100, 3682)
0.85
is 1407
so 2699
sad 2522
for 810
my 1928
apl 188
friend 841
is so 1445
so sad 2716
sad for 2529




## Exercise 4: Try testing the model with sample sentences that you put in. What do you think the output result means? Can you see if the model seems to perform well? Under what situations do you think the model's performance is good or bad? (2pts)

In [50]:
vectorizer = CountVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(all_sentences)
arr = X.toarray()
print(arr.shape)

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3)
rf = RandomForestClassifier()
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
acc = accuracy_score(y_true=y_test, y_pred=y_pred)
print(acc)

pos_test_sentence = 'I love everything happy and good and fun!'
pos_vec = vectorizer.transform([pos_test_sentence.lower()])
pos_result = rf.predict_proba(pos_vec)
print(pos_result)

neg_test_sentence = 'I hate everything and it sucks and it is bad'
neg_vec = vectorizer.transform([neg_test_sentence.lower()])
neg_result = rf.predict_proba(neg_vec)
print(neg_result)

(100, 1441)
0.7333333333333333
[[0.9 0.1]]
[[0.9 0.1]]




I wrote one pretty negative and one pretty positive one, and it said 90% they were both negative, so it's not a great model for me at least.

the left number is the probability of having the class 0, which means negative
the right number is the probability of having the class 1, which means positive