<a href="https://colab.research.google.com/github/unt-iialab/INFO5731_Spring2020/blob/master/Interesting_Code/Lesson5_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Data Description

# 1.1 Number of Words
One of the most basic features we can extract is the number of words in each tweet. The basic intuition behind this is that generally, the negative sentiments contain a lesser amount of words than the positive ones.

To do this, we simply use the split function in python:

In [None]:
import pandas as pd
data_url = "https://raw.githubusercontent.com/unt-iialab/info5731-spring2022/main/lecture-examples/train_E6oV3lV.csv"
train = pd.read_csv(data_url)
train['word_count'] = train['tweet'].apply(lambda x: len(str(x).split(" ")))
train[['tweet','word_count']].head()

Unnamed: 0,tweet,word_count
0,@user when a father is dysfunctional and is s...,21
1,@user @user thanks for #lyft credit i can't us...,22
2,bihday your majesty,5
3,#model i love u take with u all the time in ...,17
4,factsguide: society now #motivation,8


# 1.2 Number of characters
This feature is also based on the previous feature intuition. Here, we calculate the number of characters in each tweet. This is done by calculating the length of the tweet.

In [None]:
train['char_count'] = train['tweet'].str.len() ## this also includes spaces
train[['tweet','char_count']].head()

Unnamed: 0,tweet,char_count
0,@user when a father is dysfunctional and is s...,102
1,@user @user thanks for #lyft credit i can't us...,122
2,bihday your majesty,21
3,#model i love u take with u all the time in ...,86
4,factsguide: society now #motivation,39


# 1.3 Average Word Length
We will also extract another feature which will calculate the average word length of each tweet. This can also potentially help us in improving our model.

Here, we simply take the sum of the length of all the words and divide it by the total length of the tweet:

In [None]:
def avg_word(sentence):
  words = sentence.split()
  return (sum(len(word) for word in words)/len(words))

train['avg_word'] = train['tweet'].apply(lambda x: avg_word(x))
train[['tweet','avg_word']].head()

Unnamed: 0,tweet,avg_word
0,@user when a father is dysfunctional and is s...,4.555556
1,@user @user thanks for #lyft credit i can't us...,5.315789
2,bihday your majesty,5.666667
3,#model i love u take with u all the time in ...,4.928571
4,factsguide: society now #motivation,8.0



# 1.4 Number of stopwords
Generally, while solving an NLP problem, the first thing we do is to remove the stopwords. But sometimes calculating the number of stopwords can also give us some extra information which we might have been losing before.

Here, we have imported stopwords from NLTK, which is a basic NLP library in python.

In [None]:
from nltk.corpus import stopwords

import nltk
nltk.download('stopwords')
stop = stopwords.words('english')

train['stopwords'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x in stop]))
train[['tweet','stopwords']].head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,tweet,stopwords
0,@user when a father is dysfunctional and is s...,10
1,@user @user thanks for #lyft credit i can't us...,5
2,bihday your majesty,1
3,#model i love u take with u all the time in ...,5
4,factsguide: society now #motivation,1


# 1.5 Number of special characters
One more interesting feature which we can extract from a tweet is calculating the number of hashtags or mentions present in it. This also helps in extracting extra information from our text data.

Here, we make use of the ‘starts with’ function because hashtags (or mentions) always appear at the beginning of a word.

In [None]:
train['hastags'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))
train[['tweet','hastags']].head()

Unnamed: 0,tweet,hastags
0,@user when a father is dysfunctional and is s...,1
1,@user @user thanks for #lyft credit i can't us...,3
2,bihday your majesty,0
3,#model i love u take with u all the time in ...,1
4,factsguide: society now #motivation,1


# 1.6 Number of numerics
Just like we calculated the number of words, we can also calculate the number of numerics which are present in the tweets. It does not have a lot of use in our example, but this is still a useful feature that should be run while doing similar exercises.

In [None]:
train['numerics'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
train[['tweet','numerics']].head()

Unnamed: 0,tweet,numerics
0,@user when a father is dysfunctional and is s...,0
1,@user @user thanks for #lyft credit i can't us...,0
2,bihday your majesty,0
3,#model i love u take with u all the time in ...,0
4,factsguide: society now #motivation,0


# 1.7 Number of Uppercase words
Anger or rage is quite often expressed by writing in UPPERCASE words which makes this a necessary operation to identify those words.

In [None]:
train['upper'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x.isupper()]))
train[['tweet','upper']].head()

Unnamed: 0,tweet,upper
0,@user when a father is dysfunctional and is s...,0
1,@user @user thanks for #lyft credit i can't us...,0
2,bihday your majesty,0
3,#model i love u take with u all the time in ...,0
4,factsguide: society now #motivation,0


# **2. Basic Pre-processing**

So far, we have learned how to extract basic features from text data. Before diving into text and feature extraction, our first step should be cleaning the data to obtain better features. We will achieve this by doing some of the basic pre-processing steps on our training data.

# 2.1 Lower case
The first pre-processing step which we will do is to transform our tweets into lowercase.

This avoids having multiple copies of the same words.

For example, while calculating the word count, ‘Analytics’ and ‘analytics’ will be taken as different words.

In [None]:
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x.lower() for x in x.split()))
train['tweet'].head()

Unnamed: 0,tweet
0,@user when a father is dysfunctional and is so...
1,@user @user thanks for #lyft credit i can't us...
2,bihday your majesty
3,#model i love u take with u all the time in ur...
4,factsguide: society now #motivation


# 2.2 Removing Punctuation
The next step is to remove punctuation, as it doesn’t add any extra information while treating text data.

Therefore removing all instances of it will help us reduce the size of the training data.

In [None]:
train['tweet'] = train['tweet'].str.replace('[^\w\s]','')
train['tweet'].head()

Unnamed: 0,tweet
0,@user when a father is dysfunctional and is so...
1,@user @user thanks for #lyft credit i can't us...
2,bihday your majesty
3,#model i love u take with u all the time in ur...
4,factsguide: society now #motivation


As you can see in the above output, all the punctuation, including ‘#’ and ‘@’, has been removed from the training data.

# 2.3 Removal of Stop Words

As we discussed earlier, stop words (or commonly occurring words) should be removed from the text data.

For this purpose, we can either create a list of stopwords ourselves or we can use predefined libraries.

Here, we use NLTK (Natural Language ToolKit)

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
train['tweet'].head()

Unnamed: 0,tweet
0,@user father dysfunctional selfish drags kids ...
1,@user @user thanks #lyft credit can't use caus...
2,bihday majesty
3,#model love u take u time urð±!!! ððð...
4,factsguide: society #motivation


# 2.4 Common word removal
Previously, we just removed commonly occurring words in a general sense. We can also remove commonly occurring words from our text data

First, let’s check the 10 most frequently occurring words in our text data then take call to remove or retain.

In [None]:
freq = pd.Series(' '.join(train['tweet']).split()).value_counts()[:10]
freq

Unnamed: 0,count
@user,17291
&amp;,1574
day,1454
#love,1449
happy,1328
-,1244
u,1116
love,1112
i'm,992
like,920


In [None]:
freq = list(freq.index)
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
train['tweet'].head()

Unnamed: 0,tweet
0,father dysfunctional selfish drags kids dysfun...
1,thanks #lyft credit can't use cause offer whee...
2,bihday majesty
3,#model take time urð±!!! ðððð ð...
4,factsguide: society #motivation


# 2.5 Rare words removal
Similarly, just as we removed the most common words, this time let’s remove rarely occurring words from the text. Because they’re so rare, the association between them and other words is dominated by noise. You can replace rare words with a more general form and then this will have higher counts.


In [None]:
freq = pd.Series(' '.join(train['tweet']).split()).value_counts()[-10:]
freq

Unnamed: 0,count
nouveau,1
"cops""",1
"smth,",1
opinion?,1
#capitalist;,1
architecture.,1
#socialist;,1
#environmentalist;,1
#coalitionist;,1
chisolm.,1


In [None]:
freq = list(freq.index)
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
train['tweet'].head()

0    father dysfunctional selfish drags kids dysfun...
1    thanks lyft credit cant use cause dont offer w...
2                                       bihday majesty
3                              model take urð ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object

All these pre-processing steps are essential and help us in reducing our vocabulary clutter so that the features produced in the end are more effective.

# 2.6 Spelling correction

We’ve all seen tweets with a plethora of spelling mistakes. Our timelines are often filled with hastly sent tweets that are barely legible at times.

In that regard, spelling correction is a useful pre-processing step because this also will help us in reducing multiple copies of words. For example, “Analytics” and “analytcs” will be treated as different words even if they are used in the same sense.

In [None]:
from textblob import TextBlob
train['tweet'][:5].apply(lambda x: str(TextBlob(x).correct()))

Unnamed: 0,tweet
0,father dysfunctional selfish drags kiss dysfun...
1,thanks #left credit can't use cause offer whee...
2,midday majesty
3,#model take time or±!!! ðððð ð¦...
4,factsguide: society #motivation


# 2.7 Tokenization
Tokenization refers to dividing the text into a sequence of words or sentences. In our example, we have used the textblob library to first transform our tweets into a blob and then converted them into a series of words.

In [None]:
import nltk
nltk.download('punkt_tab')

#punkt is a pre-trained tokenizer model that helps split text into sentences and words.

In [None]:
print(train['tweet'][1])
TextBlob(train['tweet'][1]).words


thanks #lyft credit can't use cause offer wheelchair vans pdx. #disapointed #getthanked


WordList(['thanks', 'lyft', 'credit', 'ca', "n't", 'use', 'cause', 'offer', 'wheelchair', 'vans', 'pdx', 'disapointed', 'getthanked'])

# 2.8 Stemming
Stemming refers to the removal of suffices, like “ing”, “ly”, “s”, etc. by a simple rule-based approach. For this purpose, we will use PorterStemmer from the NLTK library.

In the following example, dysfunctional has been transformed into dysfunct, among other changes.

In [None]:
from nltk.stem import PorterStemmer
st = PorterStemmer()
train['tweet'][:5].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

0        father dysfunct selfish drag kid dysfunct run
1    thank lyft credit cant use caus dont offer whe...
2                                       bihday majesti
3                              model take urð ðððð ððð
4                              factsguid societi motiv
Name: tweet, dtype: object

# 2.9 Lemmatization
Stemming reduces words to their root form, while lemmatization reduces words to their base or dictionary form.

Lemmatization is a more effective option than stemming because it converts the word into its root word, rather than just stripping the suffices. It makes use of the vocabulary and does a morphological analysis to obtain the root word. Therefore, we usually prefer using lemmatization over stemming.

In [None]:
from textblob import Word
import nltk
nltk.download('wordnet')

train['tweet'] = train['tweet'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
train['tweet'].head()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


0    father dysfunctional selfish drag kid dysfunct...
1    thanks lyft credit cant use cause dont offer w...
2                                       bihday majesty
3                              model take urð ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object

# **3.RegEX**

RegEx :A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.

RegEx can be used to check if a string contains the specified search pattern.

In [None]:
import re
"built-in package called re"

'buitl-in package calledre'

**findall**	Returns a list containing all matches

**search**	Returns a Match object if there is a match anywhere in the string

**split**	Returns a list where the string has been split at each match

**sub**	Replaces one or many matches with a string

# **The findall() Function**
The findall() function returns a list containing all matches.

In [None]:
import re

txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)

'''cntrl+F''

['ai', 'ai']


# **The search() Function**
The search() function searches the string for a match, and returns a Match object if there is a match.

If there is more than one match, only the first occurrence of the match will be returned

In [None]:
import re

txt = "The rain in Spain"
x = re.search("\s", txt)

print("The first white-space character is located in position:", x.start())

The first white-space character is located in position: 7


# ***The split() Function***

The split() function returns a list where the string has been split at each match:

In [None]:
txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)

['The', 'rain', 'in', 'Spain']


# **The sub() Function**
The sub() function replaces the matches with the text of your choice

In [None]:
txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)

The9rain9in9Spain


# **Compile Function**
Regular expressions are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions.

In [None]:
p = re.compile('[a-e]')

print(p.findall("Aye, said Mr. Gibenson Stark"))

['e', 'a', 'd', 'b', 'e', 'a']


# **3. Advance Text Processing**

Up to this point, we have done all the basic pre-processing steps in order to clean our data. Now, we can finally move on to extracting features using NLP techniques.


# 3.1 N-grams
N-grams are the combination of multiple words used together. Ngrams with N=1 are called unigrams. Similarly, bigrams (N=2), trigrams (N=3) and so on can also be used.

Unigrams do not usually contain as much information as compared to bigrams and trigrams. The basic principle behind n-grams is that they capture the language structure, like what letter or word is likely to follow the given one. The longer the n-gram (the higher the n), the more context you have to work with. Optimum length really depends on the application – if your n-grams are too short, you may fail to capture important differences. On the other hand, if they are too long, you may fail to capture the “general knowledge” and only stick to particular cases.

So, let’s quickly extract bigrams from our tweets using the ngrams function of the textblob library.


In [36]:
TextBlob(train['tweet'][0]).ngrams(3)

[WordList(['father', 'dysfunctional', 'selfish']),
 WordList(['dysfunctional', 'selfish', 'drags']),
 WordList(['selfish', 'drags', 'kids']),
 WordList(['drags', 'kids', 'dysfunction']),
 WordList(['kids', 'dysfunction', 'run'])]

# 3.2 Term frequency

Term frequency is simply the ratio of the count of a word present in a sentence, to the length of the sentence.

Therefore, we can generalize term frequency as:

**TF = Term Frequency (TF) is to just count the number of occurrence.**

But it has been observed that if a word X occurs in document A 1 time and in B 10 times, its generally not true that the word X is 10 times more relevant in B than in A. The difference is generally lesser as compared to the actual ratio. Hence it is good to apply following transformation on TF:



TF = 1 + log (TF)   if  TF > 0

     0              if TF = 0

To understand more about Term Frequency, have a look at [this article](https://www.analyticsvidhya.com/blog/2015/04/information-retrieval-system-explained/). More about [term frequency](https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/).

Below, I have tried to show you the term frequency table of a tweet.

In [41]:
tf1 = (train['tweet'][1:2]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0).reset_index()
tf1.columns = ['words','tf']
tf1

  tf1 = (train['tweet'][1:2]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0).reset_index()
  tf1 = (train['tweet'][1:2]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0).reset_index()


Unnamed: 0,words,tf
0,thanks,1
1,#lyft,1
2,credit,1
3,can't,1
4,use,1
5,cause,1
6,offer,1
7,wheelchair,1
8,vans,1
9,pdx.,1


# 3.3 Inverse Document Frequency
The intuition behind inverse document frequency (IDF) is that a word is not of much use to us if it’s appearing in all the documents.

Therefore, the IDF of each word is the log of the ratio of the total number of rows to the number of rows in which that word is present.

IDF = log(N/n), where, N is the total number of rows and n is the number of rows in which the word was present.

So, let’s calculate IDF for the same tweets for which we calculated the term frequency.

**The more the value of IDF, the more unique is the word.**

In [42]:
import numpy as np

for i,word in enumerate(tf1['words']):
  tf1.loc[i, 'idf'] = np.log(train.shape[0]/(len(train[train['tweet'].str.contains(word)])))

tf1

Unnamed: 0,words,tf,idf
0,thanks,1,4.597751
1,#lyft,1,9.273691
2,credit,1,7.327781
3,can't,1,3.753564
4,use,1,3.374707
5,cause,1,5.610129
6,offer,1,6.522155
7,wheelchair,1,9.273691
8,vans,1,8.426393
9,pdx.,1,8.762865


# 3.4 Term Frequency – Inverse Document Frequency (TF-IDF)

TF-IDF is the multiplication of the TF and IDF which we calculated above.

In [43]:
tf1['tfidf'] = tf1['tf'] * tf1['idf']
tf1

Unnamed: 0,words,tf,idf,tfidf
0,thanks,1,4.597751,4.597751
1,#lyft,1,9.273691,9.273691
2,credit,1,7.327781,7.327781
3,can't,1,3.753564,3.753564
4,use,1,3.374707,3.374707
5,cause,1,5.610129,5.610129
6,offer,1,6.522155,6.522155
7,wheelchair,1,9.273691,9.273691
8,vans,1,8.426393,8.426393
9,pdx.,1,8.762865,8.762865


We can see that the TF-IDF has penalized words like ‘don’t’, ‘can’t’, and ‘use’ because they are commonly occurring words. However, it has given a high weight to “disappointed” since that will be very useful in determining the sentiment of the tweet.

We don’t have to calculate TF and IDF every time beforehand and then multiply it to obtain TF-IDF. Instead, sklearn has a separate function to directly obtain it:

In [46]:
from sklearn.feature_extraction.text import TfidfVectorizer
myvocabulary = ['life', 'learning', 'game']
corpus = {1: "The game of life is a game of everlasting learning", 2: "The unexamined life is not worth living", 3: "Never stop learning"}
tfidf = TfidfVectorizer(vocabulary = myvocabulary, max_features=1000, lowercase=True, analyzer='word',
 stop_words= 'english', ngram_range = (1,3))
tfs = tfidf.fit_transform(corpus.values())
feature_names = tfidf.get_feature_names_out()
corpus_index = [n for n in corpus]
import pandas as pd
df = pd.DataFrame(tfs.T.todense(), index=feature_names, columns=corpus_index)
print(df)

                 1    2    3
life      0.334907  1.0  0.0
learning  0.334907  0.0  1.0
game      0.880724  0.0  0.0


# 3.5 Bag of Words
Bag of Words (BoW) refers to the representation of text which describes the presence of words within the text data. The intuition behind this is that two similar text fields will contain similar kind of words, and will therefore have a similar bag of words. Further, that from the text alone we can learn something about the meaning of the document.

To gain a better understanding of this, you can refer to [this article](https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/).

For implementation, sklearn provides a separate function for it as shown below:

In [48]:
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(max_features=50, lowercase=True, ngram_range=(1,3),analyzer = "word")
train_bow = bow.fit_transform(train['tweet'])
bow.get_feature_names_out()

array(['back', 'beautiful', 'best', 'bihday', 'bull', 'can', 'day',
       'days', 'family', 'father', 'first', 'friday', 'friends', 'fun',
       'get', 'go', 'going', 'good', 'got', 'great', 'healthy', 'it',
       'know', 'life', 'make', 'me', 'morning', 'music', 'need', 'new',
       'one', 'people', 'positive', 'really', 'see', 'smile', 'summer',
       'take', 'thankful', 'time', 'today', 'tomorrow', 'us', 'wait',
       'want', 'way', 'weekend', 'work', 'world', 'you'], dtype=object)

# 3.6 Sentiment Analysis

If you recall, our problem was to detect the sentiment of the tweet. So, before applying any ML/DL models (which can have a separate feature detecting the sentiment using the textblob library), let’s check the sentiment of the first few tweets.


In [49]:
train['tweet'][:5].apply(lambda x: TextBlob(x).sentiment)

Unnamed: 0,tweet
0,"(-0.5, 1.0)"
1,"(0.2, 0.2)"
2,"(0.0, 0.0)"
3,"(0.0, 0.0)"
4,"(0.0, 0.0)"


Above, you can see that it returns a tuple representing polarity and subjectivity of each tweet. Here, we only extract polarity as it indicates the sentiment as value nearer to 1 means a positive sentiment and values nearer to -1 means a negative sentiment. This can also work as a feature for building a machine learning model.

In [50]:
train['sentiment'] = train['tweet'].apply(lambda x: TextBlob(x).sentiment[0] )
train[['tweet','sentiment']].head()

Unnamed: 0,tweet,sentiment
0,father dysfunctional selfish drags kids dysfun...,-0.5
1,thanks #lyft credit can't use cause offer whee...,0.2
2,bihday majesty,0.0
3,#model take time urð±!!! ðððð ð...,0.0
4,factsguide: society #motivation,0.0


# 3.7 Word Embeddings


Word Embedding is the representation of text in the form of vectors. The underlying idea here is that similar words will have a minimum distance between their vectors.

Word2Vec models require a lot of text, so either we can train it on our training data or we can use the pre-trained word vectors developed by Google, Wiki, etc.

Here, we will use pre-trained word vectors which can be downloaded from the [glove website](https://nlp.stanford.edu/projects/glove/). There are different dimensions (50,100, 200, 300) vectors trained on wiki data. For this example, I have downloaded the 100-dimensional version of the model.

You can refer an article [here](https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/) to understand different form of word embeddings.

The first step here is to convert it into the word2vec format.

In [None]:
from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

Now, we can load the above word2vec file as a model.

In [None]:
from gensim.models import KeyedVectors # load the Stanford GloVe model
filename = 'glove.6B.100d.txt.word2vec'
model = KeyedVectors.load_word2vec_format(filename, binary=False)
model['go']
model['away']
(model['go'] + model['away'])/2

We have converted the entire string into a vector which can now be used as a feature in any modelling technique.