# Text Processing: Stemming and Lemmatization

*Original Dataset: https://www.kaggle.com/uciml/sms-spam-collection-dataset/home*

For this exercise, there are 100 sms that have been parsed and categorized as "Spam" or "Ham". The dataframe also contains the original text message. We have converted the dataframe into a dictionary for this exercise (execute the first two cells).

In the given dictionary, there are 100 entries, starting from 0 to 99 as the keys. The value for each of them is two strings, `class` and `text`. `class` contains either "spam" or "ham", based on the category of the sms, and `text` contains the original text message.

In [1]:
import pandas as pd

df = pd.read_csv("/dsa/data/DSA-8410/spam.csv", encoding='latin1')
mini_df = df[['v1', 'v2']][:100]
mini_df.columns = ['class', 'text']

mini_df.to_csv('messages.csv', index=False)

In [2]:
df = pd.read_csv('messages.csv')
msgs = df.T.to_dict()

In [3]:
msgs #<--- Nested dictionary

{0: {'class': 'ham',
  'text': 'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'},
 1: {'class': 'ham', 'text': 'Ok lar... Joking wif u oni...'},
 2: {'class': 'spam',
  'text': "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"},
 3: {'class': 'ham',
  'text': 'U dun say so early hor... U c already then say...'},
 4: {'class': 'ham',
  'text': "Nah I don't think he goes to usf, he lives around here though"},
 5: {'class': 'spam',
  'text': "FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, å£1.50 to rcv"},
 6: {'class': 'ham',
  'text': 'Even my brother is not like to speak with me. They treat me like aids patent.'},
 7: {'class': 'ham',
  'text': "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as

**Task 1.** Create a list of strings from this dictionary with the `text` values, and convert all of the strings into lowercase. Print out the first five (5) items from your list.

In [4]:
# Importing Libraries

# upgrading nltk

#! pip install nltk -U

from nltk import word_tokenize
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")

from nltk.stem import PorterStemmer
porter = PorterStemmer()

from nltk.stem import WordNetLemmatizer
wordnet = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to /home/lcmhng/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
# Your code goes here
#---------------------

texts = []

for text in msgs.values():
    # print(text['text'])
    texts.append(text['text'])

In [6]:
texts = [text.lower() for text in texts]

print(texts[:5])

['go until jurong point, crazy.. available only in bugis n great world la e buffet... cine there got amore wat...', 'ok lar... joking wif u oni...', "free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005. text fa to 87121 to receive entry question(std txt rate)t&c's apply 08452810075over18's", 'u dun say so early hor... u c already then say...', "nah i don't think he goes to usf, he lives around here though"]


**Task 2.** Use `nltk` packages tokenize functionality on each of the strings in your list. The result should be a list of lists. Print out the first five (5) items from your list.

In [7]:
# Your code goes here
#---------------------

texts = [word_tokenize(text) for text in texts]

print(texts[:5])

[['go', 'until', 'jurong', 'point', ',', 'crazy', '..', 'available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', '...', 'cine', 'there', 'got', 'amore', 'wat', '...'], ['ok', 'lar', '...', 'joking', 'wif', 'u', 'oni', '...'], ['free', 'entry', 'in', '2', 'a', 'wkly', 'comp', 'to', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005.', 'text', 'fa', 'to', '87121', 'to', 'receive', 'entry', 'question', '(', 'std', 'txt', 'rate', ')', 't', '&', 'c', "'s", 'apply', '08452810075over18', "'s"], ['u', 'dun', 'say', 'so', 'early', 'hor', '...', 'u', 'c', 'already', 'then', 'say', '...'], ['nah', 'i', 'do', "n't", 'think', 'he', 'goes', 'to', 'usf', ',', 'he', 'lives', 'around', 'here', 'though']]


**Task 3.** Remove the stopwords, punctuations and numbers from your list (list of lists). Punctuations and numbers can be checked by the function `string.punctuation` used after a string. If the result is false, you can remove that particular string from the list.

In [8]:
# Your code goes here
#---------------------


# Numbers and punctuation

new_sen=[]

for sen in texts:
    words = [word for word in sen if word.isalpha()]
    new_sen.append(words)
    
texts = new_sen

#stop words

stop_words = stopwords.words("english")

new_sen=[]

for sen in texts:
    words = [word for word in sen if word not in stop_words]
    new_sen.append(words)
    
texts = new_sen


In [9]:
texts[:5]

[['go',
  'jurong',
  'point',
  'crazy',
  'available',
  'bugis',
  'n',
  'great',
  'world',
  'la',
  'e',
  'buffet',
  'cine',
  'got',
  'amore',
  'wat'],
 ['ok', 'lar', 'joking', 'wif', 'u', 'oni'],
 ['free',
  'entry',
  'wkly',
  'comp',
  'win',
  'fa',
  'cup',
  'final',
  'tkts',
  'may',
  'text',
  'fa',
  'receive',
  'entry',
  'question',
  'std',
  'txt',
  'rate',
  'c',
  'apply'],
 ['u', 'dun', 'say', 'early', 'hor', 'u', 'c', 'already', 'say'],
 ['nah', 'think', 'goes', 'usf', 'lives', 'around', 'though']]

**Task 4.** Use `nltk` packages `PorterStemmer` to stem the cleaned-text list that you got as a result of **Task 3**. Use a new variable to store the stemmed-word list, and keep the result from the **Task 3** intact. As we will use the cleaned-text list from **Task 3** in the later tasks.

In [10]:
# Your code goes here
#---------------------

stemmed_word=[]

for sen in texts:
    stems = [porter.stem(word) for word in sen]
    stemmed_word.append(stems)
    

**Task 5.** Use `nltk` packages `WordNetLemmatizer` to find the lemma (or root word) from the cleaned-text list that you got as a result of **Task 3**. Consider all of the words to be a `Verb`. Use a new variable to store the lemmatized-word list, and keep the result from **Task 3** intact. As we will use the cleaned-text list from **Task 3** in the later tasks. We assume every word is a verb to make the problem easier, but we could have applied a `POS` tagger and inferred the POS for that word. 

In [11]:
# Your code goes here
#---------------------

lemmatize_words=[]

for sen in texts:
    lemmas = [wordnet.lemmatize(word, pos="v") for word in sen]
    lemmatize_words.append(lemmas)

**Task 6.** For each lemma that we got from **Task 5**, calculate how many times they occur in all of the messages. Sort them in descending order by the number of total occurrences, and print out the top ten (10) words and their number of occurrences.

In [12]:
# Your code goes here
#---------------------

# Create a dictionary using the lemmas as keys and we will default each instance as 1 for value. 

lemma_dict={}

for sen in lemmatize_words:
    for word in sen:
        if word in lemma_dict.keys():
            lemma_dict[word] += 1
        else:
            lemma_dict[word] = 1

In [13]:
len(lemma_dict)

532

In [14]:
new_dict = {k: v for k, v in sorted(lemma_dict.items(), key=lambda item: item[1], reverse=True)}

words = new_dict.keys()
occurance = new_dict.values()

df = pd.DataFrame()

df['Words'] = words
df['Occurances'] = occurance

df.head()

Unnamed: 0,Words,Occurances
0,u,17
1,call,15
2,go,11
3,get,11
4,free,10


**Task 7.** From the result we got from **Task 6**, remove all of the words with a length of 1 and select the top hundred (100) most frequent terms from it. We will use this list of words in our next task.

In [15]:
# Your code goes here
#---------------------

#This is kinda pointless if we are only grabbing the top 100 and it's already sorted
df_100 = df[df.Occurances != 1]
df_100 = df.head(100)

df_100

top_100 = df_100['Words'].to_list()

**Task 8.** For each message (use the lemma-list we created for **Task 5**), calculate the number of times each word from **Task 7** (top-100 words) occurs in that message. 
Create a **Data-Matrix** using your calculations. Each row should correspond to a message, and each column should correspond to a word from the list we got in **Task 7**. Each cell should correspond to how many times that particular word (from column) occurs for that specific message (from row).

You can use Pandas-DataFrame to store your **Data-Matrix**. Print the first five rows of the Data-Matrix.

In [16]:
# Setting up a blank dictionary to hold the values
# Will update this and append to created dataframe

word_dict = {}
for word in top_100:
    word_dict[word] = 0

In [17]:
# Your code goes here
#---------------------

# Start with an empty data frame
word_df = pd.DataFrame(columns=top_100)


for sen in lemmatize_words:
    for word in sen:
        if word in top_100:
            # Update the word dictionary value by 1 if the word is in the sentence
            word_dict[word] += 1
    # Now we append the dictionary to the data frame and reset each value pair to 0
    word_df = word_df.append(word_dict, ignore_index=True)
    
    for key in word_dict.keys():
        word_dict[key] = 0


In [18]:
word_df.head()

Unnamed: 0,u,call,go,get,free,like,ok,sorry,txt,already,...,value,network,months,update,gon,stuff,anymore,enough,today,urgent
0,0,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,2,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
# Confirming with first five sentences

for sen in lemmatize_words[:5]:
    print(sen)

['go', 'jurong', 'point', 'crazy', 'available', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'get', 'amore', 'wat']
['ok', 'lar', 'joke', 'wif', 'u', 'oni']
['free', 'entry', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', 'may', 'text', 'fa', 'receive', 'entry', 'question', 'std', 'txt', 'rate', 'c', 'apply']
['u', 'dun', 'say', 'early', 'hor', 'u', 'c', 'already', 'say']
['nah', 'think', 'go', 'usf', 'live', 'around', 'though']
