<img src="https://nyp-aicourse.s3.ap-southeast-1.amazonaws.com/agods/nyp_ago_logo.png" width='400'/>

# Text Processing using Python
Python language is commonly used for text preprocessing purpose. It has suitable libraries built that we can use on the python script, such as, nltk (http://www.nltk.org/ ). 

We will go through a sample program to illustrate the various processing steps that can be done using a Python program.


# Import the libraries and download the necessary packages

In [None]:
import nltk
import re
#only need to do once
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')

# Read in file for analysis
This file must be placed in the same directory as this python file.
Just change the name of the file to analyse other files. The code reads in the first line of the file and saves it to 'text'

In [None]:
filename = 'datasets/hotel2.txt'
fp = open(filename, 'r', encoding='UTF8')
text = fp.readline()
print(text)
fp.close()

Alternatively, you can also read the entire file and obtain the first entry as below

In [None]:
fp = open(filename, 'r', encoding='UTF8')
text2 = fp.read()
#print(text2)
fp.close()

text3 = text2.split('\n')
print(text3[0])

# Tokenization
Run the code below to tokenize the text, and to analyze the content of text in terms of the number of sentences and words

In [None]:
#sentence tokenizer
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text)
print('number of sentences: '+str(len(sentences)))

#word tokenizer
from nltk.tokenize import word_tokenize
words1 = word_tokenize(text) 
words2 = text.split()
print('number of words1: '+str(len(words1)))
print('first 80 words1 '+str(words1[:80]))
print('number of words2: '+str(len(words2)))
print('first 80 words2 '+str(words2[:80]))

# Exercise

Examine the output above and describe the difference between words1 and words2

# Normalization
Normalise each of the words to lowercase and remove some of the special characters

In [None]:
#lowercase and remove punctuation
processed1 = []
for w in words1:
    #processed.append(re.sub(r'([^\s\w]|_)+', '', w).lower())  #will clean all punctuations and numbers
    processed1.append(re.sub('[-(),.]', '', w).lower())   #keep word like i'd

processed2 = []
for w in words2:
    #processed.append(re.sub(r'([^\s\w]|_)+', '', w).lower())  #will clean all punctuations and numbers
    processed2.append(re.sub('[-(),.]', '', w).lower())   #keep word like i'd


Write code to print the first 10 words in processed1 and processed2

<details>
<summary>
    Click here to see code
</summary>

```
print('first 10 processed1 '+str(processed1[:10]))
print('first 10 processed2 '+str(processed2[:10]))

```

In [None]:
# insert code here


We can clean the empty strings as below

In [None]:
processed1 = list(filter(None, processed1))
processed2 = list(filter(None, processed2))
print(len(processed1))
print(len(processed2))

print(processed1)
print(processed2)

# Remove Stopwords
Use the stopwords from nltk to remove stopwords and store the list of words in clean_words

In [None]:
#remove stop words
from nltk.corpus import stopwords
clean_words = processed2[:]
sw = stopwords.words('english')
print(sw)
#sw.append('')  # include blank so that it will be removed
for word in processed2:
    if word in sw:
        clean_words.remove(word)

Check the content of clean_words by printing the list of words with its frequency

In [None]:
print()
print('---Print list of words with frequency (after normalisation and stop words removal)---')
#count word frequency
freq = nltk.FreqDist(clean_words)
i = 0
top5original = []
#display items in desc order
for key,val in freq.most_common():
    print(str(key) + ":" + str(val))
    # get the first 5 top frequency words
    if i < 5:
        top5original.append(str(key))
        i += 1
print()
print('---Print top 5 words from the file ---')
print('top 5 :' + str(top5original))

# Stemming using Porter
Do porter stemming and store the new list inside stemmed

In [None]:
#stemming  - english
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem('working'))

#add in stemming - check bed, sound
stemmed = []
for c in clean_words:
    stemmed.append(stemmer.stem(c))


# Exercise
Write some code to print out the top 5 stemmed list of words

<details>
<summary>
    Click here to see code
</summary>

```
sfreq = nltk.FreqDist(stemmed)
i = 0
top5 = []
#display items in desc order
for key,val in sfreq.most_common():
    print(str(key) + ":" + str(val))
    # get the first 5 top frequency words
    if i < 5:
        top5.append(str(key))
        i += 1
print()
print('---Print top 5 stemmed words ---')
print('top 5 :'+str(top5))

```

In [None]:
# insert code here


# Analyse top 5 words by retrieving their bigrams

In [None]:
#create bigram and search for top 5 words
bigrams = list(nltk.ngrams(processed2,2))

print(bigrams[:10])

print()
print('---Print top 5 stemmed words and its bigrams ---')

for k in top5:
    print(k + ": "+str([s for s in bigrams if any(k in x for x in s)]))
    print()

# Exercise
Try writing code to print the top 5 bigrams after normalisation, removal of stop words and stemming

<details>
<summary>
    Click here to see code
</summary>

```
bigrams2=list(nltk.ngrams(stemmed,2))
print()
print('---Print list of bigrams with frequency (after normalisation, stop words removal and stemming)---')
#count word frequency
freq = nltk.FreqDist(bigrams2)
i = 0
top5original = []
#display items in desc order
for key,val in freq.most_common():
    print(str(key) + ":" + str(val))
    # get the first 5 top frequency words
    if i < 5:
        top5original.append(str(key))
        i += 1
print()
print('---Print top 5 bigrams ---')
print('top 5 :' + str(top5original))
```

In [None]:
#insert code here


# Use lemmatization instead of stemming

In [None]:
# import these modules
from nltk.stem import WordNetLemmatizer
  
lemmatizer = WordNetLemmatizer()
  
print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))

Try to write some code to perform lemmatization and to print the top 5 lemmatized list of words. Examine if there are any differences with stemming.

<details>
<summary>
    Click here to see code
</summary>

```
lemmatized = []
for c in clean_words:
    lemmatized.append(lemmatizer.lemmatize(c))

sfreq = nltk.FreqDist(lemmatized)
i = 0
top5 = []
#display items in desc order
for key,val in sfreq.most_common():
    print(str(key) + ":" + str(val))
    # get the first 5 top frequency words
    if i < 5:
        top5.append(str(key))
        i += 1
print()
print('---Print top 5 lemmatized words ---')
print('top 5 :'+str(top5))



```

In [None]:
# insert code here

# Word Representation 
First, we will use the bag of words method to create the frequency matrix from the stemmed data. We can do this using scikitlearn's CountVectorizer.

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Create data in dataframe for analysis
data1 = " ".join(stemmed)
df1 = pd.DataFrame({'data': [data1]})
print(df1)


In [None]:
vectorizer=CountVectorizer()
doc_vec = vectorizer.fit_transform(df1.iloc[0])
print(vectorizer.get_feature_names_out())
print(doc_vec)

In [None]:
df_bow = pd.DataFrame(doc_vec.toarray(),columns=vectorizer.get_feature_names_out())
df_bow.head()

Let us now create the term frequency inverse document frequency matrix using scikitlearn's TfidfVectorizer

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize
vectorizer = TfidfVectorizer()
doc_vec = vectorizer.fit_transform(df1.iloc[0])

print(vectorizer.get_feature_names_out())
print(doc_vec)


In [None]:
df_bow = pd.DataFrame(doc_vec.toarray(),columns=vectorizer.get_feature_names_out())
df_bow.head()

# Word Embedding using Word2Vec
Word Embedding is a language modeling technique used for mapping words to vectors of real numbers. It represents words or phrases in vector space with several dimensions. 

We will go through a sample program to generate word vectors using Word2Vec.

Let us first process the entire text file instead of a single row. You can try to write the code for this

<details>
<summary>
    Click here to see code
</summary>

```

filename = 'datasets/hotel2.txt'
fp = open(filename, 'r', encoding='UTF8')
text = fp.read()
print(text)
fp.close()

data = [] 
# iterate through each sentence in the file 
for i in sent_tokenize(text): 
    temp = [] 
      
    # tokenize the sentence into words 
    for j in i.split():
        temp.append(re.sub('[-(),.]', '', j).lower()) 
  
    data.append(temp) 

#remove stop words
from nltk.corpus import stopwords
sw = stopwords.words('english')
sw.append('')  # include blank so that it will be removed
sw.append('the')

clean_data = []
for sentence in data:
    temp = sentence
    for word in sentence:
        if word in sw:
            temp.remove(word)
    
    clean_data.append(temp)   

print(clean_data)

```

In [None]:
#insert code here


## Train Word2Vec Model
Instantiate Word2Vec and pass the reviews that we read in the previous step. Each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary.

The results is a learned vector, also known as the embeddings. You can think of these embeddings as some features that describe the target word. For example, the word king may be described by the gender, age, the type of people the king associates with, etc.

In [None]:
import gensim 
from gensim.models import Word2Vec 

# build vocabulary and train model
# Create CBOW model 
# size: size of dense vector to represent each token or word 
# window: maximum distance between target word and its neighboring word
# min_count: minimium frequency count of words
# iter: number of iterations (epochs)
model = gensim.models.Word2Vec(clean_data, min_count = 3,  
                              vector_size = 100, window = 5) # sg =1 for skip-gram

model.save('mymodel')
new_model = gensim.models.Word2Vec.load('mymodel')

Call the **most_similar** function and provide a word, it will return the top 10 similar words.


In [None]:
w1 = "service"
model.wv.most_similar (w1)

You can use Word2Vec to compute similarity between two words in the vocabulary by using the **similarity** function

In [None]:
#excellent is highly similar to best
model.wv.similarity(w1="best", w2="excellent")

In [None]:
#worst is dissimilar to best
model.wv.similarity(w1="best", w2="worst")

We can also view the 100-dimensional vector created for a word

In [None]:
model.wv.get_vector('excellent')

# Exercise
Try to perform BOW or TFIDF on the entire dataset. The code to convert the dataset to a dataframe has been written for you. Try to limit the vocabulary size to the 1000 most frequent by using max_features

<details>
<summary>
    Click here to see code
</summary>

```

vectorizer=CountVectorizer(max_features=1000)
doc_vec = vectorizer.fit_transform(clean_data_2.iloc[:,0])
print(vectorizer.get_feature_names_out())
print(doc_vec)


df_bow = pd.DataFrame(doc_vec.toarray(),columns=vectorizer.get_feature_names_out())
df_bow.head()

```

In [None]:
clean_data_2=pd.DataFrame(columns=['data'])
for i in clean_data:
    data1 = " ".join(i)
    a=pd.DataFrame([data1],columns=['data'])
    clean_data_2=pd.concat([clean_data_2,a])
print(clean_data_2)

In [None]:
#insert code here


In [None]:
#insert code here


# Additional Resources 1 - Synonyms

Sample code to find synonyms in WordNet 

In [None]:
#synonyms definition
#Synset is a special kind of a simple interface that is present in NLTK to look up words in WordNet.
#The code below gives the definition for NLP
from nltk.corpus import wordnet
syn = wordnet.synsets("NLP")
print(syn[0].definition())

#synonymous words
#The code below gives the synonyms for computer
syn=wordnet.synsets("Computer")
print(syn)
synonyms = []
for syn in wordnet.synsets('Computer'):
    for lemma in syn.lemmas():
        synonyms.append(lemma.name())
print(synonyms)