<a href="https://colab.research.google.com/github/Nemat-Allah-Aloush/Machine_Learning_Techinques/blob/main/Task_4_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Nemat_Allah_Aloush_J41332c_MLT_2022_Task_4
* Name: Nemat Allah Aloush
* ISU group: J41332c
* ISU number: 336092

### Importing packages

In [None]:
import nltk
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
# import all the resources for Natural Language Processing with Python
nltk.download("book")
nltk.download('punkt')

### Part 1: Reading data file

In [None]:
# Mounting to google drive
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
# Reading the file (Alice in wonder land story)
filepath = nltk.data.find('/content/drive/My Drive/Machine Learning Techniques 2022/Alice_in_wonder_land.txt')
textfile= open(filepath, 'r').read()

In [None]:
# The file contains a paragraph before the story and onthore one after the ending of the story. 
# In the following I am splitting the content of the file to delete the latter paragraph after the ending of the story.  
splitted = textfile.split("THE END ", 1) 
# In the following I am splitting the result to delete the first paragraph that was written before the story.  
story = splitted[0].split("[Illustration]",1)
story = story[1] # story variable contains the story and it is the variable to analyze in the further steps.

### Part 2: Cleaning Text Data

In [None]:
# split into words
tokens = word_tokenize(story)
# convert to lower case
tokens = [w.lower() for w in tokens]
# remove punctuation from each word
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]
# filter out stop words
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
# lemmatizarion
wn = nltk.WordNetLemmatizer()
lemmas = [wn.lemmatize(word) for word in words]

### Part 3: Top 10 most important (in terms of TF-IDF metric) words from each chapter in the text and naming each chapter according to the identified tokens

##### Splitting the lemmas array to individual chapters

In [None]:
# find the indices of occurences of the word 'chapter' to select the start and end of each chapter
occurences =[i for i,x in enumerate(lemmas) if x=='chapter']
occurences.append(len(lemmas))  # the last word (index) in the 'lemmas' variable is the end of the last chapter 

In [None]:
indices=[]
#First 12 occurence of the word 'chapter' are just the outline of the story, thus we delete them
chapter_points=occurences[12:] 
# building a list to contain for each chapter the indices for its start and ending.
for i in range (len(chapter_points)-1):
  indices.append((chapter_points[i],chapter_points[i+1]))
indices

[(59, 1011),
 (1011, 1944),
 (1944, 2736),
 (2736, 3876),
 (3876, 4837),
 (4837, 6005),
 (6005, 7057),
 (7057, 8183),
 (8183, 9238),
 (9238, 10185),
 (10185, 11038),
 (11038, 11976)]

In [None]:
# 'chapters' variable contains lists of lemmas for each chapter individually 
chapters = [lemmas [s:e] for s,e in indices]

In [None]:
# delete the word 'Alice' from each paragraph (avoiding getting 'Alice' as one of the most important words)
chapters_without_alice = [[word for word in chapter if word != 'alice'] for chapter in chapters]

In [None]:
# for each chapter, join the tokens together 
chapters_paragraphs=[' '.join(i) for i in chapters_without_alice]

##### Calculating TF-IDF and finding top 10 words for each document

In [None]:
# Defining the model
tfidf = TfidfVectorizer() # Convert a collection of raw documents to a matrix of TF-IDF features.
X_tfidf = tfidf.fit_transform(chapters_paragraphs).toarray() # Learn vocabulary and idf, return document-term matrix
vocab = tfidf.vocabulary_  # A mapping of terms to feature indices.
reverse_vocab = {v:k for k,v in vocab.items()}
feature_names = tfidf.get_feature_names()
df_tfidf = pd.DataFrame(X_tfidf, columns = feature_names)

idx = X_tfidf.argsort(axis=1)   # Sorting
tfidf_max10 = idx[:,-10:]       # Top 10
df_tfidf['top10'] = [[reverse_vocab.get(item) for item in row] for row in tfidf_max10 ]

In [None]:
#Printing Top 10 words for each chapter
for (i, item) in enumerate(df_tfidf['top10'] , start=1):
    print('Chapter', i, item)

Chapter 1 ['bottle', 'see', 'either', 'way', 'like', 'eat', 'key', 'door', 'bat', 'little']
Chapter 2 ['go', 'mabel', 'foot', 'said', 'dear', 'cat', 'swam', 'little', 'pool', 'mouse']
Chapter 3 ['dinah', 'bird', 'know', 'thimble', 'dry', 'lory', 'prize', 'dodo', 'said', 'mouse']
Chapter 4 ['one', 'chimney', 'glove', 'fan', 'bottle', 'puppy', 'rabbit', 'window', 'little', 'bill']
Chapter 5 ['well', 'little', 'father', 'size', 'youth', 'egg', 'pigeon', 'serpent', 'said', 'caterpillar']
Chapter 6 ['cook', 'like', 'wow', 'duchess', 'pig', 'mad', 'baby', 'footman', 'cat', 'said']
Chapter 7 ['clock', 'draw', 'tea', 'time', 'twinkle', 'hare', 'march', 'said', 'dormouse', 'hatter']
Chapter 8 ['executioner', 'procession', 'five', 'cat', 'soldier', 'gardener', 'king', 'hedgehog', 'said', 'queen']
Chapter 9 ['say', 'never', 'went', 'queen', 'moral', 'duchess', 'gryphon', 'mock', 'said', 'turtle']
Chapter 10 ['whiting', 'beautiful', 'join', 'soup', 'dance', 'lobster', 'said', 'gryphon', 'mock', 't

The reason I am getting some verbs not converted to their lemmas is that the WordNetLemmatizer function from nltk library consider the default POS tag for the passed word as a noun. Thus, the correct way to call it, is by passing the POS tag of the word with the word itself, as done later in Part 4. 

**Depending on the results**, Trying to give titles for each chapter:


1. The door key.
2. Swimming in the pool.
3. To whom the proze is?
4. A  little bill.
5. What does Caterpillar say?
6. Time to cook.
7. Tea time.
8. Who is under sentence of execution ?
9. What the mock would say ?
10. A beautiful dance.
11. In the court!
12. The king and the jury last decision.




### Part 4: Top 10 most used verbs in sentences with Alice

##### Finding the verbs in sentences with Alice

In order to find the most common verbs we need to find the POS tag for each token in each sentence. Hence, instead of directly spliting the story into tokens as in the previous work, here we need to split the data into sentences first, then we find the POS tag for each token in each sentencce.

In [None]:
# splitting the story into sentences
sentences = nltk.sent_tokenize(story)

In [None]:
# Keep only the sentences that contains the word (alice) in order to find the verbs in these sentences as required.
sentences_with_alice = [sen for sen in sentences if 'alice' in sen.lower()]

In [None]:
# Tagging the sentences
sentences_tagged=[] # this list will contain each sentence tagged
for i in sentences_with_alice: 
  # Tokenizing each sentence individually
  wordsList = nltk.word_tokenize(i) 
  # Removing punctuation from each word
  stripped_2 = [ w.translate(table) for w in wordsList]
  # Removing stop words and non alphabtecis words 
  wordsList = [w.translate(table) for w in stripped_2 if not w in stop_words and w.isalpha()]
  # Using a Tagger. Which is part-of-speech tagger or POS-tagger. 
  tagged = nltk.pos_tag(wordsList) 
  sentences_tagged.append(tagged) 

In [None]:
# Finding the verbs in the tagged sentences
# All the tags that are related to verb: VB, VBD, VBG, VBN, VBP, VBZ
verb_tags=['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']
verbs =[]
# keeping only the verbs 
for sen in sentences_tagged:
  verbs.append([verb for verb,tag in sen if tag in verb_tags])
all_verbs=[j for i in verbs for j in i] # This variable contains all the verbs in one list
#Finding the lemma for each verb in order to find the most frequent verb
verbs_lemmas = [wn.lemmatize(word,'v') for word in all_verbs] 

##### Finding the 10 most frequent verbs

In [None]:
 # Creating a dataframe out of the list of verbs_lemmas
df_all_words = pd.DataFrame(verbs_lemmas, columns=['verb'])
# Grouping the verbs and find each verb frequency
df_all_words['counts'] = df_all_words.groupby(['verb'])['verb'].transform('count')
# Sorting the verbs by their frequencies
df_all_words = df_all_words.sort_values(by=['counts', 'verb'], ascending=[False, True]).reset_index()
# Finding most 10 frequent verbs
df_words = df_all_words.groupby('verb').first().sort_values(by='counts', ascending=False).reset_index()
print("Most 10 frequent verbs:")
print(df_words.head(10))

Most 10 frequent verbs:
    verb  index  counts
0    say     14     295
1     go     27      91
2  think     13      64
3    get      1      61
4   look     20      49
5   know     92      46
6  begin      0      42
7    see     23      38
8   come     76      33
9   make      7      33


**As a result**, most of the time Alice saies. Usually she goes, thinks, and gets. less frequent she looks, knows, begins, sees, comes and makes.