<a href="https://colab.research.google.com/github/kSahatova/ITMO_MLTech/blob/main/Lab5/MLTech_Lab5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Task Assignment

1.	Download Alice in Wonderland by Lewis Carroll from Project Gutenberg's website http://www.gutenberg.org/files/11/11-0.txt
2.	Perform any necessary preprocessing on the text, including converting to lower case, removing stop words, numbers / non-alphabetic characters, lemmatization.
3.	Find Top 10 most important (for example, in terms of TF-IDF metric) words from each chapter in the text (not "Alice"); how would you name each chapter according to the identified tokens?
4.	Find the Top 10 most used verbs in sentences with Alice. What does Alice do most often?



In [54]:
import requests
import re

import numpy as np
import pandas as pd

import nltk

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

from nltk.tokenize import TreebankWordTokenizer, WhitespaceTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag

from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [2]:
def clean_text(text):
    # remove several non-alphabetic characters
    text = re.sub(r'[\x00-\x1f\x7f-\x9fâ_]', ' ', text)
    # remove numbers
    text = re.sub(r'[0-9]', ' ', text)
    # convert text to lower case
    text = text.lower()
    text = text.split(" ")
    # remove stop words
    text = [word for word in text if not word in stop_words if word != 'alice']
    # lemmatization
    text = [lemmatizer.lemmatize(token) for token in text]
    # remove other non-alphabetic characters
    text = [token for token in text if token.isalpha()] 

    text = " ".join(text)
    return text

In [55]:
url = 'http://www.gutenberg.org/files/11/11-0.txt'

In [56]:
text = requests.get(url)

text = re.split(r'Down the Rabbit-Hole', text.text)[2]
text = re.split(r'THE END', text)[0]

text = re.split(r'CHAPTER', text)

In [57]:
text_wo_alice = []
stop_words = stopwords.words("english")
lemmatizer = WordNetLemmatizer()


for chapter in text:
  chapter = clean_text(chapter)
  text_wo_alice.append(chapter)

In [58]:
vectorizer=TfidfVectorizer(use_idf=True)
tfIdf = vectorizer.fit_transform(text_wo_alice)

In [59]:
words = vectorizer.get_feature_names()
for i in range(len(text_wo_alice)):
    result_list = []
    for k in range(len(words)):
        result_list.append([words[k], float(tfIdf[i].T.todense()[k])])
        
    print('The most 10 important words in chapter %d:'%(i+1))
    print(', '.join(list(map(lambda el: el[0], 
                         sorted(result_list, key = lambda l: l[1], reverse = True)[:10]))))

The most 10 important words in chapter 1:
little, eat, like, either, think, bottle, marked, one, rabbit, nothing
The most 10 important words in chapter 2:
mouse, little, pool, swam, cried, said, oh, like, must, cat
The most 10 important words in chapter 3:
said, dodo, mouse, dry, bird, old, soon, one, know, crowded
The most 10 important words in chapter 4:
little, puppy, rabbit, one, said, room, bill, fan, heard, grow
The most 10 important words in chapter 5:
said, caterpillar, pigeon, hookah, little, tried, one, green, father, side
The most 10 important words in chapter 6:
said, footman, baby, cat, duchess, like, little, cook, mad, know
The most 10 important words in chapter 7:
said, dormouse, march, hatter, hare, clock, thing, know, time, tea
The most 10 important words in chapter 8:
said, queen, soldier, three, hedgehog, king, cat, executioner, gardener, head
The most 10 important words in chapter 9:
mock, said, turtle, moral, gryphon, duchess, went, queen, never, old
The most 10 im

In [60]:
vectorizer=TfidfVectorizer(use_idf=False)
tfIdf = vectorizer.fit_transform(text_wo_alice)

In [61]:
words = vectorizer.get_feature_names()
for i in range(len(text_wo_alice)):
    result_list = []
    for k in range(len(words)):
        result_list.append([words[k], float(tfIdf[i].T.todense()[k])])
        
    print('The most 10 important words in chapter %d:'%(i+1))
    print(', '.join(list(map(lambda el: el[0], 
                         sorted(result_list, key = lambda l: l[1], reverse = True)[:10]))))

The most 10 important words in chapter 1:
little, like, think, one, thought, eat, found, get, nothing, said
The most 10 important words in chapter 2:
little, mouse, said, like, must, cried, time, went, could, go
The most 10 important words in chapter 3:
said, mouse, dodo, one, dry, know, long, soon, would, bird
The most 10 important words in chapter 4:
little, said, one, rabbit, get, heard, came, quite, thought, go
The most 10 important words in chapter 5:
said, caterpillar, little, one, got, pigeon, tried, felt, good, like
The most 10 important words in chapter 6:
said, like, little, cat, could, much, went, would, get, know
The most 10 important words in chapter 7:
said, march, dormouse, hatter, hare, thing, time, know, went, little
The most 10 important words in chapter 8:
said, queen, three, head, one, went, came, like, king, looked
The most 10 important words in chapter 9:
said, mock, turtle, went, never, gryphon, duchess, little, make, queen
The most 10 important words in chapter 

In [62]:
url = 'http://www.gutenberg.org/files/11/11-0.txt'
text = requests.get(url)

res = re.split(r'Down the Rabbit-Hole', text.text)[2]
res = re.split(r'THE END', res)[0]

res = re.sub(r'[\x00-\x1f\x7f-\x9fâ_]', '', res)

In [63]:
split_regex = re.compile(r'[.|!|?|…]') # split the sentences by '.!&...' marks
sentences = filter(lambda t: t, [t.strip() for t in split_regex.split(res)])
for_Alice = list(sentences)

In [64]:
sentences_w_alice = []
for i in range(len(for_Alice)):
    if for_Alice[i].find('Alice') != -1:
        sentences_w_alice.append(for_Alice[i])


In [65]:
final_text = []
for sentence in sentences_w_alice:    
  sentence = clean_text(sentence)
  final_text.append(sentence)
  
final_text = ' '.join(final_text)

In [66]:
tokens = WhitespaceTokenizer().tokenize(final_text) 

In [67]:
unique_tokens = []
for token in tokens:
    if token not in unique_tokens:
        unique_tokens.append(token)

In [68]:
number_of_unique_tokens = []
for token in unique_tokens:
    number_of_unique_tokens.append(tokens.count(token))

In [69]:
unique_tokens_ = []
for i in range(len(unique_tokens)):
    unique_tokens_.append([unique_tokens[i], number_of_unique_tokens[i]])


In [70]:
sorted_unique_tokens = sorted(unique_tokens_, key = lambda i: i[1], reverse = True)

In [71]:
for_pos_tag = []
for i in sorted_unique_tokens:
    for_pos_tag.append(i[0])

In [72]:
tagged_sorted_unique_tokens = nltk.pos_tag(for_pos_tag)

In [73]:
final_verb = []
for i in range(len(for_pos_tag)):
    if tagged_sorted_unique_tokens[i][1] in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']:
        final_verb.append(for_pos_tag[i])

In [74]:
print('The most important verbs in sentences with Alice are the following:')
print(', '.join(final_verb[:10]))

The most important verbs in sentences with Alice are the following:
said, thought, went, looked, got, say, began, think, see, go
