# Assignment 4

## Task 1

**Description**

Download Alice in Wonderland by Lewis Carrol

In [73]:
import os.path

filename = f"../data/Alice-in-Wonderland.txt"
alice_is_downloaded = os.path.isfile(filename)
if not alice_is_downloaded:
    from urllib import request

    url = "http://www.gutenberg.org/files/11/11-0.txt"
    request.urlretrieve(url, filename)

with open(filename, "r") as f:
    alice_corpus = ' '.join(f.readlines())

print(f"Alice corpus have {len(alice_corpus)} chars")

Alice corpus have 168183 chars


## Task 2

**Description**

Perform any necessary preprocessing on the test, including converting to lower case, removing stop words, number / non-alphabetic characters and so on.

In [40]:
import spacy

spacy.prefer_gpu()
nlp = spacy.load("en_core_web_sm")

**Process chapters into individual documents**

In [142]:
chapter_content = dict()

chapter_names = {
    "I": "Down the Rabbit-Hole",
    "II": "The Pool of Tears",
    "III": "A Caucus-Race and a Long Tale",
    "IV": "The Rabbit Sends in a Little Bill",
    "V": "Advice from a Caterpillar",
    "VI": "Pig and Pepper",
    "VII": "A Mad Tea-Party",
    "VIII": "The Queen’s Croquet-Ground",
    "IX": "The Mock Turtle’s Story",
    "X": "The Lobster Quadrille",
    "XI": "Who Stole the Tarts?",
    "XII": "Alice’s Evidence"
}
chapter_numbers = ["I", "II", "III", "IV", "V", "VI", "VII", "VIII", "IX", "X", "XI", "XII"]

chapter_beginnings = {
    "I": "Alice was beginning to get very tired of sitting by her sister on the",
    "II": "“Curiouser and curiouser!” cried Alice (she was so much surprised, that",
    "III": "They were indeed a queer-looking party that assembled on the bank—the",
    "IV": "It was the White Rabbit, trotting slowly back again, and looking",
    "V": "The Caterpillar and Alice looked at each other for some time in",
    "VI": "For a minute or two she stood looking at the house, and wondering what",
    "VII": "There was a table set out under a tree in front of the house, and the",
    "VIII": "A large rose-tree stood near the entrance of the garden: the roses",
    "IX": "“You can’t think how glad I am to see you again, you dear old thing!”",
    "X": "The Mock Turtle sighed deeply, and drew the back of one flapper across",
    "XI": "The King and Queen of Hearts were seated on their throne when they",
    "XII": "“Here!” cried Alice, quite forgetting in the flurry of the moment how"
}
# chapter_beginnings = {k: v.lower() for k, v in chapter_beginnings.items()}

def find_start_of_next_chapter(corpus, current_pos=0, last_chapter=None):
    if last_chapter:
        return alice_corpus.find(chapter_beginnings[chapter_names[0]])
    else:
        return alice_corpus.find(chapter_beginnings[last_chapter + 1], current_pos)

def preprocess_chapter(chapter):
    return chapter.lower().replace("\n", " ")

prev_chapter_start = None
for i, number in enumerate(chapter_numbers):
    next_chapter_start = alice_corpus.find(chapter_beginnings[number])
    if prev_chapter_start is not None:
        chapter_content[chapter_numbers[i - 1]] = \
                preprocess_chapter(alice_corpus[prev_chapter_start:next_chapter_start])
    prev_chapter_start = next_chapter_start

chapter_content[chapter_numbers[-1]] = preprocess_chapter(alice_corpus[prev_chapter_start:alice_corpus.find("THE END")])

chapter_doc = { k:nlp(v) for k, v in chapter_content.items() }
alice_doc = nlp('\n'.join(chapter_content.values()))

**Remove stop words, punctuation, non-words**

In [93]:
def filter_stopwords(token):
    return nlp.vocab[token.text].is_stop

def filter_non_alpdabetic(token):
    return not nlp.vocab[token.text].is_alpha

def filter_punkt(token):
    return nlp.vocab[token.text].is_punct

def filter_space(token):
    return nlp.vocab[token.text].is_space

filters = [
    filter_stopwords,
    filter_non_alpdabetic,
    filter_punkt,
    filter_space
]

def apply_filters(token):
    for f in filters:
        if f(token):
            return None
    return token

def filter_doc(doc):
    filtered_doc = []
    for token in doc:
        if apply_filters(token) is not None:
            filtered_doc.append(token)
    return filtered_doc

def lemmatize(doc):
    return [token.lemma_ for token in doc]

In [92]:
filtered_chapter_doc = { k: filter_doc(v) for k, v in chapter_doc.items() }

In [102]:
preprocessed_chapter_doc = { k: ' '.join(lemmatize(v)) for k, v in filtered_chapter_doc.items() }

preprocessed_corpus = ' '.join(preprocessed_chapter_doc.values())

## Task 3

**Description**

Find top 10 most important words (TF-IDF, for instance) from each chapter in the text (not "Alice"). How would you name each chapter according to the identified tokens?

**Get normalized TF-IDF scores**

In [113]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import normalize
import pandas as pd

cvect = CountVectorizer()
counts = cvect.fit_transform(preprocessed_chapter_doc.values())
normalized_counts = normalize(counts, norm='l1', axis=1)

tfidf = TfidfVectorizer(smooth_idf=False)
tfs = tfidf.fit_transform(preprocessed_chapter_doc.values())
new_tfs = normalized_counts.multiply(tfidf.idf_)

feature_names = tfidf.get_feature_names()
corpus_index = [n for n in preprocessed_corpus]
df = pd.DataFrame(new_tfs.T.todense(), index=feature_names, columns=chapter_numbers)
# df = pd.DataFrame(new_tfs.T.todense(), index=feature_names, columns=corpus_index)

**Extract most important ones for each capter**

In [135]:
def get_most_important_words_in_chapter(chapter, df):
    most_important = filter(lambda x: x != 'alice', list(df[chapter].sort_values(ascending=False)[:11].index))
    return list(most_important)

In [None]:
for chapter in chapter_numbers:
    print(f"Chapter {chapter} - \t {get_most_important_words_in_chapter(chapter, df)}")

**Actual chapter names**

In [None]:
for number, name in chapter_names.items():
    print(f"Chapter {number}: \t{name}")

Actually the names are pretty reasonable, often in fiction chapter names do not reflect chapter's content, which is not suprising: chapter name is probably one of the first things the reader sees when opens a new chapter, he/she should be intrigued by the name, and it probably should not reflect all of the chapter essence in a few words as it may reduce reader's engagement.

### Task 4

**Description**

Find the top 10 most used verbs in sentences with Alice. What does Alice do most often?

**Iterate over sentence and collect verbs that has alice as subject**

The following counters only take into account sentences where Alice is named explicitly, implicit names are a lot harder to resolve, but perhaps that doesn't change the results as much

In [None]:
from collections import defaultdict

def span_contains(key, sentence):
    for word in sentence:
        if key == word.text:
            return True
    return False

verbs_count = defaultdict(int)
verbs_alice_as_subject = defaultdict(int)

alice = alice_doc[0]
for sentence in alice_doc.sents:
    if not span_contains(alice.text, sentence):
        continue
    
    for word in sentence:
        if word.pos_ != "VERB":
            continue
        verbs_count[word.text] += 1
            
        if span_contains(alice.text, list(word.children)):
            verbs_alice_as_subject[word.text] += 1

In [None]:
def get_top_verbs(counter, k=10):
    top = list()
    for w in sorted(counter, key=counter.get, reverse=True)[:k]:
        top.append(w)
    return top

**Simple counter results**

Simple counter has all verbs in sentences which word alice appears in.

In [None]:
get_top_verbs(verbs_count)

**Language-aware counter results**

More compicated counter takes into account only words that have Alice as a dependency (as a subject). That results in "more correct" words being choosen 

In [None]:
get_top_verbs(verbs_alice_as_subject)

Seems like Alice is doing a lot of talking and thinking, which is probably a good thing

In [139]:
### Task 5

**Description**

Find the top 100 most used verbs in sentences with Alice. Get word2vec visualization of them. Compare word embeddings and write conclusions.

Chapter I - 	 ['eat', 'fall', 'think', 'bat', 'little', 'key', 'door', 'find', 'go', 'way']
Chapter II - 	 ['mouse', 'swam', 'little', 'say', 'oh', 'cat', 'pool', 'cry', 'fan', 'mabel']
Chapter III - 	 ['say', 'mouse', 'dodo', 'prize', 'lory', 'dry', 'race', 'thimble', 'know', 'bird']
Chapter IV - 	 ['window', 'puppy', 'bill', 'little', 'rabbit', 'fan', 'grow', 'say', 'glove', 'bottle']
Chapter V - 	 ['caterpillar', 'say', 'serpent', 'pigeon', 'youth', 'egg', 'size', 'father', 'think', 'hookah']
Chapter VI - 	 ['footman', 'say', 'cat', 'baby', 'mad', 'duchess', 'grin', 'wow', 'grunt', 'go']
Chapter VII - 	 ['hatter', 'dormouse', 'say', 'hare', 'march', 'twinkle', 'tea', 'time', 'remark', 'treacle']
Chapter VIII - 	 ['queen', 'say', 'hedgehog', 'king', 'gardener', 'rose', 'soldier', 'go', 'look', 'executioner']
Chapter IX - 	 ['say', 'turtle', 'mock', 'gryphon', 'moral', 'duchess', 'queen', 'go', 'think', 'school']
Chapter X - 	 ['turtle', 'gryphon', 'mock', 'dance', 'say', 'lobster', '

**Actual chapter names**

In [141]:
for number, name in chapter_names.items():
    print(f"Chapter {number}: \t{name}")

Chapter I: 	Down the Rabbit-Hole
Chapter II: 	The Pool of Tears
Chapter III: 	A Caucus-Race and a Long Tale
Chapter IV: 	The Rabbit Sends in a Little Bill
Chapter V: 	Advice from a Caterpillar
Chapter VI: 	Pig and Pepper
Chapter VII: 	A Mad Tea-Party
Chapter VIII: 	The Queen’s Croquet-Ground
Chapter IX: 	The Mock Turtle’s Story
Chapter X: 	The Lobster Quadrille
Chapter XI: 	Who Stole the Tarts?
Chapter XII: 	Alice’s Evidence


Actually the names are pretty reasonable, often in fiction chapter names do not reflect chapter's content, which is not suprising: chapter name is probably one of the first things the reader sees when opens a new chapter, he/she should be intrigued by the name, and it probably should not reflect all of the chapter essence in a few words as it may reduce reader's engagement.

### Task 4

**Description**

Find the top 10 most used verbs in sentences with Alice. What does Alice do most often?

**Iterate over sentence and collect verbs that has alice as subject**

The following counters only take into account sentences where Alice is named explicitly, implicit names are a lot harder to resolve, but perhaps that doesn't change the results as much

In [165]:
from collections import defaultdict

def span_contains(key, sentence):
    for word in sentence:
        if key == word.text:
            return True
    return False

verbs_count = defaultdict(int)
verbs_alice_as_subject = defaultdict(int)

alice = alice_doc[0]
for sentence in alice_doc.sents:
    if not span_contains(alice.text, sentence):
        continue
    
    for word in sentence:
        if word.pos_ != "VERB":
            continue
        verbs_count[word.text] += 1
            
        if span_contains(alice.text, list(word.children)):
            verbs_alice_as_subject[word.text] += 1

In [171]:
def get_top_verbs(counter, k=10):
    top = list()
    for w in sorted(counter, key=counter.get, reverse=True)[:k]:
        top.append(w)
    return top

**Simple counter results**

Simple counter has all verbs in sentences which word alice appears in.

In [172]:
get_top_verbs(verbs_count)

['said',
 'thought',
 'could',
 'would',
 'went',
 '’s',
 'looked',
 'began',
 'think',
 'see']

**Language-aware counter results**

More compicated counter takes into account only words that have Alice as a dependency (as a subject). That results in "more correct" words being choosen 

In [173]:
get_top_verbs(verbs_alice_as_subject)

['said',
 'thought',
 'replied',
 'began',
 'went',
 'looked',
 'felt',
 'cried',
 'heard',
 'asked']

Seems like Alice is doing a lot of talking and thinking, which is probably a good thing

### Task 5

**Description**

Find the top 100 most used verbs in sentences with Alice. Get word2vec visualization of them. Compare word embeddings and write conclusions.