### Task 2:
1. Download Alice in Wonderland by Lewis Carroll from Project Gutenberg's website http://www.gutenberg.org/files/11/11-0.txt
2. Perform any necessary preprocessing on the text, including converting to lower case, removing stop words, numbers / non-alphabetic characters, lemmatization.
3. Find Top 10 most important (for example, in terms of TF-IDF metric) words from each chapter in the text (not "Alice"); how would you name each chapter according to the identified tokens?
4. Find the Top 10 most used verbs in sentences with Alice. What does Alice do most often?


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import re
import requests
import nltk
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [None]:
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize

#### Download text

In [None]:
url = "http://www.gutenberg.org/files/11/11-0.txt"
response = requests.get(url)
text = response.text

In [None]:
text[:1000]

'*** START OF THE PROJECT GUTENBERG EBOOK 11 ***\n\n[Illustration]\n\n\n\n\nAlice’s Adventures in Wonderland\n\nby Lewis Carroll\n\nTHE MILLENNIUM FULCRUM EDITION 3.0\n\nContents\n\n CHAPTER I.     Down the Rabbit-Hole\n CHAPTER II.    The Pool of Tears\n CHAPTER III.   A Caucus-Race and a Long Tale\n CHAPTER IV.    The Rabbit Sends in a Little Bill\n CHAPTER V.     Advice from a Caterpillar\n CHAPTER VI.    Pig and Pepper\n CHAPTER VII.   A Mad Tea-Party\n CHAPTER VIII.  The Queen’s Croquet-Ground\n CHAPTER IX.    The Mock Turtle’s Story\n CHAPTER X.     The Lobster Quadrille\n CHAPTER XI.    Who Stole the Tarts?\n CHAPTER XII.   Alice’s Evidence\n\n\n\n\nCHAPTER I.\nDown the Rabbit-Hole\n\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into\nthe book her sister was reading, but it had no pictures or\nconversations in it, “and what is the use of a book,” thought Alice\n“without pictures or conver

#### Text preprocessing

In [None]:
def preprocess_text(text):

    # Find the first chapter
    start_chapter = 'CHAPTER I.     Down the Rabbit-Hole'
    start_match = re.search(re.escape(start_chapter), text)

    if start_match:
        text = text[start_match.start():]

    # Divide into chapters
    chapter_pattern = r'CHAPTER [IVXLCDM]+\.'
    chapters_raw = re.split(chapter_pattern, text)

    # Remove empty elements and preface
    chapters = []
    for chapter in chapters_raw:
        chapter = chapter.strip()
        if chapter and len(chapter) > 100:  # check length
            chapters.append(chapter)
    return chapters

In [None]:
def clean_and_lemmatize(text):
    # Converting to lower case
    text = text.lower()

    # Removing numbers, non-alphabetic characters
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)
    text = re.sub(r'\d+', ' ', text)
    text = re.sub(r'\s+', ' ', text)

    # Tokenization
    words = word_tokenize(text)

    # Removing stop words
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]

    # Create lemmatizer
    lemmatizer = WordNetLemmatizer()
    # Proceed words to infinitive form
    def get_wordnet_pos(word):
        tag = nltk.pos_tag([word])[0][1]
        if tag.startswith('J'):  # adj
            return 'a'
        elif tag.startswith('V'):  # verbs
            return 'v'
        elif tag.startswith('N'):  # nouns
            return 'n'
        elif tag.startswith('R'):  # adverbs
            return 'r'
        else:
            return 'n'

    words = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in words]
    return ' '.join(words)

In [None]:
# Preprocessing chapters
chapters = preprocess_text(text)
print(f"Number of chapters: {len(chapters)}")
cleaned_chapters = [clean_and_lemmatize(chapter) for chapter in chapters]

Number of chapters: 12


In [None]:
print(cleaned_chapters[0][:500])
print(type(cleaned_chapters[0]))

rabbit hole alice begin get tire sit sister bank nothing twice peeped book sister reading picture conversation use book thought alice without picture conversation consider mind well could hot day make feel sleepy stupid whether pleasure make daisy chain would worth trouble get pick daisy suddenly white rabbit pink eye ran close nothing remarkable alice think much way hear rabbit say oh dear oh dear shall late thought afterwards occur ought wonder time seem quite natural rabbit actually take watc
<class 'str'>


In [None]:
cleaned_chapters

['rabbit hole alice begin get tire sit sister bank nothing twice peeped book sister reading picture conversation use book thought alice without picture conversation consider mind well could hot day make feel sleepy stupid whether pleasure make daisy chain would worth trouble get pick daisy suddenly white rabbit pink eye ran close nothing remarkable alice think much way hear rabbit say oh dear oh dear shall late thought afterwards occur ought wonder time seem quite natural rabbit actually take watch waistcoat pocket look hurry alice start foot flash across mind never see rabbit either waistcoat pocket watch take burning curiosity ran across field fortunately time see pop large rabbit hole hedge another moment go alice never consider world get rabbit hole go straight like tunnel way dipped suddenly suddenly alice moment think stop found fall deep well either well deep fell slowly plenty time go look wonder go happen next first try look make come dark see anything look side well notice fi

In [None]:
def analyze_chapters_tfidf(chapters):
    # Set a TF-IDF vectorizer object
    vectorizer = TfidfVectorizer(
        stop_words=None,
        max_df=0.8
    )

    tfidf = vectorizer.fit_transform(chapters)
    feature_names = vectorizer.get_feature_names_out()
    exclude_words = ['alice']
    chapter_keywords = []

    for i in range(len(chapters)):
        row = tfidf[i].toarray()[0]
        scores = list(zip(feature_names, row))

        top_words = sorted(scores, key=lambda x: x[1], reverse=True)[:10]
        top_words = [w for w, s in top_words if w not in exclude_words]

        chapter_keywords.append(top_words)

        print(f"\nChapter {i+1} top words:")
        print(", ".join(top_words))

    return chapter_keywords

In [None]:
chapter_keywords = analyze_chapters_tfidf(cleaned_chapters)


Chapter 1 top words:
bat, eat, door, rabbit, key, fall, either, bottle, hole, dinah

Chapter 2 top words:
mouse, pool, swam, cat, dear, cry, fan, mabel, tear, dog

Chapter 3 top words:
mouse, dodo, prize, race, lory, dry, thimble, bird, cause, dinah

Chapter 4 top words:
bill, rabbit, window, puppy, grow, fan, glove, bottle, chimney, room

Chapter 5 top words:
caterpillar, pigeon, serpent, egg, youth, size, father, hookah, change, bit

Chapter 6 top words:
footman, cat, baby, mad, duchess, grin, sneeze, wow, pig, cook

Chapter 7 top words:
hatter, dormouse, hare, march, tea, twinkle, draw, remark, treacle, clock

Chapter 8 top words:
queen, hedgehog, king, gardener, soldier, cat, five, executioner, procession, three

Chapter 9 top words:
turtle, mock, gryphon, duchess, moral, queen, school, sigh, remark, chin

Chapter 10 top words:
turtle, mock, gryphon, dance, lobster, soup, beautiful, oop, soo, join

Chapter 11 top words:
king, hatter, court, dormouse, witness, jury, queen, juror, o

#### Example of chapters name:
* Chapter 1 - Rabit Door Key
* Chapter 2 - Mouse Pool Tears
* Chapter 3 - Mouse Dodo Race
* Chapter 4 - Rabbit Bill Window
* Chapter 5 - Caterpillar Serpent Egg
* Chapter 6 - Mad Cat Duchess
* Chapter 7 - Hatter Hare Tea
* Chapter 8 - Queen King Croquet
* Chapter 9 - Mock Turtle Gryphon
* Chapter 10 - Lobster Dance Soup
* Chapter 11 - King Court Jury
* Chapter 12 - Jury Dream Sister

In [None]:
def extract_verbs_with_alice(text):
    # Divide into sentences
    sentences = sent_tokenize(text)

    # Looking for sentences containing "Alice"
    alice_sentences = [sent for sent in sentences if 'alice' in sent.lower()]

    # Combine sentences into one text for processing
    alice_text = ' '.join(alice_sentences)

    # Preprocessing
    alice_text = re.sub("[^a-zA-Z ]+", " ", alice_text)
    alice_text = re.sub(" +", " ", alice_text)

    # Tokenize and clean
    tokens = word_tokenize(alice_text)
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token.lower() not in stop_words]

    # Extract verbs
    tags = nltk.pos_tag(tokens)
    verbs = [verb for verb, tag in tags if tag in ['VB', 'VBG', 'VBD', 'VBN', 'VBP', 'VBZ']]

    # Lemmatize verbs
    lemmatizer = WordNetLemmatizer()
    verbs = [lemmatizer.lemmatize(verb.lower(), 'v') for verb in verbs]

    return verbs

In [None]:
def analyze_alice_actions(chapters):
    all_verbs = []
    for chapter in chapters:
        chapter_verbs = extract_verbs_with_alice(chapter)
        all_verbs.extend(chapter_verbs)

    # Count verb frequency
    verb_counts = Counter(all_verbs)
    top_10_verbs = sorted(verb_counts.items(), key = lambda x:x[1], reverse = True)[:10]

    print("TOP 10 VERBS IN SENTENCES WITH ALICE:\n")
    for i, (verb, count) in enumerate(top_10_verbs, 1):
        print(f"{i}. {verb}: {count} times")
    return top_10_verbs

In [None]:
top_verbs = analyze_alice_actions(chapters)

TOP 10 VERBS IN SENTENCES WITH ALICE:

1. say: 294 times
2. go: 91 times
3. think: 66 times
4. get: 57 times
5. look: 49 times
6. begin: 43 times
7. know: 38 times
8. see: 37 times
9. make: 35 times
10. come: 34 times
