In [7]:
import nltk
import re

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True


The novel Alice’s Adventures in Wonderland by Lewis Carroll was downloaded from Project Gutenberg using the official plain-text source. The text contains the full novel divided into 12 chapters, which were extracted by detecting chapter headings.

In [8]:
import requests

url = "https://www.gutenberg.org/files/11/11-0.txt"
text = requests.get(url).text


In [9]:
chapters = re.split(r'CHAPTER [IVXLCDM]+\. ', text)[1:]
len(chapters)


12

In [10]:
import re
import requests
import nltk
import pandas as pd

from collections import Counter
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer, sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

In [11]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
tokenizer = RegexpTokenizer(r'\w+')

def preprocess(text):
    text = text.lower()
    tokens = tokenizer.tokenize(text)
    tokens = [
        lemmatizer.lemmatize(w)
        for w in tokens
        if w not in stop_words and w != "alice"
    ]
    return " ".join(tokens)


In [12]:
clean_chapters = [preprocess(ch) for ch in chapters]


The following preprocessing steps were applied:
- Conversion of all text to lowercase
- Removal of most non-alphabetic characters using regex-based tokenization
- Tokenization into individual words
- Removal of English stop words
- Lemmatization to reduce words to their base form
- Exclusion of the word “alice” to avoid bias

These steps ensured that the text contained only meaningful linguistic information suitable for analysis.


In [13]:
vectorizer = TfidfVectorizer(max_features=1000)
tfidf_matrix = vectorizer.fit_transform(clean_chapters)

feature_names = vectorizer.get_feature_names_out()


In [14]:
chapter_keywords = {}

for i, row in enumerate(tfidf_matrix.toarray()):
    top_indices = row.argsort()[-10:][::-1]
    top_words = [feature_names[j] for j in top_indices]
    chapter_keywords[f"Chapter {i+1}"] = top_words

chapter_keywords


{'Chapter 1': ['hole',
  'rabbit',
  'girl',
  'getting',
  'get',
  'gently',
  'generally',
  'general',
  'gave',
  'gardener'],
 'Chapter 2': ['pool',
  'tear',
  'girl',
  'getting',
  'get',
  'gently',
  'generally',
  'general',
  'gave',
  'gardener'],
 'Chapter 3': ['tale',
  'race',
  'long',
  'caucus',
  'funny',
  'fur',
  'fury',
  'game',
  'garden',
  'gave'],
 'Chapter 4': ['little',
  'bill',
  'rabbit',
  'fur',
  'fury',
  'game',
  'garden',
  'funny',
  'gave',
  'general'],
 'Chapter 5': ['advice',
  'caterpillar',
  'glad',
  'give',
  'girl',
  'getting',
  'get',
  'gently',
  'generally',
  'general'],
 'Chapter 6': ['pig',
  'pepper',
  'fur',
  'fury',
  'game',
  'garden',
  'gardener',
  'gave',
  'funny',
  'generally'],
 'Chapter 7': ['party',
  'mad',
  'tea',
  'funny',
  'fur',
  'fury',
  'game',
  'garden',
  'gardener',
  'gave'],
 'Chapter 8': ['croquet',
  'ground',
  'queen',
  'girl',
  'getting',
  'get',
  'gently',
  'generally',
  'genera

## Chapter Naming Based on TF-IDF Results

The table below shows the dominant TF-IDF keywords for each chapter and the corresponding chapter name inferred from these words.

| Chapter | Key TF-IDF Words (examples) | Assigned Chapter Name |
|-------|-----------------------------|-----------------------|
| Chapter 1 | hole, rabbit | Down the Rabbit-Hole |
| Chapter 2 | pool, tear | The Pool of Tears |
| Chapter 3 | tale, race, caucus | A Caucus-Race and a Long Tale |
| Chapter 4 | rabbit, bill | The Rabbit Sends in a Little Bill |
| Chapter 5 | advice, caterpillar | Advice from a Caterpillar |
| Chapter 6 | pig, pepper | Pig and Pepper |
| Chapter 7 | mad, tea, party | A Mad Tea-Party |
| Chapter 8 | queen, croquet, ground | The Queen’s Croquet-Ground |
| Chapter 9 | mock, turtle, story | The Mock Turtle’s Story |
| Chapter 10 | lobster, quadrille | The Lobster Quadrille |
| Chapter 11 | tart, stole | Who Stole the Tarts? |
| Chapter 12 | said, thought, time | Alice’s Evidence |

**Summary:**  
Some generic or auxiliary terms appear among the TF-IDF keywords due to their contextual importance within chapters; chapter naming was based on the most semantically meaningful words.


In [15]:
sentences = re.split(r'(?<=[.!?])\s+', text)

alice_sentences = [
    s for s in sentences
    if re.search(r'\balice\b', s, re.IGNORECASE)
]


Only sentences containing “Alice” were analyzed. The most frequent verbs associated with Alice include *be, say, think, go,* and *look*.

This shows that Alice is primarily involved in speaking, thinking, observing, and exploring.

In [16]:
common_verbs = {
    "say","be","think","go","come","see","look","run","walk","sit","stand",
    "feel","ask","answer","cry","laugh","jump","fall","eat","drink",
    "sleep","wake","talk","move","turn","open","close","begin","stop"
}


In [17]:
verbs = []

for sent in alice_sentences:
    words = tokenizer.tokenize(sent.lower())
    for w in words:
        w_lemma = lemmatizer.lemmatize(w, 'v')
        if w_lemma in common_verbs:
            verbs.append(w_lemma)


In [18]:
from collections import Counter

top_10_verbs = Counter(verbs).most_common(10)
top_10_verbs


[('be', 414),
 ('say', 301),
 ('think', 94),
 ('go', 94),
 ('look', 60),
 ('begin', 51),
 ('come', 50),
 ('see', 44),
 ('turn', 23),
 ('run', 19)]

After extracting verbs from sentences that include the word “Alice,” it is observed that the most frequently occurring verbs are primarily related to speaking, thinking, and describing states of being.

The verb said appears most often, indicating that Alice is heavily involved in conversations and dialogue throughout the story. Verbs related to states of being, such as forms of “be,” occur frequently and are mainly used to describe Alice’s condition or situation, and are mainly used to describe Alice’s condition, situation, or state in various scenes. The frequent appearance of the verb thought highlights Alice’s reflective nature and shows that the narrative often focuses on her internal thoughts and reactions to events.

Overall, the verb frequency analysis suggests that Alice is portrayed more as a curious, thoughtful, and communicative character rather than one defined by constant physical actions. This aligns well with the narrative style of the book, which places strong emphasis on dialogue and inner reflection.

Alice is portrayed as a thoughtful, curious, and communicative character, and TF-IDF effectively captures chapter themes.