### Abdullah Nasih Jasir (Абдулла Насих Джасир)
---

# **ALICE NLP**
---

In [1]:
import pandas as pd

# Data Wrangling

---
#### **1. Download Alice in Wonderland by Lewis Carroll from Project Gutenberg's website http://www.gutenberg.org/files/11/11-0.txt**
---

In [2]:
import requests
import re

# FETCH THE TEXT FROM URL
url = "https://www.gutenberg.org/files/11/11-0.txt"
response = requests.get(url)
text_content = response.text

# CHAPTERS SPLIT
chapters = re.split(r'(CHAPTER\s+[IVXLCDM]+\.\s+[^\n]*)\s*\n', text_content)
chapter_titles = chapters[1::2]
chapter_contents = chapters[2::2]

# TAKE IT TO DATAFRAME
raw_df = pd.DataFrame({"chapter_title": chapter_titles, "content": chapter_contents})
print(raw_df.head(12))

                                        chapter_title content
0               CHAPTER I.     Down the Rabbit-Hole\r        
1                  CHAPTER II.    The Pool of Tears\r        
2      CHAPTER III.   A Caucus-Race and a Long Tale\r        
3   CHAPTER IV.    The Rabbit Sends in a Little Bi...        
4          CHAPTER V.     Advice from a Caterpillar\r        
5                     CHAPTER VI.    Pig and Pepper\r        
6                    CHAPTER VII.   A Mad Tea-Party\r        
7         CHAPTER VIII.  The Queen’s Croquet-Ground\r        
8            CHAPTER IX.    The Mock Turtle’s Story\r        
9              CHAPTER X.     The Lobster Quadrille\r        
10              CHAPTER XI.    Who Stole the Tarts?\r        
11                  CHAPTER XII.   Alice’s Evidence\r        


since the contents list divide to independent row without the text in content column, we should to delete it

In [3]:
# DROP ROWS
raw_df = raw_df.drop(raw_df.index[0:12])

# RESET THE INDEX
raw_df = raw_df.reset_index(drop=True)

print(raw_df.head())

                                       chapter_title  \
0               CHAPTER I.\r\nDown the Rabbit-Hole\r   
1                 CHAPTER II.\r\nThe Pool of Tears\r   
2    CHAPTER III.\r\nA Caucus-Race and a Long Tale\r   
3  CHAPTER IV.\r\nThe Rabbit Sends in a Little Bi...   
4          CHAPTER V.\r\nAdvice from a Caterpillar\r   

                                             content  
0  Alice was beginning to get very tired of sitti...  
1  “Curiouser and curiouser!” cried Alice (she wa...  
2  They were indeed a queer-looking party that as...  
3  It was the White Rabbit, trotting slowly back ...  
4  The Caterpillar and Alice looked at each other...  


# PREPROCESSING DATA

---
#### **2. Perform any necessary preprocessing on the text, including converting to lower case, removing stop words, numbers / non-alphabetic characters, lemmatization.**
---

In [4]:
preprocessed_df = raw_df.copy()

In [5]:
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# LOWERCASING
preprocessed_df["content"] = preprocessed_df["content"].apply(lambda x: x.lower())

# TOKENIZATION
preprocessed_df["content"] = preprocessed_df["content"].apply(word_tokenize)

# REMOVING NOISE
def remove_noise(words):
    cleaned_words = []
    for word in words:
        cleaned_word = re.sub(r'[^A-Za-z\s]+', '', word)
        cleaned_word = cleaned_word.lower()
        if cleaned_word:
            cleaned_words.append(cleaned_word)
    return cleaned_words

preprocessed_df["content"] = preprocessed_df["content"].apply(remove_noise)

# STOPWORD REMOVAL
stop_words = set(stopwords.words('english'))
def remove_stopwords(words):
    return [word for word in words if word.lower() not in stop_words]

preprocessed_df["content"] = preprocessed_df["content"].apply(remove_stopwords)

# LEMMATIZATION
def lemmatize_words(words):
    return [lemmatizer.lemmatize(word) for word in words]

preprocessed_df["content"] = preprocessed_df["content"].apply(lemmatize_words)

# REJOINING TOKENS
preprocessed_df["content"] = preprocessed_df["content"].apply(lambda words: ' '.join(words))

# EXPLORATORY DATA

---
#### **3. Find Top 10 most important (for example, in terms of TF-IDF metric) words from each chapter in the text (not "Alice"); how would you name each chapter according to the identified tokens?**
---

In [6]:
exploratory_df = preprocessed_df.copy()

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

# REMOVE THE ALICE WORD
exploratory_df["content_cleaned"] = exploratory_df["content"].apply(lambda x: re.sub(r'\balice\b', '', x))
exploratory_df = exploratory_df.drop(columns=["content"])

# VECTORIZE USING TFIDVECTORIZER
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(exploratory_df["content_cleaned"])

# CONVERT TFIDF TO DATAFRAME
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

TOP 10 WORDS AND THEIR SUITABLE TITLE FOR EACH CHAPTER

In [8]:
# CHAPTER 1
chapter_1 = tfidf_df.iloc[0]
top_10_words1 = chapter_1.nlargest(10)

print(top_10_words1)

little    0.173424
bat       0.171089
door      0.154574
key       0.151132
eat       0.143506
like      0.127178
think     0.127178
way       0.127178
either    0.123005
see       0.115616
Name: 0, dtype: float64


CHAPTER 1:  
Based on those words (Top 10 of most common words in that chapter), I think that chapter will have title "The Little Door and the Key"

In [9]:
# CHAPTER 2
chapter_2 = tfidf_df.iloc[1]
top_10_words2 = chapter_2.nlargest(10)

print(top_10_words2)

mouse     0.315662
little    0.189146
pool      0.169681
swam      0.159762
cat       0.157831
dear      0.154499
said      0.133515
foot      0.129849
mabel     0.127809
go        0.122388
Name: 1, dtype: float64


CHAPTER 2:  
Based on those words (Top 10 of most common words in that chapter), I think that chapter will have title "Little Mouse in the Pool"

In [10]:
# CHAPTER 3
chapter_3 = tfidf_df.iloc[2]
top_10_words3 = chapter_3.nlargest(10)

print(top_10_words3)

mouse      0.401868
said       0.366934
dodo       0.319406
prize      0.185958
lory       0.159703
dry        0.141075
thimble    0.123972
know       0.118714
bird       0.114819
dinah      0.105521
Name: 2, dtype: float64


CHAPTER 3:  
Based on those words (Top 10 of most common words in that chapter), I think that chapter will have title "The Mouse's Prize"

In [11]:
# CHAPTER 4
chapter_4 = tfidf_df.iloc[3]
top_10_words4 = chapter_4.nlargest(10)

print(top_10_words4)

bill       0.215152
window     0.210644
little     0.201710
rabbit     0.190681
puppy      0.184313
bottle     0.135677
chimney    0.135677
fan        0.135677
glove      0.135677
one        0.128361
Name: 3, dtype: float64


CHAPTER 4:  
Based on those words (Top 10 of most common words in that chapter), I think that chapter will have title "The little rabbit and a puppy"

In [12]:
# CHAPTER 5
chapter_5 = tfidf_df.iloc[4]
top_10_words5 = chapter_5.nlargest(10)

print(top_10_words5)

caterpillar    0.456587
said           0.435911
pigeon         0.288889
serpent        0.288889
egg            0.144444
youth          0.144444
size           0.114750
father         0.103375
little         0.092212
well           0.083829
Name: 4, dtype: float64


CHAPTER 5:  
Based on those words (Top 10 of most common words in that chapter), I think that chapter will have title "The Caterpillar and the Serpent"

In [13]:
# CHAPTER 6
chapter_6 = tfidf_df.iloc[5]
top_10_words6 = chapter_6.nlargest(10)

print(top_10_words6)

said       0.374080
cat        0.338714
footman    0.274285
baby       0.215929
mad        0.190743
duchess    0.165527
pig        0.157040
wow        0.137143
like       0.127346
cook       0.121382
Name: 5, dtype: float64


CHAPTER 6:  
Based on those words (Top 10 of most common words in that chapter), I think that chapter will have title "The madness of baby cat"

In [14]:
# CHAPTER 7
chapter_7 = tfidf_df.iloc[6]
top_10_words7 = chapter_7.nlargest(10)

print(top_10_words7)

hatter      0.466133
dormouse    0.431742
said        0.382525
hare        0.266249
march       0.266249
twinkle     0.148954
time        0.110219
tea         0.098877
draw        0.095943
clock       0.093096
Name: 6, dtype: float64


CHAPTER 7:  
Based on those words (Top 10 of most common words in that chapter), I think that chapter will have title "The Hatter and the Dormouse"

In [15]:
# CHAPTER 8
chapter_8 = tfidf_df.iloc[7]
top_10_words8 = chapter_8.nlargest(10)

print(top_10_words8)

queen          0.450239
said           0.332164
hedgehog       0.221839
king           0.211481
gardener       0.177471
soldier        0.151466
cat            0.150672
five           0.133363
executioner    0.133103
procession     0.133103
Name: 7, dtype: float64


CHAPTER 8:  
Based on those words (Top 10 of most common words in that chapter), I think that chapter will have title "The King and The Queen"

In [16]:
# CHAPTER 9
chapter_9 = tfidf_df.iloc[8]
top_10_words9 = chapter_9.nlargest(10)

print(top_10_words9)

said       0.413998
turtle     0.411420
mock       0.395596
gryphon    0.284062
duchess    0.204999
moral      0.187724
queen      0.164630
went       0.094421
never      0.079894
say        0.078445
Name: 8, dtype: float64


CHAPTER 9:  
Based on those words (Top 10 of most common words in that chapter), I think that chapter will have title "The Queen Turtle"

In [17]:
# CHAPTER 10
chapter_10 = tfidf_df.iloc[9]
top_10_words10 = chapter_10.nlargest(10)

print(top_10_words10)

turtle       0.419633
mock         0.379024
gryphon      0.376653
said         0.279597
dance        0.231962
lobster      0.231962
beautiful    0.162439
soup         0.162439
join         0.160589
whiting      0.142746
Name: 9, dtype: float64


CHAPTER 10:  
Based on those words (Top 10 of most common words in that chapter), I think that chapter will have title "The Dance of the Turtle and Gryphon""

In [18]:
# CHAPTER 11
chapter_11 = tfidf_df.iloc[10]
top_10_words11 = chapter_11.nlargest(10)

print(top_10_words11)

king              0.407538
hatter            0.366727
said              0.320623
court             0.296537
dormouse          0.256998
witness           0.230191
queen             0.116798
juror             0.115096
officer           0.115096
breadandbutter    0.098846
Name: 10, dtype: float64


CHAPTER 11:  
Based on those words (Top 10 of most common words in that chapter), I think that chapter will have title "The King's Court"

In [19]:
# CHAPTER 12
chapter_12 = tfidf_df.iloc[11]
top_10_words12 = chapter_12.nlargest(10)

print(top_10_words12)

said      0.468572
king      0.395265
jury      0.200168
queen     0.148752
sister    0.140117
dream     0.135959
would     0.119440
slate     0.113300
rabbit    0.109187
fit       0.105541
Name: 11, dtype: float64


CHAPTER 12:  
Based on those words (Top 10 of most common words in that chapter), I think that chapter will have title "The Queen's Dream"

----
#### **4. Find the Top 10 most used verbs in sentences with Alice. What does Alice do most often?**
----

In [20]:
import nltk
from nltk import pos_tag, word_tokenize
from nltk.corpus import stopwords
from collections import Counter

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

# EXTRACT THE SENTENCE WITH ALICE ONLY
sentences_with_alice = [sentence for sentence in re.split(r'(?<=[.!?])\s+', text_content) if 'Alice' in sentence]
sentences_with_alice

[nltk_data] Downloading package punkt to C:\Users\Abdullah
[nltk_data]     NJ\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Abdullah NJ\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to C:\Users\Abdullah
[nltk_data]     NJ\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['*** START OF THE PROJECT GUTENBERG EBOOK 11 ***\r\n[Illustration]\r\n\r\n\r\n\r\n\r\nAlice’s Adventures in Wonderland\r\n\r\nby Lewis Carroll\r\n\r\nTHE MILLENNIUM FULCRUM EDITION 3.0\r\n\r\nContents\r\n\r\n CHAPTER I.',
 'Alice’s Evidence\r\n\r\n\r\n\r\n\r\nCHAPTER I.',
 'Down the Rabbit-Hole\r\n\r\n\r\nAlice was beginning to get very tired of sitting by her sister on the\r\nbank, and of having nothing to do: once or twice she had peeped into\r\nthe book her sister was reading, but it had no pictures or\r\nconversations in it, “and what is the use of a book,” thought Alice\r\n“without pictures or conversations?”\r\n\r\nSo she was considering in her own mind (as well as she could, for the\r\nhot day made her feel very sleepy and stupid), whether the pleasure of\r\nmaking a daisy-chain would be worth the trouble of getting up and\r\npicking the daisies, when suddenly a White Rabbit with pink eyes ran\r\nclose by her.',
 'There was nothing so _very_ remarkable in that; nor did Alice th

After I check the sentences_with_alice list, it still dirty list and contains different verb in one meaning (i.e said/say, di/did/, etc). So, I'm cleaning it first before find 10 most verbs that alice use

In [21]:
# INITIATE LEMMATIZER
lemmatizer = WordNetLemmatizer()

# LOWERCASING
sentences_with_alice = [sentence.lower() for sentence in sentences_with_alice]

# TOKENIZATION
sentences_with_alice = [word_tokenize(sentence) for sentence in sentences_with_alice]

# REMOVING NOISE
def remove_noise(words):
    cleaned_words = []
    for word in words:
        cleaned_word = re.sub(r'[^A-Za-z\s]+', '', word)
        cleaned_word = cleaned_word.lower()
        if cleaned_word:
            cleaned_words.append(cleaned_word)
    return cleaned_words

sentences_with_alice = [remove_noise(sentence) for sentence in sentences_with_alice]

# STOPWORD REMOVAL
stop_words = set(stopwords.words('english'))
def remove_stopwords(words):
    return [word for word in words if word.lower() not in stop_words]

sentences_with_alice = [remove_stopwords(sentence) for sentence in sentences_with_alice]

# LEMMATIZATION
def lemmatize_words(words):
    return [lemmatizer.lemmatize(word, pos='v') for word in words]

sentences_with_alice = [lemmatize_words(sentence) for sentence in sentences_with_alice]

# REJOINING TOKENS
sentences_with_alice = [' '.join(words) for words in sentences_with_alice]
sentences_with_alice

['start project gutenberg ebook illustration alice adventure wonderland lewis carroll millennium fulcrum edition content chapter',
 'alice evidence chapter',
 'rabbithole alice begin get tire sit sister bank nothing twice peep book sister read picture conversations use book think alice without picture conversations consider mind well could hot day make feel sleepy stupid whether pleasure make daisychain would worth trouble get pick daisies suddenly white rabbit pink eye run close',
 'nothing remarkable alice think much way hear rabbit say oh dear',
 'shall late think afterwards occur ought wonder time seem quite natural rabbit actually take watch waistcoatpocket look hurry alice start feet flash across mind never see rabbit either waistcoatpocket watch take burn curiosity run across field fortunately time see pop large rabbithole hedge',
 'another moment go alice never consider world get',
 'rabbithole go straight like tunnel way dip suddenly suddenly alice moment think stop find fall 

Here, we can see the POS tag when I tried to do lemmatization. The POS tag 'v' will make this process focus on lemmatizing verbs. Before I do this, the default will be to lemmatize as a noun, and this is not what I want.

In [None]:
# TOKENIZE
all_verbs = []
for sentence in sentences_with_alice:
    words = word_tokenize(sentence)
    pos_tags = pos_tag(words)
    
    # EXTRACT THE VERB (AS I RECENTLY KNOW, V IS ONE OF THE VERB TAGS IN POS TAGGING)
    verbs = [word.lower() for word, tag in pos_tags if tag.startswith('V') and word.isalpha()]
    all_verbs.extend(verbs)

# FIND THE TOP 10 MOST COMMON VERBS
top_verbs = Counter(all_verbs).most_common(10)

print("Top 10 most used verbs in sentences with 'Alice':")
for verb, freq in top_verbs:
    print(f"{verb}: {freq}")

Top 10 most used verbs in sentences with 'Alice':
say: 289
go: 90
think: 75
get: 50
see: 37
come: 36
make: 35
know: 35
take: 31
begin: 24


Based on here, The thing that Alice does most often is "say" (Yapping or Talking).