# Assignment II: Text Preprocessing

## Question 1

Please use the `gutenberg` corpus provided in `nltk` and extract the text written by Lewis Caroll, i.e., `fileid == 'carroll-alice.txt'`, as your corpus data.

With this corpus data, please perform text preprocessing on the **sentences** of the corpus.

In particular, please:

- pos-tag all the sentences to get the parts-of-speech of each word
- lemmatize all words using `WordNetLemmatizer` in NLTK on a sentential basis

Please provide your output as shown below:

- it is a data frame
- the column `alice_sents` includes the original sentence texts
- the column `alice_sents_pos` includes annotated version of the sentences with each token as `word/postag`
- the column `sents_lem` includes the lemmatized version of the sentences


```{note}
Please note that the lemmatized form of the BE verbs (e.g., *was*) should be *be*. This is a quick check if your lemmatization works successfully.
```


##**Load packages**##

In [1]:
import nltk
nltk.download(['gutenberg', 'punkt','averaged_perceptron_tagger', 'treebank', 'wordnet', 'omw-1.4'])
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import gutenberg
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
import pandas as pd

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


##**Extract the text 'carroll-alice.txt'**##

In [2]:
# Load and extract the text of "Alice's Adventures in Wonderland" by Lewis Carroll
alice_text = gutenberg.raw('carroll-alice.txt')

# View the text
print(alice_text[:300])

[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversatio


##**Extract sentences and store them in 'sents'**##

In [3]:
# Tokenize the text into sentences
alice_sents = sent_tokenize(alice_text)

# Creating a DataFrame from the 'alice_sents' list
alice_text = "\n\n".join(alice_sents)

# Split the text based on "\n\n"
alice_sents = alice_text.split("\n\n")

# Create a DataFrame
alice_sents_df = pd.DataFrame(alice_sents, columns=['sents'])
print(alice_sents_df)

                                                  sents
0     [Alice's Adventures in Wonderland by Lewis Car...
1                                            CHAPTER I.
2                                  Down the Rabbit-Hole
3     Alice was beginning to get very tired of sitti...
4     So she was considering in her own mind (as wel...
...                                                 ...
1704  But her sister sat still just as she left her,...
1705  First, she dreamed of little Alice herself, an...
1706  The long grass rustled at her feet as the Whit...
1707  So she sat on, with closed eyes, and half beli...
1708  Lastly, she pictured to herself how this same ...

[1709 rows x 1 columns]


##**Pos-tag sentences and store them in 'sents_pos'**

In [4]:
# Function to tokenize and tag
def tokenize_and_tag(sentence):
    tokens = word_tokenize(sentence)  # Tokenize the sentence
    tagged = nltk.pos_tag(tokens)  # Tag each token
    return ' '.join([f'{word}/{tag}' for word, tag in tagged])

# Apply the function to each sentence
alice_sents_df['sents_pos'] = alice_sents_df['sents'].apply(tokenize_and_tag)

# Display the result
print(alice_sents_df['sents_pos'])

0       [/JJ Alice/NNP 's/POS Adventures/NNS in/IN Won...
1                                    CHAPTER/NN I/PRP ./.
2                           Down/IN the/DT Rabbit-Hole/JJ
3       Alice/NNP was/VBD beginning/VBG to/TO get/VB v...
4       So/IN she/PRP was/VBD considering/VBG in/IN he...
                              ...                        
1704    But/CC her/PRP$ sister/NN sat/VBD still/RB jus...
1705    First/RB ,/, she/PRP dreamed/VBD of/IN little/...
1706    The/DT long/JJ grass/NN rustled/VBD at/IN her/...
1707    So/IN she/PRP sat/VBD on/IN ,/, with/IN closed...
1708    Lastly/RB ,/, she/PRP pictured/VBD to/TO herse...
Name: sents_pos, Length: 1709, dtype: object


##**Prepare for word lemmatization**

In [5]:
# Initialize the WordNetLemmatizer
wnl = WordNetLemmatizer()

# Function to convert NLTK POS tags to WordNet POS tags
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wn.ADJ
    elif treebank_tag.startswith('V'):
        return wn.VERB
    elif treebank_tag.startswith('N'):
        return wn.NOUN
    elif treebank_tag.startswith('R'):
        return wn.ADV
    else:
        return wn.NOUN  # Default to noun if not found

##**Lemmatize words in sentences and store them in 'sents_lem'**

In [6]:
# Function to tokenize, tag, and lemmatize
def lemmatize_with_pos(sentence):
    tokens = word_tokenize(sentence)  # Tokenize the sentence
    tagged = pos_tag(tokens)  # Tag each token with POS
    lemmatized = [wnl.lemmatize(word, get_wordnet_pos(tag)) for word, tag in tagged]  # Lemmatize with POS
    return ' '.join(lemmatized)

# Apply the function to each sentence
alice_sents_df['sents_lem'] = alice_sents_df['sents'].apply(lemmatize_with_pos)

# Display the result
print(alice_sents_df['sents_lem'])

0       [ Alice 's Adventures in Wonderland by Lewis C...
1                                             CHAPTER I .
2                                    Down the Rabbit-Hole
3       Alice be begin to get very tired of sit by her...
4       So she be consider in her own mind ( as well a...
                              ...                        
1704    But her sister sit still just a she leave her ...
1705    First , she dream of little Alice herself , an...
1706    The long grass rustle at her foot a the White ...
1707    So she sit on , with closed eye , and half bel...
1708    Lastly , she picture to herself how this same ...
Name: sents_lem, Length: 1709, dtype: object


##**View the final output DataFrame**

In [7]:
alice_sents_df[:21]

Unnamed: 0,sents,sents_pos,sents_lem
0,[Alice's Adventures in Wonderland by Lewis Car...,[/JJ Alice/NNP 's/POS Adventures/NNS in/IN Won...,[ Alice 's Adventures in Wonderland by Lewis C...
1,CHAPTER I.,CHAPTER/NN I/PRP ./.,CHAPTER I .
2,Down the Rabbit-Hole,Down/IN the/DT Rabbit-Hole/JJ,Down the Rabbit-Hole
3,Alice was beginning to get very tired of sitti...,Alice/NNP was/VBD beginning/VBG to/TO get/VB v...,Alice be begin to get very tired of sit by her...
4,So she was considering in her own mind (as wel...,So/IN she/PRP was/VBD considering/VBG in/IN he...,So she be consider in her own mind ( as well a...
5,There was nothing so VERY remarkable in that; ...,There/EX was/VBD nothing/NN so/RB VERY/RB rema...,There be nothing so VERY remarkable in that ; ...
6,Oh dear!,Oh/UH dear/NN !/.,Oh dear !
7,I shall be late!',I/PRP shall/MD be/VB late/RB !/. '/'',I shall be late ! '
8,"(when she thought it over afterwards, it\noccu...",(/( when/WRB she/PRP thought/VBD it/PRP over/I...,"( when she think it over afterwards , it occur..."
9,"In another moment down went Alice after it, ne...",In/IN another/DT moment/NN down/RP went/VBD Al...,"In another moment down go Alice after it , nev..."


## Question 2

Based on the output of **Question 1**, please create a lemma frequency list of the corpus, `carroll-alice.txt`, using the lemmatized forms by including only lemmas which are:
- consisting of only alphabets or hyphens
- at least 5-character long

The casing is irrelevant (i.e., case normalization is needed).

The expected output is provided as follows (showing the top 20 lemmas and their frequencies).


##**Load packages**

In [8]:
import pandas as pd
from collections import Counter

##**Extract lemmas and convert them to lowercase**

In [9]:
# Prepare a placeholder for lemmas
all_lemmas = []

# Extract all lemmas
for sentence in alice_sents_df['sents_lem']:
    lemmas = sentence.split()
    all_lemmas.extend(lemmas)

# Convert lemmas to lowercase
all_lemmas = [lemma.lower() for lemma in all_lemmas]

##**Filter lemmas based on criteria**

In [10]:
# Lemmas being at least 5 char long or including "-"
filtered_lemmas = [
    lemma for lemma in all_lemmas
    if len(lemma) >= 5 and all(char.isalpha() or char == "-" for char in lemma)
]

##**Count frequencies and return a DataFrame of top 20 lemmas**

In [11]:
# Count lemma frequencies
lemma_counts = Counter(filtered_lemmas)

# Create and return the DataFrame from the top 20 lemmas
alice_df = pd.DataFrame(lemma_counts.most_common(21), columns=['LEMMA', 'FREQ'])
alice_df

Unnamed: 0,LEMMA,FREQ
0,alice,396
1,little,128
2,think,109
3,about,94
4,begin,91
5,would,90
6,there,87
7,could,86
8,again,83
9,herself,83


## Question 3

Please identify top verbs that co-occcur with the name *Alice* in the text, with *Alice* being the **subject** of the verb.

Please use the `en_core_web_sm` model in `spacy` for English dependency parsing.

To simply the task, please identify all the verbs that have a dependency relation of `nsubj` with the noun `Alice` (where `Alice` is the **dependent**, and the verb is the **head**).

The expected output is provided below (showing the top 20 heads of `Alice` for the `nsubj` dependency relation.)

##**Load packages**

In [12]:
import spacy

# Load the English web small model
nlp = spacy.load("en_core_web_sm")

##**Parse text and find verbs dependent on "Alice"**

In [13]:
# Process the text with spaCy
doc = nlp(alice_text)

# Find all verbs related to "Alice" with nsubj dependency
alice_verbs = []
for token in doc:
    if token.lower_ == "alice" and token.dep_ == "nsubj":
        # Retrieve the head verb associated with this token and append its text (the actual verb)
        alice_verbs.append(token.head.text)

##**Count frequencies and return a DataFrame of top 20 verbs**

In [14]:
# Count verb frequencies
verb_counts = Counter(alice_verbs)

# Create and return the DataFrame from the top 20 verbs
alice_nsubj_df = pd.DataFrame(verb_counts.most_common(21), columns=['nsubj-head', 'FREQ'])
alice_nsubj_df

Unnamed: 0,nsubj-head,FREQ
0,said,124
1,thought,19
2,replied,13
3,was,11
4,began,8
5,went,7
6,looked,7
7,felt,5
8,like,5
9,think,4
