<a href="https://colab.research.google.com/github/letizia-z/Vocab-growth-through-short-stories/blob/main/Acquire_L2_words_from_consuming_content.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

|     Course                     | Academic Year |
|    :---                        |     ---:      |
| Programming for the Humanities |  *2023/2024*  |

*This has been my first programming course.*

# HOW MANY WORDS CAN I ACQUIRE FROM CONSUMING CONTENT IN L2

### Project Description

This project uses Python to perform linguistic analysis on three short stories. The goal is to extract useful information regarding the complexity and variety of the language used in the texts, as well as the number of occurrences of each word. This tool is particularly useful for those who want to deepen their linguistic understanding and improve their vocabulary through the consumption of content in the target language.

* **Input data:** three short stories in English
* **Output data:** sentence, word, and syllable count; Flesch-Kincaid score and reading difficulty; vocabulary variety; possible encounter with new words and passively learnable words (i.e., those that exceed a certain number of occurrences within the texts).


### Name and URL of programs/notebooks reused in the project

|Name|URL|
| :---        |    :---  |
|*Python*|*Notebooks from lectures, especially "09_NLP"*|
||*https://www.datacamp.com/tutorial/sort-a-dictionary-by-value-python*|
|*Spacy*|*https://spacy.io/api/doc*|
||https://stackoverflow.com/questions/405161/detecting-syllables-in-a-word|
|*Matplotlib.pyplot* |*https://matplotlib.org/stable/api/pyplot_summary.html*|
||*https://stackoverflow.com/questions/66446687/how-do-i-make-a-dashed-horizontal-line-with-matplotlib* |
|*Artificial Intelligence*|*ChatGPT, Gemini*|

*This code was developed with the assistance of artificial intelligence (AI) tools to generate a starting point or suggestions. However, the final code has been reviewed, modified, and adapted according to my specific needs, and represents the result of personal work. Any similarities with other works are purely coincidental and unintentional. I have taken all necessary precautions to ensure that the code presented here does not constitute plagiarism and respects copyright laws.*


---

# 1. INTRODUCTION

Anyone who has seriously studied a foreign language has likely experienced the so-called ***“language learning plateau”*** — that moment when you already have enough vocabulary to understand and express more or less everything you want, making it increasingly difficult to learn new words.
To overcome this plateau, reading or generally increasing your consumption of content in the target language is often recommended.
With this in mind, I decided to analyze some texts myself to see **how effective content consumption really is**, in this case with reference to the **English language**.


## 1.1 SOME BASIC CONCEPTS

Before diving into the actual project, it's important to provide two key pieces of information.

### 1.1.1 CEFR LEVELS
Each language proficiency level includes an approximate number of words that are known and usable by the speaker. Therefore, depending on one's starting level, there will be differences in the time needed to understand a text and in the number of words that can be learned from it.

| LEVEL | WORDS | HOURS |
| --- | --- | --- |
| A1 | 700 | 100 |
| A2 | 1500 | 180/200 |
| B1 | 2500 | 350/400 |
| B2 | 4000 | 500/600 |
| C1 | 8000 | 700/800 |
| C2 | 16000 | 1000/1200 |

### 1.1.2 VOCABULARY ACQUISITION

Simply encountering a word in a text is obviously not enough to learn it. When it comes to learning a new word, we can take two main approaches:

* **Active learning**, where the student who comes across a new term makes a conscious effort to remember it (for example, by using flashcards)
* **Passive acquisition**, which relies primarily on repeated exposure to the same word, ideally in different contexts

Most studies in this field focus on first language (L1) acquisition rather than second language (L2) learning. This is partly because it is still unclear how many exposures are needed to acquire a word, as this also depends on individual cognitive abilities.

According to Uchihara et al.:

> *“the number of encounters necessary to learn words rang\[es] from 6, 10, 12, to more than 20 times. \[That is to say,] the number of encounters necessary for learning of vocabulary to occur during meaning-focussed input remains unclear”*

Therefore, for the purposes of my project, I decided to assume that the **minimum number of exposures required for passive vocabulary acquisition is 12**, based in part on a study by Holly L. Storkel et al. on L1 acquisition in children.



---

# 2. THE SHORT STORIES

First, I selected three short stories that I was unfamiliar with, written by authors from different time periods, genders, and styles. The idea behind this choice was that **greater variety** would allow for the encounter of the largest possible number of different words. This is ideal from the perspective of *active vocabulary study*, but it could be problematic for *passive acquisition*, since a wider vocabulary range would likely result in fewer words reaching the 12-occurrence threshold.

The short stories analyzed are:

* *“The Yellow Wallpaper”* by C. P. Gilman (1892)
* *“Hills Like White Elephants”* by E. Hemingway (1927)
* *“A Good Man is Hard to Find”* by F. O’Connor (1953)



> Note: make sure to manually download them in your personal Colab space and runtime



## 2.1 IMPORTING AND OPENING THE FILES

First, we'll need to open the files of the selected short stories, so we can begin analyzing them. To make sure I’ve opened the correct files, I’ll also print the first 100 characters of each one.

To distinguish between the three texts, we’ll add the initial of each author’s last name to the variable names:

* **O** = *“A Good Man is Hard to Find”* by F. O’Connor (1953)
* **H** = *“Hills Like White Elephants”* by E. Hemingway (1927)
* **G** = *“The Yellow Wallpaper”* by C. P. Gilman (1892)


In [1]:
def readFile(filePath):
  with open(filePath, 'r', encoding='utf-8') as file:
    return file.read()


filePathO = 'AGoodManIsHardToFind_OConnor1953.txt'
filePathH = 'HillsLikeWhiteElephants_Hemingway1927.txt'
filePathG = 'TheYellowWallpaper_Gillman1892.txt'

rawTextO = readFile(filePathO)
rawTextH = readFile(filePathH)
rawTextG = readFile(filePathG)

print(str(rawTextO[:100]) + '\n')
print(str(rawTextH[:100]) + '\n')
print(str(rawTextG[:100]) + '\n')


﻿A GOOD MAN IS HARD TO FIND
Flannery O’Connor, 1953
The grandmother didn’t want to go to Florida. Sh

﻿HILLS LIKE WHITE ELEPHANTS
Ernest Hemingway, 1927 
The hills across the valley of the Ebro were lon

﻿THE YELLOW WALLPAPER
Charlotte Perkins Gillman, 1892
It is very seldom that mere ordinary people li



### 2.1.1 EXTRACTING THE TITLE

I also decided to take advantage of the formatting of these files (with the title written in uppercase) to create a function that extracts only the title of the short story. This way, we can easily refer back to it in later stages, especially when displaying the results of the various analysis steps.

In [2]:
def extractTitle(filePath):
  with open(filePath, 'r', encoding='utf-8') as file:
    for line in file:
      strippedLine = line.strip() # remove blank spaces at the beginning and at the end of the line
      if strippedLine.isupper(): # the title is supposedly in uppercase
        return strippedLine
  return 'Title not found'  # in cas the title isn't in uppercase like expected

titleO = extractTitle(filePathO)
print(str(titleO) + '\n')

titleH = extractTitle(filePathH)
print(str(titleH) + '\n')

titleG = extractTitle(filePathG)
print(str(titleG) + '\n')

﻿A GOOD MAN IS HARD TO FIND

﻿HILLS LIKE WHITE ELEPHANTS

﻿THE YELLOW WALLPAPER



## 2.2 PREPROCESSING

One last necessary step is preprocessing the text by making slight modifications to simplify the subsequent analysis:

* Convert the entire text to **lowercase**: this ensures that during co-occurrence counting, identical words are counted together (1), rather than being treated as separate groups due to capitalization
* Remove **apostrophes**: I encountered issues related to apostrophes during tokenization, so I decided to remove them immediately, verifying that this neither affected tokenization nor influenced the later counts (2)

In [3]:
def preprocess(text):
  text = text.lower() #(1)
  text = text.replace('’', '') #(2)
  return text

textO = preprocess(rawTextO)
textH = preprocess(rawTextH)
textG = preprocess(rawTextG)

print(str(textO[:100]) + '\n')
print(str(textH[:100]) + '\n')
print(str(textG[:100]) + '\n')

﻿a good man is hard to find
flannery oconnor, 1953
the grandmother didnt want to go to florida. she 

﻿hills like white elephants
ernest hemingway, 1927 
the hills across the valley of the ebro were lon

﻿the yellow wallpaper
charlotte perkins gillman, 1892
it is very seldom that mere ordinary people li




---
# 3. FLESCH-KINCAID READABILITY

The first thing we want to do is determine which text would be best to read first, moving from the easiest to the most difficult in order to **gradually build our vocabulary**.
To do this, for English we can use the Flesch-Kincaid Grade Level Formula:

$$
0.39 \cdot \frac{\text{total words}}{\text{total sentences}} + 11.8 \cdot \frac{\text{total syllables}}{\text{total words}} - 15.59
$$


## 3.1 CALCULATING THE VALUES

From the formula, we see that we need to calculate three values:

* Total sentences (`totalSentences`)
* Total words (`totalWords`)
* Total syllables (`totalSyllables`)

To do this, we’ll use the `spaCy` library, downloading its English language model.

In [4]:
!pip install spacy

import spacy
nlp = spacy.load('en_core_web_sm')



## 3.1.1 TOTAL WORDS
While drafting the project, I decided to start with word tokenization, so that I could immediately spot any potential errors that might also affect later stages of the analysis. In fact, this turned out to be one of the steps where I encountered the most challenges.

Knowing that I would eventually need proper word tokenization for later steps, I decided to create a function dedicated solely to that task. I then obtained the total number of tokens simply by printing the result with a `len()` call outside the function.

However, during this first tokenization attempt, I noticed two main issues:

1. A `\ufeff` character (BOM – Byte Order Mark) appeared at the beginning of the text
2. Punctuation and line breaks were being counted as tokens, even though I only wanted to include **words and numbers**

<div style="text-align: center">
<img src=pics/token_ufeff.png width=75%/>
</div>

To address these issues:

* I started tokenization from the first word after any potential BOM

  * Using `.remove` would not be suitable, as it would also remove the first word after the BOM — in this case, the word "A" (3)
* I defined the function so that it would only add to the token list those strings that consist entirely of letters (4)

> *Note:* the use of `token.is_alpha` filters out all tokens containing apostrophes — including the author's name ("O’Connor"). This problem was already resolved during the **preprocessing phase** (see 2.2.1).

The final function is therefore as follows:

In [5]:
def tokenizeWords(text):
  if text.startswith('\ufeff'): #(3)
    text = text[1:]

  doc = nlp(text)
  words = []
  for token in doc:
    if token.is_alpha: #(4)
      words.append(token.text)
  return words

tokensO = tokenizeWords(textO)
totalWordsO = len(tokensO)
print(tokensO)
print('Total words: ' + str(totalWordsO)+ '\n')

tokensH = tokenizeWords(textH)
totalWordsH = len(tokensH)
print(tokensH)
print('Total words: ' + str(totalWordsH) + '\n')

tokensG = tokenizeWords(textG)
totalWordsG = len(tokensG)
print(tokensG)
print('Total words: ' + str(totalWordsG)+ '\n')

['a', 'good', 'man', 'is', 'hard', 'to', 'find', 'flannery', 'oconnor', 'the', 'grandmother', 'did', 'nt', 'want', 'to', 'go', 'to', 'florida', 'she', 'wanted', 'to', 'visit', 'some', 'of', 'her', 'connections', 'in', 'east', 'tennessee', 'and', 'she', 'was', 'seizing', 'at', 'every', 'chance', 'to', 'change', 'baileys', 'mind', 'bailey', 'was', 'the', 'son', 'she', 'lived', 'with', 'her', 'only', 'boy', 'he', 'was', 'sitting', 'on', 'the', 'edge', 'of', 'his', 'chair', 'at', 'the', 'table', 'bent', 'over', 'the', 'orange', 'sports', 'section', 'of', 'the', 'journal', 'now', 'look', 'here', 'bailey', 'she', 'said', 'see', 'here', 'read', 'this', 'and', 'she', 'stood', 'with', 'one', 'hand', 'on', 'her', 'thin', 'hip', 'and', 'the', 'other', 'rattling', 'the', 'newspaper', 'at', 'his', 'bald', 'head', 'here', 'this', 'fellow', 'that', 'calls', 'himself', 'the', 'misfit', 'is', 'aloose', 'from', 'the', 'federal', 'pen', 'and', 'headed', 'toward', 'florida', 'and', 'you', 'read', 'here', 

This way, we also see that all contractions reappear (e.g., `'not'` becomes `'nt'`), as they are still recognized as individual tokens despite the absence of the apostrophe.

In this regard, the only two letters that could pose issues are **'d'** (from *would*, *had*) and especially **'s'**. After checking, I observed the following:

* **'d'** is always treated as a separate token
* **'s'** is treated as a separate token **only** when it follows *wh-* or *th-* words. In contrast, in words like *its*, *lets*, or proper nouns, it’s interpreted as a plural, third person singular verb, or pronoun — and thus **merged with the preceding word**

  * Regarding this, I figured that distinguishing the **Saxon genitive** from a plural word wasn’t particularly necessary for the purpose of estimating **reading difficulty**, since it’s one of the first things learners pick up and doesn't have a meaningful standalone form
  * The same applies to **verbs** — especially since in the lemmatization step (*see 4.1 Lemmatization*) we already know that the verb *to be* will appear countless times (thus enough to be considered), and plurals will be lemmatized to their singular form regardless

In short, I decided these distinctions weren’t relevant enough to justify more complex filtering at this stage.

## 3.1.2 TOTAL SYLLABLES
To count the syllables, I used an additional spaCy pipeline called `spacy_syllables`.

In [6]:
!pip install spacy spacy_syllables

import spacy_syllables
nlp.add_pipe('syllables')

Collecting spacy_syllables
  Downloading spacy_syllables-3.0.2-py3-none-any.whl.metadata (5.3 kB)
Collecting pyphen>=0.10.0 (from spacy_syllables)
  Downloading pyphen-0.17.2-py3-none-any.whl.metadata (3.2 kB)
Downloading spacy_syllables-3.0.2-py3-none-any.whl (5.1 kB)
Downloading pyphen-0.17.2-py3-none-any.whl (2.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyphen, spacy_syllables
Successfully installed pyphen-0.17.2 spacy_syllables-3.0.2


<spacy_syllables.SpacySyllables at 0x7eb91afeab90>


This function relies on the `._.syllable_count` method (5) to compute the number of syllables.

In [7]:
def countSyllables(text):
  doc = nlp(text)
  totalSyllables = 0
  for token in doc:
    if token.is_alpha: # only count alphabet characters as tokens
      count = token._.syllables_count #(5)
      totalSyllables += count
  return totalSyllables

totalSyllablesO = countSyllables(textO)
print(titleO + ': '  + str(totalSyllablesO) + ' syllables in total \n')

totalSyllablesH = countSyllables(textH)
print(titleH + ': '  + str(totalSyllablesH) + ' syllables in total \n')

totalSyllablesG = countSyllables(textG)
print(titleG + ': '  + str(totalSyllablesG) + ' syllables in total \n')

﻿A GOOD MAN IS HARD TO FIND: 8126 syllables in total 

﻿HILLS LIKE WHITE ELEPHANTS: 1684 syllables in total 

﻿THE YELLOW WALLPAPER: 7541 syllables in total 



### 3.1.3 TOTAL SENTENCES

Finally, we simply apply the `doc.sents` property to split the text into individual sentences. These sentences are added to the `sentences` list, and the function returns only the length of that list, as the actual sentence content is not needed in this case.

In the function `countSentences(text)`:

1. spaCy processes the text by creating a `doc` object, which is essentially a [sequence of tokens](https://spacy.io/api/doc)
2. The `doc.sents` property splits the text into sentences
3. The `sent.text` attribute is used to access the individual sentences
4. The sentences are added to the `sentences` list using the `.append()` method
5. The function returns only the `len` of the list — i.e., the number of sentences detected in the text


In [8]:
def countSentences(text):
  doc = nlp(text)
  sentences = []
  for sent in doc.sents:
    if sent.text:
      sentences.append(sent.text)
  return len(sentences)

totalSentencesO = countSentences(textO)
print(titleO + ': '  + str(totalSentencesO) + ' sentences in total \n')

totalSentencesH = countSentences(textH)
print(titleH + ': '  + str(totalSentencesH) + ' sentences in total \n')

totalSentencesG = countSentences(textG)
print(titleG + ': '  + str(totalSentencesG) + ' sentences in total \n')

﻿A GOOD MAN IS HARD TO FIND: 420 sentences in total 

﻿HILLS LIKE WHITE ELEPHANTS: 160 sentences in total 

﻿THE YELLOW WALLPAPER: 376 sentences in total 



## 3.2 READING DIFFICULTY RANKING

Once all the required values have been calculated, we simply plug them into the **Flesch-Kincaid formula** to create a ranking of the texts — from the easiest to the most complex.


### 3.2.1 CALCULATING READABILITY

By applying the formula *(see section 3. Flesch-Kincaid Readability)*, the result should be a **value between 0 and 18**, which corresponds to increasing levels of difficulty based on the U.S. school grade system.

| Value | School Level | Student Age Range | Notes |
| :---  |     :---     |       :---        | :---  |
|0-1    | Pre-Kindergarten -- 1st grade | 3-7 | Basic level for those who just learned to read books. |
|1-5    | 1st grade -- 5th grade | 7-11 | Very easy to read. |
|5-11   | 5th grade -- 11th grade | 11-17 | Average level. Good for the majority of marketing materials. |
|11-18  | 11th grade -- 18th grade | 17 and above | The text is for skilled readers. For example, an academic paper. |


In [9]:
def calcReadability(totalSentences, totalWords, totalSyllables):
  readability = 0.39 * (totalWords/totalSentences) + 11.8 * (totalSyllables/totalWords) - 15.59
  return f'{readability:.3f}'

print('FLESCH-KINCAID VALUE: \n')

readabilityO = calcReadability(totalSentencesO, totalWordsO, totalSyllablesO)
print(titleO + ': '  + readabilityO + '\n')

readabilityH = calcReadability(totalSentencesH, totalWordsH, totalSyllablesH)
print(titleH + ': '  + readabilityH + '\n')

readabilityG = calcReadability(totalSentencesG, totalWordsG, totalSyllablesG)
print(titleG + ': '  + readabilityG + '\n')

FLESCH-KINCAID VALUE: 

﻿A GOOD MAN IS HARD TO FIND: 5.081

﻿HILLS LIKE WHITE ELEPHANTS: 1.538

﻿THE YELLOW WALLPAPER: 5.272



### 3.2.2 CONVERTING READABILITY INTO DIFFICULTY LEVEL

We can then create an additional function that automatically tells us the corresponding difficulty level based on the previously calculated Flesch-Kincaid score.


In [10]:
def difficulty(readability):
  if 0 <= float(readability) < 2: #The previous f-string transformed this value in a string type, so I need to convert it back to a number
    level = 'Absolute Beginner'
  elif float(readability) < 6:
    level = 'Beginner'
  elif float(readability) < 12:
    level = 'Intermediate'
  elif float(readability) < 19:
    level = 'Advanced'
  else: # meaning readability < 0 || readability > 18
    level = 'Attention! Value not included in the Flesch-Kincaid Scale'

  return level

print('LEVEL OF DIFFICULTY: \n')

difficultyO = difficulty(readabilityO)
print(titleO + ': '  + str(difficultyO) + '\n')

difficultyH = difficulty(readabilityH)
print(titleH + ': '  + str(difficultyH) + '\n')

difficultyG = difficulty(readabilityG)
print(titleG + ': '  + str(difficultyG) + '\n')

LEVEL OF DIFFICULTY: 

﻿A GOOD MAN IS HARD TO FIND: Beginner

﻿HILLS LIKE WHITE ELEPHANTS: Absolute Beginner

﻿THE YELLOW WALLPAPER: Beginner



## 3.3 INITIAL RESULTS: Readability Comparison

Finally, we compile all the values calculated so far into a single block, so we can **compare** them and determine the best reading order for a more gradual and effective learning experience.


In [11]:
def partialInfos(title, totalSents, totalWords, totalSyllables, readability, difficulty):
  print('\"' + title + '\"')
  print('Number of sentences: ' + str(totalSents))
  print('Number of words: ' + str(totalWords))
  print('Number of syllables: ' + str(totalSyllables))
  print('Flesch-Kincaid value: ' + str(readability))
  print('Level of difficulty: ' + difficulty)
  return '\n'

partialInfosO = partialInfos(titleO, totalSentencesO, totalWordsO, totalSyllablesO, readabilityO, difficultyO)
print(partialInfosO)

partialInfosH = partialInfos(titleH, totalSentencesH, totalWordsH, totalSyllablesH, readabilityH, difficultyH)
print(partialInfosH)

partialInfosG = partialInfos(titleG, totalSentencesG, totalWordsG, totalSyllablesG, readabilityG, difficultyG)
print(partialInfosG)

"﻿A GOOD MAN IS HARD TO FIND"
Number of sentences: 420
Number of words: 6589
Number of syllables: 8126
Flesch-Kincaid value: 5.081
Level of difficulty: Beginner


"﻿HILLS LIKE WHITE ELEPHANTS"
Number of sentences: 160
Number of words: 1466
Number of syllables: 1684
Flesch-Kincaid value: 1.538
Level of difficulty: Absolute Beginner


"﻿THE YELLOW WALLPAPER"
Number of sentences: 376
Number of words: 6139
Number of syllables: 7541
Flesch-Kincaid value: 5.272
Level of difficulty: Beginner




From these results, we understand that **the best order** to read these texts would be the following:
1. Hills like White Elephants (1.26)
2. A Good Man is Hard to Find (5.06)
3.  The Yellow Wallpaper (5.25)

---

# 4. TEXT ANALYSIS

At this point, we can finally move on to analyzing the actual content of the texts. Specifically, we can:

* Build a kind of **author vocabulary** starting from `lemmas`

  * A *lemma* is the “base form” of a word — the form you'd find in a dictionary

* Analyze **lexical variety** (`vocab` / `totalWords` \* 100)

  * `vocab` works like `lemmas`, but filters out repeated occurrences
  * `totalWordsX = len(tokensX)`

* Count the **number of occurrences of each word** (`lemmas`) to identify which ones we could potentially learn through passive exposure
  *(see section 1.1.2 Vocabulary Acquisition)*

  * The result will be a dictionary of the form `{lemma: number of occurrences}`


## 4.1 LEMMATIZATION

To create a list of the vocabulary used by the author from our tokens, we first want to convert each individual word to its **base form**, also known as the *lemma*.

> Since the function takes about a minute to return results for each text, I chose to separate them into individual blocks.

In [12]:
def lemmatize(tokens):
  lemmas = []
  for token in tokens:
    lemma = nlp(token)[0].lemma_ #the [0] index is necessary to run through all the elements, otherwise it will stop after the first item
    lemmas.append(lemma)
  return lemmas

lemmasH = lemmatize(tokensH)
print(lemmasH)

['hill', 'like', 'white', 'elephant', 'ern', 'hemingway', 'the', 'hill', 'across', 'the', 'valley', 'of', 'the', 'ebro', 'be', 'long', 'and', 'white', 'on', 'this', 'side', 'there', 'be', 'no', 'shade', 'and', 'no', 'tree', 'and', 'the', 'station', 'be', 'between', 'two', 'line', 'of', 'rail', 'in', 'the', 'sun', 'close', 'against', 'the', 'side', 'of', 'the', 'station', 'there', 'be', 'the', 'warm', 'shadow', 'of', 'the', 'build', 'and', 'a', 'curtain', 'make', 'of', 'string', 'of', 'bamboo', 'bead', 'hung', 'across', 'the', 'open', 'door', 'into', 'the', 'bar', 'to', 'keep', 'out', 'fly', 'the', 'american', 'and', 'the', 'girl', 'with', 'he', 'sit', 'at', 'a', 'table', 'in', 'the', 'shade', 'outside', 'the', 'build', 'it', 'be', 'very', 'hot', 'and', 'the', 'express', 'from', 'barcelona', 'would', 'come', 'in', 'forty', 'minute', 'it', 'stop', 'at', 'this', 'junction', 'for', 'two', 'minute', 'and', 'go', 'on', 'to', 'madrid', 'what', 'should', 'we', 'drink', 'the', 'girl', 'ask', 's

In [13]:
lemmasO = lemmatize(tokensO)
print(lemmasO)

['a', 'good', 'man', 'be', 'hard', 'to', 'find', 'flannery', 'oconnor', 'the', 'grandmother', 'do', 'not', 'want', 'to', 'go', 'to', 'florida', 'she', 'want', 'to', 'visit', 'some', 'of', 'she', 'connection', 'in', 'east', 'tennessee', 'and', 'she', 'be', 'seize', 'at', 'every', 'chance', 'to', 'change', 'bailey', 'mind', 'bailey', 'be', 'the', 'son', 'she', 'live', 'with', 'she', 'only', 'boy', 'he', 'be', 'sit', 'on', 'the', 'edge', 'of', 'his', 'chair', 'at', 'the', 'table', 'bent', 'over', 'the', 'orange', 'sport', 'section', 'of', 'the', 'journal', 'now', 'look', 'here', 'bailey', 'she', 'say', 'see', 'here', 'read', 'this', 'and', 'she', 'stand', 'with', 'one', 'hand', 'on', 'she', 'thin', 'hip', 'and', 'the', 'other', 'rattle', 'the', 'newspaper', 'at', 'his', 'bald', 'head', 'here', 'this', 'fellow', 'that', 'call', 'himself', 'the', 'misfit', 'be', 'aloose', 'from', 'the', 'federal', 'pen', 'and', 'head', 'toward', 'florida', 'and', 'you', 'read', 'here', 'what', 'it', 'say', 

In [15]:
lemmasG = lemmatize(tokensG)
print(lemmasG)

['the', 'yellow', 'wallpaper', 'charlotte', 'perkin', 'gillman', 'it', 'be', 'very', 'seldom', 'that', 'mere', 'ordinary', 'people', 'like', 'john', 'and', 'myself', 'secure', 'ancestral', 'hall', 'for', 'the', 'summer', 'a', 'colonial', 'mansion', 'a', 'hereditary', 'estate', 'I', 'would', 'say', 'a', 'haunt', 'house', 'and', 'reach', 'the', 'height', 'of', 'romantic', 'felicity', 'but', 'that', 'would', 'be', 'ask', 'too', 'much', 'of', 'fate', 'still', 'I', 'will', 'proudly', 'declare', 'that', 'there', 'be', 'something', 'queer', 'about', 'it', 'else', 'why', 'should', 'it', 'be', 'let', 'so', 'cheaply', 'and', 'why', 'have', 'stand', 'so', 'long', 'untenante', 'john', 'laugh', 'at', 'I', 'of', 'course', 'but', 'one', 'expect', 'that', 'in', 'marriage', 'john', 'be', 'practical', 'in', 'the', 'extreme', 'he', 'have', 'no', 'patience', 'with', 'faith', 'an', 'intense', 'horror', 'of', 'superstition', 'and', 'he', 'scoff', 'openly', 'at', 'any', 'talk', 'of', 'thing', 'not', 'to', 'b

> Note how the *nt* from negations (e.g., in `lemmasO`) has indeed been converted into *not* during lemmatization.

## 4.2 AUTHOR'S VOCABULARY

To define the author's vocabulary, we remove all repeated occurrences of the same word by converting the list into a `set()`. We can also apply the `sorted()` method, which extracts the strings from the set in alphabetical order and returns them as a list.

From an active learning perspective, this gives us a ready-made list of vocabulary items we could choose to study **before reading** the text.


In [16]:
def listVocab(lemmas):
  vocab = sorted(set(lemmas))
  return vocab

vocabH = listVocab(lemmasH)
vocabCountH = len(vocabH)
print(titleH + ': ' + str(vocabCountH) + ' vocabs in total')
print(str(vocabH) + '\n')

vocabO = listVocab(lemmasO)
vocabCountO = len(vocabO)
print(titleO + ': ' + str(vocabCountO) + ' vocabs in total')
print(str(vocabO) + '\n')

vocabG = listVocab(lemmasG)
vocabCountG = len(vocabG)
print(titleG + ': ' + str(vocabCountG) + ' vocabs in total')
print(str(vocabG) + '\n')

﻿HILLS LIKE WHITE ELEPHANTS: 267 vocabs in total
['I', 'a', 'about', 'absinthe', 'across', 'afraid', 'afterward', 'again', 'against', 'air', 'all', 'along', 'american', 'amuse', 'an', 'and', 'ani', 'another', 'any', 'anybody', 'anything', 'around', 'ask', 'at', 'away', 'awfully', 'back', 'bag', 'bamboo', 'bank', 'bar', 'barcelona', 'barroom', 'be', 'bead', 'because', 'beer', 'before', 'between', 'beyond', 'big', 'blow', 'bother', 'bright', 'brightly', 'bring', 'brown', 'build', 'but', 'call', 'can', 'care', 'carry', 'cerveza', 'close', 'cloud', 'color', 'come', 'cool', 'could', 'country', 'course', 'curtain', 'cut', 'damp', 'day', 'del', 'do', 'door', 'doorway', 'down', 'drank', 'drink', 'dry', 'ebro', 'elephant', 'else', 'end', 'ern', 'especially', 'ever', 'every', 'everything', 'everywhere', 'express', 'far', 'feel', 'field', 'fine', 'finish', 'five', 'fly', 'for', 'forty', 'four', 'from', 'get', 'girl', 'glass', 'go', 'good', 'grain', 'ground', 'guess', 'hand', 'happy', 'hat', 'have

### 4.2.1 LEXICAL VARIETY

In [17]:
def vocabVariety(vocabCount, totalWords):
  variety = (vocabCount / totalWords) * 100
  return f'{variety:.2f}'

varietyH = vocabVariety(vocabCountH, totalWordsH)
print(titleH + ': ' + varietyH + '% \n')

varietyO = vocabVariety(vocabCountO, totalWordsO)
print(titleO + ': ' + varietyO + '% \n')

varietyG = vocabVariety(vocabCountG, totalWordsG)
print(titleG + ': ' + varietyG + '% \n')

﻿HILLS LIKE WHITE ELEPHANTS: 18.21% 

﻿A GOOD MAN IS HARD TO FIND: 16.53% 

﻿THE YELLOW WALLPAPER: 16.47% 



We can see that the percentage of vocabulary variety is not particularly high, which suggests that there will likely be some words that are learnable through **passive acquisition** — even though we don't know exactly how many yet. That will become clear in the next step, when we calculate word occurrences.


## 4.3 OCCURRENCES

To get the total occurrences of each individual word, we simply sum all the duplicates within the `lemmas` lists (not `vocab`, which has already been filtered).

### 4.3.1 COUNTING AND SORTING OCCURRENCES

To count occurrences, we create a function that works not on a list (\['sequence', 'of', 'objects']) but on a dictionary (`{key: value}`).

Therefore, another way to get the author's vocabulary would be to first count occurrences and then extract only the keys. However, this would break the logical flow of the process (Lemmas → Vocabulary → Passively Acquirable Words).

> I am aware that my choice to count occurrences on lemmas rather than on words is debatable, especially due to irregular verbs, where past forms should ideally be counted separately for passive learning. However, it is also important to remember that:
* The reader should be able to understand the verb variant from context (or at least from a first translation) and register its base meaning.
* We are still analyzing very short texts, so making too many distinctions would be unnecessary and would result in very few occurrences per word.

Inside the same function, we can also reorder the words in **descending order of occurrences**:

* Since `lemmasCount` is a dictionary, we need to add `.items()` to make everything work — this converts the dictionary into tuples `(key, value)`.
* The `sorted()` function orders the elements, but we want to specify more precisely how:

  * `key=lambda` starts a small anonymous function expression. The lambda receives one or more arguments and returns a value.
  * `item: item[1]` tells `sorted()` to sort by the **value** (which is the second element), not the key (which is `item[0]`).
  * `reverse=True` ensures the order is descending.

In [18]:
def countLemmas(lemmas):
  lemmasCount = {}
  for lemma in lemmas:
    if lemma in lemmasCount:
      lemmasCount[lemma] += 1
    else:
      lemmasCount[lemma] = 1

  return lemmasCount

lemmasCountH = countLemmas(lemmasH)
print(str(lemmasCountH) + '\n')

lemmasCountO = countLemmas(lemmasO)
print(str(lemmasCountO) + '\n')

lemmasCountG = countLemmas(lemmasG)
print(lemmasCountG)

{'hill': 6, 'like': 10, 'white': 7, 'elephant': 5, 'ern': 1, 'hemingway': 1, 'the': 127, 'across': 6, 'valley': 2, 'of': 24, 'ebro': 2, 'be': 38, 'long': 2, 'and': 48, 'on': 11, 'this': 4, 'side': 5, 'there': 4, 'no': 7, 'shade': 3, 'tree': 4, 'station': 6, 'between': 1, 'two': 9, 'line': 2, 'rail': 1, 'in': 9, 'sun': 2, 'close': 1, 'against': 3, 'warm': 2, 'shadow': 2, 'build': 2, 'a': 7, 'curtain': 8, 'make': 4, 'string': 2, 'bamboo': 1, 'bead': 6, 'hung': 1, 'open': 1, 'door': 1, 'into': 2, 'bar': 3, 'to': 28, 'keep': 1, 'out': 6, 'fly': 1, 'american': 1, 'girl': 22, 'with': 12, 'he': 13, 'sit': 3, 'at': 21, 'table': 8, 'outside': 1, 'it': 57, 'very': 1, 'hot': 2, 'express': 1, 'from': 4, 'barcelona': 1, 'would': 7, 'come': 8, 'forty': 1, 'minute': 4, 'stop': 3, 'junction': 1, 'for': 8, 'go': 5, 'madrid': 1, 'what': 7, 'should': 2, 'we': 27, 'drink': 7, 'ask': 5, 'she': 17, 'have': 20, 'take': 5, 'off': 2, 'hat': 1, 'put': 5, 'pretty': 1, 'man': 11, 'say': 36, 'let': 4, 'beer': 8, '

Even with just these few words, we can already get an idea of which vocabulary items are the most frequent in the language — and therefore, which ones we likely won't have much difficulty learning.