<a href="https://colab.research.google.com/github/letizia-z/letizia-z/blob/main/Acquire_L2_words_from_consuming_content.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

|     Course                     | Academic Year |
|    :---                        |     ---:      |
| Programming for the Humanities |  *2023/2024*  |

*This has been my first programming course.*

# HOW MANY WORDS CAN I ACQUIRE FROM CONSUMING CONTENT IN L2

### Project Description

This project uses Python to perform linguistic analysis on three short stories. The goal is to extract useful information regarding the complexity and variety of the language used in the texts, as well as the number of occurrences of each word. This tool is particularly useful for those who want to deepen their linguistic understanding and improve their vocabulary through the consumption of content in the target language.

* **Input data:** three short stories in English
* **Output data:** sentence, word, and syllable count; Flesch-Kincaid score and reading difficulty; vocabulary variety; possible encounter with new words and passively learnable words (i.e., those that exceed a certain number of occurrences within the texts).


### Name and URL of programs/notebooks reused in the project

|Name|URL|
| :---        |    :---  |
|*Python*|*Notebooks from lectures, especially "09_NLP"*|
||*https://www.datacamp.com/tutorial/sort-a-dictionary-by-value-python*|
|*Spacy*|*https://spacy.io/api/doc*|
||https://stackoverflow.com/questions/405161/detecting-syllables-in-a-word|
|*Matplotlib.pyplot* |*https://matplotlib.org/stable/api/pyplot_summary.html*|
||*https://stackoverflow.com/questions/66446687/how-do-i-make-a-dashed-horizontal-line-with-matplotlib* |
|*Artificial Intelligence*|*ChatGPT, Gemini*|

*This code was developed with the assistance of artificial intelligence (AI) tools to generate a starting point or suggestions. However, the final code has been reviewed, modified, and adapted according to my specific needs, and represents the result of personal work. Any similarities with other works are purely coincidental and unintentional. I have taken all necessary precautions to ensure that the code presented here does not constitute plagiarism and respects copyright laws.*


---

# 1. INTRODUCTION

Anyone who has seriously studied a foreign language has likely experienced the so-called ***“language learning plateau”*** — that moment when you already have enough vocabulary to understand and express more or less everything you want, making it increasingly difficult to learn new words.
To overcome this plateau, reading or generally increasing your consumption of content in the target language is often recommended.
With this in mind, I decided to analyze some texts myself to see **how effective content consumption really is**, in this case with reference to the **English language**.


## 1.1 SOME BASIC CONCEPTS

Before diving into the actual project, it's important to provide two key pieces of information.

### 1.1.1 CEFR LEVELS
Each language proficiency level includes an approximate number of words that are known and usable by the speaker. Therefore, depending on one's starting level, there will be differences in the time needed to understand a text and in the number of words that can be learned from it.

| LEVEL | WORDS | HOURS |
| --- | --- | --- |
| A1 | 700 | 100 |
| A2 | 1500 | 180/200 |
| B1 | 2500 | 350/400 |
| B2 | 4000 | 500/600 |
| C1 | 8000 | 700/800 |
| C2 | 16000 | 1000/1200 |

### 1.1.2 VOCABULARY ACQUISITION

Simply encountering a word in a text is obviously not enough to learn it. When it comes to learning a new word, we can take two main approaches:

* **Active learning**, where the student who comes across a new term makes a conscious effort to remember it (for example, by using flashcards)
* **Passive acquisition**, which relies primarily on repeated exposure to the same word, ideally in different contexts

Most studies in this field focus on first language (L1) acquisition rather than second language (L2) learning. This is partly because it is still unclear how many exposures are needed to acquire a word, as this also depends on individual cognitive abilities.

According to Uchihara et al.:

> *“the number of encounters necessary to learn words rang\[es] from 6, 10, 12, to more than 20 times. \[That is to say,] the number of encounters necessary for learning of vocabulary to occur during meaning-focussed input remains unclear”*

Therefore, for the purposes of my project, I decided to assume that the **minimum number of exposures required for passive vocabulary acquisition is 12**, based in part on a study by Holly L. Storkel et al. on L1 acquisition in children.



---

# 2. THE SHORT STORIES

First, I selected three short stories that I was unfamiliar with, written by authors from different time periods, genders, and styles. The idea behind this choice was that **greater variety** would allow for the encounter of the largest possible number of different words. This is ideal from the perspective of *active vocabulary study*, but it could be problematic for *passive acquisition*, since a wider vocabulary range would likely result in fewer words reaching the 12-occurrence threshold.

The short stories analyzed are:

* *“The Yellow Wallpaper”* by C. P. Gilman (1892)
* *“Hills Like White Elephants”* by E. Hemingway (1927)
* *“A Good Man is Hard to Find”* by F. O’Connor (1953)



> Note: make sure to manually download them in your personal Colab space and runtime



## 2.1 IMPORTING AND OPENING THE FILES

First, we'll need to open the files of the selected short stories, so we can begin analyzing them. To make sure I’ve opened the correct files, I’ll also print the first 100 characters of each one.

To distinguish between the three texts, we’ll add the initial of each author’s last name to the variable names:

* **O** = *“A Good Man is Hard to Find”* by F. O’Connor (1953)
* **H** = *“Hills Like White Elephants”* by E. Hemingway (1927)
* **G** = *“The Yellow Wallpaper”* by C. P. Gilman (1892)


In [None]:
def readFile(filePath):
  with open(filePath, 'r', encoding='utf-8') as file:
    return file.read()


filePathO = 'short_stories/AGoodManIsHardToFind_OConnor1953.txt'
filePathH = 'short_stories/HillsLikeWhiteElephants_Hemingway1927.txt'
filePathG = 'short_stories/TheYellowWallpaper_Gillman1892.txt'

rawTextO = readFile(filePathO)
rawTextH = readFile(filePathH)
rawTextG = readFile(filePathG)

print(str(rawTextO[:100]) + '\n')
print(str(rawTextH[:100]) + '\n')
print(str(rawTextG[:100]) + '\n')


A GOOD MAN IS HARD TO FIND
Flannery O’Connor, 1953
The grandmother didn’t want to go to Florida. Sh

﻿HILLS LIKE WHITE ELEPHANTS
Ernest Hemingway, 1927 
The hills across the valley of the Ebro were lon

﻿THE YELLOW WALLPAPER
Charlotte Perkins Gillman, 1892
It is very seldom that mere ordinary people li



### 2.1.1 EXTRACTING THE TITLE

I also decided to take advantage of the formatting of these files (with the title written in uppercase) to create a function that extracts only the title of the short story. This way, we can easily refer back to it in later stages, especially when displaying the results of the various analysis steps.

In [None]:
def extractTitle(filePath):
  with open(filePath, 'r', encoding='utf-8') as file:
    for line in file:
      strippedLine = line.strip() # remove blank spaces at the beginning and at the end of the line
      if strippedLine.isupper(): # the title is supposedly in uppercase
        return strippedLine
  return 'Title not found'  # in cas the title isn't in uppercase like expected

titleO = extractTitle(filePathO)
print(str(titleO) + '\n')

titleH = extractTitle(filePathH)
print(str(titleH) + '\n')

titleG = extractTitle(filePathG)
print(str(titleG) + '\n')

A GOOD MAN IS HARD TO FIND

﻿HILLS LIKE WHITE ELEPHANTS

﻿THE YELLOW WALLPAPER



## 2.2 PREPROCESSING

One last necessary step is preprocessing the text by making slight modifications to simplify the subsequent analysis:

* Convert the entire text to **lowercase**: this ensures that during co-occurrence counting, identical words are counted together (1), rather than being treated as separate groups due to capitalization
* Remove **apostrophes**: I encountered issues related to apostrophes during tokenization, so I decided to remove them immediately, verifying that this neither affected tokenization nor influenced the later counts (2)

In [None]:
def preprocess(text):
  text = text.lower() #(1)
  text = text.replace('’', '') #(2)
  return text

textO = preprocess(rawTextO)
textH = preprocess(rawTextH)
textG = preprocess(rawTextG)

print(str(textO[:100]) + '\n')
print(str(textH[:100]) + '\n')
print(str(textG[:100]) + '\n')

a good man is hard to find
flannery oconnor, 1953
the grandmother didnt want to go to florida. she 

﻿hills like white elephants
ernest hemingway, 1927 
the hills across the valley of the ebro were lon

﻿the yellow wallpaper
charlotte perkins gillman, 1892
it is very seldom that mere ordinary people li




---
# 3. FLESCH-KINCAID READABILITY

The first thing we want to do is determine which text would be best to read first, moving from the easiest to the most difficult in order to **gradually build our vocabulary**.
To do this, for English we can use the Flesch-Kincaid Grade Level Formula:

$$
0.39 \cdot \frac{\text{total words}}{\text{total sentences}} + 11.8 \cdot \frac{\text{total syllables}}{\text{total words}} - 15.59
$$


## 3.1 CALCULATING THE VALUES

From the formula, we see that we need to calculate three values:

* Total sentences (`totalSentences`)
* Total words (`totalWords`)
* Total syllables (`totalSyllables`)

To do this, we’ll use the `spaCy` library, downloading its English language model.

In [None]:
!pip install spacy

import spacy
nlp = spacy.load('en_core_web_sm')

Collecting spacy
  Downloading spacy-3.8.7-cp313-cp313-macosx_11_0_arm64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Using cached spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.13-cp313-cp313-macosx_11_0_arm64.whl.metadata (2.2 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.11-cp313-cp313-macosx_11_0_arm64.whl.metadata (8.5 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.10-cp313-cp313-macosx_11_0_arm64.whl.metadata (2.4 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.6-cp313-cp313-macosx_11_0_arm64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Using cached wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  D

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

## 3.1.1 TOTAL WORDS
While drafting the project, I decided to start with word tokenization, so that I could immediately spot any potential errors that might also affect later stages of the analysis. In fact, this turned out to be one of the steps where I encountered the most challenges.

Knowing that I would eventually need proper word tokenization for later steps, I decided to create a function dedicated solely to that task. I then obtained the total number of tokens simply by printing the result with a `len()` call outside the function.

However, during this first tokenization attempt, I noticed two main issues:

1. A `\ufeff` character (BOM – Byte Order Mark) appeared at the beginning of the text
2. Punctuation and line breaks were being counted as tokens, even though I only wanted to include **words and numbers**

<div style="text-align: center">
<img src=pics/token_ufeff.png width=75%/>
</div>

To address these issues:

* I started tokenization from the first word after any potential BOM

  * Using `.remove` would not be suitable, as it would also remove the first word after the BOM — in this case, the word "A" (3)
* I defined the function so that it would only add to the token list those strings that consist entirely of letters (4)

> *Note:* the use of `token.is_alpha` filters out all tokens containing apostrophes — including the author's name ("O’Connor"). This problem was already resolved during the **preprocessing phase** (see 2.2.1).

The final function is therefore as follows:

In [None]:
def tokenizeWords(text):
  if text.startswith('\ufeff'): #(3)
    text = text[1:]

  doc = nlp(text)
  words = []
  for token in doc:
    if token.is_alpha: #(4)
      words.append(token.text)
  return words

tokensO = tokenizeWords(textO)
totalWordsO = len(tokensO)
print(tokensO)
print('Total words: ' + str(totalWordsO)+ '\n')

tokensH = tokenizeWords(textH)
totalWordsH = len(tokensH)
print(tokensH)
print('Total words: ' + str(totalWordsH) + '\n')

tokensG = tokenizeWords(textG)
totalWordsG = len(tokensG)
print(tokensG)
print('Total words: ' + str(totalWordsG)+ '\n')

This way, we also see that all contractions reappear (e.g., `'not'` becomes `'nt'`), as they are still recognized as individual tokens despite the absence of the apostrophe.

In this regard, the only two letters that could pose issues are **'d'** (from *would*, *had*) and especially **'s'**. After checking, I observed the following:

* **'d'** is always treated as a separate token
* **'s'** is treated as a separate token **only** when it follows *wh-* or *th-* words. In contrast, in words like *its*, *lets*, or proper nouns, it’s interpreted as a plural, third person singular verb, or pronoun — and thus **merged with the preceding word**

  * Regarding this, I figured that distinguishing the **Saxon genitive** from a plural word wasn’t particularly necessary for the purpose of estimating **reading difficulty**, since it’s one of the first things learners pick up and doesn't have a meaningful standalone form
  * The same applies to **verbs** — especially since in the lemmatization step (*see 4.1 Lemmatization*) we already know that the verb *to be* will appear countless times (thus enough to be considered), and plurals will be lemmatized to their singular form regardless

In short, I decided these distinctions weren’t relevant enough to justify more complex filtering at this stage.

## 3.1.2 TOTAL SYLLABLES
To count the syllables, I used an additional spaCy pipeline called `spacy_syllables`.

In [None]:
!pip install spacy spacy_syllables

import spacy_syllables
nlp.add_pipe('syllables')


This function relies on the `._.syllable_count` method (5) to compute the number of syllables.