## Tutorial for the Intended Usage of the Feature Extraction Pipeline API

##### Please ensure that you have spaCy and the "es_core_news_md" pipeline installed

In [1]:
# !python -m spacy download es_core_news_md

In [1]:
import pprint
from utils import read_corpus
from features import feature_pipeline

# Setup pretty printer
p = pprint.PrettyPrinter(indent=4, width=140)

### Introduction
This notebook is intended to illustrate and explain the main workflows possible when using the feature extraction pipeline API for computing statistical/numerical features from a raw text in our collected corpus of Spanish texts. This notebook should be treated as a secondary resource for understanding the API, the primary resource being the file `features.py` containing the source code.

Let's begin by broadly describing the stages of the feature extraction pipeline. To calculate any statistical feature from a raw, unprocessed text, the following steps must occur in sequence:
1. Initialize the pipeline object
2. Pre-process (clean up) the text to remove stray whitespaces, numerals, characters, etc.
3. Extract fundamental attributes of the text (eg., tokens, POS tags, lemmas, etc.) using spaCy
4. Use the extracted attributes to calculate statistical features of the text (eg., total number of tokens, type-token ratio, pronoun density, etc.)

It is not possible to calculate a statistical feature without first extracting the fundamental attributes necessary for calculating that feature. In other words, we cannot reach stage 4 without completing stages 1, 2 and 3. However, the API is written with some shortcuts in place, allowing us to almost never need to explicitly call the `.preprocess()` method in our code. Depending on our usage we can even sometimes skip writing the spaCy step. It is most crucial to remember that the unprocessed text must be passed as an argument at some stage within the pipeline, but it is quite flexible as to which stage that should be.

Through this tutorial we will understand these shortcuts and learn how to best apply them in workflows.

### Corpus Reading
Let's begin by loading the corpus and seeing a broad view of what it looks like:

In [2]:
corpus = read_corpus()

for k, v in corpus.items():
    print(f"{k}: {len(v)}")

A2: 62
B1: 42
A1: 94
B2: 205


Let's pick a text from the corpus for the purposes of this demo:

In [3]:
unprocessed_text = corpus["A1"][81]["content"]
print(unprocessed_text)

CAPÍtULO 12

Hoy es sábado. El día tan esperado de la final. Ha venido mucho 
público a ver el partido.

Laura y Mónica saludan con la mano a su tutor Roberto, a Ángela 
y a Carmen que están unas gradas más atrás con otros profesores 
de otros colegios.

Guillermo hoy está convocado por el entrenador, como suplente. 
Laura y Mónica están en la entrada de los vestuarios dando ánimos 
a sus compañeros.

—¡Guille! —dice Mónica a su amigo —¡A ver si marcas un gol!
—No sé si jugaré. Estoy de suplente.
—Yo creo que sí que jugarás... —dice Martín.
—Bueno, tenemos que ir a cambiarnos —dice Sergio.
—Oye, Laura —dice Mónica—. ¿Aquella chica no es...?
—¿Quién?
—Allá...
—No, no la veo.
—Ahora no se ve... Quizás me he equivocado.
Martín mira hacia donde señalan sus amigas. Él sí la ve. O no. 

No, no puede ser ella. Martín se pone nervioso.
El árbitro señala el principio del partido. El equipo del Peralta tiene suerte y a los pocos minutos marca un gol. Al equipo de 
Barcelona le cuesta organizar s

We can see that this piece of text is formatted somewhat "irregularly". It is a poem, so it has line breaks in the middle of grammatical sentences. It also has a title at the top, which is not a necessary component of the content of the text. In order to derive syntactic attributes like POS tags and dependency parses we will need to convert this text into a more standard form that can be interpreted by spaCy.

### Text Preprocessing
Let's create a pipeline for cleaning up this text and extracting important attributes and features from it.

In [6]:
# This step passes the un-processed text to the pipeline and automatically cleans it up using the .preprocess() method
pipe = feature_pipeline(unprocessed_text)
print(pipe.text)

capítulo hoy es sábado. el día tan esperado de la final. ha venido mucho público a ver el partido. laura y mónica saludan con la mano a su tutor roberto, a ángela y a carmen que están unas gradas más atrás con otros profesores de otros colegios. guillermo hoy está convocado por el entrenador, como suplente. laura y mónica están en la entrada de los vestuarios dando ánimos a sus compañeros. —¡guille! —dice mónica a su amigo —¡a ver si marcas un gol! —no sé si jugaré. estoy de suplente. —yo creo que sí que jugarás... —dice martín. —bueno, tenemos que ir a cambiarnos —dice sergio. —oye, laura —dice mónica—. ¿aquella chica no es...? —¿quién? —allá... —no, no la veo. —ahora no se ve... quizás me he equivocado. martín mira hacia donde señalan sus amigas. él sí la ve. o no. no, no puede ser ella. martín se pone nervioso. el árbitro señala el principio del partido. el equipo del peralta tiene suerte y a los pocos minutos marca un gol. al equipo de barcelona le cuesta organizar su juego. además

The text looks much more standardized now, which makes it easier for downstream functions to extract things from it in a consistent manner.

In the above cell we only called the pipeline object constructor but it automatically gave us a cleaned text. That is because the constructor already contains a call for the `.preprocess()` method if a text has been supplied to the object. This behaviour is functionally equivalent to the following code:

In [6]:
pipe = feature_pipeline()
cleaned_text = pipe.preprocess(unprocessed_text)
print(cleaned_text)

la primavera la primavera principia el veintiuno de marzo y dura hasta el veintiuno de junio. la primavera es muy agradable y hermosa. las flores crecen. los árboles y los campos se cubren de verdura y los pájaros cantan en ellos. todos los hombres, las mujeres y los niños están alegres. algunas veces hace frío en abril y aún en mayo. algunas veces, pero no frecuentemente, hay nieve y hielo en abril, y entonces muchas flores y plantas se mueren.


\
As a rule of thumb, it is easiest to pass the unprocessed text to the pipeline at initialization, since the constructor has the effect of resetting all of the attributes to empty lists, giving the pipeline a blank slate for processing a new text that may come its way.

### Extracting Attributes of the Text using SpaCy
Now that we have cleaned up the text into a standard form, spaCy will be able to derive some fundamental syntactic attributes from the text. Some attributes of the text that we might be interested in are the sentences, tokens and POS tags. We can try accessing them, but they won't be accessible at this stage since by default the pipeline constructor does not execute any of the spaCy methods.

In [7]:
# The below calls to access class attributes will just return empty lists since those items have not been extracted yet
print(pipe.sentences)
print(pipe.tokens)
print(pipe.pos_tags)

[]
[]
[]


These attributes must be extracted from the text using spaCy's Spanish pipeline. It is recommended that we generate the attribute lists as and when we need them, since extracting all of the attributes for every text can be a bit slow (although the pre-processing of the text is typically the slowest step in the pipeline). We can extract some attributes as follows:

In [8]:
pipe.get_sentences()  # populates the pipe.sentences attribute
pipe.get_tokens()  # populates the pipe.tokens attribute

# Print out the attributes that were populated
p.pprint(pipe.sentences)
print()
print(pipe.tokens)

[   'la primavera la primavera principia el veintiuno de marzo y dura hasta el veintiuno de junio.',
    'la primavera es muy agradable y hermosa.',
    'las flores crecen.',
    'los árboles y los campos se cubren de verdura y los pájaros cantan en ellos.',
    'todos los hombres, las mujeres y los niños están alegres.',
    'algunas veces hace frío en abril y aún en mayo.',
    'algunas veces, pero no frecuentemente, hay nieve y hielo en abril, y entonces muchas flores y plantas se mueren.']

['la', 'primavera', 'la', 'primavera', 'principia', 'el', 'veintiuno', 'de', 'marzo', 'y', 'dura', 'hasta', 'el', 'veintiuno', 'de', 'junio', '.', 'la', 'primavera', 'es', 'muy', 'agradable', 'y', 'hermosa', '.', 'las', 'flores', 'crecen', '.', 'los', 'árboles', 'y', 'los', 'campos', 'se', 'cubren', 'de', 'verdura', 'y', 'los', 'pájaros', 'cantan', 'en', 'ellos', '.', 'todos', 'los', 'hombres', ',', 'las', 'mujeres', 'y', 'los', 'niños', 'están', 'alegres', '.', 'algunas', 'veces', 'hace', 'frío

\
Calling the methods above will populate the respective attributes with lists, but they also return the lists as outputs. We can assign the output to a variable and access it that way as well.

In [9]:
tags = pipe.get_pos_tags()
print(tags == pipe.pos_tags)  # check if they are the same
print(tags)

True
['DET', 'NOUN', 'DET', 'NOUN', 'ADJ', 'DET', 'NUM', 'ADP', 'INTJ', 'CCONJ', 'ADJ', 'ADP', 'DET', 'NUM', 'ADP', 'INTJ', 'PUNCT', 'DET', 'NOUN', 'AUX', 'ADV', 'ADJ', 'CCONJ', 'ADJ', 'PUNCT', 'DET', 'NOUN', 'AUX', 'PUNCT', 'DET', 'NOUN', 'CCONJ', 'DET', 'NOUN', 'PRON', 'AUX', 'ADP', 'NOUN', 'CCONJ', 'DET', 'NOUN', 'AUX', 'ADP', 'PRON', 'PUNCT', 'DET', 'DET', 'NOUN', 'PUNCT', 'DET', 'NOUN', 'CCONJ', 'DET', 'NOUN', 'VERB', 'ADJ', 'PUNCT', 'DET', 'NOUN', 'VERB', 'NOUN', 'ADP', 'NOUN', 'CCONJ', 'ADV', 'ADP', 'NOUN', 'PUNCT', 'DET', 'NOUN', 'PUNCT', 'CCONJ', 'ADV', 'ADV', 'PUNCT', 'AUX', 'NOUN', 'CCONJ', 'NOUN', 'ADP', 'NOUN', 'PUNCT', 'CCONJ', 'ADV', 'DET', 'NOUN', 'CCONJ', 'NOUN', 'PRON', 'AUX', 'PUNCT']


Putting it all together, let's create a new pipeline object and give it the text to automatically pre-process. We can then just call a `.get_*` method to access the attribute using spaCy, completely eliminating the explicit writing of the pre-processing step.

In [10]:
pipe = feature_pipeline(unprocessed_text)
print(pipe.get_noun_chunks())

['la primavera', 'la primavera', 'la primavera', 'las flores', 'los árboles', 'los campos', 'se', 'verdura', 'los pájaros', 'ellos', 'los hombres', 'las mujeres', 'los niños', 'algunas veces', 'frío', 'abril', 'mayo', 'algunas veces', 'nieve', 'hielo', 'abril', 'muchas flores', 'plantas', 'se']


Alternatively, we could also create a blank pipeline object and pass in the text through the `.get_*` method. This is functionally the same as the above. The only difference is that the text pre-processing will occur at the spaCy stage instead of the constructor stage.

In [11]:
pipe = feature_pipeline()
print(pipe.get_noun_chunks(unprocessed_text))

['la primavera', 'la primavera', 'la primavera', 'las flores', 'los árboles', 'los campos', 'se', 'verdura', 'los pájaros', 'ellos', 'los hombres', 'las mujeres', 'los niños', 'algunas veces', 'frío', 'abril', 'mayo', 'algunas veces', 'nieve', 'hielo', 'abril', 'muchas flores', 'plantas', 'se']


\
Here is a list of all of the spaCy features supported by our pipeline so far:
* sentences: \
    extraction function - `pipe.get_sentences()`, \
    attribute - `pipe.sentences`
* tokens: \
    extraction function - `pipe.get_tokens()`, \
    attribute - `pipe.tokens`
* lemmas: \
    extraction function - `pipe.get_lemmas()`, attribute - `pipe.lemmas`
* POS tags: \
    extraction function - `pipe.get_pos_tags()`, \
    attribute - `pipe.pos_tags`
* morphology tags: \
    extraction function - `pipe.get_morphology()`, \
    attribute - `pipe.morphs`
* dependency parses: \
    extraction function - `pipe.get_dependency_parses()`, \
    attribute - `pipe.parses`
* noun phrase chunks: \
    extraction function - `pipe.get_noun_chunks()`, \
    attribute - `pipe.noun_chunks`

What if we want to extract all of the spaCy features in one go, instead of calling each of the `.get_*` methods one by one? We can do that by calling the method `.full_spacy()` which will extract all of these features, OR we could initialize the pipeline object with the flag `full_spacy=True`.

In [12]:
pipe = feature_pipeline(unprocessed_text, full_spacy=True)
# Equivalent to:
# pipe = feature_pipeline()
# pipe.full_spacy(unprocessed_text)  # does not return any outputs, saves directly to attributes

# All of the following items will automatically be extracted as part of the spaCy pipeline:
print(pipe.text)
print()
print(pipe.parses)

# Commented out for brevity
# p.pprint(pipe.sentences)
# print(pipe.tokens)
# print(pipe.lemmas)
# print(pipe.pos_tags)
# print(pipe.morphs)
# print(pipe.noun_chunks)

la primavera la primavera principia el veintiuno de marzo y dura hasta el veintiuno de junio. la primavera es muy agradable y hermosa. las flores crecen. los árboles y los campos se cubren de verdura y los pájaros cantan en ellos. todos los hombres, las mujeres y los niños están alegres. algunas veces hace frío en abril y aún en mayo. algunas veces, pero no frecuentemente, hay nieve y hielo en abril, y entonces muchas flores y plantas se mueren.

['det', 'iobj', 'det', 'ROOT', 'amod', 'det', 'nsubj', 'case', 'compound', 'cc', 'conj', 'case', 'det', 'obl', 'case', 'compound', 'punct', 'det', 'nsubj', 'cop', 'advmod', 'ROOT', 'cc', 'conj', 'punct', 'det', 'nsubj', 'ROOT', 'punct', 'det', 'nsubj', 'cc', 'det', 'nsubj', 'obj', 'ROOT', 'case', 'obj', 'cc', 'det', 'nsubj', 'conj', 'case', 'obl', 'punct', 'det', 'det', 'nsubj', 'punct', 'det', 'appos', 'cc', 'det', 'conj', 'cop', 'ROOT', 'punct', 'det', 'obl', 'ROOT', 'obj', 'case', 'obl', 'cc', 'conj', 'case', 'nmod', 'punct', 'det', 'obl', 

### Numerical/Statistical Feature Extraction
The most important aspect of the feature extraction pipeline is the ability to derive statistical/numerical features from the text given to it. For a comprehensive guide of all of the features that this pipeline is capable of computing please see the project report (TODO: LINK TO PROJECT REPORT). \
\
Using the pipeline that we created and the attributes that we extracted in the previous cell, here is how we can derive some features from a text using the pipeline:

In [13]:
num_tokens = pipe.num_tokens()  # internally accesses pipe.tokens
log_op_density = pipe.logical_operators()  # internally accesses pipe.tokens
(
    fh_score,
    syls_per_sent,
) = pipe.fernandez_huerta_score()  # internally accesses pipe.tokens and pipe.sentences

print(num_tokens)
print(log_op_density)
print(fh_score)
print(syls_per_sent)

91
0.12345679012345678
86.16
21.285714285714285


It is important to note that any of the statistical feature functions can be called directly without needing to run any of the spaCy extractors first. As long as a feature pipeline object has been created, calling any of the feature functions will automatically extract the spaCy features necessary for computing the desired statistical feature. For example:

In [14]:
# Explicitly specifying full_spacy=False for demonstration purposes (default behaviour)
pipe = feature_pipeline(unprocessed_text, full_spacy=False)
# The above step simply cleans up the text. No spaCy features are extracted

# All of these methods try to access their required spaCy attributes, and if
# they find that the attribute does not yet exist the necessary method will
# be called interally to generate those attributes.

num_tokens = pipe.num_tokens()  # internally accesses pipe.tokens
log_op_density = pipe.logical_operators()  # internally accesses pipe.tokens
(
    fh_score,
    syls_per_sent,
) = pipe.fernandez_huerta_score()  # internally accesses pipe.tokens and pipe.sentences

print(num_tokens)
print(log_op_density)
print(fh_score)
print(syls_per_sent)

91
0.12345679012345678
86.16
21.285714285714285


Alternatively, please note that the order of where the text is provided to the pipeline may also be switched. That is, the unprocessed text does not have to be provided to the pipeline at the initialization stage; it can be provided directly to the feature function as well. The text will automatically be cleaned up and the necessary attributes will be extracted using spaCy, following which the statistic will be calculated.

In [15]:
pipe = feature_pipeline()

# Saves the cleaned text to pipe.text and the extracted list of tokens to pipe.tokens
num_tokens = pipe.num_tokens(text=unprocessed_text)
print(pipe.text)
print()
print(pipe.tokens)
print()

# No need to pass any arguments, it internally accesses pipe.tokens
log_op_density = pipe.logical_operators()

# No need to pass any arguments, it internally accesses pipe.tokens
# and extracts pipe.sentences from the cleaned up pipe.text
fh_score, syls_per_sent = pipe.fernandez_huerta_score()

print(num_tokens)
print(log_op_density)
print(fh_score)
print(syls_per_sent)

la primavera la primavera principia el veintiuno de marzo y dura hasta el veintiuno de junio. la primavera es muy agradable y hermosa. las flores crecen. los árboles y los campos se cubren de verdura y los pájaros cantan en ellos. todos los hombres, las mujeres y los niños están alegres. algunas veces hace frío en abril y aún en mayo. algunas veces, pero no frecuentemente, hay nieve y hielo en abril, y entonces muchas flores y plantas se mueren.

['la', 'primavera', 'la', 'primavera', 'principia', 'el', 'veintiuno', 'de', 'marzo', 'y', 'dura', 'hasta', 'el', 'veintiuno', 'de', 'junio', '.', 'la', 'primavera', 'es', 'muy', 'agradable', 'y', 'hermosa', '.', 'las', 'flores', 'crecen', '.', 'los', 'árboles', 'y', 'los', 'campos', 'se', 'cubren', 'de', 'verdura', 'y', 'los', 'pájaros', 'cantan', 'en', 'ellos', '.', 'todos', 'los', 'hombres', ',', 'las', 'mujeres', 'y', 'los', 'niños', 'están', 'alegres', '.', 'algunas', 'veces', 'hace', 'frío', 'en', 'abril', 'y', 'aún', 'en', 'mayo', '.', 

Finally, let's extract all of the available statistical features in one go. Accomplishing this is as simple as creating a pipeline object and calling the `.feature_extractor()` method. We explicitly only write two lines, and the unprocessed text can be supplied to the pipeline at either line, but under the hood all 4 stages of processing the text will take place. \
(If the object is initialized with a text, `.feature_extractor()` does not require any arguments. Otherwise, if the object is initialized *without* a text, `.feature_extractor()` must be given a text in order to perform pre-processing and spaCy attribute extraction.)

In [4]:
pipe = feature_pipeline(
    dep_parse_flag=True,
    dep_parse_classpath="/Users/eun-youngchristinapark/Documents/stanza_corenlp/*",
    result_root="/Users/eun-youngchristinapark/MDS-CAPSTONE/wn-mcr-transform/wordnet_spa",
)
pipe.corenlp_client.start()

2021-06-07 18:55:05 INFO: Using CoreNLP default properties for: spanish.  Make sure to have spanish models jar (available for download here: https://stanfordnlp.github.io/CoreNLP/) in CLASSPATH
2021-06-07 18:55:16 INFO: Starting server with command: java -Xmx5G -cp /Users/eun-youngchristinapark/Documents/stanza_corenlp/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 30000 -threads 5 -maxCharLength 100000 -quiet False -serverProperties spanish -annotators depparse -preload -outputFormat serialized


In [5]:
p.pprint(pipe.feature_extractor(unprocessed_text))

# ALTERNATIVELY:
# pipe = feature_pipeline(unprocessed_text)
# pipe.feature_extractor()
pipe.corenlp_client.stop()

{   'ADJ': 0.03607103218645949,
    'ADP': 0.1076581576026637,
    'ADV': 0.04827968923418424,
    'AUX': 0.03995560488346282,
    'CCONJ': 0.030521642619311874,
    'CONJ': 0.0,
    'CONTENT': 0.6620111731843575,
    'DET': 0.10932297447280799,
    'EOL': 0.0,
    'FUNCTION': 0.33798882681564246,
    'INTJ': 0.009433962264150943,
    'NOUN': 0.15038845726970032,
    'NUM': 0.002774694783573807,
    'PART': 0.0011098779134295228,
    'PRON': 0.0560488346281909,
    'PROPN': 0.06437291897891231,
    'PUNCT': 0.20199778024417314,
    'SCONJ': 0.019422863485016647,
    'SPACE': 0.0,
    'SYM': 0.003329633740288568,
    'VERB': 0.1193118756936737,
    'X': 0.0,
    'avg_ambiguation_all_words': 3.546268656716418,
    'avg_ambiguation_content_words': 4.128846153846154,
    'avg_degree_of_abstraction': 7.398465724900336,
    'avg_dep_tree_depth': 2.7049180327868854,
    'avg_rank_of_lemmas_in_freq_list': 522.8107658157603,
    'avg_sent_length': 10.123595505617978,
    'fernandez_huerta_score