# Text Pre-Processing for Spanish

<div class="admonition note" name="html-admonition" style="background: lightblue; padding: 10px">
<p class="title">Note</p>
This section, "Working in Languages Beyond English," is co-authored with <a href="http://www.quinndombrowski.com/">Quinn Dombrowski</a>, the Academic Technology Specialist at Stanford University and a leading voice in multilingual digital humanities. I'm grateful to Quinn for helping expand this textbook to serve languages beyond English. 
</div>

This lesson is for anyone who wants to try the [TF-IDF](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Text-Analysis/TF-IDF.html) or [topic modeling](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Text-Analysis/Topic-Modeling.html) lessons on Spanish texts. Before continuing with those lessons, you need to create a *lemmatized derivative* of your original Spanish text, which replaces all the words with their dictionary form, which will work much better with the word-count based methods.

## Install spaCy

In [None]:
!pip install -U spacy

## Download Language Model

In [None]:
!python -m spacy download es_core_news_md

## Import Libraries

In [4]:
import spacy

## Load Language Model
Once the model is downloaded, we need to load it. There are two ways to load a spaCy language model.

1. We can import the model as a module and then load it from the module.

In [5]:
import es_core_news_md
nlp = es_core_news_md.load()

2. We can load the model by name.

In [6]:
#nlp = spacy.load('es_core_news_md')

If you just downloaded the model for the first time, it’s advisable to use Option 1. Then you can use the model immediately. Otherwise, you’ll likely need to restart your Jupyter kernel (which you can do by clicking Kernel -> Restart Kernel… in the Jupyter Lab menu).

## Process Document
To create a derivative text file that we can use with TF-IDF, topic modeling, or other word-count based methods, we need to use spaCy to *lemmatize* the text, replacing each word with its dictionary form. The result will be an ungrammatical text that will produce better results than the original version when used with word-count methods.

The example text for Spanish is *Oasis en la vida* by Juana Manuela Gorriti [from Project Gutenberg](http://www.gutenberg.org/ebooks/62564).

Here we open the text and process it with the Spanish spaCy model.

In [7]:
filepath = '../texts/es.txt'
# Open and read text
text = open(filepath, encoding='utf-8').read()
# Process text with spaCy
document = nlp(text)

Then we loop through each token in the original text, lemmatize each token and insert a space between the tokens, and finally write them to our new segmented derivative text file.

In [8]:
outname = filepath.replace('.txt', '-lemmatized.txt')

# Create a lemmatized version of the original text file
with open(outname, 'w', encoding='utf8') as out:
    for token in document:
        # Get the lemma for each token
        out.write(token.lemma_.lower())
        # Insert white space between each token
        out.write(' ')

## Examine Differences
The code cell below prints the original word in the text, a dash, then the lemmatized form that was written to the derivative text document that you'll use for TF-IDF and topic modeling. It's a good idea to take a look at this so you can see if there are places where the model consistently makes mistakes.

For instance, an earlier version of spaCy often associated the Spanish preposition `para` ("for") with the verb `parar` ("to stop"). If you just took that derivative file and used it for TF-IDF or topic modeling without realizing what was happening, you might reach the surprising conclusion that "stop" is a very frequent word in your text, when actually it's a lemmatization problem.

In [9]:
for token in document:
    print(token.text + ' - ' + token.lemma_)

﻿INTRODUCCION - ﻿INTRODUCCION
. - .



 - 



ECONOMÍA - economía
POLÍTICA - político
. - .


 - 


El - el
sombrío - sombrío
Prudhon - Prudhon
, - ,
imbuído - imbuído
, - ,
sin - sin
duda - duda
, - ,
en - en
las - el
ideas - idea
de - de
los - el
Santos - Santos

 - 

Padres - Padres
de - de
la - el
Iglesia - Iglesia
que - que
predicaban - predicar
el - el
desden - desden
por - por
los - el
bienes - bien

 - 

terrenales - terrenal
, - ,
decía - decir
que - que
la - el
pobreza - pobreza
es - ser
una - uno
ley - ley
de - de
nuestra - nuestro
naturaleza - naturaleza
, - ,

 - 

ley - ley
bajo - bajo
la - el
cual - cual
hemos - haber
sido - ser
constituídos - constituer
, - ,
de - de
donde - donde
se - él
deduce - deducir
que - que
el - el

 - 

pauperismo - pauperismo
es - ser
mal - mal
que - que
no - no
tiene - tener
remedio - remedio
ni - ni
cura - cura
. - .


 - 


Muy - mucho
desconsolados - desconsolado
debieron - deber
quedar - quedar
los - el
menesterosos - menesteroso
con - co