# PLN I: Uses and operations of NLTK in English

In this notebook we are going to put into practice the tokenisation of texts.

Tokenisation is the division of text into smaller pieces. It can be tokenised by words or phrases, although it is more common to tokenise by words.

## Libraries and installation
### NLTK

First we need to import the NLTK library.

In [1]:
import nltk

## Working with the data

First we will load a simple sentence to work with it and see examples in a clear way.

In [2]:
frase = 'Me he comprado un coche rojo. Ahora tenemos que encontrar un seguro de coches a todo riesgo'

### Word tokenisation

We will use the "word tokenize" that we have previously imported. To do this, we load the text from the web page that we obtained and cleaned up in the previous step.

Here is a brief explanation of the commands used: ".lower()" what we do is standardise the formatting of all the words. The ".isalpha()" command evaluates each token as true or flase depending on whether it is a word or not. With this we discard all punctuation marks, numbers, symbols, etc. ...

#### NLTK Word Tokenize

We import the Word Tokenize component of the NLTK library to generate the tokens of our text.

It is important to take into account that we will use the Spanish tokenisation in our case for the analysis of the text.

In [3]:
from nltk.tokenize import word_tokenize

### We get the tokens

To get the tokens we simply use the command `word_tokenize(t,i)` where;
* **t** would be the text to tokenize
* **i** would be the language, in our case `Spanish`.

In [4]:
tokens = word_tokenize(frase, "spanish") 

tokens = [word.lower() for word in tokens if word.isalpha()]

print(tokens)

['me', 'he', 'comprado', 'un', 'coche', 'rojo', 'ahora', 'tenemos', 'que', 'encontrar', 'un', 'seguro', 'de', 'coches', 'a', 'todo', 'riesgo']


### Stop words

Stop words are those words that are not really relevant to our exercise, e.g. articles, conjunctions, determiners, auxiliary verbs, etc. ...

First we must import the NLTK package **stopwords**.

In [5]:
from nltk.corpus import stopwords

We can easily see the words contained within stopwords by executing the following command `stopwords.words('spanish')`.

In [6]:
print(stopwords.words('spanish'))

['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'del', 'se', 'las', 'por', 'un', 'para', 'con', 'no', 'una', 'su', 'al', 'lo', 'como', 'más', 'pero', 'sus', 'le', 'ya', 'o', 'este', 'sí', 'porque', 'esta', 'entre', 'cuando', 'muy', 'sin', 'sobre', 'también', 'me', 'hasta', 'hay', 'donde', 'quien', 'desde', 'todo', 'nos', 'durante', 'todos', 'uno', 'les', 'ni', 'contra', 'otros', 'ese', 'eso', 'ante', 'ellos', 'e', 'esto', 'mí', 'antes', 'algunos', 'qué', 'unos', 'yo', 'otro', 'otras', 'otra', 'él', 'tanto', 'esa', 'estos', 'mucho', 'quienes', 'nada', 'muchos', 'cual', 'poco', 'ella', 'estar', 'estas', 'algunas', 'algo', 'nosotros', 'mi', 'mis', 'tú', 'te', 'ti', 'tu', 'tus', 'ellas', 'nosotras', 'vosotros', 'vosotras', 'os', 'mío', 'mía', 'míos', 'mías', 'tuyo', 'tuya', 'tuyos', 'tuyas', 'suyo', 'suya', 'suyos', 'suyas', 'nuestro', 'nuestra', 'nuestros', 'nuestras', 'vuestro', 'vuestra', 'vuestros', 'vuestras', 'esos', 'esas', 'estoy', 'estás', 'está', 'estamos', 'estáis', 'están', 'e

To remove a stopword from the text, simply search for it in the list.

In [7]:
clean_tokens = tokens[:]
 
for token in tokens:
 
    if token in stopwords.words('spanish'):
 
        clean_tokens.remove(token)
    
print(clean_tokens)

['comprado', 'coche', 'rojo', 'ahora', 'encontrar', 'seguro', 'coches', 'riesgo']


### Stemming

Backward derivation allows us to eliminate verb tenses, genders, plurals, ... in order to improve the counting and grouping of words in the analysed texts. 

In our case, for Spanish, we will use the **Snowball** algorithm. We will import the `SnowballStemmer` into the **nltk.stem** package.

In [8]:
from nltk.stem import SnowballStemmer

As this stemmer is multi-language, we will have to specify which language we want to use.

You can consult all the available languages, along with more documentation at: https://www.nltk.org/_modules/nltk/stem/snowball.html

In [9]:
spanish_stemmer = SnowballStemmer('spanish')

Next, we have to load the tokens without the stopWords we have previously generated to get it (you can also load any token, even if it includes stopWords).

In [10]:
stem_tokens = []

for token in clean_tokens:
    stem_tokens.append(spanish_stemmer.stem(token))
    
stem_tokens

['compr', 'coch', 'roj', 'ahor', 'encontr', 'segur', 'coch', 'riesg']

### Lemmatisation

Lemmatisation, by greatly simplifying its definition, allows us to obtain the original word, for example:

* Verbs: Eating -> Eat
* Plurals: Tables -> Table

With this we can make a much more optimal classification than with backward derivation. 

To do this process in Spanish we must make use of the spaCy library, since NLTK does not perform this process in Spanish. 
The installation of spaCy is very simple, just run the following commands in an **Anaconda Prompt** terminal:
* `conda install -c conda-forge spacy`.
* `python -m spacy download es_core_news_sm`.

Once installed, import the library with `import spacy` and load the Spanish package with `spacy.load('es_core_news_sm)`.

In [11]:
import spacy
nlp = spacy.load('es_core_news_sm')

Once the language has been imported and loaded, we will proceed to obtain the lemmas.

In [12]:
lem_tokens = []

separator = ' '

for token in nlp(separator.join(clean_tokens)):
    lem_tokens.append(token.lemma_)
    
print(lem_tokens)

['comprado', 'coche', 'rojo', 'ahora', 'encontrar', 'seguro', 'coche', 'riesgo']
