# NLP of Multilingual Texts using TextBlob (English, Spanish and Chinese)

https://github.com/vitojph/kschool-nlp-18/blob/master/notebooks/textblob.ipynb

Debugged/Extended by:
* Jon Chun
* 20221212

In [1]:
!pip list

Package                       Version
----------------------------- ----------------------
absl-py                       1.3.0
aeppl                         0.0.33
aesara                        2.7.9
aiohttp                       3.8.3
aiosignal                     1.3.1
alabaster                     0.7.12
albumentations                1.2.1
altair                        4.2.0
appdirs                       1.4.4
arviz                         0.12.1
astor                         0.8.1
astropy                       4.3.1
astunparse                    1.6.3
async-timeout                 4.0.2
atari-py                      0.2.9
atomicwrites                  1.4.1
attrs                         22.1.0
audioread                     3.0.0
autograd                      1.5
Babel                         2.11.0
backcall                      0.2.0
beautifulsoup4                4.6.3
bleach                        5.0.1
blis                          0.7.9
bokeh                         2.3.3
branca

In [2]:
!pip install -U textblob

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting textblob
  Downloading textblob-0.17.1-py2.py3-none-any.whl (636 kB)
[K     |████████████████████████████████| 636 kB 29.9 MB/s 
Installing collected packages: textblob
  Attempting uninstall: textblob
    Found existing installation: textblob 0.15.3
    Uninstalling textblob-0.15.3:
      Successfully uninstalled textblob-0.15.3
Successfully installed textblob-0.17.1


In [3]:
!pip show textblob

Name: textblob
Version: 0.17.1
Summary: Simple, Pythonic text processing. Sentiment analysis, part-of-speech tagging, noun phrase parsing, and more.
Home-page: https://github.com/sloria/TextBlob
Author: Steven Loria
Author-email: sloria1@gmail.com
License: MIT
Location: /usr/local/lib/python3.8/dist-packages
Requires: nltk
Required-by: 


In [7]:
!pip install -U nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [8]:
!pip show nltk

Name: nltk
Version: 3.7
Summary: Natural Language Toolkit
Home-page: https://www.nltk.org/
Author: NLTK Team
Author-email: nltk.team@gmail.com
License: Apache License, Version 2.0
Location: /usr/local/lib/python3.8/dist-packages
Requires: joblib, click, tqdm, regex
Required-by: textblob


In [9]:
import nltk 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# install the requirements
# pip install textblob

# `textblob`: otro módulo para tareas de PLN (`NLTK` + `pattern`)

[textblob](http://textblob.readthedocs.org/) es una librería de procesamiento del texto para Python que permite realizar tareas de Procesamiento del Lenguaje Natural como análisis morfológico, extracción de entidades, análisis de opinión, traducción automática, etc. 

Está construida sobre otras dos librerías muy famosas de Python: [NLTK](http://www.nltk.org/) y [pattern](http://www.clips.ua.ac.be/pages/pattern-en). La principal ventaja de [textblob](http://textblob.readthedocs.org/) es que permite combinar el uso de las dos herramientas anteriores en un interfaz más simple.

Vamos a apoyarnos en [este tutorial](http://textblob.readthedocs.org/en/dev/quickstart.html) para aprender a utilizar algunas de sus funcionalidades más llamativas. 

Lo primero es importar el objeto `TextBlob` que nos permite acceder a todas las herramentas que incluye.

In [4]:
from textblob import TextBlob

Vamos a crear nuestro primer ejemplo de *textblob* a través del objeto `TextBlob`. Piensa en estos *textblobs* como una especie de cadenas de texto de Python, analaizadas y enriquecidas con algunas características extra. 

In [57]:
texto = """In new lawsuits brought against the ride-sharing companies Uber and Lyft, the top prosecutors in Los Angeles 
and San Francisco counties make an important point about the lightly regulated sharing economy. The consumers who 
participate deserve a very clear picture of the risks they're taking"""

print(texto) 

t = TextBlob(texto)

In new lawsuits brought against the ride-sharing companies Uber and Lyft, the top prosecutors in Los Angeles 
and San Francisco counties make an important point about the lightly regulated sharing economy. The consumers who 
participate deserve a very clear picture of the risks they're taking


In [58]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [59]:
# Parse into sentences

print(t.sentences)

[Sentence("In new lawsuits brought against the ride-sharing companies Uber and Lyft, the top prosecutors in Los Angeles 
and San Francisco counties make an important point about the lightly regulated sharing economy."), Sentence("The consumers who 
participate deserve a very clear picture of the risks they're taking")]


In [60]:
print("Tenemos", len(t.sentences), "oraciones.\n")

for sentence in t.sentences:
    print(sentence)
    print("-" * 75)

Tenemos 2 oraciones.

In new lawsuits brought against the ride-sharing companies Uber and Lyft, the top prosecutors in Los Angeles 
and San Francisco counties make an important point about the lightly regulated sharing economy.
---------------------------------------------------------------------------
The consumers who 
participate deserve a very clear picture of the risks they're taking
---------------------------------------------------------------------------


In [64]:
print("Tenemos", len(t.sentences), "oraciones.\n")

oracion_ls = []

for sentence in t.sentences:
    print(sentence)
    sentence_str = str(sentence)
    print(type(sentence_str))
    print(sentence_str)
    oracion_ls.append(sentence_str)
    print("-" * 75)

print(f'\n\nORACION:\n\n{oracion_ls}')

Tenemos 2 oraciones.

In new lawsuits brought against the ride-sharing companies Uber and Lyft, the top prosecutors in Los Angeles 
and San Francisco counties make an important point about the lightly regulated sharing economy.
<class 'str'>
In new lawsuits brought against the ride-sharing companies Uber and Lyft, the top prosecutors in Los Angeles 
and San Francisco counties make an important point about the lightly regulated sharing economy.
---------------------------------------------------------------------------
The consumers who 
participate deserve a very clear picture of the risks they're taking
<class 'str'>
The consumers who 
participate deserve a very clear picture of the risks they're taking
---------------------------------------------------------------------------


ORACION:

['In new lawsuits brought against the ride-sharing companies Uber and Lyft, the top prosecutors in Los Angeles \nand San Francisco counties make an important point about the lightly regulated sharin

## Procesando oraciones, palabras y entidades

Podemos segmentar en oraciones y en palabras nuestra texto de ejemplo simplemente accediendo a las propiedades `.sentences` y `.words`. Imprimimos por pantalla: 

In [65]:
# imprimimos las oraciones
for sentence in t.sentences:
    print(sentence)
    print("-" * 75)

# y las palabras
print(t.words)
print(texto.split())

In new lawsuits brought against the ride-sharing companies Uber and Lyft, the top prosecutors in Los Angeles 
and San Francisco counties make an important point about the lightly regulated sharing economy.
---------------------------------------------------------------------------
The consumers who 
participate deserve a very clear picture of the risks they're taking
---------------------------------------------------------------------------
['In', 'new', 'lawsuits', 'brought', 'against', 'the', 'ride-sharing', 'companies', 'Uber', 'and', 'Lyft', 'the', 'top', 'prosecutors', 'in', 'Los', 'Angeles', 'and', 'San', 'Francisco', 'counties', 'make', 'an', 'important', 'point', 'about', 'the', 'lightly', 'regulated', 'sharing', 'economy', 'The', 'consumers', 'who', 'participate', 'deserve', 'a', 'very', 'clear', 'picture', 'of', 'the', 'risks', 'they', "'re", 'taking']
['In', 'new', 'lawsuits', 'brought', 'against', 'the', 'ride-sharing', 'companies', 'Uber', 'and', 'Lyft,', 'the', 'top', 'p

La propiedad `.noun_phrases` nos permite acceder a la lista de entidades (en realidad, son sintagmas nominales) incluídos en nuestro *textblob*. Así es como funciona.

In [66]:
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [67]:
print("el texto de ejemplo contiene", len(t.noun_phrases), "entidades")
for element in t.noun_phrases:
    print("-", element)

el texto de ejemplo contiene 8 entidades
- new lawsuits
- uber
- lyft
- top prosecutors
- los angeles
- san francisco
- important point
- clear picture


In [21]:
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [22]:
# jugando con lemas, singulares y plurales
for word in t.words:
    if word.endswith("s"):
        print(word.lemmatize(), word, word.singularize())
    else:
        print(word.lemmatize(), word, word.pluralize())

In In Ins
new new news
lawsuit lawsuits lawsuit
brought brought broughts
against against againsts
the the thes
ride-sharing ride-sharing ride-sharings
company companies company
Uber Uber Ubers
and and ands
Lyft Lyft Lyfts
the the thes
top top tops
prosecutor prosecutors prosecutor
in in ins
Los Los Lo
Angeles Angeles Angele
and and ands
San San Sans
Francisco Francisco Franciscoes
county counties county
make make makes
an an some
important important importants
point point points
about about abouts
the the thes
lightly lightly lightlies
regulated regulated regulateds
sharing sharing sharings
economy economy economies
The The Thes
consumer consumers consumer
who who whoes
participate participate participates
deserve deserve deserves
a a some
very very veries
clear clear clears
picture picture pictures
of of ofs
the the thes
risk risks risk
they they they
're 're 'res
taking taking takings


In [25]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [26]:
# ¿cómo podemos hacer la lematización más inteligente?
for item in t.tags:
    if item[1] == "NN":
        print(item[0], "-->", item[0].pluralize())
    elif item[1] == "NNS":
        print(item[0], "-->", item[0].singularize())
    else:
        print(item[0], item[0].lemmatize())

In In
new new
lawsuits --> lawsuit
brought brought
against against
the the
ride-sharing ride-sharing
companies --> company
Uber Uber
and and
Lyft Lyft
the the
top top
prosecutors --> prosecutor
in in
Los Los
Angeles Angeles
and and
San San
Francisco Francisco
counties --> county
make make
an an
important important
point --> points
about about
the the
lightly lightly
regulated regulated
sharing sharing
economy --> economies
The The
consumers --> consumer
who who
participate participate
deserve deserve
a a
very very
clear clear
picture --> pictures
of of
the the
risks --> risk
they they
're 're
taking taking


## Análisis sintático

Aunque podemos utilizar otros analizadores, por defecto el método `.parse()` invoca al analizador morfosintáctico del módulo  `pattern.en` que ya conoces.

In [27]:
# análisis sintáctico
print(t.parse())

In/IN/B-PP/B-PNP new/JJ/B-NP/I-PNP lawsuits/NNS/I-NP/I-PNP brought/VBN/B-VP/I-PNP against/IN/B-PP/B-PNP the/DT/B-NP/I-PNP ride-sharing/JJ/I-NP/I-PNP companies/NNS/I-NP/I-PNP Uber/NNP/I-NP/I-PNP and/CC/O/O Lyft/NNP/B-NP/O ,/,/O/O the/DT/B-NP/O top/JJ/I-NP/O prosecutors/NNS/I-NP/O in/IN/B-PP/B-PNP Los/NNP/B-NP/I-PNP Angeles/NNP/I-NP/I-PNP and/CC/I-NP/I-PNP San/NNP/I-NP/I-PNP Francisco/NNP/I-NP/I-PNP counties/NNS/I-NP/I-PNP make/VB/B-VP/O an/DT/B-NP/O important/JJ/I-NP/O point/NN/I-NP/O about/IN/B-PP/O the/DT/O/O lightly/RB/B-VP/O regulated/VBN/I-VP/O sharing/VBG/I-VP/O economy/NN/B-NP/O ././O/O
The/DT/B-NP/O consumers/NNS/I-NP/O who/WP/O/O participate/VB/B-VP/O deserve/VBP/I-VP/O a/DT/B-NP/O very/RB/I-NP/O clear/JJ/I-NP/O picture/NN/I-NP/O of/IN/B-PP/B-PNP the/DT/B-NP/I-PNP risks/NNS/I-NP/I-PNP they/PRP/I-NP/I-PNP '/POS/O/O re/NN/B-NP/O taking/VBG/B-VP/O


## Traducción automática


A partir de cualquier texto procesado con `TextBlob`, podemos acceder a un traductor automático de bastante calidad con el método `.translate`. Fíjate en cómo lo usamos. Es obligatorio indicar la lengua de destinto. La lengua de origen, se puede predecir a partir del texto de entrada. 

In [37]:
# de chino a inglés y español
oracion_zh = "中国探月工程 亦稱嫦娥工程，是中国启动的第一个探月工程，于2003年3月1日正式启动"
t_zh = TextBlob(oracion_zh)
print(t_zh.translate(from_lang="zh-CN", to="en"))
print(t_zh.translate(from_lang="zh-CN", to="es"))

oracion_ru = "В 1943 году была отправлена в США, где выступала в защиту британской «белой книги», после чего работала в Канаде и Индии."
t_ru = TextBlob(oracion_ru)
print(t_ru.translate(from_lang="ru", to="en"))
print(t_ru.translate(from_lang="ru", to="es"))

print("--------------")

t_es = TextBlob(
    "La deuda pública ha marcado nuevos récords en España en el tercer trimestre"
)

# ERROR:  ---> 17 print(t_es.translate(to="el"))
#         AttributeError: 'list' object has no attribute 'strip'
#         textblob-0.17.1
#         fix: add [from_lang="es",] in calls to t_es.translate() 
#         20221212

print(t_es.translate(from_lang="es", to="el"))
print(t_es.translate(from_lang="es", to="ru"))
print(t_es.translate(from_lang="es", to="eu"))
print(t_es.translate(from_lang="es", to="fi"))
print(t_es.translate(from_lang="es", to="fr"))
print(t_es.translate(from_lang="es", to="nl"))
print(t_es.translate(from_lang="es", to="gl"))
print(t_es.translate(from_lang="es", to="ca"))
print(t_es.translate(from_lang="es", to="zh"))
print(t_es.translate(from_lang="es", to="la"))
print(t_es.translate(from_lang="es", to="cs"))

# con el slang no funciona tan bien
print("--------------")
t_ita = TextBlob("Sono andato a Milano e mi sono divertito un bordello.")
print(t_ita.translate(from_lang="es", to="en"))
# print(t_ita.translate(to="es"))
print(t_es)

China Lunar Exploration Project is also known as Chang'e Project. It is the first lunar exploration project launched by China. It was officially launched on March 1, 2003
El Proyecto de Exploración Lunar de China también se conoce como Proyecto Chang'e. Es el primer proyecto de exploración lunar lanzado por China. Se lanzó oficialmente el 1 de marzo de 2003
In 1943 she was sent to the United States, where she spoke in defense of the British White Book, after which she worked in Canada and India.
En 1943 fue enviada a los Estados Unidos, donde habló en defensa del libro blanco británico, después de lo cual trabajó en Canadá e India.
--------------
Το δημόσιο χρέος σημείωσε νέα αρχεία στην Ισπανία το τρίτο τρίμηνο
Государственный долг отметил новые записи в Испании в третьем квартале
Zor publikoak erregistro berriak markatu ditu Espainian hirugarren hiruhilekoan
Julkinen velka on merkinnyt uusia tietueita Espanjassa kolmannella vuosineljänneksellä
La dette publique a marqué de nouveaux r

## WordNet

`textblob`, más concretamente, cualquier objeto de la clase `Word`, nos permite acceder a la información de WordNet. 

In [38]:
# WordNet
from textblob import Word
from textblob.wordnet import VERB

# ¿cuántos synsets tiene "car"
word = Word("car")
print(word.synsets)

# dame los synsets de la palabra "hack" como verbo
print(Word("hack").get_synsets(pos=VERB))

# imprime la lista de definiciones de "car"
print(Word("car").definitions)

# recorre la jerarquía de hiperónimos
for s in word.synsets:
    print(s.hypernym_paths())

[Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')]
[Synset('chop.v.05'), Synset('hack.v.02'), Synset('hack.v.03'), Synset('hack.v.04'), Synset('hack.v.05'), Synset('hack.v.06'), Synset('hack.v.07'), Synset('hack.v.08')]
['a motor vehicle with four wheels; usually propelled by an internal combustion engine', 'a wheeled vehicle adapted to the rails of railroad', 'the compartment that is suspended from an airship and that carries personnel and the cargo and the power plant', 'where passengers ride up and down', 'a conveyance for passengers or freight on a cable railway']
[[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('instrumentality.n.03'), Synset('container.n.01'), Synset('wheeled_vehicle.n.01'), Synset('self-propelled_vehicle.n.01'), Synset('motor_vehicle.n.01'), Synset('car.n.01')], [Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset

## Análisis de opinion

In [39]:
# análisis de opinión
opinion1 = TextBlob("This new restaurant is great. I had so much fun!! :-P")
print(opinion1.sentiment)

opinion2 = TextBlob("Google News to close in Spain.")
print(opinion2.sentiment)

# subjetividad 0:1
# polaridad -1:1

print(opinion1.sentiment.polarity)

if opinion1.sentiment.subjectivity > 0.5:
    print("Hey, esto es una opinion")

Sentiment(polarity=0.5387784090909091, subjectivity=0.6011363636363636)
Sentiment(polarity=0.0, subjectivity=0.0)
0.5387784090909091
Hey, esto es una opinion


### Ejercicio 1

Prueba a analizar distintas oraciones en inglés (combinando verbos que indican información subjetiva, palabras con distinta carga emocional, añadiendo emoticonos, etc.) para ver si eres capaz de entender el funcionamiento de este analizador de opiniones.

In [None]:
# escribe tu código aquí

`TextBlob` da acceso a [otro tipo de analizadores](https://textblob.readthedocs.io/en/dev/advanced_usage.html#sentiment-analyzers) de opinión, por ejemplo, un clasificador basado en *Naive Bayes*. Prueba qué tal funciona:

In [42]:
%whos

Variable             Type        Data/Info
------------------------------------------
NaiveBayesAnalyzer   ABCMeta     <class 'textblob.en.senti<...>ents.NaiveBayesAnalyzer'>
TextBlob             type        <class 'textblob.blob.TextBlob'>
VERB                 str         v
Word                 type        <class 'textblob.blob.Word'>
b1                   TextBlob    I havv goood speling!
b2                   TextBlob    Miy naem iz Jonh!
b3                   TextBlob    Boyz dont cri
b4                   TextBlob    psicological posesion achifmen comitment
element              Word        clear picture
item                 tuple       n=2
nltk                 module      <module 'nltk' from '/usr<...>ckages/nltk/__init__.py'>
opinion1             TextBlob    This new restaurant is gr<...>. I had so much fun!! :-P
opinion2             TextBlob    Google News to close in Spain.
oracion_ru           str         В 1943 году была отправле<...>аботала в Канаде и Индии.
oracion_zh          

In [51]:
oracion = []

for sentence in t.sentences:
    oracion.append(str(sentence))
    print(f'Added: {str(sentence)}')

Added: i


In [49]:
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


True

In [50]:
from textblob.sentiments import NaiveBayesAnalyzer

# for oracion in oraciones:
for oracion in oracion:
    t = TextBlob(oracion, analyzer=NaiveBayesAnalyzer())
    print(t.sentiment)

Sentiment(classification='pos', p_pos=0.5, p_neg=0.5)
Sentiment(classification='pos', p_pos=0.5, p_neg=0.5)
Sentiment(classification='pos', p_pos=0.5, p_neg=0.5)
Sentiment(classification='pos', p_pos=0.5, p_neg=0.5)
Sentiment(classification='pos', p_pos=0.5, p_neg=0.5)
Sentiment(classification='pos', p_pos=0.5, p_neg=0.5)
Sentiment(classification='pos', p_pos=0.5, p_neg=0.5)
Sentiment(classification='pos', p_pos=0.5, p_neg=0.5)
Sentiment(classification='pos', p_pos=0.5, p_neg=0.5)
Sentiment(classification='pos', p_pos=0.5, p_neg=0.5)
Sentiment(classification='pos', p_pos=0.5, p_neg=0.5)
Sentiment(classification='pos', p_pos=0.5, p_neg=0.5)


KeyboardInterrupt: ignored

## Otras curiosidades

In [41]:
#  corrección ortográfica
b1 = TextBlob("I havv goood speling!")
print(b1.correct())

b2 = TextBlob("Miy naem iz Jonh!")
print(b2.correct())

b3 = TextBlob("Boyz dont cri")
print(b3.correct())

b4 = TextBlob("psicological posesion achifmen comitment")
print(b4.correct())

I have good spelling!
In name in On!
Boy dont cry
psychological position achifmen commitment


## Hasta el infinito, y más allá

En este breve resumen solo consideramos las posibilidades que ofrece `TextBlob` por defecto. Pero si necesitas personalizar las herramientas, echa un vistazo a [la documentación avanzada](http://textblob.readthedocs.org/en/dev/advanced_usage.html#advanced). 