# __**BLOCK 2**__

# 2. Natural Language Processing for a piece of news
## Morphological analysis and translation

##### The code of this notebook was used to create the Dash interface part 1. See the "4_Dash_framework_part_1.ipynb" notebook

For the analysis of the text itself, the **TextBlob** module is going to be used.

The **TextBlob** is a processing text library for Python, which allows to perform different Natural Language Processing tasks such as morphological analysis, entities extraction, opinion analysis, translation...

For this part of the project, only the morphological analysis and translation will be used.

**TextBlob** is built over two famous Python libraries: NLTK and pattern. The main advantage of **TextBlob** is the combinations of the previous tools in a more friendly interface.

TextBlob works only in English.

**To use TextBlob, the below code should be executed to install the package**

In [None]:
#The packages for textblob are installed

!pip install textblob

In [None]:
from textblob import TextBlob
from collections import Counter
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [None]:
text_news="""Robbie Rotten, has died aged 43 from cancer. Stefansson was best known for his role on the children's show,
which was produced from 2004 to 2014. He was initially diagnosed with pancreatic cancer in 2016, but said it had been
removed with surgery. He often shared his treatment and progress with fans online on social media - announcing in March 
the disease had returned and was inoperable. \"It's not until they tell you you're going to die soon that you realise 
how short life is. Time is the most valuable thing in life because it never comes back. And whether you spend it in the 
arms of a loved one or alone in a prison cell, life is what you make of it. Dream big,\" he posted to Twitter. In June 
his wife Steinunn Olina revealed the father-of-four's cancer was in its final stages. All done with chemo for now.
\xed��\xed�\u008a #allover #happy #happyplantdad A post shared by  Stefán Karl Stefánsson (@stefanssonkarl) on 
Feb 22, 2017 at 12:56pm PST End of Instagram post  by stefanssonkarl A post on Tuesday paid tribute to her husband, 
and said the family would follow his wishes to scatter his remains at sea. \"Stefan's family wants to express their 
gratitude for the support and warmth received in recent years, and to express their deepest sympathy to the many friends 
and fans of Stefan Karl,\" she posted. The actor spent his early career working in film and theatre - playing the title 
role in How the Grinch Stole Christmas in the US from 2008-2015. LazyTown was produced from 2004-2007, with more episodes
made in 2013 and 2014. It followed the life of a pink-haired eight-year-old girl named Stephanie and her superhero 
companion Sportacus, who attempt to liven up an inactive town she moved to. The character of Robbie Rotten was the 
show's villain and attempted to thwart their schemes, preferring to stay unhealthy. The show, produced in Iceland, was
translated into dozens of languages and aired in more than 180 countries worldwide. Fans who grew up with the show 
shared their sadness at the news of Stefansson's death on social media. This has been literally the saddest day on the 
internet since Satoru Iwata passed away. Rest in Piece, Stefan Karl (aka Robbie Rotten). End of Twitter post  by 
@mayhem_crimson I spent many hours with my younger siblings watching you years ago and ive lost count of how many times 
you made us laugh and smile uncontrollably. Thank you for all the joy you brought us and people everywhere. Rest in 
Peace Robbie Rotten. pic.twitter.com/0lXwMc7miX End of Twitter post  by @smg4official Farewell, Robbie Rotten. Hope you
finally get to be lazy without those pesky kids exercising around you 24/7. You were a fantastic, funny character. 
RIP Stefan Karl Stefansson pic.twitter.com/A9GazKA0sL End of Twitter post  by @EpicVoiceGuy The character's popularity
has continued online in recent years, where his exuberant facial expressions are frequently used in memes. Michael Cohen
admitted violating campaign finance laws, but Mr Trump says he has"""

## 2.1 Preparing the data

In [None]:
text_news_2 = TextBlob(text_news)
#Number of words
print('The number of words in the text is', (len(text_news_2.words)))
#Number of sentences
print('The number of sentences in the text is',len(text_news_2.sentences))
#Priting the text
print(text_news_2.sentences)

In [None]:
#The text is divided in words and each word is analyzed morphologically
words_analysis = text_news_2.parse().split()

## 2.2 Morphological classification of the words

In [None]:
#A counter is created
class_word=[]
for i0 in words_analysis:
    for i1 in i0:
        class_word.append(i1[1])
count_class_word=Counter(class_word)

#Punctuations symbols are deleted from the text.
symbols = '''@"{}()[].,:;+-*/&|'<>=~#$%€\ºª_?¿!¡'''

for x in symbols:
    if x in count_class_word:
        del count_class_word[x]  
count_class_word

## 2.3 Representing the data

In [None]:
#The items of the count_class_word are putted in a list and then represented.
list_class_word=list((count_class_word.items()))
#The data is sorted
list_class_word_sorted=sorted(list_class_word, key=lambda x: x[1], reverse=True)
#Labels for each category are tagged
tag_words={'CC':'coordinating conjunction', 'CD':'cardinal digit', 'DT':'determiner',
          'EX':'existential there', 'FW':'foreign word', 'IN':'preposition/subordinating conjunction',
          'JJ':'adjective', 'JJR':'adjective, comparative', 'JJS':'adjective, superlative',
          'LS':'list marker','MD':'modal', 'NN':'noun, singular', 'NNS':'noun plural',
          'NNP':'proper noun, singular', 'NNPS':'proper noun, plural','PDT':'predeterminer',
          'POS':'possessive ending','PRP':'personal pronoun','PRP$':'possessive pronoun',
          'RB':'adver','RBR':'adverb, comparative','RBS':'adverb, superlative', 'RP':'particle',
          'TO':'to','UH':'interjection','VB':'verb, base form','VBD':'verb, past tense',
          'VBG':'verb, gerund/present participle','VBN':'verb, past participle',
          'VBP':'verb, sing. present, non-3d','VBZ':'verb, 3rd person sing. present',
          'WDT':'wh-determiner','WP':'wh-pronoun','WP$':'possessive wh-pronoun','WRB':'wh-abverb'}
#The labels and the values are plotted
plt.bar([tag_words.get(label)for label in[label[0] for label in list_class_word]], [label[1] for label in list_class_word_sorted])
plt.xticks(rotation='vertical')
plt.figure(figsize=(15,15))

For more information relating to the kind of word see the
POS tag list:

**CC** coordinating conjunction<br>
**CD** cardinal digit<br>
**DT** determiner<br>
**EX** existential: 'there' (like: "there is" ... think of it like "there exists")<br>
**FW** foreign word<br>
**IN** preposition/subordinating conjunction<br>
**JJ** adjective: 'big'<br>
**JJR** adjective, comparative: 'bigger'<br>
**JJS** adjective, superlative: 'biggest'<br>
**LS** list marker: '1)'<br>
**MD** modal: 'could', 'will'<br>
**NN** noun, singular: 'desk'<br>
**NNS** noun plural: 'desks'<br>
**NNP** proper noun, singular: 'Harrison'<br>
**NNPS** proper noun, plural: 'Americans'<br>
**PDT** predeterminer: 'all the kids'<br>
**POS** possessive ending: 'parent\'s'<br>
**PRP** personal pronoun: 'I', 'he', 'she'<br>
**PRP** possessive pronoun: 'my', 'his', 'hers'<br>
**RB** adverb: 'very', 'silently'<br>
**RBR** adverb,comparative: 'better'<br>
**RBS** adverb,superlative: 'best'<br>
**RP** particle: 'give up'<br>
**TO**: 'to go′to′the store'.<br>
**UH** interjection: 'errrrrrrrm'<br>
**VB** verb, base form: 'take'<br>
**VBD** verb, past tense: 'took'<br>
**VBG** verb, gerund/present participle: 'taking'<br>
**VBN** verb,past participle: 'taken'<br>
**VBP** verb, sing.present, non−3d: 'take'<br>
**VBZ** verb, 3rd personsing. present: 'takes'<br>
**WDT** wh−determiner: 'which'<br>
**WP** wh−pronoun: 'who', 'what'<br>
**WPP** possessive wh-pronoun: 'whose'<br>
**WRB** wh-abverb: 'where', 'when'

## 2.4 Translating the text

In [None]:
#A translation is performed.
text_news_2 = TextBlob(text_news)
print(text_news_2.translate(from_lang="en", to="gl"))
