# One Language Two Text 
## Natural Language Pipeline

### Set Up
Load each language separately

In [1]:
import spacy
nlp_en = spacy.load('en', parse=True, tag=True, entity=True)
nlp_sp = spacy.load('es', parse=True, tag=True, entity=True)

Take a string and convert it into a Doc object

In [2]:
Alice_text = '''Presently she began again. ‘I wonder if I shall fall right THROUGH the earth! How funny it’ll seem to come out among the people that walk with their heads downward! The Antipathies, I think--’ (she was rather glad there WAS no one listening, this time, as it didn’t sound at all the right word) ‘--but I shall have to ask them what the name of the country is, you know. Please, Ma’am, is this New Zealand or Australia?’ (and she tried to curtsey as she spoke--fancy CURTSEYING as you’re falling through the air! Do you think you could manage it?) ‘And what an ignorant little girl she’ll think me for asking! No, it’ll never do to ask: perhaps I shall see it written up somewhere.’ '''
Alice_spacy = nlp_en(Alice_text)
Alice_spacy[:100] # returns first 100 tokens
len(Alice_spacy) # return the total number of tokens in the doc

154

In [3]:
Quixote_text = '''Con estas razones perdía el pobre caballero el juicio, y desvelábase por entenderlas y desentrañarles el sentido, que no se lo sacara ni las entendiera el mesmo Aristóteles, si resucitara para sólo ello. No estaba muy bien con las heridas que don Belianís daba y recebía, porque se imaginaba que, por grandes maestros que le hubiesen curado, no dejaría de tener el rostro y todo el cuerpo lleno de cicatrices y señales. Pero, con todo, alababa en su autor aquel acabar su libro con la promesa de aquella inacabable aventura, y muchas veces le vino deseo de tomar la pluma y dalle fin al pie de la letra, como allí se promete; y sin duda alguna lo hiciera, y aun saliera con ello, si otros mayores y continuos pensamientos no se lo estorbaran. Tuvo muchas veces competencia con el cura de su lugar (que era hombre docto, graduado en Sigüenza), sobre cuál había sido mejor caballero: Palmerín de Ingalaterra, o Amadís de Gaula; mas maese Nicolás, barbero del mismo pueblo, decía que ninguno llegaba al Caballero del Febo, y que si alguno se le podía comparar, era don Galaor, hermano de Amadís de Gaula, porque tenía muy acomodada condición para todo; que no era caballero melindroso, ni tan llorón como su hermano, y que en lo de la valentía no le iba en zaga.'''
Quixote_spacy = nlp_sp(Quixote_text)
Quixote_spacy[:100]

Con estas razones perdía el pobre caballero el juicio, y desvelábase por entenderlas y desentrañarles el sentido, que no se lo sacara ni las entendiera el mesmo Aristóteles, si resucitara para sólo ello. No estaba muy bien con las heridas que don Belianís daba y recebía, porque se imaginaba que, por grandes maestros que le hubiesen curado, no dejaría de tener el rostro y todo el cuerpo lleno de cicatrices y señales. Pero, con todo, alababa en su autor aquel acabar su libro con la promesa de aquella inacabable aventura

In [4]:
apples = nlp_en(u'I like apples')
oranges = nlp_en(u'I like oranges')
apples.similarity(oranges)

0.9314620260888731

### Tokens

In [5]:
[obj.text for obj in Alice_spacy.sents] # at the sentence level
[token for token in Alice_spacy] # at the word level

[Presently,
 she,
 began,
 again,
 .,
 ‘,
 I,
 wonder,
 if,
 I,
 shall,
 fall,
 right,
 THROUGH,
 the,
 earth,
 !,
 How,
 funny,
 it,
 ’ll,
 seem,
 to,
 come,
 out,
 among,
 the,
 people,
 that,
 walk,
 with,
 their,
 heads,
 downward,
 !,
 The,
 Antipathies,
 ,,
 I,
 think--’,
 (,
 she,
 was,
 rather,
 glad,
 there,
 WAS,
 no,
 one,
 listening,
 ,,
 this,
 time,
 ,,
 as,
 it,
 did,
 n’t,
 sound,
 at,
 all,
 the,
 right,
 word,
 ),
 ‘,
 --but,
 I,
 shall,
 have,
 to,
 ask,
 them,
 what,
 the,
 name,
 of,
 the,
 country,
 is,
 ,,
 you,
 know,
 .,
 Please,
 ,,
 Ma’am,
 ,,
 is,
 this,
 New,
 Zealand,
 or,
 Australia?’,
 (,
 and,
 she,
 tried,
 to,
 curtsey,
 as,
 she,
 spoke,
 --,
 fancy,
 CURTSEYING,
 as,
 you,
 ’re,
 falling,
 through,
 the,
 air,
 !,
 Do,
 you,
 think,
 you,
 could,
 manage,
 it,
 ?,
 ),
 ‘,
 And,
 what,
 an,
 ignorant,
 little,
 girl,
 she,
 ’ll,
 think,
 me,
 for,
 asking,
 !,
 No,
 ,,
 it,
 ’ll,
 never,
 do,
 to,
 ask,
 :,
 perhaps,
 I,
 shall,
 see,
 it,
 written,


In [None]:
[obj.text for obj in Quixote_spacy.sents] # at the sentence level
[token for token in Quixote_spacy] # at the word level

### POS tags

In [None]:
[(token, token.pos_) for token in Alice_spacy]

In [None]:
[(token, token.pos_) for token in Quixote_spacy]

### Named Entity Recognition
#### The entity visualizer "ent" highlights named entities and their labels in a text.

In [6]:
from spacy import displacy
displacy.render(Alice_spacy, style='ent', jupyter=True)

In [7]:
displacy.render(Quixote_spacy, style='ent', jupyter=True)

#### The property ".ents" iterate over the entities in the document and yields named-entity Span objects

In [8]:
for ent in Alice_spacy.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Antipathies 169 180 ORG
New Zealand 393 404 GPE


#### The ".ent_iob_" attribute provides a named entity tag of "B", "I" or "O". 
"B" means the token begins an entity  
"I" means it is inside an entity,   
"O" means it is outside an entity, and   
"" means no entity tag is set.  


In [9]:
for ent in Quixote_spacy.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Aristóteles 161 172 PER
No 204 206 LOC
Belianís 247 255 ORG
Sigüenza 832 840 LOC
Palmerín de Ingalaterra 882 905 PER
Amadís de Gaula 909 924 LOC
Nicolás 936 943 PER
Caballero del Febo 1000 1018 MISC
don Galaor 1062 1072 PER
Amadís de Gaula 1085 1100 LOC


In [None]:
[(token, token.ent_iob_) for token in Alice_spacy]

In [None]:
[(token, token.ent_iob_) for token in Quixote_spacy]

### <span style="color:blue">Your turn </span> / <span style="color:red"> ¡Te toca!</span>
Build a function that will convert a plain text into a pandas dataframe all the necessary functions and build your NLP pipeline

In [None]:
def text_processor(text, language="en"):
    
    # tokenization
    
    # part of speech tagging
    
    # named entity recognition
    
    
    return text

Test your function on the following documents

In [1]:
spn_newspaper_articles = open("Data/nacion_articles.txt").read()
#print(nacion_articles[:300])

In [2]:
eng_dialogue = open("Data/EngCorpus.txt").read()
#print(eng_dialogue[:300])

In [3]:
cs_interview = open("Data/Spanish_in_Texas_subset.txt").read()
#print(cs_interview[:300])

In [4]:
cs_book = open("Data/CodeswitchedBook.txt").read()
#print(cs_novel[:300])

Explore these documents. How many words are in each? What is the distribution of POS tags? How accurate is the tokenization and the pos tagging for the CS_interview and CS_novel?

## Language Identification