# United Nations Parallel Corpora Analysis using SpaCy and Camel
### Kinan Al-Mouk / kim47@pitt.edu 

## Table of Contents 
[`Imports`](#Importing-Tools)

[`Uploading and Processing Data`](#Uploading-and-Processing-Data-using-Spacy)
    
   - [`English`](#English)
       - [Reading File](#Reading-in-English-File)
       - [Downloading SpaCy English Module](#Downloading-SpaCy-object-for-English-processing)
       - [Word Tokenizing](#Word-Tokenizing-English-Document)
       - [Sentence Tokenizing](#Sentence-Tokenizing-English-Document)
       - [POS Tagging](#POS-Tagging-English-Document)
       - [Dependency Tagging](#Dependency-Tagging-English-Document)
           - Image Rendering
       - [Stop Word Tagging](#Eng-Stop-Word-Tagging)
       - [Entity Tagging](#English-Entity-Tagging)
   - [`Spanish`](#Spanish)
       - [Reading File](#Reading-in-Spanish-File)
       - [Word Tokenizing](#Word-Tokenizing-Spanish-Document)
       - [Sentence Tokenizing](#Sentence-Tokenizing-Spanish-Document)
       - [POS Tagging](#POS-Tagging-Spanish-Document)
       - [Dependency Tagging](#Dependency-Tagging-Spanish-Document)
           - Image Rendering
       - [Entity Tagging](#Spanish-Entity-Tagging)
   - [`French`](#French)
       - [Reading File](#Reading-in-French-File)
       - [Word Tokenizing](#Word-Tokenizing-French-Document)
       - [Sentence Tokenizing](#Sentence-Tokenizing-French-Document)
       - [POS Tagging](#POS-Tagging-French-Document)
       - [Dependency Tagging](#Dependency-Tagging-French-Document)
           - Image Rendering
       - [Entity Tagging](#French-Entity-Tagging)
   - [`Russian`](#Russian)
       - [Reading File](#Reading-in-Russian-File)
       - [Word Tokenizing](#Word-Tokenizing-Russian-Document)
       - [Sentence Tokenizing](#Sentence-Tokenizing-Russian-Document)
       - [POS Tagging](#POS-Tagging-Russian-Document)
       - [Dependency Tagging](#Dependency-Tagging-Russian-Document)
       - [Entity Tagging](#Russian-Entity-Tagging)
   - [`Chinese`](#Chinese)
       - [Reading File](#Reading-in-Mandarin-File)
       - [Word Tokenizing](#Word-Tokenizing-Mandarin-Document)
       - [Sentence Tokenizing](#Sentence-Tokenizing-Mandarin-Document)
       - [POS Tagging](#POS-Tagging-Mandarin-Document)
       - [Dependency Tagging](#Dependency-Tagging-Mandarin-Document)
       - [Entity Tagging](#Mandarin-Entity-Tagging)
   - [`Arabic`](#Arabic)
  
[`Creating DataFrame for Analyzation`](#Creating-DataFrame-for-Analyzation)
  


# Importing Tools

In [1]:
import nltk 
import pickle
from time import time
import numpy as np 
import pandas as pd 

In [2]:
import spacy

I am using [SpaCy](https://spacy.io/) for NLP of **English, Spanish, French, Russian**, and **Mandarin Chinese** in order to ensure that the tokenization for each langauge remains accurate since NLTK is mostly and primarily used for English language NLP.

However there is no [SpaCy](https://spacy.io/) module that can allow me to process the **Arabic** files. At the moment I am still looking into other resources.

# Uploading and Processing Data using Spacy

## English

### Reading in File

In [3]:
# Loading in English Files
start = time()
f = open('data/sixway/english.100k', 'r') # Reading in English File
english = f.read()
print("Data loaded in:", (time()-start), "seconds.")
f.close()

Data loaded in: 0.14672303199768066 seconds.


### Downloading SpaCy object for English processing

In [4]:
english_samp = english[:1000000] # 1000000 is the characther limit for SpaCy processing

In [5]:
start = time()
nlp = spacy.load('en_core_web_sm')
eng_doc = nlp(english_samp)

print("English document processed in:", (time()-start), "seconds.")

English document processed in: 89.45646619796753 seconds.


Here I am using the `time` function to see how long it takes to train files in [SpaCy](https://spacy.io/)'s processor.. It usually takes the English document about 86.43 seconds to complete.

### Word Tokenizing English Document

In [6]:
start = time()

eng_tokens = [token for token in eng_doc]  # Tokenizing Document

print("English document word tokenized in:", (time()-start), "seconds.")

English document word tokenized in: 0.25989675521850586 seconds.


In [7]:
print('English Word Tokens Count:', len(eng_tokens), '\n \n', eng_tokens[:20])

English Word Tokens Count: 177808 
 
 [RESOLUTION, 918, (, 1994, ), 
, Adopted, by, the, Security, Council, at, its, 3377th, meeting, ,, on, 17, May, 1994]


[SpaCy's Tokenizer](https://spacy.io/api/tokenizer) segments text into words, puncutation marks, etc. Additionally it creates a `Doc` object with the discovered segment boundaries. Their **global** and **langauge-specific** data is supplied via the langauge data in [`spacy/lang`]

### Sentence Tokenizing English Document

In [8]:
start = time()
eng_sent = [sent.text for sent in eng_doc.sents]
print("English document sentence tokenized in:", (time()-start), "seconds.")

English document sentence tokenized in: 0.11670398712158203 seconds.


In [9]:
len(eng_sent)

6662

In [163]:
example_eng_sent = eng_sent[1]
print(example_eng_sent)

Demands that all parties to the conflict immediately cease hostilities, agree to a cease-fire, and bring an end to the mindless violence and carnage engulfing Rwanda;
2.


### POS Tagging English Document

In [51]:
start = time()
eng_pos = [token.pos_ for token in eng_doc] 
print("English document POS tagged in:", (time()-start), "seconds.")

English document POS tagged in: 0.056739091873168945 seconds.


In [52]:
print('\nList of POS tags:', set(eng_pos), '\n\nPOS count:', len(set(eng_pos)))


List of POS tags: {'INTJ', 'NOUN', 'X', 'NUM', 'DET', 'PRON', 'PART', 'ADP', 'VERB', 'PROPN', 'SPACE', 'ADJ', 'CCONJ', 'SYM', 'ADV', 'AUX', 'SCONJ', 'PUNCT'} 

POS count: 18


The [SpaCy](https://spacy.io/) library is quite easy. All that is neccessary is tot instantiate a [SpaCy](https://spacy.io/) object as doc, then iterate over the `Doc` object and use `pos_` to print the **POS**.

### Dependency Tagging

In [53]:
from spacy import displacy

In [54]:
start = time()
eng_dep = [token.dep_ for token in eng_doc] 
print("English document DEP tagged in:", (time()-start), "seconds.")

English document DEP tagged in: 0.08782196044921875 seconds.


In [55]:
eng_dep[:5]

['nsubj', 'nummod', 'punct', 'npadvmod', 'punct']

In [75]:
len(set(eng_dep))

45

[SpaCy](https://spacy.io/)'s **dependency parser** come with a visualiztion module. We can pass a Doc or a list of Doc objects to displaCy and run `displacy.render` to generate the raw markup. I am using the example sentence from above `example_eng_sent` because it is one of the shortest sentences I managed to find. 

In [56]:
sentence_doc = nlp(example_eng_sent)
displacy.render(sentence_doc, style='dep')

### Stop Word Tagging

In [57]:
start = time()
eng_stop = [token for token in eng_doc if token.is_stop is False]
print("English document STOP tagged in:", (time()-start), "seconds.")

English document STOP tagged in: 0.09779000282287598 seconds.


### Entity Tagging

In [58]:
start = time()
eng_ent = []
eng_label = []
for ent in eng_doc.ents:
    eng_ent.append(ent.text)
    eng_label.append(ent.label_)
print("English document entity tagged in:", (time()-start), "seconds.")

English document entity tagged in: 0.06624889373779297 seconds.


In [59]:
print(set(eng_label))

{'ORG', 'LAW', 'WORK_OF_ART', 'FAC', 'LOC', 'LANGUAGE', 'ORDINAL', 'PERCENT', 'CARDINAL', 'NORP', 'DATE', 'PRODUCT', 'TIME', 'EVENT', 'MONEY', 'QUANTITY', 'PERSON', 'GPE'}


In [60]:
eng_ents = np.column_stack((eng_ent, eng_label))

In [61]:
print(eng_ents[:10], '\n\nLength of English Entities:', len(eng_ents))

[['918' 'CARDINAL']
 ['1994' 'DATE']
 ['the Security Council' 'ORG']
 ['3377th' 'ORDINAL']
 ['17 May 1994' 'DATE']
 ['The Security Council' 'ORG']
 ['Rwanda' 'GPE']
 ['resolution 872' 'LAW']
 ['1993' 'DATE']
 ['5 October 1993' 'DATE']] 

Length of English Entities: 12124


## Spanish

### Reading in Spanish File

In [62]:
# Loading in Spanish Files
start = time()

f = open('data/sixway/spanish.100k', 'r') # Reading in Spanish File
spanish100 = f.read()
print("Data loaded in:", (time()-start), "seconds.")
f.close()

Data loaded in: 0.22364211082458496 seconds.


In [63]:
# Creating Smaller Slice to Analyze using SpaCy
spanish_samp = spanish100[:1000000]

### Downloading SpaCy object for Spanish processing

In [24]:
start = time()

nlp = spacy.load('es_core_news_sm')
es_doc = nlp(spanish_samp)

print("Spanish document processed in:", (time()-start), "seconds.")

Spanish document processed in: 110.16744899749756 seconds.


### Word Tokenizing Spanish Document

In [25]:
start = time()
es_tokens = [token for token in es_doc]
print("Spanish document tokenized in:", (time()-start), "seconds.")

Spanish document tokenized in: 0.4779846668243408 seconds.


In [26]:
print('Spanish Word Tokens Count:', len(es_tokens), '\n \n', es_tokens[:20])

Spanish Word Tokens Count: 176209 
 
 [RESOLUCIÓN, 918, (, 1994, ), 
, Aprobada, por, el, Consejo, de, Seguridad, en, su, 3377ª, sesión, ,, celebrada, el, 17]


### Sentence Tokenizing Spanish Document

In [27]:
start = time()
es_sent = [sent.text for sent in es_doc.sents]
print("Spanish document sentence tokenized in:", (time()-start), "seconds.")

Spanish document sentence tokenized in: 0.21952581405639648 seconds.


In [28]:
len(es_sent)

4861

In [64]:
example_es_sent = es_sent[1]
print(example_es_sent)

Exige que todas las partes en el conflicto pongan fin inmediatamente a las hostilidades, convengan en una cesación del fuego y pongan término a la violencia y la carnicería insensatas en que está sumida Rwanda;
2.


#### Comparing SpaCy English Tokenizer with Spanish Tokenizer

In [71]:
print(eng_sent[1],'\n\n',es_sent[1])

Demands that all parties to the conflict immediately cease hostilities, agree to a cease-fire, and bring an end to the mindless violence and carnage engulfing Rwanda;
2. 

 Exige que todas las partes en el conflicto pongan fin inmediatamente a las hostilidades, convengan en una cesación del fuego y pongan término a la violencia y la carnicería insensatas en que está sumida Rwanda;
2.


### POS Tagging Spanish Document

In [30]:
start = time()
es_pos = [token.pos_ for token in es_doc] 
print("Spanish document POS tagged in:", (time()-start), "seconds.")

Spanish document POS tagged in: 0.07349586486816406 seconds.


In [72]:
print('\nList of POS tags:', set(es_pos), '\n\nPOS count:', len(set(es_pos)))


List of POS tags: {'INTJ', 'NOUN', 'NUM', 'DET', 'PRON', 'PART', 'ADP', 'VERB', 'PROPN', 'SPACE', 'ADJ', 'CCONJ', 'AUX', 'ADV', 'SYM', 'SCONJ', 'PUNCT'} 

POS count: 17


In [73]:
print(set(es_pos))

{'INTJ', 'NOUN', 'NUM', 'DET', 'PRON', 'PART', 'ADP', 'VERB', 'PROPN', 'SPACE', 'ADJ', 'CCONJ', 'AUX', 'ADV', 'SYM', 'SCONJ', 'PUNCT'}


### Dependency Tagging Spanish File

In [74]:
start = time()
es_dep = [token.dep_ for token in es_doc] 
print("Spanish document DEP tagged in:", (time()-start), "seconds.")

Spanish document DEP tagged in: 0.09531593322753906 seconds.


In [76]:
es_dep[:5]

['nsubj', 'nummod', 'punct', 'appos', 'punct']

In [77]:
len(set(es_dep))

31

In [79]:
print(set(eng_dep)-set(es_dep))

{'npadvmod', 'attr', 'pcomp', 'quantmod', 'csubjpass', 'agent', 'dobj', 'acomp', 'expl', 'nsubjpass', 'preconj', 'intj', 'prep', 'neg', 'auxpass', 'predet', 'pobj', 'relcl', 'dative', 'poss', 'oprd', 'prt', 'meta'}


In [80]:
sentence_doc = nlp(example_es_sent)
displacy.render(sentence_doc, style='dep')

### Entity Tagging

In [81]:
start = time()
es_ent = []
es_label = []
for ent in es_doc.ents:
    es_ent.append(ent.text)
    es_label.append(ent.label_)
print("Spanish document entity tagged in:", (time()-start), "seconds.")

Spanish document entity tagged in: 0.07843494415283203 seconds.


In [82]:
print(set(es_label))

{'MISC', 'LOC', 'ORG', 'PER'}


In [84]:
es_ents = np.column_stack((es_ent, es_label))

In [85]:
print(es_ents[:10], '\n\nLength of Spanish Entities:', len(es_ents))

[['Aprobada' 'LOC']
 ['Consejo de Seguridad' 'ORG']
 ['Consejo de Seguridad' 'ORG']
 ['Rwanda' 'LOC']
 ['Misión de Asistencia de las Naciones Unidas a Rwanda' 'LOC']
 ['UNAMIR' 'ORG']
 ['UNAMIR' 'ORG']
 ['UNAMIR' 'ORG']
 ['Presidente del Consejo' 'MISC']
 ['S/PRST/1994/16' 'MISC']] 

Length of Spanish Entities: 9826


## French

### Reading in French File

In [88]:
# Loading in French Files
start = time()
f = open('data/sixway/french.100k', 'r') # Reading in French File
french100 = f.read()
print("Data loaded in:", (time()-start), "seconds.")
f.close()

Data loaded in: 0.21050190925598145 seconds.


In [89]:
# Creating Smaller Slice to Analyze using SpaCy
french_samp = french100[:1000000]

### Downloading SpaCy object for French processing

In [90]:
start = time()
nlp = spacy.load('fr_core_news_sm')
fr_doc = nlp(french_samp)

print("French document processed in:", (time()-start), "seconds.")

French document processed in: 120.03544187545776 seconds.


### Word Tokenizing French Document

In [91]:
start = time()
fr_tokens = [token for token in fr_doc]
print("French document tokenized in:", (time()-start), "seconds.")

French document tokenized in: 0.7552700042724609 seconds.


### Sentence Tokenizing French Document

In [92]:
start = time()
fr_sent = [sent.text for sent in fr_doc.sents]
print("French document sentence tokenized in:", (time()-start), "seconds.")

French document sentence tokenized in: 0.2048640251159668 seconds.


In [93]:
len(fr_sent)

9440

In [98]:
example_fr_sent = fr_sent[4]
print(example_fr_sent)

Exige que toutes les parties au conflit cessent immédiatement les hostilités, acceptent un cessez-le-feu et mettent fin à la violence et au carnage insensés dans lesquels est plongé le Rwanda;


In [100]:
print(example_eng_sent,'\n\n',example_es_sent, '\n\n',example_fr_sent)

Demands that all parties to the conflict immediately cease hostilities, agree to a cease-fire, and bring an end to the mindless violence and carnage engulfing Rwanda;
2. 

 Exige que todas las partes en el conflicto pongan fin inmediatamente a las hostilidades, convengan en una cesación del fuego y pongan término a la violencia y la carnicería insensatas en que está sumida Rwanda;
2. 

 Exige que toutes les parties au conflit cessent immédiatement les hostilités, acceptent un cessez-le-feu et mettent fin à la violence et au carnage insensés dans lesquels est plongé le Rwanda;


### POS Tagging French Documnent

In [101]:
start = time()
fr_pos = [token.pos_ for token in fr_doc] 
print("French document POS tagged in:", (time()-start), "seconds.")

French document POS tagged in: 0.07364201545715332 seconds.


In [102]:
print('\nList of POS tags:', set(fr_pos), '\n\nPOS count:', len(set(fr_pos)))


List of POS tags: {'NOUN', 'X', 'NUM', 'DET', 'PRON', 'ADP', 'VERB', 'PROPN', 'SPACE', 'ADJ', 'CCONJ', 'AUX', 'ADV', 'SYM', 'SCONJ', 'PUNCT'} 

POS count: 16


In [103]:
print(set(fr_pos))

{'NOUN', 'X', 'NUM', 'DET', 'PRON', 'ADP', 'VERB', 'PROPN', 'SPACE', 'ADJ', 'CCONJ', 'AUX', 'ADV', 'SYM', 'SCONJ', 'PUNCT'}


### Dependency Tagging French File

In [104]:
start = time()
fr_dep = [token.dep_ for token in fr_doc] 
print("French document DEP tagged in:", (time()-start), "seconds.")

French document DEP tagged in: 0.10315871238708496 seconds.


In [105]:
fr_dep[:5]

['ROOT', 'nummod', 'punct', 'nummod', 'punct']

In [106]:
len(set(es_dep))

31

In [108]:
print(set(fr_dep)-set(eng_dep))

{'aux:pass', 'aux:tense', 'flat:name', 'flat:foreign', 'expl:pass', 'nsubj:pass', 'iobj', 'expl:subj', 'fixed', 'obl:arg', 'cop', 'expl:comp', 'acl:relcl', 'obl:mod', 'obj', 'obl:agent', 'vocative'}


In [109]:
sentence_doc = nlp(example_fr_sent)
displacy.render(sentence_doc, style='dep')

### Entity Tagging

In [115]:
start = time()
fr_ent = []
fr_label = []
for ent in fr_doc.ents:
    fr_ent.append(ent.text)
    fr_label.append(ent.label_)
print("French document entity tagged in:", (time()-start), "seconds.")

French document entity tagged in: 0.06374764442443848 seconds.


In [116]:
print(set(fr_label))

{'MISC', 'ORG', 'LOC', 'PER'}


In [117]:
fr_ents = np.column_stack((fr_ent, fr_label))
print(fr_ents[:10], '\n\nLength of French Entities:', len(fr_ents))

[['RESOLUTION 918' 'MISC']
 ['Conseil de sécurité' 'ORG']
 ['Conseil de sécurité' 'ORG']
 ['Rwanda' 'LOC']
 ['Mission des Nations Unies' 'MISC']
 ['Rwanda' 'LOC']
 ['MINUAR' 'LOC']
 ['MINUAR' 'MISC']
 ['MINUAR' 'MISC']
 ['Président' 'PER']] 

Length of French Entities: 7853


## Russian

In [118]:
# Loading in Russian Files
start = time()
f = open('data/sixway/russian.100k', 'r') # Reading in Russian File
russian100 = f.read()
print("Data loaded in:", (time()-start), "seconds.")
f.close()

Data loaded in: 0.15739893913269043 seconds.


In [119]:
# Creating Smaller Slice to Analyze using SpaCy
russian_samp = russian100[:1000000]

### Downloading SpaCy object for Russian processing

In [120]:
start = time()
nlp = spacy.load('ru_core_news_sm')
ru_doc = nlp(russian_samp)
print("Russian document processed in:", (time()-start), "seconds.")

Russian document processed in: 130.43276715278625 seconds.


### Word Tokenizing Russian Document

In [128]:
start = time()

ru_tokens = [token for token in ru_doc]  # Tokenizing Document

print("Russian document word tokenized in:", (time()-start), "seconds.")

Russian document word tokenized in: 0.3548762798309326 seconds.


In [129]:
print('Russian Word Tokens Count:', len(ru_tokens), '\n \n', ru_tokens[:20])

Russian Word Tokens Count: 153964 
 
 [РЕЗОЛЮЦИЯ, 918, (, 1994, ), ,, 
, принятая, Советом, Безопасности, на, его, 3377-м, заседании, 17, мая, 1994, года, 
, Совет]


### Sentence Tokenizing Russian Document

In [130]:
start = time()
ru_sent = [sent.text for sent in ru_doc.sents]
print("Russian document sentence tokenized in:", (time()-start), "seconds.")

Russian document sentence tokenized in: 0.2581510543823242 seconds.


In [132]:
len(ru_sent)

9310

In [164]:
example_ru_sent = ru_sent[13]
print(example_ru_sent)

требует, чтобы все стороны в конфликте немедленно прекратили военные действия, договорились о прекращении огня и положили конец бессмысленному насилию и резне, охватившим Руанду;



In [165]:
print(example_eng_sent,'\n\n',example_es_sent, '\n\n',example_fr_sent, '\n\n',example_ru_sent)

Demands that all parties to the conflict immediately cease hostilities, agree to a cease-fire, and bring an end to the mindless violence and carnage engulfing Rwanda;
2. 

 Exige que todas las partes en el conflicto pongan fin inmediatamente a las hostilidades, convengan en una cesación del fuego y pongan término a la violencia y la carnicería insensatas en que está sumida Rwanda;
2. 

 Exige que toutes les parties au conflit cessent immédiatement les hostilités, acceptent un cessez-le-feu et mettent fin à la violence et au carnage insensés dans lesquels est plongé le Rwanda; 

 требует, чтобы все стороны в конфликте немедленно прекратили военные действия, договорились о прекращении огня и положили конец бессмысленному насилию и резне, охватившим Руанду;



### POS Tagging Russian Document

In [166]:
start = time()
ru_pos = [token.pos_ for token in ru_doc] 
print("Russian document POS tagged in:", (time()-start), "seconds.")

Russian document POS tagged in: 0.0551297664642334 seconds.


In [167]:
print('\nList of POS tags:', set(fr_pos), '\n\nPOS count:', len(set(fr_pos)))


List of POS tags: {'NOUN', 'X', 'NUM', 'DET', 'PRON', 'ADP', 'VERB', 'PROPN', 'SPACE', 'ADJ', 'CCONJ', 'AUX', 'ADV', 'SYM', 'SCONJ', 'PUNCT'} 

POS count: 16


In [168]:
print(set(ru_pos))

{'INTJ', 'NOUN', 'X', 'NUM', 'DET', 'PRON', 'PART', 'ADP', 'VERB', 'ADJ', 'PROPN', 'SPACE', 'CCONJ', 'ADV', 'SYM', 'AUX', 'SCONJ', 'PUNCT'}


### Dependency Tagging Russian File

In [123]:
start = time()
ru_dep = [token.dep_ for token in ru_doc] 
print("Russian document DEP tagged in:", (time()-start), "seconds.")

Russian document DEP tagged in: 0.07619786262512207 seconds.


In [169]:
ru_dep[:5]

['nsubj', 'appos', 'punct', 'parataxis', 'punct']

In [170]:
len(set(ru_dep))

40

In [171]:
print(set(ru_dep)-set(eng_dep))

{'obl', 'aux:pass', 'nsubj:pass', 'flat:foreign', 'flat:name', 'iobj', 'fixed', 'obj', 'cop', 'list', 'orphan', 'nummod:gov', 'acl:relcl', 'csubj:pass', 'flat', 'nummod:entity', 'obl:agent', 'discourse'}


### Entity Tagging Russian

In [209]:
start = time()
ru_ent = []
ru_label = []
for ent in ru_doc.ents:
    ru_ent.append(ent.text)
    ru_label.append(ent.label_)
print("Russian document entity tagged in:", (time()-start), "seconds.")

Russian document entity tagged in: 0.1517488956451416 seconds.


In [210]:
print(set(ru_label))

{'ORG', 'LOC', 'PER'}


In [211]:
ru_ents = np.column_stack((ru_ent, ru_label))

In [212]:
print(ru_ents[:10], '\n\nLength of Russian Entities:', len(ru_ents))

[['Советом Безопасности' 'ORG']
 ['Совет Безопасности' 'ORG']
 ['Руанде' 'LOC']
 ['Миссию Организации Объединенных Наций' 'ORG']
 ['Руанде' 'LOC']
 ['МООНПР' 'ORG']
 ['МООНПР' 'ORG']
 ['Председателя' 'PER']
 ['Совета Безопасности' 'ORG']
 ['Организации Объединенных Наций' 'ORG']] 

Length of Russian Entities: 5827


## Mandarin

### Reading in Mandarin File

In [172]:
# Loading in Mandarin Files
start = time()
f = open('data/sixway/mandarin.100k', 'r') # Reading in Mandarin File
mandarin100 = f.read()
print("Data loaded in:", (time()-start), "seconds.")
f.close()

Data loaded in: 0.11664915084838867 seconds.


In [173]:
# Creating Smaller Slice to Analyze using SpaCy
mandarin_samp = mandarin100[:1000000]

### Downloading SpaCy object for Mandarin processing

In [175]:
start = time()
nlp = spacy.load('zh_core_web_sm')
zh_doc = nlp(mandarin_samp)
print("Mandarin document processed in:", (time()-start), "seconds.")

Mandarin document processed in: 881.6232738494873 seconds.


SpaCy's Mandarin module takes about 881 seconds, which is painstaking to work with.

### Word Tokenizing Mandarin Document

In [176]:
start = time()
zh_tokens = [token for token in zh_doc]
print("Mandarin document tokenized in:", (time()-start), "seconds.")


Mandarin document tokenized in: 2.3030920028686523 seconds.


In [178]:
print('Mandarin Word Tokens Count:', len(zh_tokens), '\n \n', zh_tokens[:20])

Mandarin Word Tokens Count: 553748 
 
 [第918, (, 1994, )号, 决议, 
, 1994年, 5月, 17日, 安全, 理事会, 第3377, 次, 会议, 通过, 
, 安全, 理事会, ，, 
]


### Sentence Tokenizing Mandarin Document

In [179]:
start = time()
zh_sent = [sent.text for sent in zh_doc.sents]
print("Mandarin document sentence tokenized in:", (time()-start), "seconds.")

Mandarin document sentence tokenized in: 0.40075111389160156 seconds.


In [180]:
len(zh_sent)

23363

In [189]:
example_zh_sent = zh_sent[6]
print(example_zh_sent)

要求卢旺达各当事方严格尊重联合国及其他在卢旺达服务的组织的人员和房地，并避免对从事人道主义和维持和平工作的人员进行任何恐吓或暴力行为；
12.


In [190]:
print(example_eng_sent,'\n\n',example_es_sent, '\n\n',
      example_fr_sent, '\n\n',example_ru_sent, '\n\n', example_zh_sent)

Demands that all parties to the conflict immediately cease hostilities, agree to a cease-fire, and bring an end to the mindless violence and carnage engulfing Rwanda;
2. 

 Exige que todas las partes en el conflicto pongan fin inmediatamente a las hostilidades, convengan en una cesación del fuego y pongan término a la violencia y la carnicería insensatas en que está sumida Rwanda;
2. 

 Exige que toutes les parties au conflit cessent immédiatement les hostilités, acceptent un cessez-le-feu et mettent fin à la violence et au carnage insensés dans lesquels est plongé le Rwanda; 

 требует, чтобы все стороны в конфликте немедленно прекратили военные действия, договорились о прекращении огня и положили конец бессмысленному насилию и резне, охватившим Руанду;
 

 要求卢旺达各当事方严格尊重联合国及其他在卢旺达服务的组织的人员和房地，并避免对从事人道主义和维持和平工作的人员进行任何恐吓或暴力行为；
12.


### POS Tagging Mandarin Document

In [191]:
start = time()
zh_pos = [token.pos_ for token in zh_doc] 
print("Mandarin document POS tagged in:", (time()-start), "seconds.")

Mandarin document POS tagged in: 0.15354585647583008 seconds.


In [194]:
print('\nList of POS tags:', set(zh_pos), '\n\nPOS count:', len(set(zh_pos)))


List of POS tags: {'INTJ', 'NOUN', 'X', 'NUM', 'DET', 'PRON', 'PART', 'ADP', 'VERB', 'PROPN', 'SPACE', 'CCONJ', 'ADV', 'ADJ', 'SCONJ', 'PUNCT'} 

POS count: 16


### Dependency Tagging Mandarin File

In [195]:
start = time()
zh_dep = [token.dep_ for token in zh_doc] 
print("Mandarin document DEP tagged in:", (time()-start), "seconds.")

Mandarin document DEP tagged in: 0.20015501976013184 seconds.


In [196]:
zh_dep[:5]

['dep', 'punct', 'dep', 'nsubj', 'dep']

In [197]:
len(set(zh_dep))

44

In [198]:
print(set(zh_dep)-set(eng_dep))

{'advmod:dvp', 'name', 'aux:asp', 'amod:ordmod', 'mark:clf', 'advmod:rcomp', 'nmod:topic', 'compound:nn', 'aux:ba', 'nmod:poss', 'cop', 'aux:prtmod', 'advcl:loc', 'nmod:prep', 'aux:modal', 'nmod:tmod', 'compound:vc', 'parataxis:prnmod', 'nmod:assmod', 'etc', 'nmod:range', 'advmod:loc', 'discourse'}


### Entity Tagging

In [199]:
start = time()
zh_ent = []
zh_label = []
for ent in zh_doc.ents:
    zh_ent.append(ent.text)
    zh_label.append(ent.label_)
print("Mandarin document entity tagged in:", (time()-start), "seconds.")

Mandarin document entity tagged in: 0.551123857498169 seconds.


In [200]:
print(set(zh_label))

{'ORG', 'LOC', 'LAW', 'WORK_OF_ART', 'FAC', 'PERCENT', 'ORDINAL', 'QUANTITY', 'CARDINAL', 'NORP', 'DATE', 'EVENT', 'PRODUCT', 'TIME', 'MONEY', 'LANGUAGE', 'PERSON', 'GPE'}


In [201]:
zh_ents = np.column_stack((zh_ent, zh_label))

In [203]:
print(zh_ents[:10], '\n\nLength of Mandarin Entities:', len(zh_ents))

[[')号' 'PERSON']
 ['1994年5月17日' 'DATE']
 ['卢旺达' 'GPE']
 ['联合国卢旺达援助团(联卢援助团)' 'ORG']
 ['1993年10月5日第872(1993' 'DATE']
 [')号' 'PERSON']
 ['1994年7月29日' 'DATE']
 [')号' 'PERSON']
 ['1994年4月21日第912' 'DATE']
 [')号' 'PERSON']] 

Length of Mandarin Entities: 35936


## Arabic

Spacy does not have a module for processing Arabic, after looking into other resources I decided to go with camel.

In [205]:
# Loading in Arabic Files
start = time()
f = open('data/sixway/arabic.100k', 'r') # Reading in Arabic File
arabic100 = f.read()
print("Data loaded in:", (time()-start), "seconds.")
f.close()

# Creating Smaller Slice to Analyze using SpaCy
arabic_samp = arabic100[:1000000]

Data loaded in: 0.1333611011505127 seconds.


# Creating DataFrame for Analyzation

In [222]:
data = {'Language': ['English', 'Spanish', 'French', 'Russian', 'Mandarin', 'Arabic'],
        'Document' : [english_samp, spanish_samp, french_samp, russian_samp, mandarin_samp, arabic_samp],
        'Word Tokens' : [eng_tokens, es_tokens, fr_tokens, ru_tokens, zh_tokens, 'null'],
        'Word Token Length' : [len(eng_tokens), len(es_tokens), len(fr_tokens), len(ru_tokens), len(zh_tokens), 'null'],
        'Sentence Tokens' : [eng_sent, es_sent, fr_sent, ru_sent, zh_sent, 'null'],
        'Sentence Token Length' : [len(eng_sent), len(es_sent), len(fr_sent), len(ru_sent), len(zh_sent), 'null'],
        'POS Set' : [set(eng_pos), set(es_pos), set(fr_pos), set(ru_pos), set(zh_pos), 'null'],
        'POS Count' : [len(set(eng_pos)), len(set(es_pos)), len(set(fr_pos)), len(set(ru_pos)), len(set(zh_pos)), 'null'],
        'Dependency': [set(eng_dep), set(es_dep), set(fr_dep), set(ru_dep), set(zh_dep), 'null'],
        'Dependency Count' : [len(set(eng_dep)), len(set(es_dep)), len(set(fr_dep)), len(set(ru_dep)), len(set(zh_dep)), 'null'],
        'Entity Label List' :[set(eng_label), set(es_label), set(fr_label), set(ru_label), set(zh_label), 'null'],
        'Entity Label Count' : [len(set(eng_label)), len(set(es_label)), len(set(fr_label)), len(set(ru_label)), len(set(zh_label)), 'null'],
        'Entity Count' : [len((eng_ent)), (len(es_ent)), len(set(fr_ent)), len(set(ru_ent)), len(set(zh_ent)), 'null'],
       }

In [223]:
sixway_df = pd.DataFrame(data)

In [224]:
sixway_df

Unnamed: 0,Language,Document,Word Tokens,Word Token Length,Sentence Tokens,Sentence Token Length,POS Set,POS Count,Dependency,Dependency Count,Entity Label List,Entity Label Count,Entity Count
0,English,RESOLUTION 918 (1994)\nAdopted by the Security...,"[RESOLUTION, 918, (, 1994, ), \n, Adopted, by,...",177808.0,[RESOLUTION 918 (1994)\nAdopted by the Securit...,6662.0,"{INTJ, NOUN, X, NUM, DET, PRON, PART, ADP, VER...",18.0,"{npadvmod, advcl, attr, acl, det, pcomp, quant...",45.0,"{ORG, LAW, WORK_OF_ART, FAC, LOC, LANGUAGE, OR...",18.0,12124.0
1,Spanish,RESOLUCIÓN 918 (1994)\nAprobada por el Consejo...,"[RESOLUCIÓN, 918, (, 1994, ), \n, Aprobada, po...",176209.0,[RESOLUCIÓN 918 (1994)\nAprobada por el Consej...,4861.0,"{INTJ, NOUN, NUM, DET, PRON, PART, ADP, VERB, ...",17.0,"{expl:pass, advcl, det, acl, obl, conj, dep, c...",31.0,"{MISC, LOC, ORG, PER}",4.0,9826.0
2,French,RESOLUTION 918 (1994)\nAdoptée par le Conseil ...,"[RESOLUTION, 918, (, 1994, ), \n, Adoptée, par...",181823.0,"[RESOLUTION 918 (1994)\n, Adoptée, par le Cons...",9440.0,"{NOUN, X, NUM, DET, PRON, ADP, VERB, PROPN, SP...",16.0,"{aux:tense, expl:pass, advcl, det, acl, obl:ag...",36.0,"{MISC, ORG, LOC, PER}",4.0,2762.0
3,Russian,"РЕЗОЛЮЦИЯ 918 (1994),\nпринятая Советом Безопа...","[РЕЗОЛЮЦИЯ, 918, (, 1994, ), ,, \n, принятая, ...",153964.0,"[РЕЗОЛЮЦИЯ 918 (1994),\n, принятая Советом Без...",9310.0,"{INTJ, NOUN, X, NUM, DET, PRON, PART, ADP, VER...",18.0,"{advcl, det, acl, orphan, obl:agent, obl, conj...",40.0,"{ORG, LOC, PER}",3.0,2126.0
4,Mandarin,第918(1994)号决议\n1994年5月17日安全理事会第3377次会议通过\n安全理事...,"[第918, (, 1994, )号, 决议, \n, 1994年, 5月, 17日, 安全...",553748.0,[第918(1994)号决议\n1994年5月17日安全理事会第3377次会议通过\n安全理...,23363.0,"{INTJ, NOUN, X, NUM, DET, PRON, PART, ADP, VER...",16.0,"{advmod:dvp, name, aux:asp, amod:ordmod, mark:...",44.0,"{ORG, LOC, LAW, WORK_OF_ART, FAC, PERCENT, ORD...",18.0,9476.0
5,Arabic,القرار ٨١٩ )٤٩٩١(\nالذي اتخذه مجلس اﻷمن في جلس...,,,,,,,,,,,
