# United Nations Parallel Corpora Analysis
#### Kinan Al-Mouk  - kim47@pitt.edu - March 16th 2022

# Table of Contents
[Uploading and Analyzing Raw Data](#Uploading-and-Analyzing-Raw-Data)

- [English](#English)

- [Spanish](#Spanish)

- [French](#French)

- [Russian](#Russian)

- [Arabic](#Arabic)

- [Mandarin](#Mandarin)

[Creating Initial DataFrame for Analysis](#DataFrame-Construction)

# Uploading and Analyzing Raw Data

In [1]:
import nltk 
from time import time
import numpy as np 
import pandas as pd 

# English

### Reading in English File

In [2]:
start = time()
f = open('data/sixway/english.100k', 'r') # Reading in English File
english100 = f.read()
print("Data loaded in:", (time()-start), "seconds.")
f.close()

Data loaded in: 0.15114116668701172 seconds.


### Word Tokenizing English File

In [3]:
# Word Tokenization
start = time()
small_en_words = nltk.word_tokenize(english100)
print("English word tokenized in:", (time()-start), "seconds.")
small_en_words_len = len(small_en_words)
print("Word Token Count for English File:", small_en_words_len)

English word tokenized in: 20.7223219871521 seconds.
Word Token Count for English File: 3105868


### Sentence Tokenizing English File

In [4]:
# Sentence Tokenization
start = time()
small_en_sents = nltk.sent_tokenize(english100)
print("English sentence tokenized in:", (time()-start), "seconds.")
small_en_sents_len = len(small_en_sents)
print("Sentence Token Count for English File:", small_en_sents_len)


English sentence tokenized in: 5.488654136657715 seconds.
Sentence Token Count for English File: 108375


# Spanish

### Reading in Spanish File

In [6]:
start = time()
f = open('data/sixway/spanish.100k', 'r') # Reading in Spanish File
spanish100 = f.read()
print("Data loaded in:", (time()-start), "seconds.")
f.close()

Data loaded in: 0.24316692352294922 seconds.


### Word Tokenizing Spanish File

In [7]:
start = time()
small_es_words = nltk.word_tokenize(spanish100)
print("Spanish word tokenized in:", (time()-start), "seconds.")
small_es_words_len = len(small_es_words)
print("Word Token Count for Spanish File:", small_es_words_len)

Spanish word tokenized in: 24.900392055511475 seconds.
Word Token Count for Spanish File: 3504309


### Sentence Tokenizing Spanish File

In [8]:
# Sentence Tokenization
start = time()
small_es_sents = nltk.sent_tokenize(spanish100)
print("Spanish sentence tokenized in:", (time()-start), "seconds.")
small_es_sents_len = len(small_es_sents)
print("Sentence Token Count for English File:", small_es_sents_len)


Spanish sentence tokenized in: 6.110598802566528 seconds.
Sentence Token Count for English File: 102386


# French

### Reading in French File

In [9]:
start = time()
f = open('data/sixway/french.100k', 'r') # Reading in French File
french100 = f.read()
print("Data loaded in:", (time()-start), "seconds.")
f.close()

Data loaded in: 0.25275087356567383 seconds.


### Word Tokenizing French File

In [10]:
start = time()
small_fr_words = nltk.word_tokenize(french100)
print("French word tokenized in:", (time()-start), "seconds.")
small_fr_words_len = len(small_fr_words)
print("Word Token Count for French File:", small_fr_words_len)

French word tokenized in: 25.011726140975952 seconds.
Word Token Count for French File: 3456688


### Sentence Tokenizing French File

In [11]:
# Sentence Tokenization
start = time()
small_fr_sents = nltk.sent_tokenize(french100)
print("French sentence tokenized in:", (time()-start), "seconds.")
small_fr_sents_len = len(small_fr_sents)
print("Sentence Token Count for French File:", small_fr_sents_len)


French sentence tokenized in: 6.825227975845337 seconds.
Sentence Token Count for French File: 107730


# Russian

### Reading in Russian File

In [12]:
start = time()
f = open('data/sixway/russian.100k', 'r') # Reading in Russian File
russian100 = f.read()
print("Data loaded in:", (time()-start), "seconds.")
f.close()

Data loaded in: 0.21496081352233887 seconds.


### Word Tokenizing Russian File

In [13]:
start = time()
small_ru_words = nltk.word_tokenize(russian100)
print("Russian word tokenized in:", (time()-start), "seconds.")
small_ru_words_len = len(small_ru_words)
print("Word Token Count for Russian File:", small_ru_words_len)

Russian word tokenized in: 26.897891759872437 seconds.
Word Token Count for Russian File: 2857554


### Sentence Tokenizing Russian File

In [14]:
# Sentence Tokenization
start = time()
small_ru_sents = nltk.sent_tokenize(russian100)
print("Russian sentence tokenized in:", (time()-start), "seconds.")
small_ru_sents_len = len(small_ru_sents)
print("Sentence Token Count for Russian File:", small_ru_sents_len)

Russian sentence tokenized in: 8.108860969543457 seconds.
Sentence Token Count for Russian File: 108311


# Arabic

### Reading in Arabic File

In [15]:
start = time()
f = open('data/sixway/arabic.100k', 'r') # Reading in Arabic File
arabic100 = f.read()
print("Data loaded in:", (time()-start), "seconds.")
f.close()

Data loaded in: 0.16884827613830566 seconds.


### Word Tokenizing Arabic File

In [16]:
start = time()
small_ar_words = nltk.word_tokenize(arabic100)
print("Arabic word tokenized in:", (time()-start), "seconds.")
small_ar_words_len = len(small_ar_words)
print("Word Token Count for Arabic File:", small_ar_words_len)

Arabic word tokenized in: 20.76917004585266 seconds.
Word Token Count for Arabic File: 2564054


### Sentence Tokenizing Arabic File

In [24]:
# Sentence Tokenization
start = time()
small_ar_sents = nltk.sent_tokenize(arabic100)
print("Arabic sentence tokenized in:", (time()-start), "seconds.")
small_ar_sents_len = len(small_ar_sents)
print("Sentence Token Count for Arabic File:", small_ar_sents_len)

Arabic sentence tokenized in: 3.8507251739501953 seconds.
Sentence Token Count for Arabic File: 78687


# Mandarin

### Reading in Mandarin File

In [25]:
start = time()
f = open('data/sixway/mandarin.100k', 'r') # Reading in Mandarin File
mandarin100 = f.read()
print("Data loaded in:", (time()-start), "seconds.")
f.close()

Data loaded in: 0.09038591384887695 seconds.


### Word Tokenizing Mandarin File

In [26]:
start = time()
small_zh_words = nltk.word_tokenize(mandarin100)
print("Mandarin word tokenized in:", (time()-start), "seconds.")
small_zh_words_len = len(small_zh_words)
print("Word Token Count for Mandarin File:", small_zh_words_len)

Mandarin word tokenized in: 9.78513479232788 seconds.
Word Token Count for Mandarin File: 375501


### Sentence Tokenizing Mandarin File

In [36]:
# Sentence Tokenization
start = time()
small_zh_sents = nltk.sent_tokenize(mandarin100)
print("Mandarin sentence tokenized in:", (time()-start), "seconds.")
small_zh_sents_len = len(small_zh_sents)
print("Sentence Token Count for Mandarin File:", small_zh_sents_len)

Mandarin sentence tokenized in: 4.953838109970093 seconds.
Sentence Token Count for Mandarin File: 27314


# DataFrame Construction

In [45]:
data = {'Language': ['English', 'Spanish', 'French', 'Russian', 'Arabic', 'Mandarin'], 
        'Text': [english100, spanish100, french100, russian100, arabic100, mandarin100],
        'Word Tokens' : [small_en_words, small_es_words, small_fr_words, small_ru_words, small_ar_words, small_zh_words],
        'Word Tokens Len' : [small_en_words_len, small_es_words_len, small_fr_words_len, small_ru_words_len, small_ar_words_len, small_zh_words_len],  
        'Sentence Tokens' : [small_en_sents, small_es_sents, small_fr_sents, small_ru_sents, small_ar_sents, small_zh_sents],  
        'Sentence Tokens Len' : [small_en_sents_len, small_es_sents_len, small_fr_sents_len, small_ru_sents_len, small_ar_sents_len, small_zh_sents_len] } 


In [50]:
sixway_df = pd.DataFrame(data)

In [52]:
sixway_df['Average Sentence Length'] = sixway_df['Word Tokens Len']/sixway_df['Sentence Tokens Len']
sixway_df

Unnamed: 0,Language,Text,Word Tokens,Word Tokens Len,Sentence Tokens,Sentence Tokens Len,Average Sentence Length
0,English,RESOLUTION 918 (1994)\nAdopted by the Security...,"[RESOLUTION, 918, (, 1994, ), Adopted, by, the...",3105868,[RESOLUTION 918 (1994)\nAdopted by the Securit...,108375,28.658528
1,Spanish,RESOLUCIÓN 918 (1994)\nAprobada por el Consejo...,"[RESOLUCIÓN, 918, (, 1994, ), Aprobada, por, e...",3504309,[RESOLUCIÓN 918 (1994)\nAprobada por el Consej...,102386,34.226447
2,French,RESOLUTION 918 (1994)\nAdoptée par le Conseil ...,"[RESOLUTION, 918, (, 1994, ), Adoptée, par, le...",3456688,[RESOLUTION 918 (1994)\nAdoptée par le Conseil...,107730,32.086587
3,Russian,"РЕЗОЛЮЦИЯ 918 (1994),\nпринятая Советом Безопа...","[РЕЗОЛЮЦИЯ, 918, (, 1994, ), ,, принятая, Сове...",2857554,"[РЕЗОЛЮЦИЯ 918 (1994),\nпринятая Советом Безоп...",108311,26.38286
4,Arabic,القرار ٨١٩ )٤٩٩١(\nالذي اتخذه مجلس اﻷمن في جلس...,"[القرار, ٨١٩, ), ٤٩٩١, (, الذي, اتخذه, مجلس, ا...",2564054,[القرار ٨١٩ )٤٩٩١(\nالذي اتخذه مجلس اﻷمن في جل...,78687,32.585484
5,Mandarin,第918(1994)号决议\n1994年5月17日安全理事会第3377次会议通过\n安全理事...,"[第918, (, 1994, ), 号决议, 1994年5月17日安全理事会第3377次会...",375501,[第918(1994)号决议\n1994年5月17日安全理事会第3377次会议通过\n安全理...,27314,13.747565


## Sixway Analysis 
 - **Mandarin** has the least amount of Word Tokens at **375,501** where as **English, Spanish, and French** have over **3 million**. Not sure what the reason for this is at the moment. Same phenmomena occurs for Sentence Tokens Length. 
 - **Mandarin** Has the lowest counts for Word Tokens Length, Sentence Tokens Length, and Average Sentence Length.