# Extract sentence data from raw data

In this project, we have 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek. And we need to build a language detection engines to identify different languages of text data.

In the raw data, we have 21 folders for the 21 European languages, and each folder contains thousands of text files. All formats contain document ( &lt; CHAPTER id &gt;), speaker (&lt; SPEAKER id name language &gt;), and paragraph (&lt; P &gt;) mark-up on a separate line. Some special HTML entities and noisy characters are not removed from the data. In this part, we need to extract sentence data from the raw data. Here are some keys in this step:

#### 1. Strip empty lines and their correspondences 
#### 2. Remove lines with XML-Tags (starting with "<") 
#### 3. Lowercase the text
#### 4. Split the text data to sentences by period
#### 5. Remove noisy characters and punctuation
#### 6. Transform long space to single space
#### 7. Control the length of each sentence ( > 1)

In [1]:
import numpy as np
import random
import matplotlib.pyplot as plt
import pickle
import os
import re
import string
import codecs

Here are the abbreviation labels for the 21 European languages

In [2]:
labels = ['bg', 'cs', 'da', 'de', 'el', 'en', 'es', 'et', 
          'fi', 'fr', 'hu', 'it', 'lt', 'lv', 'nl', 'pl', 
          'pt', 'ro', 'sk', 'sl', 'sv']

Here are two functions to  extract clean text from all files of one language. We split all paragraph to sentences and store all the sentences in one set to make sure there is no duplicate sentence. In the `extract_clean_sentence` function, we use codecs package to open the text file instead of built-in function because there are some noisy characters e.g. byte 0x95 and 'utf-8' codec can't decode byte 0x95, thus we can ignore these errors.

In [3]:
def extract_clean_sentence(file_path, language):
    '''
    Goal:  extract clean text from all files of one language. We split all paragraph to sentences and store all the sentences (we set a threshold for the sentence length: length > 1) in one set.
    @param txt_file_list: a string which is a language
           file_path: the file path contains all language folders.
    @return clean_sentences:  a set of strings (sentences).
    '''
    clean_sentences = set() # use set to make sure there is no duplicate sentence
    file_list = os.listdir(file_path+language+'\\')
    for file in file_list:
        # Here we use codecs package to open the text file because there are some noisy characters e.g. byte 0x95
        # and 'utf-8' codec can't decode byte 0x95, thus we can ignore these errors
        with codecs.open(file_path+language+'\\'+file, 'r', encoding="utf-8", errors = 'ignore') as f:
            for line in f:
                line = line.strip() # 1. Strip empty lines
                if not line or line.startswith("<"):  # 2.Skip XML-Tags e.g. <CHAPTER ID="006-01">
                    continue
                line = line.split('.') # 4. split the text data to sentences by period
                for sentence in line:
                    sentence = sentence.strip()
                    sentence = clean_sentence(sentence)
                    if len(sentence) > 1: # 7. control the length of sentence
                        clean_sentences.add(sentence)
    f = open(language + '_sentence.pkl','wb')
    pickle.dump(clean_sentences, f)
    f.close()
    return clean_sentences

def clean_sentence(sentence):
    '''
    Goal: remove punctuation in the sentence
    @param sentence: a string which contains punctuation
    @return clean_sentences:  a strings (sentence).
    '''
    output = "".join(re.findall("[^\t\d\r\n–{}]+".format(string.punctuation),sentence.lower())) # 3,5. remove punctuation and lowercase the text¶
    output = output.strip()
    output = re.sub("\s+", " ", output) # 6. transform long space to ' '
    return output

For the 21 European languages, we store the senteces for each language in the language_sentence.pkl file

In [5]:
for i in labels:
    f = open(i + '_sentence.pkl','wb')
    tmp = extract_clean_sentence('../txt/', i)
    pickle.dump(tmp, f)
    f.close()
    print('Finish', i)

Finish bg
Finish cs
Finish da
Finish de
Finish el
Finish en
Finish es
Finish et
Finish fi
Finish fr
Finish hu
Finish it
Finish lt
Finish lv
Finish nl
Finish pl
Finish pt
Finish ro
Finish sk
Finish sl
Finish sv
