<a href="https://colab.research.google.com/github/mterion/Transformers-for-NLP-2nd-Edition/blob/main/Chapter06/chap06_clean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
import requests
import tarfile
import io

# URL of the TAR file
tar_url = "https://statmt.org/europarl/v7/fr-en.tgz"

# Download the TAR file
response = requests.get(tar_url)
tar_file = io.BytesIO(response.content)

# Extract the contents of the TAR file
with tarfile.open(fileobj=tar_file, mode="r:gz") as tar:
    tar.extractall()

print("TAR file downloaded and extracted successfully.")

TAR file downloaded and extracted successfully.


In [6]:
#Pre-Processing datasets for Machine Translation
#Copyright 2020, Denis Rothman, MIT License
#Denis Rothman modified the code for educational purposes.
#Reference:
#Jason Brownlee PhD, ‘How to Prepare a French-to-English Dataset for Machine Translation
# https://machinelearningmastery.com/prepare-french-english-dataset-machine-translation/

import pickle
from pickle import dump

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, mode='rt', encoding='utf-8') # rt means read text mode
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text
  

The file extension .tgz typically indicates a TAR file that has been compressed using the GZIP compression algorithm. It is commonly referred to as a "tarball" and is often used in Unix-like systems.

- .tar files are uncompressed archives that bundle multiple files together, similar to a folder/directory structure.
- .gz or .gzip files are compressed files using the GZIP algorithm.

When both these formats are combined, it results in a .tgz or .tar.gz file, which is a TAR archive that has been compressed with GZIP.

In [7]:
# Me: 
filename = 'europarl-v7.fr-en.en'
doc = load_doc(filename)
type(doc)

str

In [12]:
# Me:
lines = doc.splitlines() 
  # splits the string into a list of lines based on line breaks.
    # method does not include the line break characters ('\n', '\r', or '\r\n') in the resulting lines. It essentially splits the string wherever it finds a line break.
      # '\n': This is the newline character and is commonly used to represent a line break in Unix-like systems (such as Linux and macOS). 
      # '\r': This is the carriage return character and is commonly used to represent a line break in older Mac OS versions prior to Mac OS X (before macOS). 
      # '\r\n': This is a combination of the carriage return ('\r') and newline ('\n') characters. It is commonly used to represent a line break in Windows and some other systems. Windows text files often use '\r\n' as the line break sequence. It is represented by two characters, one for carriage return and one for newline.
  # When using the splitlines() method in Python, it automatically recognizes and handles these different line break representations. 
first_line = lines[:20]
print(first_line)

['Resumption of the session', 'I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.', "Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.", 'You have requested a debate on this subject in the course of the next few days, during this part-session.', "In the meantime, I should like to observe a minute' s silence, as a number of Members have requested, on behalf of all the victims concerned, particularly those of the terrible storms, in the various countries of the European Union.", "Please rise, then, for this minute' s silence.", "(The House rose and observed a minute' s silence)", 'Madam President, on a point of order.', 'You will be aware from the press and television that there have been a num

In [17]:
# split a loaded document into sentences
def to_sentences(doc):
	return doc.strip().split('\n')
 
mySentences = to_sentences(doc)
print(mySentences[:10])
  # See that both methods to split work the same
    # My method above is maybe better as it handles other breaks as well

['Resumption of the session', 'I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.', "Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.", 'You have requested a debate on this subject in the course of the next few days, during this part-session.', "In the meantime, I should like to observe a minute' s silence, as a number of Members have requested, on behalf of all the victims concerned, particularly those of the terrible storms, in the various countries of the European Union.", "Please rise, then, for this minute' s silence.", "(The House rose and observed a minute' s silence)", 'Madam President, on a point of order.', 'You will be aware from the press and television that there have been a num

In [18]:
# shortest and longest sentence lengths
def sentence_lengths(sentences):
	lengths = [len(s.split()) for s in sentences]
	return min(lengths), max(lengths)
 
sentence_lengths(mySentences)

(0, 668)

In [40]:
# clean lines
import re
import string
import unicodedata
def clean_lines(lines):
	cleaned = list()
	# prepare regex for char filtering
	re_print = re.compile('[^%s]' % re.escape(string.printable))
    # This regular expression pattern can be useful for filtering out or identifying non-printable 
      # characters within a string. 
    # re: This is the Python module for working with regular expressions.
    # re.compile(): This function is used to compile a regular expression pattern 
      # into a pattern object that can be used for matching operations.
    # [^%s]: This part of the regular expression pattern is a negated character set. 
      # The ^ at the beginning of the character set negates it, meaning it matches any character that is not present within the set.
    # % re.escape(string.printable): This is a placeholder that inserts the escaped version of the string 
      # string.printable into the regular expression pattern.
    # string.printable: It is a string constant from the string module that contains all printable ASCII 
      # characters. It includes uppercase and lowercase letters, digits, punctuation marks, and whitespace characters.
    # re.escape(): This function is used to escape any special characters within the string 
    # By using re.escape() to escape the characters in string.printable, the regular expression pattern 
      #will match any character that is not within the set of printable ASCII characters.
    # The output is a regex that start with ^ which means not followed after by ASCII chars

	# prepare translation table for removing punctuation
	table = str.maketrans('', '', string.punctuation)
    # used to create a translation table that can be used with the str.translate() method.
	for line in lines:
		# normalize unicode characters
		line = unicodedata.normalize('NFD', line).encode('ascii', 'ignore')
      # NFD = unicode normalization
      # encode normalized unicode into ascii, using ignore error handling scheme when char cannot be represented in ascii
      # When you encode into ascii, the result is a byte string, bec ascii represents chars in single bytes
		line = line.decode('UTF-8')
      # 'UTF-8': It is a popular encoding scheme that can represent the entire Unicode character set.
      # Attempts to decode the byte string line into a Unicode string using the UTF-8 encoding.
		# tokenize on white space
		line = line.split()
		# convert to lower case
		line = [word.lower() for word in line]
		# remove punctuation from each token
		line = [word.translate(table) for word in line]
      # translate() method is a string method that performs character-level translation or deletion based on a translation table.
		# remove non-printable chars form each token
		line = [re_print.sub('', w) for w in line]
      #  applies the regular expression pattern my_re_print to each word w in the line 
		# remove tokens with numbers in them
		line = [word for word in line if word.isalpha()] 
      # isalpha() that returns True if all characters in the string are alphabetic (letters from a to z)
		# store as string
		cleaned.append(' '.join(line)) # appends a string to the cleaned list after joining the elements of the line list with a space separator
	return cleaned

In [39]:
# Me
import re
import string
import unicodedata
mySubset = mySentences[:10]
my_re_print = re.compile('[^%s]' % re.escape(string.printable))
print("my_re_print: ", my_re_print)

table = str.maketrans('', '', string.punctuation)
print("table: ", table)

line = mySubset[4]
print("mySubset[4]: ", line)
  # normalize unicode characters
line = unicodedata.normalize('NFD', line).encode('ascii', 'ignore')
print("unicodedata.normalize('NFD', line).encode('ascii', 'ignore') (which is a byte string due to ascii encoding) : ", line)
line = line.decode('UTF-8')
print("UTF-8: ", line)

# tokenize on white space
line = line.split()
print("line.split(): ", line)

# convert to lower case
line = [word.lower() for word in line]
print("[word.lower() for word in line]: ", line)

# remove punctuation from each token
line = [word.translate(table) for word in line]
print("[word.translate(table) for word in line]: ", [word.translate(table) for word in line])

# remove non-printable chars form each token
line = [my_re_print.sub('', w) for w in line]
print("[my_re_print.sub('', w) for w in line]: ", line)

# remove tokens with numbers in them
line = [word for word in line if word.isalpha()]

# store as string
my_cleaned = list()
my_cleaned.append(' '.join(line))
print(my_cleaned)

my_re_print:  re.compile('[^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"\\#\\$%\\&\'\\(\\)\\*\\+,\\-\\./:;<=>\\?@\\[\\\\\\]\\^_`\\{\\|\\}\\~\\ \\\t\\\n\\\r\\\x0b\\\x0c]')
table:  {33: None, 34: None, 35: None, 36: None, 37: None, 38: None, 39: None, 40: None, 41: None, 42: None, 43: None, 44: None, 45: None, 46: None, 47: None, 58: None, 59: None, 60: None, 61: None, 62: None, 63: None, 64: None, 91: None, 92: None, 93: None, 94: None, 95: None, 96: None, 123: None, 124: None, 125: None, 126: None}
mySubset[4]:  In the meantime, I should like to observe a minute' s silence, as a number of Members have requested, on behalf of all the victims concerned, particularly those of the terrible storms, in the various countries of the European Union.
unicodedata.normalize('NFD', line).encode('ascii', 'ignore') (which is a byte string due to ascii encoding) :  b"In the meantime, I should like to observe a minute' s silence, as a number of Members have requested, on behalf of a

In [41]:
# load English data
filename = 'europarl-v7.fr-en.en'
doc = load_doc(filename)
sentences = to_sentences(doc)
minlen, maxlen = sentence_lengths(sentences)
print('English data: sentences=%d, min=%d, max=%d' % (len(sentences), minlen, maxlen))
cleanf=clean_lines(sentences)
filename = 'English.pkl'
outfile = open(filename,'wb') # 'wb' mode indicates that the file should be opened in binary mode, allowing binary data to be written. 
pickle.dump(cleanf,outfile) # dump serialize the data (convert into bytes)
outfile.close()
print(filename," saved")

# load French data
filename = 'europarl-v7.fr-en.fr'
doc = load_doc(filename)
sentences = to_sentences(doc)
minlen, maxlen = sentence_lengths(sentences)
print('French data: sentences=%d, min=%d, max=%d' % (len(sentences), minlen, maxlen))
cleanf=clean_lines(sentences)
filename = 'French.pkl'
outfile = open(filename,'wb')
pickle.dump(cleanf,outfile)
outfile.close()
print(filename," saved")

English data: sentences=2007723, min=0, max=668
English.pkl  saved
French data: sentences=2007723, min=0, max=693
French.pkl  saved


In [42]:
#Pre-Processing datasets for Machine Translation
#Copyright 2020, Denis Rothman, MIT License
#Denis Rothman modified the code for educational purposes.
#Reference:
#Jason Brownlee PhD, ‘How to Prepare a French-to-English Dataset for Machine Translation
# https://machinelearningmastery.com/prepare-french-english-dataset-machine-translation/


from pickle import load
from pickle import dump
from collections import Counter
 
# load a clean dataset
def load_clean_sentences(filename):
	return load(open(filename, 'rb'))
 # 'rb' stands for "read binary." It is used when you want to read a file in binary mode, typically used for reading non-text files, such as images, videos, or serialized objects.
  # When you open a file in binary mode ('rb'), the file is read as a sequence of bytes, allowing you to read and process the binary data directly. 
 
# save a list of clean sentences to file
def save_clean_sentences(sentences, filename):
	dump(sentences, open(filename, 'wb'))
	print('Saved: %s' % filename)
  # dump() function to serialize and save the sentences object. The dump() function is typically associated with the pickle module in Python, which allows you to convert Python objects into a byte stream that can be saved to a file.
 

In [43]:
# create a frequency table for all words
def to_vocab(lines):
	vocab = Counter()
	for line in lines:
		tokens = line.split()
		vocab.update(tokens)
	return vocab
 
# remove all words with a frequency below a threshold
def trim_vocab(vocab, min_occurance):
	tokens = [k for k,c in vocab.items() if c >= min_occurance]
	return set(tokens)
 
# mark all OOV with "unk" for all lines
def update_dataset(lines, vocab):
	new_lines = list()
	for line in lines:
		new_tokens = list()
		for token in line.split():
			if token in vocab:
				new_tokens.append(token)
			else:
				new_tokens.append('unk')
		new_line = ' '.join(new_tokens)
		new_lines.append(new_line)
	return new_lines

In [44]:
# load English dataset
filename = 'English.pkl'
lines = load_clean_sentences(filename)
# calculate vocabulary
vocab = to_vocab(lines)
print('English Vocabulary: %d' % len(vocab))
# reduce vocabulary
vocab = trim_vocab(vocab, 5)
print('New English Vocabulary: %d' % len(vocab))
# mark out of vocabulary words
lines = update_dataset(lines, vocab)
# save updated dataset
filename = 'english_vocab.pkl'
save_clean_sentences(lines, filename)
# spot check
for i in range(20):
	print("line",i,":",lines[i])
 
# load French dataset
filename = 'French.pkl'
lines = load_clean_sentences(filename)
# calculate vocabulary
vocab = to_vocab(lines)
print('French Vocabulary: %d' % len(vocab))
# reduce vocabulary
vocab = trim_vocab(vocab, 5)
print('New French Vocabulary: %d' % len(vocab))
# mark out of vocabulary words
lines = update_dataset(lines, vocab)
# save updated dataset
filename = 'french_vocab.pkl'
save_clean_sentences(lines, filename)
# spot check
for i in range(20):
	print("line",i,":",lines[i])

English Vocabulary: 105357
New English Vocabulary: 41746
Saved: english_vocab.pkl
line 0 : resumption of the session
line 1 : i declare resumed the session of the european parliament adjourned on friday december and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period
line 2 : although as you will have seen the dreaded millennium bug failed to materialise still the people in a number of countries suffered a series of natural disasters that truly were dreadful
line 3 : you have requested a debate on this subject in the course of the next few days during this partsession
line 4 : in the meantime i should like to observe a minute s silence as a number of members have requested on behalf of all the victims concerned particularly those of the terrible storms in the various countries of the european union
line 5 : please rise then for this minute s silence
line 6 : the house rose and observed a minute s silence
line 7 : madam president o