# Text Processing

Parts of this tutorial has been extracted from the assignment prepared by **Marijn Schraagen** for the data analytics course that I taught last year

In [1]:
# import required packages
import pandas as pd
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
import re

### Tokenization

In [None]:
myDoc = 'Data has been coined "The oil of the 21st century".'
myDoc += ' Businesses and organizations have realized that in order to thrive in the data driven economy,'
myDoc += ' have to adopt modern data management solutions that will allow them to innovate and generate high-quality'
myDoc += ' added value services.'
# TODO: Tokenize the documnet myDoc and display the set of tokens .. 
# consider the default tokenization (tokens are separated by spaces and special characters)



In [None]:
# TODO: Tokenize the documnet myDoc and display the set of tokens .. 
# consider the tokens are separated by spaces and special characters (hyphen should be considered as special character).


## Stemming

In [None]:
sentence = "It is important to be very pythonly while you are pythoning with python."
sentence += " All pythoners have pythoned poorly at least once."
# TODO: run porter's stemmer to find the root of each of the words in the sentence
ps = PorterStemmer()





### Minimum Edit Distance (Levenshtein)

In [None]:
def ldist(s, t):
    """ 
        iterative_levenshtein(s, t) -> ldist
        ldist is the Levenshtein distance between the strings 
        s and t.
        For all i and j, dist[i,j] will contain the Levenshtein 
        distance between the first i characters of s and the 
        first j characters of t
    """

    rows = len(s)+1
    cols = len(t)+1
    dist = [[0 for x in range(cols)] for x in range(rows)]

    # source prefixes can be transformed into empty strings 
    # by deletions:
    for i in range(1, rows):
        dist[i][0] = i

    # target prefixes can be created from an empty source string
    # by inserting the characters
    for i in range(1, cols):
        dist[0][i] = i
        
    for col in range(1, cols):
        for row in range(1, rows):
            if s[row-1] == t[col-1]:
                c = 0
            else:
                c = 2
            dist[row][col] = min(dist[row-1][col] + 1,       # deletion
                                 dist[row][col-1] + 1,       # insertion
                                 dist[row-1][col-1] + c)     # substitution

    for r in range(rows):
        print(dist[r])
    
 
    return dist[row][col]

print(ldist("Intention", "execution"))

In [None]:
# TODO: Modify the ldist function to accept a new parameter which specifies the substitution cost
# Test your implementation with the call mod_ldist("Intention", "execution", 1) -- You should get 5
# and mod_ldist("Intention", "execution", 2) -- You should get 8


In [None]:
# TODO: Now consider the following words: 
w1 = 'AGGCTATCACCTGACCTCCAGGCCGATGCCC'
w2 = 'TAGCTATCACGACCGCGGTCGATTTGCCCGAC'
w3 = 'AGCTATCACGACCGCGGTCGATTTGCCCGACCC'
# Compute the paiwise minimum edit edit distance


## Document Retrieval 

In [None]:
D1 = 'Businesses and organizations have realized that in order to thrive in the data driven economy,'
D1 += ' have to adopt modern data management solutions that will allow them to innovate and generate high-quality'
D1 += ' added value services.'
D2 = 'Successful organizations most likely have data management technology powering every process.'
D3 = 'Data management solutions and systems that help to deal with and manage the full data lifecycle needs of your company.'
D3 += ' Manage the development and execution of architectures, policies, practices and procedures with the data management systems found on bobsguide.'
# TODO: Create a dictionary that contains the term and its frequency in the document (stop words shouldn't be included). 
# Compute the normalized document-term matrix and display the values for the terms
# organizations, data, management, solutions, systems



In [None]:
# Using the D1, D2, D3, compute the TF-IDF matrix for the terms in the previous example



# Building language identification classifier

The goal of this tutorial is to use what we learned during the text processing lecture to build a language identification classifier, for four languages Dutch, English, German and French. Trigrams (3-shingles) proved to be good measure that can be used to compare blocks of text based on their local structure, which is a good indicator of the language used. 

To build the model, a corpus is provided with 28 documents from Wikipedia, 20 training documents (5 per language) and 8 test documents (2 per language). A file with the basic structure of the code is provided as an attachment ( da_textmining_2020.r ). The algorithm you should implement is as follows:

1. Read the training data from the text files, perform data cleaning and split into word/tokens.
2. Compute a vector of all **trigrams** for each language.
3. Compute a **frequency table** and store the top 300 most frequent trigrams for each language.
4. For each individual test document, compute the 300 most frequent trigrams.
5. Implement a rank comparison algorithm to compare the 300 test trigrams for each test document to each of the four language models.
6. For each test document, present the results of the comparison in sorted order to the user.

Test your implementation on the provided test corpus. Based on the numbers, which language is closest to Dutch,
and which language is furthest away? 

Analyze how much data you need. Reduce the number of words used for the construction of the train and test language models gradually and check the performance. Extract words from the middle of the document (for example starting at the 100th word) to reduce influence of any Wikipedia-specific phrases. Make a table showing the number or words used vs. the number of
classification errors.

### Implementation details
Read the training data from the text files, perform data cleaning and split into word/tokens. 
For reading files the library readr provides the read_file function. After loading the file, perform the following preprocessing steps:
* convert to lowercase
* strip newline characters \n 
* strip everything that is not an alphanumeric character, a space or an apostrophe (' ). You can use regular expressions for this
* convert multiple spaces to one using the gsub function
* extract a list of words from the document
* Store all words from all documents of one language together in a vector

In [2]:
# TODO: write a function that receives a filename, reads all of its contents, converts all letters to lowercase and 
# removes the '\r', '\n' and any extra spaces and returns the list of words 
def get_words (filename):    
    

**Compute a vector of all trigrams for each language:**
Write a function that computes all trigrams for a single word. Then, apply this function on the full list of words for a language. The function to compute trigrams should first add the boundary character _ (underscore) at the start and end
of the word. Then, using a for loop over the length of the word, extract all trigrams (3-shingles).

In [3]:
def get_trigrams(word):
    
    

**Compute a frequency table and store the top 300 most frequent trigrams for each language:** Write a function to compute a frequency table. Sort (most frequent on top) and select the top 300 most frequent trigrams. Note: convert the result to a data frame to enable rank comparison in the next step.

In [4]:
import collections
def get_top300 (words):
   

**For each individual test document, compute the 300 most frequent trigrams:** 

In [26]:
# TODO: Compute the comparison score between reference and test models
# If a trigram gram exists in both models, add the difference between the rank 
# of the trigram in both models to the comaprison score
# If a trigram exists in the test model but not in the ref. model add a penalty of 500 the comparison score
def compare_model(model_ref, model_test):
    

Get the names of the training files and read the contents of each file. 
Create a reference model for each language

In [9]:
import os
# TODO: Get the name of the files in the following path

path = 'txt/Training/'


# Now: create a dictionary for each class of names to store the names of the files 
# that represents a specific language 
# read the contents of all the document from a given language into a single list or vector 
# (you will need 4 lists or vectors)
# Create the reference models of the languages by storing the top 300 frequent trigrams of each language



Use the same pipeline as before (preprocessing, create a word list, compute trigrams, select top 300, convert
to data frame) but now for an individual test file (instead of all files for a language as before).

In [31]:
# TODO: Get the name of the files in the following path
test_path = 'txt/Testing/'
# TODO: for each file, find the most frequest 300 trigrams and compare the model to the reference models of the languages
# Print the comparison score in ascending order to see which language is closest to the test file 


de-Recht.txt
{'de': 57210, 'nl': 98233, 'en': 112190, 'fr': 113559}
de-Wirtschaftswissenschaft.txt
{'de': 62659, 'nl': 99068, 'en': 110925, 'fr': 112940}
en-Economics.txt
{'en': 59830, 'fr': 95597, 'nl': 108235, 'de': 110558}
en-Law.txt
{'en': 63615, 'fr': 96542, 'nl': 108886, 'de': 109644}
fr-Droit.txt
{'fr': 53981, 'en': 102131, 'de': 112984, 'nl': 116002}
fr-Sciences_economiques.txt
{'fr': 50437, 'en': 95457, 'de': 108292, 'nl': 111584}
nl-Economie.txt
{'nl': 59802, 'de': 94938, 'en': 101698, 'fr': 112395}
nl-Recht.txt
{'nl': 65473, 'de': 98252, 'en': 104456, 'fr': 113297}


The output that you will get should be close to:

de-Recht.txt

{'de': 57210, 'nl': 98233, 'en': 112190, 'fr': 113559}

de-Wirtschaftswissenschaft.txt

{'de': 62659, 'nl': 99068, 'en': 110925, 'fr': 112940}

en-Economics.txt

{'en': 59830, 'fr': 95597, 'nl': 108235, 'de': 110558}

en-Law.txt

{'en': 63615, 'fr': 96542, 'nl': 108886, 'de': 109644}

fr-Droit.txt

{'fr': 53981, 'en': 102131, 'de': 112984, 'nl': 116002}

fr-Sciences_economiques.txt

{'fr': 50437, 'en': 95457, 'de': 108292, 'nl': 111584}

nl-Economie.txt

{'nl': 59802, 'de': 94938, 'en': 101698, 'fr': 112395}

nl-Recht.txt

{'nl': 65473, 'de': 98252, 'en': 104456, 'fr': 113297}