## Text analysis

Writing texts will produce errors, some of these errors are quite simple. Typos, word duplication and punctuation. Typos can be mainly fixed by online dictionaries but for word duplication and punctuation detection special programs are needed. We want to implement a simple word duplication with Python.

Let's assume these kind of errors:

``` 
... we often often make ...
```

or 

```
... here. this ...
```

In `data/faulty_text.txt` we have added a few errors.

**Your task:**

Write a python script `text_analysis.py` in which you write a function `detect_errors` with takes a filename as an argument and uses the `logging`-module for indicate the `errors` (`logging.error`). Test your function with `data/faulty_text.txt`.

In [1]:
# example to split a string into words taking into account
# (removing) punctuation.
# For time reasons, we will not treat 'regular expressions' in class
# but you should look them up yourself! You should know them from
# Linux already.

import re # module to handle regular expressions in a Python program

s = "Here some text with double (double!) words words. It also contains puctuation!"

# split s into its words without the punctuation marks; note that
# you might end up with empty strings in the word list!
words = re.split('\W+', s.rstrip())

print(words)

['Here', 'some', 'text', 'with', 'double', 'double', 'words', 'words', 'It', 'also', 'contains', 'puctuation', '']


 * take also into account that you check for duplication after switching to a new line
   ```
   ... test
   
   test ...
   ```
   Also do this for the punctuation tests!
 * keep also track of line numbers and word positions

My program gave these results:

```
ERROR:root:line 1 word #6: often
ERROR:root:line 2 word #6: here
ERROR:root:line 2 word #7: this
ERROR:root:lines 5+6: words
ERROR:root:lines 8+10: test
ERROR:root:line 11 word #1: this
```

 * it is not necessary to reproduce the results in detail, but you should address the same errors

In [2]:
import logging
import re

# Configure the logging
logging.basicConfig(level=logging.ERROR)

def detect_errors(filename):
    words_no_punc = []  # list without punctuation
    words_with_punc = []  # list with punctuation
    temp, index_j, index_i = '', 0, 0  # initialize variables for tracking

    # opening the text file and reading the lines separately (with error handling)
    try:
        with open(filename, 'r') as f:
            data = f.read().splitlines()
    except IOError:
        print('Error reading the file!')
        return

    # split lines into words and add to the list
    for line in data:
        words_no_punc.append(re.split('\W+', line.rstrip()))  # remove punctuation and split words
        words_with_punc.append(line.split())  # split words including punctuation

    # loop through all the words to check for repeated words
    for i in range(len(words_no_punc)):
        for j in range(len(words_no_punc[i])):
            # check if this isn't the first line
            if i != 0:
                # check if the current word matches the previous word with some specific conditions
                if (words_no_punc[i][j] == temp) and (temp != ''):
                    if j == 0:
                        logging.error(f"same word in line {index_i+1}+{i+1} word {index_j+1}+{j+1}: {temp}")  # log the error
                    else:
                        logging.error(f"same word in line {i+1} word {index_j+1}+{j+1}: {temp}")  # log the error
            else:
                if j != 0:
                    if (words_no_punc[i][j] == temp) and (temp != ''):
                        logging.error(f"same word in line {i+1} word {index_j+1}+{j+1}: {temp}")  # log the error
            # save the last checked word for comparison in the next iteration
            if words_no_punc[i][j] != "":
                temp, index_j, index_i = words_no_punc[i][j], j, i

    # loop through the words with punctuation to find punctuation errors
    dot = False
    for i in range(len(words_with_punc)):
        for j in range(len(words_with_punc[i])):
            # check if the previous word had a dot and the current word is lowercase
            if (dot is True) and (words_with_punc[i][j].islower()):
                logging.error(f"punctuation error in line {i+1} word {j+1}: {words_with_punc[i][j]}")
            # check if the current word contains a dot
            if "." in words_with_punc[i][j]:
                dot = True
            # reset dot flag if the current word does not contain a dot
            elif words_with_punc[i][j] != "":
                dot = False

detect_errors("data/faulty_text.txt")

ERROR:root:same word in line 1 word 6+7: often
ERROR:root:same word in line 2 word 6+7: here
ERROR:root:same word in line 5+6 word 5+1: words
ERROR:root:same word in line 8+10 word 3+1: test
ERROR:root:punctuation error in line 2 word 8: this
ERROR:root:punctuation error in line 11 word 1: this


---

## Language detection of text files 

Language detection of written texts can be very complex, but we want to implement an easy to understand solution. The analysis is based on letter frequency (see this [Wikipedia article](https://en.wikipedia.org/wiki/Letter_frequency)). So for an unknown text the letter frequency can be calculated and compared with some predefined statistics:

```Python
wikipedia_stats_english = { 'e': 0.1270, 't': 0.09056, 'a': 0.08167, 'o': 0.07507, 'l': 0.06966, 'n' : 0.06749 }
wikipedia_stats_german  = { 'e': 0.1639, 'n': 0.0978, 's': 0.0727, 'r' : 0.0700, 'i': 0.0655, 'a': 0.0651 } 
wikipedia_stats_italian = { 'e': 0.1179, 'a': 0.1174, 'i': 0.1128, 'o' : 0.0983, 'n': 0.0688, 'l': 0.0651 } 




```

I simply used these three languages, since other europeen languages can be simply identified by some special characters!

In the folder `texts` we have a few text files for which you should decide which language is used.

In [3]:
!ls texts

test01.txt  test03.txt	test05.txt  test07.txt
test02.txt  test04.txt	test06.txt


### Read the text file

Start with the text-file `test01.txt` as a development example. 


**Your task:**

Define a function `read_file_to_letters` in the modul `letters` which takes a filename as an argument. It should return the letter frequency  of the data with the given filename.

In [6]:
import letters

filename = 'texts/test01.txt'
letter_frequencies = letters.read_file_to_letters(filename)
if letter_frequencies:
    print("Letter frequencies in the text:")
    for letter, freq in letter_frequencies.items():
        print(f"{letter}: {freq:.4f}")

Letter frequencies in the text:
a: 0.0645
b: 0.0102
c: 0.0204
d: 0.0238
e: 0.1210
f: 0.0249
g: 0.0271
h: 0.0577
i: 0.0905
j: 0.0000
k: 0.0045
l: 0.0509
m: 0.0339
n: 0.0701
o: 0.0701
p: 0.0283
q: 0.0023
r: 0.0633
s: 0.0600
t: 0.0860
u: 0.0294
v: 0.0147
w: 0.0192
x: 0.0000
y: 0.0249
z: 0.0023


### 3.3 Compare the languages (5 points)

The next task is to compare your letter frequency with some predefined statistics and decide how well your data fits. You can simply check for each given letter the distance `d` from your value with the given value. The value `1-d` will then give you a fit for an individual letter. For the sequence of letters the minimum of all individual `1-d` values will then describe how good a language will fit to the data. The larger this value will be, so higher is the probability that the letter frequency fits the predefined data.

**Your task:**

Define a function `test_language` in the module `letters` which takes the your letter frequency and a predefined statistic as arguments and returns the minimum value `1-d` for all individual letters in the predefined statistics.

**Hints:**
 * use __only__ the letters defined in the predefined statistics for the analysis
 * call the function from the main script

### Decision of languages 

Based on the previous tasks, we need now to decide which language fits best. 

**Your task:**

Define a function `decide_language` in the module `letters` which takes your letter frequency and returns a language name.

**Hints:**
 * call the function from the main script
 * `test_language` is not needed any more for your main script, remove the `import` of this function

### Batch-check 

**Your task:**

Check all given files in `texts/` for the used languges.

In [7]:
# The glob-module allows you to use Linux style
# pathname expansion
import glob

# generate a list of files matching the Unix-pattern
# texts/*.txt. 
datapath = "texts"
filelist = glob.glob(f"{datapath}/*.txt")

# print the resulst
print(filelist)

['texts/test04.txt', 'texts/test05.txt', 'texts/test01.txt', 'texts/test02.txt', 'texts/test03.txt', 'texts/test07.txt', 'texts/test06.txt']


In [8]:
import glob
import letters

def main():
    # process a single file
    filename = 'texts/test01.txt'
    letter_frequencies = letters.read_file_to_letters(filename)
    
    if letter_frequencies:
        print("Letter frequencies in the text:")
        for letter, freq in letter_frequencies.items():
            print(f"{letter}: {freq:.4f}")
        
        detected_language = letters.decide_language(letter_frequencies)
        print(f"\nDetected Language: {detected_language}")

    # process multiple files in a directory
    datapath = "texts"
    filelist = sorted(glob.glob(f"{datapath}/*.txt"))

    for filename in filelist:
        print(f"\nfilename: {filename}")
        letter_frequencies = letters.read_file_to_letters(filename)
        
        if letter_frequencies:
            detected_language = letters.decide_language(letter_frequencies)
            print(f"Detected Language: {detected_language}")

if __name__ == "__main__":
    main()


Letter frequencies in the text:
a: 0.0645
b: 0.0102
c: 0.0204
d: 0.0238
e: 0.1210
f: 0.0249
g: 0.0271
h: 0.0577
i: 0.0905
j: 0.0000
k: 0.0045
l: 0.0509
m: 0.0339
n: 0.0701
o: 0.0701
p: 0.0283
q: 0.0023
r: 0.0633
s: 0.0600
t: 0.0860
u: 0.0294
v: 0.0147
w: 0.0192
x: 0.0000
y: 0.0249
z: 0.0023

Detected Language: English

filename: texts/test01.txt
Detected Language: English

filename: texts/test02.txt
Detected Language: German

filename: texts/test03.txt
Detected Language: German

filename: texts/test04.txt
Detected Language: Italian

filename: texts/test05.txt
Detected Language: Italian

filename: texts/test06.txt
Detected Language: English

filename: texts/test07.txt
Detected Language: English


---