# Text cleaning heuristics
This script shows how to clean the newspaper corpus very conservatively. Nothing is done to correct OCR mistakes, we only aim to remove obvious OCR noise and text that is not in Swedish.

In [1]:
import re
import fasttext

The input file is text file with empty lines between documents.

In [2]:
with open('/data/corpus-v2/small_news.txt') as f:
    contents = f.read()

docs = contents.split('\n\n')

We use the [fasttext language identification model](https://fasttext.cc/docs/en/language-identification.html) in order to tell whether a document is in Swedish or not. 

In [3]:
model_path = "/home/elena/Workspace/news/lid.176.bin"
model = fasttext.load_model(model_path)



In [4]:
def lang_id(doc):
    text = doc.replace("\n", " ")
    lang = model.predict(text)[0][0]
    if "sv" not in lang:
        return False
    else:
        return True

We count the numbers, letters, spaces and punctuation in the text. Numbers and punctuation ("other") we consider as junk, i.e. not text, and if the ratio of junk to letters is higher than a certain threshold (in this case 0.5) we consider the text as waste.

In [5]:
def count_char(doc):
    numbers = 0
    letters = 0
    spaces = 0
    other = 0
    for c in doc:
        if c.isalpha():
            letters += 1
        elif c.isdigit():
            numbers += 1
        elif c.isspace():
            spaces += 1
        else:
            other += 1
    return numbers, letters, spaces, other

In [6]:
def letter_ratio(doc):
    numbers, letters, spaces, other = count_char(doc)
    try:
        junk = numbers+other
        if junk/letters > 0.5:
            return False
        else:
            return True
    except ZeroDivisionError:
        return False

We substitute sequences of punctuation marks and spaces that are longer than 3 characters with only the first occurrence.

In [7]:
def reduce_punct(doc):
    doc = re.sub(r'([^a-zA-Z0-9\s] ?){3,}', r'\g<1>', doc)
    return doc

The following function deals with repeated characters. We make a distinction between OCR material and natively digital material. For the OCR material, repetitions are likely to be OCR mistakes since newspapers and other publications rarely contain colloquial language. In this case we simply remove the sequence of repeated characters. For digital material, expecially social media text, repetitions can be intentional as a feature of informal language, for example "sååååååå kul!", so we don't delete the whole sequence and we substitute it with only one occurrence instead.

In [8]:
def reduce_seq(doc, ocr):
    if ocr == True:
        doc = re.sub(r'([a-zA-ZäöåÅÖÄ])\1{3,}', '', doc)
    else:
        doc = re.sub(r'([a-zA-ZäöåÅÖÄ])\1{3,}', r'\g<1>', doc)
    return doc

In [9]:
clean_docs = []
waste = []

for doc in docs:
    if lang_id(doc) and letter_ratio(doc):
        doc = reduce_punct(doc)
        doc = reduce_seq(doc, True)
        clean_docs.append(doc)
    else:
        waste.append(doc)

This is an example of the kind of documents that are removed with this heuristics. Definitely looks like crap!

In [11]:
waste[:50]

['Ring 08 / 5015 60 för annonser under denna rubrik lllll !\nllll !\nllllllll !\nllll !\nllilll !\nllilllillliill ] l !\nlll !',
 'T « rp ?\nv5t » r : « \' s sknrtter it C & s ?\nt » tfof ÄftsaMwSeö och FolUpdi . ita8 " Vi 3 » ia « !\nt opp int dst togs för IV 2 : » teta « 9ia : t HflM * s - ttlckoma t « f ?',
 "Genoasen unc CcnowintiPn iJic ii He » ! ulw & gt ; n i !\nif Soel - irt « l » r Uf » I Hel Ku .\nWf dri I * n . ciA « 1 . t ur .\nK » t . ti - r tVaUl » & gt ; U . U ' fl litftf Cf . t .",
 "lll | ll !\nllli !\nIIIIIIIIIIIHUII ' llinillllli !\nii !",
 'HII !\nl !\nli !\ni !\nl !',
 "COMANCHE MOON ( DE ) h 6 år Joakim Lövgren : 4 U 14 . 1m 890 . 621 Löv Jo J 21 / 11 - 9 9 / 2140 d Ilag ti 18 , 68 70 ' 18 : 17 3 - 3 - 4 Löv Jo A 5 / 12 - 9 11 / 2140 3 14 , 0a cc 14 , 10 125 ' ia . n n _ n _ n - - - - * , 19 : 0 .\nDe här tvä utgör ett bra lås .\nSer ut som att det är de två som gör upp om det .\nOm ingen annan kan överraska ?",
 'Strukna , V5 - 1 : nr 2 och 5 .\nV5 - 2 : nr 8 och

In [12]:
with open("/data/home/elena/Workspace/corpus_cleaning/removed.txt", "w") as f:
    for doc in waste:
        f.write(doc+"\n\n")
        
with open("/data/home/elena/Workspace/corpus_cleaning/clean_text.txt", "w") as f:
    for doc in clean_docs:
        f.write(doc+"\n\n")