Formålet er at opnå et kendskab til regulære udtryk og deres implementering i Python.

RegEx' anvendelse er meget udbredt, fordi RegEx er super smart i relation til tekstbehandling, fordi det kan bruges til at foretage avancerede søgninger. RegEx anvendes til søgemaskiner og til søg og erstat funktioner. At arbejde med RegEx er afgjort en oplevelse for sig, men når man får indblik i omfanget af opgaver, som kan løses med RegEx, så indser man, at det er et utroligt godt værktøj.

Denne notebook forsøger ikke at lære dig alt om RegEx, men den forsøger at skabe læring om det, og kun et fåtal af mulighederne bliver illustreret nedenfor.

Foruden RegEx indeholder denne notebook mange loops, så på den måde kan du også få indblik i, hvordan du skriver den slags.

In [2]:
# importer biblioteker
import re
from pathlib import Path
import os

In [8]:
# Sti til mappe med data / path to data
input_dir = Path.cwd() / '../data/txt_files/grimm'
# byg liste af filnavne vha. os.listdir() metoden / build file list using os.listdir method.
files = os.listdir(input_dir)

# Tom liste til data / make an empty list 
texts = []
# loop fillisten og gem teksterne i listen 'texts'.
for i in files:
    with open (input_dir  / i, 'r', encoding='utf-8-sig') as f:
        text = f.read()
        texts.append(text)

In [9]:
texts[0]

'ASHPUTTEL\n\n\n\n\n\nThe wife of a rich man fell sick; and when she felt that her end drew\n\nnigh, she called her only daughter to her bed-side, and said, ‘Always be\n\na good girl, and I will look down from heaven and watch over you.’ Soon\n\nafterwards she shut her eyes and died, and was buried in the garden;\n\nand the little girl went every day to her grave and wept, and was always\n\ngood and kind to all about her. And the snow fell and spread a beautiful\n\nwhite covering over the grave; but by the time the spring came, and the\n\nsun had melted it away again, her father had married another wife. This\n\nnew wife had two daughters of her own, that she brought home with her;\n\nthey were fair in face but foul at heart, and it was now a sorry time\n\nfor the poor little girl. ‘What does the good-for-nothing want in the\n\nparlour?’ said they; ‘they who would eat bread should first earn it;\n\naway with the kitchen-maid!’ Then they took away her fine clothes, and\n\ngave her an ol

# Rens teksten
Rensning af tekst kan foregå på flere måder. Metoden nedenfor er på den måde en ud af flere måder.

Vi begynder med at importere RegEx (import re).

RegEx mønsteret er '\b\S+\b'.

\b : \b finder positionen ved grænsen af et ord (word boundary).

\S: \S matcher ethvert ikke-mellemrum

+: + matcher det forrige tegn mellem én og et ubegrænset antal gange, så mange gange som muligt ind til næste tegn. Man siger, at plusset er grådigt.

\b : \b finder positionen ved grænsen af et ord (word boundary).



### Clean the text
Cleaning text can be done in several ways. The method below is one of several approaches.

We start by importing RegEx (import re).

The RegEx pattern is '\b\S+\b'.

\b : \b finds the position at the boundary of a word (word boundary).

\S: \S matches any non-whitespace character.

: + matches the previous character between one and an unlimited number of times, as many times as possible until the next character. It's said that the plus is greedy.
\b : \b finds the position at the boundary of a word (word boundary).

In [10]:
clean_texts = []
for text in texts:
    text_lower_string = text.lower()
    # RexEx funktionen .findall returnerer en liste af ord
    text_clean_list = re.findall(r'\b\S+\b', text_lower_string)
    # Med ' '.join samles ordlisten til en tekststreng
    text = ' '.join(text_clean_list)
    # Med append tilføjes tekststrengen til listen clean_texts
    clean_texts.append(text) 

In [11]:
clean_texts[0]

'ashputtel the wife of a rich man fell sick and when she felt that her end drew nigh she called her only daughter to her bed-side and said always be a good girl and i will look down from heaven and watch over you soon afterwards she shut her eyes and died and was buried in the garden and the little girl went every day to her grave and wept and was always good and kind to all about her and the snow fell and spread a beautiful white covering over the grave but by the time the spring came and the sun had melted it away again her father had married another wife this new wife had two daughters of her own that she brought home with her they were fair in face but foul at heart and it was now a sorry time for the poor little girl what does the good-for-nothing want in the parlour said they they who would eat bread should first earn it away with the kitchen-maid then they took away her fine clothes and gave her an old grey frock to put on and laughed at her and turned her into the kitchen there

# Sammenligninger
I litteratur anvender man ofte sammenligninger til at illustrere pointer tydeligere ved at sætte billeder på det man vil beskrive. Sammenligninger bidrager også til at gøre teksten mere levende og intererssant.

Men regex bliver det en overkommelig opgave at hente eksempler på sammenligninger i Grimms eventyr, fordi vi kan finde tekststrenge som følger mønsteret i en typisk sammenligning.

Vi kan illustrere det på følgende måde. Vi leder efter fraser, hvis mønster enten er as a ... eller as an ....

RegEx mønsteret kan skrives således:

'as\sa\s\w+'

Ordet 'as' efterfølges af \s, der betyder white space, der efterfølges af a, derefterføgles af \s, der efterfølges \w, der betyder word charater, der efterfølges af + der betyder "en eller flere af den forrige".


#### Comparisons
In literature, comparisons are often used to illustrate points more clearly by creating mental images of what is being described. Comparisons also make the text more vivid and engaging.

With regex, it becomes a manageable task to extract examples of comparisons in Grimms' fairy tales because we can find text strings that follow the pattern of a typical comparison.

We can illustrate it as follows. We are looking for phrases whose pattern is either "as a ..." or "as an ...".

The RegEx pattern can be written as:

'as\sa\s\w+'

The word 'as' is followed by \s, which means white space, followed by 'a', then followed by \s, which is followed by \w, meaning a word character, followed by +, which means "one or more of the previous."

In [14]:
comparisons = []
for text in clean_texts:
    comparison = re.findall(r'as\sa\s\w+', text)
    comparisons.append(comparison)
    
comparisons

[[],
 ['as a narrow', 'as a golden'],
 ['as a white'],
 [],
 ['as a jingling'],
 [],
 [],
 ['as a cock'],
 ['as a sack', 'as a road'],
 ['as a chandelier'],
 ['as a countryman'],
 ['as a little', 'as a feast'],
 ['as a white'],
 ['as a beautiful', 'as a stone', 'as a large', 'as a costly'],
 ['as a tub'],
 ['as a garden', 'as a crack', 'as a feast'],
 ['as a dear', 'as a great'],
 ['as a widow', 'as a little', 'as a thousand'],
 ['as a mouse'],
 ['as a little'],
 ['as a great'],
 ['as a garden',
  'as a pleasure',
  'as a rose',
  'as a poor',
  'as a bear',
  'as a savage'],
 ['as a peasant', 'as a king'],
 ['as a real', 'as a magical'],
 ['as a lovely'],
 ['as a few'],
 [],
 [],
 ['as a parlour',
  'as a little',
  'as a courtyard',
  'as a garden',
  'as a park',
  'as a little'],
 ['as a great', 'as a reward'],
 [],
 [],
 ['as a gentle'],
 ['as a poor'],
 ['as a man', 'as a fine', 'as a goose'],
 ['as a daughter', 'as a good'],
 ['as a court', 'as a boy', 'as a burning', 'as a roar

En liste med en zip funktion, der samler filnavne og sammenligninger giver en liste, så vi kan se eventyr og sammenligninger sammen. 

A list with a zip function that combines file names and comparisons provides a list.

In [18]:
list(zip(files,comparisons))

[('ASHPUTTEL.txt', []),
 ('BRIAR ROSE.txt', ['as a narrow', 'as a golden']),
 ('CAT AND MOUSE IN PARTNERSHIP.txt', ['as a white']),
 ('CAT-SKIN.txt', []),
 ('CLEVER ELSIE.txt', ['as a jingling']),
 ('CLEVER GRETEL.txt', []),
 ('CLEVER HANS.txt', []),
 ('DOCTOR KNOWALL.txt', ['as a cock']),
 ('FREDERICK AND CATHERINE.txt', ['as a sack', 'as a road']),
 ('FUNDEVOGEL.txt', ['as a chandelier']),
 ('HANS IN LUCK.txt', ['as a countryman']),
 ('HANSEL AND GRETEL.txt', ['as a little', 'as a feast']),
 ('IRON HANS.txt', ['as a white']),
 ('JORINDA AND JORINDEL.txt',
  ['as a beautiful', 'as a stone', 'as a large', 'as a costly']),
 ('KING GRISLY-BEARD.txt', ['as a tub']),
 ('LILY AND THE LION.txt', ['as a garden', 'as a crack', 'as a feast']),
 ('LITTLE RED-CAP [LITTLE RED RIDING HOOD].txt', ['as a dear', 'as a great']),
 ('MOTHER HOLLE.txt', ['as a widow', 'as a little', 'as a thousand']),
 ('OLD SULTAN.txt', ['as a mouse']),
 ('RAPUNZEL.txt', ['as a little']),
 ('RUMPELSTILTSKIN.txt', ['as a 

# Find et tekstuddrag baseret på søgeord og et interval
Vi vil finde ordet 'king' samt ord, der er beslægtet med ordet, og vi må have noget kontekst med, fordi vi er faktisk interesseret i at pege ned i teksten og se, hvordan konge helt præcist bliver brugt.

Til dette skal vi bruge \w., fordi det giver os flere ordtegn og {30} søger for, at vi får 30 ordtegn før, vi rammer bogstaverne king. \b foran king søger for at vi kun finder ord, der begynder med king og ikke ord, hvor king er en del af ordet, f.eks. looking. Efter king søger \w.{30} for, at vi får endnu 30 ordtegn.



### Find a text excerpt based on keywords and an interval
We want to find the word 'king' as well as words related to it, and we need some context because we are actually interested in diving into the text and seeing how 'king' is used precisely.

For this, we need to use \w., as it gives us multiple word characters, and {30} ensures that we get 30 word characters before we hit the letters 'king'. \b in front of 'king' ensures that we only find words that begin with 'king' and not words where 'king' is part of the word, such as 'looking'. After 'king', \w.{30} ensures that we get another 30 word characters.

In [21]:
re.findall(r'.{0,30}\bking.{0,30}', clean_texts[20])

[' her that he one day told the king of the land who used to come ',
 'in gold out of straw now this king was very fond of money and wh',
 's all spun into gold when the king came and saw this he was grea',
 'orning all was done again the king was greatly delighted to see ',
 'e spun the heap into gold the king came in the morning and findi',
 'ive him all the wealth of the kingdom if he would let her off bu']

Læg regex mønsteret i et loop og få overblikket over, hvordan ordet benyttes.

In [23]:
contexts1 = []
for text in clean_texts:
    context = re.findall(r'.{0,30}\bking.{0,30}', text) # ig 
    contexts1.append(context)

list(zip(files,contexts1))

[('ASHPUTTEL.txt',
  [' for now it happened that the king of that land held a feast whi',
   ' we are going to dance at the king’s feast then she did as she w',
   ' safe at home in the dirt the king’s son soon came up to her and',
   'she wanted to go home and the king’s son said i shall go and tak',
   'ondered at her beauty but the king’s son who was waiting for her',
   'she wanted to go home and the king’s son followed here as before',
   't without being seen then the king’s son lost sight of her and c',
   ' wonder at her beauty and the king’s son danced with nobody but ',
   'she wanted to go home and the king’s son would go with her and s',
   ' and went the next day to the king his father and said i will ta',
   'd on the shoe and went to the king’s son then he took her for hi',
   'lood came and took her to the king’s son and he set her as his b']),
 ('BRIAR ROSE.txt',
  ['briar rose a king and queen once upon a time re',
   'n those days fairies now this king and queen had 

# Skattejagt efter egenavne
Find de ord, der begynder med store bogstaver, men ikke findes med små bogstaver.

Mange af disse ord er skrevet med stort, fordi de optræder efter et punktum, og på den måde er de ikke, hvad jeg vil kalde for "ægte" ord med stort.

Hvis man vil bortfiltrere de "uægte" ord fra sin liste, så kan man afsløre dem ved at lave et loop og indsætte en betingelse, der kan tjekke om, ordene skulle være skrevet med småt andre steder i teksterne, fordi hvis de er det, så er de "uægte".

Konkret gør vi det på den måde at vi looper listen med ord med store bogtaver. Hvis ordet, som vi med .lower() manipulere til kun at bestå af små bogstaver, ikke findes skrevet med et lille begyndelsesbogstav i alle teksterne, så tilføjer vi ordet til vores nye liste med ord med stort begyndelsesbogstav.

NB. vi samler alle tekster i listen raw_texts med ' '.join(). På den måde bliver listen med tekster samlet omkring et mellemrum.


#### Treasure Hunt for Proper Nouns
Find the words that begin with capital letters but are not found with lowercase letters.

Many of these words are capitalized because they appear after a period, and in that way, they are not what I would call "genuine" words with a capital letter.

If you want to filter out the "unauthentic" words from your list, you can identify them by creating a loop and inserting a condition that checks if the words should be written in lowercase elsewhere in the texts, because if they are, then they are "unauthentic."

Specifically, we do it by looping through the list of words with capital letters. If the word, which we manipulate with .lower() to consist only of lowercase letters, is not found with a lowercase initial letter in all the texts, then we add the word to our new list of words with a capital initial letter.

Note: we collect all texts in the list raw_texts with ' '.join(). This way, the list of texts is combined with a space.

In [27]:
upper_case = []
for text in texts:
    upper_case_words = re.findall(r'[A-Z]\w+', text)
    for word in upper_case_words: 
        if word.lower() not in ' '.join(texts):
            upper_case.append(word)
set(upper_case)

{'ADVENTURES',
 'ASHPUTTEL',
 'Aha',
 'Ashputtel',
 'BANDY',
 'BENJAMIN',
 'Bewailing',
 'Blackbird',
 'Bravo',
 'CATHERINE',
 'CHANTICLEER',
 'CROOK',
 'Catherine',
 'Caw',
 'Chanticleer',
 'Christendom',
 'Christmas',
 'Cinderella',
 'Coxcomb',
 'Crabb',
 'Curdken',
 'Dobbin',
 'Dummling',
 'ELSIE',
 'Elsie',
 'FREDERICK',
 'FUNDEVOGEL',
 'Falada',
 'Finally',
 'Fourthly',
 'Frederick',
 'Fundevogel',
 'GRETEL',
 'GRISLY',
 'German',
 'Gothel',
 'Grete',
 'Gretel',
 'Grisly',
 'Growler',
 'HANS',
 'HANSEL',
 'HOLLE',
 'HUNCHBACK',
 'Hans',
 'Hansel',
 'Hark',
 'Hearken',
 'Heinel',
 'Heinrich',
 'Holiness',
 'Holle',
 'Hullo',
 'Hurrah',
 'ICHABOD',
 'Ilsabill',
 'Influenced',
 'JEMMY',
 'JEREMIAH',
 'JOHN',
 'JORINDA',
 'JORINDEL',
 'Jip',
 'Jorinda',
 'Jorindel',
 'KNOWALL',
 'KORBES',
 'Kate',
 'Kehrewit',
 'Knowall',
 'Korbes',
 'Kywitt',
 'LANGUAGES',
 'MRS',
 'Marleen',
 'Mortal',
 'Mrs',
 'Oho',
 'PARTLET',
 'Partlet',
 'Prithee',
 'ROLAND',
 'RUMPELSTILTSKIN',
 'RUMPLESTILTSK

# Find tekstuddrag baseret på to søgeord og et interval
Det sidste eksempel består i at finde tekstuddrag, der er kendetegnet ved at befinde sig mellem to udvalgte ord og ikke er længere end et udvalgt interval.

Det kan f.eks. være relevant, hvis man er interesseret i at identificere tekstuddrag, hvor to vigtige karakterer eller begreber optræder i nærheden af hinanden.

Det nye her er spørgsmålstegnet, der gør koden lazy.



### Find text excerpts based on two keywords and an interval
The last example involves finding text excerpts characterized by being located between two selected words and not longer than a selected interval.

This can be relevant, for example, if you are interested in identifying text excerpts where two important characters or concepts appear near each other.

What's new here is the question mark, which makes the code lazy.

In [None]:
contexts2 = []
for text in raw_texts:
    context = re.findall(r'\bGretel.+?\bHans\w*|\bHans.+?\bGretel\w*', text) # 
    contexts2.append(context)

# indsæt et max interval mellem første og andet ord 
contexts_within_interval = [item for item in contexts2 if len(item) <= 100]


list(zip(file_list,contexts_within_interval))