# Lab 1 - wyrażenia regularne

Treść zadań dostępna [tutaj](README.md), a w oryginale na https://github.com/apohllo/pjn/blob/master/1-regexp.md

**Wykonanie zadań**: Marcin Przewięźlikowski

https://github.com/mprzewie/nlp_course

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
import json
import regex 
from pprint import pprint

# Regular expressions (aka regexps)

## Task

A dataset containing texts of Polish statutory law is available at [http://apohllo.pl/text/ustawy.tar.gz](http://apohllo.pl/text/ustawy.tar.gz).

In [None]:
DATA_DIR = Path("../data")
if not DATA_DIR.exists():
    download_url = "http://apohllo.pl/text/ustawy.tar.gz" 
    !curl -O {download_url}
    !mkdir {DATA_DIR}
    !tar -xvzf ustawy.tar.gz -C {DATA_DIR}
    !rm ustawy.tar.gz

In [None]:
files = [f for f in DATA_DIR.rglob("**/*") if f.is_file()]
len(files)

It contains texts of Polish bills, e.g.:

```
Tekst ustawy przyjęty przez Senat bez poprawek
 
USTAWA
z
dnia 8 listopada 2013 r.
 
o
zmianie niektórych ustaw w związku z realizacją ustawy budżetowej[1])
 
Art.
1. 
W
ustawie z dnia 4 marca 1994 r. o zakładowym funduszu świadczeń socjalnych (Dz. U.
z 2012 r. poz. 592, z późn. zm.[2]))
po art. 5b dodaje się art. 5c w brzmieniu:
„Art. 5c. W 2014 r. przez
przeciętne wynagrodzenie miesięczne w gospodarce narodowej, o którym mowa w art.
5 ust. 2, należy rozumieć przeciętne wynagrodzenie miesięczne w gospodarce narodowej
w drugim półroczu 2010 r. ogłoszone przez Prezesa Głównego Urzędu Statystycznego
na podstawie art. 5 ust. 7.”.
```

The task is to:

Find all external references to bills, e.g. **ustawie z dnia 
   4 marca 1994 r. o zakładowym funduszu świadczeń socjalnych (Dz. U.  z 2012 r. poz. 592)**.
   The result should be aggregated by bill ID (year and position) and sorted by descending number of reference
   counts. The reference format should include:
   * the title of the regulation (if present)
   * the year of the regulation
   * the number of the Journal of Laws of the Republic of Poland (*Dziennik Ustaw*) - if applicable
   * the position of the regulation
   

In [None]:
regexp1_1 = "Dz[.]\s*U[.]" \
"(((\s*z\s*\d{4}\s*r[.])?\s*(Nr\s*(\d+),\s*poz[.]\s*(\d+))\s*(,|i|oraz)?)+)" \

regexp1_2 = "(z\s*(\d{4})\s*r[.]\s*((Nr\s*\d+\s*,\s*poz[.]\s*\d+\s*(,|i)?\s*)+))"

regexp1_3 = "Nr\s*(\d+)\s*,\s*poz[.]\s*(\d+)"

In [None]:
def get_references(text):
    result = []
    journal_references = [jr[0] for jr in regex.findall(regexp1_1, text)]
    for jr in journal_references:
        years_refs = regex.findall(regexp1_2, jr)
        for yr in years_refs:
            year = yr[1]
            refs = regex.findall(regexp1_3, yr[2])
            result.extend([(year, *ref) for ref in refs])
    return result

In [None]:
references = []

for file in files:
    year, nr =  regex.findall("(\d{4})_(\d+)",str(file))[0]
    with file.open() as f:
        text = f.read()
        text = regex.sub("Dz[.]\s*U[.]\s*Nr", f"Dz. U. z {year} r. Nr", text)
        references.extend(get_references(text))

In [None]:
refs_and_counts = [
    (ref, references.count(ref))
    for ref in set(references)
]

In [None]:
sorted(refs_and_counts, key=lambda rc: -rc[1])

In [None]:
plt.hist([np.log2(rc[1]) for rc in refs_and_counts], bins=15)
plt.title("Histogram of logarithms of numbers of references to a regulation")
plt.show()

Find all internal references to regulations, e.g.  **art.  5 ust. 2**, **art. 5 ust. 7**, etc. The result should
   exclude the internal numbering of the bill (e.g. **Art. 1.** W ustawie ...).
   The result should be aggregated by regulation ID (as described below) and sorted by descending number of reference
   counts inside particular bill. The bills should be sorted by descending number of internal references. 
   The reference format should include all elements necessary to identify the regulation, e.g.:
   * art. 1, ust. 2 - if an article inside the regulation is referenced,
   * ust. 2 - if a paragraph inside the same article is referenced,
   * etc.
   

In [None]:
sc1 = "„"
sc2 = "”"
regex_quotation = f"{sc1}\s*[^{sc1}]*{sc2}"

In [None]:
def prune_quotation(text):
    regex_quotation = f"{sc1}\s*[^{sc1}]*{sc2}"
    while True:
        new_text = regex.sub(regex_quotation, "(...)", text)
        if new_text == text:
            break
        text = new_text
    return text

In [None]:
regexp2_1 = "brzmieni[e|u|a]:\s*.\s*[^”]*”"
regexp2_2 = "(?<!brzmieni[e|u|a]:\s*.\s*)Art[.]\s*(\d+)"
regexp2_3 = "art[.]\s*(\d+)\s*,\s*ust[.]\s*(\d+)"

In [None]:
def get_internal_references(text):
    return regex.finditer(prune_strange_chars_regex, text)
#     return text
#     articles_matches = regex.finditer(regexp2_1, text)
#     return articles_matches
# #     print(articles_matches)
# #     articles_matches = ["0"] + articles_matches
# #     assert len(articles_matches) % 2 == 0
# #     articles = [
# #         (articles_matches[2*i], articles_matches[2*i + 1]) 
# #         for i in range(int(len(articles_matches)/ 3))
# #     ]
# #     for art_number, art_text in articles:
# #         print(art_number)
# # #         print(art_text)
# # #         print("####")

In [None]:
# for file in files[10:20]:
file = files[15]
with file.open() as f:
        text = f.read()

# print([text[r.start()-10:r.end()+10] for r in get_internal_references(text)][0])
prune_quotation(text)

In [None]:
„

In [None]:
regex.findall("brzmienie:\s*.\s*", text)

In [None]:
# regex.split(regexp1, xd)

In [None]:
text.split("Art")[4]

In [None]:
exp = "(A|a)rt[.]\s*\d+\s*"

[text[m.start():m.end()+10] for m in regex.finditer(exp, text)]


Count all occurrences of the word **ustawa** in all inflected forms (*ustawa*, *ustawie*, *ustawę*, etc.),
   and all spelling forms (*ustawa*, *Ustawa*, *USTAWA*), excluding other words with the same prefix (e.g. *ustawić*).

In [None]:
regulation_regex = "[\b(U|u)]\s*[S|s]\s*[T|t]\s*[A|a]\s*[W|w]\s*[(A|a|Y|y|(IE)|(ie)|Ę|ę|Ą|ą|O|o)\b]"

In [None]:
num_regulations = 0

for file in files:
    with file.open() as f:
        text = f.read()
    num_regulations += len(regex.findall(regulation_regex, text))
    
num_regulations

## Hints

* Some programming languages allow to use Unicode classes in regular expressions, e.g.
  * `\p{L}` - letters from any alphabet (e.g. a, ą, ć, ü, カ)
  * `\p{Ll}` - small letters from any alphabet
  * `\p{Lu}` - capital letters from any alphabet
* Not all regular expressions engines support Unicode classes, e.g. `re` from Python does not.
  Yet you can use `regex` library (`pip install regex`), which has much more features.
* Regular expressions can include positive and negative lookahead and lookbehind constructions, e.g.
  * *positive lookahead* - `(\w+)(?= has a cat)` will match string `Ann has a cat`, but it will match `Ann` only.
  * *negative lookbehind* - `(?!<New )(York)`, will match `Yorkshire` but not `New York`.
* `\b` matches a word boarder. Regexp `fish` will match `jellyfish`, but `\bfish\b` will only match `fish`.
  In the case of Python you should use either `'\\bfish\\b'` or `r'\bfish\b'`.
* `\b` is dependent on what is understood by "word". For instance in Ruby polish diacritics are not treated as parts of
  a word, thus `\bpsu\b` will match both `psu` and `psuć`, since `ć` is a non-word letter in Ruby.
* Some languages, e.g. Ruby, support regexp match operator as well as regexp literals (`=~`, /fish/ respectively 
  in the case of Ruby and Perl). Notably Python does not support either.
* You should be very careful when copying regexps from Internet - different languages and even different versions of the
  same language may interpret them differently, so make sure to always test them on a large set of diversified examples.

In [None]:
s = "12-34-56-78-89-10111"
x = re.match("(\d{1,2}-){2}(\d{1,2}-){2}", s)
x.groups()