# Pahlavi Corpus Builder

Links:
- [extracting MS Word data](https://towardsdatascience.com/how-to-extract-data-from-ms-word-documents-using-python-ed3fbb48c122)
- [navigating MS Word XML data](https://virantha.com/2013/08/16/reading-and-writing-microsoft-word-docx-files-with-python/)

because some of this code uses the walrus operator, it requires Python3.8; [expanation for how to deal with installation](https://github.com/pickettj/jupyter_notebooks/blob/master/README.md)


In [1]:
from platform import python_version
print(python_version())

3.8.5


In [2]:
import os, zipfile, re, glob, nltk, pickle
import pandas as pd
from bs4 import BeautifulSoup
from collections import defaultdict

# explanation of default dict: https://www.geeksforgeeks.org/defaultdict-in-python/

In [3]:
#set home directory path
hdir = os.path.expanduser('~')

#pahlavi corpus directory
pah_path = hdir + "/Box/Notes/Digital_Humanities/Corpora/pahlavi_corpus/"

#pickle path
pickle_path = hdir + "/Box/Notes/Digital_Humanities/Corpora/pickled_tokenized_cleaned_corpora"

### Glob the corpus

In [4]:
pah_files = glob.glob(pah_path + r'/*.docx')

pah_xml_corpus = {}
for longname in pah_files:
    document = zipfile.ZipFile(longname)
    txt = zipfile.ZipFile.read(document, 'word/document.xml', pwd=None)
    start = os.path.basename(longname)
    short = os.path.splitext(start)
    pah_xml_corpus[short[0]] = txt

### Assemble simple corpus divided by MS Word paragraph breaks

This [new version](https://github.com/pickettj/pahlavi_digital_projects/issues/3) takes advantage of the "[walrus operator](https://medium.com/better-programming/what-is-the-walrus-operator-in-python-5846eaeb9d95#:~:text=Nov%2010%2C%202019%20%C2%B7%202%20min,would%20utilize%20a%20similar%20statement.)," which allows the "assignment and return of a value on the same expression."

Essentially, the walrus operator [takes something like this](https://realpython.com/lessons/assignment-expressions/):

```python
walrus = False
print (walrus)
```

And consolidates it into this:

```python
print (walrus := False)
```

Here's a version of the below code in a single line:

```python
pahlavi_corpus = {
    name: [
        t for p in BeautifulSoup(src).find_all("w:p") if len(t := p.get_text()) > 0
    ] for name, src in pah_xml_corpus.items()
}

```


For posterity, previous version:

```python
pahlavi_corpus = {}
for work in pah_xml_corpus:
    tree = BeautifulSoup(pah_xml_corpus[work])
    paras = tree.find_all("w:p")
    document = {}
    for i in range(len(paras)):
        if len(paras[i].get_text()) > 0:
            document[i] = paras[i].get_text()
    pahlavi_corpus[work] = document
```

In [5]:
pahlavi_corpus = {}
for name, src in pah_xml_corpus.items():
    tree = BeautifulSoup(src)
    paras = tree.find_all("w:p")
    document = [t for p in paras if len(t := p.get_text()) > 0]
    pahlavi_corpus[name] = document

In [6]:
# Example:
pahlavi_corpus["nĪrang ī āb"]

['nĪrang ī āb',
 'Also WD.89',
 '[ML / 4050 / 303v]',
 '[TUL 11263_294r]',
 '[DZ 4010_ 292r]',
 '[Nik 4040_ 294v]',
 ' Nīrang ī āb ud pādyāb yaštan',
 'fradom kār ēn kū ōy kū āb ud  pādyāb // kunēd naxust xwēš-tan pad baršnūm bē abāyēd šustan ',
 'ud ka-š // 3 3 \\\\ 3 šabag dāšt bawēd āb pad karbās  ī pad-pādyāb pālūdan ud pad  ǰāmag ī // \\\\ pad-pādyāb andar kunišn ',
 'gōmēz  az gāw ī gušn ka nē ān ī wādag  // šāyēd \\\\ bē kunišn ',
 'ud andar ǰāmag ī pad-pādy<āb  xūbīhā andar kunišn ',
 'u-š sar  // \\\\ bē nihumbišn ud az xrafstar ud abārīg  rēmanīh pad pahrēz dārišn ',
 'ud aw>ēšān kē āb  // \\\\ ud g[ōmēz yazēnd yaštan tan p<ad baršnūm bē šōyišn ',
 'ka 3 3 3 > šabag xub [dāšt // ēg-išān 30 <gām frāz kunišn ',
 'yašt ī 3 paywa \\\\ nd abāg kas ī h[u-xēmtar ī awest<wārtar ī // rāst>-Abestāg \\\\ tar (narm-Abestāg-tar) xūb-n[ērangtar ud dēn-āgāhtar [ML_304r] bē < kunišn // ',
 'ud ān kas kē zōdīh k>unēd šab<[īg kustīg nōg ud pad-pādyāb < ōy-iz kē rāspīg // ud srōš>-barišnīh kun\

### Extracting the Line Numbers

In [7]:
# currently does not work for works that lack line numbers, e.g. nĪrang ī āb

num_pattern = re.compile(r'^(.*(?:\.[0-9]{1,3}){1,3})?(.*)')

pahlavi_corpus_lines = {}

for name, doc in pahlavi_corpus.items():
    match = [ret.groups() for text in doc if (ret := num_pattern.match(text)) is not None]

    if all(num is None for num, text in match):
        # this doc doesn't use line numbers, make our own
        pahlavi_corpus_lines[name] = {str(i+1): text for i, (_, text) in enumerate(match)}
        continue

    segment = {}
    para, line = None, None

    for i, (num, text) in enumerate(match+[('-end-', '')]):
        if num is not None:
            if i > 0:
                store = f'-start-' if para is None else para
                segment[store] = line
            para, line = num, None

        if line is None:
            line = text
        else:
            line += '\n' + text

    pahlavi_corpus_lines[name] = segment
    

In [15]:
#pahlavi_corpus_lines.keys()
#pahlavi_corpus_lines["Dēnkard 4"]
#pahlavi_corpus_lines.keys()

### Flat Indexing

In [8]:
# list of tuples

#doc = pahlavi_corpus_lines["ARDĀ WIRĀZ"]
#sum([[(ln, pos, tok) for pos, tok in enumerate(line.split())] for ln, line in doc.items()], [])

# any advantage to using nltk.word_tokenize() instead of split()?

pahlavi_flat_corpus = []
for work in pahlavi_corpus_lines:
    doc = pahlavi_corpus_lines[work]
    output = sum([[(work, ln, pos, tok) for pos, tok in enumerate(line.split())] for ln, line in doc.items()], [])
    pahlavi_flat_corpus += output

In [17]:
#pahlavi_flat_corpus

### Pass to Dataframe

In [9]:
# pass to dataframe: pd.DataFrame([(1,2,3), (2,3,4)], columns=['a', 'b', 'c'])

#pd.DataFrame(pahlavi_flat_corpus, columns=['title', 'line', 'index', 'token'])
pd.DataFrame(pahlavi_flat_corpus, columns=['title', 'line', 'index', 'token']).to_csv(os.path.join(pickle_path,r'pahlavi_corpus.csv'), index=False)