# Clean incipits
---
Preparing solely for `incipit` -> `canonical story id` mapping, what essential information should we pull from the existing CSVs? Here we strip unuseful nans and columns as well as English notes and punctuation. We create `stripped_incipits.csv` and `stripped_incipits.txt` for later analysis

In [8]:
import pandas as pd
import numpy as np
import string
import re

In [9]:
df = pd.read_csv('data/story_instance.csv')
df.columns

Index(['Manuscript', 'Canonical Story ID', 'Canonical Story Title',
       'Folio Start', 'Column Start', 'Line Start', 'Folio End', 'Column End',
       'Line End', 'Miracle Number', 'Number of Paintings', 'Incipit',
       'Macomber Incipit', 'Confidence Score', 'Notes',
       'Best Incipit Tool Match', 'Story Incomplete', 'Blank TM folios',
       'Ethiopic Story Number', 'Story Variation', 'High Confidence Not IT',
       'Princeton Catalog Folios', 'Princeton Catalog Titles',
       'Body of story start folio & line', 'Macomber Incipit.1',
       '(test on whether there are two incipits in the ITool on the same folio)',
       'Test for whether the incipit is not unique',
       'New mss (column for sorting)', 'Miracles sequence number',
       'Folio Start Number', 'Folio Start Letter',
       'Temporary English Translation for TGS 1994, to be moved when ID'd'],
      dtype='object')

In [10]:
# useful columns
useful_columns = ['Canonical Story ID', 'Incipit']
df = df[useful_columns]
# useful rows
df = df.dropna(how="all", axis=0)
# quick renaming
df.columns = ['canon_id', 'incipit']

In [11]:
# collect english chars and ethiopian punctuation
remove = list('abcdefghijklmnopqrstuvwxyz…')
remove.extend(string.punctuation)
remove.extend(['…', '፡', '።', '፨'])

In [12]:
def clean_incipit(incipit):
    if not pd.isna(incipit):
        # remove all punctuation, english letters, and whitespace
        stripped = ''.join([c for c in incipit if c.lower() not in remove]).strip()
        stripped = re.sub(r'[ ]+', ' ', stripped) # replace any number of spaces with single space
        cleaned = np.nan if stripped == '' else stripped
    else:
        cleaned = np.nan
    return cleaned

df['clean_incipit'] = df['incipit'].copy()
df['clean_incipit'] = df['clean_incipit'].apply(lambda x: clean_incipit(x))

In [13]:
df.head()

Unnamed: 0,canon_id,incipit,clean_incipit
1,207,ንወጥን፡ በረድኤተ፡ እግዚአብሔር፡ ወበአኰቴተ፡ ስብሐቲሁ፨ ወንዜንወክሙ፡ ...,ንወጥን በረድኤተ እግዚአብሔር ወበአኰቴተ ስብሐቲሁ ወንዜንወክሙ ኦአኃውየ ...
2,159,ተብህለ፡ ከመ፡ ሀሎ፡ ፩ደብር፡ ዘደናግል፡ በውስተ፡ አሐቲ፡ ሀገር፨ ወሀለ...,ተብህለ ከመ ሀሎ ፩ደብር ዘደናግል በውስተ አሐቲ ሀገር ወሀለወት ውስተ ው...
3,160,ተብህለ፡ ከመ፡ ሀሎ፡ ፩ብእሲ፡ በሀገረ፡ እልፍንድር፡ ወኮነ፡ ይፈቅድ፡ ከ...,ተብህለ ከመ ሀሎ ፩ብእሲ በሀገረ እልፍንድር ወኮነ ይፈቅድ ከመ ይኩን ብዙ...
4,156,ተብህለ፡ ከመ፡ ሀሎ፡ ፩ወሬዛ፡ ዘሠናይ፡ አርአያሁ፨ ወያፈቅራ፡ ለእግዝእት...,ተብህለ ከመ ሀሎ ፩ወሬዛ ዘሠናይ አርአያሁ ወያፈቅራ ለእግዝእትነ በኵሉ ል...
5,172,ወነበረ፡ ፩ብእሲ፡ ባዕል፡ ወነጋዲ፡ በሀገረ፡ ኒቆንያ፨ ወቦቱ፡ ምግባረ፡ ...,ወነበረ ፩ብእሲ ባዕል ወነጋዲ በሀገረ ኒቆንያ ወቦቱ ምግባረ ሠናይ ወይምሕ...


In [14]:
df.to_csv('output/stripped_incipits.csv', index=False)
with open('output/stripped_incipits.txt', 'w') as f:
    f.writelines([x + '\n' for x in df['clean_incipit'] if not pd.isna(x)])