# Data Preprocessing

## Poemhunter data

In [None]:
import io
import json
import pandas as pd
import numpy as np

file_ph="scrapy_poemhunter/top_poems.jl"
file_poets="scrapy_poemhunter/poets.jl"

Read in poems of top poets.

In [10]:
# To avoid new line symbols in data
def jsonEscape(str):
    return str.replace('\n', "\\n")

def read_to_df(file):
    
    with open(file, "r") as f:
        l = []
        i = 0
        for line in f.readlines():
            i += 1        
            line = jsonEscape(line.strip())
            l.append(json.loads(line))        

        df =  pd.DataFrame(l)
        # Join poem into one string (instead of list containing lines) 
        for i, poem in enumerate(df.poem):
            df.iloc[i].poem = "\n".join(poem)
    
    return df     
        

In [11]:
df_poets = read_to_df(file_poets)
df_poets.shape

(140320, 3)

In [13]:
df_poems = read_to_df(file_ph)
df_poems.shape

(14388, 3)

In [21]:
df_poems.head()

Unnamed: 0,author,poem,title
0,Edgar Allan Poe,From childhood's hour I have not been\nAs othe...,Alone - Poem by Edgar Allan Poe
1,Ambrose Bierce,"In contact, lo! the flint and steel,\nBy sharp...",Alone - Poem by Ambrose Bierce
2,Amy Louise Kerswell,Anger is bubbling away at me.\nBurning a whole...,Anger Rages Inside - Poem by Amy Louise Kerswell
3,Sara Teasdale,"I am alone, in spite of love,\nIn spite of all...",Alone - Poem by Sara Teasdale
4,Davina Caddell,"His anger is a hard summer storm,\nUnpredictab...",Anger - Poem by Davina Caddell


#### Join data sets

In [217]:
df_ph = pd.concat([df_poets, df_poems])

#### Remove duplicates

In [218]:
df_ph = df_ph.drop_duplicates()
print(df_ph.shape)

(149496, 3)


Almost 150k poets at first (after removing duplicates)

In [219]:
df_poets.head()

Unnamed: 0,author,poem,title
0,Sylvia Plath,"Stalemated their armies stood, with tottering ...",The Snowman On The Moor - Poem by Sylvia Plath
1,Sylvia Plath,No map traces the street\nWhere those two slee...,The Sleepers - Poem by Sylvia Plath
2,Sylvia Plath,When night comes black\nSuch royal dreams beck...,The Shrike - Poem by Sylvia Plath
3,Khalil Gibran,Your children are not your children.\nThey are...,Your Children - Poem by Khalil Gibran
4,Maya Angelou,A free bird leaps on the back\nOf the wind and...,I know why the caged bird sings - Poem by Maya...


#### Drop non-english poems
Get one poem per each poet - to check if poems are in english

In [220]:
df_ph.shape

(149496, 3)

In [221]:
authors = "Abdul Wahab, Ahmad Faraz, Alfonsina Storni, Andre Marie de Chenier, Antonio Machado, Ashok Chakradhar, \
Bijay Kant Dubey, Christian Winther, Donald Bruce Dawe, Edith Wharton, Emil Aarestrup, Emile Verhaeren, Francisco Balagtas, \
Gajanan Mishra, Geoffrey Chaucer, Goswami Tulsidas, Hans Christian Andersen, Harivansh Rai Bachchan, Hasmukh Amathalal, \
Henry VIII, King of England, Hiren Bhattacharyya, Jacques Prevert, James Arlington Wright, Jasimuddin, Jhaverchand Meghani, \
John Milton, Kamini Roy, Kumar Vishwas, Kuvempu, Michael Madhusudan Dutta, Paul Valery, Rahat Indori, Ronjoy Brahma, \
Ruben Dario, Rudra Mohammad Shahidullah, Sayeed Abubakar, Shakti Chattopadhay, Sophus Niels Christen Claussen, \
Victor Marie Hugo, Viggo Stuckenberg, Wallace Stevens"
authors = authors.split(", ")
len(authors)

42

In [222]:
for auth in authors:
    df_ph = df_ph[df_ph.author != auth]

In [223]:
df_ph.shape

(99347, 3)

Left with about 100k poems after removing non-english writers.

In [224]:
len(df_ph.groupby('author').first())

4023

In [225]:
df_ph.author.value_counts().head()

RoseAnn V. Shawiak       22667
Lawrence S. Pertillar    17874
Sandra Feldman            2524
Emily Dickinson           1232
Muzahidul Reza            1225
Name: author, dtype: int64

Some poets had really many poems.

#### Take max 500 poems per author.

In [226]:
df_ph = df_ph.groupby('author').head(500).reset_index(drop=True)

In [227]:
df_ph.shape

(54870, 3)

After taking only top 500 poems of each author, to reduce bias towards one author, we are left with 55k poems.

#### Next reduce alphabet size of poems.

In [228]:
poems_joined = "\n\n".join(df_ph.poem)

In [229]:
len(poems_joined)

83532115

File contains originally 83 million characters.

In [230]:
len(''.join(set(poems_joined)))

1234

Many hieroglyphs in data and weird symbols (1234). So let's only leave poems with english alphabet and some punctuation.

Drop poems that have too many special symbols. Over 5% of the poem.

In [231]:
for idx, row in df_ph.iterrows():
    poem = row.poem
    poem_clean = re.sub('[^a-zA-Z0-9 \t\n\.,\"\';:\-\_<>!?()\“”‘’—]', '', poem) # Sub all characters not listed here
    length_score = len(poem_clean)/(len(poem)+0.01)  # Check if length of poem shortened too much
    
    poem_clean = re.sub('[“”‘’]', '\"', poem_clean)  # No need for 4 different quotation marks
    
    if(length_score < 0.95):
        poem_clean = ""  # Dropping messed up the indexing

    df_ph.loc[idx].poem = poem_clean  # Add cleaned poem to data
        
# Remove poems with empty bodies
df_ph = df_ph[df_ph.poem != ""]

Check alphabet size

In [236]:
alphabet = ''.join(set("\n\n".join(df_ph2.poem)))
print(len(alphabet))
print("".join(sorted(alphabet)))

80
	
 !"'(),-.0123456789:;<>?ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz—


Only 80 different characters left, meaning it is much easier to get reliable results. 

In [233]:
df_ph.shape

(54676, 3)

Lost 194 poems (mostly containing only hieroglyphs or foreign alphabet).

#### Check how long are most of the lines of poems

In [163]:
poems_joined = "\n\n".join(df_ph.poem)

In [164]:
line = poems_joined.split('\n')

In [166]:
lines_len = [len(v) for v in line]

In [168]:
np.percentile(lines_len, 99)

70.0

99 percent of verses are shorter than 70 characters.

#### Write data into the file.

Write into file cleaned poems

In [169]:
with open("poems_clean.txt", "w", encoding="utf-8") as g:
    g.write(poems_joined)

In [170]:
len(poems_joined)

83268427

54676 poems written using in total 83 million characters from alphabet of size 80 symbols.

## Subwords

#### Tokenize data

Add white space around special symbols.

In [264]:
df_ph_bpe = df_ph.copy()
df_ph_bpe.shape

(54676, 3)

In [None]:
for idx, row in df_ph_bpe.iterrows():
    poem = row.poem
    poem_tok = re.sub('([\t\n\.,\"\';:\-\_<>!?()“”‘’—])', r' \1 ', poem)  # Add whitespace around special symbols
    poem_tok = re.sub(' {2,}', ' ', poem_tok)  # replace multiple spaces

    df_ph_bpe.loc[idx].poem = poem_tok  # Add tokenized poem to data


In [None]:
df_ph_bpe.iloc[1].poem

In [None]:
poems_joined_bpe = "\n\n".join(df_ph_bpe.poem)

### BPE

Bype-pair encoding to split words into subwords.

Used following command for bpe training:
- subword-nmt-master\learn_bpe.py -i data\poems_clean.txt -o bpe_trained -s 10000