# Evaluation Script 

Notebook implementing evaluation metrics for Poems generated (Haiku) - 

- **Metric 1:** accesses the quality of text
  - GRUEN: https://aclanthology.org/2020.findings-emnlp.9.pdf
  - Github: https://github.com/WanzhengZhu/GRUEN

- **Metric 2:** accesses the structure of text
  - Mean Syllable count for each line (should be 5-7-5 for Haiku)


# Installs and Setup

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
cd /content/drive/MyDrive/Haiku-Generation

/content/drive/MyDrive/Haiku-Generation


In [None]:
cd GRUEN 

/content/drive/MyDrive/Haiku-Generation/GRUEN


### GRUEN installs




In [None]:
'''Need to run for 1st time setup'''
# !git clone https://github.com/WanzhengZhu/GRUEN.git
# cd GRUEN
# ! pip install -r requirements.txt
# !pip install --upgrade --no-cache-dir gdown
# !gdown --id 1S-l0L_YOzn5KhYHdB8iS37qKwuUhHP0G
# !unzip cola_model.zip

'No need to run'

In [None]:
%%capture
!pip install transformers
!pip install wmd
!python -m spacy download en_core_web_md

import nltk
nltk.download('punkt')

from greun import *

import pandas as pd
import numpy as np
from tqdm import tqdm

### Syllables install

In [41]:
#for phonemizer
%%capture
!pip install phonemizer
!sudo apt-get install festival

from phonemizer import phonemize
from phonemizer.separator import Separator

# Steps to run via terminal 

- Ensure all installs and setup have been made 
- Make sure you can access the GRUEN folder (the file imports greun.py file)
- Pass the csv data path file into the command line
  - The csv must have 3 columns namely ['sent_1', 'sent_2', 'sent_3'] that each has 1 line of the corresponding poem
  - File running Example 

```
python evaluation.py '/content/drive/MyDrive/output.csv'
```

- The file would print the mean, median and quantile of the scores achieved for each poem.


# Trial GRUEN (Github reference)

Check if all installs work fine

In [None]:
candidates = [
    "All I need is faith to believe it",
    "I can feel my heart is beating faster",
    "'cause I'm so selfishly in love with you",
    "All I see is this big bright world",
    "I dont't think you are worth waiting with"
]

In [None]:
gruen_score = get_gruen(candidates)

100%|██████████| 5/5 [00:01<00:00,  4.30it/s]
Evaluating: 100%|██████████| 5/5 [00:05<00:00,  1.01s/it]


In [None]:
print(
    sorted(list(zip(candidates, gruen_score)),
            key=lambda x: x[1],
            reverse=True))

[('All I see is this big bright world', 0.8570365309715271), ('All I need is faith to believe it', 0.8371034860610962), ('I can feel my heart is beating faster', 0.8120250403881073), ("I dont't think you are worth waiting with", 0.7390467375975708), ("'cause I'm so selfishly in love with you", 0.6723689770002683)]


# Req Func Syllable 

In [172]:
def get_phonemes(line,char="|"):
  '''
  Get Phonemes 

  Arguments:
    Input:
      line (str) - text for calculating phn
  '''
  try:
    phn = phonemize(line, language='en-us', backend='festival', with_stress=False,
        separator=Separator(phone=None, word=' ', syllable=char), strip=True)
  except:
    # 1 syllable
    phn = ""
  return phn


def syllable_count(sen, char = "|"):
  '''
  Get Phonemes 

  Arguments:
    Input:
      sen (str) - phonemized structure split by char
  '''
  return sum([len(ph.split(char)) for ph in sen.split(" ")])


def get_data_syllables(data):
  '''
  Get syllable count for each line of poem

  Arguments: 
    data = 
  '''
  phenoemes_data = np.apply_along_axis(func1d=get_phonemes,arr=np.array(list(data)), axis=0)
  return [syllable_count(sen) for sen in phenoemes_data]

# Output Evaluation 

## HaikuRNN output 

In [113]:
path = '/content/drive/MyDrive/CIS530-Project/Code/HaikuRNN_output/haiku_charrnn_final_output.csv'
data = pd.read_csv(path)
data.drop(data.columns[0],axis=1,inplace=True)
data.head()

Unnamed: 0,sent_1,sent_2,sent_3
0,flowers was,a straight of a child and shadow,on a cold street stars
1,a shadow and the colours,of a complex shaped to see the,stars and sheets and the streets are so sad an...
2,and the sound of the straight strange thinks t...,is a stream that i saw,the world to see the stars of the wind
3,the sun is a silence the streets of a breeze,in the windows on the stars and street and shaped,to start a barnes of street and the candle shade
4,that standing at the sun and the can of strand,and so the summer stars,and stars are still there


In [None]:
data = pd.Series(list(data[['sent_1', 'sent_2', 'sent_3']].astype(str).values)).apply(lambda x: ' '.join(x)).to_list()

In [None]:
data[:5]

['flowers was a straight of a child and shadow on a cold street stars',
 'a shadow and the colours of a complex shaped to see the stars and sheets and the streets are so sad and stared',
 'and the sound of the straight strange thinks that this is a stream that i saw the world to see the stars of the wind',
 'the sun is a silence the streets of a breeze in the windows on the stars and street and shaped to start a barnes of street and the candle shade',
 'that standing at the sun and the can of strand and so the summer stars and stars are still there']

In [None]:
'''error in running more than x number of data points'''
# gruen_score = get_gruen(data)

In [114]:
pd.DataFrame(sorted(list(zip(data, gruen_score)),
        key=lambda x: x[1],
        reverse=True),columns=['haiku','gruen_score']).to_csv('/content/drive/MyDrive/CIS530-Project/Code/HaikuRNN_output/haiku_charrnn_gruen_score.csv',index=False)

In [120]:
# pd.Series(gruen_score).describe()

In [154]:
for i in tqdm([data.columns[2]]):
  data[str(i)+"_syllable"] = get_data_syllables(data[i])
  
data[[col for col in data.columns if 'syllable' in col]].describe()

Unnamed: 0,sent_1_syllable,sent_2_syllable,sent_3_syllable
count,1458.0,1458.0,1458.0
mean,7.58642,8.715364,8.039095
std,3.964462,3.161504,4.051045
min,2.0,2.0,1.0
25%,5.0,7.0,5.0
50%,5.0,7.0,6.0
75%,10.0,10.0,11.0
max,35.0,26.0,31.0


## Spacy Output 

In [202]:
path = '/content/drive/MyDrive/CIS530-Project/Code/Spacy_output/haiku_spacy_output_.csv'
data = pd.read_csv(path)
data.drop(data.columns[0],axis=1,inplace=True)
data.head()

Unnamed: 0,sent_1,sent_2,sent_3
0,motherlode copy,into river below dam,around til i crash
1,wind's cold frostbite,nostalgia sinks in again,ears are icicles
2,of the dawn heralds,moonlight is melting fleeting,now i am awake
3,now clear sky purple,happily alive and well,bare and blackened
4,flowers in the field,retail stores furlough thousands,sign from the cosmos


In [178]:
data = pd.Series(list(data[['sent_1', 'sent_2', 'sent_3']].astype(str).values)).apply(lambda x: ' '.join(x)).to_list()

#### GRUEN 

In [None]:
gruen_score = []
index = []
for i in tqdm(range(0,len(data),10)):
  try:
    gruen_score.extend(get_gruen(data[i:i+10]))
    index.extend(list(range(i,i+10)))
  except:
    pass

In [189]:
pd.DataFrame(pd.Series(gruen_score).describe(),columns=['gruen_score']).reset_index()

Unnamed: 0,index,gruen_score
0,count,1500.0
1,mean,0.292812
2,std,0.131685
3,min,0.089615
4,25%,0.19031
5,50%,0.260279
6,75%,0.375368
7,max,0.793858


In [40]:
data = list(np.array(data)[list(set(index))])
pd.DataFrame(sorted(list(zip(data, gruen_score)),
        key=lambda x: x[1],
        reverse=True),columns=['haiku','gruen_score']).to_csv('/content/drive/MyDrive/CIS530-Project/Code/Spacy_output/haiku_spacy_gruen_score.csv',index=False)

#### Phenomes 

In [211]:
haiku = pd.read_csv('/content/drive/MyDrive/CIS530-Project/Code/Spacy_output/haiku_spacy_gruen_score.csv')
haiku.head()

Unnamed: 0,haiku,gruen_score
0,slow steady breathing in universal chaos b...,0.793858
1,behind a mountain concentric ripples shimmer...,0.779875
2,for a better home tomorrow i will forget e...,0.768513
3,conversations go boundless possibilities w...,0.739246
4,right before our eyes black cumulonimbus clo...,0.720661


In [212]:
data['haiku'] = pd.Series(list(data[['sent_1', 'sent_2', 'sent_3']].astype(str).values)).apply(lambda x: ' '.join(x))
data = data.merge(haiku,on='haiku')

In [213]:
for i in tqdm(data.columns[:3]):
  data[str(i)+"_syllable"] = get_data_syllables(data[i])

100%|██████████| 3/3 [00:44<00:00, 14.74s/it]


In [214]:
data[[col for col in data.columns if 'syllable' in col]].describe()

Unnamed: 0,sent_1_syllable,sent_2_syllable,sent_3_syllable
count,1500.0,1500.0,1500.0
mean,4.914667,6.781333,4.920667
std,0.420519,0.534517,0.387909
min,3.0,4.0,3.0
25%,5.0,7.0,5.0
50%,5.0,7.0,5.0
75%,5.0,7.0,5.0
max,7.0,8.0,8.0


In [215]:
data.head()

Unnamed: 0,sent_1,sent_2,sent_3,haiku,gruen_score,sent_1_syllable,sent_2_syllable,sent_3_syllable
0,motherlode copy,into river below dam,around til i crash,motherlode copy into river below dam aroun...,0.251227,5,7,5
1,wind's cold frostbite,nostalgia sinks in again,ears are icicles,wind's cold frostbite nostalgia sinks in aga...,0.458207,4,7,5
2,of the dawn heralds,moonlight is melting fleeting,now i am awake,of the dawn heralds moonlight is melting fle...,0.42945,5,7,5
3,now clear sky purple,happily alive and well,bare and blackened,now clear sky purple happily alive and well ...,0.212137,5,7,4
4,flowers in the field,retail stores furlough thousands,sign from the cosmos,flowers in the field retail stores furlough ...,0.246135,5,7,5


# Preprocessed Clean Data GRUEN

In [None]:
import os
main_path = '/content/drive/MyDrive/CIS530-Project/Data/Cleanup/csv_data/Final range/'
path = os.listdir(main_path)

data = pd.concat([pd.read_csv(os.path.join(main_path,i)) for i in path])
print(data.shape)

# other data
path = '/content/drive/MyDrive/CIS530-Project/Data/Cleanup/cleaned-data_firstrange.csv'
df = pd.read_csv(path)
df.drop(df.columns[0],axis=1,inplace=True)
data.columns  = df.columns
data = pd.concat([data,df])
print(data.shape)

path = '/content/drive/MyDrive/CIS530-Project/Data/Cleanup/cleaned-data_1.csv'
df = pd.read_csv(path)
df.drop(df.columns[0],axis=1,inplace=True)
df.columns = data.columns 
data = pd.concat([data,df])
print(data.shape)

data.head()

(40668, 11)
(88157, 11)
(99387, 11)


Unnamed: 0,sent_1,sent_2,sent_3,source,topic,sent_1_syllable,sent_2_syllable,sent_3_syllable,sent_1_phoneme,sent_2_phoneme,sent_3_phoneme
0,"Damn, I really miss.",Us in our early days.,In our best days.,twaiku,best days,5.0,7.0,5.0,daem ay rih|liy mihs,ahs ihn aw|er er|liy deyz,ihn aw|er behst deyz
1,I've never wanted.,To go to Utah more in.,My entire life.,twaiku,never wanted,5.0,7.0,5.0,ayv neh|ver waan|taxd,tax gow tax yuw|tao maor ihn,may axn|tay|er layf
2,ALL people have the.,Capability to change.,"They just choose not, too.",twaiku,capability to,5.0,7.0,5.0,aol piy|paxl hhaev dhax,key|pax|bih|lax|tiy tax cheynjh,dhey jhahst chuwz naat tuw
3,Southern comfort and.,Tea with a bit of honey.,And a dash of ice.,twaiku,and tea,5.0,7.0,5.0,sah|dhern kahm|fert aend,tiy wihdh ax biht ahv hhah|niy,aend ax daesh ahv ays
4,We are both courses.,But is structured as a pimp.,Now Newt inquired.,twaiku,is,5.0,7.0,5.0,wiy aar bowth kaor|saxz,baht ihz strahk|cherd aez ax pihmp,naw nuwt axn|kway|erd


In [None]:
# data.to_csv('/content/drive/MyDrive/CIS530-Project/Data/preprocessed_cleaned_data_99k.csv',index=False,header=True)

In [None]:
data['poem'] = pd.Series(list(data[['sent_1', 'sent_2', 'sent_3']].astype(str).values)).apply(lambda x: ' '.join(x))

In [None]:
gruen_score = get_gruen(list(data['poem']))

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 99387/99387 [8:12:27<00:00,  3.36it/s]
Evaluating:  21%|██        | 62066/294381 [8:00:43<29:59:22,  2.15it/s]


KeyboardInterrupt: ignored

In [None]:
import pickle
pickle.dump(gruen_score, open('wb', '/content/drive/MyDrive/CIS530-Project/Data/pickle_gruen.pkl'))

In [None]:
data['gruen_score'] = gruen_score
data.to_csv('/content/drive/MyDrive/CIS530-Project/Data/data_99k_gruen.csv',index=False,header=True)