## Data preprocessing

The dataset is composed by 6494 headlines, the mean length is 70 characters and the maximum length is 168 characters. I decided to pad all the headlines to the same length (168) with the special character "~" and to add at the begining the special character "^". This is because this implementation only supports fixed length chunks and in order to give to the model the possibility to learn the semantic rules of an italian phrase it should observe one complete phrase as training example.

In [8]:
#load the dateset
import pandas as pd

#load the dataset, no header
df = pd.read_csv('lercio_headlines.csv', header=None)

#record with the maximum number of characters
max = df[0].str.len().max()
print(max)

# pad the headlines with special characters to make them all the same length
df[0] = df[0].str.pad(max, side='right', fillchar='~')

# insert a special character at the beginning of each headline
df[0] = "^" + df[0]

# create a txt file with the headlines
with open('lercio_padded.txt', 'w') as f:
    for line in df[0]:
        f.write(line)

168


## Training and validation setup

The original experiments with this model have been made on the Shaekespeare dataset that is approximately 3 times smaller than the lercio dataset (without padding), since the original experiments were performed with 2000 epoch, i used 6000 epoch of training as baseline.

The other parameters that I played with are the hidden size of the gru units and the number of layer, in addition to the temperature in the generation phase.

It's difficult to systematically asses the quality of the generated headlines, in particular the semantic accuracy. Concerning the correctness of the single words, I used a dataset found on Github (https://github.com/napolux/paroleitaliane). The "parole.txt" file contains an list of words in italian (almost complete in my opinion, around 1 milion words), including: compound words, names, surnames, cities and locations, verbs, adjectives, adverbs, etc.

The idea is to use that file to calculate the percetage of correct words generate by the model with different hidden size, number of layers and epochs.

In [1]:
# open the file "parole.txt" and read it creating a dictionary

real_words = {}
with open('./dataset/parole.txt', 'r') as f:
    for p in f:
        real_words[p.strip().lower()] = True

len(real_words)

952734

In [3]:
from generate import generate

def percetage_correct_words(real_words, model_path, temperature, num_titles, len_titles):
    total_words = 0
    wrong_words = 0
    for i in range(num_titles):
        title = generate(model_path, temperature, len_titles)
        # remove the ^ character
        title = title[1:]
        # remove the ~ characters
        title = title.replace('~', '')
        # for each word in the title
        for word in title.split():
            # if the word is not in the dictionary
            if word.lower() not in real_words:
                wrong_words += 1
            total_words += 1
    
    return 1 - wrong_words/total_words          


In [13]:



percetage_correct_words(real_words, './models/lercio_E6000_H200_L1.pt', 0.3, 100, 200)


0.8213157138753232