The corpus which I used was the open access SCEPA corpus (Small Corpus of English Political Apologies). The corpus was downloaded from here. The corpus is in XML format and contains political apologies/excuses from politicians in the US, UK and Canada. There is also a substantial amount of metadata associated with each apology such as the date, author, gender of author, country of author, link(s) to source of excuse, reason of apology and also the communicative tactics used in apologies. For the purpose of sentence generation however, only the textual data of each apology was required. There are 232 apologies in the corpus which contain 1,220 sentences, after processing, for a total of 22,538 words.
The sentences which were generated by the model are stored in the generated_sentence.txt file. In general, the quality of the sentences is not great but this is to be expected when using as small a corpus as I did. The small size of the corpus meant that there was not a large selection of choices for the next token on each iteration. I would have liked to have used a much bigger corpus but I liked the genre of the corpus I chose and was also constrained by the processing power available to me.
There is also substantial variation in the length of the generated sentences and this could be altered by adding a constraint on the while loop which was used in their generation. However, I wanted the length of generated sentences to also be random, hence why I left this as so.