Skip to content

liamcarroll/bigram-language-model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Simple Bigram Language Model in Python

Description of Corpus Used

The corpus which I used was the open access SCEPA corpus (Small Corpus of English Political Apologies). The corpus was downloaded from here. The corpus is in XML format and contains political apologies/excuses from politicians in the US, UK and Canada. There is also a substantial amount of metadata associated with each apology such as the date, author, gender of author, country of author, link(s) to source of excuse, reason of apology and also the communicative tactics used in apologies. For the purpose of sentence generation however, only the textual data of each apology was required. There are 232 apologies in the corpus which contain 1,220 sentences, after processing, for a total of 22,538 words.

Appraisal of Generated Sentences

The sentences which were generated by the model are stored in the generated_sentence.txt file. In general, the quality of the sentences is not great but this is to be expected when using as small a corpus as I did. The small size of the corpus meant that there was not a large selection of choices for the next token on each iteration. I would have liked to have used a much bigger corpus but I liked the genre of the corpus I chose and was also constrained by the processing power available to me.

There is also substantial variation in the length of the generated sentences and this could be altered by adding a constraint on the while loop which was used in their generation. However, I wanted the length of generated sentences to also be random, hence why I left this as so.

About

Bigram language model implemented in Python.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors