Simple Bigram Language Model in Python

Description of Corpus Used

The corpus which I used was the open access SCEPA corpus (Small Corpus of English Political Apologies). The corpus was downloaded from here. The corpus is in XML format and contains political apologies/excuses from politicians in the US, UK and Canada. There is also a substantial amount of metadata associated with each apology such as the date, author, gender of author, country of author, link(s) to source of excuse, reason of apology and also the communicative tactics used in apologies. For the purpose of sentence generation however, only the textual data of each apology was required. There are 232 apologies in the corpus which contain 1,220 sentences, after processing, for a total of 22,538 words.

Appraisal of Generated Sentences

The sentences which were generated by the model are stored in the generated_sentence.txt file. In general, the quality of the sentences is not great but this is to be expected when using as small a corpus as I did. The small size of the corpus meant that there was not a large selection of choices for the next token on each iteration. I would have liked to have used a much bigger corpus but I liked the genre of the corpus I chose and was also constrained by the processing power available to me.

There is also substantial variation in the length of the generated sentences and this could be altered by adding a constraint on the while loop which was used in their generation. However, I wanted the length of generated sentences to also be random, hence why I left this as so.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generated_sentences.txt		generated_sentences.txt
part1.ipynb		part1.ipynb
part2.ipynb		part2.ipynb
part3.ipynb		part3.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple Bigram Language Model in Python

Description of Corpus Used

Appraisal of Generated Sentences

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Simple Bigram Language Model in Python

Description of Corpus Used

Appraisal of Generated Sentences

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages