Skip to content

luonglearnstocode/Seinfeld-text-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Seinfeld text corpus

banner

Seinfeld is my favorite TV show. I wrote this notebook to scrape the scripts of all Seinfeld episodes from the site seinology.com and merge them into a text corpus so that I could train a language model on. Hope you could find it useful. Any feedback would be appreciated.


corpus.txt: the corpus of length 717576 words, including 64919 lines of Seinfeld scripts, ready to train a language models on.