We'll be using some features from fastai, so let's import it:

In [2]:
from fastai.text.all import *

Download the public domain book War and Peace from Project Gutenberg:

In [4]:
download_data(url='https://www.gutenberg.org/files/2600/2600-0.txt', fname='war_and_peace.txt')

Path('war_and_peace.txt')

Let's check that it downloaded to the same directory as our notebook:

In [6]:
!ls

war_and_peace.ipynb  war_and_peace.txt


Let's open the text file and read the whole thing into our `text` variable:

In [7]:
text = ''
with open('war_and_peace.txt', 'r') as f:
    text = f.read()

Let's see how long it is:

In [8]:
len(text)

3227619

Now we have a very long string containing the whole book.

Let's take a look at the beginning of the string:

In [11]:
text[:1000]

'\ufeff\nThe Project Gutenberg EBook of War and Peace, by Leo Tolstoy\n\nThis eBook is for the use of anyone anywhere at no cost and with almost\nno restrictions whatsoever. You may copy it, give it away or re-use\nit under the terms of the Project Gutenberg License included with this\neBook or online at www.gutenberg.org\n\n\nTitle: War and Peace\n\nAuthor: Leo Tolstoy\n\nTranslators: Louise and Aylmer Maude\n\nPosting Date: January 10, 2009 [EBook #2600]\n\nLast Updated: January 21, 2019\n\nLanguage: English\n\nCharacter set encoding: UTF-8\n\n*** START OF THIS PROJECT GUTENBERG EBOOK WAR AND PEACE ***\n\n\n\n\nAn Anonymous Volunteer, and David Widger\n\n\n\n\n\n\nWAR AND PEACE\n\n\nBy Leo Tolstoy/Tolstoi\n\n\n\n\n\n    CONTENTS\n\n\n    BOOK ONE: 1805\n\n    CHAPTER I\n\n    CHAPTER II\n\n    CHAPTER III\n\n    CHAPTER IV\n\n    CHAPTER V\n\n    CHAPTER VI\n\n    CHAPTER VII\n\n    CHAPTER VIII\n\n    CHAPTER IX\n\n    CHAPTER X\n\n    CHAPTER XI\n\n    CHAPTER XII\n\n    CHAPTER XI

We don't want to include this section (before the actual book starts) in our data, so let's cut it out.

The actual story starts with the string "BOOK ONE: 1805", so let's find the index of that substring so we can cut out everything before it.

Let's be careful and check for all occurences of that substring in the text.

In [43]:
substring = re.compile('BOOK ONE: 1805')
matches = []
for match in substring.finditer(text):
    matches.append(match)
matches

[<re.Match object; span=(697, 711), match='BOOK ONE: 1805'>,
 <re.Match object; span=(7257, 7271), match='BOOK ONE: 1805'>]

We can see that the substring appeared twice in the book. Let's find the match that actually comes at the beginning of the story.

In [45]:
start1, start2 = [match.start() for match in matches]
start1, start2

(697, 7257)

Let's try the first match:

In [47]:
text[start1:start1+1000]

'BOOK ONE: 1805\n\n    CHAPTER I\n\n    CHAPTER II\n\n    CHAPTER III\n\n    CHAPTER IV\n\n    CHAPTER V\n\n    CHAPTER VI\n\n    CHAPTER VII\n\n    CHAPTER VIII\n\n    CHAPTER IX\n\n    CHAPTER X\n\n    CHAPTER XI\n\n    CHAPTER XII\n\n    CHAPTER XIII\n\n    CHAPTER XIV\n\n    CHAPTER XV\n\n    CHAPTER XVI\n\n    CHAPTER XVII\n\n    CHAPTER XVIII\n\n    CHAPTER XIX\n\n    CHAPTER XX\n\n    CHAPTER XXI\n\n    CHAPTER XXII\n\n    CHAPTER XXIII\n\n    CHAPTER XXIV\n\n    CHAPTER XXV\n\n    CHAPTER XXVI\n\n    CHAPTER XXVII\n\n    CHAPTER XXVIII\n\n\n    BOOK TWO: 1805\n\n    CHAPTER I\n\n    CHAPTER II\n\n    CHAPTER III\n\n    CHAPTER IV\n\n    CHAPTER V\n\n    CHAPTER VI\n\n    CHAPTER VII\n\n    CHAPTER VIII\n\n    CHAPTER IX\n\n    CHAPTER X\n\n    CHAPTER XI\n\n    CHAPTER XII\n\n    CHAPTER XIII\n\n    CHAPTER XIV\n\n    CHAPTER XV\n\n    CHAPTER XVI\n\n    CHAPTER XVII\n\n    CHAPTER XVIII\n\n    CHAPTER XIX\n\n    CHAPTER XX\n\n    CHAPTER XXI\n\n\n    BOOK THREE: 1805\n\n    CH

This is still just the table of contents.

In [48]:
text[start2:start2+1000]

'BOOK ONE: 1805\n\n\n\n\n\nCHAPTER I\n\n“Well, Prince, so Genoa and Lucca are now just family estates of the\nBuonapartes. But I warn you, if you don’t tell me that this means war,\nif you still try to defend the infamies and horrors perpetrated by that\nAntichrist—I really believe he is Antichrist—I will have nothing\nmore to do with you and you are no longer my friend, no longer my\n‘faithful slave,’ as you call yourself! But how do you do? I see I\nhave frightened you—sit down and tell me all the news.”\n\nIt was in July, 1805, and the speaker was the well-known Anna Pávlovna\nSchérer, maid of honor and favorite of the Empress Márya Fëdorovna.\nWith these words she greeted Prince Vasíli Kurágin, a man of high\nrank and importance, who was the first to arrive at her reception. Anna\nPávlovna had had a cough for some days. She was, as she said, suffering\nfrom la grippe; grippe being then a new word in St. Petersburg, used\nonly by the elite.\n\nAll her invitations without exception, 

Now that looks like the start of the story itself.

Now let's remove everything up to that point in the text string:

In [49]:
text = text[start2:]

In [50]:
text[:1000]

'BOOK ONE: 1805\n\n\n\n\n\nCHAPTER I\n\n“Well, Prince, so Genoa and Lucca are now just family estates of the\nBuonapartes. But I warn you, if you don’t tell me that this means war,\nif you still try to defend the infamies and horrors perpetrated by that\nAntichrist—I really believe he is Antichrist—I will have nothing\nmore to do with you and you are no longer my friend, no longer my\n‘faithful slave,’ as you call yourself! But how do you do? I see I\nhave frightened you—sit down and tell me all the news.”\n\nIt was in July, 1805, and the speaker was the well-known Anna Pávlovna\nSchérer, maid of honor and favorite of the Empress Márya Fëdorovna.\nWith these words she greeted Prince Vasíli Kurágin, a man of high\nrank and importance, who was the first to arrive at her reception. Anna\nPávlovna had had a cough for some days. She was, as she said, suffering\nfrom la grippe; grippe being then a new word in St. Petersburg, used\nonly by the elite.\n\nAll her invitations without exception, 

Let's do the same for the end of the text string, which has extra contents that are not part of the story.

In [51]:
text[-1000:]

'laf.org/donate\n\n\nSection 5. General Information About Project Gutenberg-tm electronic\nworks.\n\nProfessor Michael S. Hart is the originator of the Project Gutenberg-tm\nconcept of a library of electronic works that could be freely shared\nwith anyone. For thirty years, he produced and distributed Project\nGutenberg-tm eBooks with only a loose network of volunteer support.\n\n\nProject Gutenberg-tm eBooks are often created from several printed\neditions, all of which are confirmed as Public Domain in the U.S. unless\na copyright notice is included. Thus, we do not necessarily keep eBooks\nin compliance with any particular paper edition.\n\n\nMost people start at our Web site which has the main PG search facility:\n\n     http://www.gutenberg.org\n\nThis Web site includes information about Project Gutenberg-tm, including\nhow to make donations to the Project Gutenberg Literary Archive\nFoundation, how to help produce our new eBooks, and how to subscribe to\nour email newsletter to h

The text that signals the end of the story is "End of the Project Gutenberg EBook" so let's remove everything including and after that substring:

In [52]:
substring = re.compile('End of the Project Gutenberg EBook')
matches = []
for match in substring.finditer(text):
    matches.append(match)
matches

[<re.Match object; span=(3201664, 3201698), match='End of the Project Gutenberg EBook'>]

In [54]:
start = matches[0].start()
start

3201664

Here is the very end of the story:

In [56]:
text[start-1000:start]

'e motion of the planets, so in history the difficulty of recognizing\nthe subjection of personality to the laws of space, time, and cause\nlies in renouncing the direct feeling of the independence of one’s own\npersonality. But as in astronomy the new view said: “It is true that we\ndo not feel the movement of the earth, but by admitting its immobility\nwe arrive at absurdity, while by admitting its motion (which we do not\nfeel) we arrive at laws,” so also in history the new view says: “It is\ntrue that we are not conscious of our dependence, but by admitting our\nfree will we arrive at absurdity, while by admitting our dependence on\nthe external world, on time, and on cause, we arrive at laws.”\n\nIn the first case it was necessary to renounce the consciousness of an\nunreal immobility in space and to recognize a motion we did not feel;\nin the present case it is similarly necessary to renounce a freedom\nthat does not exist, and to recognize a dependence of which we are not\nconsc

So let's cut out everything after that from our text string:

In [57]:
text = text[:start]

Let's check the beginning and end of our text string to make sure we have the whole story:

In [58]:
text[:1000]

'BOOK ONE: 1805\n\n\n\n\n\nCHAPTER I\n\n“Well, Prince, so Genoa and Lucca are now just family estates of the\nBuonapartes. But I warn you, if you don’t tell me that this means war,\nif you still try to defend the infamies and horrors perpetrated by that\nAntichrist—I really believe he is Antichrist—I will have nothing\nmore to do with you and you are no longer my friend, no longer my\n‘faithful slave,’ as you call yourself! But how do you do? I see I\nhave frightened you—sit down and tell me all the news.”\n\nIt was in July, 1805, and the speaker was the well-known Anna Pávlovna\nSchérer, maid of honor and favorite of the Empress Márya Fëdorovna.\nWith these words she greeted Prince Vasíli Kurágin, a man of high\nrank and importance, who was the first to arrive at her reception. Anna\nPávlovna had had a cough for some days. She was, as she said, suffering\nfrom la grippe; grippe being then a new word in St. Petersburg, used\nonly by the elite.\n\nAll her invitations without exception, 

In [59]:
text[-1000:]

'e motion of the planets, so in history the difficulty of recognizing\nthe subjection of personality to the laws of space, time, and cause\nlies in renouncing the direct feeling of the independence of one’s own\npersonality. But as in astronomy the new view said: “It is true that we\ndo not feel the movement of the earth, but by admitting its immobility\nwe arrive at absurdity, while by admitting its motion (which we do not\nfeel) we arrive at laws,” so also in history the new view says: “It is\ntrue that we are not conscious of our dependence, but by admitting our\nfree will we arrive at absurdity, while by admitting our dependence on\nthe external world, on time, and on cause, we arrive at laws.”\n\nIn the first case it was necessary to renounce the consciousness of an\nunreal immobility in space and to recognize a motion we did not feel;\nin the present case it is similarly necessary to renounce a freedom\nthat does not exist, and to recognize a dependence of which we are not\nconsc

Now we have the story and only the story in our `text` string.

Let's split the text into chapters. Each chapter is separated by six newline characters, so we can split the string on that:

In [66]:
chapters = text.split('\n'*6)

In [124]:
[ch[:60] for ch in chapters][:10]

['“Well, Prince, so Genoa and Lucca are now just family estate',
 'Anna Pávlovna’s drawing room was gradually filling. The high',
 'Anna Pávlovna’s reception was in full swing. The spindles hu',
 'Just then another visitor entered the drawing room: Prince A',
 '“And what do you think of this latest comedy, the coronation',
 'Having thanked Anna Pávlovna for her charming soiree, the gu',
 'The rustle of a woman’s dress was heard in the next room. Pr',
 'The friends were silent. Neither cared to begin talking. Pie',
 'It was past one o’clock when Pierre left his friend. It was ',
 'Prince Vasíli kept the promise he had given to Princess Drub']

The last one does't look like a chapter:

In [75]:
chapters[-1]

'\n\n\n\n\n'

So let's remove it:

In [76]:
chapters.pop()

'\n\n\n\n\n'

In [79]:
chapters[-1][:100]

'CHAPTER XII\n\nFrom the time the law of Copernicus was discovered and proved, the mere\nrecognition of '

Now that looks like the final chapter.

Let's also remove all the sub-book names, since this is a list of chapters:

In [89]:
[ch for ch in chapters if len(ch)<100]

['BOOK ONE: 1805',
 'BOOK TWO: 1805',
 'BOOK THREE: 1805',
 'BOOK FOUR: 1806',
 'BOOK FIVE: 1806 - 07',
 'BOOK SIX: 1808 - 10',
 'BOOK SEVEN: 1810 - 11',
 'BOOK EIGHT: 1811 - 12',
 'BOOK NINE: 1812',
 'BOOK TEN: 1812',
 'BOOK ELEVEN: 1812',
 'BOOK TWELVE: 1812',
 'BOOK THIRTEEN: 1812',
 'BOOK FOURTEEN: 1812',
 'BOOK FIFTEEN: 1812 - 13',
 'FIRST EPILOGUE: 1813 - 20',
 'SECOND EPILOGUE']

In [90]:
chapters = [ch for ch in chapters if len(ch)>100]

In [123]:
[ch[:60] for ch in chapters][:10]

['“Well, Prince, so Genoa and Lucca are now just family estate',
 'Anna Pávlovna’s drawing room was gradually filling. The high',
 'Anna Pávlovna’s reception was in full swing. The spindles hu',
 'Just then another visitor entered the drawing room: Prince A',
 '“And what do you think of this latest comedy, the coronation',
 'Having thanked Anna Pávlovna for her charming soiree, the gu',
 'The rustle of a woman’s dress was heard in the next room. Pr',
 'The friends were silent. Neither cared to begin talking. Pie',
 'It was past one o’clock when Pierre left his friend. It was ',
 'Prince Vasíli kept the promise he had given to Princess Drub']

The chapter titles aren't really part of the story either, so let's remove those.

In [98]:
ch1 = chapters[0][:60]
ch1

'CHAPTER I\n\n“Well, Prince, so Genoa and Lucca are now just fa'

We can see that each chapter begins with "CHAPTER #\n\n" where "#" is the digit number of the chapter. We want to cut out that whole substring for each chapter.

Let's make a function that will remove the title from a given chapter:

In [113]:
def remove_chapter_title(chapter):
    chapter_title = re.compile('CHAPTER \w+\\n\\n')
    match = chapter_title.search(chapter)
    end = match.end()
    return chapter[end:]

Let's test it on the first chapter:

In [118]:
remove_chapter_title(chapters[0])[:60]

'“Well, Prince, so Genoa and Lucca are now just family estate'

And let's apply it to all the chapters:

In [120]:
chapters = [remove_chapter_title(chapter) for chapter in chapters]

In [122]:
[ch[:60] for ch in chapters][:10]

['“Well, Prince, so Genoa and Lucca are now just family estate',
 'Anna Pávlovna’s drawing room was gradually filling. The high',
 'Anna Pávlovna’s reception was in full swing. The spindles hu',
 'Just then another visitor entered the drawing room: Prince A',
 '“And what do you think of this latest comedy, the coronation',
 'Having thanked Anna Pávlovna for her charming soiree, the gu',
 'The rustle of a woman’s dress was heard in the next room. Pr',
 'The friends were silent. Neither cared to begin talking. Pie',
 'It was past one o’clock when Pierre left his friend. It was ',
 'Prince Vasíli kept the promise he had given to Princess Drub']

It looks like it worked.

Now let's split each chapter into paragraphs. Paragraphs are separated by two newline characters, so we can split each chapter on that:

In [127]:
paragraphs = chapters[0].split('\n'*2)
len(paragraphs)

42

In [131]:
paragraphs[0]

'“Well, Prince, so Genoa and Lucca are now just family estates of the\nBuonapartes. But I warn you, if you don’t tell me that this means war,\nif you still try to defend the infamies and horrors perpetrated by that\nAntichrist—I really believe he is Antichrist—I will have nothing\nmore to do with you and you are no longer my friend, no longer my\n‘faithful slave,’ as you call yourself! But how do you do? I see I\nhave frightened you—sit down and tell me all the news.”'

Let's also replace all the remaining single newline characters with spaces:

In [130]:
paragraphs[0].replace('\n', ' ')

'“Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.”'

Let's apply these changes in one step to the whole collection of chapters:

In [135]:
chapters = [[p.replace('\n', ' ') for p in ch.split('\n'*2)] for ch in chapters]

Now we can index into chapters/paragraphs:

In [143]:
chapters[5][8]

'Two footmen, the princess’ and his own, stood holding a shawl and a cloak, waiting for the conversation to finish. They listened to the French sentences which to them were meaningless, with an air of understanding but not wishing to appear to do so. The princess as usual spoke smilingly and listened with a laugh.'

Now let's join it all back into one long string, but with some helpful markers to show where chapters start and end, and also where paragraph breaks happen.

In [165]:
paragraphs = []
for chapter in chapters:
    paragraphs.append('[start-chapter]')
    for paragraph in chapter:
        paragraphs.append(paragraph)
    paragraphs.append('[end-chapter]')

In [166]:
len(paragraphs)

12070

In [167]:
paragraphs[:3]

['[start-chapter]',
 '“Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.”',
 'It was in July, 1805, and the speaker was the well-known Anna Pávlovna Schérer, maid of honor and favorite of the Empress Márya Fëdorovna. With these words she greeted Prince Vasíli Kurágin, a man of high rank and importance, who was the first to arrive at her reception. Anna Pávlovna had had a cough for some days. She was, as she said, suffering from la grippe; grippe being then a new word in St. Petersburg, used only by the elite.']

In [168]:
' [paragraph-break] '.join(paragraphs[:3])

'[start-chapter] [paragraph-break] “Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.” [paragraph-break] It was in July, 1805, and the speaker was the well-known Anna Pávlovna Schérer, maid of honor and favorite of the Empress Márya Fëdorovna. With these words she greeted Prince Vasíli Kurágin, a man of high rank and importance, who was the first to arrive at her reception. Anna Pávlovna had had a cough for some days. She was, as she said, suffering from la grippe; grippe being then a new word in St. Petersburg, used only by the elite.'

In [169]:
text = ' [paragraph-break] '.join(paragraphs)

In [172]:
text[:1000]

'[start-chapter] [paragraph-break] “Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.” [paragraph-break] It was in July, 1805, and the speaker was the well-known Anna Pávlovna Schérer, maid of honor and favorite of the Empress Márya Fëdorovna. With these words she greeted Prince Vasíli Kurágin, a man of high rank and importance, who was the first to arrive at her reception. Anna Pávlovna had had a cough for some days. She was, as she said, suffering from la grippe; grippe being then a new word in St. Petersburg, used only by the elite. [paragraph-break] All her invitations withou