<a href="https://colab.research.google.com/github/mkane968/Extracted-Features/blob/master/notebooks/1_Text_Sectioning_%26_Disaggregation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Sectioning and Disaggregation

## Introduction

Use this code to clean, section, and disaggregate texts and corpora. 

**Why Perform Text Sectioning?** 

Dividing texts into sections (for example, chapters or chunks of N length) is valuable as a precursor to topic modeling and other forms of computational analysis which perform more accurately when applied to groups of segmented documents from longer texts. 

**Why Disaggregate Texts?** 

The process of disaggregating the words in texts (in this case, by alphabetizing them) also creates data sets that can be shared freely where original texts cannot be due to copyright restrictions. 

*Input/Output Specifications:* 

This code requires plain txt files as input, either those from this repository's sample_data folder or those from a local machine. It returns csv files with disaggregated text grouped by chapter or chunk of n length.

## Upload and Add Text Files To Pandas DataFrame
In this section, text files are added into a Pandas DataFrame. Pandas is a fast and relatively easy way to work with large datasets. Though data frames are typically associated with numbers, Pandas also offers many functionalities for [working with textual data. ](https://www.tutorialspoint.com/python_pandas/python_pandas_working_with_text_data.htm) 

In [5]:
#Import os and glob
import glob
import os

#Import pandas
import pandas as pd

#Import nltk for tokenization 
import nltk
from nltk.tokenize.treebank import TreebankWordDetokenizer
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [6]:
#Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [9]:
#Selet all files to upload
from google.colab import files

uploaded = files.upload()

Saving 1969_RICHMOND-RICHMOND_PHOENIXSHIP.txt to 1969_RICHMOND-RICHMOND_PHOENIXSHIP.txt
Saving 1969_KAMIN_EARTHRIM.txt to 1969_KAMIN_EARTHRIM.txt
Saving 1969_TUBB_TOYMAN.txt to 1969_TUBB_TOYMAN.txt
Saving 1969_KOONTZ_FEARTHATMAN.txt to 1969_KOONTZ_FEARTHATMAN.txt
Saving 1971_KAMIN_THEHERODMEN.txt to 1971_KAMIN_THEHERODMEN.txt
Saving 1971_RACKHAM_DARKPLANET.txt to 1971_RACKHAM_DARKPLANET.txt


In [10]:
#Add files into dataframe
import pandas as pd

books = pd.DataFrame.from_dict(uploaded, orient='index')
books.head()

Unnamed: 0,0
1969_RICHMOND-RICHMOND_PHOENIXSHIP.txt,b'\xef\xbb\xbfPHOENIX SHIP\r\n Walt and Leigh ...
1969_KAMIN_EARTHRIM.txt,b'\xef\xbb\xbfEARTHRIM\r\nNICK KAMIN\r\nThe ma...
1969_TUBB_TOYMAN.txt,b'\xef\xbb\xbfTOYMAN\r\nE.C.TUBB\r\nTHEY CALLE...
1969_KOONTZ_FEARTHATMAN.txt,b'\xef\xbb\xbfThe galaxy had forgotten war and...
1971_KAMIN_THEHERODMEN.txt,b'\xef\xbb\xbfTHE\r\nHEROD MEN\r\nNICK KAMIN\r...


In [13]:
#Reset index and add column names to make wrangling easier
books = books.reset_index()
books.columns = ["Title", "Text"]
books

Unnamed: 0,Title,Text
0,1969_RICHMOND-RICHMOND_PHOENIXSHIP.txt,b'\xef\xbb\xbfPHOENIX SHIP\r\n Walt and Leigh ...
1,1969_KAMIN_EARTHRIM.txt,b'\xef\xbb\xbfEARTHRIM\r\nNICK KAMIN\r\nThe ma...
2,1969_TUBB_TOYMAN.txt,b'\xef\xbb\xbfTOYMAN\r\nE.C.TUBB\r\nTHEY CALLE...
3,1969_KOONTZ_FEARTHATMAN.txt,b'\xef\xbb\xbfThe galaxy had forgotten war and...
4,1971_KAMIN_THEHERODMEN.txt,b'\xef\xbb\xbfTHE\r\nHEROD MEN\r\nNICK KAMIN\r...
5,1971_RACKHAM_DARKPLANET.txt,b'\xef\xbb\xbfDARK PLANET\r\nJOHN RACKHAM\r\nS...


In [None]:
#Set dataframes as Colab data tables
from google.colab import data_table
data_table.enable_dataframe_formatter()

## Perform Minimal Cleaning and Set Parameters for Sectioning 
Several basic cleaning processes are implemented: removing unwanted characters from titles and removing encoding  and newline characters from texts. Parameters are then set for part(s) of text to be included in sectioning. In the SciFi Corpus project, "START OF BOOK" and "END OF BOOK" tags were added to delineate the body of each text. Code in this section removes any text outside the starting and ending parameters--e.g., title page, copyright page, other paratext. 

In [14]:
books_cleaned = books.copy()

In [15]:
#Remove .txt from titles
books_cleaned['Title'] = books_cleaned['Title'].str.replace(r'.txt', ' ', regex=True) 
books_cleaned.head()

Unnamed: 0,Title,Text
0,1969_RICHMOND-RICHMOND_PHOENIXSHIP,b'\xef\xbb\xbfPHOENIX SHIP\r\n Walt and Leigh ...
1,1969_KAMIN_EARTHRIM,b'\xef\xbb\xbfEARTHRIM\r\nNICK KAMIN\r\nThe ma...
2,1969_TUBB_TOYMAN,b'\xef\xbb\xbfTOYMAN\r\nE.C.TUBB\r\nTHEY CALLE...
3,1969_KOONTZ_FEARTHATMAN,b'\xef\xbb\xbfThe galaxy had forgotten war and...
4,1971_KAMIN_THEHERODMEN,b'\xef\xbb\xbfTHE\r\nHEROD MEN\r\nNICK KAMIN\r...


In [16]:
#Remove encoding characters from Text column (b'\xef\xbb\xbf)
books_cleaned['Text'] = books_cleaned['Text'].apply(lambda x: x.decode('utf-8', errors="ignore"))

#Remove newline characters
books_cleaned['Text'] = books_cleaned['Text'].str.replace(r'\s+|\\r', ' ', regex=True) 
books_cleaned['Text'] = books_cleaned['Text'].str.replace(r'\s+|\\n', ' ', regex=True) 
books_cleaned

Unnamed: 0,Title,Text
0,1969_RICHMOND-RICHMOND_PHOENIXSHIP,﻿PHOENIX SHIP Walt and Leigh Richmond phoenix ...
1,1969_KAMIN_EARTHRIM,﻿EARTHRIM NICK KAMIN The man who stopped the w...
2,1969_TUBB_TOYMAN,"﻿TOYMAN E.C.TUBB THEY CALLED THEIR PLANET TOY,..."
3,1969_KOONTZ_FEARTHATMAN,﻿The galaxy had forgotten war and evil—until t...
4,1971_KAMIN_THEHERODMEN,﻿THE HEROD MEN NICK KAMIN Planned death vs. un...
5,1971_RACKHAM_DARKPLANET,﻿DARK PLANET JOHN RACKHAM Step Two was the Spa...


In [17]:
#Split book on start of book tag, keep text only after start of book tag
start = books_cleaned["Text"].str.split("START OF", expand = True)
books_cleaned['Text'] = start[1]

In [None]:
#Split books from project gutenberg on start of ebook label and keep only text after label
books_cleaned['Text'] = books_cleaned['Text'].str.replace('THIS PROJECT GUTENBERG EBOOK', '')
books_cleaned

In [19]:
#Split book on end of book tag, keep text only before of book tag
end = books_cleaned["Text"].str.split("END OF", expand = True)
books_cleaned['Text'] = end[0]
books_cleaned

Unnamed: 0,Title,Text
0,1969_RICHMOND-RICHMOND_PHOENIXSHIP,BOOK Lullaby for our Space Children Parameter...
1,1969_KAMIN_EARTHRIM,"BOOK CHAPTER 1 “About that shoulder,” the doc..."
2,1969_TUBB_TOYMAN,BOOK CHAPTER 1 Tor thirty hours the sun had a...
3,1969_KOONTZ_FEARTHATMAN,BOOK PART 1 PURPOSE And ye shall seek a new o...
4,1971_KAMIN_THEHERODMEN,BOOK CHAPTER 1 He stepped onto the morning ba...
5,1971_RACKHAM_DARKPLANET,BOOK CHAPTER 1 He stood up to his knees in ho...


In [20]:
#Check that text is cleaned and sectioned
books_cleaned.iloc[0]['Text']

' BOOK Lullaby for our Space Children Parameter, perimeter, and pi— There’s a trace in the space past the sky that is I There’s a me in the lee of this starred infinity That is out to prove the ethic that the universe is free. We’re a shout in the snout of eternities of doubt— We’re a spit in the mitt as we take our aim to hit—in the eye— The multitude of factors that will try to nullify. Our parameters, perimeters, and pi. Astronomy and chemistry and math— If you know where to go and your slipstick’s not too slow (don’t be slow!) Where electrons meet the nucleus of mass And protons go along the selfsame path Where the multiples of decimals that mark the whirling sphericals Indicate there may be trouble coming past— It’s a laugh ... if you’re fast With astronomy and chemistry and math. Diameter, circumference, and sphere— Space may not yet have noticed, but we’re here—Space, we’re here! With spectrographic dazzle and a certain yen to travel And the love of work that brings the concepts

## Section Texts By Chapter Headings
When working with texts with clearly delineated chapters, using chapter headings is a relatively easy way to section texts into segments of (relatively) the same size. After checking the chapter counts for each text to confirm whether sectioning by chapter is a useful procedure, this code iterates through the texts and splits them each time it encounters a new "chapter" heading. From here, the text from each chapter is appended to a new dataframe and denoted by book and chapter number. 

In [21]:
#Count number of chapters in each text
chapter_counts = books_cleaned['Text'].str.count('CHAPTER')

#Append chapter counts to dataframe
books_cleaned["Chapters"] = chapter_counts
books_cleaned

Unnamed: 0,Title,Text,Chapters
0,1969_RICHMOND-RICHMOND_PHOENIXSHIP,BOOK Lullaby for our Space Children Parameter...,10
1,1969_KAMIN_EARTHRIM,"BOOK CHAPTER 1 “About that shoulder,” the doc...",10
2,1969_TUBB_TOYMAN,BOOK CHAPTER 1 Tor thirty hours the sun had a...,11
3,1969_KOONTZ_FEARTHATMAN,BOOK PART 1 PURPOSE And ye shall seek a new o...,32
4,1971_KAMIN_THEHERODMEN,BOOK CHAPTER 1 He stepped onto the morning ba...,14
5,1971_RACKHAM_DARKPLANET,BOOK CHAPTER 1 He stood up to his knees in ho...,12


In [22]:
#Make new cell each time new chapter starts 
new = books_cleaned["Text"].str.split("CHAPTER", expand = True).set_index(books_cleaned['Title'])
new

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,23,24,25,26,27,28,29,30,31,32
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1969_RICHMOND-RICHMOND_PHOENIXSHIP,BOOK Lullaby for our Space Children Parameter...,1 His name was Stanley Thomas Arthur Reginald...,"2 Professor Mallard stood, cloaked but unhood...","3 Stan arrived at Termdock, White Sands, and ...","4 Stan reached Orbdock, Mars, still preoccupi...",5 The silence went on and on; and Stan waited...,"6 “They’ve spotted us,” Stan said furiously. ...",7 Stan walked into Weed’s office with his hea...,8 Time. Time was the factor both at the Ace Y...,9 The general was seated on the far side of t...,...,,,,,,,,,,
1969_KAMIN_EARTHRIM,BOOK,"1 “About that shoulder,” the doctor was sayin...",2 She had been sitting at the bar when he ent...,"3 Quinn answered the phone. “Hello, Doctor,” ...",4 Standard glowered and weighed his trembling...,5 Standard was suffocating in a fine brown ha...,6 The tip of the electron pencil shimmered a ...,7 The streets were quiet in the working class...,"8 Four people in the back were playing Quod, ...",9 The morning broke slowly in a blast of red ...,...,,,,,,,,,,
1969_TUBB_TOYMAN,BOOK,1 Tor thirty hours the sun had arched across ...,"2 Leon Hurl, Stockholder of Toy, woke two hou...","3 Dumarest sighed, stretched, jerked fully aw...","4 Battle had been done on a table, men attack...","5 There were guards at the gate, a small knot...","6 A man lay whimpering, crying, the tears str...",7 Leon Hurl lifted the delicate porcelain of ...,7 Dividend Day and all Toy threw a party. The...,"9 Legrain paced the floor, scowling, obviousl...",...,,,,,,,,,,
1969_KOONTZ_FEARTHATMAN,BOOK PART 1 PURPOSE And ye shall seek a new o...,1 When he woke from a featureless dream of si...,"2 Like a grotesquely misshapen fruit, the bod...",3 Hurkos came padding down the narrow corrido...,"4 The thunders, as soon as Sam had thrown the...","5 Gnossos tore his hand out of the machine, r...","6 “Me?” “Well, not really from your mind. Thr...","7 In their wandering, they came across many t...","8 The Inferno was a bar. But more than a bar,...","9 The water, chemicals, and lubricants flowed...",...,7 Coro quickly wiped the perspiration that ha...,"8 Coro used the medikit preparedermics, injec...",9 Food-slugs as large as houses lay pulsating...,"10 “Just as I thought,” Coro said. “They woul...",11 Like a needle sinking through a jar of Sty...,12 The Inferno was just as he remembered it. ...,13 The Central Being was overwhelmed by Hope....,"14 Sam gritted his teeth, fought against clos...",15 Raceship had settled in the vast wild game...,16 Buronto stepped further into the chamber. ...
1971_KAMIN_THEHERODMEN,BOOK,1 He stepped onto the morning balcony and let...,2 The cold evening was approaching by the tim...,3 “What do you think?” she asked as the drive...,4 ArchCommodore Gudtsler was in uncommonly go...,5 For the third consecutive day the morning w...,6 “We must leave this world o£ corruption and...,"7 “Feels good out today,” Matter said. “And I...",8 Sergeant Kulcheski saw them coming up the f...,9 The black and purple uniform abraded nerve ...,...,,,,,,,,,,
1971_RACKHAM_DARKPLANET,BOOK,"1 He stood up to his knees in hot mud, the we...","2 “Stephen Query, Instrumentman, First Class....","3 We’ve blown free! he thought, pushing his a...","4 He took her shoulders, pushed her away to s...",5 5 “It was a monster!” Lieutenant Evans babb...,"6 Evans seemed to wilt a little, and Query co...",7 Query got slowly to his feet. In the face o...,8 Query shook himself free of his stunned hor...,"9 The wordless chanting, the thudding primiti...",...,,,,,,,,,,


In [23]:
#Flatten dataframe so each chapter is on own row, designated by book and chapter 
chapters_df = new.stack().reset_index()
chapters_df.columns = ["Book", "Chapter", "Text"]
chapters_df

Unnamed: 0,Book,Chapter,Text
0,1969_RICHMOND-RICHMOND_PHOENIXSHIP,0,BOOK Lullaby for our Space Children Parameter...
1,1969_RICHMOND-RICHMOND_PHOENIXSHIP,1,1 His name was Stanley Thomas Arthur Reginald...
2,1969_RICHMOND-RICHMOND_PHOENIXSHIP,2,"2 Professor Mallard stood, cloaked but unhood..."
3,1969_RICHMOND-RICHMOND_PHOENIXSHIP,3,"3 Stan arrived at Termdock, White Sands, and ..."
4,1969_RICHMOND-RICHMOND_PHOENIXSHIP,4,"4 Stan reached Orbdock, Mars, still preoccupi..."
...,...,...,...
90,1971_RACKHAM_DARKPLANET,8,8 Query shook himself free of his stunned hor...
91,1971_RACKHAM_DARKPLANET,9,"9 The wordless chanting, the thudding primiti..."
92,1971_RACKHAM_DARKPLANET,10,10 Query was so shaken as to be speechless fo...
93,1971_RACKHAM_DARKPLANET,11,11 He raised himself on an arm to look down i...


In [24]:
#Tidying the DF
#Combine book and chapter labels into one column
chapters_df['Book + Chapter'] = chapters_df['Book'].astype(str) + '_Chapter_' + chapters_df['Chapter'].astype(str)

#Remove individual book and chapter columns
chapters_df.drop(columns=['Book', 'Chapter'])

#Reindex so book + chapter is first column 
column_names = "Book + Chapter", "Text"
chapters_df = chapters_df.reindex(columns=column_names)
chapters_df

Unnamed: 0,Book + Chapter,Text
0,1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_0,BOOK Lullaby for our Space Children Parameter...
1,1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_1,1 His name was Stanley Thomas Arthur Reginald...
2,1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_2,"2 Professor Mallard stood, cloaked but unhood..."
3,1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_3,"3 Stan arrived at Termdock, White Sands, and ..."
4,1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_4,"4 Stan reached Orbdock, Mars, still preoccupi..."
...,...,...
90,1971_RACKHAM_DARKPLANET _Chapter_8,8 Query shook himself free of his stunned hor...
91,1971_RACKHAM_DARKPLANET _Chapter_9,"9 The wordless chanting, the thudding primiti..."
92,1971_RACKHAM_DARKPLANET _Chapter_10,10 Query was so shaken as to be speechless fo...
93,1971_RACKHAM_DARKPLANET _Chapter_11,11 He raised himself on an arm to look down i...


## Section Chapters by Chunks of N Length
Though chapter headings are useful for splitting texts into semi-equal segments, disparities in chapter length may occur, especially in large corpora. To further segment texts, the text of each text can be divided into chunks of n length. 

In [25]:
#Create new df to work with chunks
new_chapters_df = chapters_df.copy()

#Get number of words in each chapter (helps to determine chunk length)
ch_words = new_chapters_df["Text"].apply(lambda x: len(str(x).split(' ')))

#Append word counts to dataframe
new_chapters_df["Word Count"] = ch_words
new_chapters_df

Unnamed: 0,Book + Chapter,Text,Word Count
0,1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_0,BOOK Lullaby for our Space Children Parameter...,350
1,1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_1,1 His name was Stanley Thomas Arthur Reginald...,4592
2,1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_2,"2 Professor Mallard stood, cloaked but unhood...",2782
3,1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_3,"3 Stan arrived at Termdock, White Sands, and ...",3187
4,1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_4,"4 Stan reached Orbdock, Mars, still preoccupi...",4506
...,...,...,...
90,1971_RACKHAM_DARKPLANET _Chapter_8,8 Query shook himself free of his stunned hor...,3460
91,1971_RACKHAM_DARKPLANET _Chapter_9,"9 The wordless chanting, the thudding primiti...",3435
92,1971_RACKHAM_DARKPLANET _Chapter_10,10 Query was so shaken as to be speechless fo...,3580
93,1971_RACKHAM_DARKPLANET _Chapter_11,11 He raised himself on an arm to look down i...,3919


In [26]:
#Tokenize Text
new_chapters_df['Tokens'] = new_chapters_df.apply(lambda row: nltk.word_tokenize(row['Text']), axis=1)
new_chapters_df

Unnamed: 0,Book + Chapter,Text,Word Count,Tokens
0,1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_0,BOOK Lullaby for our Space Children Parameter...,350,"[BOOK, Lullaby, for, our, Space, Children, Par..."
1,1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_1,1 His name was Stanley Thomas Arthur Reginald...,4592,"[1, His, name, was, Stanley, Thomas, Arthur, R..."
2,1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_2,"2 Professor Mallard stood, cloaked but unhood...",2782,"[2, Professor, Mallard, stood, ,, cloaked, but..."
3,1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_3,"3 Stan arrived at Termdock, White Sands, and ...",3187,"[3, Stan, arrived, at, Termdock, ,, White, San..."
4,1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_4,"4 Stan reached Orbdock, Mars, still preoccupi...",4506,"[4, Stan, reached, Orbdock, ,, Mars, ,, still,..."
...,...,...,...,...
90,1971_RACKHAM_DARKPLANET _Chapter_8,8 Query shook himself free of his stunned hor...,3460,"[8, Query, shook, himself, free, of, his, stun..."
91,1971_RACKHAM_DARKPLANET _Chapter_9,"9 The wordless chanting, the thudding primiti...",3435,"[9, The, wordless, chanting, ,, the, thudding,..."
92,1971_RACKHAM_DARKPLANET _Chapter_10,10 Query was so shaken as to be speechless fo...,3580,"[10, Query, was, so, shaken, as, to, be, speec..."
93,1971_RACKHAM_DARKPLANET _Chapter_11,11 He raised himself on an arm to look down i...,3919,"[11, He, raised, himself, on, an, arm, to, loo..."


In [27]:
#Define chunking function
def split(list_a, chunk_size):
  for i in range(0, len(list_a), chunk_size):
    yield list_a[i:i + chunk_size]

#Set desired size of chunks
chunk_size = 1000

#Create new list for chunked sentences
chunked_ch = []

#Perform chunking function on each row of tokens
s = new_chapters_df['Tokens']
for content in s:
  chunks = list(split(content, chunk_size))
  #Add to new list
  chunked_ch.append(chunks)


In [28]:
#Create dictionary to associate chunks with titles
keys = new_chapters_df['Book + Chapter']
values = chunked_ch

res = {keys[i]: values[i] for i in range(len(keys))}

In [29]:
#Add chunks to new dataframe
chunked_ch_df = pd.DataFrame.from_dict(res, orient='index')
chunked_ch_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_0,"[BOOK, Lullaby, for, our, Space, Children, Par...",,,,,,,,
1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_1,"[1, His, name, was, Stanley, Thomas, Arthur, R...","[understanding, of, the, society, and, possibl...","[one, of, those, relegated, to, merely, fine, ...","[The, cold, air, bit, into, his, lungs, ,, and...","[vast, listening, audience, ., “, The, Belters...","[Tom, ’, s, neck, ., “, What, shall, I, break,...",,,
1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_2,"[2, Professor, Mallard, stood, ,, cloaked, but...","[no, longer, with, us, ., He, has, not, been, ...","[required, information, to, be, filtered, thro...","[shook, him, ,, and, he, drew, his, cloak, of,...",,,,,
1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_3,"[3, Stan, arrived, at, Termdock, ,, White, San...","[,, but, usually, go, first, to, the, spincent...","[to, the, Belt, ., I, ’, ll, have, a, berth, f...","[in, annoyance, ., “, A, lack, of, perception,...",,,,,
1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_4,"[4, Stan, reached, Orbdock, ,, Mars, ,, still,...","[”, “, He, ’, s, my, passenger., ”, Stan, grin...","[,, and, would, build, them, to, tremendous, v...","[I, ’, m, going, to, put, us, under, drive, to...","[The, other, looked, at, him, queerly, ., “, Y...","[but—well, ,, you, got, off, the, ship, withou...",,,


In [30]:
#Reset dataframe index and rename columns
chunked_ch_df = chunked_ch_df.stack().reset_index()
chunked_ch_df.columns = ["Title","Chunk","Text"]
chunked_ch_df

Unnamed: 0,Title,Chunk,Text
0,1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_0,0,"[BOOK, Lullaby, for, our, Space, Children, Par..."
1,1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_1,0,"[1, His, name, was, Stanley, Thomas, Arthur, R..."
2,1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_1,1,"[understanding, of, the, society, and, possibl..."
3,1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_1,2,"[one, of, those, relegated, to, merely, fine, ..."
4,1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_1,3,"[The, cold, air, bit, into, his, lungs, ,, and..."
...,...,...,...
386,1971_RACKHAM_DARKPLANET _Chapter_12,0,"[12, “, I, don, ’, t, see, why, not, !, ”, Eva..."
387,1971_RACKHAM_DARKPLANET _Chapter_12,1,"[s, all, peaceful, and, quiet, here, ., Beauti..."
388,1971_RACKHAM_DARKPLANET _Chapter_12,2,"[was, gone, ., ., ., there, was, a, moment, of..."
389,1971_RACKHAM_DARKPLANET _Chapter_12,3,"[we, had, to, put, these, suits, on, ., That, ..."


In [31]:
#Tidying the DF
#Combine book and chunk labels into one column
chunked_ch_df['Book + Chunk'] = chunked_ch_df['Title'].astype(str) + ' Chunk ' + chunked_ch_df['Chunk'].astype(str)

#Remove individual book and chunk columns
chunked_ch_df.drop(columns=['Title', 'Chunk'])

#Detokenize text
TreebankWordDetokenizer().detokenize
chunked_ch_df['Text'] = chunked_ch_df.apply(lambda row: TreebankWordDetokenizer().detokenize(row['Text']), axis=1)
chunked_ch_df['Text'] 

#Reindex so book + chunk is first column 
column_names = "Book + Chunk", "Text"
chunked_ch_df = chunked_ch_df.reindex(columns=column_names)

#Print cleaned df
chunked_ch_df

Unnamed: 0,Book + Chunk,Text
0,1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_0 ...,"BOOK Lullaby for our Space Children Parameter,..."
1,1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_1 ...,1 His name was Stanley Thomas Arthur Reginald ...
2,1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_1 ...,understanding of the society and possible usef...
3,1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_1 ...,one of those relegated to merely fine courses ...
4,1969_RICHMOND-RICHMOND_PHOENIXSHIP _Chapter_1 ...,"The cold air bit into his lungs, and he fasten..."
...,...,...
386,1971_RACKHAM_DARKPLANET _Chapter_12 Chunk 0,12 “ I don ’ t see why not! ” Evans declared p...
387,1971_RACKHAM_DARKPLANET _Chapter_12 Chunk 1,"s all peaceful and quiet here . Beautiful, if ..."
388,1971_RACKHAM_DARKPLANET _Chapter_12 Chunk 2,was gone . . . there was a moment of bright ra...
389,1971_RACKHAM_DARKPLANET _Chapter_12 Chunk 3,we had to put these suits on . That ’ s the fi...


## Section Texts By Chunks of N Length
When working with texts WITHOUT discernable chapter headings--or, even if chapter headings are present but too infrequent to split texts into meaningful segments--texts can instead be sectioned by chunks of "N" length, where N is a variable that can be custom-set below. After checking the word counts for each text to determine what size chunks would be appropriate, this code iterates through the texts and splits them each time it counts "N" number of words. From here, the text from each chunk is appended to a new dataframe and denoted by book and chunk number.

In [32]:
#Get number of words in each book (helps to determine chunk length)
words = books_cleaned["Text"].apply(lambda x: len(str(x).split(' ')))

#Append chapter counts to dataframe
books_cleaned["Word Count"] = words
books_cleaned

Unnamed: 0,Title,Text,Chapters,Word Count
0,1969_RICHMOND-RICHMOND_PHOENIXSHIP,BOOK Lullaby for our Space Children Parameter...,10,38411
1,1969_KAMIN_EARTHRIM,"BOOK CHAPTER 1 “About that shoulder,” the doc...",10,53193
2,1969_TUBB_TOYMAN,BOOK CHAPTER 1 Tor thirty hours the sun had a...,11,44993
3,1969_KOONTZ_FEARTHATMAN,BOOK PART 1 PURPOSE And ye shall seek a new o...,32,46293
4,1971_KAMIN_THEHERODMEN,BOOK CHAPTER 1 He stepped onto the morning ba...,14,50032
5,1971_RACKHAM_DARKPLANET,BOOK CHAPTER 1 He stood up to his knees in ho...,12,39522


In [33]:
#Tokenize Text
books_cleaned['Text'] = books_cleaned['Text'].astype(str)
books_cleaned['Tokens'] = books_cleaned.apply(lambda row: nltk.word_tokenize(row['Text']), axis=1)
books_cleaned

Unnamed: 0,Title,Text,Chapters,Word Count,Tokens
0,1969_RICHMOND-RICHMOND_PHOENIXSHIP,BOOK Lullaby for our Space Children Parameter...,10,38411,"[BOOK, Lullaby, for, our, Space, Children, Par..."
1,1969_KAMIN_EARTHRIM,"BOOK CHAPTER 1 “About that shoulder,” the doc...",10,53193,"[BOOK, CHAPTER, 1, “, About, that, shoulder, ,..."
2,1969_TUBB_TOYMAN,BOOK CHAPTER 1 Tor thirty hours the sun had a...,11,44993,"[BOOK, CHAPTER, 1, Tor, thirty, hours, the, su..."
3,1969_KOONTZ_FEARTHATMAN,BOOK PART 1 PURPOSE And ye shall seek a new o...,32,46293,"[BOOK, PART, 1, PURPOSE, And, ye, shall, seek,..."
4,1971_KAMIN_THEHERODMEN,BOOK CHAPTER 1 He stepped onto the morning ba...,14,50032,"[BOOK, CHAPTER, 1, He, stepped, onto, the, mor..."
5,1971_RACKHAM_DARKPLANET,BOOK CHAPTER 1 He stood up to his knees in ho...,12,39522,"[BOOK, CHAPTER, 1, He, stood, up, to, his, kne..."


In [34]:
#Define chunking function
def split(list_a, chunk_size):
  for i in range(0, len(list_a), chunk_size):
    yield list_a[i:i + chunk_size]

#Set desired size of chunks
chunk_size = 500

#Create new list for chunked sentences
chunked_sentences = []

#Perform chunking function on each row of tokens
s = books_cleaned['Tokens']
for content in s:
  chunks = list(split(content, chunk_size))
  #Check that text is being chunked correctly
  print(chunks[0])
  #Add to new list
  chunked_sentences.append(chunks)


['BOOK', 'Lullaby', 'for', 'our', 'Space', 'Children', 'Parameter', ',', 'perimeter', ',', 'and', 'pi—', 'There', '’', 's', 'a', 'trace', 'in', 'the', 'space', 'past', 'the', 'sky', 'that', 'is', 'I', 'There', '’', 's', 'a', 'me', 'in', 'the', 'lee', 'of', 'this', 'starred', 'infinity', 'That', 'is', 'out', 'to', 'prove', 'the', 'ethic', 'that', 'the', 'universe', 'is', 'free', '.', 'We', '’', 're', 'a', 'shout', 'in', 'the', 'snout', 'of', 'eternities', 'of', 'doubt—', 'We', '’', 're', 'a', 'spit', 'in', 'the', 'mitt', 'as', 'we', 'take', 'our', 'aim', 'to', 'hit—in', 'the', 'eye—', 'The', 'multitude', 'of', 'factors', 'that', 'will', 'try', 'to', 'nullify', '.', 'Our', 'parameters', ',', 'perimeters', ',', 'and', 'pi', '.', 'Astronomy', 'and', 'chemistry', 'and', 'math—', 'If', 'you', 'know', 'where', 'to', 'go', 'and', 'your', 'slipstick', '’', 's', 'not', 'too', 'slow', '(', 'don', '’', 't', 'be', 'slow', '!', ')', 'Where', 'electrons', 'meet', 'the', 'nucleus', 'of', 'mass', 'And'

In [36]:
#Create dictionary to associate chunks with titles
keys = books_cleaned['Title']
values = chunked_sentences

res = {keys[i]: values[i] for i in range(len(keys))}

In [37]:
#Add chunks to new dataframe
chunked_df = pd.DataFrame.from_dict(res, orient='index')
chunked_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,124,125,126,127,128,129,130,131,132,133
1969_RICHMOND-RICHMOND_PHOENIXSHIP,"[BOOK, Lullaby, for, our, Space, Children, Par...","[was, old, enough, to, ask, ., “, Ruffians, an...","[deep, ,, unreasoning, yearning, ,, was, to, e...","[the, Mentor, had, assured, him, ,, his, stran...","[will, I, be, kept, or, dropped, ?, I, assume,...","[of, the, questions—and, yet, ,, they, didn, ’...","[than, any, he, ’, d, ever, known, ;, to, the,...","[a, minute, ,, feeling, it, wash, over, him, ,...","[slowly, to, give, gravity, when, not, under, ...","[may, hurt, a, parent, to, spank, a, child, ,,...",...,,,,,,,,,,
1969_KAMIN_EARTHRIM,"[BOOK, CHAPTER, 1, “, About, that, shoulder, ,...","[he, said, coldly, ,, “, is, because, I, was, ...","[you, ’, d, better, damn, well, fix, this, arm...","[”, “, That, would, have, been, most, painful....","[chromium, plate, and, simulated, reptile, ski...","[t, work, for, us, ,, either., ”, “, I, know, ...","[unstable, citizens., ”, “, That, ’, s, quite,...","[the, breakfast, list, had, already, been, rem...","[off, me, and, that, ’, s, beginning, to, make...","[The, appointment, was, two, weeks, off, ., He...",...,"[but, the, world, was, at, peace, ., “, Sure, ...","[while, their, bodies, slowly, rust, away, ?, ...","[the, stream/lake, ,, throwing, water, on, eac...","[missing, ., None, of, the, guards, spoke, ., ...","[recognize, who, you, were., ”, “, It, ’, s, u...","[than, humor, ., “, I, won, ’, t, mislead, you...","[he, stood, ,, he, felt, the, weight, of, his,...","[you, want, done, ,, and, I, ’, ll, do, it, .,...","[around, her, and, the, light, was, like, an, ...","[,, numb, with, agony, ., Everything, had, dis..."
1969_TUBB_TOYMAN,"[BOOK, CHAPTER, 1, Tor, thirty, hours, the, su...","[gasped, ., “, Water., ”, Dumarest, rose, ,, c...","[to, a, short, e, ,, Earl, ,, ”, he, said, ser...","[last, of, his, meat, ., “, You, were, unfortu...","[the, close-packed, suns, of, the, center, ., ...","[,, lifted, it, ,, thrust, down, with, the, bl...","[”, r, Dumarest, hesitated, ,, then, mentally,...","[dead, yet, ,, ”, reminded, Dumarest, ., “, Li...","[wheeling, stars, ., Dawn, came, with, a, flus...","[., Dumarest, fitted, one, into, his, crude, w...",...,,,,,,,,,,
1969_KOONTZ_FEARTHATMAN,"[BOOK, PART, 1, PURPOSE, And, ye, shall, seek,...","[automatically, sifted, through, the, readings...","[was, void, of, brand, ,, model, ,, and, make,...","[., ., ., ., ., He, was, in, a, great, cathedr...","[”, “, Yes, ., How, could, I, hear, your, drea...","[years, before, ,, man, had, tried, to, make, ...","[happened, to, you, in, your, lifetime, ,, you...","[out, of, hyperspace, and, into, Real, Space, ...","[curses, ,, his, eyes, two, fiery, droplets, w...","[abruptly, stuck, out, to, Hurkos, as, a, sign...",...,,,,,,,,,,
1971_KAMIN_THEHERODMEN,"[BOOK, CHAPTER, 1, He, stepped, onto, the, mor...","[., They, seemed, alien, ,, like, small, invad...","[of, the, finest, and, most, accommodating, pl...","[“, I, ’, ll, put, a, hundred, on, the, knuckl...","[pickup, truck, ahead, of, him, grew, progress...","[metal, oddments, were, smeared, around, him, ...","[and, I, could, have, made, it, ,, he, thought...","[Philip, would, have, enjoyed, the, sadistic, ...","[wasn, ’, t, mine, anyway, ,, ”, Matter, said,...","[All, right, ,, ”, Matter, said, ., “, I, can,...",...,"[,, slapping, his, bodystocking, across, his, ...","[is, unimportant., ”, “, Let, ’, s, hope, so, ...","[the, black, jetchoppers, flying, in, from, th...",,,,,,,


In [38]:
#Reset dataframe index and rename columns
chunked_df = chunked_df.stack().reset_index()
chunked_df.columns = ["Title","Chunk","Text"]
chunked_df

Unnamed: 0,Title,Chunk,Text
0,1969_RICHMOND-RICHMOND_PHOENIXSHIP,0,"[BOOK, Lullaby, for, our, Space, Children, Par..."
1,1969_RICHMOND-RICHMOND_PHOENIXSHIP,1,"[was, old, enough, to, ask, ., “, Ruffians, an..."
2,1969_RICHMOND-RICHMOND_PHOENIXSHIP,2,"[deep, ,, unreasoning, yearning, ,, was, to, e..."
3,1969_RICHMOND-RICHMOND_PHOENIXSHIP,3,"[the, Mentor, had, assured, him, ,, his, stran..."
4,1969_RICHMOND-RICHMOND_PHOENIXSHIP,4,"[will, I, be, kept, or, dropped, ?, I, assume,..."
...,...,...,...
676,1971_RACKHAM_DARKPLANET,93,"[not, ?, Til, try, to, understand, ., The, min..."
677,1971_RACKHAM_DARKPLANET,94,"[can, ’, t, just, barge, in, like, this, ., We..."
678,1971_RACKHAM_DARKPLANET,95,"[“, Eldredge, !, ”, Evans, roared, ., “, Use, ..."
679,1971_RACKHAM_DARKPLANET,96,"[to, be, lifted, off, ,, damnit, !, ”, “, We, ..."


In [39]:
#Tidying the DF
#Combine book and chunk labels into one column
chunked_df['Book + Chunk'] = chunked_df['Title'].astype(str) + ' Chunk ' + chunked_df['Chunk'].astype(str)

#Remove individual book and chunk columns
chunked_df.drop(columns=['Title', 'Chunk'])

#Detokenize text
TreebankWordDetokenizer().detokenize
chunked_df['Text'] = chunked_df.apply(lambda row: TreebankWordDetokenizer().detokenize(row['Text']), axis=1)
chunked_df['Text'] 

#Reindex so book + chunk is first column 
column_names = "Book + Chunk", "Text"
chunked_df = chunked_df.reindex(columns=column_names)

#Print cleaned df
chunked_df

Unnamed: 0,Book + Chunk,Text
0,1969_RICHMOND-RICHMOND_PHOENIXSHIP Chunk 0,"BOOK Lullaby for our Space Children Parameter,..."
1,1969_RICHMOND-RICHMOND_PHOENIXSHIP Chunk 1,was old enough to ask . “ Ruffians and ne ’ er...
2,1969_RICHMOND-RICHMOND_PHOENIXSHIP Chunk 2,"deep, unreasoning yearning, was to explore the..."
3,1969_RICHMOND-RICHMOND_PHOENIXSHIP Chunk 3,"the Mentor had assured him, his strange, nearl..."
4,1969_RICHMOND-RICHMOND_PHOENIXSHIP Chunk 4,will I be kept or dropped? I assume that you d...
...,...,...
676,1971_RACKHAM_DARKPLANET Chunk 93,not? Til try to understand . The mind flow fro...
677,1971_RACKHAM_DARKPLANET Chunk 94,can ’ t just barge in like this . We ’ re all ...
678,1971_RACKHAM_DARKPLANET Chunk 95,"“ Eldredge! ” Evans roared . “ Use your eyes, ..."
679,1971_RACKHAM_DARKPLANET Chunk 96,"to be lifted off, damnit! ” “ We have no choic..."


## Download CSV Output of Aggregated and Disaggregated Texts 

At this point, you have three dataframes containing segmented texts that are ready for further analysis. All three (along with the dataframe containing the full texts) can be downloaded as csv files. Depending on the nature of your texts and future analysis, it may be necessary to first disaggregate the data before download. Some analyses like topic modeling work well with "bag of words" data, and copyrighted texts cannot be shared in their original forms. Disaggregation, or the breakdown of data into smaller (disordered) parts, is accomplished through the alphabetization of the words in each chapter/chunk.Below, texts are disaggregated and the resulting dataframes can then be downloaded as csvs. 


In [40]:
#Download CSVs of aggregated texts

#Download CSV with aggregated full texts 
books_agg = books_cleaned[['Title', 'Text']]
books_agg.to_csv('full_texts_agg_output.csv', encoding = 'utf-8-sig')
files.download('full_texts_agg_output.csv')

#Download CSV with aggregated chapters 
chapters_df.to_csv('chapters_agg_output.csv', encoding = 'utf-8-sig') 
files.download('chapters_agg_output.csv')

#Download CSV with aggregated chapter chunks 
chunked_ch_df.to_csv('chapter_chunks_agg_output.csv', encoding = 'utf-8-sig') 
files.download('chapter_chunks_agg_output.csv')

#Download CSV with aggregated chunks
chunked_df.to_csv('chunks_agg_output.csv', encoding = 'utf-8-sig') 
files.download('chunks_agg_output.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [41]:
## Disaggregate data in each dataframe

#Alphabetize words in each full text
books_bow = books_agg.copy()
books_bow['Text'] = books_bow['Text'].apply(lambda x: ' '.join(sorted(x.split())))


#Alphabetize words in each chapter
chapters_df['Text'] = chapters_df['Text'].apply(lambda x: ' '.join(sorted(x.split())))
chapters_df


#Alphabetize words in each chapter chunk 
chunked_ch_df['Text'] = chunked_ch_df['Text'].apply(lambda x: ' '.join(sorted(x.split())))

#Alphabetize words in each chunk 
chunked_df['Text'] = chunked_df['Text'].apply(lambda x: ' '.join(sorted(x.split())))

In [42]:
#Download CSVs of disaggregated texts

#Download CSV with disaggregated full texts 
books_bow.to_csv('full_texts_bow_output.csv', encoding = 'utf-8-sig')
files.download('full_texts_bow_output.csv')

#Download CSV with disaggregated chapters 
chapters_df.to_csv('chapters_bow_output.csv', encoding = 'utf-8-sig') 
files.download('chapters_bow_output.csv')

#Download disaggregated chapter chunks to csv
chunked_ch_df.to_csv('chapter_chunks_bow_output.csv', encoding = 'utf-8-sig') 
files.download('chapter_chunks_bow_output.csv')

#Download disaggregated chunks to csv
chunked_df.to_csv('chunks_bow_output.csv', encoding = 'utf-8-sig') 
files.download('chunks_bow_output.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>