<a href="https://colab.research.google.com/github/lclark7/stylometry-of-baen-books/blob/main/Stylometric_Comparison_of_Baen_Books_Science_Fiction_Authors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stylometric Comparison of Baen Books Science Fiction Authors

This is a Colab Notebook to work through the examples for my Stylometric Comparison of Baen Books Science Fiction Authors project. This notebook will contain step-by-step instructions on how I conducted this comparison using Plotly to graph the results. A full explanation of this project can be found on my WordPress site, and all of this information can also be found on my GitHub.


First, we need to install Plotly. This is done by prompting Plotly to install.

In [None]:
!pip install plotly



Now we need to import some modules so Python can read the files I will be using in this dataset. We are also going to install some modules related to Plotly so we do not have to do so later.

In [None]:
import pandas as pd 
import numpy as np 
import plotly.graph_objects as go 
import plotly.express as px 
import glob
import os
from textblob import TextBlob
import nltk
nltk.download('punkt')
import plotly.io as pio

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Now it is time to upload the corpus or dataset we will be using. The dataset being used for this computational analysis came from Baen Books' Free Online Library. Each of these books were downloaded in EPUB format and then converted to TXT. These files were put into a compressed ZIP file to reduce space when being uploaded to the notebook session. The only thing the next line of code will be doing is unzipping the compressed file for use.

*Note: This corpus does not include every book from Baen Books Free Library. Titles with more than two authors were excluded to reduce confusion and to save time when conducting this analysis. As of April 2022, there are 75 free ebooks available from Baen Books. This dataset is only comprised of 49 ebooks, some of which are repeating due to more than one author from the dataset working on them.

In [None]:
!unzip "/content/BaenBooksCorpus2.zip"

Archive:  /content/BaenBooksCorpus2.zip
   creating: BaenBooksCorpus2/Andre_Norton/
  inflating: BaenBooksCorpus2/Andre_Norton/Star_Soldiers.txt  
  inflating: BaenBooksCorpus2/Andre_Norton/Time_Traders.txt  
   creating: BaenBooksCorpus2/Charles_Gannon/
  inflating: BaenBooksCorpus2/Charles_Gannon/Fire_with_Fire.txt  
   creating: BaenBooksCorpus2/Dave_Freer/
  inflating: BaenBooksCorpus2/Dave_Freer/The_Forlorn.txt  
   creating: BaenBooksCorpus2/David_Drake/
  inflating: BaenBooksCorpus2/David_Drake/In_the_Heart_of_Darkness.txt  
  inflating: BaenBooksCorpus2/David_Drake/Northworld_Trilogy.txt  
  inflating: BaenBooksCorpus2/David_Drake/Old_Nathan.txt  
  inflating: BaenBooksCorpus2/David_Drake/Redliners.txt  
  inflating: BaenBooksCorpus2/David_Drake/Seas_of_Venus.txt  
  inflating: BaenBooksCorpus2/David_Drake/Starliner.txt  
  inflating: BaenBooksCorpus2/David_Drake/The_Sea_Hag.txt  
  inflating: BaenBooksCorpus2/David_Drake/The_Tank_Lords.txt  
  inflating: BaenBooksCorpus2/David

Now we are simply going to get a list of all the text files and print the resulting list of files to make sure it worked.

In [None]:
files = glob.glob("/content/BaenBooksCorpus2/*/*.txt")
print(files)

['/content/BaenBooksCorpus2/Ellen_Guon/Bedlam_Boyz.txt', '/content/BaenBooksCorpus2/Ryk_Spoor/Phoenix_Rising.txt', '/content/BaenBooksCorpus2/Ryk_Spoor/Grand_Central_Arena.txt', '/content/BaenBooksCorpus2/Ryk_Spoor/Boundary.txt', '/content/BaenBooksCorpus2/Ryk_Spoor/Digital_Knight.txt', '/content/BaenBooksCorpus2/Charles_Gannon/Fire_with_Fire.txt', '/content/BaenBooksCorpus2/John_Joseph_Adams/Selections_from_The_Improbable_Adventures_of_Sherlock_Holmes.txt', '/content/BaenBooksCorpus2/John_Joseph_Adams/Selections_from_By_Blood_We_Live.txt', '/content/BaenBooksCorpus2/John_Joseph_Adams/Selections_from_Brave_New_Worlds_ Dystopian_Stories.txt', '/content/BaenBooksCorpus2/John_Joseph_Adams/Selections_From_The_Living_Dead_2.txt', '/content/BaenBooksCorpus2/Michael_Williamson/Freehold.txt', '/content/BaenBooksCorpus2/David_Drake/Starliner.txt', '/content/BaenBooksCorpus2/David_Drake/In_the_Heart_of_Darkness.txt', '/content/BaenBooksCorpus2/David_Drake/The_Tank_Lords.txt', '/content/BaenBooks

This next bit of code is essential. It will go through each file, count the word frequence of the keywords and tabulate all of that in a dataframe.

In [None]:
# Set up the dictionary 
dd ={
     "Author":[],
     "Path":[]
}

# a list of supposedly the most common 100 words
fwords = ["the","at","there","some","my","of","be","use","her","than","and","this","an","would","first","a","have","each","make","water","to","from","which","like","been","in","or","she","him","call","is","one","do","into","who","you","had","how","time","oil","that","by","their","has","its","it","word","if","look","now","he","but","will","two","find","was","not","up","more","long","for","what","other","write","down","on","all","about","go","day","are","were","out","see","did","as","we","many","number","get","with","when","then","no","come","his","your","them","way","made","they","can","these","could","may","I","said","so","people","part"]

# make an empty list for each word as a placeholder
for fw in fwords:
  dd[fw] = []

# just in case
files = glob.glob("/content/BaenBooksCorpus2/*/*.txt")


for file in files:
  # print current file to keep an eye on progress
  print(file)

  # get the author and filename from the path and add to dd
  dd['Author'].append(os.path.dirname(file).split("/")[-1])
  dd['Path'].append(os.path.basename(file))

  # read in the book's text
  with open(file,errors="surrogateescape") as f:
    text = f.read()

  # blobify and convert all words to lowercase
  blob = TextBlob(text)
  word_list = []
  for w in blob.words:
    word_list.append(w.lower())
  
  # get the frequency (per 10,000) for each frequent word
  for word in fwords:
    fr = (word_list.count(word) / len(word_list)) * 10000
    dd[word].append(fr)

df = pd.DataFrame(dd)


/content/BaenBooksCorpus2/Ellen_Guon/Bedlam_Boyz.txt
/content/BaenBooksCorpus2/Ryk_Spoor/Phoenix_Rising.txt
/content/BaenBooksCorpus2/Ryk_Spoor/Grand_Central_Arena.txt
/content/BaenBooksCorpus2/Ryk_Spoor/Boundary.txt
/content/BaenBooksCorpus2/Ryk_Spoor/Digital_Knight.txt
/content/BaenBooksCorpus2/Charles_Gannon/Fire_with_Fire.txt
/content/BaenBooksCorpus2/John_Joseph_Adams/Selections_from_The_Improbable_Adventures_of_Sherlock_Holmes.txt
/content/BaenBooksCorpus2/John_Joseph_Adams/Selections_from_By_Blood_We_Live.txt
/content/BaenBooksCorpus2/John_Joseph_Adams/Selections_from_Brave_New_Worlds_ Dystopian_Stories.txt
/content/BaenBooksCorpus2/John_Joseph_Adams/Selections_From_The_Living_Dead_2.txt
/content/BaenBooksCorpus2/Michael_Williamson/Freehold.txt
/content/BaenBooksCorpus2/David_Drake/Starliner.txt
/content/BaenBooksCorpus2/David_Drake/In_the_Heart_of_Darkness.txt
/content/BaenBooksCorpus2/David_Drake/The_Tank_Lords.txt
/content/BaenBooksCorpus2/David_Drake/Redliners.txt
/content/B

Now we are going to generate a scatterplot from that dataframe. But first we are going to print the dataframe in a table to make sure it is processing the information from the text files correctly.

In [None]:
df

Unnamed: 0,Author,Path,the,at,there,some,my,of,be,use,...,they,can,these,could,may,I,said,so,people,part
0,Ellen_Guon,Bedlam_Boyz.txt,518.988468,79.335817,33.183731,17.41828,23.139613,178.251306,38.650783,2.542815,...,43.990693,24.792443,5.848474,48.313478,1.525689,0.0,71.580232,21.486784,11.188384,1.398548
1,Ryk_Spoor,Phoenix_Rising.txt,520.357617,49.335636,31.401465,16.667444,26.93459,241.411266,49.268966,3.200149,...,44.468742,23.401092,6.06695,29.601381,5.800271,0.0,27.267939,33.13488,11.667211,6.200289
2,Ryk_Spoor,Grand_Central_Arena.txt,444.965928,52.630985,32.774493,22.056501,28.600117,231.959926,57.313056,4.794891,...,44.395054,32.153978,7.333363,23.974457,7.784647,0.0,34.63604,35.595018,14.779548,5.189765
3,Ryk_Spoor,Boundary.txt,486.140584,62.864721,38.72679,25.066313,17.771883,227.586207,68.501326,6.366048,...,52.586207,28.647215,7.228117,32.692308,4.045093,0.0,25.066313,35.543767,13.129973,6.034483
4,Ryk_Spoor,Digital_Knight.txt,425.102442,60.988816,35.345791,21.138169,86.371945,194.748378,60.122497,3.638537,...,34.479473,28.675139,9.009711,33.43989,5.024647,0.0,30.754303,40.110542,17.759527,5.717701
5,Charles_Gannon,Fire_with_Fire.txt,507.991751,58.831179,24.027892,9.908409,16.348875,198.168183,46.074103,3.529871,...,51.399872,22.170065,6.254683,16.039237,7.988655,0.0,6.688176,42.296522,2.724813,4.644567
6,John_Joseph_Adams,Selections_from_The_Improbable_Adventures_of_S...,546.47361,52.014776,38.639548,22.716657,84.285168,279.393656,44.796399,2.335357,...,19.744385,17.833638,9.129124,32.907308,7.430682,0.0,40.7626,32.907308,3.609189,3.396883
7,John_Joseph_Adams,Selections_from_By_Blood_We_Live.txt,587.003468,56.933447,32.065964,10.906791,32.720372,233.623454,36.42868,2.399494,...,54.752089,12.651877,7.852889,32.502236,3.926445,0.0,39.482582,35.992409,6.76221,3.708309
8,John_Joseph_Adams,Selections_from_Brave_New_Worlds_ Dystopian_St...,500.032951,58.488204,30.150257,14.992751,43.824964,187.491762,42.177409,2.965599,...,56.346382,24.713325,3.130355,18.782127,1.812311,0.0,42.836431,31.13879,19.276394,4.118888
9,John_Joseph_Adams,Selections_From_The_Living_Dead_2.txt,596.556254,54.424322,29.323449,9.618091,55.362672,194.942291,32.138501,2.111288,...,75.771793,26.039223,5.86469,16.655719,2.815051,0.0,36.12649,27.446749,8.679741,2.580464


In [None]:
df = pd.DataFrame(dd)
fig = px.scatter(df, x="there", y="their", color="Author",hover_name="Path",symbol="Author")
fig.update_traces(marker=dict(size=20,
                              line=dict(width=2,
                                        color='DarkSlateGrey')),
                  selector=dict(mode='markers'))
fig.show()

This next part is not crucial to the comparison, but it does allow you to upload the graph and dataframe above to Plotly. This makes it possible to obtain a reactive embed code to embed the graph on a site.

To do this you will need a free Plotly account. You have to provide your Plotly usernmae and the API key associated with your account. Once it has processed your credentials, you will need to assign the file a name and it should appear under "My Files" on your Plotly account.

*Note: I have starred-out my own username and API key for this demonstration. For information on this method, check out [How To Create a Plotly Visualization And Embed It On Websites](https://https://towardsdatascience.com/how-to-create-a-plotly-visualization-and-embed-it-on-websites-517c1a78568b) on The Medium.

In [None]:
!pip install chart_studio
import chart_studio
username = '*******' # your username
api_key = '********************' # your api key - go to profile > settings > regenerate key
chart_studio.tools.set_credentials_file(username=username, api_key=api_key)

In [None]:
import chart_studio.plotly as py
py.plot(fig, filename = 'baen-books-theretheir-graph', auto_open=True)