In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt





NLP - WRITER ANALYZER


To answer the question once and for all if Kit Marlowe was Shakespeare and vice versa. Will import boths' complete works. Will create the "Marlowe detector" based on about 70% of Marlowe's work. Will test with unseen confirmed Marlowe plays to confirm it is working. Then will test on some Shakespeare work to see what the score is.  

Source texts as .txt files from Project Gutenberg/z-Library (all are public domain).

## Phase 1 - Proof of Concept 

Will compare style of all plays of Marlowe I could get my hands on (Faustus was listed like 5 times for some reason). Will compare to another of Marlowe's plays that was not part of the style basis and will compare to one play of Shakespeare, King Lear. 


## Phase 2 - Cleaner Version with More Writing as Basis 

When the process has been figured out, I have the much easier to find complete collection of Shakespeare's work. That will serve as the style basis to check again if Shakespeare was Marlowe. This will be gone on a few of Marlowe's singular plays and should hopefully get similar scores to Phase 1.


## Phase 3 - The Workspace

Will have a basic UI where you can upload input in a file type (.txt) either through upload or a folder location. Then your text you want to scan will be input as well. The software will create the new style blueprint and then compare the file you want to compare.


## Phase 4 - Advanced Product

Will make a database of prolific writer's styles (Kit Marlowe, William Shakespeare, Mark Twain, Agatha Christie, Stephen King, etc). All you have to do is input the text you want to examine, and the software will run and show you how similar it is to each writer's style in descending order. 

Will have authors to check against by default and allow you to set which you want to compare to if you don't want to do all. (Also may affect runtime if its less than a couple seconds)

Will still have an option to create a new writer style profile and add it to saved authors list. 

Maybe make the UI visually cool



# Challenges to consider

Cleaning all the texts to remove legal text/publisher data

Cleaning any odd symbols (will be done with the stemmer etc)

Character names will come up. Consider swapping with just the name CHARACTER or ACTOR for all of them

Will the sonnets/poetry need to be eliminated/stemmed?



In [2]:
        def stemmer(text):
            # Write your code here
            text2 = str(text)
            text = text2.split()
            new_text = []
            new_text2=[]
            for word in text:
                if word.endswith('ed'):
                    new_word = word[:-2]
                    new_text.append(new_word)
                elif word.endswith('ly'):
                    new_word = word[:-2]
                    new_text.append(new_word)
                elif word.endswith('ing'):
                    new_word = word[:-3]
                    new_text.append(new_word)
                else:
                    new_text.append(word)
            for word in new_text:
                if len(word)>8:
                    new_word2 = word[:8]
                    new_text2.append(new_word2)
                else:
                    new_text2.append(word)
            return " ".join(new_text2)


In [3]:
stemmer("hello my excellent reading leavered wryly")

'hello my excellen read leaver wry'

In [4]:
from string import ascii_lowercase

def cr(mes, rotation=4):
    rotated = ascii_lowercase[rotation:]+ascii_lowercase[:rotation]
    cipher = {o in n for o, n in zip(ascii_lowercase,rotated)}
    
    encoded = []
    for char in mes.lower():
        if char in cipher:
            encoded.append(cipher[char])
        else:
            encoded.append(char)
    return " ".join(encoded)
    

In [5]:
cr("HELLO MY dEAR man ") 

'h e l l o   m y   d e a r   m a n  '

In [6]:

#from https://programminghistorian.org/en/lessons/introduction-to-stylometry-with-python
#a major resource



papers = {
    'Madison': [10, 14, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48],
    'Hamilton': [1, 6, 7, 8, 9, 11, 12, 13, 15, 16, 17, 21, 22, 23, 24,
                 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 59, 60,
                 61, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77,
                 78, 79, 80, 81, 82, 83, 84, 85],
    'Jay': [2, 3, 4, 5],
    'Shared': [18, 19, 20],
    'Disputed': [49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 62, 63],
    'TestCase': [64]
}

In [1]:

#from https://programminghistorian.org/en/lessons/introduction-to-stylometry-with-python
#a major resource




# A function that compiles all of the text files associated with a single author into a single string
def read_files_into_string(filenames):
    strings = []
    for filename in filenames:
        with open(f'data/federalist_{filename}.txt') as f:
            strings.append(f.read())
    return '\n'.join(strings)

In [8]:
# Make a dictionary out of the authors' corpora
federalist_by_author = {}
for author, files in papers.items():
    federalist_by_author[author] = read_files_into_string(files)

FileNotFoundError: [Errno 2] No such file or directory: 'data/federalist_10.txt'

In [9]:
for author in papers:
    print(federalist_by_author[author][:100])

KeyError: 'Madison'

In [10]:
# Load nltk
import nltk
%matplotlib inline

# Compare the disputed papers to those written by everyone,
# including the shared ones.
authors = ("Hamilton", "Madison", "Disputed", "Jay", "Shared")

# Transform the authors' corpora into lists of word tokens
federalist_by_author_tokens = {}
federalist_by_author_length_distributions = {}
for author in authors:
    tokens = nltk.word_tokenize(federalist_by_author[author])

    # Filter out punctuation
    federalist_by_author_tokens[author] = ([token for token in tokens
                                            if any(c.isalpha() for c in token)])

    # Get a distribution of token lengths
    token_lengths = [len(token) for token in federalist_by_author_tokens[author]]
    federalist_by_author_length_distributions[author] = nltk.FreqDist(token_lengths)
    federalist_by_author_length_distributions[author].plot(15,title=author)

KeyError: 'Hamilton'

# Background on the theory itself

Via [Wikipedia](https://en.wikipedia.org/wiki/Marlovian_theory_of_Shakespeare_authorship) for now:

The Marlovian theory of Shakespeare authorship holds that the Elizabethan poet and playwright Christopher Marlowe was the main author of the poems and plays attributed to William Shakespeare. Further, the theory says Marlowe did not die in Deptford on 30 May 1593, as the historical records state, but that his death was faked.

Marlovians (as those who subscribe to the theory are usually called) base their argument on supposed anomalies surrounding Marlowe's reported death[1] and on the significant influence which, according to most scholars, Marlowe's works had on those of Shakespeare.[2] They also point out the coincidence that, despite their having been born only two months apart, the first time the name William Shakespeare is known to have been connected with any literary work was with the publication of Venus and Adonis just a week or two after the death of Marlowe.

The argument against this is that Marlowe's death was accepted as genuine by sixteen jurors at an inquest held by the Queen's personal coroner,[3] that everyone apparently thought that he was dead at the time, and that there is a complete lack of direct evidence supporting his survival beyond 1593.[4] While there are similarities between their works,[5] Marlowe's style,[6] vocabulary,[7] imagery,[8] and his apparent weaknesses—particularly in the writing of comedy[9]—are said to be too different from Shakespeare's to be compatible with the claims of the Marlovians. The convergence of documentary evidence of the type used by academics for authorial attribution—title pages, testimony by other contemporary poets and historians, and official records—sufficiently establishes Shakespeare of Stratford's authorship for the overwhelming majority of Shakespeare scholars and literary historians,[10] who consider the Marlovian theory, like all other alternative theories of Shakespeare authorship, a fringe theory.[11]

# ALT Uses

- IDing author of book
- ID author of online author from "digital fingerprint" - possibly evil uses tbh
- Authentication of user on one end. If you write in the same style over several instances/found in system you are likely user X

# ALT PREBUILT
https://freelancedatascientist.net/fast-stylometry-tutorial/

In [1]:
!pip install faststylometry

Collecting faststylometry
  Downloading faststylometry-0.5.tar.gz (7.1 kB)
Building wheels for collected packages: faststylometry
  Building wheel for faststylometry (setup.py) ... [?25ldone
[?25h  Created wheel for faststylometry: filename=faststylometry-0.5-py3-none-any.whl size=8478 sha256=13c0965e88b742f263fd037a875c2ff10677ce56e5ad3d2e9b400d65fe8b051e
  Stored in directory: /Users/nicholaswertz/Library/Caches/pip/wheels/ca/53/85/ef936668cea2afa13c65db00d8cfa05079600d1261bc054a8a
Successfully built faststylometry
Installing collected packages: faststylometry
Successfully installed faststylometry-0.5
