Author Identification with NLTK

This project focuses on identifying the author of a given book using various techniques implemented with the Natural Language Toolkit (NLTK) in Python. The identification is based on several factors including Jaccard's probability, vocabulary test, word length test, and non-indexed word frequency. The project is based on project from book: Real-World Python: A Hacker's Guide to Solving Problems with Code

Installation

Clone the repository:

git clone https://github.com/kamlocicho/author-identification.git

Navigate into the project directory:
```
cd author-identification
```
Open the Jupyter notebook find_author.ipynb to access and run the code.

Usage

Ensure you have the book text files you want to analyze placed in the data directory.
Open the Jupyter notebook find_author.ipynb.
Run the cells in the notebook sequentially to execute the code and follow the instructions provided within the notebook.
The notebook will guide you through the process of identifying the author of the provided book using various NLTK techniques.

Methodology

Jaccard's Probability

Jaccard's probability is calculated by comparing the sets of unique words used by different authors. This method calculates the similarity between the word usage patterns of different authors.

Vocabulary Test

The vocabulary test measures the diversity of vocabulary used by different authors. It compares the number of unique words used by each author to determine the authorship.

Word Length Test

This test analyzes the average word length used by different authors. It compares the average word length in characters to identify patterns unique to specific authors.

Non-indexed Word Frequency

Non-indexed word frequency examines the occurrence of less common words (excluding stop words and highly frequent words) to identify distinctive patterns in each author's writing style.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
find_author.ipynb		find_author.ipynb
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Author Identification with NLTK

Installation

Usage

Methodology

Jaccard's Probability

Vocabulary Test

Word Length Test

Non-indexed Word Frequency

About

Releases

Packages

Contributors 2

Languages

kamlocicho/author-identification

Folders and files

Latest commit

History

Repository files navigation

Author Identification with NLTK

Installation

Usage

Methodology

Jaccard's Probability

Vocabulary Test

Word Length Test

Non-indexed Word Frequency

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages