Skip to content

A deep learning computational literature library for large text genre classification. This repository includes instructions and tools for downloading and building the Gutenberg Dataset for genre identification, deep learning models in python and machine learning algorithms in R.

License

joeworsh/ai-lit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ai-lit - Deep Learning Tools for Computational Literature

This repository contains all of the models, infrastructure and data parsing tools to run Deep Learning experiments on the Project Gutenberg Literature Classification dataset.

Table of Contents

  1. Installing Necessary Tools
  2. Gutenberg Dataset
  3. Running a Model

Installing Necessary Tools

Python

ai-lit is implemented using python 3.5.3

Below are the needed python packages to run ai-lit. These can be installed through pip and it is best to be installed in a virtual environment.

  • nltk 3.1

  • gensim 1.0.1

  • glob2 0.5

  • matplotlib 1.5.1

  • numpy 1.14.2

  • pandas 0.17.1

  • scikit-learn 0.17

  • tensorflow or tensorflow-gpu 1.3.0

R

ai-lit also has modules written in R 3.3.2

Below are the needed R libraries to run ai-lit R models. These can be installed through the R package manager.

  • caret 3.3.3

  • class 3.3.2

  • dplyr 3.3.2

  • jsonlite 3.3.3

  • magrittr 3.3.2

  • naivebayes 3.3.3

  • plyr 3.3.2

  • randomForest 3.3.3

  • tm 3.3.2

  • xgboost 3.3.3

Gutenberg Dataset

The Project Gutenberg dataset is hosted by the Project Gutenberg. The dataset can be downloaded by following the directions posted here. The instructions for downloading the data used in the Genre Identification paper are as follows:

  1. run wget -w 2 -m -H "http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en" in the dataset folder. Note: this download can take two days and requires ~11GB of storage space.

  2. download the XML/RDF catalog rdf-files.tar.zip here.

  3. extract the XML/RDF library to the same directory as the downloaded dataset

The downloading process should yield a folder structure like the following:

GutenbergDataset
|-->cache
|-->www.gutenberg.org
|--><mirror name of download mirror>.

The name of the mirror does not matter, the data will be extracted from any mirror folder name.

After the dataset has been downloaded, the data must be built into a common format and then compiled into TFRecord files which are read into the ai-lit TensorFlow models. Run the following python script to build the common dataset format and the TFRecords for the different data representations.

  • run python3 build_dataset.py <folder of the Gutenberg dataset>

Running a Model

Python

The python experiments are all run within Jupyter Notebooks. The experiments are found here. The models are run using TensorFlow and are configured using TF flags.

R

The R scripts can be run directly using R. RStudio can make it easier to run and analyze the R machine learning models.

About

A deep learning computational literature library for large text genre classification. This repository includes instructions and tools for downloading and building the Gutenberg Dataset for genre identification, deep learning models in python and machine learning algorithms in R.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published