This repository contains all of the models, infrastructure and data parsing tools to run Deep Learning experiments on the Project Gutenberg Literature Classification dataset.
ai-lit is implemented using python 3.5.3
Below are the needed python packages to run ai-lit. These can be installed through pip
and it is best to be installed in a virtual environment.
-
nltk 3.1
-
gensim 1.0.1
-
glob2 0.5
-
matplotlib 1.5.1
-
numpy 1.14.2
-
pandas 0.17.1
-
scikit-learn 0.17
-
tensorflow or tensorflow-gpu 1.3.0
ai-lit also has modules written in R 3.3.2
Below are the needed R libraries to run ai-lit R models. These can be installed through the R package manager.
-
caret 3.3.3
-
class 3.3.2
-
dplyr 3.3.2
-
jsonlite 3.3.3
-
magrittr 3.3.2
-
naivebayes 3.3.3
-
plyr 3.3.2
-
randomForest 3.3.3
-
tm 3.3.2
-
xgboost 3.3.3
The Project Gutenberg dataset is hosted by the Project Gutenberg. The dataset can be downloaded by following the directions posted here. The instructions for downloading the data used in the Genre Identification paper are as follows:
-
run
wget -w 2 -m -H "http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en"
in the dataset folder. Note: this download can take two days and requires ~11GB of storage space. -
download the XML/RDF catalog
rdf-files.tar.zip
here. -
extract the XML/RDF library to the same directory as the downloaded dataset
The downloading process should yield a folder structure like the following:
GutenbergDataset
|-->cache
|-->www.gutenberg.org
|--><mirror name of download mirror>.
The name of the mirror does not matter, the data will be extracted from any mirror folder name.
After the dataset has been downloaded, the data must be built into a common format and then compiled into TFRecord files which are read into the ai-lit TensorFlow models. Run the following python script to build the common dataset format and the TFRecords for the different data representations.
- run
python3 build_dataset.py <folder of the Gutenberg dataset>
The python experiments are all run within Jupyter Notebooks. The experiments are found here. The models are run using TensorFlow and are configured using TF flags.
The R scripts can be run directly using R. RStudio can make it easier to run and analyze the R machine learning models.