The CLTK has a distributed infrastructure that lets you download official CLTK texts or other corpora shared by others. For full docs, see <http://docs.cltk.org/en/latest/importing_corpora.html>.

To get started, from the Terminal, open a new Jupyter notebook from within your `~/cltk` directory (see notebook 1 "CLTK Setup" for instructions): `jupyter notebook`. Then go to <http://localhost:8888>.

# See what corpora are available

First we need to "import" the right part of the CLTK library. Think of this as pulling just the book you need off the shelf and having it ready to read.

In [5]:
# This is the import of the right part of the CLTK library

from cltk.corpus.utils.importer import CorpusImporter

In [6]:
# See https://github.com/cltk for all official corpora

my_latin_downloader = CorpusImporter('latin')

# Now 'my_latin_downloader' is the variable by which we call the CorpusImporter

In [3]:
my_latin_downloader.list_corpora

['latin_text_perseus',
 'latin_treebank_perseus',
 'latin_text_latin_library',
 'phi5',
 'phi7',
 'latin_proper_names_cltk',
 'latin_models_cltk',
 'latin_pos_lemmata_cltk',
 'latin_treebank_index_thomisticus',
 'latin_lexica_perseus',
 'latin_training_set_sentence_cltk',
 'latin_word2vec_cltk',
 'latin_text_antique_digiliblt',
 'latin_text_corpus_grammaticorum_latinorum',
 'latin_text_poeti_ditalia',
 'latin_text_tesserae']

# Import several corpora

In [4]:
my_latin_downloader.import_corpus('latin_text_latin_library')
my_latin_downloader.import_corpus('latin_models_cltk')

You can verify the files were downloaded in the Terminal with `$ ls -l ~/cltk_data/latin/text/latin_text_latin_library/`

In [5]:
# Let's get some Greek corpora, too

my_greek_downloader = CorpusImporter('greek')
my_greek_downloader.import_corpus('greek_models_cltk')
my_greek_downloader.list_corpora

['greek_software_tlgu',
 'greek_text_perseus',
 'phi7',
 'tlg',
 'greek_proper_names_cltk',
 'greek_models_cltk',
 'greek_treebank_perseus',
 'greek_treebank_gorman',
 'greek_lexica_perseus',
 'greek_training_set_sentence_cltk',
 'greek_word2vec_cltk',
 'greek_text_lacus_curtius',
 'greek_text_first1kgreek',
 'greek_text_tesserae']

In [6]:
my_greek_downloader.import_corpus('greek_text_lacus_curtius')

Downloaded 100% 2.98 MiB | 12.26 MiB/s 

Likewise, verify with `ls -l ~/cltk_data/greek/text/greek_text_lacus_curtius/plain/`

In [7]:
my_greek_downloader.import_corpus('greek_text_first1kgreek')

Downloaded 100% 182.99 MiB | 7.56 MiB/s 

In [8]:
!ls -l ~/cltk_data/greek/text/greek_text_first1kgreek/

total 4320
-rwxr-xr-x    1 aleedom  staff  1955024 Sep 24 09:35 [31m#gelasius-kg.xml#[m[m
-rw-r--r--    1 aleedom  staff   126919 Sep 24 09:35 Committing Issues using GitHub.docx
-rwxr-xr-x    1 aleedom  staff    19777 Sep 24 09:35 [31mGreek-works.txt[m[m
-rw-r--r--    1 aleedom  staff     1658 Sep 24 09:35 README.md
-rwxr-xr-x    1 aleedom  staff     1889 Sep 24 09:35 [31mcselstats.pl[m[m
drwxr-xr-x  186 aleedom  staff     5952 Sep 24 09:35 [34mdata[m[m
-rwxr-xr-x    1 aleedom  staff     2414 Sep 24 09:35 [31mgreek-justwork.txt[m[m
-rwxr-xr-x    1 aleedom  staff     3249 Sep 24 09:35 [31mgreek.txt[m[m
-rw-r--r--    1 aleedom  staff    19125 Sep 24 09:35 license.md
-rw-r--r--    1 aleedom  staff    58346 Sep 24 09:35 new_edition_metadata.csv
-rw-r--r--    1 aleedom  staff      697 Sep 24 09:35 pages.sh
-rwxr-xr-x    1 aleedom  staff     1901 Sep 24 09:35 [31mpnumber.xsl[m[m
drwxr-xr-x    4 aleedom  staff      128 Sep 24 09:35 [34msave[m[m
drwxr-xr-x   49 aleedom

# Convert TEI XML texts

Here we'll convert the First 1K Years' Greek corpus from TEI XML to plain text.

In [1]:
from cltk.corpus.greek.tei import onekgreek_tei_xml_to_text

In [2]:
#! If you get the following error: 'Install `bs4` and `lxml` to parse these TEI files.'
# then run: `pip install bs4 lxml`.
# !pip install bs4 lxml
import bs4
import lxml
onekgreek_tei_xml_to_text()

In [3]:
# Count the converted plaintext files

!ls -l ~/cltk_data/greek/text/greek_text_first1kgreek_plaintext/ | wc -l

     975


# Import local corpora

In [9]:
my_latin_downloader.import_corpus('phi5')

AttributeError: 'NoneType' object has no attribute 'endswith'

In [None]:
my_latin_downloader.import_corpus('phi7', '~/cltk/corpora/PHI7/')

In [7]:
my_greek_downloader.import_corpus('tlg', '~/cltk/corpora/TLG_E/')

In [12]:
!ls -l /home/kyle/cltk_data/originals/

total 204
drwxr-xr-x 2 kyle kyle  32768 Mar 30  2014 phi5
drwxr-xr-x 2 kyle kyle  24576 Mar 30  2014 phi7
drwxr-xr-x 2 kyle kyle 151552 Mar 30  2014 tlg
