htrc-yule

Yule's K generated for the pre-generated version of the HTRC genre seperated corpus (https://sharc.hathitrust.org/genre).

The CSVs contain the HTID of the volume followed by the K calculated for the volume. ' Was used as the text seperator.

The code is messy but fairly straightforward. The basic process is to call one of the three functions in yule_htrc.py -- non_fuzzy, as_single_corpus or fuzzy_restrictions. The difference between the three is briefly covered in the accompanying blogpost. Each requires an argument with the path to the data to be used (.tsvs, will iterate over child folders to gather them as well), the path to the metadata CSV, and the path to the contextual correction csv.

Why I did this can be found in the blog post here: http://cmessner.com/blog/?p=127

This repository now also contains code for deduplicating the dataset using K. The HTID clumps produced by deduplicating the fiction dataset can be found in the fiction_duplicated_clumps csv. Taking one entry from each clump will result in a deduplicated dataset.

More on this here: http://cmessner.com/blog/?p=209

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
fiction_duplicated_clumps.csv		fiction_duplicated_clumps.csv
fiction_yule.csv		fiction_yule.csv
poetry_yule.csv		poetry_yule.csv
remove_repetitions.py		remove_repetitions.py
yule_htrc.py		yule_htrc.py
yule_k.py		yule_k.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

htrc-yule

About

Releases

Packages

Languages

messner1/htrc-yule

Folders and files

Latest commit

History

Repository files navigation

htrc-yule

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages