Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 

README.md

city-directory-entry-parser

city-directory-entry-parser parses lines from OCR’d New York City directories into separate fields, such as names, occupations, and addresses.

city-directory-entry-parser is part of NYPL’s NYC Space/Time Directory project.

For more tools that are used to turn digitized city directories into datasets, see Space/Time’s City Directories repository.

This module relies on the sklearn-crfsuite implementation of a conditional random fields algorithm.

Example

Input:

"Calder William W, clerk, 206 W. 24th"

Output:

{
  "subjects": [
    "Calder William W"
  ],
  "occupations": [
    "clerk"
  ],
  "addresses": [
    [
      "206 W . 24th"
    ]
  ]
}

If the output contains an address field, nyc-street-normalizer can be used to turn this abbreviated address into a full address (e.g. 668 Sixth av.668 Sixth Avenue).

Prerequisites

city-directory-entry-parser depends on the following Python modules:

  • numpy
  • sklearn
  • nltk
  • scipy
  • sklearn_crfsuite

Installation & usage

From Python:

from cdparser import Classifier, Features, LabeledEntry, Utils

## Create a classifier object and load some labeled data from a CSV
classifier = Classifier.Classifier()
classifier.load_training("/full/path/to/training/nypl-labeled-train.csv")

## Optionally, load validation dataset
classifier.load_validation("/full/path/to/validation/nypl-labeled-validate.csv")

## Train your classifier (with default settings)
classifier.train()

## Create an entry object from string
entry = LabeledEntry.LabeledEntry("Cappelmann Otto, grocer, 133 VVashxngton, & liquors, 170 Greenwich, h. 109 Cedar")

## Pass the entry to the classifier
classifier.label(entry)

## Export the labeled entry as JSON
json.dumps(entry.categories)

From bash (using parse.py):

cat /path/to/nypl-1851-1852-entries-sample.txt | python3 parse.py --training /path/to/nypl-labeled-70-training.csv

See also

About

Module to parse lines from OCR’d New York City directories into separate fields, such as names, occupations, and addresses.

Resources

License

Releases

No releases published

Languages

You can’t perform that action at this time.