A collection of utilities for natural language processing – includes scripts for downloading, scraping and scrubbing open NLP data from the web, some code for pattern extraction, plus a bit of visualization.
Work in progress!
Currently contains scripts for
-
downloading Google's n-grams and syntactic n-grams ( s.a. All our n-gram... and Syntactic Ngrams over Time ) and extracting patterns from them using Pig ≥ 0.12 and CPython UDFs to run locally or on a Hadoop cluster.
-
visualizing n-grams as mini-graphs: looking-at-ngrams
-
scraping a Twitter user's timeline ( twitter ) or feeding a .csv file to sqlite and the like ( utils )
-
web-as-corpus: a minimal show case using the Wordnik API
Check out the subdirectories on how to use each module.