Collins's NLP Lab II: PROBABLISTIC CONTEXT FREE GRAMMERS

The goal is to generate parse trees for English sentences, which in this case happen to be trivia questions. My job was to compile the probabilities of specific rules governing the branching of the tree using training data with smoothing, and then implement the CKY dynamic programming algorithm, which recursively finds the subtree with the maximum probability given my estimated probabilies.

Author: Amandine Lee

Email: amandine.m.lee@gmail.com

Files

PYTHON FILES CREATED BY ME ie. the most important parts:

replace_rare_tree.py - Can be imported for member functions or run as a script for the given data files. Takes a JSON nested list representing a parse tree (the training data) and a text file with the counts output by count_cfg_freq.py. Tallies the words that occur with a given tag < 5 times, creates a new training JSON file with those words replaced by 'RARE'
probability_generator.py - A class that stores the counts of different rules from training data, and can be called to calculate probabilities from those counts.
cky_algo.py - Script that implements the CKY algorithm, calculating the maximum probable parse trees from newline seaparated sentences, and writes JSON-encoded trees to a file.

GIVEN PYTHON FILES:

count_cfg_freq.py - Takes a JSON tree file, outputs the counts and types of each: NONTERMINAL, UNARYRULE, BINARYRULE
eval_parser.py - compares two files (for the development sent) and gives the efficiency of your analysis.
pretty_print_tree.py - Makes indented versions of trees. Takes single-line-tree fomrat files.

GIVEN TEXT FILES:

parse_train.dat - Each line represents a sentences, parsed into its lexical tree, stored in JSON format. The first is the data, the second the right branch, the third the left branch, until it terminates with a terminal (actual word) and it's tag, stored as ["TAG", "word"]
cfg.counts - Original counts from training data. Each line represents one piece of data, as: <nonterminal/terminal sympbols...>
parse_dev.dat - Each line is a sentence to be analyzes.
parse_dev.key - The correct trees stored in JSON format
parse_test.dat - More single-line sentences, the test file.
tree.example - A single tree in JSON as an example

GENERATED TEXT FILES

new.counts - Counts with RARE type

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
infiles		infiles
outfiles		outfiles
.gitignore		.gitignore
README.md		README.md
cky_algo.py		cky_algo.py
count_cfg_freq.py		count_cfg_freq.py
eval_parser.py		eval_parser.py
parse_dev.key		parse_dev.key
pretty_print_tree.py		pretty_print_tree.py
prob_generator.py		prob_generator.py
replace_rare_tree.py		replace_rare_tree.py
submit.py		submit.py
tester.py		tester.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Collins's NLP Lab II: PROBABLISTIC CONTEXT FREE GRAMMERS

Files

About

Releases

Packages

Languages

momandine/PCFGs

Folders and files

Latest commit

History

Repository files navigation

Collins's NLP Lab II: PROBABLISTIC CONTEXT FREE GRAMMERS

Files

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages