Skip to content

michaeljpaul/psm-selection

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 

Propensity score matching for feature selection

This script calculates significance scores for text features using the method described in:

Michael J. Paul. Feature Selection as Causal Inference: Experiments with Text Classification. 21st Conference on Computational Natural Language Learning (CoNLL 2017). Vancouver, August 2017.

Input format

The input should be a text file containing one document per line. On each line, the first token should be a binary integer label (0 or 1). The remaining tokens are the word tokens of the document. The whitespace-separated string tokens will be read as-is, so any preprocessing like punctuation removal and lowercasing should be done before using this script.

Output format

The output will be written to a file with the same name as the input, with ".out" appended to the filename. Each line of the file contains a word followed by the log of the p-value calculated by the script. The words are sorted by their log-p-values, where lower values (i.e., more negative) indicate higher significance.

Running the script

The script takes three command line arguments. The first is the name of the input file. The second is the regularization parameter, lambda in the paper. I recommend a value of 1. The third is the threshold for matching, tau in the paper. A very high value like 100000 is functionally as if there is no threshold.

The command to run the script will thus look something like:

python propensity.py myfile.txt 1.0 100000

and the output in this example will be written to myfile.txt.out.

This thing is quite slow to run, and it doesn't scale to large numbers of features. For bag of words experiments, I prune the vocabulary so that the size is only a few thousand word types. Improving the efficiency is something that will help make this more useful.

About

Propensity score matching for feature selection

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages