-
Notifications
You must be signed in to change notification settings - Fork 4
karlstratos/cca
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Author: Karl Stratos (stratos@cs.columbia.edu) Release version: 1.0 Requirements: python (2.7), numpy, scipy, sparsesvd, Matlab This program is an implementation of canoncial correlation analysis (CCA) in the context of deriving word embeddings. A theoretical justification of this implementation is provided in: A spectral algorithm for learning class-based n-gram models of natrual language Karl Stratos, Do-kyum Kim, Michael Collins, and Daniel Hsu. In Proceedings of UAI (2014). v------------------------------------------------------------------------------v | Setup | ^------------------------------------------------------------------------------^ First, make sure your machine has all the required programs listed above. Also, to be able to run Matlab on your machine, you need to change the line in the call_matlab function in src/call_matlab.py to the path to Matlab on that machine. For example, for me it's: matlab = '/Applications/MATLAB_R2013b.app/bin/matlab' The easiest way to check everything is good is to run debug.py: $ python debug.py v------------------------------------------------------------------------------v | Preparing input data | ^------------------------------------------------------------------------------^ We assume a raw (but properly tokenized) text corpus as an input. There is no restriction such as 'one sentence per line'---we don't need sentence boundaries. But sentence boundaries can be incorporated as special tokens. For example, there is a toy corpus input/example/example.corpus: the dog saw the cat the dog barked the cat meowed You can put boundary markers, as in: _START_ the dog saw the cat _END_ _START_ the dog barked _END_ _START_ the cat meowed _END_ v------------------------------------------------------------------------------v | Step 1: Deriving statistics | ^------------------------------------------------------------------------------^ In step 1, we extract co-occurrence statistics. For example, running: python cca.py --corpus input/example/example.corpus --cutoff 1 will create a directory input/example/example.cutoff1.window3/ that contains statistics of example.corpus. The command line arguments for step 1 are the following: --corpus CORPUS count words from this corpus --cutoff CUTOFF cut off words appearing <= this number --vocab VOCAB size of the vocabulary --window WINDOW size of the sliding window --want WANT want words in this file --rewrite rewrite the (processed) corpus, not statistics In particular, you can decide the context (window)---the default is 3, i.e., previous/next words. You can control the size of the vocabulary by discarding rare words (cutoff) or using only a restricted set of vocabulary (vocab). Rare words are all replaced by a special token "<?>". v------------------------------------------------------------------------------v | Step 2: Deriving embeddings Ur | ^------------------------------------------------------------------------------^ In step 2, we run Matlab to perform SVD on the statistics from step 1. Running: python cca.py --stat input/example/example.cutoff1.window3/ --m 2 --kappa 2 will create a directory output/example.cutoff1.window3.m2.kappa2.matlab.out/ that contains the word embedding file Ur: 4 the -2.3410244894135657e-01 -9.7221193337649348e-01 3 <?> -8.6218169891930729e-01 -5.0659916901690338e-01 2 dog -9.3955297838817597e-01 3.4240356423657153e-01 2 cat -9.6347323867084655e-01 2.6780462722871301e-01 where the format of each line is <frequency>, <word>, <val_1>, <val_2>, ..., <val_m>. Also, the rows are ordered in decreasing frequency. The command line arguments for step 2 are the following: --stat STAT directory containing statistics --m M number of dimensions --kappa KAPPA smoothing parameter --quiet quiet mode --no_matlab do not call matlab - use python sparsesvd In particular, m is the dimensionality of CCA, and kappa is a "pseudocount". The value of kappa needs to be tuned for the given corpus. Try experimenting with 50, 100, 200, ... (or if your data is huge like Google Ngram, 1000, 2000, ...) until the performance on your problem stops improving. Matlab's SVD is very fast, so you can try many parameter values with ease. v------------------------------------------------------------------------------v | Optional post processing | ^------------------------------------------------------------------------------^ Depending on your problem, it might be a good idea to use only the top subspace of your word embeddings. You can derive lower dimensional embeddings via principal component analysis (PCA), e.g.: python src/pca.py --embedding_file output/example.cutoff1.window3.m2.kappa2.matlab.out/Ur --pca_dim 1 Now you have a file Ur.pca1 that looks like: 4 the 0.906265637029 3 <?> 0.20812022154 2 dog -0.585143449361 2 cat -0.529242409207
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published