Skip to content

justinaL/CoVeR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 

Repository files navigation

CoVeR

Implementation of CoVeR (Covariate-Specific Vector Representations with Tensor Decomposition), a model that learn word embeddings and a series of covariances to transform normal embeddings to covariate-specific. This is a re-implementation of GloVe (Global Vectors for Word Representation) proposed by Pennington J., Socher R. and Manning C.D. (2014)

I used this model to analyse the difference on word usage between male and female speech in Old Bailey Voice Corpus from 1780 to 1880, which thus suggesting different meanings. You will need tf_glove.py from Simon G. (2017) before running CoVerModel.py, you can get it from:

https://github.com/GradySimon/tensorflow-glove

CoVerModel.py -- actual code implementing CoVeR model (Tian et al., 2018)

main.py -- example usage of CoVerModel.py on Old Bailey Voice Corpus separated in female and male corpora

Prerequisite

  • v1.0.0 tensorflow
  • v0.20.1 pandas
  • v2.0.11 spacy with eng_core_web_md installed in v2.0.0
  • v1.15.0 numpy
  • v0.19.2 scikit-learn

Data set

You can find the Old Bailey Voice Corpus here:

https://github.com/sharonhoward/voa

The file I used in main.py is the unzipped version of OBV2/obv_words_v2_28-01-2017.tsv.zip. It is the updted verion on May 2017.

How to use

>>> from CoVerModel import CoVerModel
>>> cover = CoVeRModel(embedding_size=300, context_size=10)
>>> cover.fit_corpora(corpora)
>>> cover.train()
>>> cover.embeddings
array([[-0.97056437, -0.4124898 ,  0.74619746, ...,  2.2409537 ,
        -2.3252304 ,  0.0969246 ],
       [-0.6628446 , -0.92629707,  2.0215068 , ...,  1.9445996 ,
        -1.5850129 , -0.1194554 ],
       ...,
       [-0.72753906,  0.49558735,  1.2009518 , ..., -0.04062271,
        -0.49638033,  0.16031623]], dtype=float32)
>>> cover.covariates
array([[ 5.10783792e-01,  1.36624873e-01,  2.43794426e-01,
        ...
        -3.18439603e-01, -3.65320116e-01, -5.38883567e-01],
       [ 1.18549293e-09, -2.79872656e-01, -2.56543934e-01,
        ...,
         2.83558279e-01, -3.57488275e-01, -6.27053738e-01]], dtype=float32)
>>> cover.generate_tsne()

female_full

Parameters

Parameters in creating a CoVeRModel is more or less the same as tf_glove.py, with embedding_size and context_size as complusory arguments

The format of the corpora required for fit_corpora are multidimensional iterables of strings, where strings are tokens as below:

[
  [["This","is","a","sentence","in","1st","corpus","."],
    ["Second","sentence","in","1st","corpus","."]],

  [["Sentence","1","in","2nd","corpus:","."],
   ["Sentence","2","in","2nd","corpus","."]]
]

For full implementation with Old Bailey Voice Corpus, see main.py.

References

About

Re-implementation of CoVeR that transform normal embeddings to covariate-specific

Resources

Stars

Watchers

Forks

Contributors

Languages