word2diavec

Final project for LING-472 Python for Computational Linguistics

This project examines vector space similarity for dialectal second person plural pronouns in American English. It trains gensim FastText and Word2Vec models on tweets, then calculates cosine similarity between models and tokens.

Data was gathered from Twitter using an implementation of Tweepy, originally written for LING-452 American Dialects as tweet_sentiment_to_csv.py but rewritten for this project.

Required libraries:

gensim
scikit-learn
NLTK
Tweepy (tweet_fetcher.py only)

Runs in Python 3.6+.

word2diavec.py

word2diavec evaluates the performance of two vector space models trained on the same data by calculating the average cosine similarities of their vectors using analogy metrics proposed by Tal Linzen (2016). It can take two pretrained models or a dataset to be trained in vector_training.py.

Normally, the model first saves average cosine similarity between weights in models' shared vocabulary. Then, using a list of analogies in the form a:a* :: b:b*, it calculates cosine similarities between "most similar" word offsets to provide the missing word, e.g., talk:talking::swim:swimming. If the model correctly provides b*, it receives a point, but it receives no points for those it gets wrong.

Metrics used:

"vanilla" offset method
only-b: nearest neighbor cosine similarities of b
ignore-a: most similar to both a* and b

It also returns a the cosine similarities between models for a list of target words, as well as the top three most similar words for each model.

In Explore mode, all of the above is true, except instead of returning the scored Linzen metrics, it returns a dictionary of the analogy in question with the model's Linzen results in a tuple (e.g., {'a : a* :: b : ': (vanilla word, only-b word, ignore-a word)

Output consists of the average Linzen scores and individual cosine similarities in .txt format.

Example usage with loaded models, not in Explore mode: word2diavec.py --ft fasttext.model --w2v word2vec.model --in text_input.txt --out text_output.txt --an word-test.v1.txt --l

In Explore mode: word2diavec.py --e --in text_input.txt --out text_output.txt

Output consists of a .txt file with the results of the four metrics.

vector_training.py

This script uses gensim implementations of FastText and word2vec to train models on a corpus of tweets. It uses a custom tokenization function that converts concatenations (e.g., 'you guys') to one word ('youguys') and that treats emojis as individual words, e.g., so as to avoid thinking that "yall😭" "yall😂" and "yall😭😭😭😂😂😂" are unique tokens rather than variants of the word "y'all."

It can also retrain preexisting models for 10 additional epochs.

Example usage: vector_training.py --ft fasttext.model --w2v word2vec.model --tsv tweet_data.tsv

Output consists of trained FastText and word2vec .model files.

tweet_fetcher.py

This script uses an implementation of Joshua Roesslein's Tweepy (v.3.7.0) to download tweets & user metadata from Twitter search queries, exclude those not tweeted in certain areas, and save the data in .tsv format.

Originally written for LING-447 as tweet_sentiment_to_csv.py, it has been almost entirely rewritten to improve performance and usefulness.

To use this Twitter scraper, you will need free API keys available from their Developer website. Once connected, the scraper will retrieve a user-specified number of tweets per user-specified search term. It then filters tweets based on a user-provided list of locations, which can be supplemented by user-provided stopwords/tokens, keeping only tweets whose location fields contain the desired locations with no stopwords.

Output consists of tweet metadata and tweets in .tsv format.

Example usage: tweet_fetcher.py --keys api_key.txt --output tweet_data.tsv --geo geotargets.txt --toks tok_list.txt --num 1000

Results

Overall performance of the models on the test data were quite poor, likely due in part to the small sample size after filtering. tweet_fetcher.py did not properly save tweets or tweet metadata containing line breaks, causing a large swath of data to be unusable.

The FastText model fared best on 2/3 Linzen (2016) metrics on Mikolov et al. (2013)'s data, vs. Word2Vec. ALL is the average of all metrics, while SYN is the average of the syntactic categories only.

Average cosine similarity b/w models: -0.0400

Linzen offset metrics
                FT      w2v
VANILLA (ALL) 	0.0174	0.0012
VANILLA (SYN)	0.0315	0.0017
ONLY-B (ALL)	0.0491	0.0055
ONLY-B (SYN)	0.0852	0.0058
IGNORE-A (ALL)	0.1055	0.0138
IGNORE-A (SYN)	0.1891	0.0215

When tested on pronoun_analogies.py, scores were also quite low:

Linzen offset metrics
            FT          w2v
VANILLA     0.0100	0.0300
ONLY-B	    0.0250	0.0000
IGNORE-A    0.0100	0.0000

And cosine similarities between models on the same pronouns showed them to be fairly unrelated:

Pronoun	    cos sim.
you         -0.0604
youse	    -0.0246
yinz	    0.0100
yall	    0.0236
youguys     -0.0315
youall	    0.0731

Cosine similarity between pronouns was slightly better:

                        FastText
pronoun	    youse   yinz    yall    youguys youall
you         0.6966  0.6781  0.7798  0.7917  0.7266
youse	    ------  0.6954  0.5937  0.5641  0.5214
yinz        ------  ------  0.6704  0.6168  0.5422
yall        ------  ------  ------  0.7290  0.5993
youguys     ------  ------  ------  ------  0.7523

                        word2vec
pronoun	    youse   yinz    yall    youguys youall
you         0.5651  0.6444  0.7628  0.7167  0.6401
youse       ------  0.7007  0.4959  0.4523  0.3748
yinz        ------  ------  0.6420  0.5716  0.5222
yall        ------  ------  ------  0.5969  0.5192
youguys	    ------  ------  ------  ------  0.5969

See report for a breakdown of scores by category and a further explanation of processes and shortcomings.

Future goals

I hope to add data visualization tools to this repository in the near future, such as implementations of mapping programs or vector space visualizers, to make the results more accessible to non-specialists and more thoroughly explore the "dialect" component of the project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Rowe_Pycomplx_Final.pdf

Rowe_Pycomplx_Final.pdf

filter_chara.txt

filter_chara.txt

geotargets.txt

geotargets.txt

pronoun_analogies.txt

pronoun_analogies.txt

text_input.txt

text_input.txt

tok_list.txt

tok_list.txt

tweet_data.tsv

tweet_data.tsv

tweet_fetcher.py

tweet_fetcher.py

vector_training.py

vector_training.py

word-test.v1.txt

word-test.v1.txt

word2diavec.py

word2diavec.py

Repository files navigation

word2diavec

Required libraries:

word2diavec.py

vector_training.py

tweet_fetcher.py

Results

Future goals

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
README.md		README.md
Rowe_Pycomplx_Final.pdf		Rowe_Pycomplx_Final.pdf
filter_chara.txt		filter_chara.txt
geotargets.txt		geotargets.txt
pronoun_analogies.txt		pronoun_analogies.txt
text_input.txt		text_input.txt
tok_list.txt		tok_list.txt
tweet_data.tsv		tweet_data.tsv
tweet_fetcher.py		tweet_fetcher.py
vector_training.py		vector_training.py
word-test.v1.txt		word-test.v1.txt
word2diavec.py		word2diavec.py

marrowe/word2diavec

Folders and files

Latest commit

History

Repository files navigation

word2diavec

Required libraries:

word2diavec.py

vector_training.py

tweet_fetcher.py

Results

Future goals

About

Topics

Resources

Stars

Watchers

Forks

Languages