In this project, we use a novel non-parametric skip-gram model to capture the dialectal changes of English on multiple resolutions. This repository contains the tweets ids we used for training the model. You are free to crawl the data using these ids and preprocess the data using our tools to replicate our research results.
Number | USA | UK | Total |
---|---|---|---|
tweet | 2,075,394 | 1,088,232 | 3,163,626 |
token | 41,637,107 | 22,012,953 | 63,650,060 |
term | 865,784 | 469,570 | 1,167,790 |
note: CMU geo data only contain 378K tweets
To use our model implementation, you should visit the github page DialectGram. There are four models in the github repository:
- baseline models: frequency model and syntactic model
- GEODIST model: region-specific embeddings
- DialectGram model: a novel approach to compose dialect-sensitive word embeddings, based on Adaptive Skip-gram.
You can play with our demo on the website: demo
We would like to acknowledge the following resources when we implement our models:
- Eisenstein, Jacob, et al. "A latent variable model for geographic lexical variation." Proceedings of the 2010 conference on empirical methods in natural language processing. Association for Computational Linguistics, 2010.
- Bartunov, Sergey, et al. "Breaking sticks and ambiguities with adaptive skip-gram." Artificial Intelligence and Statistics. 2016.
- Bamman, David, Chris Dyer, and Noah A. Smith. "Distributed representations of geographically situated language." Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2014.
- Kulkarni, Vivek, Bryan Perozzi, and Steven Skiena. "Freshman or fresher? quantifying the geographic variation of internet language." arXiv preprint arXiv:1510.06786 (2015).
- Python Implementation of AdaGram
- Julia Implementation of AdaGram
Jiang, Hang*; Haoshen Hong*; Yuxing Chen*; and Vivek Kulkarni. 2019. DialectGram: Automatic Detection of Dialectal Changes with Multi-geographic Resolution Analysis. To appear in Proceedings of the Society for Computation in Linguistics. New Orleans: Linguistic Society of America.
@inproceedings{Jiang:Hong:Chen:2020:SCiL,
Author = {Jiang, Hang and Hong, Haoshen and Chen, Yuxing and Kulkarni, Vivek},
Title = {DialectGram: Automatic Detection of Dialectal Changes with Multi-geographic Resolution Analysis},
Booktitle = {Proceedings of the Society for Computation in Linguistics},
Location = {New Orleans},
Publisher = {Linguistic Society of America},
Address = {Washington, D.C.},
Year = {2020}}