We construct an embedding for Twitter acounts to visualize clusters. We apply techniques normaly used to construct Word Embeddings. As far as we know, we are the first ones to use the method like this.
- iterate over all accounts and count co-occurrences (in the sense: who are they retweeting besides @hgmaassen as a binary choice, count them pair-wise in a 2D matrix)
- Pointwise Mutual Information to normalize counts and construct a vector space
- choose N accounts, i.e. the ones with the highest total count, and apply PCA to project them onto a 2D plane for visualization
This will result into an image where points that are closer together have a similar retweet behaviour of its recipients.
See 2_create_vis.ipynb for more details.
Some reference if you want to dig deeper in the (NLP) topic: "Improving Distributional Similarity with Lessons Learned from Word Embeddings" by Omer Levy, Yoav Goldberg, Ido Dagan, TACL 2015.
I am not sure wheter I should write/experiment more on the method. If you have an opinion on it, write me an email.