Companion repository to blog post "The big picture of public discourse" where we analyze Swedish Twitter data from 2015
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
0_make_mentiongraph.py
1_pickle_to_communities.py
2_content_analysis.py
README.md
dependencies.txt

README.md

bigpicture-twitterdiscourse

Companion repository to The big picture of public discourse on Twitter by clustering metadata where we analyze Swedish Twitter data from 2015.

To illustrate what we did, a couple of Python scripts are provided. You need to have the networkx library for the first two, and the gensim library for the third one.

You also need to have a working installation of Infomap (http://www.mapequation.org/code.html, the standalone version). We use the standalone version because we have been unable to make the Python Infomap library generate the same results as the standalone.

The first script, 0_make_mentiongraph.py, is not meant to be run unless necessary. It is used to process a large number of tweets to generate a networkx graph, which is stored as a pickle file. Example call:

python 0_make_mentiongraph.py tweet_dir my_graph

... where "tweet_dir" (in this example) is the name of a directory containing user tweets, one file per account named .txt, containing tweets for that user, and "my_graph" is the prefix of the pickle file that will store the generated graph.

We will provide "ready-made" pickle files so that you should hopefully be able to skip this step.

The second script, 1_pickle_to_communities.py, does the actual community detection analysis by using Infomap.

NOTE! For this script to work, you must change the path to the Infomap executable in the code!

python 1_pickle_to_communities.py my_graph.pickle

If you download the pickle file linked from the blog post, you should be able to do:

python 1_pickle_to_communities.py undirected_g_2015.pickle

directly on that, however, note that Python 3 is probably needed for that (I think this pickle format is not supported by Python 2).

The third script, 2_content_analysis.py, calculates the most distinctive words for each of the largest communities (using TF-IDF) and gives some information on each cluster.

python 2_content_analysis.py tweet_dir my_graph_trees

... where "tweet_dir" is again the path to the directory of user tweet files, and "my_graph_trees" is a directory that has been generated by Infomap in the previous step and which contains the community decomposition of the graph in two separate files.