Companion repository to blog post "The big picture of public discourse" where we analyze Swedish Twitter data from 2015
Python
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
0_make_mentiongraph.py
1_pickle_to_communities.py
2_content_analysis.py
README.md
dependencies.txt

README.md

bigpicture-twitterdiscourse

Companion repository to The big picture of public discourse on Twitter by clustering metadata where we analyze Swedish Twitter data from 2015.

To illustrate what we did, a couple of Python scripts are provided. You need to have the networkx library for the first two, and the gensim library for the third one.

You also need to have a working installation of Infomap (http://www.mapequation.org/code.html, the standalone version). We use the standalone version because we have been unable to make the Python Infomap library generate the same results as the standalone.

The first script, 0_make_mentiongraph.py, is not meant to be run unless necessary. It is used to process a large number of tweets to generate a networkx graph, which is stored as a pickle file. Example call:

python 0_make_mentiongraph.py tweet_dir my_graph

... where "tweet_dir" (in this example) is the name of a directory containing user tweets, one file per account named .txt, containing tweets for that user, and "my_graph" is the prefix of the pickle file that will store the generated graph.

We will provide "ready-made" pickle files so that you should hopefully be able to skip this step.

The second script, 1_pickle_to_communities.py, does the actual community detection analysis by using Infomap.

NOTE! For this script to work, you must change the path to the Infomap executable in the code!

python 1_pickle_to_communities.py my_graph.pickle

If you download the pickle file linked from the blog post, you should be able to do:

python 1_pickle_to_communities.py undirected_g_2015.pickle

directly on that, however, note that Python 3 is probably needed for that (I think this pickle format is not supported by Python 2).

The third script, 2_content_analysis.py, calculates the most distinctive words for each of the largest communities (using TF-IDF) and gives some information on each cluster.

python 2_content_analysis.py tweet_dir my_graph_trees

... where "tweet_dir" is again the path to the directory of user tweet files, and "my_graph_trees" is a directory that has been generated by Infomap in the previous step and which contains the community decomposition of the graph in two separate files.