# Practice Session 04: Networks from text

In this session we will learn to construct a network from a set of implicit relationships. The relationships that we will study are between accounts in Twitter, a micro-blogging service.

We will create two networks: one directed and one undirected.

* In the **directed mention network**, we will say that there is a link of weight *w* from account *x* to account *y*, if account *x* has re-tweeted (re-posted) or mentioned *w* times account *y*.

* In the **undirected co-mention network**, we will say that there is a link of weight *w* between accounts *x* and *y*, if both accounts have been mentioned together in *w* tweets.

The input material you will use is a file named `CovidLockdownCatalonia.json.gz` available in the data/ directory. This is a gzip-compressed file, which you can de-compress either from your python code (as detailed below) or using the `gunzip` command. This file contains about 35,500 messages ("tweets") posted between March 13th, 2020, and March 14th, 2020, containing a hashtag or keyword related to COVID-19, and posted by a user declaring a location in Catalonia. Each tweet is contained in a single line.

*Advice: be carefull if you want to open `CovidLockdownCatalonia.json`, it lasts for a long time to load and can freeze your computer. It should be better to have a glance at `CovidLockdownCatalonia_sample-10-lines.json` instead, which contains only the first 10 lines (these lines have been formatted by introducing some carriage returs for ease insight).*

The tweets are in a format known as [JSON](https://en.wikipedia.org/wiki/JSON#Example). Python's JSON library takes care of translating it into a dictionary.

**How was this file obtained?** This file was obtained from the [CrisisNLP](https://crisisnlp.qcri.org/covid19). This is a website that provides COVID-19 collections of tweets, however, they only provide the identifier of the tweet, known as a tweet-id. To recover the entire tweet, a process commonly known as *re-hydration* was used, which involves querying an API from Twitter, giving the tweet-id, and obtaining the tweet. This can be done with a little bit of programming or using a software such as [twarc](https://github.com/DocNow/twarc#dehydrate).

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

Author: <font color="blue">Your name here</font>

E-mail: <font color="blue">Your e-mail here</font>

Date: <font color="blue">The current date here</font>

# 1. Create the directed mention network

Create the **directed mention network**, which has a weighted edge (source, target, weight) if user *source* mentioned user *target* at least once; with *weight* indicating the number of mentions.

Create two files: one containing all edges, and one containing all edges having *count* greater or equal than 2.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [None]:
import io
import json
import gzip
import csv
import re

In [None]:
# Leave this code as-is

# Input file
COMPRESSED_INPUT_FILENAME = "data/CovidLockdownCatalonia.json.gz"

# These are the output files, leave as-is
OUTPUT_ALL_EDGES_FILENAME = "delivery/CovidLockdownCatalonia.csv"
OUTPUT_FILTERED_EDGES_FILENAME = "delivery/CovidLockdownCatalonia-min-weight-filtered.csv"
OUTPUT_CO_MENTIONS_FILENAME = "delivery/CovidLockdownCatalonia-co-mentions.csv"

## 1.1. Extract mentions

The `extract_mentions(text)` function is used to extract mentions, so that if we give, for instance `RT @Jordi: check this post by @Xavier`, it returns the list `["Jordi", "Xavier"]`.

Note that you need an `import re` command together with the other imports. RE module provide utilities for using regular expressions, which will be used to extract user names form patterns beginning with @.

You can now print all the people mentioned in a tweet by doing:

```python
mentions = extract_mentions(message)
for mention in mentions:
    print("%s mentioned %s" % (author, mention))
```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [None]:
# Leave this code as-is

def extract_mentions(text):
    return re.findall("@([a-zA-Z0-9_]{5,20})", text)

print(extract_mentions("RT @Jordi: check this post by @Xavier"))

## 1.2. Count mentions

We do not need to uncompress this file (it is about 236 MB uncompressed, but only 31 MB compressed), but we can read it directly while it is compressed.

```python
with gzip.open(COMPRESSED_INPUT_FILENAME, "rt", encoding="utf-8") as input_file:
    for line in input_file:
        tweet = json.loads(line)
        author = tweet["user"]["screen_name"]
        message = tweet["full_text"]
        print("%s: '%s'" % (author, message))
```

To count how many times a mention happens, you will keep a dictionary:

```python
mentions_counter = {}
```

Each key in the dictionary will be a tuple `(author, mention)` where `author` is the username of the person who writes the message, and `mention` the username of someone who is mentioned in the message. To update the dictionary, use this code while you are reading the input file:

```python
for mention in mentions:
    key = (author, mention)
    if key in mentions_counter:
        mentions_counter[key] += 1
    else:
        mentions_counter[key] = 1
```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to read the compressed input file and create the mentions_counter dictionary.</font>

Print the number of times the account `joanmariapique` mentioned `catalangov`. It should be 9.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to print the number of times the account `joanmariapique` mentioned `catalangov`.</font>

Now we write a file with all the edges in this graph (Source, Target, Weight) as a tab-separated file.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [None]:
# Leave this code as-is

with io.open(OUTPUT_ALL_EDGES_FILENAME, "w") as output_file:
    writer = csv.writer(output_file, delimiter='\t', quotechar='"', lineterminator='\n')
    writer.writerow(["Source", "Target", "Weight"])
    for key in mentions_counter:
        author = key[0]
        mention = key[1]
        weight = mentions_counter[key]
        writer.writerow([author, mention, weight])

<font size="+1" color="red">Replace this cell with your code to create a file named `OUTPUT_FILTERED_EDGES_FILENAME` containing all (author, mention) pairs with a value greater or equal to 2.</font>

# 2. Visualize the directed mention network

Open the **filtered** edge file in Cytoscape, by importing its CSV file. You may have to set the delimiter to "Tab" in the advanced options, when importing.

The file is large so if you want to see all details while zooming out you may have to set ``View > Always show Graphic Details``. Note this makes the program run slower.

Style the network:

* Run "Tools > Analyze Network ..." (as a directed graph)
* Style nodes by setting their size proportional to their in-degree
* Style edges by setting their width and color (darker=more) using the *weight* attribute.
* Style edges by setting Target Arrow Shape to column *interaction*, Mapping Type to *Discrete Mapping*, and interacts with to *Arrow*

*Tip*: to count nodes in Cytoscape, hold shift while clicking and select the nodes. In the lower-right corner you should see a count of nodes and edges.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Save the image as mentions.png and replace this cell with `![Mentions graph](mentions.png)` to display your graph.</font>

<font size="+1" color="red">Replace this cell with a brief commentary of what you see in this graph. What is the size of the largest connected component, both as a number of nodes and as a percentage of the nodes in the graph? What is the size of the second largest connected component?</font>

Keep only the largest connected component, deleting the rest of the nodes (you can hold shift while drawing a rectangle, to select some nodes).

Run the ClusterMaker2 plug-in to create a clustering (affinity propagation clustering) of this graph using the *weight* edge attribute. Color nodes according to their cluster, using a discrete mapping. Note that if you right-click on "Mapping type" when creating a discrete mapping, you can use an automatic mapping generator that you can fine-tune later.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Save the image as mentions-largest-cc.png and replace this cell with `![Mentions graph largest connected component](mentions-largest-cc.png)` to display your graph.</font>

Look at the Results Panel of the network analyzer. There is interesting information here, particularly the node degree distribution.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Save the plots of the in-degree distribution and the out-degree distribution replace this cell with `![Mentions in-degree distribution](mentions-in-degree-distribution.png)` and `![Mentions out-degree distribution](mentions-out-degree-distribution.png)` to display them.</font>

<font size="+1" color="red">Replace this cell by a brief commentary, in your own words, of what you see in this graph. What type of graph is it? Which kinds of nodes are high-centrality nodes? Include any aspects that you find relevant from the results of the network analyzer (e.g., number of nodes, edges, connected components, characteristic path lengths, average degrees, etc.) and indicate why those aspects are relevant</font>

# 3. Create the undirected co-mention network

The **undirected co-mention network** connects two accounts if they are both mentioned in the same tweet. The weight of the edge is the number of tweets in which the accounts are co-mentioned.

Suppose the mentions in a Tweet are in the array ``mentions``, then you can iterate through all pairs of co-mentioned like this:

```python
for mention1 in mentions:
    for mention2 in mentions:
        if mention1 < mention2:
            key = (mention1, mention2)
```

Read the input file again to create a dictionary `co_mentions_counter` in which keys are tuples (user1, user2) in which user1 lexicographically precedes user2 (user1 < user2), and values are the number of times user1 and user2 have appeared together in a tweet.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to create the `co_mentions_counter`.</font>

Print the number of times the accounts `agriculturacat` and `uniopagesos` have been mentioned together. It should be 8.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to print the number of times the accounts `agriculturacat` and `uniopagesos` have been mentioned together.</font>

Now create a file named `OUTPUT_CO_MENTIONS_FILENAME` containing co-mentions in tab-separated columns Source, Target, Weight.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to create the co-mentions file.</font>


# 4. Visualize the undirected co-mention network in Cytoscape

Open the `OUTPUT_CO_MENTIONS_FILENAME` file in Cytoscape.

Style the network so that line widths are larger for edges with large weights, and node sizes are larger for nodes with large degrees. Remember you need to run the network analyzer first.

Use "Layout > Prefuse Force Directed Layout > All Nodes > Weight" to create a layout by edge weight.

Run a clustering algorithm within Cytoscape to create a clustering of this graph using the *Weight* attribute as weight, and assign colors to the different clusters.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Save the image as co-mentions.png and replace this cell with `![Co-mentions graph](co-mentions.png)` to display your graph.</font>

<font size="+1" color="red">Replace this cell by a brief commentary, in your own words, of what you see in this graph. What type of graph is it? Which kinds of nodes are high-centrality nodes? Include any aspects that you find relevant from the results of the network analyzer (e.g., number of nodes, edges, connected components, characteristic path lengths, average degrees, etc.) and indicate why those aspects are relevant.</font>


# DELIVER (individually)

Deliver a zip file containing:

* Your code as a Python notebook (a `.ipynb` file).
   * Remove all unnecessary elements
   * Add comments when needed
* Any png files that you inserted in the notebook

<font size="-1" color="gray">(Remove this cell when delivering.)</font>


<font size="+2" color="#003300">I hereby declare that, except for the code provided by the course instructors, all of my code, text, and figures were produced by myself.</font>
