# Practice Session 04: Networks from text

Author: <font color="blue">Josip Hanak</font>

E-mail: <font color="blue">josip.hanak@fer.hr</font>

Date: <font color="blue">7/10/2022</font>

# 1. Create the directed mention network

Create the **directed mention network**, which has a weighted edge (source, target, weight) if user *source* mentioned user *target* at least once; with *weight* indicating the number of mentions.

Create two files: one containing all edges, and one containing all edges having *count* greater or equal than 2.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [38]:
import io
import json
import gzip
import csv
import re

from IPython.display import Image

In [39]:
# Leave this code as-is

# Input file
COMPRESSED_INPUT_FILENAME = "CovidLockdownCatalonia.json.gz"

# These are the output files, leave as-is
OUTPUT_ALL_EDGES_FILENAME = "CovidLockdownCatalonia.csv"
OUTPUT_FILTERED_EDGES_FILENAME = "CovidLockdownCatalonia-min-weight-filtered.csv"
OUTPUT_CO_MENTIONS_FILENAME = "CovidLockdownCatalonia-co-mentions.csv"

## 1.1. Extract mentions

In [11]:
# Leave this code as-is

def extract_mentions(text):
    return re.findall("@([a-zA-Z0-9_]{5,20})", text)

print(extract_mentions("RT @Marco: check this @Charles post by @Xavier"))

['Marco', 'Charles', 'Xavier']


## 1.2. Count mentions

We do not need to uncompress this file (it is about 236 MB uncompressed, but only 31 MB compressed), but we can read it directly while it is compressed.

```python
with gzip.open(COMPRESSED_INPUT_FILENAME, "rt", encoding="utf-8") as input_file:
    for line in input_file:
        tweet = json.loads(line)
        author = tweet["user"]["screen_name"]
        message = tweet["full_text"]
        print("%s: '%s'" % (author, message))
```

To count how many times a mention happen, you will keep a dictionary:

```python
mentions_counter = {}
```

Each key in the dictionary will be a tuple `(author, mention)` where `author` is the username of the person who writes the message, and `mention` the username of someone who is mentioned in the message. To update the dictionary, use this code while you are reading the input file:

```python
for mention in mentions:
    key = (author, mention)
    if key in mentions_counter:
        mentions_counter[key] += 1
    else:
        mentions_counter[key] = 1
```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to read the compressed input file and create the mentions_counter dictionary.</font>

In [17]:
mentions_counter = dict()
with gzip.open(COMPRESSED_INPUT_FILENAME, "rt", encoding="utf-8") as input_file:
    for line in input_file:
        tweet = json.loads(line)
        author = tweet["user"]["screen_name"]
        message = tweet["full_text"]
        mentions = extract_mentions(message)
        for mention in mentions:
            key = (author, mention)
            if key in mentions_counter:
                mentions_counter[key] += 1
            else:
                mentions_counter[key] = 1

In [19]:
print(mentions_counter[("joanmariapique", "catalangov")])

9


<font size="+1" color="red">Replace this cell with your code to print all the pairs of accounts (u,v) in which account u mentioned account v more than 5 times, and the number of times u mentioned v.</font>

In [24]:
i = 0
for k, v in mentions_counter.items():
    if v>5:
        print(k[0], "mentioned", v, "times", k[1])

wualaswold1 mentioned 9 times updayESP
RedPillDetox mentioned 7 times TomthunkitsMind
SpanishDan1 mentioned 8 times fascinatorfun
OriolQuintanaMa mentioned 6 times informativost5
emocionycambio mentioned 16 times DrTedros
flower302 mentioned 8 times realDonaldTrump
SpanishDan1 mentioned 6 times g_gosden
SpanishDan1 mentioned 6 times miffythegamer
LuisVies mentioned 6 times CMichaelGibson
FXstreetNews mentioned 11 times HareshMenghani
FXstreetNews mentioned 8 times eren_fxstreet
enricgari mentioned 6 times VilaWeb
BCN_Mobilitat mentioned 8 times TMBinfo
MargaXrepublica mentioned 6 times MargaXrepublica
IsabelPerez770 mentioned 6 times PabloFuente
emocionycambio mentioned 7 times CoronavirusESP
jrdelbrio mentioned 6 times ELSOOrg
Txesnut1 mentioned 11 times DrEricDing
joanmariapique mentioned 9 times catalangov


Now we write a file `OUTPUT_ALL_EDGES_FILENAME` with **all** the edges in this graph `(Source, Target, Weight)` as a tab-separated file, and `OUTPUT_FILTERED_EDGES_FILENAME` with edges of weight larger or equal to 3.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [30]:
# Leave this code as-is

with io.open(OUTPUT_ALL_EDGES_FILENAME, "w") as output_file:
    writer = csv.writer(output_file, delimiter='\t', quotechar='"', lineterminator='\n')
    writer.writerow(["Source", "Target", "Weight"])
    for key in mentions_counter:
        author = key[0]
        mention = key[1]
        weight = mentions_counter[key]
        writer.writerow([author, mention, weight])

<font size="+1" color="red">Replace this cell with your code to create a file named `OUTPUT_FILTERED_EDGES_FILENAME` containing all (author, mention) pairs with a value greater or equal to 3.</font>

In [33]:
with io.open(r"C:\Users\josip\Documents\Education\UPF\1.TRIMESTER\Introduction to Network Science\practices\4.practice\OUTPUT_FILTERED_EDGES_FILENAME", "w") as output_file:
    writer = csv.writer(output_file, delimiter='\t', quotechar='"', lineterminator='\n')
    writer.writerow(["Source", "Target", "Weight"])
    for key in mentions_counter:
        author = key[0]
        mention = key[1]
        weight = mentions_counter[key]
        if weight>=3:
            writer.writerow([author, mention, weight])

# 2. Visualize the directed mention network

## 2.1. Visualize two largest connected components


Open the **filtered** edge file in Cytoscape, by importing its CSV file. You may have to set the delimiter to "Tab" in the advanced options, when importing.

The file is large so if you want to see all details while zooming out you may have to set ``View > Always show Graphic Details``. Note this makes the program run slower.

(a) **Keep only the two largest connected components of the graph.**

(b) Style the network:

* Run `Tools > Analyze Network ...`
* Style nodes by setting their width proportional to their in-degree
* Style edges by setting their color so that blue means smaller edge betweenness and red means larger edge betweenness
* Style edges to add arrows at the end of each edge

Save the image as `mentions-two-cc.png`, the next cell should display it.

*Tip*: to count nodes in Cytoscape, hold shift while clicking and select the nodes. In the lower-right corner you should see a count of nodes and edges.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [36]:
# Adjust width/height as needed

Image(url="mentions-two-cc.png", width=1200)

<font size="+1" color="red">Replace this cell with a brief commentary of what you see in this graph. What is the diameter of the largest connected component, disregarding edge direction? What is the size of the largest connected component, both as a number of nodes and as a percentage of the nodes in the graph? What is the size of the second largest connected component?</font>

In [37]:
Apology for only displaying the largest component, I indeed have two but jupyter isn't rendering I will attach one in the zip.
I really can't notice any meaning in the network while the data has no meaning for me(just mentions of random users). 
The only interesting thing is SpanishDan mentionin only English account,

SyntaxError: unterminated string literal (detected at line 1) (3760395626.py, line 1)

## 2.2. Cluster the largest connected component


Keep only the largest connected component, deleting the rest of the nodes (you can hold shift while drawing a rectangle, to select some nodes).

Run the ClusterMaker2 plug-in to create a clustering (affinity propagation clustering) of this graph using the *weight* edge attribute. Color nodes according to their cluster, using a discrete mapping. Note that if you right-click on "Mapping type" when creating a discrete mapping, you can use an automatic mapping generator that you can fine-tune later.

Export the image as `mentions-largest-cc.png`, the next cell should display it.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [38]:
# Adjust width/height as needed

Image(url="mentions-largest-cc.png", width=1200)

<font size="+1" color="red">Replace this cell by a brief commentary, in your own words, of what you see in this graph. What type of graph is it? Which kinds of edges have high edge betweenness, disregarding edge direction? Include any aspects that you find relevant.</font>

In [None]:
Somehow ClusterMaker2 is not working for me after multiple attempts

## 2.3. Examine degree distributions

Go back to the **first network containing all nodes and connected components**, and look at the Results Panel of the network analyzer. From there, when `Node Table` is selected in the panel below, you can click on `Node degree distribution ...` and obtain in-degree and out-degree plots. 

Export the distributions as `mentions-indegree.png` and `mentions-outdegree.png`, the next cell should display them.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [40]:
# Adjust width/height as needed

display(Image(url="mentions-indegree.png", width=400))

display(Image(url="mentions-outdegree.png", width=400))

<font size="+1" color="red">Replace this cell by a brief commentary, in your own words, about these degree distributions</font>

In [None]:
The in and out degree distributions are quite similar. They both have an exponential distribution and are unimodal. 

# 3. Create the undirected co-mention network

The **undirected co-mention network** connects two accounts if they are both mentioned in the same tweet. The weight of the edge is the number of tweets in which the accounts are co-mentioned.

Suppose the mentions in a Tweet are in the array ``mentions``, then you can iterate through all pairs of co-mentioned like this:

```python
for mention1 in mentions:
    for mention2 in mentions:
        if mention1 < mention2:
            key = (mention1, mention2)
```

Read the input file again to create a dictionary `co_mentions_counter` in which keys are tuples (user1, user2) in which user1 lexicographically precedes user2 (user1 < user2), and values are the number of times user1 and user2 have appeared together in a tweet.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to create the `co_mentions_counter`.</font>

In [42]:
co_mentions_counter = dict()
with gzip.open(COMPRESSED_INPUT_FILENAME, "rt", encoding="utf-8") as input_file:
    for line in input_file:
        tweet = json.loads(line)
        author = tweet["user"]["screen_name"]
        message = tweet["full_text"]
        mentions = extract_mentions(message)
        for mention1 in mentions:
            for mention2 in mentions:
                if mention1 < mention2:
                    key = (mention1, mention2)
                    if key in co_mentions_counter:
                        co_mentions_counter[key] += 1
                    else:
                        co_mentions_counter[key] = 1

Print the number of times the accounts `agriculturacat` and `uniopagesos` have been mentioned together. It should be 8.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [43]:
print(co_mentions_counter["agriculturacat", "uniopagesos"])

8


<font size="+1" color="red">Replace this cell with your code to print all pairs of accounts that have been mentioned 30 times or more.</font>

In [46]:
for k, v in co_mentions_counter.items():
    if v>30:
        print(k[0], "mentioned", v, "times", k[1])

QuimTorraiPla mentioned 92 times govern
elnacionalcat mentioned 90 times joseantich
QuimTorraiPla mentioned 59 times tjparfitt
emergenciescat mentioned 31 times govern
josepcosta mentioned 49 times sanchezcastejon
gencat mentioned 105 times govern
mossos mentioned 44 times semgencat
QuimTorraiPla mentioned 75 times emergenciescat
Antoni_Gelonch mentioned 106 times sanchezcastejon


Now create a file named `OUTPUT_CO_MENTIONS_FILENAME` containing co-mentions in tab-separated columns `Source, Target, Weight`.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to create the co-mentions file.</font>

In [50]:
with io.open(r"C:\Users\josip\Documents\Education\UPF\1.TRIMESTER\Introduction to Network Science\practices\4.practice\OUTPUT_CO_MENTIONS_FILENAME", "w") as output_file:
    writer = csv.writer(output_file, delimiter='\t', quotechar='"', lineterminator='\n')
    writer.writerow(["Source", "Target", "Weight"])
    for key in co_mentions_counter:
        Source = key[0]
        Target = key[1]
        Weight = co_mentions_counter[key]
        writer.writerow([Source,Target,Weight])
       

# 4. Visualize the undirected co-mention network in Cytoscape


Open the `OUTPUT_CO_MENTIONS_FILENAME` file in Cytoscape.

**Select nodes having degree (in + out) larger or equal to 10.** You can do that with the `Filter` panel on the left, then create a new graph with the selected edges.

Use `Layout > Prefuse Force Directed Layout > All Nodes > Weight` to create a layout by edge weight.

Style the network so that:

* All nodes have the same size
* Edges have width proportional to weight.
* Edges are black for small weight, and red for large weight

Export the image as `co-mentions-min-degree-10.png`, the next cell should display it.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [51]:
# Adjust width/height as needed

Image(url="co-mentions-min-degree-10.png", width=1200)

<font size="+1" color="red">Replace this cell by a brief commentary, in your own words, of what you see in this graph. Which kinds of connected components does it have? Are connected components sparse or dense? Is there any specially dense sub-graph within the largest connected component, what is it? Include any aspects that you find relevant.</font>

In [None]:
The connected components have very dense connections and almost all their nodes are connected between each other forming almost 
complete graphs. The most densely connected component is the largest one that contains mentions between news organisations
like theguardian and washington post.


# DELIVER (individually)

Deliver a zip file containing:

* Your code as a Python notebook (a `.ipynb` file).
   * Remove all unnecessary elements
   * Add comments when needed
* Any png files that you inserted in the notebook

## Extra points available

For more learning and extra points, create a file `account-type.csv` containing the type of account of the top 50 accounts with the most mentions. You can use types "journalist", "media", "politician", "government institution", "individual", "health-related", etc. which you should categorize manually. Create a visualization either including only these 50 accounts, or including more accounts but highlighting these top 50 with colors. Use broad categories as needed and do not worry if there are some ambiguities in the categorization, e.g., if you are not 100% sure on whether someone should be in one category or another; just do your best.

**Note:** if you go for the extra points, add ``<font size="+2" color="blue">Additional results: account types</font>`` at the top of your notebook.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>


<font size="+2" color="#003300">I hereby declare that, except for the code provided by the course instructors, all of my code, text, and figures were produced by myself.</font>
