Signed Triads in Social Media
=============================

By **Guangyi Zhang** (guaz@kth.se)

Please click
[HERE](https://drive.google.com/file/d/1TrxhdSxsU1qKk_SywKf2nUA8mUAO4CG4/view?usp=sharing)
to watch the accompanying video.

Introduction
------------

This project aims to verify the friend-foe motifs in a large-scale
signed social network.

A signed network is a graph that contains both positive and negative
links. The sign of a link contains rich semantics in different
appliations. For example, in a social network, positive links can
indicate friendly relationships, while negative ones indicate
antagonistic interactions.

In on-line discussion sites such as Slashdot, users can tag other users
as “friends” and “foes”. These provide us exemplary datasets to study a
online signed network. In this notebook we explore the a dataset from
Epinions, which contains up to 119,217 nodes, 841,200 edges, and
millions of motifs. Epinions is the trust network of the Epinions
product review web site, where users can indicate their trust or
distrust of the reviews of others. We analyze the network data in an
undirected representation.

References:

Leskovec, Jure, Daniel Huttenlocher, and Jon Kleinberg. "Signed networks
in social media." Proceedings of the SIGCHI conference on human factors
in computing systems. 2010.

Regarding the motifs, we investigate several interesting triads that are
related to *structural balance theory* in an online social signed
network. Structural balance originates in social psychology in the
mid-20th-century, and considers the possible ways in which triangles on
three individuals can be signed.

Let us explain different types of triads, which is shown in the figure
below,

-   T3: “the friend of my friend is my friend”
-   T1: “the friend of my enemy is my enemy,” “the enemy of my friend is
    my enemy” and “the enemy of my enemy is my friend”
-   T2 and T0: does not quite make sense in social network. For example,
    two friends of mine are unlikely to be enemy to each other.

Our goal is to compare the numbers of different triads in our appointed
dataset.

![triads](https://drive.google.com/uc?export=view&id=1QY9ouqxbVqpH3KLl-x-QyR72yxtE0vSX)

Download dataset
----------------

In [None]:
pwd

  

>     /databricks/driver

In [None]:
wget http://snap.stanford.edu/data/soc-sign-epinions.txt.gz

  

>     --2020-11-24 16:15:45--  http://snap.stanford.edu/data/soc-sign-epinions.txt.gz
>     Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
>     Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
>     HTTP request sent, awaiting response... 200 OK
>     Length: 2972840 (2.8M) [application/x-gzip]
>     Saving to: ‘soc-sign-epinions.txt.gz’
>
>          0K .......... .......... .......... .......... ..........  1% 1.13M 2s
>         50K .......... .......... .......... .......... ..........  3% 1.12M 2s
>        100K .......... .......... .......... .......... ..........  5% 2.22M 2s
>        150K .......... .......... .......... .......... ..........  6% 2.23M 2s
>        200K .......... .......... .......... .......... ..........  8%  131M 1s
>        250K .......... .......... .......... .......... .......... 10% 2.18M 1s
>        300K .......... .......... .......... .......... .......... 12% 2.25M 1s
>        350K .......... .......... .......... .......... .......... 13%  101M 1s
>        400K .......... .......... .......... .......... .......... 15% 2.27M 1s
>        450K .......... .......... .......... .......... .......... 17%  110M 1s
>        500K .......... .......... .......... .......... .......... 18%  118M 1s
>        550K .......... .......... .......... .......... .......... 20% 2.29M 1s
>        600K .......... .......... .......... .......... .......... 22%  119M 1s
>        650K .......... .......... .......... .......... .......... 24% 2.29M 1s
>        700K .......... .......... .......... .......... .......... 25%  109M 1s
>        750K .......... .......... .......... .......... .......... 27% 93.2M 1s
>        800K .......... .......... .......... .......... .......... 29% 89.2M 1s
>        850K .......... .......... .......... .......... .......... 31% 2.38M 1s
>        900K .......... .......... .......... .......... .......... 32%  138M 1s
>        950K .......... .......... .......... .......... .......... 34%  105M 1s
>       1000K .......... .......... .......... .......... .......... 36%  107M 0s
>       1050K .......... .......... .......... .......... .......... 37%  121M 0s
>       1100K .......... .......... .......... .......... .......... 39% 2.38M 0s
>       1150K .......... .......... .......... .......... .......... 41%  135M 0s
>       1200K .......... .......... .......... .......... .......... 43%  125M 0s
>       1250K .......... .......... .......... .......... .......... 44%  113M 0s
>       1300K .......... .......... .......... .......... .......... 46%  101M 0s
>       1350K .......... .......... .......... .......... .......... 48% 2.44M 0s
>       1400K .......... .......... .......... .......... .......... 49%  140M 0s
>       1450K .......... .......... .......... .......... .......... 51% 74.7M 0s
>       1500K .......... .......... .......... .......... .......... 53% 6.68M 0s
>       1550K .......... .......... .......... .......... .......... 55% 96.7M 0s
>       1600K .......... .......... .......... .......... .......... 56% 61.3M 0s
>       1650K .......... .......... .......... .......... .......... 58%  140M 0s
>       1700K .......... .......... .......... .......... .......... 60%  101M 0s
>       1750K .......... .......... .......... .......... .......... 62% 4.13M 0s
>       1800K .......... .......... .......... .......... .......... 63%  117M 0s
>       1850K .......... .......... .......... .......... .......... 65%  117M 0s
>       1900K .......... .......... .......... .......... .......... 67% 6.82M 0s
>       1950K .......... .......... .......... .......... .......... 68% 91.3M 0s
>       2000K .......... .......... .......... .......... .......... 70%  104M 0s
>       2050K .......... .......... .......... .......... .......... 72% 98.2M 0s
>       2100K .......... .......... .......... .......... .......... 74%  134M 0s
>       2150K .......... .......... .......... .......... .......... 75%  101M 0s
>       2200K .......... .......... .......... .......... .......... 77% 4.24M 0s
>       2250K .......... .......... .......... .......... .......... 79% 96.9M 0s
>       2300K .......... .......... .......... .......... .......... 80%  124M 0s
>       2350K .......... .......... .......... .......... .......... 82%  117M 0s
>       2400K .......... .......... .......... .......... .......... 84% 7.51M 0s
>       2450K .......... .......... .......... .......... .......... 86% 86.3M 0s
>       2500K .......... .......... .......... .......... .......... 87%  139M 0s
>       2550K .......... .......... .......... .......... .......... 89%  127M 0s
>       2600K .......... .......... .......... .......... .......... 91%  103M 0s
>       2650K .......... .......... .......... .......... .......... 93%  122M 0s
>       2700K .......... .......... .......... .......... .......... 94%  101M 0s
>       2750K .......... .......... .......... .......... .......... 96%  123M 0s
>       2800K .......... .......... .......... .......... .......... 98% 4.44M 0s
>       2850K .......... .......... .......... .......... .......... 99% 81.3M 0s
>       2900K ...                                                   100% 6035G=0.4s
>
>     2020-11-24 16:15:45 (7.59 MB/s) - ‘soc-sign-epinions.txt.gz’ saved [2972840/2972840]

In [None]:
ls -l

  

>     total 2924
>     drwxr-xr-x  2 root root    4096 Jan  1  1970 conf
>     -rw-r--r--  1 root root     733 Nov 24 15:24 derby.log
>     drwxr-xr-x 10 root root    4096 Nov 24 15:24 eventlogs
>     drwxr-xr-x  2 root root    4096 Nov 24 16:15 ganglia
>     drwxr-xr-x  2 root root    4096 Nov 24 16:04 logs
>     -rw-r--r--  1 root root 2972840 Dec  3  2009 soc-sign-epinions.txt.gz

In [None]:
gunzip soc-sign-epinions.txt.gz

In [None]:
ls -l

  

>     total 11000
>     drwxr-xr-x  2 root root     4096 Jan  1  1970 conf
>     -rw-r--r--  1 root root      733 Nov 24 15:24 derby.log
>     drwxr-xr-x 10 root root     4096 Nov 24 15:24 eventlogs
>     drwxr-xr-x  2 root root     4096 Nov 24 16:15 ganglia
>     drwxr-xr-x  2 root root     4096 Nov 24 16:04 logs
>     -rw-r--r--  1 root root 11243141 Dec  3  2009 soc-sign-epinions.txt

In [None]:
head soc-sign-epinions.txt

  

>     # Directed graph: soc-sign-epinions
>     # Epinions signed social network
>     # Nodes: 131828 Edges: 841372
>     # FromNodeId	ToNodeId	Sign
>     0	1	-1
>     1	128552	-1
>     2	3	1
>     4	5	-1
>     4	155	-1
>     4	558	1

In [None]:
mkdir -p epinions
mv soc-sign-epinions.txt epinions/

In [None]:
ls -l /dbfs/FileStore
mv epinions /dbfs/FileStore/

  

>     total 33
>     drwxrwxrwx 2 root root   24 May  1  2018 datasets_magellan
>     drwxrwxrwx 2 root root 4096 Nov 24 11:14 DIGSUM-files
>     drwxrwxrwx 2 root root 4096 Nov 24 11:14 import-stage
>     drwxrwxrwx 2 root root 4096 Nov 24 11:14 jars
>     drwxrwxrwx 2 root root 4096 Nov 24 11:14 plots
>     drwxrwxrwx 2 root root 4096 Nov 24 11:14 shared_uploads
>     drwxrwxrwx 2 root root 4096 Nov 24 11:14 simon_temp_files_feel_free_to_delete_any_time
>     drwxrwxrwx 2 root root 4096 Nov 24 11:14 tables
>     drwxrwxrwx 2 root root 4096 Nov 24 11:14 timelinesOfInterest
>     mv: preserving permissions for ‘/dbfs/FileStore/epinions/soc-sign-epinions.txt’: Operation not permitted
>     mv: preserving permissions for ‘/dbfs/FileStore/epinions’: Operation not permitted

In [None]:
ls /

  

[TABLE]

In [None]:
ls /FileStore

  

[TABLE]

In [None]:
ls file:/databricks/driver

  

[TABLE]

In [None]:
//%fs mv file:///databricks/driver/epinions /FileStore/

In [None]:
ls /FileStore/epinions/

  

[TABLE]

  

Preprocess dataset
------------------

In [None]:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._

import org.graphframes._

// This import is needed to use the $-notation
import spark.implicits._

  

>     import org.apache.spark.sql._
>     import org.apache.spark.sql.functions._
>     import org.graphframes._
>     import spark.implicits._

In [None]:
var df = spark.read.format("csv")
//   .option("header", "true")
  .option("inferSchema", "true")
  .option("comment", "#")
  .option("sep", "\t")
  .load("/FileStore/epinions")

  

>     df: org.apache.spark.sql.DataFrame = [_c0: int, _c1: int ... 1 more field]

In [None]:
df.count()

  

>     res1: Long = 841372

In [None]:
df.rdd.getNumPartitions 

  

>     res36: Int = 3

In [None]:
df.head(3)

  

>     res37: Array[org.apache.spark.sql.Row] = Array([0,1,-1], [1,128552,-1], [2,3,1])

In [None]:
df.printSchema()

  

>     root
>      |-- _c0: integer (nullable = true)
>      |-- _c1: integer (nullable = true)
>      |-- _c2: integer (nullable = true)

In [None]:
val newNames = Seq("src", "dst", "rela")
val e = df.toDF(newNames: _*)

  

>     newNames: Seq[String] = List(src, dst, rela)
>     e: org.apache.spark.sql.DataFrame = [src: int, dst: int ... 1 more field]

In [None]:
e.printSchema()

  

>     root
>      |-- src: integer (nullable = true)
>      |-- dst: integer (nullable = true)
>      |-- rela: integer (nullable = true)

In [None]:
// Vertex DataFrame
val v = spark.range(1, 131827).toDF("id")

  

>     v: org.apache.spark.sql.DataFrame = [id: bigint]

In [None]:
val g = GraphFrame(v, e)

  

>     g: org.graphframes.GraphFrame = GraphFrame(v:[id: bigint], e:[src: int, dst: int ... 1 more field])

In [None]:
g.edges.take(3)

  

>     res15: Array[org.apache.spark.sql.Row] = Array([0,1,-1], [1,128552,-1], [2,3,1])

  

Count triads
------------

In [None]:
// val results = g.triangleCount.run()

  

We can not make use of the convenient API `triangleCount()` because it
does not take the sign of edges into consideration. We need to write our
own code to find triads.

First, a triad should be undirected, but our graph concists of only
directed edges.

One strategy is to keep only bi-direction edges of the same sign. But we
need to examine how large is the proportion of edges we will lose.

In [None]:
// Search for pairs of vertices with edges in both directions between them, i.e., find undirected or bidirected edges.
val pair = g.find("(a)-[e1]->(b); (b)-[e2]->(a)")
println(pair.count())
val filtered = pair.filter("e1.rela == e2.rela")
println(filtered.count())

  

>     259751
>     254345
>     pair: org.apache.spark.sql.DataFrame = [a: struct<id: bigint>, e1: struct<src: int, dst: int ... 1 more field> ... 2 more fields]
>     filtered: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: struct<id: bigint>, e1: struct<src: int, dst: int ... 1 more field> ... 2 more fields]

  

Fortunately, we only lose a very small amount of edges.

It also makes sense for this dataset, because if A trusts B, then it is
quite unlikely that B does not trust A.

In order to count different triads, first we have to find all triads.

In [None]:
val triad = g.find("(a)-[eab]->(b); (b)-[eba]->(a); (b)-[ebc]->(c); (c)-[ecb]->(b); (c)-[eca]->(a); (a)-[eac]->(c)")
println(triad.count())

  

>     3314925
>     triad: org.apache.spark.sql.DataFrame = [a: struct<id: bigint>, eab: struct<src: int, dst: int ... 1 more field> ... 7 more fields]

  

After finding all triads, we find each type by filtering.

In [None]:
val t111 = triad.filter("eab.rela = 1 AND eab.rela = ebc.rela AND ebc.rela = eca.rela")
println(t111.count())

  

>     3232357
>     t111: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: struct<id: bigint>, eab: struct<src: int, dst: int ... 1 more field> ... 7 more fields]

In [None]:
val t000 = triad.filter("eab.rela = -1 AND eab.rela = ebc.rela AND ebc.rela = eca.rela")
println(t000.count())

  

>     1610
>     t000: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: struct<id: bigint>, eab: struct<src: int, dst: int ... 1 more field> ... 7 more fields]

In [None]:
val t110 = triad.filter("eab.rela + ebc.rela + eca.rela = 1")
println(t110.count())

  

>     62634
>     t110: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: struct<id: bigint>, eab: struct<src: int, dst: int ... 1 more field> ... 7 more fields]

In [None]:
val t001 = triad.filter("eab.rela + ebc.rela + eca.rela = -1")
println(t001.count())

  

>     18324
>     t001: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: struct<id: bigint>, eab: struct<src: int, dst: int ... 1 more field> ... 7 more fields]

In [None]:
val n111 = t111.count()
val n001 = t001.count()
val n000 = t000.count() 
val n110 = t110.count()
val imbalanced = n000 + n110
val balanced = n111 + n001

  

>     n111: Long = 3232357
>     n001: Long = 18324
>     n000: Long = 1610
>     n110: Long = 62634
>     imbalanced: Long = 64244
>     balanced: Long = 3250681

  

As we can see, the number of balanced triads overwhelms the number of
imbalanced ones, which verifies the effectiveness of structural balance
theory.

Duplicates
----------

Some tests about duplicated motifs

In [None]:
val g: GraphFrame = examples.Graphs.friends

  

>     g: org.graphframes.GraphFrame = GraphFrame(v:[id: string, name: string ... 1 more field], e:[src: string, dst: string ... 1 more field])

In [None]:
display(g.edges)

  

[TABLE]

In [None]:
val motifs = g.find("(a)-[e]->(b); (b)-[e2]->(a)")
motifs.show()

  

>     +----------------+--------------+----------------+--------------+
>     |               a|             e|               b|            e2|
>     +----------------+--------------+----------------+--------------+
>     |    [b, Bob, 36]|[b, c, follow]|[c, Charlie, 30]|[c, b, follow]|
>     |[c, Charlie, 30]|[c, b, follow]|    [b, Bob, 36]|[b, c, follow]|
>     +----------------+--------------+----------------+--------------+
>
>     motifs: org.apache.spark.sql.DataFrame = [a: struct<id: string, name: string ... 1 more field>, e: struct<src: string, dst: string ... 1 more field> ... 2 more fields]

  

As shown above, bi-direction edges are reported twice. Therefore, each
triad is counted three times. However, this does not matter in our
project, because the ratios between different triads remain the same.