Exploring the GQA Scene Graph Dataset Structure and Properties
==============================================================

-   This project aims to explore the scene graphs in the Genome Question
    Answering (GQA) dataset \[1\].

-   The structure, properties, and motifs of the ground truth data will
    be analysed.

**Adam Dahlgren, Pavlo Melnyk, Emanuel Sanchez Aimar**

Graph structure
---------------

-   We want to extract the names of objects we see in the images and use
    their id's as vertices.
-   For one object category, we will have multiple id's and hence
    multiple vertices. In contrast, one vertex will represent an object
    category in the merged graph.
-   Object attributes are used as part of the vertices (in some graph
    representations we exploit).
-   The edge properties are the names of the relations (provided in
    JSON-files).

Loading data
------------

-   We read the scene graph data as JSON files. Below is the example
    JSON object given by the GQA website, for scene graph 2407890.

In [None]:
sc = spark.sparkContext
# Had to change weather 'none' to '"none"' for the string to parse
json_example_str = '{"2407890": {"width": 640,"height": 480,"location": "living room","weather": "none","objects": {"271881": {"name": "chair","x": 220,"y": 310,"w": 50,"h": 80,"attributes": ["brown", "wooden", "small"],"relations": {"32452": {"name": "on","object": "275312"},"32452": {"name": "near","object": "279472"}}}}}}'
json_rdd = sc.parallelize([json_example_str])
example_json_df = spark.read.json(json_rdd, multiLine=True)
example_json_df.show()

  

>     +--------------------+
>     |             2407890|
>     +--------------------+
>     |[480, living room...|
>     +--------------------+

In [None]:
example_json_df.first()

  

>     Out[3]: Row(2407890=Row(height=480, location='living room', objects=Row(271881=Row(attributes=['brown', 'wooden', 'small'], h=80, name='chair', relations=Row(32452=None, 32452=Row(name='near', object='279472')), w=50, x=220, y=310)), weather='none', width=640))

  

### Reading JSON files

-   Due to issues with the JSON files and how Spark reads them, we need
    to parse the files using pure Python. Otherwise, we get stuck in a
    loop and finally crash the driver.

In [None]:
from graphframes import *
import json

In [None]:
# load train and validation graph data:
f_train = open("/dbfs/FileStore/shared_uploads/scenegraph_motifs/train_sceneGraphs.json")
train_scene_data = json.load(f_train)

f_val = open("/dbfs/FileStore/shared_uploads/scenegraph_motifs/val_sceneGraphs.json")
val_scene_data = json.load(f_val)

  

  

### Parsing graph structure

-   We use a Pythonic way to parse the JSON-files and obtain the
    vertices and edges of the graphs, provided vertex and edge schemas,
    respectively.

In [None]:
# Pythonic way of doing it, parsing a JSON graph representation.
# Creates vertices with the graph id, object name and id, optionally includes the attibutes
def json_to_vertices_edges(graph_json, scene_graph_id, include_object_attributes=False):
  vertices = []
  edges = []
  obj_id_to_name = {}
  
  vertex_ids = graph_json['objects']
  
  for vertex_id in vertex_ids:   
    vertex_obj = graph_json['objects'][vertex_id]
    name = vertex_obj['name']
    vertices_data = [scene_graph_id, vertex_id, name]
    
    if vertex_id not in obj_id_to_name:
      obj_id_to_name[vertex_id] = name
      
    if include_object_attributes:
      attributes = vertex_obj['attributes']  
      vertices_data.append(attributes)
      
    vertices.append(tuple(vertices_data))
    
    for relation in vertex_obj['relations']:
        src = vertex_id
        dst = relation['object']
        name = relation['name']
        edges.append([src, dst, name])
        
  for i in range(len(edges)):
    src_type = obj_id_to_name[edges[i][0]]
    dst_type = obj_id_to_name[edges[i][1]]
    edges[i].append(src_type)
    edges[i].append(dst_type)
    
  return (vertices, edges)

In [None]:
def parse_scene_graphs(scene_graphs_json, vertex_schema, edge_schema):  
  vertices = []
  edges = []
  
  # if vertice_schema has a field for attributes:
  include_object_attributes = len(vertex_schema) == 4
     
  for scene_graph_id in scene_graphs_json:
    vs, es = json_to_vertices_edges(scene_graphs_json[scene_graph_id], scene_graph_id, include_object_attributes)
    vertices += vs
    edges += es
    
  vertices = spark.createDataFrame(vertices, vertex_schema)
  edges = spark.createDataFrame(edges, edge_schema)
  
  return GraphFrame(vertices, edges)

In [None]:
from pyspark.sql.types import StructType, StructField, ArrayType, IntegerType, StringType

# create schemas for scene graphs:
vertex_schema = StructType([
  StructField("graph_id", StringType(), False), StructField("id", StringType(), False), StructField("object_name", StringType(), False)
])

vertex_schema_with_attr  = StructType([
  StructField("graph_id", StringType(), False), 
  StructField("id", StringType(), False), 
  StructField("object_name", StringType(), False), 
  StructField("attributes", ArrayType(StringType()), True)
])

edge_schema = StructType([
  StructField("src", StringType(), False), StructField("dst", StringType(), False), StructField("relation_name", StringType(), False),
  StructField("src_type", StringType(), False), StructField("dst_type", StringType(), False)
])

In [None]:
# we will use the length of the vertice schemas to parse the graph from the json files appropriately:
len(vertex_schema), len(vertex_schema_with_attr)

  

>     Out[8]: (3, 4)

  

We add attributes to vertices and types to edges in the graph structure
-----------------------------------------------------------------------

-   If vertices have attributes, we can get more descriptive answers to
    our queries like "Objects of type 'person' are 15 times 'next-to'
    objects of type 'banana' ('yellow', 'small'); 10 times 'next-to'
    objects of type 'banana' ('green' 'banana')".

-   We can do more interesting queries if the edges disclose what
    type/name the source and destination has.

-   For instance, it is then possible to group the edges not only by the
    ID but also by which type of objects they are connected to,
    answering questions like "How often are objects of type 'person' in
    the relation 'next-to' with objects of type 'banana'?".

In [None]:
# TODO Perhaps merge train+val and produce results for all three (train, val, train+val)?
scene_graphs_train = parse_scene_graphs(train_scene_data, vertex_schema_with_attr, edge_schema)

In [None]:
scene_graphs_train_without_attributes = GraphFrame(scene_graphs_train.vertices.select('graph_id', 'id', 'object_name'), scene_graphs_train.edges)

In [None]:
scene_graphs_val = parse_scene_graphs(val_scene_data, vertex_schema_with_attr, edge_schema)

In [None]:
# person next-to banana (yellow, small) vs person next-to banana (green)
display(scene_graphs_train.find('(a)-[ab]->(b)').filter("(a.object_name = 'person') and (b.object_name = 'banana')"))

In [None]:
display(scene_graphs_train.vertices)

In [None]:
display(scene_graphs_val.vertices)

In [None]:
display(scene_graphs_train.edges)

  

[TABLE]

Truncated to 30 rows

In [None]:
display(scene_graphs_val.edges)

  

[TABLE]

Truncated to 30 rows

  

Analysis of original graph
--------------------------

-   The original graph consists of multiple graphs, each representing an
    image.

-   Number of objects per image (graph):

In [None]:
grouped_graphs = scene_graphs_train.vertices.groupBy('graph_id')
display(grouped_graphs.count().sort('count', ascending=False))

  

[TABLE]

Truncated to 30 rows

In [None]:
print("Graphs/Scenes/Images): {}".format(scene_graphs_train.vertices.select('graph_id').distinct().count()))
print("Objects: {}".format(scene_graphs_train.vertices.count()))
print("Relations: {}".format(scene_graphs_train.edges.count()))

  

>     Graphs/Scenes/Images): 74289
>     Objects: 1231134
>     Relations: 3795907

In [None]:
display(scene_graphs_train.degrees.sort(["degree"],ascending=[0]).limit(20))

  

[TABLE]

  

### Finding most common attributes

-   "Which object characteristics are the most common?"

In [None]:
from pyspark.sql.functions import explode
# the attributes are sequences: we need to split them;
# explode the attributes in the vertices graph:
explodedAttributes = scene_graphs_train.vertices.select("id", "object_name", explode(scene_graphs_train.vertices.attributes).alias("attribute"))
explodedAttributes.printSchema()
display(explodedAttributes)

  

[TABLE]

Truncated to 30 rows

  

-   Above we see the object-attribute pairs seen in the dataset.

#### Most used attributes

In [None]:
topAttributes = explodedAttributes.groupBy("attribute")
display(topAttributes.count().sort("count", ascending=False))

  

[TABLE]

Truncated to 30 rows

In [None]:
topAttributes = explodedAttributes.groupBy("attribute")
display(topAttributes.count().sort("count", ascending=False))

  

[TABLE]

Truncated to 30 rows

  

-   7 out of the top 10 attributes are colors, where `white` is seen
    92659 times, and `black` 59617 times.

-   We see a *long tail-end* distribution with only 68 out of a 617
    attributes being seen more than a 1000 times in the dataset, and
    around 300 attributes are seen less than 100 times (e.g.,
    `breakable` is seen 15 times, `wrist` 14 times, and `immature` 3
    times).

### Finding most common objects

In [None]:
topObjects = scene_graphs_train.vertices.groupBy("object_name")
topObjects = topObjects.count()
display(topObjects.sort("count", ascending=False))

  

[TABLE]

Truncated to 30 rows

In [None]:
display(topObjects.sort("count", ascending=True))

  

[TABLE]

Truncated to 30 rows

  

-   Again, we see that a few object types account for most of the
    occurences. Interestingly, `man` (31370) and `person` (20218) is
    seen three and two times more than `woman` (11355), respectively.
    Apparently, `window`s are really important in this dataset, comming
    out on top with 35907 occurences.

-   The top 259 object types are seen more than 1000 times, and after
    819 objects are seen less than 100 times.

-   Looking at the tail-end of the distribution, we see that `pikachu`
    is mentioned once, whereas, e.g., `warderobe` (5) and `robot`(8) are
    rarely seen which was not expected.

-   The nature of the GQA dataset suggests its general-purpose
    applicability. However, the skewed object categories distribution
    shown above implies otherwise.

### Finding most common object pairs

-   "What are the most common two adjacent object categories in the
    graphs?"

In [None]:
topPairs = scene_graphs_train.edges.groupBy("src_type", "dst_type")
display(topPairs.count().sort("count", ascending=False))

  

[TABLE]

Truncated to 30 rows

In [None]:
topPairs = scene_graphs_train.edges.groupBy("src_type", "relation_name", "dst_type")
display(topPairs.count().sort("count", ascending=False))

  

[TABLE]

Truncated to 30 rows

  

-   In the tables above, we see that the most common relations reflect
    spatial properties such as `to the right of` with `windows`
    symmetrically related to each other standing for 2 x 28944
    occurances.

-   The most common relations are primarily between objects of the same
    category.

-   The first 'action'-encoding relation is seen in the 15th most common
    triple `man-wearing-shirt` (5254).

### Finding most common relations

-   Could we categorise the edges according to what semantic function
    they play?

-   For instance, filtering out all relations that are spatial
    (`behind`, `to the left of`, etc.).

-   Suggested categories: *spatial*, *actions*, and *semantic*
    relations.

In [None]:
topPairs = scene_graphs_train.edges.groupBy("relation_name")
display(topPairs.count().sort("count", ascending=False))

  

[TABLE]

Truncated to 30 rows

  

-   The most common relations are spatial, overwhelmingly, with
    `to the left of` and `to the right of` accounting for 1.7 million
    occurences each.

-   In contrast, the third most common relation `on` is seen "only"
    90804 times. Out of the top 30 relations, 23 are spatial. Common
    actions can be seen as few times as 28, as in the case of `opening`.

-   Some of these relations encode both spatial and actions, such as in
    `sitting on`.

-   This shows some ambiguity in how the relation names are chosen, and
    how this relates to the attributes, such as `sitting`, `looking`,
    `lying`, that can also be encoded as object attributes.

&nbsp;

-   Next, we filter out relations that begin with `to the`, `in`, `on`,
    `behind of`, or `in front of`, in order to bring forth more of the
    non-spatial relations.

In [None]:
# Also possible to do:
# from pyspark.sql.functions import udf
#from pyspark.sql.types import BooleanType

#filtered_df = spark_df.filter(udf(lambda target: target.startswith('good'), 
#                                  BooleanType())(spark_df.target))

topPairs = scene_graphs_train.edges.filter("(relation_name NOT LIKE 'to the%') and (relation_name NOT LIKE '%on') and (relation_name NOT LIKE '%in') and (relation_name NOT LIKE '% of')").groupBy("src_type", "relation_name", "dst_type")
display(topPairs.count().sort("count", ascending=False))

  

[TABLE]

Truncated to 30 rows

  

-   We see in the pie chart above that once we filter out the most
    common spatial relations, the remainder is dominated by `wearing`
    and the occasional associative `of` (as in, e.g., `head-of-man`).

-   These relations make up almost half of the non-spatial relations.

TODO - Report statistics on attributes, similarly to how src/dst types
are used above. Correlation metrics between vertex types and attributes?

### Finding motifs

TODO - finding other interesting motifs that we can motivate from a
semantic perspective, e.g. looking at triangles, or more complex
child-parent tree relations (e.g. one parent with exactly two children)

In [None]:
scene_graphs_train_without_attributes_graphid = GraphFrame(scene_graphs_train_without_attributes.vertices.drop('graph_id').drop('id').selectExpr('object_name as id'), scene_graphs_train.edges)

In [None]:

motifs = scene_graph_train_without_attributes_graphid.find("(a)-[ab]->(b); (b)-[bc]->(c)").filter("(a.object_name NOT LIKE b.object_name) and (a.object_name NOT LIKE c.object_name)")

display(motifs)

In [None]:
# TODO - display motifs in a nicer way:
#display(motifs.select('ab').rdd.map(lambda t: (t[-2], t[-3], t[-1])).toDF())

In [None]:
motifs_sorted = motifs.distinct()
display(motifs_sorted)

In [None]:
motifs_sorted.count()

  

-   Find *circular* motifs, i.e., motifs of type
    `A -> relation_ab -> B -> relation_bc -> C -> relation_ca -> A`:

In [None]:
circular_motifs = scene_graphs_train.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[ca]->(a)")

display(circular_motifs)

In [None]:
circular_motifs.count()

  

This gives us 7 million cycles of length 3. However, this is most likely
dominated by the most common spatial relations. In the cell below, we
filter out these spatial relations and count cycles again.

In [None]:
circular_motifs = scene_graphs_train.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[ca]->(a)").filter("(ab.relation_name NOT LIKE 'to the%') and (bc.relation_name NOT LIKE 'to the%') and (ca.relation_name NOT LIKE 'to the%') and (ab.relation_name NOT LIKE '% of') and (bc.relation_name NOT LIKE '% of') and (ca.relation_name NOT LIKE '% of')")

display(circular_motifs.select('ab', 'bc', 'ca'))

In [None]:
circular_motifs.count()

  

-   Without the most common spatial relations, we now have a
    significantly lower amount, 18805, of cycles of length 3.

&nbsp;

-   Find *symmetric* motifs, i.e., motifs of type
    `A -> relation_ab -> B -> relation_ab -> A`:

In [None]:
symmetric_motifs = scene_graphs_train.find("(a)-[ab]->(b); (b)-[ba]->(a)").filter("ab.relation_name LIKE ba.relation_name")

display(symmetric_motifs)

#val motifs = tripGraphPrime.
#  find("(a)-[ab]->(b); (b)-[bc]->(c)").
#  filter("(b.id = 'SFO') and (ab.delay > 500 or bc.delay > 500) and bc.tripid > ab.tripid and bc.tripid < ab.tripid + 10000")

In [None]:
symmetric_motifs.count()

In [None]:
symmetric_motifs = scene_graphs_train.find("(a)-[ab]->(b); (b)-[ba]->(a)").filter("ab.relation_name LIKE ba.relation_name").filter("(ab.relation_name NOT LIKE 'near') and (ab.relation_name NOT LIKE '% of')")

display(symmetric_motifs.select('ab', 'ba'))


In [None]:
symmetric_motifs.count()

  

-   The symmetric relations that are spatial behave as expected, and
    removing the most common ones shows that we have 3693 such symmetric
    relations.

-   However, when looking at the filtered symmetric motifs we can see
    examples such as 'boy-wearing-boy' and 'hot dog-wrapped in-hot dog'.

-   These examples of symmetric action relations seem to poorly reflect
    the expected structure of a scene graph.

-   We assume that this is either an artifact of the human annotations
    containing noise, or that the sought after denseness of the graphs
    used describing the images create these kinds of errors.

### Object ranking using PageRank

TODO - fix pagerank in terms of performance: perhaps, remove spatial
edges (e.g. to the left/right of).

UPDATE 21-12-2020: sorting and groupping (groupBy) almost never work
(either finishes within a couple of minutes or runs almost infinitely
slow/never finishes) possibly due to cluster overload; pagerank itself
returns all ones (approximately, and identical values) after the initial
run and sometimes produces meaningful results when re-run.

`scene_graph_without_attributes = GraphFrame(scene_graphs_train.vertices.drop('attributes'), scene_graphs_train.edges)`

`ranks = scene_graph_without_attributes.pageRank(resetProbability=0.15, tol=0.01)`

`display(ranks.vertices)`

In [None]:
# TODO - This does not really give us anything at the moment
# TODO - display the object names within one image and their correspoinding pageranks.

# Apparently, using the previous graph frame created for another task 
temp = GraphFrame(scene_graphs_train.vertices.select('graph_id', 'id', 'object_name'), scene_graphs_train.edges)

ranks = temp.pageRank(resetProbability=0.15, tol=0.01)
display(ranks.vertices)

In [None]:
sorted_ranks = ranks.vertices.sort('pagerank', ascending=False)
display(sorted_ranks)

In [None]:
val_graphs_without_attributes = GraphFrame(scene_graphs_val.vertices.select('graph_id', 'id', 'object_name'), scene_graphs_val.edges)
val_ranks = val_graphs_without_attributes.pageRank(resetProbability=0.15, tol=0.01)
display(val_ranks.vertices)

In [None]:
val_sorted_ranks = val_ranks.vertices.sort('pagerank', ascending=False)
display(val_sorted_ranks)

In [None]:
graph_pagerank_sums_objects = ranks.vertices.groupBy('object_name').sum('pagerank')
graph_pagerank_sums_objects.show()

In [None]:
graph_pagerank_sums_objects_sorted = graph_pagerank_sums_objects.sort('sum(pagerank)',ascending=False)

In [None]:
display(graph_pagerank_sums_objects_sorted)

  

[TABLE]

Truncated to 30 rows

  

-   Here we see that the summed (accumulated) PageRank per object
    category reflects the number of occurences for each object (see the
    `topObjects` section). At least for the top 10 in this table.

-   This verifies that the most common objects are highly connected with
    others in their respective scene graphs.

-   We therefore conclude that they do not necessarily have a high
    information gain.

-   A high accumulated PageRank suggests the general nature of objects.

In [None]:
topObjects = topObjects.sort('object_name', ascending=False)
graph_pagerank_sums_objects_sorted = graph_pagerank_sums_objects_sorted.sort('object_name', ascending=False)

In [None]:
import pyspark.sql.functions as F
graph_pagerank_joined = graph_pagerank_sums_objects_sorted.join(topObjects, "object_name").withColumn('normalize(pagerank)', F.col('sum(pagerank)') / F.col('count'))
display(graph_pagerank_joined.sort('normalize(pagerank)', ascending=False))

  

  

-   We further normalise the PageRank values, i.e., divide by the number
    of occurences per object category in the scenes.

-   We observe that, in contrast to the accumulated PageRank, the
    normalised values reflect the uniquness of object categorires: the
    fewer the occurences, the higher the normalised PageRank.

-   For example, `wolves` occurs only once in the entire dataset, and
    its corresponding PageRank (accumulated equals normalised in this
    case) is the highest of all, followed by the glorius `pikachu`.

In [None]:
display(graph_pagerank_joined.sort('sum(pagerank)', ascending=False).limit(30))

  

-   In the above table, we see that the normalised PageRank for the top
    30 objects has a different ordering than the summed PageRanks.

-   For example, `sky` has the highest normalised PageRank, and the most
    common category `window` has the lowest.

-   This could be a reflection of the fact that `sky` most likely acts
    as an *anchor* object in the image, being a background to which
    everything else is related.

-   On the contrary, while `window` might be prevalent in many images,
    it is assumed to more often act as a foreground object rather than
    `sky`.

-   Nonetheless, these conlusions correspond well to the above analysis
    of objects generality.

In [None]:
# ranks.vertices.groupBy('graph_id').count().show()

graph_pagerank_sums = ranks.vertices.groupBy('graph_id').sum('pagerank')
graph_pagerank_sums.show()
# df_branksasket1.groupby('graph_id','Item_name').agg({'Price': 'count'}).show()

  

>     +--------+------------------+
>     |graph_id|     sum(pagerank)|
>     +--------+------------------+
>     | 2394929|10.000000000001032|
>     | 2383053|21.000000000002167|
>     | 2381043|21.000000000002167|
>     | 2332729|11.000000000001137|
>     | 2366493|30.000000000003094|
>     | 2377871|26.000000000002682|
>     | 2350068|12.000000000001238|
>     | 1591809| 22.00000000000227|
>     | 1592764|21.000000000002167|
>     | 2411119|27.000000000002785|
>     | 2337597|23.000000000002373|
>     | 2406784| 9.000000000000929|
>     | 2319630| 28.00000000000289|
>     | 2365546|17.000000000001755|
>     | 2357438|20.000000000002064|
>     | 2395199|24.000000000002476|
>     | 2348208| 43.00000000000446|
>     | 2341350|16.000000000001652|
>     | 2397463|27.000000000002785|
>     | 2370847| 19.00000000000196|
>     +--------+------------------+
>     only showing top 20 rows

In [None]:
graph_val_pagerank_sums = val_ranks.vertices.groupBy('graph_id').sum('pagerank')
graph_val_pagerank_sums.show()

  

>     +--------+------------------+
>     |graph_id|     sum(pagerank)|
>     +--------+------------------+
>     | 2316914|13.999999999999774|
>     | 2336843|1.9999999999999671|
>     | 2409544|16.999999999999726|
>     | 2385432| 18.99999999999969|
>     | 2338120| 6.999999999999886|
>     | 2315739|  7.99999999999987|
>     | 2346829| 17.99999999999971|
>     | 2411941|11.999999999999806|
>     | 2371809|16.999999999999726|
>     | 2326319|30.999999999999478|
>     |    1669|20.999999999999655|
>     | 2318182|16.999999999999726|
>     | 2371850| 5.999999999999902|
>     | 2349988| 4.999999999999918|
>     | 2357784|20.999999999999655|
>     | 2331673|16.999999999999726|
>     | 2359775|16.999999999999726|
>     |  498144| 12.99999999999979|
>     | 2331455| 26.99999999999955|
>     | 2327585| 6.999999999999886|
>     +--------+------------------+
>     only showing top 20 rows

In [None]:
display(graph_val_pagerank_sums.sort('sum(pagerank)',ascending=False))

  

[TABLE]

Truncated to 30 rows

  

### Merging vertices

-   We use object names (object categories with or without attributes)
    instead of IDs as vertex identifier to merge all scene graphs (each
    with its `graph_id`) into one *meta-graph*.

-   This enables us to analyse, e.g., how object types relate to each
    other *in general*, and how connected components can be formed based
    on specific image contexts.

-   The key intuition is that it could allow us to detect connected
    components representing *scene categories* such as `traffic` or
    `bathroom`, i.e., meta-understanding of images as a whole.

In [None]:
merged_vertices = scene_graphs_val.vertices.selectExpr('object_name as id', 'attributes as attributes')
display(merged_vertices)

In [None]:
merged_vertices.count()

  

>     Out[83]: 174331

In [None]:
merged_vertices = merged_vertices.distinct()
display(merged_vertices)

In [None]:
merged_vertices.count()

  

>     Out[85]: 22243

  

-   We see that there are X unique combinations of objects and
    attributes.

In [None]:
merged_vertices_without_attributes = merged_vertices.select('id').distinct()
display(merged_vertices_without_attributes)

  

[TABLE]

Truncated to 30 rows

In [None]:
merged_vertices_without_attributes.count()

  

>     Out[87]: 1536

In [None]:
merged_edges = scene_graphs_val.edges.selectExpr('src_type as src', 'dst_type as dst', 'relation_name as relation_name')
display(merged_edges)

  

[TABLE]

Truncated to 30 rows

In [None]:
merged_edges.count()


  

>     Out[89]: 534889

In [None]:
scene_graphs_merged = GraphFrame(merged_vertices, merged_edges)

In [None]:
display(scene_graphs_merged.vertices)

In [None]:
display(scene_graphs_merged.edges)

  

[TABLE]

Truncated to 30 rows

In [None]:
scene_graphs_merged_without_attributes = GraphFrame(merged_vertices_without_attributes, merged_edges)

In [None]:
display(scene_graphs_merged_without_attributes.vertices)

  

[TABLE]

Truncated to 30 rows

In [None]:
display(scene_graphs_merged_without_attributes.edges)

  

[TABLE]

Truncated to 30 rows

  

### Computing the Connected Components

-   Here we compute the connected components of the merged scene graphs
    (one with the object attributes included and the other without).

-   Before merging, the connected components should roughly correspond
    to the number of scene graphs, as they are made up of at least 1
    connected component each.

-   In the merged graphs, we can expect a much smaller set of connected
    components, and we hypothesize that these could correspond to *scene
    categories* (image classes).

In [None]:
sc.setCheckpointDir("/tmp/scene-graph-motifs-connected-components")
connected_components = scene_graphs_merged.connectedComponents()
display(connected_components)

# displays the index of a component for a given object category

  

[TABLE]

Truncated to 30 rows

  

-   The number of connected components are:

In [None]:
components_count = connected_components.groupBy('component')
display(components_count.count().sort("count", ascending=False))

  

[TABLE]

In [None]:
connected_components_without_attributes = scene_graphs_merged_without_attributes.connectedComponents()
display(connected_components_without_attributes)

  

[TABLE]

Truncated to 30 rows

In [None]:
components_count = connected_components_without_attributes.groupBy('component')
display(components_count.count().sort("count", ascending=False))

  

[TABLE]

  

-   These results indicate that the merged graph is too dense due to the
    generic relations (e.g., the spatial relations 'next-to' et al.)
    connecting all objects into one big chunk.

-   Removing some of these most occuring relations could show an
    underlying graph structure that is more interesting.

General discussion
==================

First we recap the main points of the results of our analysis.

Objects
-------

-   Interestingly, `man` (31370) and `person` (20218) is seen three and
    two times more than `woman` (11355), respectively.
-   The nature of the GQA dataset suggests its general-purpose
    applicability. However, the skewed object categories distribution
    shown above implies otherwise.

Attributes
----------

-   Our analysis of the original dataset shows that a few of the most
    commonly annotated attributes account for the majority of all
    annotations.
-   Most of the common attributes are colors, `black` and `white` being
    the most common. `white` is seen 92659 times, and `black` 59617
    times.
-   We suspect that since the dataset is generated using human
    annotators, many of the less common annotations, such as attributes
    occuring less than 100 times, are more error prone and might have a
    high noise to label ratio.

Relations
---------

-   The most common relations are, overwhelmingly, *spatial* properties
    with `to the left of` and `to the right of`, accounting for 1.7
    million occurences each.
-   The most common relations are primarily between objects of the same
    category.
-   For instance, `windows` are symmetrically related to each other
    making up 2 x 28944 occurances.
-   Some of these relations encode both *spatial* and *action* relation
    categories, e.g., `sitting on`.

PageRank
--------

-   Our page rank results mainly reflect the number of occurences of
    each object category,

To summarize, we see that GQA still has a lot of room for improvement in
terms of the distribution of objects and relations.