# Image Graph Dataset Exploration

We're going to take a look at a few examples of how we can explore the Image Graph dataset.

In [None]:
dataset = "ARCHDATASETURL"

## pandas

Next, we'll setup our environment so we can load our CSS Information dataset into [pandas](https://pandas.pydata.org) DataFrames. If you're unfamiliar with DataFrames, but you've worked with spreadsheets before, you should feel comfortable pretty quick.

In [None]:
import pandas as pd

# Data Table Display

Colab includes an extension that renders pandas DataFrames into interactive displays that can be filtered, sorted, and explored dynamically. This can be very useful for taking a look at what each DataFrame provides!

Data table display for pandas DataFrames can be enabled by running:
```python
%load_ext google.colab.data_table
```
and disabled by running
```python
%unload_ext google.colab.data_table
```

In [None]:
%load_ext google.colab.data_table

## Loading our ARCH Dataset as a DataFrame

---


Next, we'll create pandas DataFrame from our dataset, and show a preview of it using the Data Table Display.

In [None]:
image_graph = pd.read_csv(dataset, compression="gzip")
image_graph

## Examining the Image Graph




### What are the most frequent `source` and `target` combinations?

In [None]:
top_links = image_graph[["source", "url"]].value_counts().head(10).reset_index()
top_links.columns = ["source", "url", "count"]
top_links

## Can we create a network graph visualization with the data we have?

Yes! We can take advantage of [NetworkX](https://networkx.org/documentation/stable/index.html) to create some basic graphs.

NetworkX is *really* powerful, so there is a lot more that can be done with it than what we're demonstrating here.

First we'll import `networkx` as well as `matplotlib.pyplot`.

In [None]:
import matplotlib.pyplot as plt
import networkx as nx

We can take advantage of [`from_pandas_edgelist`](https://networkx.org/documentation/stable/reference/generated/networkx.convert_matrix.from_pandas_edgelist.html) here since our three graph derivatives are edge tables, and initialize our graph.


In [None]:
G = nx.from_pandas_edgelist(
    top_links, source="source", target="url", edge_key="url", edge_attr="count"
)

Set up our graph, and draw it!


In [None]:
pos = nx.spring_layout(G, k=15)
options = {
    "node_size": 1000,
    "node_color": "#bc5090",
    "node_shape": "o",
    "alpha": 0.5,
    "linewidths": 4,
    "font_size": 10,
    "font_color": "black",
    "width": 2,
    "edge_color": "grey",
}

plt.figure(figsize=(12, 12))

nx.draw(G, pos, with_labels=True, **options)

labels = {e: G.edges[e]["count"] for e in G.edges}
nx.draw_networkx_edge_labels(G, pos, edge_labels=labels)

plt.show()