# Region Generation

This notebok will separate the data into all the updates for a particular project.

Take the tile_placements data and separate it out into frames. A frame is a subset of the data.
The data can be split up into frames based on time (e.g. 1 frame is 30 minutes of data) or by number of updates (e.g. 1 frame is 1 million updates). Every update can only belong to one frame.
After creating the frames, create a graph where every pixel is a node. A single pixel will be a vector of all the different updates that happened within that one frame.

We want to do a min-cut on the graph so that every graph partition represents one image. To select the edge weights, we want edges between pixels within the same image to have a large weight and edges between pixels of different images should have small weights. 

After we do graph partitions within one frame, we want to connect the frames together. 


The ultimate goal is to connect the frames that hold all the updates for a single project and to train a CNN on this data

## Progress Updates:
##### May 15, 2019  

To start out, we will try to split based on updates because there is a large surge of updates near the end, so if we split by time, then the frames near the end will have significantly more updates than frames in the beginning.
There are 16 million datapoints, so we will split 100,000 updates per frame, which will result in 160 frames.
Frames will be stored as CSV files into the folder ../data/frames

##### May 17, 2019

Heirustics: Same color and same user should have a weight of 0. 

In [3]:
import csv
import numpy as np
import networkx as nx
import math
import matplotlib.pyplot as plt

In [4]:
'''
    If fixed_framesize is True, then the each frame will contain the same number of updates equal to the framesize parameter
    The frametime parameter will be ignored.
    Default framesize is 1 million updates
    
    If fixed_framesize is False, then each frame will specify a certain timespan. The duration of one frame is the frametime parameter.
    The framesize parameter will be ignored.
    Default frametime is 3 hours (10800 seconds)
'''
def create_frames( fixed_framesize = True, framesize = 1000000, frametime = 10800, filename = "../data/sorted_tile_placements.csv"):
    # All the frames will be stored in data/frames
    filecount = 0

    num_updates = 0
    oldest_time = None
    newest_time = None
    with open(filename) as f:
        # Skip the header row
        next(f, None)
        reader = csv.reader(f)
        for r in reader:
            num_updates += 1
            time = int(r[0])
            
            if oldest_time == None or time < oldest_time:
                oldest_time = time
            if newest_time == None or time > newest_time:
                newest_time = time
            


    frames = list()

    num_updates -= 1 # Subtract one to account for the header
    print("Num updates: ", num_updates)
    with open(filename,'r') as file_in:

        # Skip the header row
        next(file_in, None)
        reader = csv.reader(file_in)
        rows = list(reader)
        
        if (fixed_framesize):
            while (filecount < int(num_updates / framesize) + 1 ):

                frames.append([])

                for i in range(filecount * framesize, (filecount * framesize) + framesize):
                    if (i < num_updates):
                        frames[filecount].append(rows[i])

                filecount += 1
        else:
            timespan = newest_time - oldest_time
            num_frames = int(timespan / frametime) + 1
            for i in range(num_frames):
                frames.append([])
            
            for update in rows:
                frames_index = int((int(update[0]) - oldest_time)/timespan)
                frames[frames_index].append(update)
                
            
    frames = np.array(frames)
    print("DONE")
    return frames


In [5]:
# Create a graph where every pixel within a frame has an edge to its r nearest neighbors and within time t
def create_graph(frames, r = 1, t = 1000):
    
    num_lines = 0
    
    # Create a list of networkx Graphs. One for every frame. 
    # Each pixel will be a node and there will be an edge between that node and its r nearest neighbors and within time t
    graphs = list()
    for frame in frames:
        G = nx.Graph()
        for i in range(0, len(frame)):
            pixel_1 = frame[i]
            for j in range(i+1, len(frame)):
                pixel_2 = frame[j]
                ts1 = int(pixel_1[0])
                ts2 = int(pixel_2[0])
                if ts2 - ts1 > t:
                    break
                x1 = int(pixel_1[2])
                y1 = int(pixel_1[3])
                x2 = int(pixel_2[2])
                y2 = int(pixel_2[3])
                user1 = pixel_1[1]
                user2 = pixel_2[1]
                color1 = pixel_1[4]
                color2 = pixel_2[4]
                if math.fabs(x1-x2) <= r and math.fabs(y1-y2):
                    pixel_tup1 = (ts1,user1,x1,y1,color1)
                    pixel_tup2 = (ts2,user2,x2,y2,color2)
                    G.add_edge(pixel_tup1, pixel_tup2, weight=1)
        graphs.append(G)
    return graphs

We will now add weights to all the edges.
Pixels that are in the same image should have larger weights.
Pixels that are in different images should have smaller weights.

every update is represented in different node.
We first assign weights to edges whose endpoints represent updates with same color

In [6]:
def add_edge_weights(graphs):
    for graph in graphs:
        all_edges = list(graph.edges())
        for edge in all_edges:
            node1 = edge[0]
            node2 = edge[1]
            if node1[1] == node2[1] or node1[4] == node2[4]:
                graph[node1][node2]['weight'] = 10

In [9]:
frames = create_frames()
G = create_graph(frames)

Num updates:  16559896
DONE


KeyboardInterrupt: 

In [7]:
#save all the graphs as pickles. Framing by updates.
for i in range(len(G)):
    filename = "../data/frame_pickles/frame_update"+str(i)+".gpickle"
    nx.readwrite.gpickle.write_gpickle(G[i],filename)

NameError: name 'G' is not defined

In [8]:
frames2 = create_frames(fixed_framesize = False)
G2 = create_graph(frames2)

FileNotFoundError: [Errno 2] No such file or directory: '../data/sorted_tile_placements.csv'

In [9]:
def draw_graph(filename):
    G = nx.readwrite.gpickle.read_gpickle(filename)
    print(list(G.nodes()))
    plt.subplot(121)
    nx.draw(G, with_labels=True, font_weight='bold')
    plt.show()

In [None]:
draw_graph("../data/frame_pickles/frame_update8.gpickle")

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

