# Lecture 12: Graph Theory, (2 of 2)

### Please note: This lecture will be recorded and made available for viewing online. If you do not wish to be recorded, please adjust your camera settings accordingly. 

# Reminders/Announcements:
- Assignment 4 is due Thursday at 8pm.
- Quiz 1 key is available in your CoCalc projects.
- Final Project Topics are now available. Take a look!
- Please see Canvas announcement for comments on HW 4.
- Welcome to February!

## Graph Theory: Quick Review

Recall that a graph $G$ is a collection of vertices $V$ and edges $E$. Unless otherwise stated, there is nothing special about the edges. They
- are undirected
- carry no additional information

In [0]:
G = graphs.CycleGraph(5)
show(G)

## Weighted and Directed Graphs

Often it can be useful to add information to a graph. One can do this by weighting edges, or by directing edges. 

Think of the weighting of an edge as *some sort of cost*.

In [0]:
G = Graph()
G.weighted(True)

G.add_vertices([i for i in range(1,16)])

for i in G:
    for j in G:
        if i!=j:
            if i and j%i == 0:
                G.add_edge((i,j,i*j + i + j))
plot(G, layout = 'circular')

In [0]:
plot(G, layout = 'circular',edge_labels = True)

Think of the vertices as $15$ bulidings. Think of the current edge set as the set of possible roads that you could construct between these buildings. The weight of edge $E$ is the cost for paving that road. How can we connect all the buildings together in a way that minimizes our cost?

This is given by a *minimum spanning tree*



In [0]:
G.min_spanning_tree()

In [0]:
plot(Graph(G.min_spanning_tree()), edge_labels = True,layout = 'circular')

In [0]:
totalCost = sum([edge[2] for edge in G.min_spanning_tree()])
totalCost

What happens if we alter the cost function?

In [0]:
G = Graph()
G.weighted(True)

G.add_vertices([i for i in range(1,16)])

for i in G:
    for j in G:
        if i!=j:
            if i and j%i == 0:
                G.add_edge((i,j,(15 - i)*(15-j)))
plot(G, layout = 'circular', edge_labels = True)

In [0]:
plot(Graph(G.min_spanning_tree()), edge_labels = True, layout = 'circular')

In [0]:
plot(Graph(G.min_spanning_tree()), edge_labels = True)

In a directed graph, an edge points *from one vertex to another*.

In [0]:
D = DiGraph()
D.add_vertices([1,2,3])
D.add_edges([(1,2),(2,3)])
show(D)

Note that you can now add *both* the edges $(i,j)$ and $(j,i)$.

In [0]:
D.add_edges([(2,1),(3,2)])
show(D)

Now vertices have both an *indegree* and an *outdegree*

In [0]:
D.out_degree(1)

In [0]:
D.in_degree(1)

This leads to the concept of *sources* and *sinks*

In [0]:
D = digraphs.RandomDirectedGNP(10, 1/7)
D.weighted(True)
plot(D, vertex_colors = {'green':D.sources(),'red':D.sinks()})

In [0]:
D.add_edge((8,2,5))
plot(D, edge_labels = True)

Digraphs are often used to model things such as electrical connectivity (i.e. you think of sources as sources of electrical power; the sinks are things that need to be powered).

## Brief Aside: Reading and Writing in Python

One of the most useful things that you can do in Python (in my opinion *the most* useful in terms of actual life) is manipulating files. What if you have a huge text file containing data about the world's countries. You'd like to manipulate this data in Python; do you have to go line by line and copy the data from that text file into your Python file by hand? 

NO! Below I will give an example of how useful this can be. The *main* thing I want you to focus on is how to interact with files; I will do a bit of statistics with the data, because we might as well buy some milk while we're at the store, but you don't have to worry about that part too much yet.

The World Bank has tons of free data sets regarding economic/social/climate markers at the country level. Here is their page on annual real GDP growth: https://data.worldbank.org/indicator/NY.GDP.MKTP.KD.ZG?view=chart 

I've downloaded some of this data and put in in the directory of this lecture: realGDP.csv. 

In [0]:
with open('realGDP.csv','r') as myFile:    # Please note the 'r' tag! This means "read." You don't want to overwrite your data as you work with it. 
    myData = myFile.readlines()

print(myData[16])
print(type(myData[16]))

## Note!
If you have a really big file, `readlines` may not be the right call, as your data may not fit in memory. In that case you could just iterate through the file line by line:

In [0]:
with open('realGDP.csv','r') as myFile:
    for line in myFile:
        if 'Australia' in line:
            print(line)

This will not be an issue in this class; you can use whatever method you are most comfortable with.

## ***** Participation Check ***************************
In the directory of this lecture file there is a file premierleague.csv which contains data for the 2018-2019 English Premier League season (American "soccer"/ English "football") obtained here: https://sports-statistics.com/sports-data/soccer-datasets/. 

Each line is formatted 

Division, Date, Home Team, Away Team, Home Team Goals, Away Team Goals, ...

(there is additional data that you don't need to worry about)

How many times did Arsenal play Chelsea in the 2018-2019 season? On which dates did they play? Who won the games? (Hint: you can use an `and` statement to string together two `in` statements)

In [0]:
#Your code here

Answer here:

## ***********************************************************

One of the most useful commands regarding data processing is `split()`. We are working with a *comma separated value* file, which means:
- Each line in the file contains one group of relevant data (in this case, one line = one country)
- The "columns" are separated by commas.

Accordingly, we want to take a comma separated string and "split" it up every time we see a comma:

In [0]:
myString = 'Australia, AUS, GDP, 124124, 123123, 198272'
myString.split(',')

In general you can have Python split at any delimiter you choose:

In [0]:
myString = 'Australia & AUS & GDP & 124124 & 123123 & 198272'
myString.split('&')

The default use of split is to split on whitespace (which includes spaces, newlines, and tabs). For instance: to turn a sentence into the list of words in that sentence:

In [0]:
myStr = 'The quick brown fox jumped over the lazy dog'
myStr.split()

Let's apply this to our data:

In [0]:
myData = []
with open('realGDP.csv','r') as myFile:
    for line in myFile:
        myData.append(line.split(','))

In [0]:
ausData = myData[16]
ausData

In [0]:
ausData[25]

Maybe there is an even better way to store this information?

In [0]:
myData = dict()
lineCounter = 1
with open('realGDP.csv','r') as myFile:
    for line in myFile:
        if lineCounter >= 5:
            countryData = line.split(',')
            myData[countryData[0]] = countryData[1:]
        lineCounter += 1

In [0]:
myData['"Australia"']

One more try! (I am adding in these intermediate steps because in real life, data is *messy*. You need to be comfortable debugging like this if you want any sort of data related career). In this case we need to take care of these obnoxious quotation marks. The command here is `strip()`. It removes *leading or trailing characters*.

In [0]:
myStr = '"Australia"'
myStr.strip('"')

In [0]:
myStr = '"Aust"ralia"'
myStr.strip('"')

In [0]:
myData = dict()
lineCounter = 1
with open('realGDP.csv','r') as myFile:
    for line in myFile:
        if lineCounter >= 5:
            countryData = line.split(',')
            myData[countryData[0].strip('"')] = [item.strip('"') for item in countryData[1:]]
        lineCounter += 1

In [0]:
myData['Mexico']

In practice it would be even better to consider formatting this as a dictionary of dictionaries, so that you could do something like 
```
myData['Australia'][1985]
```

but I'm done with data wrangling for now, let's just use it.

In [0]:
startIndex = 4
endIndex = -3
ausData = [float(gdp) for gdp in myData['Australia'][4:-3]]
ausData

In [0]:
list_plot(ausData)

In [0]:
startIndex = 4
endIndex = -3
greData = [float(gdp) for gdp in myData['Greece'][4:-3]]
greData

In [0]:
comparison = list_plot(ausData)
comparison += list_plot(greData, color = 'green')
show(comparison)

The code below finds a "line of best fit" for the data points above (don't worry too much about details)

In [0]:
var('a','b')
model(x) = a *x + b
greRegression = find_fit([(i,greData[i]) for i in range(len(greData))],model)

In [0]:
var('a','b')
model(x) = a *x + b
ausRegression = find_fit([(i,ausData[i]) for i in range(len(greData))],model)

In [0]:
print(greRegression)
print(ausRegression)

In [0]:
comparison += plot(-0.16650865062862255*x + 7.612840847116432, (-1,60), color = 'green')
comparison += plot(-0.029220092028283018*x + 4.291826627149551, (-1,60), color = 'blue')

In [0]:
show(comparison)

We will see how to expedite this process when we get to the *Pandas* module.

## ***** Participation Check ***************************

Write a function *neighborhood* which 
- Takes as input a graph G, a vertex v of G, and a integer k.
- Returns a list of the vertices in G for which `G.distance(v,u)` is less than or equal to k. 

In [0]:
def neighborhood(G,v,k):
    #Your code here

In [0]:
print(neighborhood(graphs.CycleGraph(10),0,3))

In [0]:
show(graphs.CycleGraph(10))

## ***********************************************************

## Let's apply this to graphs!
The 1138_bus.mtx file is a data file detailing connectivity in a power grid. It's not easy to figure out exactly what is being represented here, but basically:
- Nodes represent some form of power station
- Edges represent connections between these power stations.


In [0]:
with open('1138_bus.mtx','r') as myFile:
    data = myFile.readlines()

In [0]:
data

In [0]:
data[19]

In [0]:
G = Graph()
G.weighted(True)
G.add_vertices([i for i in range(1,1139)])
for line in data[14:]:
    splitLine = line.split()
    vertexA = int(splitLine[0])
    vertexB = int(splitLine[1])
    weight = float(splitLine[2])
    if vertexA != vertexB:
        G.add_edge((vertexA,vertexB, weight))


In [0]:
G.plot()

Is this even useful? Can we do anything with it?

In [0]:
G.connected_components_number()

What if we were worried about the grid disconnecting. Maybe for instance winter is coming, and we are worried about certain power lines failing due to some weather incident. Which power lines would be *most critical* to maintain?

In [0]:
G.bridges??

In [0]:
bridges = list(G.bridges())
print(bridges[0])

In [0]:
G.delete_edge((495,513))

In [0]:
G.connected_components_number()

In [0]:
G.add_edge((495,513,-46.2963))

Which power station is most important to maintain? This can be measured in *many* ways. Here is one crude way: which vertex is incident to the most bridges?

In [0]:
vertexDict = {vertex:0 for vertex in G}
for bridge in bridges:
    vertexDict[bridge[0]]+=1
    vertexDict[bridge[1]]+=1

In [0]:
max(vertexDict.values())

In [0]:
for vertex in G:
    if vertexDict[vertex]==7:
        print(vertex)

In [0]:
G.edges_incident(375)

In [0]:
G.edges_incident(724)

Now we can get a local picture around vertex 724. Using your "neighborhood" function, we can get all the vertices close to 724 and plot the induced subgraph:

In [0]:
neighbors = neighborhood(G,724,2)
plot(G.subgraph(neighbors))

In [0]:
G.delete_vertex(724)
G.connected_components_number()

Perhaps it is worth paying a little more attention to vertex 724 to make sure it does not fail!

## Next Time: Number Theory/Cryptography