# Network Analysis
For this project, I analyzed the email network of a mid-sized company and a network of blog web sites using the `networkx` Python library. 

In [38]:
import networkx as nx
import pandas as pd
import operator

## 1. Email Network Analysis
For this part of the project, I analyzed an internal email communication network between employees of a mid-sized manufacturing company. 

Each node in the network represents an employee, and each directed edge between two nodes represents an individual email. The left node represents the sender and the right node represents the recipient. Each email was also assigned a timestamp. 

### Load Data
We modeled the network as a directed multigraph, and made sure the node names were strings.

In [10]:
# Load email data
data = pd.read_csv('./assets/email_network.txt', sep='\t')

# Clean-up column headers
data = data.rename(columns={'#Sender': 'Sender', 'time':'Time'})
data['Sender'] = data['Sender'].astype(str)
data['Recipient'] = data['Recipient'].astype(str)

# Retrieve names of all senders
senders = data['Sender'].unique()
# Retrieve names of all recipients
receipients = data['Recipient'].unique()
# Combine into a single set 
both = set(senders) & set(receipients)

# Create empty network 
email_net = nx.MultiDiGraph()
# Add sender/recipient nodes to network 
email_net.add_nodes_from(both)

# Add edges to the network with time attribute
for idx in data.index:
    email_net.add_edge(data.loc[idx, 'Sender'], 
                       data.loc[idx, 'Recipient'],
                       time = data.loc[idx, 'Time'])
    
print('Total number of employees is: ', len(email_net.nodes()))
print('Total number of emails sent was: ', len(email_net.edges()))

Total number of employees is:  167
Total number of emails sent was:  82927


### Network Connectivity
If we assume that when an employee sends an email to another employee, a communication channel has been created, allowing the sender to provide information to the reciever, but not vice versa, is it possible for information to go from every employee to every other employee? In other words is it a strongly connected network? 

In [12]:
# Part 1: Is the email network strongly connected? 
strong = nx.is_strongly_connected(email_net)

if strong:
    print('The email network is strongly connected.')
else:
    print('The email network is not strongly connected.')

The email network is not strongly connected.


Therefore, the employees are all connected through email, if the direction of the email is considered.

Now if we assume that a communication channel established by an email allows information to be exchanged both ways, is it possible for information to go from every employee to every other employee? Or is the network weakly connected? 

In [13]:
# Part 2: Is it weakly connected? 
weak = nx.is_weakly_connected(email_net)

if weak:
    print('The email network is weakly connected.')
else:
    print('The email network is not weakly connected.')

The email network is weakly connected.


Therefore, all of the employees are connected through email, if the direction of the email isn't considered.

### Largest Connected Components
In a directed graph, like our email network, a weakly connected component (WCC) is a subgraph of the graph in which all of the
nodes of the subgraph are connected, regardless of the direction of the edge connecting them. The largest weakly connected component is the WCC with the most nodes. For our network, it represents the largest number of employees connected through emails, regardless of who sent/received the email.  

In [16]:
# Find largest weakly connected components
largest_weakly_connected = max(nx.weakly_connected_components(email_net), key=len)

print('The most employees weakly connected through email is: ', len(largest_weakly_connected)) 

The most employees weakly connected through email is:  167


A strongly connected component (SCC) of the email network is a sub-network of the network in which all of the employees are connected through, taking the edge directions into account. The largest strongly connected component is the SCC with the most employees. 

In [17]:
# Find largest strongly connected components
largest_strongly_connected = max(nx.strongly_connected_components(email_net), key=len)

print('The most employees strongly connected through email is: ', len(largest_strongly_connected)) 

The most employees strongly connected through email is:  126


### Analysis of Largest Strongly Connected Component
Because the largest strongly connected component is the most reliable method of sharing information via email, we analyzed a number of defining characteristics of this component. 

For instance, how far apart are the employees in this component, on average? 

In [19]:
# Create subgraph from largest strongly connected component
email_sub_net = email_net.subgraph(largest_strongly_connected)

# Calculate average distance
avg_distance = nx.average_shortest_path_length(email_sub_net)

print('The average distance between employees is: ', round(avg_distance, 2))

The average distance between employees is:  1.65


So on average, it takes more than 1.5 emails to reach everyone in this sub-network.  

What is the largest possible distance between two employees? 

In [20]:
# Find max shortest path length between nodes 
max_dist = nx.diameter(email_sub_net)

print('The maximum shortest path between employees is: ', max_dist)

The maximum shortest path between employees is:  3


So at most, it takes 3 emails to reach everyone in the sub-network.

Which employees in the email sub-network have an eccentricity, a maximum shortest path, equal to the sub-network diameter?

In [21]:
# Find nodes with eccentricity equal to diameter
set(nx.periphery(email_sub_net))

{'129', '134', '97'}

These are the employees that take the most emails to reach the other employees in the sub-network. 

Which employees in the email sub-network have an eccentricity equal to the sub-network radius, the smallest eccentricity?

In [22]:
# Employees with eccentricity equal to radius
set(nx.center(email_sub_net))

{'38'}

This is the employee that takes the fewest emails to reach all of the other employees in the sub-network. 

## 2. Network of Political Blogs
For this part of the project, I analyzed connections to political blogs. I modeled the connections as a directed network, in which the nodes correspond to a blog and edges correspond to links between blogs.

In [25]:
# Load data stored in Geographical Markup Language (GML)
blog_sub_net = nx.read_gml('assets/blogs.gml')

### Scaled PageRank
Using the PageRank algorithm developed at Google, scaled to include a damping factor of 0.85, I determined the most influential blogs in the network. 

In [32]:
# Alpha value 
a = 0.85 
# Calculate page rank with given alpha 
scaled_page_rank = nx.pagerank(blog_sub_net, alpha=a) 

# Sort by rank, retrieve top 5 
top_5_tuples = sorted(scaled_page_rank.items(), 
               key=operator.itemgetter(1),
               reverse=True)[0:5]

# Extract just web sites from list 
pagerank_top_5_values = [t[0] for t in top_5_tuples]

pagerank_top_5_values

['dailykos.com',
 'atrios.blogspot.com',
 'instapundit.com',
 'blogsforbush.com',
 'talkingpointsmemo.com']

So these are the blogs readers are most likely to visit in the blog network, using the Scaled PageRank algorithm.

### HITS Algorithm
Hyperlink-Induced Topic Search (HITS) algorithm is a link analysis algorithm that was developed by Jon Kleinberg. The algorithm assigns two scores for each page: its authority, which estimates the value of the content of the page, and its hub value, which estimates the value of its links to other pages. 

Next, we applied the HITS Algorithm to the blog network to find the 5 blogs that have the most links to other blogs in the network.

In [33]:
# Calculate hub / authority scores 
hub_score, authority_score = nx.hits(blog_sub_net)

# Sort by hub scores, retrieve top 5 
top_5_tuples = sorted(hub_score.items(), 
               key=operator.itemgetter(1),
               reverse=True)[0:5]

# Extract just web sites from list 
hub_top_5_values = [t[0] for t in top_5_tuples]

hub_top_5_values

['politicalstrategy.org',
 'madkane.com/notable.html',
 'liberaloasis.com',
 'stagefour.typepad.com/commonprejudice',
 'bodyandsoul.typepad.com']

And we also found the 5 blogs that have the most links leading to them.

In [34]:
# Sort by authority scores, retrieve top 5 
top_5_tuples = sorted(authority_score.items(), 
               key=operator.itemgetter(1),
               reverse=True)[0:5]

# Extract just web sites from list 
authority_top_5_values = [t[0] for t in top_5_tuples]

authority_top_5_values

['dailykos.com',
 'talkingpointsmemo.com',
 'atrios.blogspot.com',
 'washingtonmonthly.com',
 'talkleft.com']

### Conclusion
I then compared the contents of each of the lists.   

In [35]:
# Comparing list of highest page ranks and hub list
set(pagerank_top_5_values) & set(hub_top_5_values)

set()

So there are no pages in both the top page rank list and the hub list.

In [36]:
# Comparing authorith list and hub list
set(authority_top_5_values) & set(hub_top_5_values)

set()

Again, the authority and hub list have no blogs in common.

In [37]:
# Comparing list of highest page ranks and authority list
set(pagerank_top_5_values) & set(authority_top_5_values)

{'atrios.blogspot.com', 'dailykos.com', 'talkingpointsmemo.com'}

However, the Scaled PageRank list and the HITS authority list of blogs share 3 blogs, indicating these blogs are especially influential. 