## Centrality measures 

### Problem Set
Centrality measures can be used to predict (positive or negative) outcomes for a node.
Your task in this week’s assignment is to identify an interesting set of network data that is available on the web
(either through web scraping or web APIs) that could be used for analyzing and comparing centrality measures across nodes.

As an additional constraint, there should be at least one categorical variable available for each node
(such as “Male” or “Female”; “Republican”, “Democrat,” or “Undecided”, etc.)

In addition to identifying your data source, you should create a high level plan that describes how you would load the data for analysis,
and describe a hypothetical outcome that could be predicted from comparing degree centrality across categorical groups.
For this week’s assignment, you are not required to actually load or analyze the data.  Please see also Project 1 below.
You may work in a small group on the assignment.   You should post your document to GitHub by end of day on Sunday.

## Introduction

In our project, we take a look at a network of stocks and institutional holders.  Instiutional holders include any mutual or pension funds, insurance company, investment firms, private foudnation, endowment, or other large entities that manage funds on behalf others.  In our social network analysis, our nodes are either a stock or institutional holder and the edges are interactions between the institutional holder and stock.

To conduct our analysis, we use a Dow Jones Industrial Average dataset to extract the ticker information and use the yfinance package to pull the institutional holder dataframes.

The Dow Industrial Average consist of the 30 companies and the index weight of each.  The yfinanace package is an open source tool that uses Yahoo's publicly available APIs.  It offers a threaded Pythonic way to download data from Yahoo finance.

The analysis will begin with extraction, transformation, and loading of the desired data. Next, we create and perform basic exploration of the social network. Finally, we will perform an analysis by categorical groups.

### Required Packages

In [None]:
import yfinance as yf
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import plotly.express as px
from scipy import stats
from networkx import algorithms

### ETL

We begin by scrapping the Dow Jones data from a webpage. Below is a view of the table.

In [None]:
# scraping company tickers and info
tickers_df = pd.read_html('https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average')[1]
tickers_df.head(5)

The subset we will include is the ticker with the top 20 index weights. 

In [None]:
ticker_list = list(tickers_df['Symbol'])
print(ticker_list)

Here, we will create ticker classes using the yfinance package for each symbol.  This allows us to store the institutional holder dataframe for each ticker and concatenate into one dataframe.

We add a column to this data frame that contains the ticker symbol of each ticker, respectively.  We use this method to create a mapping back to ticker and the rest of the institutional holder. 

In [None]:
# create dictionary with the stored symbols as keys
ticker_dict = dict.fromkeys(ticker_list)

# loop through keys to create ticker class and store institutional holders df
for comp in ticker_dict:
    ticker = yf.Ticker(comp)
    inst_hol = ticker.institutional_holders
    inst_hol['comp'] = comp
    ticker_dict[comp] = inst_hol
    
# concatenate 30 dfs into 1    
institutional_holders = pd.concat(ticker_dict.values(), ignore_index=True)

Taking a look at the entire institutional holder dataframe. 

In [None]:
institutional_holders

### Social Network

From the insitutional holder dataframe, we are able to construct a social network.

In [None]:
# creating network from df
# in order to be able to utilize the edges we need to set edge_att= True
g = nx.from_pandas_edgelist(institutional_holders, 'Holder', 'comp', edge_attr=True)
edgelist = nx.to_edgelist(g)

To easily understand the nodes, we change the colors of the nodes to understand who are the companies (Blue) and who are the institutional holders (orange).  Additionally, we label them as such.

In [None]:
# assigning color and type to node
colors = []
for node in g:
    if node in institutional_holders['comp'].values:
        colors.append("blue")
        g.nodes[node]['type'] = 'stock'
    else:
        colors.append("orange")
        g.nodes[node]['type'] = 'holder'

### Exploration

Below we see basic information of the social network.

In [None]:
# storing basic network info in dictionary 
network_info = {}

network_info['connected'] = nx.is_connected(g)
network_info['diameter'] = nx.diameter(g)
network_info['num_of_nodes'] = g.number_of_nodes()
network_info['num_of_edges'] = g.number_of_edges()
network_info['avg_shortest_path'] = algorithms.average_shortest_path_length(g)

# output as table
pd.DataFrame.from_dict(network_info, orient='index', columns=['Values'])

#### Graph Network

Next, we take our first look at the network.

In [None]:
# view basic graph
plt.figure(figsize=(50,40))
nx.draw(g, with_labels=True,
        node_color=colors)

### Analysis 
Once we have loaded the dataframe of institutional holders of 20 companies (subset of data) we will create a strategy to analyze Degree centrality.

### Degree
We begin by capturing the degree of each node in a dataframe along with the name of the node and type.  This shows us the number of connections a node has. 

In [None]:
# creating df with node names and type
type = dict(g.nodes) 
temp_dict = {}
for x in type:
    temp_dict[x] = type[x]['type']

type_df = pd.DataFrame.from_dict(temp_dict, orient='index', columns=['Type'])

# creating df with node names and degree
deg = dict(g.degree())
deg_df = pd.DataFrame.from_dict(deg, orient='index', columns=['Degree'])

# joining df
df = type_df.join(deg_df)

# describing degree values
df['Degree'].describe()

Once the degree is captured in our dataframe we can visualize the distribution.

In [None]:
# plot histogram
fig = px.histogram(df, x = 'Degree', color='Type', title='Degree Values by Type')
fig.show()

### Second look at Network
Now that we explored the degree concept we include this in our social network graph.  

#### Node Size
The degree is the number of relationships of the particular node
We need to transform the degree into a dictionary
In order to appreciate the size of each node we need to use a list comprehension to multiply the value in the dict

#### Edge Weight
The edges represent the amount of money contributed to each company - therefore we need to divide by an estimate number to obtain the weight in edges

In [None]:
# view network graph that degree and value of node
plt.figure(figsize=(50,40))
nx.draw(g, with_labels=True,
        node_color=colors,
        node_size = [v * 500 for v in dict(g.degree()).values()],
                     width= [v[2]['Value'] /50_000_000_000 for v in edgelist])

### Eigen Values

Eigenvector Centrality is a recursive version of Degree Centrality. Recompute the scores of each node as a weighted sum of centralities of all nodes in a nodes neighborhood.  Measures influence, how well connected, and 

After computing the eigenvector values we view the 

In [None]:
# creating df for eigen values
eigen = dict(nx.eigenvector_centrality(g))
eigen_df = pd.DataFrame.from_dict(eigen, orient='index', columns=['Eigen'])

# joining to df
df = df.join(eigen_df)

# describing Eigen values
df['Eigen'].describe()

Again, we view the values by type of node.  Unlike the degree

In [None]:
# plot histogram
fig = px.histogram(df, x='Eigen', color='Type', title = 'Eigen Values by Type')
fig.show()

Now taking a look at degree and eigenvector value of all nodes.

In [None]:
# looking at the df 
df.sort_values('Eigen', ascending=False).head(10)

### T-Test

T-test is used to determine the signficant difference of means in two groups, holder and stock nodes.  

In [None]:
# separate df by holder and stock
holder = df[df['Type']=='holder']
stock = df[df['Type']=='stock']

# perform T-test on Eigen values of each holder and stock
stats.ttest_ind(stock['Eigen'], holder['Eigen'])

In [None]:
# perform T-test on Eigen values of each holder and stock
stats.ttest_ind(stock['Degree'], holder['Degree'])

## Conclusion

By looking at the the results of the T-test using the degree and eigenvector values, we see that there is a significant difference in the means across the types of nodes, institutional holder and stock.  

In our case, institutional holders have a higher importance across the population in the social network than the stock.  

Vanguard Group, Inc, Blackrock Inc., State Street Corporation, Geode Capital Management, LLC are the most connected with the highest degree and eigenvector values.

Moreover, we could use the social network graph to design a rec sys to predict  instiutional holders investments in the same or different companies.