## Centrality measures

Centrality measures can be used to predict (positive or negative) outcomes for a node.
Your task in this week’s assignment is to identify an interesting set of network data that is available on the web
(either through web scraping or web APIs) that could be used for analyzing and comparing centrality measures across nodes.

As an additional constraint, there should be at least one categorical variable available for each node
(such as “Male” or “Female”; “Republican”, “Democrat,” or “Undecided”, etc.)

In addition to identifying your data source, you should create a high level plan that describes how you would load the data for analysis,
and describe a hypothetical outcome that could be predicted from comparing degree centrality across categorical groups.
For this week’s assignment, you are not required to actually load or analyze the data.  Please see also Project 1 below.
You may work in a small group on the assignment.   You should post your document to GitHub by end of day on Sunday.

### Required Packages

In [2]:
import yfinance as yf
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import plotly.express as px
from scipy import stats
from networkx import algorithms

## Data (subset)
**Dow Jones Industrial Average Equity Index**

We begin by scrapping the data from Dow Jones using the Yahoo Finance API
We proceed to create a ticker object that allows access to the Dow Jones  Companies
To begin our analysis we start with only apple and MSFT to have a sense of the data

In [3]:
# scraping company tickers and info
tickers_df = pd.read_html('https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average')[1]
tickers_df.head(5)

Unnamed: 0,Company,Exchange,Symbol,Industry,Date added,Notes,Index weighting
0,3M,NYSE,MMM,Conglomerate,1976-08-09,As Minnesota Mining and Manufacturing,3.02%
1,American Express,NYSE,AXP,Financial services,1982-08-30,,3.60%
2,Amgen,NASDAQ,AMGN,Biopharmaceutical,2020-08-31,,4.48%
3,Apple,NASDAQ,AAPL,Information technology,2015-03-19,,3.25%
4,Boeing,NYSE,BA,Aerospace and defense,1987-03-12,,3.96%


In [4]:
# converting index weighting to numeric
tickers_df['Index weighting'] = tickers_df['Index weighting'].str.replace('%','')
tickers_df['Index weighting'] = pd.to_numeric(tickers_df['Index weighting'])

# storing the symbols of the top 20 index weighting
top20 = tickers_df.nlargest(20,'Index weighting')
top20_list = list(top20['Symbol'])
top20_list

['UNH',
 'GS',
 'HD',
 'MSFT',
 'MCD',
 'AMGN',
 'V',
 'CRM',
 'BA',
 'CAT',
 'HON',
 'AXP',
 'AAPL',
 'TRV',
 'JNJ',
 'MMM',
 'PG',
 'JPM',
 'NKE',
 'DIS']

In [5]:
# create dictionary with the stored symbols as keys
top20_dict = dict.fromkeys(top20_list)

# loop through keys to create ticker class and store institutional holders df
for comp in top20_dict:
    ticker = yf.Ticker(comp)
    inst_hol = ticker.institutional_holders
    inst_hol['comp'] = comp
    top20_dict[comp] = inst_hol
    
# concatenate 20 dfs into 1    
institutional_holders = pd.concat(top20_dict.values(), ignore_index=True)

After we create the the first variable wich is a ticker object we can use it to access data inside APPL stock like the institutional holders

In [None]:
institutional_holders

We want to add a column to this data frame that contains the ticker symbol of Apple - we use this method to create a mapping back to Apple and the rest of the
Companies we will be analyzing shortly. Also the prompt requested to include categorical data to our analysis

## Data Preparation

In addition to identifying your data source, you should create a high level plan that describes how you would load the data for analysis,
and describe a hypothetical outcome that could be predicted from comparing degree centrality across categorical groups.

In [None]:
# creating network from df
# in order to be able to utilize the edges we need to set edge_att= True
g = nx.from_pandas_edgelist(institutional_holders, 'Holder', 'comp', edge_attr=True)
edgelist = nx.to_edgelist(g)


add the edge_attr = True to be able to see the values corresponding to the edges

### Change Node color
First we begin our analysis we change the colors of the nodes to understand who are the companies (Blue) and who are the institutional holders (orange)

In [None]:
# assigning color and type to node
colors = []
for node in g:
    if node in institutional_holders['comp'].values:
        colors.append("blue")
        g.nodes[node]['type'] = 'stock'
    else:
        colors.append("orange")
        g.nodes[node]['type'] = 'holder'

## Exploration

In [None]:
# storing basic network info in dictionary 
network_info = {}

network_info['connected'] = nx.is_connected(g)
network_info['diameter'] = nx.diameter(g)
network_info['num_of_nodes'] = g.number_of_nodes()
network_info['num_of_edges'] = g.number_of_edges()
network_info['avg_shortest_path'] = algorithms.average_shortest_path_length(g)

# output as table
pd.DataFrame.from_dict(network_info, orient='index', columns=['Values'])

### Graph Network

In [None]:
# view basic graph
plt.figure(figsize=(50,40))
nx.draw(g, with_labels=True,
        node_color=colors)

## Strategy Degree Centrality
once we have loaded only two companies (subset of data) we will create a strategy to analyze Degree centrality

## Centrality Measures
Compute degree distribution
The number of neighbors that a node has is called its "degree", and it's possible to compute the degree distribution across the entire graph.

### Degree

In [None]:
# creating df with node names and type
type = dict(g.nodes) 
temp_dict = {}
for x in type:
    temp_dict[x] = type[x]['type']

type_df = pd.DataFrame.from_dict(temp_dict, orient='index', columns=['Type'])

# creating df with node names and degree
deg = dict(g.degree())
deg_df = pd.DataFrame.from_dict(deg, orient='index', columns=['Degree'])

# joining df
df = type_df.join(deg_df)

# describing degree values
df['Degree'].describe()


In [None]:
# plot histogram
fig = px.histogram(df, x = 'Degree', color='Type', title='Degree Values by Type')
fig.show()

#### Edges Weight
The edges represent the amount of money contributed to each company - therefore we need to divide by an estimate number to obtain the weight in edges

#### Node Size
The degree is the number of relationships of the particular node
We need to transform the degree into a dictionary
In order to appreciate the size of each node we need to use a list comprehension to multiply the value in the dict


In [None]:
# view network graph that degree and value of node
plt.figure(figsize=(50,40))
nx.draw(g, with_labels=True,
        node_color=colors,
        node_size = [v * 500 for v in dict(g.degree()).values()],
                     width= [v[2]['Value'] /50_000_000_000 for v in edgelist])

### Eigen Values

In [None]:
# creating df for eigen values
eigen = dict(nx.eigenvector_centrality(g))
eigen_df = pd.DataFrame.from_dict(eigen, orient='index', columns=['Eigen'])

# joining to df
df = df.join(eigen_df)

# describing Eigen values
df['Eigen'].describe()

In [None]:
# plot histogram
fig = px.histogram(df, x='Eigen', color='Type', title = 'Eigen Values by Type')
fig.show()

In [None]:
# looking at the df 
df.sort_values('Eigen', ascending=False).head()

### T-Test

In [None]:
# separate df by holder and stock
holder = df[df['Type']=='holder']
stock = df[df['Type']=='stock']

# perform T-test on Eigen values of each holder and stock
stats.ttest_ind(holder['Eigen'], stock['Eigen'])

### Degrees

In [None]:
plt.figure()
plt.hist(degrees)
plt.show()

### Degree centrality

In [None]:
deg_cent = nx.degree_centrality(g)
deg_cent

In [None]:
plt.figure()
plt.hist(list(deg_cent.values()))
plt.show()

### Plot a scatter plot
of the centrality distribution and the degree distribution

In [None]:
plt.figure()
plt.scatter(degrees, list(deg_cent.values()))
plt.show()

### Path Finding
When we do len on edges and nodes we can see that there are 20 edges between 20 nodes for this subset data