# Data 620 - Project 1

## Team No. 6

- Yohannes Deboch
- Sherranette Tinapunan

## Video Presentation

https://screencast-o-matic.com/watch/cqnth40IT5

## Requirements

For your first project, you are asked to:

- Identify and load a network dataset that has some categorical information available for each node.
- For each of the nodes in the dataset, calculate degree centrality and eigenvector centrality.
- Compare your centrality measures across your categorical groups.

## Data set

Source: https://www.kaggle.com/devisangeetha/divvy-bike-share-eda-network-analysis/data

> Divvy is Chicagoland’s bike share system, with 6,000 bikes available at 570+ stations across Chicago and Evanston. Divvy provides residents and visitors with a convenient, fun and affordable transportation option for getting around and exploring Chicago

- This project looks at 'gender', 'from_station_id', 'from_station_name', 'to_station_id, and 'to_station_name'. 
- This is a directed network. 
- The nodes represent stations. 
- Edge represents bike pickup station to bike drop off station.



## Load libraries



In [1]:
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
from scipy.stats import ttest_1samp
import numpy as np

##  Read the data

In [2]:
df = pd.read_csv("data.csv")
df.head()

Unnamed: 0,gender,from_station_id,from_station_name,to_station_id,to_station_name
0,Male,131,Lincoln Ave & Belmont Ave,303,Broadway & Cornelia Ave
1,Male,282,Halsted St & Maxwell St,22,May St & Taylor St
2,Male,327,Sheffield Ave & Webster Ave,225,Halsted St & Dickens Ave
3,Female,134,Peoria St & Jackson Blvd,194,State St & Wacker Dr
4,Female,320,Loomis St & Lexington St,134,Peoria St & Jackson Blvd


Create station_id to station_name mapping. This will be used later.

In [3]:
#Create station_id to station_name mapping
from_station_mapping = df[['from_station_id', 'from_station_name']]
to_station_mapping =  df[['to_station_id', 'to_station_name']]
from_station_mapping.columns = ['station_id', 'station_name']
to_station_mapping.columns = ['station_id', 'station_name']

#station_id to station_name mapping
station_name_id = pd.concat([from_station_mapping, to_station_mapping],ignore_index=True).drop_duplicates().reset_index(drop=True)

## Divide data set into male and female groups


In [4]:
df_male = df[df['gender']=='Male'].copy(deep=True)
df_female = df[df['gender']=='Female'].copy(deep=True)

In [5]:
df_male.head()

Unnamed: 0,gender,from_station_id,from_station_name,to_station_id,to_station_name
0,Male,131,Lincoln Ave & Belmont Ave,303,Broadway & Cornelia Ave
1,Male,282,Halsted St & Maxwell St,22,May St & Taylor St
2,Male,327,Sheffield Ave & Webster Ave,225,Halsted St & Dickens Ave
5,Male,332,Halsted St & Diversey Pkwy,319,Greenview Ave & Diversey Pkwy
6,Male,174,Canal St & Madison St,44,State St & Randolph St


In [6]:
df_female.head()

Unnamed: 0,gender,from_station_id,from_station_name,to_station_id,to_station_name
3,Female,134,Peoria St & Jackson Blvd,194,State St & Wacker Dr
4,Female,320,Loomis St & Lexington St,134,Peoria St & Jackson Blvd
11,Female,267,Lake Park Ave & 47th St,322,Kimbark Ave & 53rd St
13,Female,210,Ashland Ave & Division St,350,Ashland Ave & Chicago Ave
14,Female,332,Halsted St & Diversey Pkwy,350,Ashland Ave & Chicago Ave


## Assign weight to each link for male and female groups

- The weight describes how much bike pick-up and drop-off occurs from source to target stations. 
- The number of 'source -> target' occurrences is counted for each distinct 'source -> target' combination. 
- This maximum count is determined for each group.
- The weight for each link is determined by taking the count and dividing it by the group maximum count. 

In [7]:
#Generate a from-to id we can use to group by and join later 
df_male['from_to'] = df_male.apply(lambda row: str(row['from_station_id']) + '->' + str(row['to_station_id']), axis=1)

In [8]:
df_female['from_to'] = df_female.apply(lambda row: str(row['from_station_id']) + '->' + str(row['to_station_id']), axis=1)

In [9]:
#frequency of from-to
from_to_count_male = df_male['from_to'].value_counts()
from_to_count_female = df_female['from_to'].value_counts()

#Call reset_index() to convert series to dataframe
from_to_count_male = from_to_count_male.reset_index()
from_to_count_female = from_to_count_female.reset_index()

#rename columns
from_to_count_male.columns = ['from_to', 'count']
from_to_count_female.columns = ['from_to', 'count']

### Determine maximum 'source -> target' count for each male and female group

- Max 'source -> target' count for male is 1199
- Max 'source -> target' count for female is 271

In [10]:
max_count_male = from_to_count_male['count'].max()
max_count_female =from_to_count_female['count'].max()
max_count = max(max_count_male, max_count_female)

print(max_count_male)
print(max_count_female)

1199
271


### Calculate weight for each male and female group

- Weight for each node in each group is determined by dividing the count of the respective 'source -> target' by the max_count. 

In [11]:
#divide count by the maximum count
from_to_count_male['weight'] = from_to_count_male.apply(lambda row: row['count']/max_count_male, axis=1)
from_to_count_female['weight'] = from_to_count_female.apply(lambda row: row['count']/max_count_female, axis=1)

### Male 'source -> target' weights

In [12]:
#preview for male
from_to_count_male.head()

Unnamed: 0,from_to,count,weight
0,195->174,1199,1.0
1,283->174,1187,0.989992
2,195->91,1173,0.978315
3,174->43,1077,0.898249
4,195->192,1074,0.895746


### Female 'source -> target' weights

In [13]:
#preview for female
from_to_count_female.head()

Unnamed: 0,from_to,count,weight
0,177->35,271,1.0
1,35->177,267,0.98524
2,290->123,239,0.881919
3,284->255,238,0.878229
4,85->177,211,0.778598


In [14]:
#join from-to data to df_male and df_female
df_male2 = df_male.join(from_to_count_male.set_index('from_to'), on='from_to').copy(deep=True)
df_female2 = df_female.join(from_to_count_female.set_index('from_to'), on='from_to').copy(deep=True)

In [15]:
# drop duplicates to generate top 10 by weight
df_male3 = df_male2.drop_duplicates()
df_female3 = df_female2.drop_duplicates()

###  Top 10 bike pickup and drop off stations

In [16]:
df_male3.sort_values(by=['weight'], ascending=False).head(n=10)  

Unnamed: 0,gender,from_station_id,from_station_name,to_station_id,to_station_name,from_to,count,weight
376,Male,195,Columbus Dr & Randolph St,174,Canal St & Madison St,195->174,1199,1.0
907,Male,283,LaSalle St & Jackson Blvd,174,Canal St & Madison St,283->174,1187,0.989992
99,Male,195,Columbus Dr & Randolph St,91,Clinton St & Washington Blvd,195->91,1173,0.978315
492,Male,174,Canal St & Madison St,43,Michigan Ave & Washington St,174->43,1077,0.898249
1372,Male,49,Dearborn St & Monroe St,174,Canal St & Madison St,49->174,1074,0.895746
1836,Male,195,Columbus Dr & Randolph St,192,Canal St & Adams St,195->192,1074,0.895746
1943,Male,52,Michigan Ave & Lake St,91,Clinton St & Washington Blvd,52->91,966,0.805671
1171,Male,43,Michigan Ave & Washington St,174,Canal St & Madison St,43->174,942,0.785655
1129,Male,91,Clinton St & Washington Blvd,43,Michigan Ave & Washington St,91->43,879,0.733111
888,Male,91,Clinton St & Washington Blvd,47,State St & Kinzie St,91->47,807,0.673061


In [17]:
df_female3.sort_values(by=['weight'], ascending=False).head(n=10) 

Unnamed: 0,gender,from_station_id,from_station_name,to_station_id,to_station_name,from_to,count,weight
7623,Female,177,Theater on the Lake,35,Streeter Dr & Illinois St,177->35,271,1.0
388,Female,35,Streeter Dr & Illinois St,177,Theater on the Lake,35->177,267,0.98524
4749,Female,290,Kedzie Ave & Palmer Ct,123,California Ave & Milwaukee Ave,290->123,239,0.881919
153,Female,284,Michigan Ave & Jackson Blvd,255,Indiana Ave & Roosevelt Rd,284->255,238,0.878229
7547,Female,85,Michigan Ave & Oak St,177,Theater on the Lake,85->177,211,0.778598
477,Female,255,Indiana Ave & Roosevelt Rd,90,Millennium Park,255->90,205,0.756458
7226,Female,177,Theater on the Lake,85,Michigan Ave & Oak St,177->85,203,0.749077
2362,Female,110,State St & Erie St,192,Canal St & Adams St,110->192,199,0.734317
6990,Female,76,Lake Shore Dr & Monroe St,35,Streeter Dr & Illinois St,76->35,197,0.726937
1576,Female,90,Millennium Park,255,Indiana Ave & Roosevelt Rd,90->255,194,0.715867


## Create graph object

- This is a directed graph.

In [18]:
#Create directed graph object
G_male = nx.from_pandas_edgelist(df_male2, source='from_station_id', target='to_station_id', edge_attr=['weight'], create_using=nx.DiGraph())
G_female = nx.from_pandas_edgelist(df_female2, source='from_station_id', target='to_station_id', edge_attr=['weight'], create_using=nx.DiGraph())

#w=G_female.edges(data=True)
#print(w)

### Male network

In [19]:
print(nx.info(G_male))

Name: 
Type: DiGraph
Number of nodes: 300
Number of edges: 43029
Average in degree: 143.4300
Average out degree: 143.4300


### Female network

In [20]:
print(nx.info(G_female))

Name: 
Type: DiGraph
Number of nodes: 300
Number of edges: 33170
Average in degree: 110.5667
Average out degree: 110.5667


## Calculate degree of centrality

> The in-degree centrality for a node v is the fraction of nodes its incoming edges are connected to.
The degree centrality values are normalized by dividing by the maximum possible degree in a simple graph n-1 where n is the number of nodes in G.

Source: https://networkx.github.io/documentation/networkx-1.10/reference/generated/networkx.algorithms.centrality.in_degree_centrality.html


In [21]:
#calculate degree centrality

in_deg_centrality_male = pd.DataFrame.from_dict(nx.in_degree_centrality(G_male), orient='index').reset_index()
in_deg_centrality_female = pd.DataFrame.from_dict(nx.in_degree_centrality(G_female), orient='index').reset_index()

#rename columns
in_deg_centrality_male.columns = ['station', 'in_degree_centrality']
in_deg_centrality_female.columns = ['station', 'in_degree_centrality']

In [22]:
#join
in_deg_centrality_male2 = in_deg_centrality_male.join(station_name_id.set_index('station_id'), on='station')

#join
in_deg_centrality_female2 = in_deg_centrality_female.join(station_name_id.set_index('station_id'), on='station')

### Male: Top 10 stations with most incoming connections for bike drop offs

In [23]:
#sort
in_deg_centrality_male = in_deg_centrality_male2.sort_values(by=['in_degree_centrality'], ascending=False)
in_deg_centrality_male.head(n=10)

Unnamed: 0,station,in_degree_centrality,station_name
81,69,0.765886,Damen Ave & Pierce Ave
167,81,0.762542,Daley Center Plaza
23,181,0.725753,LaSalle St & Illinois St
180,60,0.722408,Dayton St & North Ave
25,289,0.722408,Wells St & Concord Ln
15,91,0.715719,Clinton St & Washington Blvd
147,176,0.715719,Clark St & Elm St
99,331,0.712375,Halsted St & Blackhawk St
186,56,0.712375,Desplaines St & Kinzie St
109,37,0.70903,Dearborn St & Adams St


### Female: Top 10 stations with most incoming connections for bike drop offs

In [24]:
#sort
in_deg_centrality_female = in_deg_centrality_female2.sort_values(by=['in_degree_centrality'], ascending=False)
in_deg_centrality_female.head(n=10)

Unnamed: 0,station,in_degree_centrality,station_name
16,289,0.648829,Wells St & Concord Ln
115,268,0.638796,Lake Shore Dr & North Blvd
163,176,0.635452,Clark St & Elm St
45,141,0.625418,Clark St & Lincoln Ave
40,56,0.622074,Desplaines St & Kinzie St
29,94,0.618729,Clark St & Armitage Ave
107,61,0.615385,Wood St & Milwaukee Ave
35,81,0.615385,Daley Center Plaza
9,69,0.615385,Damen Ave & Pierce Ave
77,60,0.605351,Dayton St & North Ave


## Calculate eigenvector centrality for each station

> Eigenvector centrality computes the centrality for a node based on the centrality of its neighbors. For directed graphs this is “left” eigenvector centrality which corresponds to the in-edges in the graph.

Source: 
https://networkx.github.io/documentation/latest/reference/algorithms/generated/networkx.algorithms.centrality.eigenvector_centrality.html

In [25]:
# Eigenvector centrality
eigenvector_male = pd.DataFrame.from_dict(nx.eigenvector_centrality(G_male, weight='weight'), orient='index').reset_index()
eigenvector_female = pd.DataFrame.from_dict(nx.eigenvector_centrality(G_female, weight='weight'), orient='index').reset_index()

In [26]:
#Rename columns
eigenvector_male.columns = ['station', 'eigenvector_centrality']
eigenvector_female.columns = ['station', 'eigenvector_centrality']

In [27]:
#join
eigenvector_male2 = eigenvector_male.join(station_name_id.set_index('station_id'), on='station')

#join
eigenvector_female2 = eigenvector_female.join(station_name_id.set_index('station_id'), on='station')

### Top 10 stations by eigenvector centrality 

These are stations with incoming connections from stations with many incoming connections. In this case, incoming connection means bike drop offs. 

In [28]:
#Top 10 stations based on eigenvector centrality
eigenvector_centrality_male = eigenvector_male2.sort_values(by=['eigenvector_centrality'], ascending=False)
eigenvector_centrality_female = eigenvector_female2.sort_values(by=['eigenvector_centrality'], ascending=False)

In [29]:
eigenvector_centrality_male.head(n=10)

Unnamed: 0,station,eigenvector_centrality,station_name
15,91,0.325697,Clinton St & Washington Blvd
8,174,0.322599,Canal St & Madison St
17,192,0.269942,Canal St & Adams St
47,43,0.207311,Michigan Ave & Washington St
85,283,0.198029,LaSalle St & Jackson Blvd
35,52,0.176277,Michigan Ave & Lake St
108,195,0.169213,Columbus Dr & Randolph St
110,47,0.164574,State St & Kinzie St
16,100,0.164552,Orleans St & Merchandise Mart Plaza
19,49,0.155411,Dearborn St & Monroe St


In [30]:
eigenvector_centrality_female.head(n=10)

Unnamed: 0,station,eigenvector_centrality,station_name
46,177,0.234861,Theater on the Lake
123,35,0.204619,Streeter Dr & Illinois St
163,176,0.183319,Clark St & Elm St
212,85,0.182209,Michigan Ave & Oak St
16,289,0.180938,Wells St & Concord Ln
115,268,0.180766,Lake Shore Dr & North Blvd
155,110,0.146966,State St & Erie St
275,76,0.136258,Lake Shore Dr & Monroe St
45,141,0.131976,Clark St & Lincoln Ave
11,140,0.131285,Dearborn Pkwy & Delaware Pl


## Degree centrality comparison

For each station, the difference in degree centrality is determined for male and female groups.


In [31]:
#Create dataframe that compares degree of centrality for each station for male and female groups

#rename columns
in_deg_centrality_female.columns = ['f_station', 'f_in_degree_centrality', 'f_station_name']

#rename columns
in_deg_centrality_male.columns = ['m_station', 'm_in_degree_centrality', 'm_station_name']

#join male and female centrality data by station id
in_deg_centrality_compare = in_deg_centrality_male.join(in_deg_centrality_female.set_index('f_station'), on='m_station')

#drop the index
in_deg_centrality_compare.reset_index(drop=True)

#drop repeated information
in_deg_centrality_compare = in_deg_centrality_compare.drop(['f_station_name'], axis=1)

#rename columns 
in_deg_centrality_compare.columns = ['station_id', 'm_in_degree_centrality', 'station_name', 'f_in_degree_centrality']

#Calculate difference between male and female in degree centrality for a given station
in_deg_centrality_compare['difference'] = in_deg_centrality_compare.apply(lambda row: abs(row['m_in_degree_centrality']-row['f_in_degree_centrality']), axis=1)

In [32]:
in_deg_centrality_compare[['station_name', 'difference']].head(n=10)

Unnamed: 0,station_name,difference
81,Damen Ave & Pierce Ave,0.150502
167,Daley Center Plaza,0.147157
23,LaSalle St & Illinois St,0.143813
180,Dayton St & North Ave,0.117057
25,Wells St & Concord Ln,0.073579
15,Clinton St & Washington Blvd,0.157191
147,Clark St & Elm St,0.080268
99,Halsted St & Blackhawk St,0.153846
186,Desplaines St & Kinzie St,0.090301
109,Dearborn St & Adams St,0.153846


In [33]:
#mean of difference
in_deg_centrality_compare['difference'].mean()

0.11114124157602424

## Test if mean difference of male and female degree centrality is zero

### Null Hypothesis: 
The mean difference between male and female degree centrality for each drop off bike station is zero. 

### Alternative Hypothesis: 
The mean difference between male and female degree centrality for each drop off bike station is not zero. 

#### scipy.stats.ttest_1samp

>Calculates the T-test for the mean of ONE group of scores. This is a two-sided test for the null hypothesis that the expected value (mean) of a sample of independent observations a is equal to the given population mean.

Source: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ttest_1samp.html
     

In [34]:
ttest_1samp(in_deg_centrality_compare['difference'], 0)

Ttest_1sampResult(statistic=44.125816540055204, pvalue=6.245978575816204e-141)

### Result of ttest

Since the pvalue (6.245978575816204e-141) is much less than 0.05. We  reject the null hypothesis. 

Assuming that the null hypothesis is true (mean is zero), the probability of observing the data that we have is very small.

Hence we conclude that there is a difference in the degree centrality in stations where male and female riders drop off their bikes.



## Eigenvector Centrality Comparison

For each station, calculate the difference in eigenvector centrality for male and female groups.

In [35]:
#Create dataframe that compares degree of centrality for each station for male and female groups

#rename columns
eigenvector_centrality_female.columns = ['f_station', 'f_in_degree_centrality', 'f_station_name']

#rename columns
eigenvector_centrality_male.columns = ['m_station', 'm_in_degree_centrality', 'm_station_name']

#join male and female centrality data by station id
eigenvector_centrality_compare = eigenvector_centrality_male.join(eigenvector_centrality_female.set_index('f_station'), on='m_station')

#drop the index
eigenvector_centrality_compare.reset_index(drop=True)

#drop repeated information
eigenvector_centrality_compare = eigenvector_centrality_compare.drop(['f_station_name'], axis=1)

#rename columns 
eigenvector_centrality_compare.columns = ['station_id', 'm_in_degree_centrality', 'station_name', 'f_in_degree_centrality']

#Calculate difference between male and female in degree centrality for a given station
eigenvector_centrality_compare['difference'] = eigenvector_centrality_compare.apply(lambda row: abs(row['m_in_degree_centrality']-row['f_in_degree_centrality']), axis=1)

In [36]:
eigenvector_centrality_compare[['station_name', 'difference']].head(n=10)

Unnamed: 0,station_name,difference
15,Clinton St & Washington Blvd,0.198403
8,Canal St & Madison St,0.208735
17,Canal St & Adams St,0.165816
47,Michigan Ave & Washington St,0.120067
85,LaSalle St & Jackson Blvd,0.130452
35,Michigan Ave & Lake St,0.09167
108,Columbus Dr & Randolph St,0.081871
110,State St & Kinzie St,0.069427
16,Orleans St & Merchandise Mart Plaza,0.075436
19,Dearborn St & Monroe St,0.069839


In [38]:
#mean of difference
eigenvector_centrality_compare['difference'].mean()

0.02208014075840004

## Test if mean difference of male and female eigenvector centrality is zero

### Null Hypothesis: 
The mean difference between male and female eigenvector centrality for each drop off bike station is zero. 

### Alternative Hypothesis: 
The mean difference between male and female eigenvector centrality for each drop off bike station is not zero. 

#### scipy.stats.ttest_1samp

>Calculates the T-test for the mean of ONE group of scores. This is a two-sided test for the null hypothesis that the expected value (mean) of a sample of independent observations a is equal to the given population mean.

Source: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ttest_1samp.html

In [37]:
ttest_1samp(eigenvector_centrality_compare['difference'], 0)

Ttest_1sampResult(statistic=12.928256492844083, pvalue=3.020672599424717e-31)

### Result of ttest

Since the pvalue (3.020672599424717e-31) is much less than 0.05. We  reject the null hypothesis. 

Assuming that the null hypothesis is true (mean is zero), the probability of observing the data that we have is very small.

Hence we conclude that there is a difference in the eigenvector centrality in stations where male and female riders drop off their bikes.


------
Data 620 Team No. 6