# DBBA Coursework 1

#### Name: Jingxiang Qi Student Number: S2590856

--------------------------------------------------------------------------------------------------------------------------------------------

### Instructions

#### Academic Misconduct

Please remember the good scholarly practice requirements of the University regarding work for
credit. You can find guidance at the School page

https://web.inf.ed.ac.uk/infweb/admin/policies/academic-misconduct

This also has links to the relevant University pages. 

**You are not allowed to collaborate with other students on this assignment or to ask or answer questions about the contents of the assignment. If you do not understand a specific question, ask Valerio and Ogy on Piazza.**

#### Submission Instructions

All the analysis must be done in this Jupyter Notebook and you should have a separate written report (without code) saved in PDF. Please fill out the fields bellow with the necessary code(remember to comment your code well) and discussion where needed. Code will generally
not be marked, but it will be checked by the markers to ensure that all the analysis is properly
done and the work is yours (i.e. there was no plagiarism). Focus on analysing the results you obtain as this is the main part that will be marked. Report your findings in a PDF file where you do not include any code but just the figures obtained and the conclusions you draw, i.e. plots and analysis. You will have to submit your files (final Jupyter Notebook and PDF) on Learn. Name your files with your
student number. For instance, if your student number is S123456789, you must submit a file
S123456789.zip containing the python source code and answers to the questions (PDF).

#### General Instructions 
In this coursework, you will analyse a real-world temporal network based on what you have learned in
class. Many exercises will require you to discuss the results of your analysis, some other will leave
you the choice of which algorithm to use for a particular task. This is by design because this
coursework assesses whether you understand network science and whether you can apply it to
real-world networks. For this reason, if you realise you need to make assumptions to answer a
question, do so and always, always motivate your assumptions and answers!

**Warning:** Some network metrics might require some time to compute. Please consider this when
doing the coursework and allow enough time to perform the required computations. Also
remember that you can use the School’s DICE machines, which can be let to run!

--------------------------------------------------------------------------------------------------------------------------------------------

#### Assignment Premises

You have been hired as a data analyst in the newly founded investment company DBBA Capital and have been tasked with the analysis of the investment patterns of one of our major competitors: Fairholme Capital, managed by Bruce Berkowitz. 

DBBA Capital wants you to evaluate the investment patterns of Fairholme Capital in relation to other superinvestors and evaluate the change in investment patterns during the pandemic. They have provided you with data about different superinvestors and the companies they invested in for each quarter spanning from quarter 1 (Q1) of 2019 to quarter 2 (Q2) of 2023 (that you can find in the folder named "Assignment Data"). 

The first column of each file represents the investors and the remaining columns represent the companies each investor invested in. First, familiarlise yourself with the data, and then, follow the steps bellow to perform the necessary analysis.

**TIP** When you believe it might help, make use of the information you have on the portfolio composition to comment and discuss your results.

#### Part 1: Network Creation

**Task 1.1 (7 marks)**<br>
In the field below, load the first Excel dataset ("2019_Q1.xlsx") and create a network out of the investors and companies in the following manner:

- the nodes of the network are all the investors in the first column of the dataset
- two investors (nodes) are connected with an edge if they have invested in the same company (e.g. Christopher Bloomstran - Semper Augustus and David Abrams - Abrams Capital Management will be connected because they both invested in GOOGL). 
- if two investors have invested in more than one common comapny, do *not* assign multiple edges between them. Instead, assign the number of common companies they have invested in as a weight to the edge connecting them.

After you built the network, extract the largest connected component and plot it. Remember to add the edge weights in your plot.

In [None]:
import os
import random
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt

# set up the network
def setup_network(filename):
    # load the dataset
    df = pd.read_excel(filename, header=None)
    investors = df.iloc[:, 0].tolist()
    # create a graph
    G = nx.Graph()

    # add nodes to the graph
    G.add_nodes_from(investors)

    # add edges to the graph
    for i in range(len(investors)):
        # get the name of investor_i
        investor_i = investors[i]
        # get the list of companies invested by investor_i
        invested_list_i = df.iloc[i, 1:].dropna().tolist()
        # compare this investor with the rest of investors
        for j in range(i+1, len(investors)):
            # get the name of investor_j
            investor_j = investors[j]
            # get the list of companies invested by investor_j
            invested_list_j = df.iloc[j, 1:].dropna().tolist()
            # compute the intersection of the two lists
            intersection_set = set(invested_list_i).intersection(set(invested_list_j))
            # if the intersection is not empty, add an edge between the two investors
            # set the weight of the edge to be the number of common companies they invested together
            if len(intersection_set) > 0:
                G.add_edge(investor_i, investor_j, weight=len(intersection_set))

    # extract the largest connected component
    largest_connected_component = max(nx.connected_components(G), key=len)
    # create a subgraph of the largest connected component
    subgraph = G.subgraph(largest_connected_component)
    return subgraph

# plot the network
def plot_network(G, title):
    # plot the graph
    plt.figure(figsize=(20, 20))
    
    pos = nx.spring_layout(
                            G, 
                            weight='weight',
                            k=6,
                            iterations=100
                        )
    nx.draw(
            G, pos, 
            with_labels=True, 
            node_color='lightblue', 
            node_size=200,
            width=0.15,
            font_size=10, 
            font_weight='light'
            )
    labels = nx.get_edge_attributes(G, 'weight')
    nx.draw_networkx_edge_labels(G, pos, edge_labels=labels, font_size=5, alpha=0.5)
    plt.title(title)
    plt.show()

whole_network = setup_network('./Assignment_Data/2019_Q1.xlsx')
plot_network(whole_network,'2019_Q1 Network')


**Note that the whole network here and hereafter represents the largest connected component of the network**

**Task 1.2 (3 marks)** <br>
Obtain the ego-network of 'Bruce Berkowitz - Fairholme Capital' and plot it.

In [None]:

# setup the ego network
def setup_ego_network(ego_node, G):
    ego_network = nx.ego_graph(G, ego_node, radius=1)
    return ego_network

# plot the ego network
def plot_ego_network(ego_node, ego_network):
    # plot the ego network
    plt.figure(figsize=(10, 10))
    pos = nx.spring_layout(ego_network, weight='weight',k=6,iterations=100)
    nx.draw(ego_network, pos, with_labels=True, node_color='lightblue', node_size=200, width=0.2,font_size=6,font_weight='light')
    # show edge labels for edges 
    labels = nx.get_edge_attributes(ego_network, 'weight')
    nx.draw_networkx_edge_labels(ego_network, pos, edge_labels=labels, font_size=5)
    plt.title('Ego Network of ' + ego_node)
    plt.show()

ego_network = setup_ego_network('Bruce Berkowitz - Fairholme Capital',whole_network)
plot_ego_network('Bruce Berkowitz - Fairholme Capital', ego_network)



#### Part 2: Basic Network Analysis

**Task 2.1 (15 marks)** <br>
Now that you know how to build the network for a single quarter and get its largest connected component, repeat the procedure for all the other quarters. For both the whole network and the ego-network, produce a table with the summary statistics (i.e. mean, max, min, and standard deviation) of the following network quantities:

- Number of nodes
- Number of links
- Density
- Average clustering coefficient
- Average degrees
- Average strength
- Assortativity

If you need to make any assumption or decision regarding the metric to use to compute any of these quantities, clearly motivate it.

In [None]:

# get the summary statistics data of the network
def get_network_summary_statistics(G, filename):
    # compute the summary statistics
    num_nodes = G.number_of_nodes()
    num_links = G.number_of_edges()
    density = nx.density(G)
    avg_clustering_coeff = nx.average_clustering(G)
    avg_degrees = np.mean(list(dict(G.degree()).values()))
    avg_strength = np.mean(list(dict(G.degree(weight='weight')).values()))
    assortativity = nx.degree_assortativity_coefficient(G, weight='weight')
    return {
        'Quarter': os.path.basename(filename).split('.')[0],
        'Num Nodes': num_nodes,
        'Num Links': num_links,
        'Density': density,
        'Avg Clustering Coefficient': avg_clustering_coeff,
        'Avg Degrees': avg_degrees,
        'Avg Strength': avg_strength,
        'Assortativity': assortativity
    }

# display the summary statistics in table
def summary_table(results, title):
    # create a table of the summary statistics
    df = pd.DataFrame(results)
    df = df[['Quarter', 'Num Nodes', 'Num Links', 'Density', 'Avg Clustering Coefficient', 'Avg Degrees', 'Avg Strength', 'Assortativity']]
    df_des = df.describe().drop('count').drop('25%').drop('50%').drop('75%')
    df.set_index('Quarter', inplace=True)
    # set the title of the table
    df = df.style.set_caption(title)
    # display the table
    display(df)
    display(df_des)

whole_network_results = []
ego_network_results = []
# get the summary statistics for all the files in the folder
folder_path = './Assignment_Data'
files = os.listdir(folder_path)
files = sorted(files)

for file in files:
    if file.endswith('.xlsx'):
        filename = os.path.join(folder_path, file)
        # Get the whole network summary statistics
        graph = setup_network(filename)
        whole_network_result = get_network_summary_statistics(graph, filename)
        whole_network_results.append(whole_network_result)
        # Get the ego network summary statistics
        ego_network = setup_ego_network('Bruce Berkowitz - Fairholme Capital', graph)
        ego_network_result = get_network_summary_statistics(ego_network, filename)
        ego_network_results.append(ego_network_result)

summary_table(whole_network_results, 'Whole Network Summary Statistics')
summary_table(ego_network_results, 'Ego Network Summary Statistics')


**Task 2.2 (10 marks)** </br>
Discuss why ego networks are useful for exploring the importance of singular nodes. Then, comment on the statistics you computed above and what information they give you about the investment patterns of Bruce Berkowitz - Fairholme Capital. Briefly discuss how the ego network statistics differ from the statistics obtained for the whole network, explaining whether the differences or similarities are expected or not. Motivate your answers. 

#### Part 3: Comparing Degree Distributions

**Task 3.1 (8 marks)** </br>
Choose a single temporal slice (i.e. quarter) and plot and analyse the total degree and strength distributions of both the whole network and the ego-network. Comment on the similarities/differences between these networks. 

In [None]:

# plot the degree distribution
def plot_degree_distribution(G, title, ax):
    # get the degree information
    degree_values = list(dict(G.degree()).values())
    # plot the degree distribution
    ax.hist(degree_values, bins=20, color='lightblue',edgecolor='gray', alpha=0.7)
    ax.set_xlabel('Degree')
    ax.set_ylabel('Number of Nodes')
    ax.set_title(title + ' Degree Distribution')

# plot the strength distribution
def plot_strength_distribution(G, title, ax):
    # get the strength information
    strength = {}
    for node in G.nodes():
        total_strength = sum(G[node][neighbor]['weight'] for neighbor in G.neighbors(node))
        strength[node] = total_strength
    strength_values = list(strength.values())

    # plot the strength distribution
    ax.hist(strength_values, bins=20, color='lightblue',edgecolor='gray', alpha=0.7)
    ax.set_xlabel('Strength')
    ax.set_ylabel('Number of Nodes')
    ax.set_title(title + ' Strength Distribution')

# 2020_Q2
whole_network = setup_network('./Assignment_Data/2020_Q2.xlsx')
ego_network = setup_ego_network('Bruce Berkowitz - Fairholme Capital',whole_network)

# create a 2x2 grid of subplots
fig, axs = plt.subplots(2, 2, figsize=(15, 15))

plot_degree_distribution(whole_network, '2020_Q2 Whole Network', axs[0, 0])
plot_degree_distribution(ego_network, '2020_Q2 Ego Network', axs[0, 1])
plot_strength_distribution(whole_network, '2020_Q2 Whole Network', axs[1, 0])
plot_strength_distribution(ego_network, '2020_Q2 Ego Network', axs[1, 1])

# display the plot
plt.tight_layout()
plt.show()

**Task 3.2 (7 marks)** </br> 
Based on degree distributions and the results you obtained, what type of network would you say the whole network and ego-network are (e.g scale free, random, etc)? Could have they been generated by any of the models discussed in class? Motivate your answer.

In [None]:
# random network / small world network

# make a random whole network
random_whole_network = nx.gnm_random_graph(len(whole_network.degree()), len(whole_network.edges()))
# assign random weights to the edges
for u, v in random_whole_network.edges():
    random_whole_network[u][v]['weight'] = random.randint(1, 20)

# make a random ego network
random_ego_network = nx.gnm_random_graph(len(ego_network.degree()), len(ego_network.edges()))
# assign random weights to the edges
for u, v in random_ego_network.edges():
    random_ego_network[u][v]['weight'] = random.randint(1, 20)

# get the clustering coefficient and average shortest path length of the random whole network and random ego network
random_whole_clustering_coefficient = nx.average_clustering(random_whole_network)
random_whole_avg_shortest_path_length = nx.average_shortest_path_length(random_whole_network)
random_ego_clustering_coefficient = nx.average_clustering(random_ego_network)
random_ego_avg_shortest_path_length = nx.average_shortest_path_length(random_ego_network)

# get the clustering coefficient and average shortest path length of the whole network and ego network
whole_network_clustering_coefficient = nx.average_clustering(whole_network)
whole_network_shortest_path_length = nx.average_shortest_path_length(whole_network)
ego_network_clustering_coefficient = nx.average_clustering(ego_network)
ego_network_shortest_path_length = nx.average_shortest_path_length(ego_network)

# create a dictionary with the data
data = {'Network Type': ['Random Whole Network', 'Random Ego Network', 'Whole Network', 'Ego Network'],
    'Clustering Coefficient': [random_whole_clustering_coefficient, random_ego_clustering_coefficient, whole_network_clustering_coefficient, ego_network_clustering_coefficient],
    'Average Shortest Path Length': [random_whole_avg_shortest_path_length, random_ego_avg_shortest_path_length, whole_network_shortest_path_length, ego_network_shortest_path_length]}

# create a dataframe from the dictionary
df = pd.DataFrame(data)

# print the dataframe
display(df)


#### Part 4: Changes of the network statistics during the pandemic

**Task 4.1 (15 marks)** </br> Plot the temporal evolution of the quantities you computed in Part 2 for the ego network and the whole network compare the difference between the networks. For each quantity, discuss if it can be used for analysing the investment patterns of Bruce Berkowitz - Fairholme Capital over time. Based on your discussion, choose the quantities that you find important. What information you can draw about the change of those network statistics during the pandemic?

In [None]:

# create empty lists to store the summary statistics for each quarter
whole_network_results = []
ego_network_results = []

# get the summary statistics for all the files in the folder
folder_path = './Assignment_Data'
files = os.listdir(folder_path)
files = sorted(files)

for file in files:
    if file.endswith('.xlsx'):
        filename = os.path.join(folder_path, file)
        # setup the whole graph and ego network
        whole_network = setup_network(filename)
        ego_network = setup_ego_network('Bruce Berkowitz - Fairholme Capital', whole_network)

        # compute the summary statistics for the whole graph and ego network
        whole_network_summary = get_network_summary_statistics(whole_network, filename)
        ego_network_summary = get_network_summary_statistics(ego_network, filename)
        # append the summary statistics to the results lists
        whole_network_results.append(whole_network_summary)
        ego_network_results.append(ego_network_summary)

# display the dataframes
fig, axs = plt.subplots(4, 2, figsize=(15, 20))

whole_network_frame = pd.DataFrame(whole_network_results)
ego_network_frame = pd.DataFrame(ego_network_results)

# num_nodes
whole_network_frame.plot(ax=axs[0, 0],x='Quarter', y='Num Nodes', kind='line', title='Num Nodes', label='Whole Network', marker='.')
ego_network_frame.plot(ax=axs[0, 0],x='Quarter', y='Num Nodes', kind='line', title='Num Nodes', label='Ego Network', marker='.')
axs[0, 0].set_xticks(range(len(whole_network_frame['Quarter'])))
axs[0, 0].set_xticklabels(whole_network_frame['Quarter'],rotation=45)

# num_links
whole_network_frame.plot(ax=axs[0, 1],x='Quarter', y='Num Links', kind='line', title='Num Links', label='Whole Network', marker='.')
ego_network_frame.plot(ax=axs[0, 1],x='Quarter', y='Num Links', kind='line', title='Num Links', label='Ego Network', marker='.')
axs[0, 1].set_xticks(range(len(whole_network_frame['Quarter'])))
axs[0, 1].set_xticklabels(whole_network_frame['Quarter'],rotation=45)

# density
whole_network_frame.plot(ax=axs[1, 0],x='Quarter', y='Density', kind='line', title='Density', label='Whole Network', marker='.')
ego_network_frame.plot(ax=axs[1, 0],x='Quarter', y='Density', kind='line', title='Density', label='Ego Network', marker='.')
axs[1, 0].set_xticks(range(len(whole_network_frame['Quarter'])))
axs[1, 0].set_xticklabels(whole_network_frame['Quarter'],rotation=45)

# avg_clustering_coeff
whole_network_frame.plot(ax=axs[1, 1],x='Quarter', y='Avg Clustering Coefficient', kind='line', title='Avg Clustering Coefficient', label='Whole Network', marker='.')
ego_network_frame.plot(ax=axs[1, 1],x='Quarter', y='Avg Clustering Coefficient', kind='line', title='Avg Clustering Coefficient', label='Ego Network', marker='.')
axs[1, 1].set_xticks(range(len(whole_network_frame['Quarter'])))
axs[1, 1].set_xticklabels(whole_network_frame['Quarter'],rotation=45)

# avg_degrees
whole_network_frame.plot(ax=axs[2, 0],x='Quarter', y='Avg Degrees', kind='line', title='Avg Degrees', label='Whole Network', marker='.')
ego_network_frame.plot(ax=axs[2, 0],x='Quarter', y='Avg Degrees', kind='line', title='Avg Degrees', label='Ego Network', marker='.')
axs[2, 0].set_xticks(range(len(whole_network_frame['Quarter'])))
axs[2, 0].set_xticklabels(whole_network_frame['Quarter'],rotation=45)

# avg_strength
whole_network_frame.plot(ax=axs[2, 1],x='Quarter', y='Avg Strength', kind='line', title='Avg Strength', label='Whole Network', marker='.')
ego_network_frame.plot(ax=axs[2, 1],x='Quarter', y='Avg Strength', kind='line', title='Avg Strength', label='Ego Network', marker='.')
axs[2, 1].set_xticks(range(len(whole_network_frame['Quarter'])))
axs[2, 1].set_xticklabels(whole_network_frame['Quarter'],rotation=45)

# assortativity
whole_network_frame.plot(ax=axs[3, 0],x='Quarter', y='Assortativity', kind='line', title='Assortativity', label='Whole Network', marker='.')
ego_network_frame.plot(ax=axs[3, 0],x='Quarter', y='Assortativity', kind='line', title='Assortativity', label='Ego Network', marker='.')
axs[3, 0].set_xticks(range(len(whole_network_frame['Quarter'])))
axs[3, 0].set_xticklabels(whole_network_frame['Quarter'],rotation=45)

# set the last subplot to be invisible
axs[3, 1].set_visible(False)

plt.tight_layout()
plt.show()


**Tasks 4.2 (10 marks)** </br> Choose a suitable centrality measure that would give us imporatnt information about the nodes in the whole network, and clearly motivate your choice. Use this measure to find the 3 most central nodes for each quarter. Compare the centrality of Bruce Berkowitz - Fairholme Capital overtime with that of the most central nodes. What can you conclude from this?

In [None]:

# get the eigenvector centrality
def get_eigenvector_centrality(G, filename, first_n, specific_node_name):
    
    # calculate eigenvector centrality for whole_network
    eigenvector_centrality = nx.eigenvector_centrality_numpy(G, weight='weight')
    # sort the nodes by eigenvector centrality
    sorted_nodes = sorted(eigenvector_centrality.items(), key=lambda x: x[1], reverse=True)
    # get the top n nodes
    sorted_nodes_list = sorted_nodes[:first_n]
    # get the specific node
    specific_node = [node for node in sorted_nodes if node[0] == specific_node_name][0]
    sorted_nodes_list.append(specific_node)

    return {
        'Quarter': os.path.basename(filename).split('.')[0],
        'TOP 1 Node Name': sorted_nodes_list[0][0],
        'TOP 1 Node Centrality': sorted_nodes_list[0][1],
        'TOP 2 Node Name': sorted_nodes_list[1][0],
        'TOP 2 Node Centrality': sorted_nodes_list[1][1],
        'TOP 3 Node Name': sorted_nodes_list[2][0],
        'TOP 3 Node Centrality': sorted_nodes_list[2][1],
        specific_node_name : specific_node[1]
    }


whole_network_centrality_results = []
# read all the files in the folder
folder_path = './Assignment_Data'
files = os.listdir(folder_path)
files = sorted(files)

for file in files:
    if file.endswith('.xlsx'):
        filename = os.path.join(folder_path, file)
        # setup the whole graph and ego network
        whole_network = setup_network(filename)
        # compute the eigenvector centrality for the whole graph and ego network
        whole_centrality = get_eigenvector_centrality(whole_network, filename, 3, 'Bruce Berkowitz - Fairholme Capital')
        # add the eigenvector centrality to the results lists
        whole_network_centrality_results.append(whole_centrality)


# display the dataframe
whole_network_frame = pd.DataFrame(whole_network_centrality_results)
# calculate the difference between Bruce Berkowitz - Fairholme Capital and TOP 1 Node Centrality
whole_network_frame['Centrality Diff'] = whole_network_frame['TOP 1 Node Centrality'] - whole_network_frame['Bruce Berkowitz - Fairholme Capital']
df_whole = whole_network_frame.style.set_caption("Whole Network")
display(df_whole)

# plot Whole Network Eigenvector Centrality
plt.figure(figsize=(10, 8))

plt.plot(whole_network_frame['Quarter'], whole_network_frame['TOP 1 Node Centrality'], linestyle='--', marker='.', color='red')
plt.plot(whole_network_frame['Quarter'], whole_network_frame['TOP 2 Node Centrality'], linestyle='--', marker='.', color='green')
plt.plot(whole_network_frame['Quarter'], whole_network_frame['TOP 3 Node Centrality'], linestyle='--', marker='.', color='blue')
plt.plot(whole_network_frame['Quarter'], whole_network_frame['Bruce Berkowitz - Fairholme Capital'], linestyle='-', marker='.', color='orange')

plt.title('Whole Network Eigenvector Centrality')
plt.xlabel('Quarter')
plt.ylabel('Eigenvector Centrality')
plt.xticks(range(len(whole_network_frame['Quarter'])), whole_network_frame['Quarter'], rotation=45)
plt.legend(['TOP 1 Central Node', 'TOP 2 Central Node', 'TOP 3 Central Node', 'Bruce Berkowitz - Fairholme Capital'])

plt.show()


#### Part 5: Clustering and Modularity

**Task 5.1 (15 marks)** </br> Find the communities in each quarter in the whole network. To do so, use an algorithm of your choice, and justify your decision. Analyse how the communities evolve overtime, focussing on the membership of Bruce Berkowitz - Fairholme Capital. Does this node fall in the same community with the same superinvestors across different quarters? What conclusions can you draw from this?

In [None]:
# create a partition map
def create_partition_map(partition):
    partition_map = {}
    for idx, cluster_nodes in enumerate(partition):
        for node in cluster_nodes:
            partition_map[node] = idx
    return partition_map

# get the louvain communities
def get_louvain_communities(G, filename):
    # compute the louvain communities
    communities = nx.community.louvain_communities(G, weight='weight', seed = 1234)
    # compute the modularity of the partition
    modularity = nx.community.quality.modularity(G, communities)
    # create a partition map
    best_partition_map = create_partition_map(communities)

    return {
        'Quarter': os.path.basename(filename).split('.')[0],
        'Modularity': modularity,
        'Community': best_partition_map
    }

# plot the networks
def plot_mark_network(G, title, partition_map, ax):
    node_colors = []
    for node in partition_map:
        if partition_map[node] == 0:
            node_colors.append('red')
        if partition_map[node] == 1:
            node_colors.append('blue')
        if partition_map[node] == 2:
            node_colors.append('green')
        if partition_map[node] == 3:
            node_colors.append('yellow')
        if partition_map[node] == 4:
            node_colors.append('purple')
        if partition_map[node] == 5:
            node_colors.append('orange')

    ax.set_title(title)
    pos = nx.spring_layout(
                            G, 
                            weight='weight',
                            k=6,
                            iterations=100
                        )
    plt.sca(ax)
    nx.draw(
            G, pos, 
            node_color=node_colors, 
            node_size=20,
            width=0.1
            )  

# create empty lists to store the results
community_results = []
whole_network_centrality_results = []
plot_sequence = []

# setup for the plots
fig, axs = plt.subplots(5, 4, figsize=(20, 25))
fig_pos_x = 0
fig_pos_y = 0

# read all the files in the folder
folder_path = './Assignment_Data'
files = os.listdir(folder_path)
files = sorted(files)

for file in files:
    if file.endswith('.xlsx'):
        filename = os.path.join(folder_path, file)
        # setup the whole graph and ego network
        whole_network = setup_network(filename)
        # compute the louvain communities
        communities = get_louvain_communities(whole_network, filename)
        community_results.append(communities)
        # plot the network of this file
        plot_mark_network(whole_network, f'{os.path.basename(filename).split(".")[0]} Network', communities['Community'], axs[fig_pos_y, fig_pos_x])
        # compute the eigenvector centrality for the whole graph
        whole_centrality = get_eigenvector_centrality(whole_network, filename, 3, 'Bruce Berkowitz - Fairholme Capital')
        whole_network_centrality_results.append(whole_centrality)
        # ploting factors
        fig_pos_x += 1
        if fig_pos_x == 4:
            fig_pos_x = 0
            fig_pos_y += 1

# set the last subplot to be invisible
axs[4, 2].set_visible(False)
axs[4, 3].set_visible(False)
# display the network plots
plt.tight_layout()
plt.show()


# store the same community with Bruce Berkowitz - Fairholme Capital
nodes_in_same_community_results = []

for result in community_results:
    index = result['Quarter']
    ego_community = result['Community']['Bruce Berkowitz - Fairholme Capital']
    nodes_in_same_community = [node for node in result['Community'] if result['Community'][node] == ego_community]
    nodes_in_same_community_results.append({'Quarter': index,'Nodes': nodes_in_same_community})

# compare community results
compare_community_results = []

# iterate through each quarter's results
for i in range(len(whole_network_centrality_results)):
    # get the TOP 1 node name and the nodes in the same community for this quarter
    index = whole_network_centrality_results[i]['Quarter']
    top_1_node_name = whole_network_centrality_results[i]['TOP 1 Node Name']
    top_2_node_name = whole_network_centrality_results[i]['TOP 2 Node Name']
    top_3_node_name = whole_network_centrality_results[i]['TOP 3 Node Name']
    nodes_in_same_community = nodes_in_same_community_results[i]['Nodes']
    modularity = community_results[i]['Modularity']

    # check if the nodes name in the same community
    top_1_in = False
    top_2_in = False
    top_3_in = False

    if top_1_node_name in nodes_in_same_community:
        top_1_in = True
    if top_2_node_name in nodes_in_same_community:
        top_2_in = True
    if top_3_node_name in nodes_in_same_community:
        top_3_in = True

    # append the result to the list
    compare_community_results.append({
        'Quarter': index,
        'Modularity': modularity,
        'TOP 1 Node Name': top_1_node_name,
        'TOP 1 Node in Same Community': top_1_in,
        'TOP 2 Node Name': top_2_node_name,
        'TOP 2 Node in Same Community': top_2_in,
        'TOP 3 Node Name': top_3_node_name,
        'TOP 3 Node in Same Community': top_3_in
    })

# convert same_community_results to a pandas dataframe
df = pd.DataFrame(compare_community_results)
# display the table is the top 3 nodes in the same community as Bruce Berkowitz - Fairholme Capital
display(df)

# setup a new figure
fig, axs = plt.subplots(1, 2, figsize=(20, 8))

# calculate the jaccard similarity
def jaccard_similarity(set1, set2):
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    return len(intersection) / len(union)

# calculate the jaccard similarity for each quarter
similarity_results = []
for i in range(len(nodes_in_same_community_results)-1):
    j = i+1
    community1 = nodes_in_same_community_results[i]['Nodes']
    community2 = nodes_in_same_community_results[j]['Nodes']
    quarter1 = nodes_in_same_community_results[i]['Quarter']
    quarter2 = nodes_in_same_community_results[j]['Quarter']
    similarity = jaccard_similarity(set(community1), set(community2))
    # append the result to the list
    similarity_results.append({
        'Quarters': f"{quarter1} and {quarter2}",
        'Jaccard Similarity': similarity
    })

# plot the jaccard similarity
jac_df = pd.DataFrame(similarity_results)
jac_df.plot(ax=axs[0],x='Quarters', y='Jaccard Similarity', kind='line', title='Jaccard Similarity', marker='.')
axs[0].set_xticks(range(len(jac_df['Quarters'])))
axs[0].set_xticklabels(jac_df['Quarters'],rotation=90)
axs[0].axhline(y=0.5, color='r', linestyle='--')

# plot the modularity
mod_df = pd.DataFrame(compare_community_results)
mod_df.plot(ax=axs[1],x='Quarter', y='Modularity', kind='line', title='Modularity', marker='.')
axs[1].set_xticks(range(len(mod_df['Quarter'])))
axs[1].set_xticklabels(mod_df['Quarter'],rotation=45)

plt.tight_layout()
plt.show()
            


#### Part 6: Report your findings

**Task 6.1 (10 marks)** </br> As any good DBBA Capital data analyst, at the end of your analysis you need to present your fidnings. Please write a brief (~250 words) report discussing how the portfolio of Fairholme Capital has changed compared with the rest of the funds in the dataset.

**REPORT**
