# Playground

## Graph, Nodes and Edges

In graph theory, a graph is a collection of nodes (also known as vertices) and edges. The nodes represent entities or elements, while the edges represent the connections or relationships between these entities. 

**Nodes**: Nodes, also referred to as vertices, are the fundamental units in a graph. They represent individual elements or entities within a network. Nodes can represent various things depending on the context of the graph. For example, in a social network graph, nodes can represent individuals, while in a transportation network, nodes can represent cities or intersections.

**Edges**: Edges are the connections or links between nodes in a graph. They represent the relationships or interactions between the entities represented by the nodes. Edges can be directed or undirected, depending on whether the relationship between nodes has a specific direction or not. Directed edges have an arrow indicating the direction of the relationship, while undirected edges have no specific direction. Edges in a graph represent relationships or interactions between the nodes. The nature of these edges can be very diverse depending on the context of your data and the problem you are solving. Here are a few examples of how edges can be defined:

1. **Social Networks**: If your nodes are individuals, an edge could represent a friendship, a follow, or a connection on a social media platform.

2. **Financial Networks**: If your nodes are banks, an edge could represent lending and borrowing relationships. In a financial transaction network, an edge could represent a transaction between two individuals.

3. **Communication Networks**: If your nodes are individual email addresses, an edge could represent the exchange of emails between them.

4. **Collaboration Networks**: If your nodes are scientists, an edge could represent co-authorship on a paper.

5. **Infrastructure Networks**: If your nodes are cities, an edge could represent a direct road or flight connection between them.

In your specific case, as you're dealing with a financial dataset (default/no default), the creation of edges depends on the context:

- If your data includes transactions, you could create edges between entities that have transactions between them.
- If your dataset includes information about co-signers or guarantors on loans, you could create edges between these connected individuals.
- If your entities belong to the same group (like same organization or family), you could draw edges between them.
- You could define edges based on similarity or proximity in the feature space. For instance, you could use a threshold value on a given feature or use a clustering method to group similar entities and then create edges within each group.

In order to create a meaningful graph, it's crucial to have a clear understanding of your data and the relationships you're trying to represent. It's important to choose an edge definition that makes sense for your specific problem and dataset.

## Centrality measures: Introduction

Centrality measures are techniques used in network analysis to identify the most important nodes in a network. There are several types of centrality measures, such as degree centrality, closeness centrality, betweenness centrality, and eigenvector centrality. Here's a brief explanation of each:

1. **Degree Centrality**: It is simply the number of edges connected to a node. In directed networks, we can differentiate between in-degree and out-degree centralities.

   For an undirected graph, the degree centrality \(C_D(v)\) for a node \(v\) is calculated as:

   <div style="text-align: center; font-weight: bold;">
   \(C_D(v) = \text{deg}(v)\)
   </div>

   where \(\text{deg}(v)\) is the degree of the node \(v\), i.e., the number of edges incident upon \(v\).

   For a directed graph, we can define in-degree centrality and out-degree centrality. The in-degree centrality for a node \(v\) is the number of incoming edges to \(v\), and the out-degree centrality is the number of outgoing edges from \(v\).

   Reference: Freeman, L. C. (2002). Centrality in social networks: Conceptual clarification. Social network: critical concepts in sociology. Londres: Routledge, 1, 238-263.

2. **Closeness Centrality**: It measures how fast information can spread from a given node to other reachable nodes in the network.

   <div style="text-align: center; font-weight: bold;">
   \[C(x) = \frac{1}{\sum_{y}d(y, x)}\]
   </div>

   In this formula, \(d(y, x)\) is the shortest-path distance from \(y\) to \(x\), and the sum is over all nodes \(y\) in the same connected component as \(x\). The idea is that the more central a node is, the closer it is to all other nodes.

   Reference: Freeman, L. C. (2002). Centrality in social networks: Conceptual clarification. Social network: critical concepts in sociology. Londres: Routledge, 1, 238-263.

3. **Betweenness Centrality**: It is a measure of a node's centrality in a network equal to the number of shortest paths from all vertices to all others that pass through that node.

   The mathematical formula for betweenness centrality \(C_B(v)\) for a node \(v\) is:

   <div style="text-align: center; font-weight: bold;">
   \[C_B(v) =\sum_{s,t \in V} \frac{\sigma(s, t|v)}{\sigma(s, t)}\]
   </div>

   In this formula, \(V\) is the set of nodes, \(\sigma(s, t)\) is the total number of shortest paths from node \(s\) to node \(t\), and \(\sigma(s, t|v)\) is the number of those paths that pass through \(v\).

   Reference: Freeman, L. C. (1977). A set of measures of centrality based on betweenness. Sociometry, 35-41.

4. **Eigenvector Centrality**: A node is considered important if it is connected to other important nodes.

   The mathematical formula for eigenvector centrality \(C_E(v)\) for a node \(v\) is defined as:

   <div style="text-align: center; font-weight: bold;">
   \[CE(v) = \frac{1}{\lambda} \sum_{t \in M(v)} C_E(t)\]
   </div>

   In this formula, \(M(v)\) is the set of the neighbors of \(v\), and \(\lambda\) is a constant. In other words, the eigenvector centrality for a node is the sum of the centrality scores of its neighbors, scaled by a constant factor.

   The calculation of eigenvector centrality is indeed iterative and based on the centrality of the neighboring nodes. This is because the importance of a node in the network is determined not only by how many connections it has, but also by how important its connections are.

   The calculation of eigenvector centrality is typically done through the power iteration method. Here's a simplified explanation of the process:

   1. Assign all nodes an initial centrality score. This could be a score of 1 for simplicity.

   2. For each node, calculate its new centrality score as the sum of the centrality scores of its neighbors from the previous iteration.

   3. Normalize the centrality scores so that their sum is 1. This is to prevent the scores from growing or shrinking exponentially in the next iterations.

   4. Repeat steps 2 and 3 until the scores converge, i.e., the scores do not change significantly from one iteration to the next.

   The result is a vector of centrality scores that is the principal eigenvector of the adjacency matrix of the graph, hence the name "eigenvector centrality".

   Eigenvector centrality is a measure of the influence of a node in a network. It assigns relative scores to all nodes in the network based on the concept that connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes.

   Reference: Bonacich, P. (1987). Power and centrality: A family of measures. American journal of sociology, 92(5), 1170-1182.

5. **PageRank**: is an algorithm used by Google Search to rank web pages in their search engine results. It works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites. The mathematical formula for PageRank \(PR(p)\) for a page \(p\) is:

   <div style="text-align: center; font-weight: bold;">
   \[PR(p) = (1 - d) + d \sum_{i \in M(p)} \frac{PR(i)}{L(i)}\]
   </div>

   In this formula, \(M(p)\) is the set of pages that link to \(p\), \(L(i)\) is the number of outbound links on page \(i\), and \(d\) is a damping factor, usually set to 0.85. The idea is that a page's PageRank is derived from the PageRanks of the pages that link to it. Each of these contributing pages transfers its PageRank to \(p\) proportionally to the number of outbound links it has.

   Reference: Lawrence, P. (1999). The pagerank citation ranking: Bringing order to the web. [Link](http://dbpubs.stanford.edu:8090/aux/index-en.html)

6. **Katz centrality**: Katz centrality is a measure of centrality in a network that takes into account both the direct and indirect influence of a node's neighbors. It was introduced by Leo Katz in his paper "A New Status Index Derived from    Sociometric Analysis" published in 1953.

   The Katz centrality of a node \(i\) is defined as the sum of the contributions from all its neighbors, weighted by a factor \(\beta\) and the number of paths of length \(k\) connecting the neighbors to the node \(i\). The formula for Katz centrality can be expressed as:

   <div style="text-align: center; font-weight: bold;">
   \(C_{\text{Katz}}(i) = \sum_{j=1}^{n} \beta A_{ij} C_{\text{Katz}}(j) + \alpha\)
   </div>

   where:
   - \(C_{\text{Katz}}(i)\) represents the Katz centrality of node \(i\).
   - \(A_{ij}\) denotes the adjacency matrix element, indicating the presence or absence of an edge between nodes \(i\) and \(j\).
   - \(\beta\) is a scaling factor that controls the weight given to indirect paths. Typically, \(|\beta| < \frac{1}{\lambda_{\text{max}}}\), where \(\lambda_{\text{max}}\) is the largest eigenvalue of the adjacency matrix.
   - \(\alpha\) is a constant term representing the node's intrinsic centrality.

   The Katz centrality algorithm is iterative. Starting with an initial centrality value for each node, the centrality values are updated iteratively using the above formula until convergence is reached.

   Katz centrality is useful for identifying influential nodes in a network, as it considers both direct and indirect connections. Nodes with higher Katz centrality scores are considered more central or influential within the network.

   Reference: Katz, L. (1953). A new status index derived from sociometric analysis. Psychometrika, 18(1), 39-43.





6. **HITS Algorithm**: The HITS (Hyperlink-Induced Topic Search) algorithm, also known as hubs and authorities, is a link analysis algorithm that rates webpages, developed by Jon Kleinberg. The idea behind Hubs and Authorities stemmed from a particular insight into the creation of web pages when the Internet was originally forming; that is, certain web pages, known as hubs, served as large directories that were not actually authoritative in the information that they held, but were used as compilations of a broad catalog of information that led users to direct to other authoritative pages.

   In the context of HITS, each node in a graph has two scores: an authority score and a hub score.

   The Authority Score a(i) of a node i is computed as the sum of the hub scores of each node j that points to i. This can be represented mathematically as:

   <div style="text-align: center; font-weight: bold;">
   \[
   a(i) = \sum_{j \in M(i)} h(j)
   \]
   </div>

   where: M(i) is the set of nodes that point to i and h(j) is the hub score of node j.
   The Hub Score h(i) of a node i is computed as the sum of the authority scores of each node j that i points to. This can be represented mathematically as:
   
   <div style="text-align: center; font-weight: bold;">
   \[
   h(i) = \sum_{j \in N(i)} a(j)
   \]
   </div>

   where N(i) is the set of nodes that i points to and a(j) is the authority score of node j.

   The authority and hub scores are calculated iteratively until they converge.

   The HITS algorithm was first proposed by Jon Kleinberg in his work:

   Reference: Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5), 604-632.

## Application in Python

Now, if you're looking to compute centrality measures using features as inputs, you could consider the features as attributes of the nodes or edges in your network. For instance, if you're predicting default/non-default, each node could be an individual (or entity), and the features could be attributes of those individuals. The edges could represent some relationship between them.

However, the computation of centrality doesn't typically involve the use of predictive features as inputs, but rather the structure of the network itself (who is connected to whom). If you have some prediction task related to the nodes of the network, the centrality measures can be used as features to help predict that task.

For example, if you are trying to predict default/non-default for individuals in a financial network, you could calculate the centrality measures for each individual (node) in the network. These centrality measures could then serve as input features to a machine learning model (along with any other features you have about the individuals) to predict default/non-default.

Here is a general way you might approach this using Python's NetworkX library:

```python
import networkx as nx

# Create a graph object
G = nx.Graph()

# Add nodes and edges to your graph using your data
# You can also add attributes to nodes and edges if you have additional features
# For example:
# G.add_node(node_id, attr_dict={feature1:value1, feature2:value2,...})
# G.add_edge(node1_id, node2_id, attr_dict={feature1:value1, feature2:value2,...})

# Calculate centrality measures
degree_centrality = nx.degree_centrality(G)
closeness_centrality = nx.closeness_centrality(G)
betweenness_centrality = nx.betweenness_centrality(G)
eigenvector_centrality = nx.eigenvector_centrality(G)
page_rank = nx.pagerank(G, alpha=0.85)  # alpha is the damping parameter
katz_centrality = nx.katz_centrality(G, alpha=0.1, beta=1.0)
hubs, authorities = nx.hits(G)

# You can then add these as node attributes (which could serve as additional features for your prediction task)
for node_id in G.nodes():
    G.nodes[node_id]['degree_centrality'] = degree_centrality[node_id]
    G.nodes[node_id]['closeness_centrality'] = closeness_centrality[node_id]
    G.nodes[node_id]['betweenness_centrality'] = betweenness_centrality[node_id]
    G.nodes[node_id]['eigenvector_centrality'] = eigenvector_centrality[node_id]
```

These features can then be fed into a machine learning model to predict default/non-default. The exact method of doing this will depend on the specific machine learning model you are using.

## Edges based on feature similarity

Defining edges based on similarity or proximity in the feature space is a common technique used in the field of network science. Here is a basic example of how to create edges based on Euclidean distance, a common measure of proximity in the feature space:

```python
import networkx as nx
import pandas as pd
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.datasets import make_classification

# Create a synthetic dataset
X, y = make_classification(n_samples=100, n_features=5, n_informative=2, n_redundant=0, random_state=1)

# Convert feature matrix into a DataFrame
df = pd.DataFrame(X, columns=['Feature1', 'Feature2', 'Feature3', 'Feature4', 'Feature5'])
df['Default'] = y

# Create a graph
G = nx.Graph()

# Add nodes to the graph
for index, row in df.iterrows():
    G.add_node(index, attr_dict=row.to_dict())

# Compute pairwise Euclidean distances
distances = euclidean_distances(df.values)

# Define a threshold to decide if an edge should be added
threshold = distances.mean()

# Add edges between nodes that are closer than the threshold distance
for i in range(distances.shape[0]):
    for j in range(i+1, distances.shape[0]):  # we only need to look at half the matrix
        if distances[i, j] < threshold:
            G.add_edge(i, j)
```

This will create a graph where each node represents an individual, and an edge exists between any two individuals if the Euclidean distance between their feature vectors is less than the average Euclidean distance across all pairs of individuals.

Remember that the choice of distance metric (Euclidean in this case) and threshold is crucial and can greatly affect the resulting graph. You may want to experiment with different distance metrics (e.g., Manhattan, Cosine, etc.) and thresholds to find what works best for your specific case.

Also, this approach does not scale well for large datasets due to the computation of the pairwise distance matrix. For large datasets, consider using more scalable methods like Nearest Neighbors or approximate methods for large-scale similarity computations.

## Edges for a weighted graph based on feature similarity

If you want to use a weighted graph, you can add weights to the edges that correspond to the similarity or proximity in the feature space. In the context of the previous example, you could use the inverse of the Euclidean distance as the edge weight. This means that nodes that are more similar (smaller distance) will have a stronger connection (larger weight).

Here's how you would modify the previous example to create a weighted graph:

```python
import networkx as nx
import pandas as pd
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.datasets import make_classification

# Create a synthetic dataset
X, y = make_classification(n_samples=100, n_features=5, n_informative=2, n_redundant=0, random_state=1)

# Convert feature matrix into a DataFrame
df = pd.DataFrame(X, columns=['Feature1', 'Feature2', 'Feature3', 'Feature4', 'Feature5'])
df['Default'] = y

# Create a graph
G = nx.Graph()

# Add nodes to the graph
for index, row in df.iterrows():
    G.add_node(index, attr_dict=row.to_dict())

# Compute pairwise Euclidean distances
distances = euclidean_distances(df.values)

# Define a threshold to decide if an edge should be added
threshold = distances.mean()

# Add edges between nodes that are closer than the threshold distance
# Use the inverse distance as the edge weight
for i in range(distances.shape[0]):
    for j in range(i+1, distances.shape[0]):  # we only need to look at half the matrix
        if distances[i, j] < threshold:
            weight = 1.0 / distances[i, j]  # inverse distance as weight
            G.add_edge(i, j, weight=weight)
```

In this example, the edge weight is the inverse of the Euclidean distance, so nodes that are closer in the feature space have a higher weight. You can use any function of the distance as the weight depending on your specific requirements. Be careful with the possibility of division by zero when using the inverse distance as weight, you may want to add a small constant in the denominator to avoid this situation.

## From edge weight to centrality measures

Now that we have a graph, we can compute several centrality measures. Here's how you can compute degree, closeness, betweenness, and eigenvector centrality for the weighted graph. Note that the interpretation of these measures changes slightly when dealing with weighted graphs.

```python
# Compute centrality measures
degree_centrality = nx.degree_centrality(G)
closeness_centrality = nx.closeness_centrality(G, distance='weight')
betweenness_centrality = nx.betweenness_centrality(G, weight='weight')
eigenvector_centrality = nx.eigenvector_centrality(G, weight='weight')

# Add these as node attributes
for node_id in G.nodes():
    G.nodes[node_id]['degree_centrality'] = degree_centrality[node_id]
    G.nodes[node_id]['closeness_centrality'] = closeness_centrality[node_id]
    G.nodes[node_id]['betweenness_centrality'] = betweenness_centrality[node_id]
    G.nodes[node_id]['eigenvector_centrality'] = eigenvector_centrality[node_id]
```

Here's a brief explanation of each of these measures:

1. **Degree Centrality**: In a weighted graph, degree centrality is still calculated as the number of edges connected to a node. This means that nodes with more connections will have a higher degree centrality. However, it doesn't take into account the weight of the edges, meaning that the strength of the connections is not considered in this measure.

2. **Closeness Centrality**: This measure calculates the reciprocal of the sum of the shortest path distances from a node to all other nodes in the graph. So, a higher value of closeness centrality implies that a node is more central. In a weighted graph, the 'distances' considered are the weights of the edges, so a node that has stronger connections to all other nodes (i.e., higher edge weights) will have higher closeness centrality.

3. **Betweenness Centrality**: This is a measure of the extent to which a node lies on paths between other nodes. Nodes with high betweenness centrality have a large influence on the transfer of items through the network, under the assumption that item transfer follows the shortest paths. When weights are considered, shorter paths are those with higher total weights, so nodes that lie on the paths with the strongest connections will have higher betweenness centrality.

4. **Eigenvector Centrality**: This is a measure of the influence of a node in a network. It assigns relative scores to all nodes in the network based on the principle that connections to nodes with high score contribute more to the score of the node in question. In a weighted graph, connections to nodes with high scores and high edge weights contribute more to the score of the node.

Remember that each of these measures captures a different aspect of a node's centrality, and the appropriate measure to use depends on the specifics of your problem and dataset.

## Centrality for binary targets

With your data, which includes a binary target (default or no default) and many numerical features, you can explore several analytical routes.

The centrality measures we've computed can be used to examine the relationships between the nodes (which represent your entities - individuals or organizations) in the context of the target variable (default or no default). For instance, you might find that entities with high degree centrality are more (or less) likely to default, indicating a potential relationship between the entity's position in the network and their default risk.

Here's how you can explore these relationships:

```python
import matplotlib.pyplot as plt

# Convert node attributes to a DataFrame
df_node_attributes = pd.DataFrame(dict(G.nodes(data=True))).T

# Plot centrality measures against the default status
fig, axs = plt.subplots(2, 2, figsize=(12, 12))

# Degree centrality
axs[0, 0].scatter(df_node_attributes['Default'], df_node_attributes['degree_centrality'])
axs[0, 0].set_xlabel('Default')
axs[0, 0].set_ylabel('Degree Centrality')

# Closeness centrality
axs[0, 1].scatter(df_node_attributes['Default'], df_node_attributes['closeness_centrality'])
axs[0, 1].set_xlabel('Default')
axs[0, 1].set_ylabel('Closeness Centrality')

# Betweenness centrality
axs[1, 0].scatter(df_node_attributes['Default'], df_node_attributes['betweenness_centrality'])
axs[1, 0].set_xlabel('Default')
axs[1, 0].set_ylabel('Betweenness Centrality')

# Eigenvector centrality
axs[1, 1].scatter(df_node_attributes['Default'], df_node_attributes['eigenvector_centrality'])
axs[1, 1].set_xlabel('Default')
axs[1, 1].set_ylabel('Eigenvector Centrality')

plt.tight_layout()
plt.show()
```

You can also calculate correlations between these centrality measures and the target variable to see if there are any strong relationships.

```python
# Calculate correlations
correlations = df_node_attributes[['Default', 'degree_centrality', 'closeness_centrality', 'betweenness_centrality', 'eigenvector_centrality']].astype(float).corr()

# Display correlations with the target variable
print(correlations['Default'])
```

It's important to note that this is exploratory analysis and correlation does not imply causation. Further investigation would be needed to understand any potential causal relationships.

In the long run, these network-based features can also be used in conjunction with your other numerical features to build a predictive model. By including the network features, the model may be able to capture complex relationships that are not captured by the numerical features alone.

## Example

### Create data

In [None]:
import networkx as nx
import pandas as pd
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.datasets import make_classification

# Create a synthetic dataset
X, y = make_classification(n_samples=2000, n_features=5, n_informative=2, n_redundant=0, random_state=1)

# Automatically name the columns based on the number of features
column_names = [f"Feature{i}" for i in range(1, X.shape[1])]
column_names.append("Default")
# Convert feature matrix into a DataFrame
df = pd.DataFrame(X, columns=column_names)

df['Default'] = y
df


### Scale features

In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Fit the scaler to the features (excluding the 'Default' column)
scaler.fit(df.drop('Default', axis=1))

# Transform the features
scaled_features = scaler.transform(df.drop('Default', axis=1))

# Create a new DataFrame for the scaled features
scaled_df = pd.DataFrame(scaled_features, columns=column_names[:-1])  # Exclude the 'Default' column name

# Add the 'Default' column back into the DataFrame
scaled_df['Default'] = df['Default']

print(scaled_df)
df = scaled_df 

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages

def plot_feature_distributions(df_scaled, target_col, file_name = 'feature_distributions.pdf'):
    with PdfPages(file_name) as pdf:
        # For each feature
        for col in df_scaled.columns:
            # Exclude the target column
            if col != target_col:
                plt.figure(figsize=(10, 5))
                
                # Plot the distribution of this feature for each class
                sns.kdeplot(data=df_scaled, x=col, hue=target_col, fill=True)
                
                plt.title(f'Distribution of {col} by Class')
                
                # Save the current figure to the pdf
                pdf.savefig()
                plt.clf()


# Call the function
plot_feature_distributions(scaled_df, 'Default')
import webbrowser

webbrowser.open_new(r'feature_distributions.pdf')


### Remove highly correlated features

In [None]:
import pandas as pd

def remove_highly_correlated_features(df, threshold=0.8):
    """
    Remove highly correlated features from a DataFrame based on a given threshold.

    Parameters:
    df (pandas.DataFrame): The input DataFrame.
    threshold (float): The threshold for high correlation. Default is 0.8.

    Returns:
    df_filtered (pandas.DataFrame): The filtered DataFrame with highly correlated features removed.
    """

    features_to_remove = []

    # Exclude the "Default" column from correlation analysis
    df_subset = df.drop("Default", axis=1)

    # Calculate the correlation matrix
    correlation_matrix = df_subset.corr()

    # Identify highly correlated features
    for i in range(len(correlation_matrix.columns)):
        for j in range(i+1, len(correlation_matrix.columns)):
            if abs(correlation_matrix.iloc[i, j]) > threshold:
                # Append the feature to the removal list
                features_to_remove.append(correlation_matrix.columns[j])

    # Remove the highly correlated features
    df_filtered = df.drop(features_to_remove, axis=1)

    return df_filtered, correlation_matrix, features_to_remove


In [None]:
# Remove highly correlated features
df, correlation_matrix,features_to_remove = remove_highly_correlated_features(df, threshold=0.9)
print(correlation_matrix)
# Print the filtered DataFrame
print(df)

### Create Graph

In [None]:
# Create a graph
G = nx.Graph()


# Add nodes to the graph
for index, row in df.iterrows():
    G.add_node(index, attr_dict=row.to_dict())

# Compute pairwise Euclidean distances
distances = euclidean_distances(df.values)

# Define a threshold to decide if an edge should be added
threshold = distances.mean()

# Add edges between nodes that are closer than the threshold distance
# Use the inverse distance as the edge weight
for i in range(distances.shape[0]):
    for j in range(i+1, distances.shape[0]):  # we only need to look at half the matrix
        if distances[i, j] < threshold:
            weight = 1.0 / distances[i, j]  # inverse distance as weight
            G.add_edge(i, j, weight=weight)


### Visualize graph

In [None]:
# Generate positions for all nodes in the graph G
pos = nx.spring_layout(G)

# Create a color map
color_map = []
for node in G:
    if G.nodes[node]['attr_dict']['Default'] == 1:
        color_map.append('red')
    else: 
        color_map.append('blue')

# Draw the graph with node colors
nx.draw_networkx_nodes(G, pos, node_color=color_map)
nx.draw_networkx_edges(G, pos, edge_color='grey')

plt.show()


In [None]:
"""
We need pygraphviz for that.
import networkx as nx
import matplotlib.pyplot as plt
import pygraphviz as pgv
from networkx.drawing.nx_agraph import graphviz_layout

# Convert the NetworkX graph G to a PyGraphviz graph
A = nx.nx_agraph.to_agraph(G)

# Create a layout for the graph
pos = graphviz_layout(G, prog='dot')

# Draw nodes with color coding for the 'Default' attribute
nx.draw_networkx_nodes(G, pos, node_color=color_map)
nx.draw_networkx_edges(G, pos, edge_color='grey')

# Show the plot
plt.show()
"""


### Visualize data

There are several ways you could visualize this dataset. Because it has 5 numerical features, you could use pairwise scatterplots to visualize the relationship between pairs of features. You could also use histograms or boxplots to visualize the distribution of each feature. Here are examples of how to create these visualizations using matplotlib and seaborn:

1. **Pairwise Scatterplots:**

```python
import seaborn as sns

# Pairplot of the dataset
sns.pairplot(df, hue='Default')
```
The above code will generate a pairwise scatterplot matrix. Each plot represents the relationship between two features, and data points are colored based on their 'Default' status. The diagonal line of the matrix shows the distribution of the single feature according to the 'Default' categories.

2. **Boxplots:**

```python
import matplotlib.pyplot as plt

# Boxplots for each feature
plt.figure(figsize=(20,10))

for i, column in enumerate(df.columns[:-1]):  # excluding the 'Default' column
    plt.subplot(2, 3, i + 1)
    sns.boxplot(x='Default', y=column, data=df)

plt.tight_layout()
plt.show()
```
The boxplots visualize the distribution of each feature for each 'Default' status separately. You can observe the median, quartiles and possible outliers for each feature in each 'Default' category.

Remember, these are just simple ways to visualize your dataset and there are many other visualization techniques that can provide deeper insights depending on the nature of your data. For example, a correlation heatmap could be useful to identify highly correlated features.

In [None]:
import seaborn as sns

# Pairplot of the dataset
sns.pairplot(df, hue='Default')


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

def create_boxplots(df):
    """
    Create boxplots for each feature in the DataFrame.

    Parameters:
    df (pandas.DataFrame): The input DataFrame.

    Returns:
    None
    """
    # Determine the number of features and calculate the appropriate number of subplots
    num_features = len(df.columns) - 1  # excluding the 'Default' column
    num_rows = (num_features - 1) // 3 + 1
    num_cols = min(num_features, 3)

    # Create subplots
    fig, axes = plt.subplots(num_rows, num_cols, figsize=(20, 10))

    # Flatten the axes array if needed
    axes = axes.flatten()

    # Plot boxplots for each feature
    for i, column in enumerate(df.columns[:-1]):  # excluding the 'Default' column
        ax = axes[i]
        sns.boxplot(x='Default', y=column, data=df, ax=ax)

    # Remove any unused subplots
    if num_features < len(axes):
        for j in range(num_features, len(axes)):
            axes[j].remove()

    # Adjust the layout
    plt.tight_layout()

    # Show the plot
    plt.show()

# Call the function to create boxplots for each feature
create_boxplots(df)

### Compute minimum spanning tree

In [None]:
import networkx as nx

mst = nx.minimum_spanning_tree(G)
import networkx as nx

# Create an edge filter for the MST
def edge_filter(u, v):
    return (u, v) in mst.edges()

# Create a filtered subgraph based on the edge filter
filtered_graph = G.edge_subgraph((u, v) for u, v in G.edges() if edge_filter(u, v))

In [None]:
# Visualize the filtered subgraph
pos = nx.spring_layout(G)
nx.draw(G, pos, with_labels=True)
labels = nx.get_edge_attributes(G, 'weight')

In [None]:
nx.draw_networkx_edges(filtered_graph, pos, edge_color='r', width=2)  # Highlight filtered edges in red
plt.show()

In [None]:
# Generate positions for all nodes in the graph G
pos = nx.spring_layout(G)

# Create a color map
color_map = []
for node in G:
    if G.nodes[node]['attr_dict']['Default'] == 1:
        color_map.append('red')
    else: 
        color_map.append('blue')

# Draw the graph with node colors
nx.draw_networkx_nodes(G, pos, node_color=color_map)
nx.draw_networkx_edges(G, pos, edge_color='grey')

plt.show()


### Centrality

In [None]:
nx.betweenness_centrality?

In [None]:
G = filtered_graph
# Compute centrality measures
degree_centrality = nx.degree_centrality(G)
closeness_centrality = nx.closeness_centrality(G, distance='weight')
betweenness_centrality = nx.betweenness_centrality(G, weight='weight')
eigenvector_centrality = nx.eigenvector_centrality(G, weight='weight')

   
    
# Compute more centrality measures
pagerank = nx.pagerank(G, weight='weight')
# HITS algorithm returns two dictionaries keyed by node containing hub scores and authority scores
hubs, authorities = nx.hits(G, max_iter=1000)

# Compute other network measures
#avg_shortest_path_length = nx.average_shortest_path_length(G)
#density = nx.density(G)
#num_connected_components = nx.number_connected_components(G)
#avg_clustering_coefficient = nx.average_clustering(G)

# Add these as node attributes
for node_id in G.nodes():
    G.nodes[node_id]['degree_centrality'] = degree_centrality[node_id]
    G.nodes[node_id]['closeness_centrality'] = closeness_centrality[node_id]
    G.nodes[node_id]['betweenness_centrality'] = betweenness_centrality[node_id]
    G.nodes[node_id]['eigenvector_centrality'] = eigenvector_centrality[node_id]

    G.nodes[node_id]['pagerank'] = pagerank[node_id]
    G.nodes[node_id]['hub_score'] = hubs[node_id]
    G.nodes[node_id]['authority_score'] = authorities[node_id]
#    G.nodes[node_id]['average_shortest_path_length'] = avg_shortest_path_length
#    G.nodes[node_id]['density'] = density
#    G.nodes[node_id]['num_connected_components'] = num_connected_components
#    G.nodes[node_id]['average_clustering_coefficient'] = avg_clustering_coefficient

# Note: Some measures like average shortest path length, density, number of connected components, 
# and average clustering coefficient are properties of the network as a whole rather than individual nodes. 
# Therefore, they will be the same for all nodes in a connected graph. 
# You might want to use these measures for comparison across different graphs or subgraphs.



In [None]:
# Convert node attributes to a DataFrame and flatten the column structure
df_node_attributes = pd.json_normalize(list(dict(G.nodes(data=True)).values()))

# Calculate correlations
correlations = df_node_attributes.astype(float).corr()

# Display correlations with the target variable sorted in descending order
print(correlations['attr_dict.Default'].sort_values(ascending=False))


In [None]:
correlations

In [None]:
import seaborn as sns

# Convert node attributes to a DataFrame and flatten the column structure
df_node_attributes = pd.json_normalize(list(dict(G.nodes(data=True)).values()))

# Calculate correlations
correlations = df_node_attributes.astype(float).corr()

# Use a mask to get the lower triangle as the correlation matrix is symmetric
mask = np.triu(np.ones_like(correlations, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(correlations, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

plt.show()

In [None]:
# Call the function
import webbrowser
plot_feature_distributions(
    df_node_attributes,
    'attr_dict.Default',
    file_name="feature_distributions_with_centrality.pdf")

webbrowser.open_new(r'feature_distributions_with_centrality.pdf')

In [None]:
def plot_centrality_measures(df_node_attributes):
    fig, axs = plt.subplots(2, 2, figsize=(12, 12))

    # Degree centrality
    axs[0, 0].scatter(df_node_attributes['attr_dict.Default'], df_node_attributes['degree_centrality'])
    axs[0, 0].set_xlabel('attr_dict.Default')
    axs[0, 0].set_ylabel('Degree Centrality')

    # Closeness centrality
    axs[0, 1].scatter(df_node_attributes['attr_dict.Default'], df_node_attributes['closeness_centrality'])
    axs[0, 1].set_xlabel('attr_dict.Default')
    axs[0, 1].set_ylabel('Closeness Centrality')

    # Betweenness centrality
    axs[1, 0].scatter(df_node_attributes['attr_dict.Default'], df_node_attributes['betweenness_centrality'])
    axs[1, 0].set_xlabel('attr_dict.Default')
    axs[1, 0].set_ylabel('Betweenness Centrality')

    # Eigenvector centrality
    axs[1, 1].scatter(df_node_attributes['attr_dict.Default'], df_node_attributes['eigenvector_centrality'])
    axs[1, 1].set_xlabel('attr_dict.Default')
    axs[1, 1].set_ylabel('Eigenvector Centrality')

    plt.tight_layout()
    plt.show()


In [None]:
plot_centrality_measures(df_node_attributes)

In [None]:
import os

# 将Jupyter notebook转换为幻灯片
os.system("jupyter nbconvert 0.1_data_preprocessing.ipynb --to slides")

In [None]:
import os

# 将HTML幻灯片转换为PDF
os.system("pandoc 0.1_data_preprocessing.slides.html -t beamer -o your_notebook.pdf")

In [None]:
!brew install pandoc

## Predictive Model

Now that we have our dataset augmented with centrality measures, we can use it to train a predictive model. Let's use a simple logistic regression model from scikit-learn as an example.

Firstly, you should split your dataset into training and testing sets. This is a common practice in machine learning to evaluate how well your model can generalize to unseen data.

```python
from sklearn.model_selection import train_test_split

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df_node_attributes.drop('Default', axis=1), df_node_attributes['Default'], test_size=0.2, random_state=42)
```

Now, let's train a logistic regression model on the training set:

```python
from sklearn.linear_model import LogisticRegression

# Initialize a Logistic Regression model
model = LogisticRegression(max_iter=1000)

# Train the model
model.fit(X_train, y_train)
```

Now that the model is trained, you can use it to predict the 'Default' status on the test set:

```python
# Predict the 'Default' status on the test set
y_pred = model.predict(X_test)
```

Finally, we can evaluate the performance of the model by computing metrics like accuracy, precision, recall and the F1 score:

```python
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Compute evaluation metrics
print('Accuracy: ', accuracy_score(y_test, y_pred))
print('Precision: ', precision_score(y_test, y_pred))
print('Recall: ', recall_score(y_test, y_pred))
print('F1 score: ', f1_score(y_test, y_pred))
```

This will give you a basic understanding of the performance of your model. Note that logistic regression is a relatively simple model, and depending on your dataset, you might achieve better performance with more complex models, such as Random Forests, Gradient Boosting Machines or Neural Networks. Also, always consider performing model validation (e.g., k-fold cross-validation) and hyperparameter tuning for a more reliable and better performing model.

In [None]:
df_node_attributes

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    df_node_attributes.drop('attr_dict.Default', axis=1),
    df_node_attributes['attr_dict.Default'],
    test_size=0.2,
    random_state=42)

In [None]:
X_train

In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize a Logistic Regression model
model = LogisticRegression(max_iter=1000)

# Train the model
model.fit(X_train, y_train)


In [None]:
# Predict the 'Default' status on the test set
y_pred_train = model.predict(X_train)


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Compute evaluation metrics
print('Accuracy Train: ', accuracy_score(y_train, y_pred_train))
print('Precision Train: ', precision_score(y_train, y_pred_train))
print('Recall Train: ', recall_score(y_train, y_pred_train))
print('F1 score Train: ', f1_score(y_train, y_pred_train))


In [None]:
# Predict the 'Default' status on the test set
y_pred = model.predict(X_test)
y_pred

In [None]:
# Import evaluation metrics from sklearn.metrics
# Accuracy measures the proportion of correct predictions out of total predictions
# Precision measures the proportion of true positive predictions out of total positive predictions
# Recall (also known as sensitivity) measures the proportion of true positive predictions out of total actual positives
# F1 score is the harmonic mean of precision and recall, a balanced measure when classes are imbalanced
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Compute and print accuracy of the model on test data
# This tells us the ratio of correctly predicted observations to the total observations
print('Accuracy: ', accuracy_score(y_test, y_pred))

# Compute and print precision of the model on test data
# This tells us the ratio of correctly predicted positive observations to the total predicted positives
print('Precision: ', precision_score(y_test, y_pred))

# Compute and print recall of the model on test data
# This tells us the ratio of correctly predicted positive observations to the all observations in actual class
print('Recall: ', recall_score(y_test, y_pred))

# Compute and print the F1 score of the model on test data
# The F1 score is the weighted average of Precision and Recall, used when we want to seek a balance between Precision and Recall
print('F1 score: ', f1_score(y_test, y_pred))


In [None]:
# Get feature names and corresponding coefficients
feature_names = X_train.columns
coefficients = model.coef_[0]

# Create a DataFrame for easy visualization
coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': coefficients
})

# Calculate the absolute values of the coefficients as a separate column
coef_df['AbsCoefficient'] = np.abs(coef_df['Coefficient'])

# Sort by absolute coefficient value in descending order
coef_df = coef_df.sort_values('AbsCoefficient', ascending=False)

print(coef_df)


In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix: \n', cm)


In [None]:
from sklearn.metrics import roc_auc_score
auc_roc = roc_auc_score(y_test, y_pred)
print('AUC-ROC: ', auc_roc)


In [None]:
from sklearn.metrics import classification_report
report = classification_report(y_test, y_pred)
print('Classification Report: \n', report)


In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
print('Cross Validation Score: ', np.mean(scores))


In [None]:
from sklearn.metrics import plot_confusion_matrix

plot_confusion_matrix(model, X_test, y_test)
plt.title('Confusion Matrix')
plt.show()


In [None]:
from sklearn.metrics import plot_roc_curve

plot_roc_curve(model, X_test, y_test)
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.show()

In [None]:
from sklearn.metrics import plot_precision_recall_curve

plot_precision_recall_curve(model, X_test, y_test)
plt.title('Precision-Recall Curve')
plt.show()


### Grid search and CV

Implement more complex models and apply validation techniques using scikit-learn for a Random Forest and a Gradient Boosting model. 

To make things more robust, I'll include a simple hyperparameter tuning using GridSearchCV, and model validation using k-fold cross-validation:

```python
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

# Initialize the models
rf = RandomForestClassifier(random_state=42)
gb = GradientBoostingClassifier(random_state=42)

# Create a list of models
models = [rf, gb]

# Define the grid of hyperparameters 'params'
params = [
    {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20, 30], 'min_samples_split': [2, 5, 10]},  # RandomForest
    {'n_estimators': [50, 100, 200], 'learning_rate': [0.01, 0.1, 0.2], 'max_depth': [3, 5, 7]}   # GradientBoosting
]

for model, param in zip(models, params):
    # GridSearchCV
    grid = GridSearchCV(estimator=model, param_grid=param, cv=5)  # 5-fold cross-validation
    grid.fit(X_train, y_train)

    print(grid.best_params_)  # print the best set of parameters found by GridSearch

    # Predict the 'Default' status on the test set
    y_pred = grid.predict(X_test)

    # Compute evaluation metrics
    print(classification_report(y_test, y_pred))

```

Please note, running this code may take a while, because it tries all combinations of the provided hyperparameters. Also, keep in mind that GridSearchCV applies cross-validation for model validation.

As for Neural Networks, it's a bit more involved and typically requires more tuning and computational resources. scikit-learn does offer a simple `MLPClassifier` for multilayer perceptron (MLP) networks, but for more complex architectures, you'll want to look into deep learning libraries like TensorFlow or PyTorch.



In [None]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

# Initialize the models
rf = RandomForestClassifier(random_state=42)
gb = GradientBoostingClassifier(random_state=42)

# Create a list of models
models = [rf, gb]

# Define the grid of hyperparameters 'params'
params = [
    {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20, 30], 'min_samples_split': [2, 5, 10]},  # RandomForest
    {'n_estimators': [50, 100, 200], 'learning_rate': [0.01, 0.1, 0.2], 'max_depth': [3, 5, 7]}   # GradientBoosting
]

for model, param in zip(models, params):
    # GridSearchCV
    grid = GridSearchCV(estimator=model, param_grid=param, cv=5)  # 5-fold cross-validation
    grid.fit(X_train, y_train)

    print(grid.best_params_)  # print the best set of parameters found by GridSearch

    # Predict the 'Default' status on the test set
    y_pred = grid.predict(X_test)

    # Compute evaluation metrics
    print(classification_report(y_test, y_pred))



### Evaluation

In [None]:
from sklearn.metrics import classification_report

# Predict the 'Default' status on the test set
y_pred = grid.best_estimator_.predict(X_test)

# Compute evaluation metrics
print(classification_report(y_test, y_pred))


### Interpretability

Feature importances should be calculated based on the training data. This is because the training data is what the model learns from, and feature importances are a measure of how much each feature contributes to the model's predictions.

After training your model, you would typically look at the feature importances to understand which features the model considers important. You can use this information to gain insights into your model and your data. For example, you may find that some features are not important and could be removed, or that some features are very important and perhaps you want to spend more time engineering related features.



In [None]:
from sklearn.inspection import permutation_importance

# Compute permutation feature importance
result = permutation_importance(grid.best_estimator_, X_test, y_test, n_repeats=10, random_state=42)

# Create a DataFrame to visualize importance scores
importance_df = pd.DataFrame({
    'Feature': X_test.columns,
    'Permutation Importance': result.importances_mean,
    'Std': result.importances_std
})

# Sort by importance
importance_df = importance_df.sort_values('Permutation Importance', ascending=False)

print(importance_df)


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Get feature importances
importances = grid.best_estimator_.feature_importances_

# Create a DataFrame
importances_df = pd.DataFrame({'feature': X_train.columns, 'importance': importances})

# Sort by importance
importances_df = importances_df.sort_values('importance', ascending=False)

# Plot
importances_df.plot.bar(x='feature', y='importance')
plt.title('Feature Importance')
plt.ylabel('Importance')
plt.show()


In [None]:
import lime
import lime.lime_tabular

# Create a lime explainer object
explainer = lime.lime_tabular.LimeTabularExplainer(X_train.values,
                                                   feature_names=X_train.columns.values.tolist(), 
                                                   class_names=['Non-Default', 'Default'], 
                                                   verbose=True, 
                                                   mode='classification')

# Pick the observation in the validation set for which explanation is required
observation_1 = X_test.values[0]

# Get the explanation for RandomForest
exp = explainer.explain_instance(observation_1, grid.best_estimator_.predict_proba, num_features=5)

exp.show_in_notebook(show_table=True)


In [None]:
"""
install shap
import shap

# Create object that can calculate shap values
explainer = shap.TreeExplainer(grid.best_estimator_)

# calculate shap values
shap_values = explainer.shap_values(X_test)

# plot
shap.summary_plot(shap_values, X_test, plot_type="bar")
"""