# Análisis de Fraude en Organismos de Gobierno

## Cargar los datos
La base de datos proviene de una entidad gubernamental encargada de supervisar el pago de impuestos por parte de los ciudadanos de un país de Latinoamérica
##### Individuos: 6.7 MM
##### Relaciones: 11 MM
##### Horizonte de tiempo: De Ene-2013 a Dic-2018 (5 años)

In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import networkx as nx
import time

In [None]:
h = pd.read_csv('../data/paths.csv')
h.head()

Revisamos la información del DataFrame

In [None]:
h.info()

Transformamos el DataFrame en un Grafo No Dirigido

In [None]:
G = nx.from_pandas_edgelist(h, source = "NODE_SRC", target = "NODE_TRG")
print(nx.info(G))

Agregamos el flag de fraude para cada nodo

In [None]:
fl = pd.read_csv('../data/nodesf.csv',index_col="id_node")
fl.head()

In [None]:
fl.info()

In [None]:
nx.set_node_attributes(G, dict(fl['fraude']), 'fraude')

Visualizamos la estructura de la red

In [None]:
plt.figure(figsize=(80,45)) 
node_colors = ['r' if G.nodes[v]['fraude'] == 1
               else 'b' for v in G]
nx.draw_networkx(G, width=0.1, label=False, node_color=node_colors)
plt.show()

Crear la función top_nodes que mostrará los valores más altos de un diccionario

In [None]:
def get_top_nodes(cdict, num=5):
    top_nodes ={}
    for i in range(num):
        top_nodes =dict(
            sorted(cdict.items(), key=lambda x: x[1], reverse=True)[:num]
            )
        return top_nodes

#### Grado

Guardar el grado de cada nodo en un diccionario

In [None]:
gdeg=G.degree()

In [None]:
get_top_nodes(dict(gdeg))

In [None]:
nx.is_connected(G)

#### Degree Centrality

In [None]:
degree_centrality =nx.degree_centrality(G)
nx.set_node_attributes(G,degree_centrality, 'dc')
get_top_nodes(degree_centrality)

#### Betweenness

In [None]:
t0= time.process_time()
betweenness_centrality = nx.betweenness_centrality(G)
nx.set_node_attributes(G,betweenness_centrality, 'bc')
t1 = time.process_time() - t0
print("Time elapsed: ", t1)

In [None]:
get_top_nodes(betweenness_centrality)

#### Closeness

In [None]:
t0= time.process_time()
closeness_centrality =nx.closeness_centrality(G)
nx.set_node_attributes(G,closeness_centrality, 'cc')
t1 = time.process_time() - t0
print("Time elapsed: ", t1)

In [None]:
get_top_nodes(closeness_centrality)

#### Eigenvector Centrality

In [None]:
t0= time.process_time()
eigenvector_centrality = nx.eigenvector_centrality_numpy(G)
nx.set_node_attributes(G, eigenvector_centrality,'ec')
t1 = time.process_time() - t0
print("Time elapsed: ", t1)

In [None]:
get_top_nodes(eigenvector_centrality)

#### PageRank Centrality

In [None]:
t0= time.process_time()
pagerank_centrality =nx.pagerank(G)
nx.set_node_attributes(G, pagerank_centrality, 'pr')
t1 = time.process_time() - t0
print("Time elapsed: ", t1)

In [None]:
get_top_nodes(pagerank_centrality)

## Métricas de Grafo

#### Shortest Paths
Armamos un proceso para identificar a los clientes no fraudulentos que se encuentran relacionados a los fraudulentos

In [None]:
sub=fl.loc[fl.fraude==1]
sub.head()

In [None]:
dist={}
for i in sub.index:
    for j in sub.index:
        if i<j:
            try:
                dist[i,j]=nx.shortest_path_length(G,i,j)
            except Exception:
                pass

In [None]:
dist

In [None]:
list(nx.all_shortest_paths(G,511748, 981630))

In [None]:
list(nx.all_shortest_paths(G,4610143, 5211021))

In [None]:
list(nx.all_shortest_paths(G,3142363, 4808898))

In [None]:
list(nx.all_shortest_paths(G,5770584,6014218))

In [None]:
nx.is_connected(G)

In [None]:
nx.number_connected_components(G)

#### Densidad

In [None]:
nx.density(G)

#### Local Clustering Coefficient

In [None]:
nx.average_clustering(G)

Elaborado por Luis Cajachahua bajo licencia MIT (2022)