# Segregation Indexes
Following the Bojanowski & Corten Paper _Measuring Segregation in Social Networks_ 2014, We calculate some segregation indexes for our graphs along the time. First, we import the data and libraries ised for the creation of the functions in the Prerequiste section. Then We calculate The Freeman Segregation Index and the Spectral Segregatoin Index in The next Sections. This Notebook is divided en the following sections.

1. Prerequisites
2. Freeman Segregation
	- Basic Freeman Segregation
	- Global Freeman Segregation Index (for K groups)
	- Freeman Segregation Index for an specific group
	- Freeman Segregation Index for an specific group (Taking in account Weights)
3. Assortativity
4. Results
4.1. 3 Day Rolling Window
4. Conclusion


# 1. Prerequistes

In [2]:
# Mathematical and Data Managment
import numpy as np
import pandas as pd

# Graph Managment
import graph_tool.all as gt
import utils.Freeman as Fr
import utils.Proximity as Pr
import utils.Homophily as Ho

# Miscellaneous
from glob import glob
from tqdm import tqdm
import concurrent.futures
from functools import partial
from time import perf_counter
import os
import re

# Paths
path = r"/mnt/disk2/Data"
path_3_day = os.path.join(path,"3_Day_Graphs")
path_daily = os.path.join(path,"Daily_Graphs")



For the calculation of the segregation Indexes, we define some notation based on (Bojanowski & Corten 2014).

We define a Graph with

$$\mathbb{N}= \{1, \dots, i, \dots, N\}$$

and then, define the set

$$\mathbb{G} = \{G_1, G_2,\dots, G_k\}$$

as the set of $K$ groups in which every $G_g$ is a subset of $\mathbb{N}$ that contains all the nodes that belong to group $g$. define $\eta_{k}$ = $|G_k|$ as the amount of nodes from group $G_k$

Now we define the type vector as 

$$\textbf{t} = [t_1,\dots, t_i, \dots, t_N]$$

where $t_i \in \{1,\dots,K\}$. This vector matches every node with its corresponding group. Using this notation. We can define a type indicator vector for each group $k$ as follows:  

$$\textbf{v}_k = [v_1, \dots, v_i, \dots, v_N]$$ 

where $v_i \in \{0,1\}$. This vector has one entry for every node and the node location will have a 1 if that node corresponds to the group $G_k$. Formally:

$$ v_i = \begin{cases} 1 &\text{ if }t_i = k \\ 0 &\text{ if }t_i \neq k \end{cases} $$

Now we define the Types Matrix $T_{N\times K}$ as a matrix that contains the information of each node and wich group it represents. For Every column of the matrix corresponds to a $\textbf{v}_k$ _types indicator vector_.

In the context of this Research, we will use a Directed Weighted Graph. Our nodes are X (formmerly Twitter) in which the user $i$ is related to the user $j$ if $i$ Retweeted a Tweet without comments of $j$. In this sense. Formarly we discribe the relationship $R$ over $\mathbb{N}\times \mathbb{N}$ that implies our square Adjcency Matrix $X = [X_{ij}]_{\mathbb{N}\times \mathbb{N}}$

For the segregation calculations will will consider the graph as weighted or un weight. In case of takin in account the weights of each edge, eh entries of the Adjacency Matrix will be defined as follows:

$$x_{ij} = \dfrac{\text{\# Tweets from }j\text{ that }i\text{ Retweeted without comments}}{\text{\# of Retweets without comments from }i}$$

Taking in account the unweighted graph, we will define our Simple Adjacency Matrix as:

$$ x_{ij} = \begin{cases} 1 & \text{if } i\text{ Retweeted }j \\ 0 & \text{In other case} \end{cases} $$

Finally, we define the _Mixing Matrix_ ($M_{ghy}$) where $g$ and $h$ are two generic groups and $y$ indexes two types of layers. The first layer _Mixing Matrix_ is the _Contact layer_, defined as follows: (Where we use the weighted or un weighted adjcancecy matrix):

$$M_{gh1} = \sum_{i\in G_g}\sum_{j\in G_h} x_{ij}$$

For the **unweighted** case, we can define the _No Contact Layer_ as follows:

$$M_{gh0} = \sum_{i\in G_g}\sum_{j\in G_h} (1-x_{ij})$$

Finally, in this matrix $M_{gh1}$ shows the amount of attention that group $h$ gets from group $g$

For easyness, we define the follow notation:

- $M_{g+1} = \sum_{h=1}^K M_{gh1}$ Sum across a column

- $M_{+h1} = \sum_{g=1}^K M_{gh1}$ Sum across a row

- $M_{++1} = \sum_{g=1}^K \sum_{h=1}^K M_{gh1}$ Sum of all the Layer

In [2]:
master_id = pd.read_csv(os.path.join(path,"Master_Index.csv"), sep = ';')

# Indexes
date_range_daily = pd.date_range(start='2021-04-28', end='2021-06-29', freq='D')
date_range_3day = pd.date_range(start='2021-04-28', end='2021-06-27', freq='D')
categories = master_id['Political Affiliation'].unique().tolist()

group_index_3day = pd.MultiIndex.from_product([date_range_3day, categories], names=['Date', 'Political Label'])
group_index_daily = pd.MultiIndex.from_product([date_range_3day, categories], names=['Date', 'Political Label'])

individual_index = pd.MultiIndex.from_product([range(0,len(master_id)), categories], names=['Node', 'Political Label'])

# 3 Day DataFrames
global_segregation_3day = pd.DataFrame(index=date_range_3day).sort_index()
group_segregation_3day = pd.DataFrame(index=group_index_3day).sort_index()

# Daily Dataframes
global_segregation_daily = pd.DataFrame(index=date_range_daily).sort_index()
group_segregation_daily = pd.DataFrame(index=group_index_daily).sort_index()

# Individual Dataframes
individual_group_segregation = pd.DataFrame(index=individual_index).sort_index()
individual_node_segregation = master_id[['Political Affiliation']].rename(columns = {'Political Affiliation': 'Political Label'})

# Load graphs
os.path.join(path_3_day,"Graphs", "*.graphml")
files_3day = glob(os.path.join(path_3_day,"Graphs", "*.graphml"))
files_daily = glob(os.path.join(path_daily,"Graphs", "*.graphml"))

---
# 2. Freman Segregation

### Basic Freeman Segregation

The basic segregation index proposed by Freeman (1998) tries to see how is the proportion of ties between two different groups against the ties if they were made randomly. This Basic Index is calculated for undirected and unweighted graphs. It is the first approch for segregation for this family of indexes. We define the $\eta_1$ as the amount of nodes belonng to group $G_1$ and $\eta_2$ belonging to group $G_2$. Recalling our past notation, in the contact Layer of the mixing matrix the number of between group ties will be the entry $M_{121}$. We can divide this numero over all the amount of edges (Note that the nuber of edges in the graph is equal to the sum of all the contact layer $M_{++1}$). This is how we get the proportion of between ties

$$p = \frac{M_{121}}{M_{++1}}$$

Now, we calculate the expected value of $p$ Note that the probability of taking a tie between the node of a one group and a node with the other group. Thinking of it as the ratio of Favor cases to total cases, the favor cases are to totall amount of cross ties that are posible. This value will correspond to the number of nodes for group 1 multiplied to the number of nodes from group 2. The number of total cases will correspond to the number of dyads posible in the graph. This will correspond to N choose 2. Conecting this two values we have the following:

$$\pi = \frac{\eta_1 \eta_2}{\frac{N(N-1)}{2}} = \frac{2\eta_1 \eta_2}{N(N-1)}$$

Finally Freeman Segregation Index is defined as:

$$S_{Freeman} = 1- \frac{p}{\pi}$$

In [3]:
def process_file(file, categories):
    results = []
    for pol in categories:
        g = gt.load_graph(file)
        date = file.split('/')[-1].split('.')[0].split('_')[-1]
        seg = Fr.Freeman_Classic(g, types=pol)
        results.append(((date, pol), seg))
    return results


# Run processing in parallel
with concurrent.futures.ProcessPoolExecutor() as executor:
    # Daily Rutine
    tic = perf_counter()
    futures = [executor.submit(process_file, file, categories) for file in files_daily]
    
    # Process results as they complete
    for future in tqdm(concurrent.futures.as_completed(futures), total=len(files_daily), desc="Daily rutine"):
        try:
            results = future.result()
            # Update group_segregation DataFrame with results
            for key, value in results:
                group_segregation_daily.loc[key, 'Classic Freeman'] = value
        except Exception as e:
            print(f'Generated an exception: {e}')
    toc = perf_counter()
    time = toc-tic

    print(f"Finished in {time//60:,.0f} minutes with {round(time%60,2):,.2f} seconds")
    # 3 Day Rutine
    tic = perf_counter()
    futures = [executor.submit(process_file, file, categories) for file in files_3day]
    
    # Process results as they complete
    for future in tqdm(concurrent.futures.as_completed(futures), total=len(files_3day), desc="3 Day rutine"):
        try:
            results = future.result()
            # Update group_segregation DataFrame with results
            for key, value in results:
                group_segregation_3day.loc[key, 'Classic Freeman'] = value
        except Exception as e:
            print(f'Generated an exception: {e}')
    toc = perf_counter()
    time = toc-tic

    print(f"Finished in {time//60:,.0f} minutes with {round(time%60,2):,.2f} seconds")

Daily rutine: 100%|██████████| 63/63 [01:35<00:00,  1.51s/it]


Finished in 1 minutes with 35.30 seconds


3 Day rutine: 100%|██████████| 61/61 [03:30<00:00,  3.46s/it]


Finished in 3 minutes with 30.82 seconds


---
### Global Freeman Segregation Index (for K groups)

For the Freeman Segregation Index, We will use the formula from (Bojanowski & Corten 2014) for the in which the generalize this index for $K$ groups. The index is define as Follows.

Let $p$ be equal to the proportion of _between_ group ties in the graph. This corresponds to the upper triangle of the $M$ Matrix without counting the diagonal (This diagonal contains the information of the _within_ group ties).

$$p = \frac{\sum_{g,h:g\neq h}M_{gh1}}{\sum_{g=1}^K\sum_{h=1}^K M_{gh1}}$$

Now, we define the expected proportion of between-group ties in a random graph. In the generalize case of $K$ groups. this looks like this

$$\pi = \frac{\left( \sum_{k=1}^K \eta_k\right)^2 - \sum_{k=1}^K \eta_k^2}{N(N-1)}$$

Finally, Freeman Segregation Index is defined as:

$$S_{Freeman} = 1 -\frac{p}{\pi} = 1- \frac{pN(N-1)}{\left( \sum_{k=1}^K \eta_k\right)^2 - \sum_{k=1}^K \eta_k^2}$$

This Index takes into account the case **unweighted case**

In [4]:
# Storage in DataFrame
def process_file(file):
    results = []
    g = gt.load_graph(file)
    date = file.split('/')[-1].split('.')[0].split('_')[-1]
    seg = Fr.Freeman_Global(g,property_label = 'Political Label')
    results.append((date, seg))
    return results

# Run processing in parallel
with concurrent.futures.ProcessPoolExecutor() as executor:
    tic = perf_counter()
    # Daily rutine
    futures = [executor.submit(process_file, file) for file in files_daily]
    
    # Process results as they complete
    for future in tqdm(concurrent.futures.as_completed(futures), total=len(files_daily), desc="Daily rutine"):
        try:
            results = future.result()
            # Update group_segregation DataFrame with results
            for key, value in results:
                global_segregation_daily.loc[key, 'Freeman Global'] = value
        except Exception as e:
            print(f'Generated an exception: {e}')
    toc = perf_counter()
    time = toc-tic

    print(f"Finished in {time//60:,.0f} minutes with {round(time%60,2):,.2f} seconds")

    tic = perf_counter()
    # 3 Day rutine
    futures = [executor.submit(process_file, file) for file in files_3day]
    
    # Process results as they complete
    for future in tqdm(concurrent.futures.as_completed(futures), total=len(files_3day), desc="3 Day rutine"):
        try:
            results = future.result()
            # Update group_segregation DataFrame with results
            for key, value in results:
                global_segregation_3day.loc[key, 'Freeman Global'] = value
        except Exception as e:
            print(f'Generated an exception: {e}')
            
    toc = perf_counter()
    time = toc-tic

    print(f"Finished in {time//60:,.0f} minutes with {round(time%60,2):,.2f} seconds")

Daily rutine: 100%|██████████| 63/63 [00:23<00:00,  2.65it/s]


Finished in 0 minutes with 23.83 seconds


3 Day rutine: 100%|██████████| 61/61 [00:54<00:00,  1.12it/s]


Finished in 0 minutes with 54.43 seconds


---
### Freeman Segregation Index for an specific group

The Freeman Segregation Index is originally computed for the segregation between two groups. This function will compute the index between one group and all the other ones using _Basic Freeman Segregatioon Index_ Formula. This will give a measure of how segregated is one group over all the others. For this case, our contact layer will only consider two groups, the group $g$ for which one would calculate the index and the group $-g$ wich are all the other nodes that do not belong to $g$. Recall our _Contact Matrix_ that looks like this:

$$
M_{gh1} = 
\begin{bmatrix}
    M_{1,1,1} & M_{1,2,1} & \dots & M_{1,k,1} \\
    M_{2,1,1} & M_{2,2,1} & \dots & M_{2,k,1} \\
    \vdots & \vdots & \ddots & \vdots \\
    M_{k,1,1} & M_{k,2,1} & \dots & M_{k,k,1} \\
\end{bmatrix}
$$

For our calculation, we will  have another _Contact Catrix_ called, "Me Vs Ohers" and denoted $\hat{M}$. This matrix will be a $2\times 2$. This will be similar as the the original _Contact Matrix_ but with only two groups, $g$ and $-g$. This matrix is defined as follows:

$$
M^* = 
\begin{bmatrix}
    M*_{gg} & M*_{g-g} \\
    M*_{-gg} & M*_{-g-g} \\
\end{bmatrix}
$$

Where:
- $M^*_{g-g} = M_{gg1}$
- $M^*_{g-g} = \sum_{g = 1}^k M_{gh1} - M_{gg1}$
- $M^*_{-gg} = \sum_{h = 1}^k M_{gh1} - M_{gg1}$
- $M^*_{-g-g} = \sum \sum \hat{M}_{gh}$

For the calculation of the $M*_{-g-g1}$ we substract from the original _Contact Matrix_ the index rows and columns for the group $g$ (Will be denoted as $\hat{M}$). This will be the contact matrix if this group hadn't existed. Thanks to this matrix, we can compute all the between ties from all nodes that aren't in $g$. This will be the sum of all the values in the matrix. Formally,
$$
\hat{M} = 
    \begin{bmatrix}
    a_{1,1} & \dots & a_{1,g-1} & a_{1,g+1} & \dots & a_{1,k} \\
    a_{2,1} & \dots & a_{2,g-1} & a_{2,g+1} & \dots & a_{2,k} \\
    \vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\
    a_{g-1,1} & \dots & a_{g-1,g-1} & a_{g-1,g+1} & \dots & a_{g-1,k} \\
    a_{g+1,1} & \dots & a_{g+1,g-1} & a_{g+1,g+1} & \dots & a_{g+1,k} \\
    \vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\
    a_{k,1} & \dots & a_{k,g-1} & a_{k,g+1} & \dots & a_{k,k} \\
\end{bmatrix}
$$

Now, for the Freeman Formula we compute both $P$ and $\pi$ and calculate $1-\frac{P}{\pi}$

$$P = \frac{M^*_{g-g}}{M^*_{++}}$$

$$\pi = \frac{2|G_g|*|G_{-g}|}{N(N-1)}$$

$$S_{Freeman}^g = 1- \frac{N(N-1)M^*_{g-g}}{2M^*_{++}|G_g||G_{-g}|}$$

In [5]:
def process_file(file, categories):
    results = []
    for pol in categories:
        g = gt.load_graph(file)
        graph_name = file.split('/')[-1].split('.')[0].split('_')[-1]
        seg = Fr.Freeman_Groups(g, 'Political Label', pol)
        results.append(((graph_name, pol), seg))
    return results

# Run processing in parallel
with concurrent.futures.ProcessPoolExecutor() as executor:
    tic = perf_counter()
    # Daily rutine
    futures = [executor.submit(process_file, file, categories) for file in files_daily]
    
    # Process results as they complete
    for future in tqdm(concurrent.futures.as_completed(futures), total=len(files_daily), desc="Daily rutine"):
        try:
            results = future.result()
            # Update group_segregation DataFrame with results
            for key, value in results:
                group_segregation_daily.loc[key, 'Freeman One vs Others'] = value
        except Exception as e:
            print(f'Generated an exception: {e}')
    toc = perf_counter()
    time = toc-tic

    print(f"Finished rutine in {time//60:,.0f} minutes with {round(time%60,2):,.2f} seconds")
    
    tic = perf_counter()
    # 3 Day rutine
    futures = [executor.submit(process_file, file, categories) for file in files_3day]
    
    # Process results as they complete
    for future in tqdm(concurrent.futures.as_completed(futures), total=len(files_3day), desc="3 Day rutine"):
        try:
            results = future.result()
            # Update group_segregation DataFrame with results
            for key, value in results:
                group_segregation_3day.loc[key, 'Freeman One vs Others'] = value
        except Exception as e:
            print(f'Generated an exception: {e}')
    toc = perf_counter()
    time = toc-tic

    print(f"Finished rutine in {time//60:,.0f} minutes with {round(time%60,2):,.2f} seconds")

Daily rutine: 100%|██████████| 63/63 [01:39<00:00,  1.57s/it]


Finished rutine in 1 minutes with 39.38 seconds


3 Day rutine: 100%|██████████| 61/61 [03:55<00:00,  3.85s/it]


Finished rutine in 3 minutes with 55.01 seconds


---
Give a weighted graph in which the nonnegative weights of each individual's outgoing links sum to $1$,  we can define the proximity of individual $j$ to group  $k$  as: 

$$Prox_{j\to k}=\frac{W_{jk}}{(T_k/ \sum_{m\in G} T_m)}$$

where $W_{jk}$  is the sum of all the weights that $j$  puts on members of group $k$ and for each group $m$, $A_m$ denotes  the total number of original tweets by members of group $k$ (tweets made on the time period (i.e. day) in question).  $G$ denotes the set of groups in the populations.  So the denominator captures the fraction of $i$'s outgoing mass (which equals 1) that would have gone onto group $k$ if it had been distributed uniformly at random.  i.e.  if agent $i$ had simply distributes its outgoing mass of $1$  uniformly at random  among the $\sum_{m\in G} T_m)$ written that day.

In [6]:
# HAY QUE MEJORARLO
tic = perf_counter()
for file in tqdm(files_daily, desc="Proximidad individual a grupo h"):
    g = gt.load_graph(file)    
    den_dict = {cat: Pr.at_random_scenario(g, 'Political Label', cat, 'Proximity to Group') for cat in categories}

    def process_individual_segregation(params,den_dict):
        i, cat = params
        num = Pr.individual_proximity_to_h(g, i, 'Political Label', cat)
        date = g.gp['Date']
        den = den_dict[cat]
        seg = num/den
        return (i, cat), seg, date
    def main():
        params = [(i, cat) for i in range(len(master_id)) for cat in categories]

        # Wrap the function call to include den_dict
        individual_segregation_process = partial(process_individual_segregation, den_dict=den_dict)
        # Use ProcessPoolExecutor to parallelize the computation
        with concurrent.futures.ProcessPoolExecutor() as executor:
            # Map the function over the parameters and wrap with tqdm for progress bar
            results = executor.map(individual_segregation_process, params)

        # Populate the DataFrame with results
        for row_index, result, date in results:
            individual_group_segregation.loc[row_index, f'Proximity index on {date}'] = result

    if __name__ == '__main__':
        main()
toc = perf_counter()
time = toc-tic

print(f"Finished cell in {time//60:,.0f} minutes with {round(time%60,2):,.2f} seconds")

Proximidad individual a grupo h: 100%|██████████| 63/63 [21:57<00:00, 20.91s/it]

Finished cell in 21 minutes with 57.36 seconds





In [7]:
tic = perf_counter()
for file in tqdm(files_daily, desc = "Proximidad a Otros"):
    g = gt.load_graph(file)    
    den_dict = {cat: Pr.at_random_scenario(g, 'Political Label', cat, 'Proximity to Others') for cat in categories}

    def process_individual_segregation(i, den_dict):
        num = Pr.individual_proximity_to_others(g, i, 'Political Label')
        date = g.gp['Date']
        den = den_dict[g.vp['Political Label'][g.vertex(i)]]
        seg = num/den
        return i, seg, date
    def main():
        # Wrap the function call to include den_dict
        individual_segregation_process = partial(process_individual_segregation, den_dict=den_dict)
        # Use ProcessPoolExecutor to parallelize the computation
        with concurrent.futures.ProcessPoolExecutor() as executor:
            # Map the function over the parameters and wrap with tqdm for progress bar
            results = executor.map(individual_segregation_process, range(len(master_id)))

        # Populate the DataFrame with results
        for row_index, result, date in results:
            individual_node_segregation.loc[row_index, f'Proximity to Others on {date}'] = result

    if __name__ == '__main__':
        main()
toc = perf_counter()
time = toc-tic

print(f"Finished cell in {time//60:,.0f} minutes with {round(time%60,2):,.2f} seconds")

Proximidad a Otros: 100%|██████████| 63/63 [05:38<00:00,  5.37s/it]

Finished cell in 5 minutes with 38.35 seconds





---
# 3. Proximity Index

### Index of attention from that group $g$ devotes to others ($-g$)

Using the same philosophy from Freeman index. Here we will calculte the coeficient between the proportion of cross ties in the graph, against the random scenario. For this case, we will take in account wwights and directionality of the graph. For that matter, take in account a _Contact Layer_ in which the entry $M_{g,-g}$ corresponds to the summ of all the weights the nodes from group $g$ devotes to any other group $-g$ In that cases, we define $P$ as:

$$P = \frac{M^*_{g-g}}{M^*_{++}}$$

Recalling the construction of the weights, the sum of all the weights the comes out of a nodes sums up to one (The sum of every row in the weighted adjacency matrix corresponds to one). 

For the expected value of $P$ which we called $\pi$ the number of cross ties weights will be calculated as the amount of weights the gruop $g$ would randomly devote to other $-g$. An edge is made between two nodes $i$ and $j$ if $i$ retweeted $j$. So the expected weight of $i$ could devote to another person $j$ will depend of the amount of original tweets that $j$ made and can be retweeted by $i$. With out loss of generality, we can say that the expected total weights from $g$ to $-g$ will correspond to the total amount of original tweets made from $-g$ nodes. over the total amount of tweets made that day.

We define then T$_i$ as the amount of original tweets made by $i$ and also we define the amount of tweets made by the group $g$ as
$$T^g = \sum_{i\in G_g} T_i$$

Consecuently, the amount of tweets made by other groups other than $g$ will be
$$T^{-g} = \sum_{i\notin G_g} T_i$$

$$\pi = \frac{T^-g}{T^+}$$

In [10]:
def process_file(file, categories):
    results = []
    for pol in categories:
        g = gt.load_graph(file)
        date = file.split('/')[-1].split('.')[0].split('_')[-1]
        num = Pr.proximity_g_others(g, 'Political Label', 'Normal Weight', pol)
        den = Pr.at_random_scenario(g, 'Political Label', pol, 'Proximity to Others')
        seg = num/den
        results.append(((date, pol), seg))
    return results

# Run processing in parallel
with concurrent.futures.ProcessPoolExecutor() as executor:
    # Daily rutine
    tic = perf_counter()
    futures = [executor.submit(process_file, file, categories) for file in files_daily]
    
    # Process results as they complete
    for future in tqdm(concurrent.futures.as_completed(futures), total=len(files_daily), desc="Daily rutine"):
        try:
            results = future.result()
            # Update group_segregation DataFrame with results
            for key, value in results:
                group_segregation_daily.loc[key, 'Proximity to Others'] = value
        except Exception as e:
            print(f'Generated an exception: {e}')
    toc = perf_counter()
    time = toc-tic

    print(f"Finished rutine in {time//60:,.0f} minutes with {round(time%60,2):,.2f} seconds")
    # 3 Day rutine
    tic = perf_counter()
    futures = [executor.submit(process_file, file, categories) for file in files_3day]
    
    # Process results as they complete
    for future in tqdm(concurrent.futures.as_completed(futures), total=len(files_3day),desc = "3 Day rutine"):
        try:
            results = future.result()
            # Update group_segregation DataFrame with results
            for key, value in results:
                group_segregation_3day.loc[key, 'Proximity to Others'] = value
        except Exception as e:
            print(f'Generated an exception: {e}')
    toc = perf_counter()
    time = toc-tic

    print(f"Finished rutine in {time//60:,.0f} minutes with {round(time%60,2):,.2f} seconds")

Daily rutine: 100%|██████████| 63/63 [00:23<00:00,  2.71it/s]


Finished rutine in 0 minutes with 23.38 seconds


3 Day rutine: 100%|██████████| 61/61 [01:03<00:00,  1.03s/it]

Finished rutine in 1 minutes with 3.05 seconds





In [11]:
def process_file(file, categories):
    results = []
    for pol in categories:
        g = gt.load_graph(file)
        graph_name = file.split('/')[-1].split('.')[0].split('_')[-1]
        num = Pr.proximity_g_others(g, 'Political Label', 'Normal Weight', pol, in_proximity=False)
        den = Pr.at_random_scenario(g, 'Political Label', pol, 'Proximity to Others')
        seg = num/den
        results.append(((graph_name, pol), seg))
    return results

# Run processing in parallel
with concurrent.futures.ProcessPoolExecutor() as executor:
    
    # Daily rutine
    tic = perf_counter()
    futures = [executor.submit(process_file, file, categories) for file in files_daily]
    
    # Process results as they complete
    for future in tqdm(concurrent.futures.as_completed(futures), total=len(files_daily), desc="Daily rutine"):
        try:
            results = future.result()
            # Update group_segregation DataFrame with results
            for key, value in results:
                group_segregation_daily.loc[key, "Other's Proximity"] = value
        except Exception as e:
            print(f'Generated an exception: {e}')
    toc = perf_counter()
    time = toc-tic

    print(f"Finished rutine in {time//60:,.0f} minutes with {round(time%60,2):,.2f} seconds")
    # 3 Day rutine
    tic = perf_counter()
    futures = [executor.submit(process_file, file, categories) for file in files_3day]
    
    # Process results as they complete
    for future in tqdm(concurrent.futures.as_completed(futures), total=len(files_3day), desc="3 Day rutine"):
        try:
            results = future.result()
            # Update group_segregation DataFrame with results
            for key, value in results:
                group_segregation_3day.loc[key, "Other's Proximity"] = value
        except Exception as e:
            print(f'Generated an exception: {e}')
    toc = perf_counter()
    time = toc-tic

    print(f"Finished rutine in {time//60:,.0f} minutes with {round(time%60,2):,.2f} seconds")

Daily rutine: 100%|██████████| 63/63 [00:24<00:00,  2.60it/s]


Finished rutine in 0 minutes with 24.42 seconds


3 Day rutine: 100%|██████████| 61/61 [01:06<00:00,  1.08s/it]

Finished rutine in 1 minutes with 6.15 seconds





---
### Index of attention from that group $h$ devotes an specific group ($k$)

$$Prox_{h\to k}=\frac{(W_{hk}/A_h)}{(T_k/ \sum_{m\in G} T_m)}$$

where  $W_{hk}$  is the sum of all the weights that members of group $h$  put on members of group $k$  and $A_h$  is the number  of  retweet-active members of group $h$ (on the day in  question). That is, the number of members of group $h$ who rewteeted at least one original tweet of some member of the whole community on that day, or what is equivalent: the number of members of group $h$ whose rows in the adjacency matrix (of that day) sum to 1. 

In [12]:
def process_file(file, categories):
    results = []
    for pol_in in categories:
        for pol_out in categories:
            g = gt.load_graph(file)
            date = file.split('/')[-1].split('.')[0].split('_')[-1]
            num = Pr.proximity_g_h(g, 'Political Label', 'Normal Weight', pol_in, pol_out)
            den = Pr.at_random_scenario(g, 'Political Label', pol_out, 'Proximity to Group')
            seg = num/den
            results.append(((date, pol_out), pol_in, seg))
    return results

# Run processing in parallel
with concurrent.futures.ProcessPoolExecutor() as executor:
    # Daily rutine
    tic = perf_counter()
    futures = [executor.submit(process_file, file, categories) for file in files_daily]
    
    # Process results as they complete
    for future in tqdm(concurrent.futures.as_completed(futures), total=len(files_daily), desc = "Daily rutine"):
        try:
            results = future.result()
            # Update group_segregation DataFrame with results
            for key, var, value in results:
                group_segregation_daily.loc[key, f'Proximity From {var} To'] = value
        except Exception as e:
            print(f'Generated an exception: {e}')
    toc = perf_counter()
    time = toc-tic

    print(f"Finished rutine in {time//60:,.0f} minutes with {round(time%60,2):,.2f} seconds")
    
    tic = perf_counter()
    futures = [executor.submit(process_file, file, categories) for file in files_3day]
    # Process results as they complete
    for future in tqdm(concurrent.futures.as_completed(futures), total=len(files_3day), desc="3 Day rutine"):
        try:
            results = future.result()
            # Update group_segregation DataFrame with results
            for key, var, value in results:
                group_segregation_3day.loc[key, f'Proximity From {var} To'] = value
        except Exception as e:
            print(f'Generated an exception: {e}')
    toc = perf_counter()
    time = toc-tic

    print(f"Finished rutine in {time//60:,.0f} minutes with {round(time%60,2):,.2f} seconds")

Daily rutine: 100%|██████████| 63/63 [01:29<00:00,  1.42s/it]


Finished rutine in 1 minutes with 29.89 seconds


3 Day rutine: 100%|██████████| 61/61 [04:50<00:00,  4.76s/it]

Finished rutine in 4 minutes with 50.24 seconds





In [13]:
def process_file(file, categories):
    results = []
    for pol1 in categories:
        for pol2 in categories:
            g = gt.load_graph(file)
            date = file.split('/')[-1].split('.')[0].split('_')[-1]
            num = Pr.proximity_g_h(g, 'Political Label', 'Normal Weight', pol1, pol2, in_proximity=False)
            den = Pr.at_random_scenario(g, 'Political Label', pol2, 'Proximity to Group')
            seg = num/den
            results.append(((date, pol2), pol1, seg))
    return results

# Run processing in parallel
with concurrent.futures.ProcessPoolExecutor() as executor:
    # Daily rutine
    tic =perf_counter()
    futures = [executor.submit(process_file, file, categories) for file in files_daily]
    
    # Process results as they complete
    for future in tqdm(concurrent.futures.as_completed(futures), total=len(files_daily), desc="Daily rutine"):
        try:
            results = future.result()
            # Update group_segregation DataFrame with results
            for key, var, value in results:
                group_segregation_daily.loc[key, f'Proximity {var} Took From'] = value
        except Exception as e:
            print(f'Generated an exception: {e}')
    toc = perf_counter()
    time = toc-tic

    print(f"Finished rutine in {time//60:,.0f} minutes with {round(time%60,2):,.2f} seconds")
    # 3 Day rutine
    tic = perf_counter()
    futures = [executor.submit(process_file, file, categories) for file in files_3day]
    
    # Process results as they complete
    for future in tqdm(concurrent.futures.as_completed(futures), total=len(files_3day), desc = "3 Day rutine"):
        try:
            results = future.result()
            # Update group_segregation DataFrame with results
            for key, var, value in results:
                group_segregation_3day.loc[key, f'Proximity {var} Took From'] = value
        except Exception as e:
            print(f'Generated an exception: {e}')
    toc = perf_counter()
    time = toc-tic

    print(f"Finished rutine in {time//60:,.0f} minutes with {round(time%60,2):,.2f} seconds")

Daily rutine: 100%|██████████| 63/63 [01:48<00:00,  1.72s/it]


Finished rutine in 1 minutes with 48.74 seconds


3 Day rutine: 100%|██████████| 61/61 [04:03<00:00,  3.99s/it]

Finished rutine in 4 minutes with 3.37 seconds





---
# 4. Assorativity

- **Assortativity:** is a preference for a network's nodes to attach to others that are similar in some way. Though the specific measure of similarity may vary, network theorists often examine assortativity in terms of a node's degree.

    The **assortativity coefficient** is the Pearson correlation coefficient of degree between pairs of linked nodes. Positive values of `r` indicate a correlation between nodes of similar degree, while negative values indicate relationships between nodes of different degree. In general, `r` lies between `−1` and `1`. When `r = 1`, the network is said to have perfect assortative mixing patterns, when `r = 0` the network is non-assortative, while at `r = −1` the network is completely disassortative.

    The *assortativity coefficient* is given by 

    $$
    r = \frac{\sum_{jk}{jk (e_{jk} - q_j q_k)}}{\sigma_{q}^{2}}
    $$

    In this equation:

    - $ \sum_{jk} $ denotes the summation over all degrees $ j $ and $ k $ in the network.
    - $ jk $ represents the product of degrees $ j $ and $ k $.
    - $ e_{jk} $ is the joint probability distribution of the remaining degrees of two connected vertices. In an undirected graph, this is symmetric and must satisfy the sum rules:
        - $ \sum_{jk}{e_{jk}} = 1 $, ensuring that the total probability is 1.
        - $ \sum_{j}{e_{jk}} = q_{k} $, linking it to the distribution of the remaining degree.
    - $ q_j $ and $ q_k $ are the distributions of the remaining degree for vertices of degrees $ j $ and $ k $, respectively. 
    - $ \sigma_{q}^{2} $ is the variance of the distribution of the remaining degree.

    The term $ q_{k} $ represents the distribution of the *remaining degree*, which captures the number of edges leaving a node, excluding the edge that connects the pair in question. This distribution is derived from the degree distribution $ p_{k} $ as follows:

    $$
    q_{k} = \frac{(k+1)p_{k+1}}{\sum_{j \geq 1} j p_j}
    $$

    - Here, $ p_{k} $ is the degree distribution of the network, and $ p_{k+1} $ refers to the probability of a node having $ k+1 $ connections.


- **Categorical Assortativity (assortativity by attribute):** is a measure used to determine how often nodes with a certain categorical attribute, like color or type, connect to other nodes with the same attribute. It is given by:

    $$
    r = \frac{\sum_{ij}{e_{ij} - q_i q_j}}{\sum_{i}{q_i q_i} - \sum_{i}{q_i q_j}}
    $$

    Where:

    - $ e_{ij} $ is the proportion of edges in the network that connect nodes of type $ i $ to nodes of type $ j $.
    - $ q_i $ and $ q_j $ are the proportions of each type of node (type $ i $ and $ j $, respectively) at the ends of a randomly chosen edge.

    In this context:

    - A positive value of $ r $ indicates assortative mixing, where nodes tend to connect to others that are similar.
    - A negative value of $ r $ indicates disassortative mixing, where nodes tend to connect to others that are different.
    - A value of $ r $ close to 0 suggests no particular preference for nodes to connect to others based on the categorical attribute.

In [14]:
# Storage in DataFrame
tic = perf_counter()
for file in tqdm(files_daily, desc="Assortativity Daily"):
    g = gt.load_graph(file)
    date = file.split('/')[-1].split('.')[0].split('_')[-1]
    for pol in categories:
        # Non weighted
        seg_w = gt.assortativity(g, g.vp[pol])
        group_segregation_daily.loc[(date, pol), 'Non Weighted Assortativity'] = seg_w[0]
        
        # Weighted
        seg_no_w = gt.assortativity(g, g.vp[pol], eweight=g.ep['Normal Weight'])
        group_segregation_daily.loc[(date, pol), 'Normal Weighted Assortativity'] = seg_no_w[0]
        
        seg_no_w = gt.assortativity(g, g.vp[pol], eweight=g.ep['Number of rts'])
        group_segregation_daily.loc[(date, pol), 'Weighted Assortativity'] = seg_no_w[0]
        
    #Global
    seg = gt.assortativity(g, g.vp['Political Label'], eweight=g.ep['Normal Weight'])
    global_segregation_daily.loc[(date), 'Normal Weighted Assortativity'] = seg[0]
    
    seg = gt.assortativity(g, g.vp['Political Label'], eweight=g.ep['Number of rts'])
    global_segregation_daily.loc[(date), 'Weighted Assortativity'] = seg[0]
    
    seg = gt.assortativity(g, g.vp['Political Label'])
    global_segregation_daily.loc[(date), 'Non Weighted Assortativity'] = seg[0]
toc = perf_counter()
time = toc-tic

print(f"Finished assortativity in {time//60:,.0f} minutes with {round(time%60,2):,.2f} seconds")

Assortativity Daily: 100%|██████████| 63/63 [00:53<00:00,  1.19it/s]

Finished assortativity in 0 minutes with 53.11 seconds





In [15]:
# Storage in DataFrame
tic = perf_counter()
for file in tqdm(files_3day, desc= "Assortativity 3 Day"):
    g = gt.load_graph(file)
    date = file.split('/')[-1].split('.')[0].split('_')[-1]
    for pol in categories:
        # Non weighted
        seg_w = gt.assortativity(g, g.vp[pol])
        group_segregation_3day.loc[(date, pol), 'Non Weighted Assortativity'] = seg_w[0]
        
        # Weighted
        seg_no_w = gt.assortativity(g, g.vp[pol], eweight=g.ep['Normal Weight'])
        group_segregation_3day.loc[(date, pol), 'Normal Weighted Assortativity'] = seg_no_w[0]
        
        seg_no_w = gt.assortativity(g, g.vp[pol], eweight=g.ep['Number of rts'])
        group_segregation_3day.loc[(date, pol), 'Weighted Assortativity'] = seg_no_w[0]
        
    #Global
    seg = gt.assortativity(g, g.vp['Political Label'], eweight=g.ep['Normal Weight'])
    global_segregation_3day.loc[(date), 'Normal Weighted Assortativity'] = seg[0]
    
    seg = gt.assortativity(g, g.vp['Political Label'], eweight=g.ep['Number of rts'])
    global_segregation_3day.loc[(date), 'Weighted Assortativity'] = seg[0]
    
    seg = gt.assortativity(g, g.vp['Political Label'])
    global_segregation_3day.loc[(date), 'Non Weighted Assortativity'] = seg[0]
toc = perf_counter()
time = toc-tic

print(f"Finished rutine in {time//60:,.0f} minutes with {round(time%60,2):,.2f} seconds")

Assortativity 3 Day: 100%|██████████| 61/61 [02:13<00:00,  2.19s/it]

Finished rutine in 2 minutes with 13.75 seconds





---
# 5. Homiphily Index

refers to the tendency of individuals (or nodes in a network) to associate and bond with similar others. The similarity can be based on various attributes such as social characteristics, behaviors, or beliefs. In the context of a network, this implies that nodes are more likely to form connections with other nodes that belong to the same group or share similar attributes. 

Measuring Homophily. We begin with some simple definitions that are important in measuring homophily and also in presenting the model.

Let $ N $ denote the number of type $ i $ individuals in the population, and let $ w_i = \frac{N_i}{N} $ be the relative fraction of type $ i $ in the population, where $ N = \sum_k N_k $.

Let $ s_i $ denote the average number of friendships that agents of type $ i $ have with agents who are of the same type, and let $ d_i $ be the average number of friendships that type $ i $ agents form with agents of types different from $ i $. Let $ t_i = s_i + d_i $ be the average total number of friendships that type $ i $ agents form.

The homophily index $ H_i $ measures the fraction of the ties of individuals of type $ i $ that are with that same type.

**Definition 1** The homophily index $ H_i $ is defined by

$$ H_i = \frac{s_i}{s_i + d_i} $$

The profile $ (s, d) $ exhibits *baseline homophily* for type $ i $ if $ H_i = w_i $.

The profile $ (s, d) $ exhibits *inbreeding homophily* for type $ i $ if $ H_i > w_i $.

Generally, there is a difficulty in simply measuring homophily according to $ H_i $. For example, consider a group that comprises 95% of a population. Suppose that its same-type friendships are 95% of its friendships. Compare this to a group that comprises 5% of a population and has 96% of its friendships being same-type. Although both have the same homophily index, they are very different in terms of how homophilous they are relative to how homophilous they could be. Comparing the homophily index, $ H_i $, to the baseline, $ w_i $, provides some information, but even that does not fully capture the idea of how biased a group is compared to how biased it could potentially be. To take care of this we use the inbreeding homophily index introduced by Coleman [Coleman J. (1958) *Human Organization* 17:28–36] that normalizes the homophily index by the potential extent to which a group could be biased.

**Definition 2** Coleman's inbreeding homophily index of type $i$ is

$$IH_i = \frac{H_i - w_i}{1 - w_i}$$

This index measures the amount of bias with respect to baseline homophily as it relates to the maximum possible bias (the term $ 1 - w_i $). It can be easily checked that we have inbreeding homophily for type $ i $ if and only if $ IH_i > 0 $, and inbreeding heterophily for type $ i $ if and only if $ IH_i < 0 $. The index of inbreeding homophily is 0 if there is pure baseline homophily, and 1 if a group completely inbreeds.

In [16]:
def process_file(file, categories):
    results = []
    g = gt.load_graph(file)
    date = file.split('/')[-1].split('.')[0].split('_')[-1]
    Homiphily_dict = Ho.homophily_index(graph = g, property_name = "Political Label")
    H = Homiphily_dict ['H_i']
    IH = Homiphily_dict ['IH_i']
    for pol in categories:
        results.append(((date, pol), H[pol], IH[pol]))
    return results

# Run processing in parallel
with concurrent.futures.ProcessPoolExecutor() as executor:
    # Daily rutine
    tic = perf_counter()
    futures = [executor.submit(process_file, file, categories) for file in files_daily]
    
    # Process results as they complete
    for future in tqdm(concurrent.futures.as_completed(futures), total=len(files_daily)):
        try:
            results = future.result()
            # Update group_segregation DataFrame with results
            for key, H_value, IH_value in results:
                group_segregation_daily.loc[key, 'Homiphily Index'] = H_value
                group_segregation_daily.loc[key, 'Inbreeding Homiphily Index'] = IH_value
        except Exception as e:
            print(f'Generated an exception: {e}')
    toc = perf_counter()
    time = toc-tic

    print(f"Finished rutine in {time//60:,.0f} minutes with {round(time%60,2):,.2f} seconds")
    
    # 3 Day rutine
    tic = perf_counter()
    futures = [executor.submit(process_file, file, categories) for file in files_3day]
    
    # Process results as they complete
    for future in tqdm(concurrent.futures.as_completed(futures), total=len(files_3day)):
        try:
            results = future.result()
            # Update group_segregation DataFrame with results
            for key, H_value, IH_value in results:
                group_segregation_3day.loc[key, 'Homiphily Index'] = H_value
                group_segregation_3day.loc[key, 'Inbreeding Homiphily Index'] = IH_value
        except Exception as e:
            print(f'Generated an exception: {e}')
    toc = perf_counter()
    time = toc-tic

    print(f"Finished rutine in {time//60:,.0f} minutes with {round(time%60,2):,.2f} seconds")

100%|██████████| 63/63 [00:05<00:00, 10.80it/s]


Finished rutine in 0 minutes with 5.97 seconds


100%|██████████| 61/61 [00:16<00:00,  3.74it/s]

Finished rutine in 0 minutes with 16.31 seconds





---
# 6. Spectral Segregation Index

Explicación pendiente

In [None]:
# CODIGOOOOOOOOOOOOOOOOOO

---
## Outputs

In [17]:
group_segregation_daily.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Classic Freeman,Freeman One vs Others,Proximity to Others,Other's Proximity,Proximity From Izquierda To,Proximity From Centro To,Proximity From Sin Clasificar To,Proximity From Derecha To,Proximity Izquierda Took From,Proximity Centro Took From,Proximity Sin Clasificar Took From,Proximity Derecha Took From,Non Weighted Assortativity,Normal Weighted Assortativity,Weighted Assortativity,Homiphily Index,Inbreeding Homiphily Index
Date,Political Label,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2021-04-28,Centro,0.168064,0.168064,0.611952,1.427209,1.216957,2.500138,1.693752,0.656383,0.563294,2.500138,0.635724,0.194764,0.173848,0.229559,0.90942,0.491196,0.382319
2021-04-28,Derecha,0.867219,0.867219,0.229847,0.12853,0.060098,0.211499,0.320882,3.243938,0.045368,0.71278,0.13112,3.243938,0.86579,0.823512,0.947555,0.85232,0.811574
2021-04-28,Izquierda,0.493014,0.493014,0.608198,0.327515,1.404507,0.805096,0.984127,0.059713,1.404507,1.739356,1.097132,0.079099,0.491508,0.464214,0.639556,0.709069,0.34699
2021-04-28,Sin Clasificar,0.476025,0.476025,0.95933,0.738643,0.92645,0.767263,1.825639,0.145729,0.831026,2.044212,1.825639,0.356634,0.022981,0.057877,0.970115,0.062592,0.01012
2021-04-29,Centro,0.225944,0.225944,0.601121,1.384234,1.303452,2.668476,1.931755,0.435901,0.526207,2.668476,1.196969,0.219353,0.189107,0.241927,-1.321798,0.50339,0.397122


In [18]:
global_segregation_daily.head()

Unnamed: 0,Freeman Global,Normal Weighted Assortativity,Weighted Assortativity,Non Weighted Assortativity
2021-04-28,0.51816,0.47042,0.805816,0.49781
2021-04-29,0.535199,0.487073,0.90067,0.521871
2021-04-30,0.552819,0.484939,0.936285,0.489328
2021-05-01,0.535726,0.480245,1.198054,0.464897
2021-05-02,0.524035,0.465449,1.192093,0.461874


In [19]:
individual_group_segregation.sort_index(axis=1, inplace=True)
individual_group_segregation

Unnamed: 0_level_0,Unnamed: 1_level_0,Proximity index on 2021-04-28,Proximity index on 2021-04-29,Proximity index on 2021-04-30,Proximity index on 2021-05-01,Proximity index on 2021-05-02,Proximity index on 2021-05-03,Proximity index on 2021-05-04,Proximity index on 2021-05-05,Proximity index on 2021-05-06,Proximity index on 2021-05-07,...,Proximity index on 2021-06-20,Proximity index on 2021-06-21,Proximity index on 2021-06-22,Proximity index on 2021-06-23,Proximity index on 2021-06-24,Proximity index on 2021-06-25,Proximity index on 2021-06-26,Proximity index on 2021-06-27,Proximity index on 2021-06-28,Proximity index on 2021-06-29
Node,Political Label,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
0,Centro,,,,2.773702,1.340203,0.706182,2.840593,2.185737,3.483649,2.868721,...,0.000000,0.606910,4.814565,3.574741,,4.79683,0.000000,0.000000,,0.000000
0,Derecha,,,,0.000000,0.000000,0.000000,0.000000,0.309840,0.000000,0.282886,...,0.000000,0.000000,0.000000,0.000000,,0.00000,0.000000,0.000000,,0.000000
0,Izquierda,,,,0.000000,1.340299,1.545909,0.793379,0.678190,0.638786,0.624553,...,2.136299,1.446277,0.000000,0.438101,,0.00000,2.279116,2.105939,,2.217557
0,Sin Clasificar,,,,10.660952,0.000000,0.000000,0.000000,3.053932,0.000000,1.491261,...,0.000000,3.707400,0.000000,0.000000,,0.00000,0.000000,0.000000,,0.000000
1,Centro,4.865854,0.863818,1.143761,2.773702,3.573875,0.617910,0.464824,1.275013,2.239489,0.000000,...,0.000000,1.213820,0.000000,,,0.00000,,,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23485,Sin Clasificar,0.000000,1.963092,0.754598,0.947640,0.854420,1.113207,0.000000,0.454841,0.481270,2.354623,...,,,0.000000,,0.0,0.00000,0.000000,0.000000,,0.000000
23486,Centro,,0.000000,0.953134,0.000000,1.786937,0.000000,,,2.090190,,...,,1.618426,2.407283,,,0.00000,,,,0.000000
23486,Derecha,,0.000000,0.000000,0.000000,0.000000,0.000000,,,0.000000,,...,,0.000000,0.000000,,,0.00000,,,,0.000000
23486,Izquierda,,2.075816,1.557179,1.782254,1.191377,1.803561,,,0.766544,,...,,1.402451,1.096721,,,2.31540,,,,2.217557


In [21]:
individual_node_segregation.reset_index(names='ID', inplace=True)
individual_node_segregation.sort_index(axis=1, inplace=True)
individual_node_segregation

Unnamed: 0,ID,Political Label,Proximity to Others on 2021-04-28,Proximity to Others on 2021-04-29,Proximity to Others on 2021-04-30,Proximity to Others on 2021-05-01,Proximity to Others on 2021-05-02,Proximity to Others on 2021-05-03,Proximity to Others on 2021-05-04,Proximity to Others on 2021-05-05,...,Proximity to Others on 2021-06-20,Proximity to Others on 2021-06-21,Proximity to Others on 2021-06-22,Proximity to Others on 2021-06-23,Proximity to Others on 2021-06-24,Proximity to Others on 2021-06-25,Proximity to Others on 2021-06-26,Proximity to Others on 2021-06-27,Proximity to Others on 2021-06-28,Proximity to Others on 2021-06-29
0,0,Izquierda,,,,2.278356,0.567636,0.320637,1.263176,1.357992,...,0.000000,0.595645,1.837912,1.471984,,1.760225,0.000000,0.000000,,0.000000
1,1,Izquierda,1.968591,0.321588,0.430252,1.139178,1.513695,0.467596,0.206702,0.528108,...,0.000000,0.953032,0.000000,,,0.000000,,,0.000000,0.000000
2,2,Izquierda,0.393718,0.000000,0.339672,0.000000,0.592315,0.112223,0.275602,0.704144,...,,0.211785,1.837912,0.000000,0.000000,,,,,0.260188
3,3,Izquierda,0.787437,1.286351,0.430252,0.402063,0.000000,0.561115,0.757906,0.704144,...,0.752020,0.381213,1.225275,1.839980,0.349798,0.440056,0.000000,1.269472,,0.000000
4,4,Izquierda,0.904488,0.694630,0.910148,1.367014,0.891999,1.645938,1.263176,1.207104,...,0.846022,1.397780,1.102747,1.471984,0.925935,1.357888,0.979985,1.038659,1.269704,1.113027
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23482,23482,Izquierda,,,,0.000000,0.702787,0.748154,,,...,0.376010,0.953032,0.424134,0.735992,1.165992,0.000000,,0.761683,0.989380,
23483,23483,Izquierda,0.000000,0.482382,0.537815,0.455671,0.000000,0.000000,0.757906,1.056216,...,0.000000,1.906063,0.000000,0.000000,0.000000,0.586742,,,,1.821316
23484,23484,Izquierda,0.000000,,0.000000,0.000000,0.000000,0.000000,0.000000,,...,,0.000000,,0.000000,,,,,,
23485,23485,Izquierda,0.246074,0.701646,0.485768,0.506301,0.499519,0.885972,0.505270,0.359563,...,,,0.000000,,0.000000,0.586742,1.781790,0.000000,,0.000000


# CHECKPOINT: Save DataFrames

In [22]:
# Run to save
group_segregation_daily.to_pickle(os.path.join(path,"Segregation",'group_segregation_daily.pkl'))
global_segregation_daily.to_pickle(os.path.join(path,"Segregation",'global_segregation_daily.pkl'))

group_segregation_3day.to_pickle(os.path.join(path,"Segregation",'group_segregation_3day.pkl'))
global_segregation_3day.to_pickle(os.path.join(path,"Segregation",'global_segregation_3day.pkl'))

individual_group_segregation.to_pickle(os.path.join(path,"Segregation",'individual_group_segregation.pkl'))
individual_node_segregation.to_pickle(os.path.join(path,"Segregation",'individual_node_segregation.pkl'))

# CHECKPOINT: Proximity Vectorizado

## Index of Proximity Between Groups

$$Prox_{j\rightarrow k}=\frac{W_{jk}}{(T_k/\sum_{m\in G} T_m)}$$

### Measure $W_{jk}$

In [23]:
# Load graphs
files = glob(os.path.join(path_daily,"Graphs", "*.graphml"))
files = np.sort(files)

results = []
tic = perf_counter()
for file in tqdm(files, desc="Calculo del númerador"):    
    # Importamos el grafo
    g = gt.load_graph(file)
    graph_date = re.search("(\d{4}-\d{2}-\d{2})", file).group(1)

    # Número de vertices/individuos
    n_individuos = g.num_vertices()
    
    # Identifica las afiliaciones políticas únicas y asigna índices
    political_labeling = np.array([g.vp["Political Label"][j] for j in range(n_individuos)]) 
    unique_affiliations = np.unique(political_labeling)
    affiliation_to_index = {affiliation: i for i, affiliation in enumerate(unique_affiliations)}
            
    # Contamos el número de rts de cada individuo hacia cada afiliación política
    # +2 es porque necesitamos una columna de mismo y otros 
    results_matrix = np.zeros((n_individuos, len(unique_affiliations) + 2))
    for e in g.edges():
        s = int(e.source())
        t = int(e.target())
        rts = g.ep['Number of rts'][e]
        affiliation_index = affiliation_to_index[political_labeling[t]]
        results_matrix[s, affiliation_index] += rts
        # Si es un rt a alguien de la misma afiliación política
        if political_labeling[s] == political_labeling[t]:
            results_matrix[s, len(unique_affiliations)] += rts
        # Si es un rt a alguien de diferente afiliación política
        else:
            results_matrix[s, len(unique_affiliations) + 1] += rts
    
    # Calcular la matriz normalizada como un porcentaje del total de RTs salientes por nodo
    total_rts_por_nodo = results_matrix[:, 0:len(unique_affiliations)].sum(axis = 1, keepdims = True)
    total_rts_por_nodo2 = total_rts_por_nodo[:, [0]*results_matrix.shape[1]]
    # Calculamos W_jk
    with np.errstate(divide = 'ignore', invalid = 'ignore'):
        results_matrix_normalizada = np.divide(results_matrix, total_rts_por_nodo2)
    # Luego, reemplaza los valores donde total_rts_por_nodo2 es 0 con NaN
    # Esto incluye manejar divisiones 0/0 y valores/0
    results_matrix_normalizada[total_rts_por_nodo2 == 0] = np.nan

    # Construir diccionario para consolidar resultados
    temp = {
        "Nodo_ID": list(range(n_individuos)),
        "Political_Affiliation": political_labeling,
        "Date": graph_date,
        "Total_RTs": total_rts_por_nodo.flatten()
    }

    additional_categories = np.array(["Mismo", "Otros"])

    # Concatena unique_affiliations con additional_categories
    extended_affiliations = np.concatenate((unique_affiliations, additional_categories))

    # Añade las columnas de RTs por afiliación política
    for i, affiliation in enumerate(extended_affiliations):
        temp[f"rts_j_{affiliation}"] = results_matrix[:, i]
        temp[f"W_j_{affiliation}"] = results_matrix_normalizada[:, i]

    df_temp = pd.DataFrame(temp)
    results.append(df_temp)

W_jk = pd.concat(results, ignore_index = True)
W_jk.to_pickle(path = os.path.join(path,"Segregation","W_jk.gzip"), compression = "gzip")
toc = perf_counter()
time = toc-tic

print(f"Finished rutine in {time//60:,.0f} minutes with {round(time%60,2):,.2f} seconds")

Calculo del númerador: 100%|██████████| 63/63 [01:37<00:00,  1.55s/it]


Finished rutine in 2 minutes with 50.60 seconds


### Denominador

In [24]:
results = []
tic = perf_counter()
for file in tqdm(files, desc = "Cálculo del denominador"):  
    # Importamos el grafo
    g = gt.load_graph(file)
    graph_date = re.search("(\d{4}-\d{2}-\d{2})", file).group(1)

    # Número de vertices/individuos
    n_individuos = g.num_vertices()

    # Vamos a calcular el número de tweets por día para cada afiliación política

    # Identifica las afiliaciones políticas únicas y asigna índices
    political_labeling = np.array([g.vp["Political Label"][j] for j in range(n_individuos)]) 
    unique_affiliations = np.unique(political_labeling)
    affiliation_to_index = {affiliation: i for i, affiliation in enumerate(unique_affiliations)}
                
    # Contamos el número de tweets de cada individuo según su afiliación política
    # +2 es porque necesitamos una columna de mismo y otros 
    results_matrix = np.zeros(len(unique_affiliations))
    for v in g.vertices():
        n = g.vp["Tweets"][v]
        pl = g.vp["Political Label"][v]
        affiliation_index = affiliation_to_index[pl]
        results_matrix[affiliation_index] += n

    # Ahora calculamos el denominador para cada afiliación
    total = results_matrix.sum()
    denominadores = results_matrix/total


    # Ahora vamos a construir el denominador para cada individuo
    denominador = np.zeros((n_individuos, 2))
    for v in g.vertices():
        pl = g.vp["Political Label"][v]
        affiliation_index = affiliation_to_index[pl]
        mismo = results_matrix[affiliation_index]/total
        otros = 1 - mismo
        denominador[int(v), :] = [mismo, otros]

    # Construir diccionario para consolidar resultados
    temp = {
        "Nodo_ID": list(range(n_individuos)),
        "Political_Affiliation": political_labeling,
        "Date": graph_date,
        "Denominador Centro": denominadores[0],
        "Denominador Derecha": denominadores[1],
        "Denominador Izquierda": denominadores[2],
        "Denominador Sin Clasificar": denominadores[3],
        "Denominador Mismo": denominador[:, 0].flatten(),
        "Denominador Otros": denominador[:, 1].flatten()
    }

    df_temp = pd.DataFrame(temp)
    results.append(df_temp)
denominador = pd.concat(results, ignore_index = True)
denominador.to_pickle(path = os.path.join(path,"Segregation","denominador.gzip"), compression = "gzip")
toc = perf_counter()
time = toc-tic

print(f"Finished rutine in {time//60:,.0f} minutes with {round(time%60,2):,.2f} seconds")

Cálculo del denominador: 100%|██████████| 63/63 [01:02<00:00,  1.01it/s]


Finished rutine in 1 minutes with 25.57 seconds


In [25]:
# Preparativos para proximidad
num = W_jk[["W_j_Centro", "W_j_Derecha", "W_j_Izquierda", "W_j_Sin Clasificar", "W_j_Mismo", "W_j_Otros"]].values
dem = denominador.iloc[:, 3::].values
proximidad = pd.DataFrame(num/dem, columns = ["P_Centro", "P_Derecha", "P_Izquierda", "P_Sin Clasificar", "P_Mismo", "P_Otros"])
proximidad = pd.concat([W_jk.iloc[:, :4], proximidad], axis = 1)
proximidad.head(3)

Unnamed: 0,Nodo_ID,Political_Affiliation,Date,Total_RTs,P_Centro,P_Derecha,P_Izquierda,P_Sin Clasificar,P_Mismo,P_Otros
0,0,Izquierda,2021-04-28,0.0,,,,,,
1,1,Izquierda,2021-04-28,1.0,4.865854,0.0,0.0,0.0,0.0,1.968591
2,2,Izquierda,2021-04-28,20.0,0.973171,0.0,1.625942,0.0,1.625942,0.393718


In [26]:
proximidad.to_pickle(path = os.path.join(path,"Segregation","proximidad.gzip"), compression = "gzip")
proximidad.to_csv(os.path.join(path,"Segregation","proximidad.csv"),index=False, sep=';')

# Prueba de Cálculos

In [3]:
# Carga de DataFrames

W_jk = pd.read_pickle(os.path.join(path,"Segregation","W_jk.gzip"), compression = "gzip")
denominador = pd.read_pickle(os.path.join(path,"Segregation","denominador.gzip"), compression = "gzip")
proximidad = pd.read_pickle(os.path.join(path,"Segregation","proximidad.gzip"), compression = "gzip")

# Valores de prueba
grupo = 'Centro'
vertice = 3
fecha = '2021-05-04'

# Grafo de prueba
prueba = f"starting_{fecha}.graphml"
os.path.join(path_daily,"Graphs",prueba)
G = gt.load_graph(os.path.join(path_daily,"Graphs",prueba))

w_jk_grupo = Pr.individual_proximity_to_h(G,vertice,'Political Label',grupo)
w_jk_otros = Pr.individual_proximity_to_others(G,vertice,'Political Label')
den = Pr.at_random_scenario(G,'Political Label', grupo, 'Proximity to Group')

# Calculos Fernando
print('Calculos Fernando')
print()
print(f"Proximidad del Nodo {vertice} en fecha {fecha} a grupo {grupo}:")
print(f'Numerador:  {w_jk_grupo}')
print(f'Denominador:  {den}')
print(f"proximidad a {grupo}: {w_jk_grupo/den}")
print()
print(f"Proximidad del Nodo {vertice} en fecha {fecha} a otros grupos")
print(f'Numerador:  {w_jk_otros}')
print(f'Denominador:  {den}')
print(f"proximidad a Otros: {w_jk_otros/den}")

print("\n"+"-"*100+"\n")

# Calculos Lucas
proximidad_grupo = proximidad[f"P_{grupo}"][(proximidad["Nodo_ID"] == vertice) & (proximidad["Date"] == fecha)].iloc[0]
W_jk_grupo = W_jk[f"W_j_{grupo}"][(W_jk["Nodo_ID"] == vertice) & (W_jk["Date"] == fecha)].iloc[0]
den = denominador[f"Denominador {grupo}"][(denominador["Nodo_ID"] == vertice) & (denominador["Date"] == fecha)].iloc[0]
print('Calculos Lucas')
print()
print(f"Proximidad del Nodo {vertice} en fecha {fecha} a grupo {grupo} En DataFrame")
print(f"Numerador: {W_jk_grupo}")
print(f"Denominador: {den}")
print(f"Proximidad a {grupo}: {proximidad_grupo}")
print()

proximidad_otros = proximidad[f"P_Otros"][(proximidad["Nodo_ID"] == vertice) & (proximidad["Date"] == fecha)].iloc[0]
W_jk_otros = W_jk[f"W_j_Otros"][(W_jk["Nodo_ID"] == vertice) & (W_jk["Date"] == fecha)].iloc[0]

print(f"Proximidad del Nodo {vertice} en fecha {fecha} a otros grupos En DataFrame")
print(f"Numerador: {W_jk_otros}")
print(f"Denominador: {den}")
print(f"Proximidad a Otros: {proximidad_otros}")
print(f"Numerador/Denominador: {W_jk_otros/den}")

Calculos Fernando

Proximidad del Nodo 3 en fecha 2021-05-04 a grupo Centro:
Numerador:  0.3333333333333333
Denominador:  0.19557729900809864
proximidad a Centro: 1.7043559504292487

Proximidad del Nodo 3 en fecha 2021-05-04 a otros grupos
Numerador:  0.3333333333333333
Denominador:  0.19557729900809864
proximidad a Otros: 1.7043559504292487

----------------------------------------------------------------------------------------------------

Calculos Lucas

Proximidad del Nodo 3 en fecha 2021-05-04 a grupo Centro En DataFrame
Numerador: 0.3333333333333333
Denominador: 0.19557729900809864
Proximidad a Centro: 1.7043559504292487

Proximidad del Nodo 3 en fecha 2021-05-04 a otros grupos En DataFrame
Numerador: 0.3333333333333333
Denominador: 0.19557729900809864
Proximidad a Otros: 0.757905711553235
Numerador/Denominador: 1.7043559504292487


In [7]:
len(proximidad)

1479681