# Retweet Networks and Community Clusters

## Objective

## Tools/ Technology Used

* **Pandas** create undirected weighted graphs for retweet network
* **GraphFrames** Spark library to detect Community Clusters using LPA( Label Propagation Algorithm) and find influencers using PageRank
* **Seaborn** Visualize cluster strength through bar chart
* **Gephi** Compute community clusters using modularity and find influencers through betweenness centrality and Eigenvector centrality

## Import the necessary packages

In [11]:
import sys
import pandas as pd
import numpy as np
import re
import os
import pyspark
from pyspark.sql import *
spark = SparkSession.builder.appName('retweet graph').getOrCreate()
sc = spark.sparkContext

import warnings
warnings.filterwarnings('ignore')

## Read and Preprocess Data

We read the records that were retwitted from Humans.csv, the input file for this analysis, which was generated after discarding the fake accounts('') and cleaning the tweets with standard NLP techniques(''). We grab the users by their screen_names if they have retweeted (given by the boolian value of is_retweet column) and we also take the post that they have retwitted given by the column text_retweet.Then we apply the following preprocessing steps:
* *Filter out records if they are NAN*
* *Add a new column retweet_username to the dataframe which captures the user who had posted the tweet for the first time*



In [None]:
# Filter out records if they are NAN
def isNaN(num):
    return num != num

In [33]:
# Read input file and get the retweeted records in dataframe retweet_df
read_df = pd.read_csv('humans.csv', sep=';', index_col='id', usecols = ['id', 'screen_name', 'is_retweet', 'text_retweet'])
retweet_df = read_df[read_df.apply(lambda x: (x['is_retweet']==True) and (isNaN(x['text_retweet'])==False), axis=1)]
retweet_df = retweet_df.reset_index()
retweet_df.head(3)

Unnamed: 0,id,is_retweet,text_retweet,screen_name
0,40000,True,"RT @jdesmondharris: You don't have to write ""i...",SamanthaCorbin
1,40002,True,"RT @jdesmondharris: You don't have to write ""i...",dinosaur_m
2,40003,True,RT @TheUSASingers: Here it is folks!\n\nWe are...,CloeNana


In [37]:
# Append the screen_name of the original post that has been retweeted
def get_retweet_user(text_retweet):
    try:
        return re.match(r"RT @(\S+):.*", text_retweet).group(1)
    except:
        return 'False'
    
retweet_df['retweet_username'] = retweet_df.apply(lambda x: get_retweet_user(x['text_retweet']), axis=1)
retweet_df = retweet_df.reset_index()

# Print the final input dataframe
retweet_df.head(3)

Unnamed: 0,level_0,index,id,is_retweet,text_retweet,screen_name,retweet_username
0,0,0,40000,True,"RT @jdesmondharris: You don't have to write ""i...",SamanthaCorbin,jdesmondharris
1,1,1,40002,True,"RT @jdesmondharris: You don't have to write ""i...",dinosaur_m,jdesmondharris
2,2,2,40003,True,RT @TheUSASingers: Here it is folks!\n\nWe are...,CloeNana,TheUSASingers


## Create Retweet graphs and store it as text file

We construct the weighted undirected graph to form the retweet network by defining the parent-child relationships between the users such that everytime an user (given by field screen_name) retweets another user's post (given by the field 'retweet_username') we draw an edge between them and the number of times they have communicated or retweeted each other becomes the edge weight. In order to reduce clutter in the network we however filter out nodes or users who have communicated less than 5 times. This parameter can be configured as per the analyst's need.  

In [38]:
def retweetgraph(data, filename, num = 5):
    
    #append all users to a list
    retweets = []
    for line in data:
        retweets.append([line[0], line[1]])
        
    # Define nodes
    nodes = dict()
    for line in retweets:
        if line[0] not in nodes:
            nodes[line[0]] = 0
        nodes[line[0]] += 1
        if line[1] not in nodes:
            nodes[line[1]] = 0
        nodes[line[1]] += 1

    # Filter nodes with more than 'num' weighted degree    
    for i in list(nodes):
        if nodes[i] <= num:
            del nodes[i]
            
    # Define undirected weighted edges without self loop
    temp = dict()
    for retweet in retweets:
        if retweet[0] == retweet[1]:
            continue
        if retweet[0] in nodes and retweet[1] in nodes:
            if (retweet[0], retweet[1]) in temp:
                temp[(retweet[0], retweet[1])] += 1
            elif (retweet[1], retweet[0]) in temp:
                temp[(retweet[1], retweet[0])] += 1
            else:
                temp[(retweet[0], retweet[1])] = 1
    edges = list(temp.items())
    
    # Write the graph to a file
    f = open(filename, 'w')
    for edge in edges:
        f.write('{}\t{}\t{}\n'.format(edge[0][0], edge[0][1], edge[1]))
    f.close()

In [40]:
# create the graph and save it as retweetgraph.txt
retweetgraph(retweet_df[retweet_df['retweet_username'] != 'False'][['screen_name', 'retweet_username']].values.tolist(), 'retweetgraph.txt', 2)

## Find communities and their influencers using GraphFrames Pyspark

GraphFrames is a package for Apache Spark that provides DataFrame-based graphs. We use Label Propagation Algorithm to cluster nodes to form communities and PageRank algorithm to find the main influencers of these communities. Using Pyspark we have written retweet_graph_lpa.py to implement these functionalities.

**Label Propagation Algorithm** is a semi-supervised machine learning algorithm well known for finding community structures within complex networks.At the start of the algorithm, a (generally small) subset of the data points have labels (or classifications). These labels are propagated to the unlabeled points throughout the course of the algorithm. LPA has advantages in its running time and amount of a priori information needed about the network structure (no parameter is required to be known beforehand), though it also has a disadvantage that it produces no unique solution, but an aggregate of many solutions.

**PageRank Algorithm** is used to calculate the centrality measure of nodes in each of the clusters defined by Label Propagation Algorithm. This allows us to get the nodes with high centrality. Intutively having high PageRank implies that the node or user is the main influencer in his community. Though PageRank gives better results for more number of iterations, we have executed it for 5 iterations to reduce execution time.

#### Note: Put graphframe jar file matching the pyspark version in current directory
spark-sumbit --packages graphframe:graphframes:<graphframe version> retweet_graph_lpa.py

In [None]:
!spark-submit --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11 retweet_graph_lpa.py

In [42]:
#Show first few rows of communities and their pageranks
sdf = spark.read.parquet('retweet_lpa.parquet')
print('id=user, label=community number, pagerank=pagerank')
sdf.show()

id=user, label=community number, pagerank=pagerank
+-----+-----+---------------+-------------------+
|   id|label|           name|           pagerank|
+-----+-----+---------------+-------------------+
| 2024|81315|NareshC26858420| 0.3507565086074427|
| 2100|81292|  MyGrindelwald|0.35756531142158715|
| 2605|81315|  Mynation_Abhi| 0.3507565086074427|
| 3817|81315| suhas_sawant85| 0.3507565086074427|
| 6360| 3396|      firstpost| 0.4665061564478987|
| 8501|81255|    MichaelYRMW| 0.3507565086074427|
|11015| 3312| deekshabhushan| 0.3507565086074427|
|14021|61377| CapitolRomance| 0.3507565086074427|
|16553|82104|       oni_keji| 0.3507565086074427|
|17849|83111|   ranjiths1998| 0.3507565086074427|
|53329|84548|    sapphicgeek| 0.3507565086074427|
|69921|84504|    monachollet| 0.3507565086074427|
|70718|27257|DeadlineDominic|  35.07565086074427|
|81290|  737|     danaksegal| 0.5822558042883549|
|81292|  138|     maelynn777|  4.106361273264061|
|81494| 1626|   cohen_HR_Law| 0.6980054521288109|

## EDA with the Community Clusters

### Top 10 largest communities

### Main Influencers in these communities

## Visualize communities and their influencers by Gephi