# SYSM 6302 - Lab 5
Jonas Wagner - jrw200000

In [82]:
import networkx as nx
import networkx.algorithms.community as nx_comm
from numpy import zeros, dot, array
import pickle
import matplotlib.pyplot as plt
import json
import string
import time
import numpy as np
import re

#### Preq processing

In [54]:
import os.path
from os import path
if not path.exists('raw_twitter.json'):
    print('need to extract .json.zip file')    
if not path.exists('small_raw_twitter.json'):
    print('small version of raw_twitter does\'nt exist')

## Section 7.13: Modularity

The first function below calculates modularity for *directed* networks and also returns the maximum modularity value $Q_{\text{max}}$ (NetworkX's modularity function does not report the $Q_{\text{max}}$ value). The second function calculates scalar assortativity (NetworkX's assortativity functions differ from our book definition). 

In [6]:
def modularity(G,c):
    d = dict()
    for k,v in enumerate(c):
        for n in v:
            d[n] = k
    L = 0
    for u,v,data in G.edges.data():
        L += data['weight']
    Q, Qmax = 0,1
    for u in G.nodes():
        for v in G.nodes():
            if d[u] == d[v]:
                Auv = 0
                if G.has_edge(v,u):
                    Auv = G[v][u]['weight']
                Q += ( Auv - G.in_degree(u,weight='weight')*G.out_degree(v,weight='weight')/L )/L
                Qmax -= ( G.in_degree(u,weight='weight')*G.out_degree(v,weight='weight')/L )/L
    return Q, Qmax

def scalar_assortativity(G,d):
    x = zeros(G.number_of_nodes())
    for i,n in enumerate(G.nodes()):
        x[i] = d[n]

    A = array(nx.adjacency_matrix(G).todense().T)
    M = 2*A.sum().sum()
    ki = A.sum(axis=1) #row sum is in-degree
    ko = A.sum(axis=0) #column sum is out-degree
    mu = ( dot(ki,x)+dot(ko,x) )/M

    R, Rmax = 0, 0
    for i in range(G.number_of_nodes()):
        for j in range(G.number_of_nodes()):
             R += ( A[i,j]*(x[i]-mu)*(x[j]-mu) )/M
             Rmax += ( A[i,j]*(x[i]-mu)**2 )/M

    return R, Rmax

In [7]:
G = nx.read_weighted_edgelist('fifa1998.edgelist',create_using=nx.DiGraph)

c = {
    'group1': {'Argentina','Brazil','Chile','Mexico','Colombia','Jamaica','Paraguay'},
    'group2': {'Japan','SouthKorea'},
    'group3': {'UnitedStates'},
    'group4': {'Nigeria','Morocco','SouthAfrica','Cameroon','Tunisia','Iran','Turkey'},
    'group5': {'Scotland','Belgium','Austria','Germany','Denmark','Spain','France','GreatBritain','Greece','Netherlands','Norway','Portugal','Italy','Yugoslavia','Romania','Bulgaria','Croatia','Switzerland'}
    }
Q, Qmax = modularity(G,c.values())
print('FIFA exports by geographic region is assortatively mixed: %1.4f/%1.4f' % (Q,Qmax))

c = {
    'exporters': {'Argentina','Brazil','Chile','Colombia','Mexico','Scotland','Belgium','Austria','Denmark','France','Greece','Netherlands','Portugal','Yugoslavia','Croatia','Jamaica','Cameroon','Nigeria','Morocco','Tunisia'},
    'importers': {'Paraguay','SouthKorea','UnitedStates','SouthAfrica','Iran','Turkey','Germany','Spain','GreatBritain','Norway','Italy','Romania','Bulgaria','Switzerland','Japan'}
    }
Q, Qmax = modularity(G,c.values())
print('FIFA exports by importers/exporters is disassortatively mixed: %1.4f/%1.4f' % (Q,Qmax))

FIFA exports by geographic region is assortatively mixed: 0.1200/0.5505
FIFA exports by importers/exporters is disassortatively mixed: -0.0185/0.5748


#### Explination of Modularity Results
FIFA exports by region demonstrates more assortive mixing then that of the exporters vs importers assortativity. Although I don't know much about FIFA (I'm definetly an Americain Football guy), I expect their to be more connections between countries in the same regions then with teams on different parts of the world, so this would make sence for it to be more assortative. On the other hand, the classes of importers and exporters doesn't seem to have any particular reason for interconnection between thoose in the same group (considering the definition of an import/export) so disassortative mixing is the result.

## Section 7.13: Assortativity

In [10]:
gdp = pickle.load(open('gdp.pkl','rb'))
life_expectancy = pickle.load(open('life_expectancy.pkl','rb'))
tariff = pickle.load(open('tariff.pkl','rb'))

G = nx.read_weighted_edgelist('world_trade_2014.edgelist',create_using=nx.DiGraph)

R, Rmax = scalar_assortativity(G,gdp)
print('Assortativity by GDP: %1.4f' % (R/Rmax))
R, Rmax =  scalar_assortativity(G,life_expectancy)
print('Assortativity by life expectancy: %1.4f' % (R/Rmax))
R, Rmax =  scalar_assortativity(G,tariff)
print('Assortativity by tariff: %1.4f' % (R/Rmax))

Assortativity by GDP: -0.0005
Assortativity by life expectancy: 0.1281
Assortativity by tariff: 0.1191


#### Explination of Assortativity Results
The assortativity of trade based on GDP is near zero, indicating neither a assorative nor disortative mixing between contries of different GDP values. My guess as to why this is is becouse countries of higher GDP are not likely to only trade with other high GDP countries (it just wouldn't be smart) and similarily, small GDP countries would likely not trade any more with others with low GDP then thoose with Higher GDP.

There is a higher assortativity between life expectancy (the same as covariance/correlation) and the amount of trade between nations. This is possibly becouse nations of higher life expectancy would trade more for luxory goods or technology, while nations with lower life expectancy are not going to trade as much.

There also appears to be a correlation between average tarrif rates and the amount of trade between countries. This could posibly be due to the higher likelyhood of tarrifs to be placed on goods at a comparrable level to thoose of the nations you trade with.

#### Algebraic Manipulation of Covariance
(*This could be done on paper... but latex is just better... I also may or may not have already done this for another class*)

Let, $$\mu = \frac{1}{2m} \sum_{l = 1}^{n} k_l x_l$$
The Correlation is defined as:
$$R = \frac{1}{2m} \sum_{i=1}^{n} \sum_{j=1}^{n}
A_{ij} (x_i - \mu) (x_j - \mu)$$
This can then be expanded into
$$R = \frac{1}{2m} \sum_{i=1}^{n} \sum_{j=1}^{n}
A_{ij} (x_i x_j - x_i \mu - x_j \mu + \mu^2)$$
The definition of the $\mu$ can then be substituted in
$$R = \frac{1}{2m} \sum_{i=1}^{n} \sum_{j=1}^{n}
A_{ij} (x_i x_j 
- (x_i + x_j) (\frac{1}{2m} \sum_{l = 1}^{n} k_l x_l) 
+ (\frac{1}{2m} \sum_{l = 1}^{n} k_l x_l)^2)$$
This can then be expanded again into
$$R = \frac{1}{2m} \sum_{i=1}^{n} \sum_{j=1}^{n}(A_{ij} x_i x_j)\\
- \frac{1}{2m} \sum_{i=1}^{n} \sum_{j=1}^{n}(A_{ij}
(x_i + x_j) (\frac{1}{2m} \sum_{l = 1}^{n} k_l x_l))\\
+ \frac{1}{2m} \sum_{i=1}^{n} \sum_{j=1}^{n} A_{ij}
(\frac{1}{2m} \sum_{l = 1}^{n} k_l x_l)^2)
$$
And expanded again
$$R = \frac{1}{2m} \sum_{i=1}^{n} \sum_{j=1}^{n}(A_{ij} x_i x_j)\\
- \frac{1}{2m} \sum_{i=1}^{n} \sum_{j=1}^{n} A_{ij}(x_i)
(\frac{1}{2m} \sum_{l = 1}^{n} k_l x_l))\\
- \frac{1}{2m} \sum_{i=1}^{n} \sum_{j=1}^{n} A_{ij}(x_j)
(\frac{1}{2m} \sum_{l = 1}^{n} k_l x_l))\\
+ \frac{1}{2m} \sum_{i=1}^{n} \sum_{j=1}^{n} A_{ij}
(\frac{1}{2m})^2 (k_1^2 x_1^2 + k_1 k_2 x_1 x_2 + ... + k_n^2 x_n^2))
$$
and again (noting that the appropriete A_ij terms are encorporated into the expanded sum)
$$R = \frac{1}{2m} \sum_{i=1}^{n} \sum_{j=1}^{n}(A_{ij} x_i x_j)\\
- \frac{n^2}{(2m)^3}  (k_1^2 x_1^2 + k_1 k_2 x_1 x_2 + ... + k_n^2 x_n^2)\\
- \frac{n^2}{(2m)^3} (k_1^2 x_1^2 + k_1 k_2 x_1 x_2 + ... + k_n^2 x_n^2)\\
+ (\frac{1}{2m})^3 (k_1^2 x_1^2 + k_1 k_2 x_1 x_2 + ... + k_n^2 x_n^2))
$$
then eliminating the terms
$$R = \frac{1}{2m} \sum_{i=1}^{n} \sum_{j=1}^{n}(A_{ij} x_i x_j)\\
- (\frac{n}{2m})^3  (k_1^2 x_1^2 + k_1 k_2 x_1 x_2 + ... + k_n^2 x_n^2)
$$
then compacting (again remembering that the k_i k_j will be zero for the cases the A_ij is zero)
$$R = \frac{1}{2m} \sum_{i=1}^{n} \sum_{j=1}^{n}(A_{ij} x_i x_j)\\
- (\frac{n}{2m})^3 (\sum_{l = 1}^{n} k_l x_l)^2
$$
and again
$$R = \frac{1}{2m} \sum_{i=1}^{n} \sum_{j=1}^{n}(A_{ij} x_i x_j)\\
- \frac{1}{2m} \sum_{i=1}^{n} \sum_{j=1}^{n} (\frac{1}{2m} \sum_{l = 1}^{n} k_l x_l)^2
$$
and again
$$R = \frac{1}{2m} \sum_{i=1}^{n} \sum_{j=1}^{n}(A_{ij} x_i x_j) - (\frac{1}{2m} \sum_{l = 1}^{n} k_l x_l)^2
$$
and again
$$R = \frac{1}{2m} \sum_{i=1}^{n} \sum_{j=1}^{n}(A_{ij} x_i x_j) - \mu^2$$
which is equivelent to (clearly from the definition of \mu^2 being the full expanded sums again)
$$R = \frac{1}{2m} \sum_{i=1}^{n} \sum_{j=1}^{n}(A_{ij} - \frac{k_i k_j}{2m}) x_i x_j$$

## Sections 11.2-11.11: Community Detection

# ***Need to do the KL algorithm example drawing***
Tips:
- Go in reverse from perfect to scrambled and make the cut set larger... then how do you make it smaller again...

#### Modularity Matrix Summation Proof

Let,
$$B_{ij} = A_{ij} - \frac{k_i k_j}{2m}$$

When summing over all collums for a single row, the sum can be shown to be zero as follows
$$\sum_{j=1}^n B_{ij} 
= \sum_{j=1}^n A_{ij} - \frac{k_i k_j}{2m}\\
= \sum_math
$$

note... sum all row or collumn = degree of node

## Community Detection in Practice
Settings:

In [58]:
# Settings for running sections of community detection
small = 'small' # 'small_' if small... '' if full
raw_tweets_filename = small + 'raw_twitter'
hashtag_sets_filename = small + 'hashtag_sets'
edgelist_filename = small + 'edgelist'
htag_comms_filename = small + 'htag_communities'
#use small_raw_twitter and small_hashtag_sets for speed

# Run sections
json_2_raw_tweets = not path.exists(raw_tweets_filename + '.txt')
raw_tweets_2_hashtag_sets = not path.exists(hashtag_sets_filename + '.txt')
build_network = not path.exists(edgelist_filename + '.edgelist')
find_communities_original = False # Don't run... not needed
find_communities_w10_c10 = not path.exists(htag_comms_filename + '_w10_c10' + '.txt')
# find_communities_w10_c10 = not path.exists(htag_comms_filename + '_w10_c10' + '.txt')

#### Twitter Network Conceptual Questions
The idea of linking hashtags together by when they occur together in a tweet could describe the similarity between them as they are used together on the same message. This would potray a similar feeling/tone or related topics of interest.
An example of this would be the use of #fml and #disapointed. In tweets when they are used together they likely are used to describe the feeling dispair or sadness that refer to the contents of the tweet.
An example that is not as useful would be when two hashtags are used in a tweet as polar opesites, i.e. #Good vs #Bad, or #Fire vs #Water

#### Functions for dealing with tweets
*** It is getting anoying to have to do this more complicated in a jupyter notebook... I'm looking foward to doing it all with scripts for my project

In [70]:
def readJSON2List(filename):
    """
    this reads a raw json, selects only the english tweets
    and saves text to a list of strings
    """
    fp = open(filename + '.json', 'r', encoding='utf-8')
    tweets = []
    for line in fp:
        if len(line) > 2:
            line_data = json.loads(line)
            if line_data['lang'] == 'en': # only english
                tweets.append(line_data['text'])
    
    return tweets

In [72]:
def readTxt2List(filename, deliminator = '\n uniqueDeliminator \n'):
    """
    reads from outputed txt list into a python list
    """
    print('not coded...')
    return -1

In [77]:
def writeList2txt(tweets, filename, deliminator = '\n uniqueDeliminator \n'):
    """
    this writes a list of tweets to a txt file
    """
    with open(filename + '.txt', 'w', encoding = 'utf-8') as filehandle:
        for tweet in tweets:
            filehandle.write(tweet + deliminator)

In [104]:
def tweetTextAnalysis(tweet):
    """
    takes tweet text (as a string) and processes it...
    1) strip
    2) lowercase
    3) hashtag search
    4) return list of hashtags (withough #)
    """
    if '#' in tweet:
        tweet = tweet.strip()
        tweet = tweet.lower()
        regex = "#(\w+)"
        hashtags = sorted(re.findall(regex, tweet))
        if len(hashtags) == 0:
            hashtags = -1
    else:
         hashtags = -1   
    return hashtags

In [126]:
def findHashtagSets(tweets):
    """
    This interprits the list of tweets (english and post striping of json stuff)
    and finds the co-occuring hashtags
    """
    hashtagSets = [str]
    for tweet in tweets:
        hashtags = tweetTextAnalysis(tweet)
        if type(hashtags) == list:
            if len(hashtags) > 1:
                hashtagSets.append(hashtags)
                print(hashtags)
    return hashtagSets

### Functions for building and analyzing networks

In [127]:
tweet = tweets[108]
hashtag = tweetTextAnalysis(tweet)
print(tweet, hashtag)

@elissakh the one and only Queen ❤️💝 #Toronto https://t.co/PZTZW5GCQk ['toronto']


### 1) Identifying co-occuring hashtags in data

1. Use JSON to output tweet data directly into a text file: raw_tweets.txt
2. Use raw_tweets.txt to produce a list of space-deliminated list of hashtages in tweets: hashtag_sets.txt

#### Empty line explination
All of the empty lines (that I didn't purposly add) within the hashtag_sets.txt file indicate that no hashtags were included in the respective tweet



In [75]:
#JSON 2 Raw Tweets
if json_2_raw_tweets:
    tweets = readJSON2List(raw_tweets_filename)
    writeList2txt(tweets, raw_tweets_filename)
    
    
    
    
    
else:
#     tweets = readTxt2List(raw_tweets_filename) #isn't coded yet
    tweets = readJSON2List(raw_tweets_filename)



In [128]:
# Raw tweets to hashtag pairs
if raw_tweets_2_hashtag_sets:
    hashtags = findHashtagSets(tweets)
    writeList2txt(tweets, hashtag_sets_filename, deliminator = '\n')





['나인뮤지스', '더쇼']
['aries', 'ascendant', 'leo', 'mediumcoeli']
['gmos', 'neonicotinoid', 'pesticides']
['etsymntt', 'pott']
['aaronrodgers', 'sports']
['filipina', 'philippines', 'pinay']
['camaradoalmirante', 'rio2016']
['lingerie', 'pantyhose', 'stockings']
['mtvstars', 'mtvstars']
['gameinsight', 'ipad', 'ipadgames']
['40', 'pizero']
['liesonsoundcloud', 'welcometweet']
['karachistandswit', 'mqm']
['job', 'jobs', 'lpn', 'tucson']
['csgo', 'csgofast', 'csgogiveaway', 'csgoskins']
['tommorowwhenigetthevampsalbumiwill', 'vampsalbumouttomorrow']
['blackfriday', 'kohlssweepstakes']
['cleveland', 'detroit', 'takeover']
['firefriday', 'teamoptimist']
['biafra', 'freebiafra', 'freennamdikanu']
['gratuitousfaggotry', 'yoloswagpovverbottomsforjesus']
['jobs', 'maintenance', 'manager', 'rugby']
['infection', 'undead', 'zombie']
['heath', 'sugar']
['forex', 'trading']
['mtvstars', 'mtvstars']
['carvsdal', 'happythanksgiving']
['naruto', 'sasusaku']
['cowboys', 'nflthanksgiving', 'thanksgiving']
[

### 2) Building a network from co-occuring hashtags

**Note:** The file hashtag_sets.txt can be interpreted as a hypergraph network as each line (tweet) being a node and the hashtags being the associated groups that they are associated with. Analysis of this network doesn't provide us with as much information about the combination of hashtags being together (i.e. they aren't the nodes of the network as we want)

1. Create an empty weighted undirected network.
2. Read in each sets of hashtags and create nodes (hashtags) and an edge linking them (or increase the weight of existing edges).
3. Use the generated network and save it as an .edgelist file.

### 3) Detecting communities in the network

1. Use the nx_comm.label_propagation_communities to create a list of the community sets

### 4) Finding the most meaningful communiites
Analysis withough preprocessing doesn't give as much detailed information... so a few things should do make it more meaningful.

#### Methods:
1. Introducing a weighting threashold on network edges

Ignoring the low weighted edges essentially means that a certain number of tweets are required to share hashtags before they are considered connected. Raising the threshold higher will gradually restrict the hashtags that are linked to thoose that are used together very frequently. It would also eliminate the connections for lesser used hashtags as they are less likely to have enough tweets to overcome the threshold.

2. Eliminating smaller connected components of the graph

By eliminating connected components you would be able to better focus on the more prevelent hashtags and their relationship with each other. This would also eliminate outlier hashtags that may be mispelled or just not very common. By doing this though (depending on if a weighting threshold was already implimented) very niche hashtags and combinations that could very well be very often used with other hashtags, but not the main ones, would be eliminated.

In [None]:
# Weighting Threshold Implimentation





In [None]:
# Smaller Component Elimination




#### Histagram plot of community sizes

In [None]:
# Code for when I get here:

# plt.hist(comm_sizes,20)
# plt.xlabel('community sizes')
# plt.show()

**Analysis of histagram:**

notes here

#### Analysis of certain community examples

notes here