# Text Analytics | BAIS:6100
# Module 11: Keyword Network Analysis

Instructor: Kang-Pyo Lee 

In [None]:
# ! pip install --user --upgrade matplotlib networkx

In [None]:
screen_name = "cnnbrk"

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', 150)

df = pd.read_csv("classdata/tweets/timeline_{}.csv".format(screen_name), sep="\t")
df

In [None]:
from IPython.display import Image
Image("classdata/images/keyword_network.png")

## Step 1: Calculate keyword frequencies and keyword co-occurrence frequencies

The keyword frequencies will be used as the node weights and the keyword co-occurrence frequencies as the edge weights. 

In [None]:
import nltk

df["words"] = df.text.apply(lambda x: nltk.word_tokenize(x))
df[["text", "words"]]

In [None]:
from nltk.corpus import stopwords
import string

global_stopwords = stopwords.words("english")
local_stopwords = [c for c in string.punctuation] +\
                  ['’', '``', '…', '...', "''", '‘', '“', '”', "....", "'m", "'re", "'s", "'ve", 
                   'amp', 'https', "n't", 'rt', 'a…', 'co', 'i…', 't…']

In [None]:
from collections import Counter

###################################################################################
# The 'counter' object will have all the word count information. 
# The 'co_counter' object will have all the co-occurrence count information.
###################################################################################
counter = Counter()
co_counter = dict()

for l in df.words:
    word_set = set()
    
    for item in l:
        word = item.lower()
        
        if word not in (global_stopwords + local_stopwords):
            word_set.add(word)

    counter.update(word_set)
    
    ###################################################################################
    # Calculate co-occurrence count of two words and save it in 'co_counter'.
    # Co_counter is a dictionary of dictionaries.
    ###################################################################################
    words = list(word_set)
    for word1 in words:
        if word1 not in co_counter:
            co_counter[word1] = dict()
        
        for word2 in words:
            ######################################
            # Skip if the two words are the same.
            ######################################
            if word1 == word2:
                continue
            
            if word2 not in co_counter[word1]:
                co_counter[word1][word2] = 1
            else:
                co_counter[word1][word2] += 1

In [None]:
counter.most_common(30)

In [None]:
co_counter["biden"]["trump"], co_counter["trump"]["biden"]

The co-occurrence frequency of two keywords is symmetric. 

## Step 2: Create a graph object

In [None]:
import networkx as nx

G = nx.Graph()

networkx.Graph: https://networkx.github.io/documentation/stable/reference/classes/graph.html

## Step 3: Decide the number of nodes in the graph 

In [None]:
num_nodes = 30

Recall that nodes correspond to keywords. 

## Step 4: Define nodes and their weights for network visualization 

In [None]:
nodes = [item[0] for item in counter.most_common(num_nodes)]
node_weights = [item[1] * 10 for item in counter.most_common(num_nodes)]

Let's take the 30 most common keywords as nodes and their frequencies as node weights. The node weights, represented later in the form of node size of a graph, need to be adjusted by being multiplied by 10 for better visualization. 

## Step 5: Add nodes to the graph

In [None]:
for word in nodes:
    G.add_node(word, weight=counter.get(word))

networkx.Graph.add_node: https://networkx.github.io/documentation/stable/reference/classes/generated/networkx.Graph.add_node.html

Add each node to `G`, such that the `weight` parameter is set to the keyword frequency. 

In [None]:
G.nodes.data()     # Check what nodes there are in G

networkx.Graph.nodes: https://networkx.github.io/documentation/stable/reference/classes/generated/networkx.Graph.nodes.html

## Step 6: Add edges to the graph

In [None]:
for word1 in nodes:
    for word2 in nodes:
        if (word1 != word2) & (word2 in co_counter[word1]):
            G.add_edge(word1, word2, weight=co_counter[word1][word2])

networkx.Graph.add_edge: https://networkx.github.io/documentation/stable/reference/classes/generated/networkx.Graph.add_edge.html

For each pair of keywords in `nodes`, add an edge to `G`, such that the `weight` parameter is set to the co-occurrence frequency. Recall that an edge between two nodes represents the co-occurrence of the two keywords in the same document and that the weight of the edge is the co-occurrence frequency. 

In [None]:
G.edges.data()     # Check what edges there are in G

networkx.Graph.edges: https://networkx.github.io/documentation/stable/reference/classes/generated/networkx.Graph.edges.html

## Step 7: Define edges and their weights for network visualization 

In [None]:
edges = nx.get_edge_attributes(G, "weight").keys()
edges

networkx.classes.function.get_edge_attributes: https://networkx.github.io/documentation/stable/reference/generated/networkx.classes.function.get_edge_attributes.html

In [None]:
edge_weights = nx.get_edge_attributes(G, "weight").values()
edge_weights

In [None]:
edge_weights = [item / 10 for item in edge_weights]
edge_weights

The edge weights, represented later in the form of edge thickness of a graph, need to be adjusted by being divided by 10 for better visualization. 

## Step 8: Plot the graph

Types of layouts
- circular
- random
- spectral
- spring
- shell

In [None]:
from matplotlib import pyplot as plt

In [None]:
plt.figure(figsize=(10, 10))
nx.draw_networkx(G, pos=nx.circular_layout(G), 
                 nodelist=nodes, node_size=node_weights, edgelist=edges, width=edge_weights,
                 node_color="yellow", with_labels=True, font_size=10)
plt.draw()

networkx.drawing.nx_pylab.draw_networkx: https://networkx.github.io/documentation/stable/reference/generated/networkx.drawing.nx_pylab.draw_networkx.html

As node size represents keyword frequency, larger nodes mean that the keywords for those nodes are used more frequently. Likewise, as edge thickness represents co-occurrence frequency, thicker edges mean that the two keywords connected by those edges appear in the same documents, or tweets, more frequently.

In [None]:
plt.figure(figsize=(10, 10))
nx.draw_networkx(G, pos=nx.random_layout(G, seed=0), 
                 nodelist=nodes, node_size=node_weights, edgelist=edges, width=edge_weights,
                 node_color="yellow", with_labels=True, font_size=10)
plt.draw()

In [None]:
plt.figure(figsize=(10, 10))
nx.draw_networkx(G, pos=nx.spring_layout(G), 
                 nodelist=nodes, node_size=node_weights, edgelist=edges, width=edge_weights,
                 node_color="yellow", with_labels=True, font_size=10)
plt.draw()

The spring layout uses an algorithm to arrange closely related nodes such that they are close to one another.  

## Exercises - Keyword Network Analysis