# Mapping word Co-occurences the 2022 'State of the Union' address
In this notebook, we map the co-occurences of words within each sentence of the 2022 State Of the Union address. For each sentence, after removing stopwords, we log the occurences of each word with each other word in the sentence; these are the co-occurences. After doing this for each sentence, we can build a matrix of thes co-occurences, which can then be used to map edges (co-occurences) between nodes (words).


One important convention if you are going to play around with this notebook:
- Comments with # are optional lines of code. To run that line, just remove the # and rerun the cell.  
- Comments with ## are just comments and should be left as is.

## Installs/ Downloads/ Imports


### Installs
- If you get install errors, run these commands in a terminal window instead
- These are only needed if you do not already have these packages installed
- Only need to run the first time you run the notebook

In [129]:
## For building and analyzing the network graph
#pip install networkx  

In [4]:
## For tokenizing the text and stopwords
#pip install nltk  

In [None]:
## For visualizing the graph
#pip install pyvis

### Imports

In [127]:

import nltk ## Natural Language Tool Kit must be imported for the below downloads to run
from nltk.tokenize import word_tokenize ## One available tokenizer in NLTK
from nltk.corpus import stopwords ## Stopwords are words that are common in many languages and are not important to the analysis
import re ## Regex (Regular Expressions)
import pandas as pd ## Pandas (Panel Data)
import itertools ## For iterating over the data
import networkx as nx ## NetworkX is a Python library for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.
from networkx.algorithms.community import greedy_modularity_communities ## For finding communities in the graph
from pyvis.network import Network ## For visualizing the graph




### Downloads
- Make sure that you have run the imports before running these
- Only need to be run if you do not already have these downloaded
- Only need to be run the first time you run this notebook

In [None]:
# nltk.download('stopwords') # Dictionary of stopwords to be removed later
# nltk.download('punkt') # For tokenizing the file into sentences

## Make some preparations

In [128]:
## Setup stop words
stops = set(stopwords.words('english'))
#print (stops)

### Get data from file
We want to keep a copy of the original unaltered data.

In [130]:
# Read in file
path = "Data/Sotu22.txt"
with open(path, 'rb') as f:
  contents = f.read()
contents=contents.decode('utf-16')

### Clean up the data
We create two different versions of the data here:
1. 'contents' will be used to split into sentences and get the co-occurences. For this purpose, we leave periods in to delineate sentences.

2. 'cont_for_tok' or 'contents for tokens' will be used to create a set of all words in the SOTU. This will be used in the creation of the co-occurence matrix.

In [8]:
## Remove punctuation, but leave periods to define sentences. 
## Make all words lowercase, otherwise the system will identify different cases as different words
reg1 = re.compile('[^a-zA-Z\s\d\.]')
contents = re.sub(reg1, '', contents).lower()

## Remove new line characters
reg2 = r'\n'
contents = re.sub(reg2, ' ', contents)
#contents[0:250]

'madam speaker madam vice president and our first lady and second gentleman. members of congress and the cabinet. justices of the supreme court. my fellow americans.  last year covid19 kept us apart. this year we are finally together again.  tonight w'

#### Split off a version of contents for tokenization

In [9]:
## Remove periods and create contents for tokens 
reg3 = r'\.'
cont_for_tok = re.sub(reg3, ' ', contents)
cont_for_tok[0:250] ## Periods have been removed

'madam speaker madam vice president and our first lady and second gentleman  members of congress and the cabinet  justices of the supreme court  my fellow americans   last year covid19 kept us apart  this year we are finally together again   tonight w'

In [10]:
## Tokenize cont_for_tok into individual words
cont_for_tok = set(word_tokenize(cont_for_tok)) 
len(cont_for_tok)

1860

## Create the Co-occurence Matrix

### Create empty matrix

In [132]:
## Iterate through the tokenized words and remove stop words
wordsFiltered = []
for word in cont_for_tok:
    if word not in stops:
        wordsFiltered.append(word)
len(wordsFiltered)

1747

In [12]:
# Create Columns
matrix = pd.DataFrame(columns=wordsFiltered)

## Add a row in the first position
## This will hold our words for co-occurences
## The words in this column will match the words in the column labels
matrix.insert(0, 'co_occ', 0)
matrix.shape

(0, 1748)

In [13]:
## Create Rows. This one will take a while because it has to fill in all of the columns
## Note that there is one more column than rows. This is because we added the column 'co_occ' to accomodate the row words
for word in wordsFiltered:
    matrix = matrix.append({'co_occ':word}, ignore_index=True).fillna(0)
matrix.shape

(1747, 1748)

### Fill the matrix with co-occurences

In [14]:
## break text into a list of strings (sentences)
sentences = nltk.tokenize.sent_tokenize(contents)
sentences[0:3] ## Notice that we still have periods. We don't want these in the coming steps.

['madam speaker madam vice president and our first lady and second gentleman.',
 'members of congress and the cabinet.',
 'justices of the supreme court.']

In [15]:
## One more cleanup step, since we no longer need periods.
## If the periods are left in, the system will think that a word... 
## with a period attached is different than the same word without one.
sentences_filtered = []

for sent in sentences:
    temp_sent = ''
    ## Replace the periods with spaces
    ## We replace with a space because there are instances of 'word.word' in the text
    sent = re.sub(reg3, ' ', sent) 
    ## Tokenize each sentence to make iterating their words easier
    tokens = word_tokenize(sent)
    ## If the word is not in the stopwords list, add it to the temporary sentence, separated by a space
    for word in tokens:
        if word not in stops:
            temp_sent = temp_sent + word + ' '
    ## Once asentence is complete, add it to the list of sentences.
    ## This keeps the sentences separated instead of one long string
    sentences_filtered.append(temp_sent.strip())
sentences_filtered[5:10] ## Notice that all periods have been removed

['year finally together ',
 'tonight meet democrats republicans independents ',
 'importantly americans ',
 'duty one another america american people constitution ',
 'unwavering resolve freedom always triumph tyranny ']

In [16]:
## 1. Iterate through each sentence
## 2. compute the combinations of words within it
## 3. Add the co-occurrence to the matrix
for sent in sentences_filtered:
    tokens = word_tokenize(sent)
    for L in range(0, len(tokens)+1):
        for subset in itertools.combinations(tokens, 2):
            matrix.loc[matrix.co_occ == subset[0], subset[1]] += 1

In [17]:
## The matrix is too large to view here, but opening the CSV shows that we were successful
matrix.to_csv('Data/SOTU_Matrix.csv', index=False)

## Build the Graph

### Create a list of edges

In [19]:
## Create a new dataframe to hold the edges
## The columns will take the x and y coordinates of the matrix as source and target
## and the value at their intersection as the weight
edges = pd.DataFrame(columns=['source', 'target', 'weight'])

for row in matrix.iterrows():
    for col in matrix.columns:
        ## Ignore the first column as that is the same as the source
        ## Without this we would end up with self referential rows
        if col == 'co_occ':
            pass
        ## If the value is greater than 0, then we add it to the edges dataframe
        ## This significantly cuts down on the size of the dataframe
        else:
            if matrix.at[row[0], col] != 0.0:
                new_row = {'source':matrix.at[row[0], 'co_occ'],
                        'target':col,
                        'weight':(matrix.at[row[0], col])/2} ## Cut the weight in half to account for reciprocal edges
                edges = edges.append(new_row, ignore_index=True)
            else:
                continue

In [21]:
edges.to_csv('Data/edges.csv', index=False)

In [76]:
## Initialize the graph from the edges dataframe
G = nx.from_pandas_edgelist(edges, source = 'source', target = 'target', edge_attr = 'weight')

In [44]:
## Graph the entire network.
## This network is really just too big to easily visualize, so we will break it into communities later
## This serves to show where we started from
net=Network(height="100%", width="100%", bgcolor="#222222", font_color="white")
net.from_nx(G)
net.show_buttons(filter_=['physics'])
net.show("Graphs/full_graph.html")

### Identify communities
There are many community recognition algorithms available with networkx.  
I chose to go with Greed Modularity as I have the most experience with it.  
Feel free to try out other algorithms!

In [89]:

gmc = greedy_modularity_communities(G, weight='weight', resolution=1)

## This is how many communities were identified
len(gmc)

### Loop through the communities and graph each one

In [134]:
counter=1
for comm in gmc:
    temp_subgraph = G.subgraph(comm)
    net=Network(height="100%", width="100%", bgcolor="#222222", font_color="white")
    net.from_nx(temp_subgraph)

    ## The below blocks can be turned on or off to adjust functionality in the graph
    ## Turn this on to get a physics menu below the graph
    ## Must turn off the set_options block below
    #net.show_buttons(filter_=['physics'])
    
    ## Turn this on to get full control of the graph
    ## Must turn off the set_options block below
    #net.show_buttons()#(filter=['nodes, edges, layout, interaction, manipulation, physics, selection, renderer, physics'])

    ## This sets the settings that are available in the commented out line above.
    ## Remove this function if you want to start your settings from scratch.
    net.set_options(
        '''var options = {
      "nodes": {
        "color": {
          "border": "rgba(112,12,233,1)",
          "background": "rgba(161,88,252,1)",
          "highlight": {
            "border": "rgba(17,229,233,1)",
            "background": "rgba(100,255,254,1)"
          },
          "hover": {
            "background": "rgba(170,201,255,1)"
          }
        }
      },
      "edges": {
        "color": {
          "color": "rgba(125,92,175,1)",
          "highlight": "rgba(95,252,255,1)",
          "hover": "rgba(102,103,255,1)",
          "inherit": false
        },
        "smooth": false
      },
      "interaction": {
        "hover": true,
        "multiselect": true
      },
      "physics": {
        "forceAtlas2Based": {
          "gravitationalConstant": -83,
          "springLength": 135
        },
        "minVelocity": 0.75,
        "solver": "forceAtlas2Based"
      }
    }'''
    )

    filename = 'Graphs/community_' + str(counter) + '_graph.html'
    counter += 1
    net.show(filename)

#### Optionally, select a specific community to graph
- To do this just uncomment the blocks below
- gmc is sorted by default with gmc[0] being the largest community and gmc[-1] being the smallest.

In [140]:
comm = gmc[0]
subgraph=G.subgraph(comm)

net=Network(height="100%", width="100%", bgcolor="#222222", font_color="white")
net.from_nx(subgraph)

## The below blocks can be turned on or off to adjust functionality in the graph
## Turn this on to get a physics menu below the graph
## Must turn off the set_options block below
#net.show_buttons(filter_=['physics'])

## Turn this on to get full control of the graph
## Must turn off the set_options block below
#net.show_buttons()#(filter=['nodes, edges, layout, interaction, manipulation, physics, selection, renderer, physics'])

## This sets the settings that are available in the commented out line above.
## Remove this function if you want to start your settings from scratch.
net.set_options(
    '''var options = {
  "nodes": {
    "color": {
      "border": "rgba(112,12,233,1)",
      "background": "rgba(161,88,252,1)",
      "highlight": {
        "border": "rgba(17,229,233,1)",
        "background": "rgba(100,255,254,1)"
      },
      "hover": {
        "background": "rgba(170,201,255,1)"
      }
    }
  },
  "edges": {
    "color": {
      "color": "rgba(125,92,175,1)",
      "highlight": "rgba(95,252,255,1)",
      "hover": "rgba(102,103,255,1)",
      "inherit": false
    },
    "smooth": false
  },
  "interaction": {
    "hover": true,
    "multiselect": true
  },
  "physics": {
    "forceAtlas2Based": {
      "gravitationalConstant": -83,
      "springLength": 135
    },
    "minVelocity": 0.75,
    "solver": "forceAtlas2Based"
  }
}'''
)


net.show("Graphs/chosen_community_graph.html")

#### An example of a graph with all options enabled

In [146]:
comm = gmc[0]
subgraph=G.subgraph(comm)

net=Network(height="100%", width="100%", bgcolor="#222222", font_color="white")
net.from_nx(subgraph)

net.show_buttons()

net.show("Graphs/full_options_community_graph.html")