# Paradigmatic Relation Network

The module developed below extracts Paradigmatic relations between word in a corpus and generates
nodes and edges for graph visualization


## Table of Contents
0. [Import Libraries](#0)
1. [Load Data](#1)
2. [Select Required Columns](#2)
3. [Pre-Process Text](#3)
4. [Create Corpus](#4)
5. [Vectorize Word-Doc Relations](#5)
6. [Get Occurrence and Co-Occurrence matrix](#6)
7. [Function to Find Index of element in np array](#7)
8. [Calculate number of documents containing specific word](#9)
9. [Functions to produce Mutual Information](#9)
10. [Calculate MI score for all pairs of words](#10)
11. [Calculate Paradigmatic Relations](#11)
12. [Save Output](#12)

## Import Libraries <a class="anchor" id="0"></a>

In [2]:
import numpy as np
import pandas as pd
from tqdm import tqdm
from modules.tweetToWords import tweetToWords
from dateutil.parser import parse
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import csr_matrix
from itertools import combinations
from tqdm.contrib import tzip
import sys
import csv
from pathlib import Path
from sklearn.metrics.pairwise import cosine_similarity


## 1 Load Data <a class="anchor" id="1"></a>

In [3]:
# read the csv file
csv_file = pd.read_csv('realdonaldtrump.csv')
csv_file.head()

Unnamed: 0,id,link,content,date,retweets,favorites,mentions,hashtags
0,1698308935,https://twitter.com/realDonaldTrump/status/169...,Be sure to tune in and watch Donald Trump on L...,2009-05-04 13:54:25,510,917,,
1,1701461182,https://twitter.com/realDonaldTrump/status/170...,Donald Trump will be appearing on The View tom...,2009-05-04 20:00:10,34,267,,
2,1737479987,https://twitter.com/realDonaldTrump/status/173...,Donald Trump reads Top Ten Financial Tips on L...,2009-05-08 08:38:08,13,19,,
3,1741160716,https://twitter.com/realDonaldTrump/status/174...,New Blog Post: Celebrity Apprentice Finale and...,2009-05-08 15:40:15,11,26,,
4,1773561338,https://twitter.com/realDonaldTrump/status/177...,"""My persona will never be that of a wallflower...",2009-05-12 09:07:28,1375,1945,,


## 2 Select Required Columns <a class="anchor" id="2"></a>

In [4]:
# get selected cells from csv
data = csv_file[["content", "date", "retweets", "favorites"]]
data.head()

Unnamed: 0,content,date,retweets,favorites
0,Be sure to tune in and watch Donald Trump on L...,2009-05-04 13:54:25,510,917
1,Donald Trump will be appearing on The View tom...,2009-05-04 20:00:10,34,267
2,Donald Trump reads Top Ten Financial Tips on L...,2009-05-08 08:38:08,13,19
3,New Blog Post: Celebrity Apprentice Finale and...,2009-05-08 15:40:15,11,26
4,"""My persona will never be that of a wallflower...",2009-05-12 09:07:28,1375,1945


## 3 Pre-Process Text <a class="anchor" id="3"></a>

In [5]:
cleantext=[]
for item in tqdm(data['content']):
    words=tweetToWords(item)
    cleantext+=[words]
data['cleantext']=cleantext
data

100%|██████████| 43352/43352 [04:30<00:00, 160.16it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['cleantext']=cleantext


Unnamed: 0,content,date,retweets,favorites,cleantext
0,Be sure to tune in and watch Donald Trump on L...,2009-05-04 13:54:25,510,917,sure tune watch donald trump late night david ...
1,Donald Trump will be appearing on The View tom...,2009-05-04 20:00:10,34,267,donald trump appearing view tomorrow morning d...
2,Donald Trump reads Top Ten Financial Tips on L...,2009-05-08 08:38:08,13,19,donald trump read top ten financial tip late s...
3,New Blog Post: Celebrity Apprentice Finale and...,2009-05-08 15:40:15,11,26,new blog post celebrity apprentice finale less...
4,"""My persona will never be that of a wallflower...",2009-05-12 09:07:28,1375,1945,persona never wallflower rather build wall cli...
...,...,...,...,...,...
43347,Joe Biden was a TOTAL FAILURE in Government. H...,2020-06-17 19:00:32,23402,116377,joe biden total failure government bungled eve...
43348,Will be interviewed on @ seanhannity tonight a...,2020-06-17 19:11:47,11810,56659,interviewed seanhannity tonight enjoy
43349,pic.twitter.com/3lm1spbU8X,2020-06-17 21:27:33,4959,19344,
43350,pic.twitter.com/vpCE5MadUz,2020-06-17 21:28:38,4627,17022,


In [6]:
data.cleantext[5]

'miss usa tara conner fired always believer second chance say donald trump'

### Fix Date DataType

In [7]:
dates = []
for item in tqdm(data["date"]):
    year = parse(item).year
    dates+=[year]
data['date']=dates
data

100%|██████████| 43352/43352 [00:03<00:00, 11414.29it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['date']=dates


Unnamed: 0,content,date,retweets,favorites,cleantext
0,Be sure to tune in and watch Donald Trump on L...,2009,510,917,sure tune watch donald trump late night david ...
1,Donald Trump will be appearing on The View tom...,2009,34,267,donald trump appearing view tomorrow morning d...
2,Donald Trump reads Top Ten Financial Tips on L...,2009,13,19,donald trump read top ten financial tip late s...
3,New Blog Post: Celebrity Apprentice Finale and...,2009,11,26,new blog post celebrity apprentice finale less...
4,"""My persona will never be that of a wallflower...",2009,1375,1945,persona never wallflower rather build wall cli...
...,...,...,...,...,...
43347,Joe Biden was a TOTAL FAILURE in Government. H...,2020,23402,116377,joe biden total failure government bungled eve...
43348,Will be interviewed on @ seanhannity tonight a...,2020,11810,56659,interviewed seanhannity tonight enjoy
43349,pic.twitter.com/3lm1spbU8X,2020,4959,19344,
43350,pic.twitter.com/vpCE5MadUz,2020,4627,17022,


## 4 Create Corpus <a class="anchor" id="4"></a>

In [8]:
corpus = list(data.cleantext)
corpus[:5]

['sure tune watch donald trump late night david letterman present top ten list tonight',
 'donald trump appearing view tomorrow morning discus celebrity apprentice new book think like champion',
 'donald trump read top ten financial tip late show david letterman funny',
 'new blog post celebrity apprentice finale lesson learned along way',
 'persona never wallflower rather build wall cling donald trump']

## 5 Vectorize Word-Doc Relations <a class="anchor" id="5"></a>

In [9]:
vectorizer = CountVectorizer(binary=True, min_df=2, max_df=0.9)
X = vectorizer.fit_transform(corpus)
vocabulary = vectorizer.get_feature_names()
vocabulary[:15]

['aa',
 'aaa',
 'aaafivediamond',
 'aaanews',
 'aacrowellt',
 'aaron',
 'aaronmcallorum',
 'aaszkler',
 'ab',
 'abandon',
 'abandoned',
 'abbas',
 'abbott',
 'abbydnyc',
 'abc']

## 6 Get Occurrence and Co-Occurrence matrix <a class="anchor" id="6"></a>

In [10]:
occurrence_matrix = np.array(X.toarray())
occurrence_matrix

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [11]:
# transform occ_matrix to sparse for optimization
sparse_W = csr_matrix(occurrence_matrix)
co_occurrence_matrix = sparse_W.transpose().dot(sparse_W)
co_occurrence_matrix


<14436x14436 sparse matrix of type '<class 'numpy.int64'>'
	with 2543244 stored elements in Compressed Sparse Column format>

### Data Shape

In [12]:
# number of documents
N = occurrence_matrix.shape[0]
N

43352

## 7 Function to Find Index of element in np array <a class="anchor" id="7"></a>

In [13]:
def find_index(arr, val):
    base = np.array(arr)
    index = np.where(base == val)
    if len(index[0]) == 0:
        sys.exit("Word you entered wasn't found in any document")
    return index[0][0]

## 8 Calculate number of documents containing specific word <a class="anchor" id="8"></a>

In [14]:
occurrence_count = {}  # number of documents containing a word

# calculate number of documents containing a word
for w in tqdm(vocabulary):
    i = find_index(vocabulary, w)
    occurrence_count[w] = np.sum(occurrence_matrix[:, i])
[x for i, x in enumerate(occurrence_count) if i < 30]

100%|██████████| 14436/14436 [01:55<00:00, 125.10it/s]


['aa',
 'aaa',
 'aaafivediamond',
 'aaanews',
 'aacrowellt',
 'aaron',
 'aaronmcallorum',
 'aaszkler',
 'ab',
 'abandon',
 'abandoned',
 'abbas',
 'abbott',
 'abbydnyc',
 'abc',
 'abcfamily',
 'abcinsc',
 'abcnews',
 'abcpolitics',
 'abcsharktank',
 'abcworldnews',
 'abdel',
 'abdul',
 'abdullah',
 'abe',
 'abedin',
 'abedini',
 'aberdeen',
 'aberdeencc',
 'aberdeenshire']

## 9 Functions to produce Mutual Information <a class="anchor" id="9"></a>

In [15]:
# calculate probabilities
def p(w1, present1, w2=None, present2=None, p_co=None):

    p_w1 = (occurrence_count[w1] + 0.5) / (N + 1)

    # if it's singular probability
    if w2 is None:
        if present1:
            return p_w1
        else:
            return 1 - p_w1

    p_w2 = (occurrence_count[w2] + 0.5) / (N + 1)

    if present1 and present2:  # p(w1 = 1, w2 = 1)
        return p_co
    elif present1 and not present2:  # p(w1 = 1, w2 = 0)
        return p_w1 - p_co
    elif not present1 and present2:  # p(w1 = 0, w2 = 1)
        return p_w2 - p_co
    elif not present1 and not present2:  # p(w1 = 0, w2 = 0)
        return 1 - (p_co + (p_w1 - p_co) + (p_w2 - p_co))


def mi(w1, w2, p_co):
    summation = 0
    for u in [False, True]:
        for v in [False, True]:
            numerator = p(w1, u, w2, v, p_co)
            denominator = p(w1, u) * p(w2, v)
            summation += numerator * np.log2(numerator / denominator)

    return summation

## 10 Calculate MI score for all pairs of words <a class="anchor" id="10"></a>

In [16]:
pairs = list(combinations(vocabulary, 2))
mi_matrix = np.zeros(shape=(len(vocabulary), len(vocabulary)))  # matrix to save MI score of words
sorted_matrix = co_occurrence_matrix.sorted_indices()
cx = sorted_matrix.tocoo()
for i, j, v in tzip(cx.row, cx.col, cx.data):  # TQDM lib used for exec time estimation
    p_co = (v + 0.25) / (N + 1)
    mi_value = mi(vocabulary[i], vocabulary[j], p_co)
    mi_matrix[i, j] = mi_value
mi_matrix

  0%|          | 0/2543244 [00:00<?, ?it/s]

array([[0.0010755 , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.00392653, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.00075952, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.00255143, 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.00197699,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.0013826 ]])

## Calculate Paradigmatic Relations <a class="anchor" id="12"></a>

In [17]:
sparse_mi = csr_matrix(mi_matrix)
sim_matrix = cosine_similarity(sparse_mi)  # calculate pair similarity

def paradigmatic(word):
    idx = find_index(vocabulary, word)
    sim = sim_matrix[idx, :]
    # ind = np.argpartition(sim, -10)[-10:]  # get top 10 similarities indexes
    ind = [idx for idx, val in enumerate(sim) if val > 0.3]
    count = 1
    sims = []
    for j in ind:
        # print(count, ':', vocabulary[j], "-->", round(sim[j], 5))
        sims.append((j, round(sim[j], 5)))
        count += 1
    return sims
paradigmatic('iran')

[(297, 0.33998), (6427, 1.0), (7360, 0.31548)]

## Save Output <a class="anchor" id="12"></a>
#### Make directory available

In [18]:
output_dir = Path('./Paradigmatic')
output_dir.mkdir(parents=True, exist_ok=True)

#### Print Nodes and Edges

In [19]:
def print_nodes2():
    with open( "./Paradigmatic/nodes.csv", "w", newline='') as csv_out:
        writer = csv.writer(csv_out, delimiter=',')
        fields = ["Id", "Label"]
        writer.writerow(fields)
        for w in tqdm(vocabulary):
            writer.writerow([find_index(vocabulary, w), w])
print_nodes2()

100%|██████████| 14436/14436 [01:21<00:00, 176.81it/s]


In [20]:
def print_edges2():
    with open("./Paradigmatic/edges.csv", "w", newline='') as csv_out:
        writer = csv.writer(csv_out, delimiter=',')
        fields = ["Source", "Target", "Type", "Weight"]
        writer.writerow(fields)
        for word in tqdm(vocabulary):
            answers = paradigmatic(word)
            rows = []
            for answer in answers:
                rows.append([find_index(vocabulary, word), answer[0], "Undirected", answer[1]])
            writer.writerows(rows)
print_edges2()


100%|██████████| 14436/14436 [06:36<00:00, 36.39it/s]
