# PREPARE INFOMEDIA DATA

Prepares the Infomedia: extract a network of named entities from original data.

*Authors: Snorre Ralund, Mathieu Jacomy, Anders Munk.*

## How to use
1. Edit the settings below
2. Run each cell after another, and check that there are no issues

In case your process is long because your data is big, you may have to run all in multiple session. As you will see in the code, there are a few checkpoints where the data is saved, and where you can go back directly. If you do that, you still need to run the settings and the section 1. with all the installs and imports.

**THIS IS A VERY LONG PROCESS.** Extracting data with DaCy requires computing power, and it will take a long time if you have many documents. That bottleneck happens in step 11. It also requires **a lot of disk space** (from Gb to hunderds of Gb, depending on your corpus size).

In [None]:
settings = {}

# Which file contains the raw data?
settings['source file'] = "Infomedia raw data SAMPLE.csv"

## 1. Install and load DaCy and other libraries

DaCy is the library we use to extract data from Danish text. It is kind of the Danish version of SpaCy. You may have to restart the kernel after installing it. We used the large model, but you can use a smaller one for more performance a less accuracy, if your machine is limited. Please check DaCy's documentation in case install problem. Relevant resources:
* [DaCy on PyPi](https://pypi.org/project/dacy/)
* [DaCy on SpaCy Universe](https://spacy.io/universe/project/dacy)
* [DaCy's GitHub repository](https://github.com/centre-for-humanities-computing/DaCy)

In [None]:
! pip install dacy[all] --quiet
! pip install dacy[large] --quiet
print("Done.")

In [None]:
# Note: on one system we had the following error in the next cell:
# ContextualVersionConflict: (click 8.1.3 (/opt/conda/lib/python3.10/site-packages), Requirement.parse('click<8.1.0'), {'spacy'})
# If that happens, uncomment and run the line below:
# !pip install 'click<8.1.0' --force-reinstall

In [None]:
# Just a check of the models available
import dacy
print("List of models available:")
for model in dacy.models():
    print("- "+model)
print("\nDone.")

In [None]:
# Load the large model (may be long)
import dacy
nlp = dacy.load('large')
print("Done.")

We also need to install a few other things like [NLTK](https://www.nltk.org/).

In [None]:
! pip install nltk --quiet
! python -m nltk.downloader stopwords
! python -m nltk.downloader punkt
! pip install networkx --quiet
! pip install gensim -U --quiet
! pip install python-louvain --quiet
! pip install sklearn --quiet
print("Done.")

In [None]:
# Other imports
import pandas as pd
import pickle
import scipy.sparse as sp
import numpy as np
import networkx as nx
import tqdm
import nltk
import shutil
import os
import json
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter
from community import community_louvain

## 2. Clean the data

The source material needs some cleaning.

In [None]:
# Source file
df = pd.read_csv(settings['source file'])
df

In [None]:
# Add year column
df['year'] = df.publishdate.apply(lambda x: int(x.split('-')[0]))
print("Done.")

In [None]:
# Create cleaning function
# to fix text error from Infomedia, joining paragraphs with no space
import re
re_html = re.compile('(?:(?:https?://)?www.[^ ]+)|(?:[^ ]+\.com[^ ]+)|(?:[^ ]+\.org[^ ]+)|(?:[^ ]+\.dk[^ ]+)')
re_error = re.compile('(?:(?:[a-zæøå"][\.?!])|(?:[A-Z]{2}[\.?!]))([A-ZÆØÅ0-9])')

def clean_text_infomedia_error(text):
    if type(text)!=str:
        return text,[]
    l = re_error.finditer(text)
    indices = [m.start(0) for m in l]
    if len(indices)==0:
        return text,[]
    
    bits = []
    last = 0
    for i in indices:
        loc = list(text[i:i+3])
        dot = 0
        for lo in ['.','!','?']:
            try:
                dot = list(loc).index(lo)
                break
            except:
                pass
        idx = i+dot+1
        bits.append(text[last:idx])
        last = idx
    bits.append(text[last:])
    new_text = ' '.join(bits)
    text = new_text
    # remove links    
    links = re_html.findall(text)
    for link in set(links):
        text = text.replace(link.strip('.'),'LINK')
    # make space between sources
    sources = re.findall('/[^/ ]{1,}/',text)
    for source in sources:
        text = text.replace(source,' %s '%source)
    return text,links



In [None]:
# Test cleaning function (useful for debug and maintenance)
text = df.bodytext.sample(1).iloc[0]
new_text,links = clean_text_infomedia_error(text)
print("Links found (URLs): "+str(new_text.count('LINK')))
print("Clean text: "+new_text)
print("\nDone.")

In [None]:
# Clean the data (with links removed)
Links = []
texts = []
import tqdm
for text in tqdm.tqdm(df.bodytext.values):
    text,links = clean_text_infomedia_error(text)
    texts.append(text)
    Links.append(links)
df['clean_text'] = texts
df['links'] = Links
print("Done.")

In [None]:
# Add heading to full text
full = []
for i,j in df[['heading','clean_text']].fillna('').values:
    full.append(' .\t'.join([i,j]))
df['text'] = full

In [None]:
# Add the links back in
import re
with_links = []
for text,links in df[['text','links']].values:
    if len(links)>0:
        for link in links:
            text = re.sub('LINK',link,text,count=1)
    with_links.append(text)
df['full_text'] = with_links
print("Done.")

In [None]:
# Monitor what the data look like at this stage
df.head(5)

In [None]:
# Save the clean data
df.to_csv('prep 02 - infomedia cleaned data.csv', index=False)

## 3. Tokenize text

In [None]:
# Load the clean data (uncomment if you restart from here)
# df = pd.read_csv('prep 02 - infomedia cleaned data.csv')

In [None]:
# Tokenize text
texts = []
for duid,text in tqdm.tqdm(df[['duid','full_text']].values):
    texts.append((duid,nltk.word_tokenize(text)))
print("Done.")

In [None]:
# Save the tokenized text
pickle.dump(texts,open('prep 03 - infomedia tokenized texts.pkl','wb'))

## 4. Build document-term matrix
We need this for deduplication

In [None]:
# Load tokenized text (uncomment if you restart from here)
# texts = pickle.load(open('prep 03 - infomedia tokenized texts.pkl','rb'))

In [None]:
# Make Document Term Matrix
def get_ngram(doc,n=2):
    grams = doc.copy()

    for gram in range(2,n+1):
        grams+=['_'.join(doc[i:i+gram]) for i in range(len(doc)+1-gram)]
    return grams
def docs_to_dtm(docs,max_words=100000,min_count=5,ngram=3):
    c = Counter()
    bows = [] 
    for doc in docs:
        doc = get_ngram(doc,n=ngram)
        d = Counter(doc)
        c.update(d)
        bows.append(d)
    # make index
    index = [w for w,count in c.most_common(max_words) if count>min_count]
    w2i = {w:num for num,w in enumerate(index)} 
    
    # initialize matrix
    X = sp.dok_matrix((len(docs),len(index)), dtype=np.int32)
    for num in range(len(bows)):
        bow = bows[num]
        for w,count in bow.items():
            
            try:
                X[num,w2i[w]]=count
            except:
                pass
    X = X.tocsr()
    return index,X
%time index,dtm = docs_to_dtm([list(j) for i,j in texts])
print("\nDone.")

In [None]:
# Save the matrix
pickle.dump([dtm,index],open('prep 04 - dtm index.pkl','wb'))

## 5. Compute TFIDF

In [None]:
# Load document-term matrix (uncomment if you restart from here)
# dtm,index = pickle.load(open('prep 04 - dtm index.pkl','rb'))

In [None]:
#### transform to TFIDF
def dtm_tfidf(dtm):
    # Document frequency
    df = np.asarray(dtm.sign().sum(axis=0))[0,:]
    # Inverse document frequency
    idf =-np.log(df/dtm.shape[0])
    # Combined term frequence and inverse document frequency
    tfidf = dtm.multiply(idf)
    return tfidf
tfidf = dtm_tfidf(dtm)

In [None]:
# Save TFIDF
pickle.dump(tfidf,open('prep 05 - tfidf.pkl','wb'))

## 6. Compute similarity information

In [None]:
# Load data (uncomment if you restart from here)
# dtm,index = pickle.load(open('prep 04 - dtm index.pkl','rb'))
# tfidf = pickle.load(open('prep 05 - tfidf.pkl','rb'))

In [None]:
# Compute and save similarity matrix
%time doc2doc = cosine_similarity(dtm)

In [None]:
# Compute and save similarity matrix TFIDF
%time doc2doc_tfidf = cosine_similarity(tfidf)

In [None]:
# set similarity to oneself to 0
n_docs = doc2doc.shape[0]
doc2doc[np.arange(n_docs),np.arange(n_docs)] = 0
doc2doc_tfidf[np.arange(n_docs),np.arange(n_docs)] = 0
# locate the distribution of closests matches
top,top2 = [],[]
# tfidf score of the best. 
top3 = []
top4 = []
match = []
for i in range(doc2doc.shape[0]):
    a = doc2doc[i]
    a2 = doc2doc_tfidf[i]   
    best = a.argsort()[-1]
    #match.append((a[best],best))
    match.append((a[best],a2[best],best))
    
    top.append(a[best])
    top4.append(a2[best])
    best = a2.argsort()[-1]
    top3.append(a[best])
    top2.append(a2[best])
    

In [None]:
# Save
pickle.dump(doc2doc,open('prep 06 - similarity matrix.pkl','wb'))
pickle.dump(doc2doc_tfidf,open('prep 06 - similarity matrix tfidf.pkl','wb'))

## 7. Identify duplicates

Duplicates form groups. We gather those groups as duplicates of each other and we save this information

In [None]:
# Load data (uncomment if you restart from here)
# texts = pickle.load(open('prep 03 - infomedia tokenized texts.pkl','rb'))
# doc2doc = pickle.load(open('prep 06 - similarity matrix.pkl','rb'))
# doc2doc_tfidf = pickle.load(open('prep 06 - similarity matrix tfidf.pkl','rb'))

In [None]:
# Build simiarlity network
g = nx.Graph()
#thres = 0.965 # threshold for DTM cosine similarity (i.e. unweighed)
len_thres = 0.9 # if length of document is less then 90 % percent of the other
                # it is not counted as a duplicate
#thres2 = 0.99 # 
tf_cut = 0.93
tfidf_cut = 0.91
for i in tqdm.tqdm(np.arange(doc2doc.shape[0])):
    a = doc2doc[i]
    n = sum(map(len,texts[i][1]))
    a2 = doc2doc_tfidf[i]
    idx = np.arange(len(a))[(a>=tf_cut)&(a2>=tfidf_cut)]
    for j in idx:
        n2 = sum(map(len,texts[j][1]))
        # Account for length difference.
        l = sorted([n,n2])
        diff = l[0]/l[1]
        if i == j or diff<=len_thres:
            continue
        
        g.add_edge(i,j)
print("Done.")

In [None]:
# Get articles that do not have duplicates
nondupes = set(np.arange(len(texts)))-set(g)
print("Tokens (nodes):",len(g),"Similarities (edges):",len(g.edges()),"Non-duplicates: ",len(nondupes))

In [None]:
# The components, in the network, are the groups of duplicated articles
comps = list(nx.connected_components(g))
len(comps)

In [None]:
# List articles with duplication information.

# What the columns mean:
# duid: article id
# duplicate: 0 if it has no duplicates or is the "original", 1 else. I.e.: 0=keep, 1=remove.
# n_dupes: how many articles in the group of duplicates
# original: which article we consider the reference (it's arbitrary, but the same for the whole group)
# density: how many pairs of articles are duplicates of each other (from 0=0% to 1=100%)

# Explanation: if A is dupe of B and B is dupe of C then we see A as dupe of C,
#              but the A to C connection might not be considered duplication
#              according to the thresholds we used.
#              So a density below 1 tells us that such a thing happened.
#              That is not a problem, though.

dat = [{'duid':texts[i][0],'duplicate':0,'n_dupes':0,'original':texts[i][0]} for i in nondupes]

for comp in comps:
    # keep the largest.
    best = max(comp,key=lambda x: sum(map(len,texts[x][1])))
    dupes = set(comp)
    dupes.remove(best)
    n_dupes = len(dupes) # keep how many duplicate were removed
    dens = nx.density(nx.subgraph(g,comp))
    dat.append({'duid':texts[best][0],'duplicate':0,'n_dupes':n_dupes,'density':dens})
    for i in dupes:
        dat.append({'duid':texts[i][0],'duplicate':1,'n_dupes':n_dupes,'original':texts[best][0],'density':dens})
    
ddf = pd.DataFrame(dat)
ddf

In [None]:
# Save
ids = [i[0] for i in texts]
ddf.index = ddf.duid
ddf = ddf.loc[ids]
ddf = ddf.reset_index(drop=True)
ddf.to_csv('prep 07 - infomedia duplicates.csv',index=False)

## 8. Remove duplicates

We produce a file with no duplicates. This does not mean that the deduplicated file is always the right file to use. For instance if you want to count the number of occurrences of an expression, you may want to take into account that some articles have been published multiple times.

We will need the deduped file, however. Indeed, as our process looks into co-occurrence, duplicated articles would create artifacts by inflating the co-occurrence of the expressions they contain. Therefore, to improve the data, we use the deduplicated data.

In [None]:
# Load data (uncomment if you restart from here)
# df = pd.read_csv('prep 02 - infomedia cleaned data.csv')
# ddf = pd.read_csv('prep 07 - infomedia duplicates.csv')

In [None]:
# Remove duplicates
after_dup = df[ddf.duplicate==0]

In [None]:
# Save
after_dup.to_csv('prep 08 - infomedia deduplicated.csv', index=False)

## 9. Detect language

In [None]:
# Load data (uncomment if you restart from here)
# texts = pickle.load(open('prep 03 - infomedia tokenized texts.pkl','rb'))

In [None]:
# Check stop words
import requests
stopwords = set(nltk.corpus.stopwords.words('danish'))
urls = ['https://gist.githubusercontent.com/berteltorp/0cf8a0c7afea7f25ed754f24cfc2467b/raw/305d8e3930cc419e909d49d4b489c9773f75b2d6/stopord.txt',
       'https://raw.githubusercontent.com/stopwords-iso/stopwords-da/master/stopwords-da.txt']
stopwords1 = set()
for url in urls:
    stopwords1.update(set(requests.get(url).text.split()))
from spacy.lang.da.stop_words import STOP_WORDS

stop = STOP_WORDS|stopwords|stopwords1
print(len(stop),len(stopwords1),len(stopwords),len(STOP_WORDS))
special_tokens = 'æøå'
stop_score = []
for d_id,text in texts:
    c = Counter(text)
    raw = ''.join(text)
    count = 0
    for i in special_tokens:
        count+=raw.count(i)
    n = sum(c.values())
    match = sum([c[i] for i in stop])
    p = match/n
    stop_score.append((p,count/len(raw)))
    

In [None]:
stops,special = zip(*stop_score)
stops = np.array(stops)

In [None]:
# language detectors
! pip install langdetect
from langdetect import detect as detect
! pip install language-detector
from language_detector import detect_language as detect2

In [None]:
# Detect languages
dat = []
for _,text in tqdm.tqdm(texts):
    s = ' '.join(text)
    
    try:
        lan = detect(s)
        lan2 = detect2(s)
    except:
        dat.append({'duid':_})
        continue
    dat.append({'lan':lan,'lan2':lan2,'duid':_})

In [None]:
# Integrate data into dataframe
import pandas as pd
ldf = pd.DataFrame(dat)
ldf['special_chr'] = special
ldf['stopword_p'] = stops
def language_decision(row,stopthres=0.15,special_thres = 0.005):
    lan = row['lan']
    if lan=='da':
        return lan
    if row['stopword_p']>=stopthres:
        return 'da'
    if row['stopword_p']>=(stopthres-0.05) and row['special_chr']>special_thres:
        return 'da'
    return lan
ldf['Language'] = ldf.apply(language_decision,axis=1)
ldf = ldf[sorted(ldf.columns)]
ldf

In [None]:
# Save
ldf.to_csv('prep 09 - infomedia language detect.csv',index=False)

## 10. Remove non-danish documents

In [None]:
# Load data (uncomment if you restart from here)
# after_dup = pd.read_csv('prep 08 - infomedia deduplicated.csv')
# ldf = pd.read_csv('prep 09 - infomedia language detect.csv')

In [None]:
out = set()
out.update(set(ldf[ldf.Language!='da'].duid))
after_dup = after_dup[~after_dup.duid.isin(out)]

In [None]:
# Save
after_dup.to_csv('prep 10 - infomedia DK deduplicated.csv', index=False)
# Note: those files with upper case are the output file
after_dup.to_csv('INFOMEDIA DEDUPLICATED.csv', index=False)

## 11. Extract entities
**THIS STEP IS TIME CONSUMING and requires a lot of disk space.**

This step will create a folder named "nlp_docs" containing a lot of data. You can delete it once step 11 is done to save disk space.

In [None]:
# Load data (uncomment if you restart from here)
# after_dup = pd.read_csv('prep 10 - infomedia DK deduplicated.csv')

In [None]:
# Files and folder setup
if os.path.exists('nlp_docs') and os.path.isdir('nlp_docs'):
    shutil.rmtree('nlp_docs')
if os.path.exists('prep 11 - done_dacy') and os.path.isfile('prep 11 - done_dacy'):
    os.remove('prep 11 - done_dacy')
! mkdir nlp_docs
open('prep 11 - done_dacy','w').close()

In [None]:
# Recover, in case something happened
done = set(map(int,open('prep 11 - done_dacy','r').read().split()))
len(done)

In [None]:
# Extract tokens THIS MIGHT BE LONG
from spacy.tokens import DocBin
import tqdm
done = set(map(int,open('prep 11 - done_dacy','r').read().split()))
fdone = open('prep 11 - done_dacy','a')
temp_done = set()
doc_bin = DocBin(store_user_data=True)
for num,doc in tqdm.tqdm(list(enumerate(after_dup.full_text.fillna('').values))):
    if num in done:
        continue
    if num in temp_done:
        continue
    #print('%d %d'%(num,len(doc)),end=' ')
    doc = nlp(doc)
    doc_bin.add(doc)
    temp_done.add(num)
    if len(doc_bin)>=500:
        bytes_data = doc_bin.to_bytes()
        path = 'nlp_docs/%d'%num
        f = open(path,'wb')
        f.write(bytes_data)
        done.update(temp_done)
        for i in temp_done:
            fdone.write('%d '%i)
            fdone.flush()
        doc_bin = DocBin(store_user_data=True)
        temp_done = set()
bytes_data = doc_bin.to_bytes()
path = 'nlp_docs/%d'%num
f = open(path,'wb')
f.write(bytes_data)
done.update(temp_done)
for i in temp_done:
    fdone.write('%d '%i)
    fdone.flush()
        

In [None]:
files = ['nlp_docs/'+i for i in sorted(os.listdir('nlp_docs/'),key=lambda x: int(x))]
entities = []
for filename in tqdm.tqdm(files):
    print(filename)
    with open(filename,'rb') as f:
        bytes_data = f.read()
        doc_bin = DocBin().from_bytes(bytes_data)
        parsed = list(doc_bin.get_docs(nlp.vocab))
        #docs+=parsed
        
    for doc in parsed:
        ents = []
        for ent in doc.ents:
            ents.append((ent.text,ent.label_))
        entities.append(ents)

In [None]:
# Save
json.dump(entities, open('prep 11 - dacy entities.js','w'))

## 12. Gather and clean named entities


In [None]:
# Load data (uncomment if you restart from here)
# after_dup = pd.read_csv('prep 10 - infomedia DK deduplicated.csv')
# entities = json.load(open('prep 11 - dacy entities.js','r'))

In [None]:
out = set(['CARDINAL','DATE','TIME','MONEY','PERCENT','ORDINAL'])
entities = [[(i,j) for i,j in doc if not j in out] for doc in entities]
def post_clean(ent):
    ent = ent.strip('"').split('"')[0]
    return ent.rstrip('-').rstrip('.').lstrip(',.-').strip('"')
entities2 = [[post_clean(i) for i,j in ents] for ents in entities]

In [None]:
json.dump(entities,  open('prep 12 - entities.js','w'))
json.dump(entities2, open('prep 12 - entities postcleaned.js','w'))

## 13. Extract named entities and build network

In [None]:
# Load data (uncomment if you restart from here)
# after_dup = pd.read_csv('prep 10 - infomedia DK deduplicated.csv')
# entities = json.load(open('prep 12 - entities.js','r'))

In [None]:
# Collect
from collections import Counter
def post_clean(ent):
    ent = ent.strip('"').split('"')[0]
    return ent.rstrip('-').rstrip('.').lstrip(',.-').strip('"')
c = Counter()
types = Counter()
c2 = Counter()
for ents in entities:
    for e,typ in ents:
        e = post_clean(e)
        if len(e)==0:
            continue
        types[typ]+=1
        c[e]+=1
        c2[e.lower()]+=1

In [None]:
# Check the types (monitoring)
types

In [None]:
# Remove duplicates from different spellings and lowercasing
g = nx.Graph()
for e in c:
    j = e.lower()
    
    if e[0].isupper():
        if c2[j]>c[e]:
            g.add_edge(j,e)

e2e = {}
for e in c:
    e2 = e.lower()
    if g.has_node(e2):
        ent = list(g[e2].keys())[0]
        e2e[e]=ent
def resolve_ent(e):
    if e in e2e:
        return e2e[e]
    return e
print(len(g),len(e2e))
del c,c2

In [None]:
# More cleaning
def post_clean(ent):
    ent = ent.strip('"').split('"')[0]
    return ent.rstrip('-').rstrip('.').lstrip(',.-').strip('"')
c = Counter()
for ents in tqdm.tqdm(entities):
    for e,typ in ents:
        e = post_clean(e)
        if len(e)==0:
            continue
        e = resolve_ent(e)
        c[e]+=1

In [None]:
# Threshold entities (keep)
cut = 10
keep = set([i for i in c if c[i]>cut])
print("Keep:", len(keep))

In [None]:
# Check matches and missings
ent_docs = []
matches = 0
missing = 0
for i in tqdm.tqdm(range(len(after_dup))):
    text = '%s'%(after_dup.iloc[i].text)
    ents = entities[i]
    ents = set([e for e,typ in ents])
    e_formats = set()
    for e in sorted(ents,key=lambda x: len(x),reverse=True):
        e_r = post_clean(e)
        if len(e_r)==0:
            continue
        e_r = resolve_ent(e_r)
        e_format = e_r.replace(' ','_')
        ci = text.count(e)
        matches+=ci
        e_formats.add(e_format)
#        if text.count(e)<1:
#            print(e,'missing')
#            print(1+'2')
#            break
        text = text.replace(e,e_format)
    doc = nltk.word_tokenize(text)
    doc = [post_clean(i).lower().strip('"') if not i in e_formats else post_clean(i) for i in doc]
    missing+=len(set(e_formats)-(set(doc)))
    ent_docs.append(doc)
print("Missing:",missing)
print("Matches:", matches)

In [None]:
# Count w
c_e = Counter()
for doc in ent_docs:
    for w in doc:
        c_e[w]+=1

In [None]:
ent_docs2 = []
for ents in tqdm.tqdm(entities):
    temp = []
    for e,typ in ents:
        e = post_clean(e)
        if len(e)==0:
            continue
        e = resolve_ent(e)
        temp.append(e)
    ent_docs2.append(temp)

In [None]:
all_docs = ent_docs+ent_docs2
import random
random.shuffle(all_docs)

In [None]:
# Save word2vec entities
from importlib import reload
import run_w2vec as W2V
W2V = reload(W2V)
ent2v = W2V.run_w2vec(all_docs,phrases=False,emb_size=128)
import pickle
pickle.dump(ent2v, open('prep 13 - word2vec entities.pkl', 'wb'))

In [None]:
e2typ = {e:Counter() for e in keep}
edges = []
for ents in tqdm.tqdm(entities):
    for e,typ in ents:
        e = post_clean(e)
        if len(e)==0:
            continue
        e = resolve_ent(e)
        if not e in keep:
            continue
        e2typ[e][typ]+=1
    ents = [resolve_ent(post_clean(i)) for i,_ in ents]
    ents = [i for i in ents if len(i)>0]
    ents2 = []
    seen = set()
    for e in ents:
        if not e in seen:
            ents2.append(e)
    ents = [post_clean(i) for i in ents2 if i in keep]
    for i in range(len(ents)-1):
        n = ents[i]
        for j in range(i+1,len(ents)):
            n2 = ents[j]
            edges.append(tuple(sorted([n,n2])))
        

In [None]:
e2typ = {e:e2typ[e].most_common(1)[0][0] for e in e2typ}

In [None]:
n_docs = len(entities)
from collections import Counter
edge_c = Counter(edges)
pmis = {}
import numpy as np
alpha = 5 # smoothing term
for edge,count in edge_c.items():
    n,n2 = edge
    p = (c[n]+alpha)/n_docs
    p2 = (c[n2]+alpha)/n_docs
    m = count/n_docs
    pmis[edge] = m/(p*p2)

pmis = Counter(pmis)

In [None]:
# W2vec distance
import tqdm
edge2dist = {}
error= 0
for n,n2 in tqdm.tqdm(pmis):
    try:
        dist = ent2v.wv.distance(n,n2)
    except:
        error+=1
        dist = np.nan
    edge2dist[tuple(sorted([n,n2]))] = dist
print(error)

In [None]:
g = nx.Graph()
topn = 100000
for edge,pmi in pmis.most_common(topn):
    n,n2 = edge
    dist = edge2dist[edge]
    count = edge_c[edge]
    t,t2 = e2typ[n],e2typ[n2]
    g.add_node(n,**{'label':t,'n_docs':c[n]})
    g.add_node(n2,**{'label':t2,'n_docs':c[n2]})
    g.add_edge(n,n2,**{'w2vec_dist':dist,'pmi':pmi,'count':count})
                    

In [None]:
part = community_louvain.best_partition(g)

In [None]:
# calculate community degree to weigh labels
com2n = {p:[] for p in part.values()}
for n,p in part.items():
    g.nodes[n]['community'] = str(p)
    com2n[p].append(n)

In [None]:
for p,nodes in com2n.items():
    degs = np.array([len(g[n]) for n in nodes])
    ma = max(degs)
    m = np.mean(degs)
    degs_sqrt = np.sqrt(degs)
    rel_deg = degs/ma
    for n,rel_d in zip(nodes,rel_deg):
        g.nodes[n]['relative_degree'] = rel_d


In [None]:
# Save network
nx.write_graphml(g,'INFOMEDIA NER PMI NETWORK.graphml')

In [None]:
# Save entities
keep_entities = set(g)
pickle.dump(keep_entities, open('prep 13 - final_entities.pkl','wb'))

In [None]:
dat = []
for num,ents in enumerate(entities):
    seen = set()
    doc = after_dup.iloc[num].duid
    for e,typ in ents:
        e = post_clean(e)
        e = resolve_ent(e)
        if not e in seen:
            d = {'doc_id':doc,'entity':e,'type':typ,'in_network':e in keep_entities}
            dat.append(d)
            seen.add(e)


In [None]:
ent_df = pd.DataFrame(dat)
ent_df.to_csv('INFOMEDIA ENTITIES.csv', index=False)

## That's it
If it worked until there, you're done, congratulations!

The output files you probably want to look at are the files in upper case starting with "INFOMEDIA". Those are:
* **INFOMEDIA DEDUPLICATED.csv** (the deduplicated list of entities to ingest in Elastic Search)
* **INFOMEDIA ENTITIES.csv** (the list of entities, where they appear, and whether or not we kept them in the network)
* **INFOMEDIA NER PMI NETWORK.gexf** (the network of entities connected by co-occurrence, weighted by positive PMI score)