# Word List

Use this notebook to look at the words that are getting pulled out of the PDF and used for vector, distance, clustering, etc.

## Setup

In [1]:
from typing import Iterable

from config import text_file_info, word_vectors

import numpy as np
import pandas as pd

## Load vectors

In [2]:
files = (f[2] for f in text_file_info())
x, vectorizer = word_vectors(files)

In [3]:
terms = vectorizer.get_feature_names()

## Some tests

Some words just need to be in here. If not - then we have messed up.

In [4]:
'atlas' in terms

True

In [5]:
'cms' in terms

True

In [6]:
'higgs' in terms

True

In [7]:
'mathusla' in terms

True

In [8]:
'llp' in terms

True

In [9]:
'lived' in terms

True

## Just dump the full list out for our edification

In [10]:
', '.join(terms)

'aab, aachen, aad, aaij, aaron, aartsen, abada, abazajian, abbott, abe, abelian, abi, ability, able, abreu, abs, absence, absolute, absorber, absorption, abundance, academia, academic, academy, acc, accel, accelerate, accelerated, accelerating, acceleration, accelerator, accelerators, acceptance, accepted, access, accessed, accessible, acciarri, accommodate, accomplished, according, account, accretion, accuracy, accurate, accurately, achievable, achieve, achieved, achieving, acoustic, acquisition, acrylic, act, action, actions, active, actively, activities, activity, acts, actual, adam, adams, adapted, adaptive, add, added, adding, addition, additional, additionally, address, addressed, addressing, adequate, adhikari, admx, adopted, adrian, adv, advance, advanced, advances, advancing, advantage, advantages, affect, affected, age, agencies, ago, agostini, agreement, aguilar, ahmed, aim, aimed, aiming, aims, aip, air, ait, akerib, akimov, alabama, alamos, albert, alberto, alessandro, ale

Write out the above to a "nicely formatted file" so that it can be included in the website.

In [11]:
def split_by_lines(words: Iterable[str], max_line_len=80):
    line = ""
    for w in words:
        line = line + w
        if len(line) > max_line_len:
            yield line
            line = ""
        else:
            line = line + ' '

with open('../web/word_list.md', 'w') as p:
    with open('word_list_template.md', 'r') as tplt:
        p.write(tplt.read())
    for l in split_by_lines(terms):
        p.write(l + '\n')

## Document Word Importance Ranking

How many documents do words appear in?

In [12]:
total_importance = np.sum(x, axis=1)
d = [{
        'word': t,
        'importance': i[0,0]
     }
    for t, i in zip(terms, total_importance)
]
word_df = pd.DataFrame(d)

In [13]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(word_df.sort_values('importance', ascending=False))

Unnamed: 0,word,importance
1417,flavor,18.164105
1393,fiducial,17.982017
499,cascades,17.914882
484,caputo,17.586681
432,brook,16.64441
1414,fixed,16.636533
612,cmb,16.336574
1338,extract,16.179184
674,comparisons,16.178682
1416,flat,16.122574


In [14]:
total_importance[0][0,0]

11.461959407481958