# Word List

Use this notebook to look at the words that are getting pulled out of the PDF and used for vector, distance, clustering, etc.

## Setup

In [1]:
from typing import Iterable

from config import text_file_info, word_vectors

import numpy as np
import pandas as pd

## Load vectors

In [2]:
files = (f[2] for f in text_file_info())
x, vectorizer = word_vectors(files)

In [4]:
terms = vectorizer.get_feature_names_out()

## Some tests

Some words just need to be in here. If not - then we have messed up.

In [5]:
'atlas' in terms

True

In [6]:
'cms' in terms

True

In [7]:
'higgs' in terms

True

In [8]:
'mathusla' in terms

True

In [9]:
'llp' in terms

True

In [10]:
'lived' in terms

True

## Just dump the full list out for our edification

In [11]:
', '.join(terms)

'aab, aachen, aad, aaij, aaron, aartsen, abada, abazajian, abbott, abe, abelian, abi, ability, able, abreu, abs, absence, absolute, absorber, absorption, abundance, academia, academic, academy, acc, accel, accelerate, accelerated, accelerating, acceleration, accelerator, accelerators, acceptance, accepted, access, accessed, accessible, acciarri, accommodate, accomplished, according, account, accretion, accuracy, accurate, accurately, achievable, achieve, achieved, achieving, acoustic, acquisition, acrylic, act, action, actions, active, actively, activities, activity, acts, actual, adam, adams, adapted, adaptive, add, added, adding, addition, additional, additionally, address, addressed, addressing, adequate, adhikari, admx, adopted, adrian, adv, advance, advanced, advancement, advances, advancing, advantage, advantages, affect, affected, age, agencies, ago, agostini, agreement, aguilar, ahmed, aim, aimed, aiming, aims, aip, air, ait, akerib, akimov, alabama, alamos, albert, alberto, al

Write out the above to a "nicely formatted file" so that it can be included in the website.

In [12]:
def split_by_lines(words: Iterable[str], max_line_len=80):
    line = ""
    for w in words:
        line = line + w
        if len(line) > max_line_len:
            yield line
            line = ""
        else:
            line = line + ' '

with open('../web/word_list.md', 'w') as p:
    with open('word_list_template.md', 'r') as tplt:
        p.write(tplt.read())
    for l in split_by_lines(terms):
        p.write(l + '\n')

## Document Word Importance Ranking

How many documents do words appear in?

In [13]:
total_importance = np.sum(x, axis=1)
d = [{
        'word': t,
        'importance': i[0,0]
     }
    for t, i in zip(terms, total_importance)
]
word_df = pd.DataFrame(d)

In [14]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(word_df.sort_values('importance', ascending=False))

Unnamed: 0,word,importance
1430,fluid,18.124311
501,case,17.969338
1406,finding,17.957258
486,carbon,17.483814
972,designed,16.955399
1427,flow,16.623056
434,brookhaven,16.580091
623,coil,16.301829
685,completion,16.129828
1350,facilitate,16.072415


In [15]:
total_importance[0][0,0]

11.457497143448188