#### Pt 2 - Build an Inverted Index
This section of code builds an inverted index which is used in the word count Recommender System. Two lines of code are commented out. It is recommended that you do not run them, as they take between 15-20 minutes. The end results have been exported as JSON files called 'assets/inverted_index_stem.json' and assets/inverted_index_lem.json for each stemmed or lemmatized word

In [19]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from tqdm import tqdm
import json

In [20]:
# load processed dataframe saved from load_data.ipynb
processed_df = pd.read_pickle('assets/processed_df.pkl')

In [21]:
# build inverted index with word and frequency word appears in each document
def inverted_index(df, normalization):
    inv_index = dict()
    for doc, words in tqdm(dict(zip(df['naics'],df[normalization])).items()):
        for word in words:
            if word in inv_index:
                inv_index[word][doc] = words.count(word)
            else:
                inv_index[word] = {doc: words.count(word)}
    return inv_index

In [22]:
#code takes a very long time (15-20 mins)
#do not run!! use resulting pkl file instead

In [23]:
inverted_index_stem = inverted_index(processed_df, 'stemmed')

100%|██████████████████████████████████████████████████████████████████████████████| 1057/1057 [16:15<00:00,  1.08it/s]


In [None]:
inverted_index_lem = inverted_index(processed_df, 'lemmatized')

 34%|██████████████████████████▉                                                    | 360/1057 [07:17<06:51,  1.69it/s]

In [None]:
# export as json files so you do not have to run the code again.

In [None]:
with open(r'assets/inverted_index_stem.json', 'w') as f:
    json.dump(inverted_index_stem, f)

In [None]:
with open(r'assets/inverted_index_lem.json', 'w') as f:
    json.dump(inverted_index_lem, f)