## Pt 4 - Our first Recommender System - Bag Of Words
With the data finallly prepared, I began exploring different methods for building a Recommender system, starting with the Bag of Words approach. When looking at the dataset, I noticed that each class code had several indicative words that might help determine which class it belongs to, so that is why I chose to test out this approach. However, I know that there are a lot of filler words that are difficult to parse out, as well as semantic terms like "musical instrument store", rather than just "musical", "instrument", and "store."

#### Load in and process data from previous steps

In [106]:
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk import word_tokenize
import json
import pandas as pd
import re
stop_words = set(stopwords.words("english"))

PATTERN_S = re.compile("\'s")  # matches `'s` from text  
PATTERN_RN = re.compile("\\r\\n\\b") #matches `\r` and `\n`
PATTERN_PUNC = re.compile(r"[^\w\s]") # matches all non 0-9 A-z whitespace 

def clean_text(text):
    """
    Series of cleaning. String to lower case, remove non words characters and numbers (punctuation, curly brackets etc).
        text (str): input text
    return (str): modified initial text
    """
    text = text.lower()  # lowercase text
    # replace the matched string with ' '
    text = re.sub(PATTERN_S, ' ', text)
    text = re.sub(PATTERN_RN, ' ', text)
    text = re.sub(PATTERN_PUNC, ' ', text)
    return text

def tokenizer(description, stop_words, normalization):
    
    if normalization == 'lemmatize':
        # tokenize and lemmatize text
        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(w) for w in word_tokenize(description)]
        
    elif normalization == 'stem':
        # tokenize and stem text
        stemmer = PorterStemmer()
        tokens = [stemmer.stem(w) for w in word_tokenize(description)]
    
   # remove tokens length of 2 or below and make all lowercase and remove stop words
    tokens = [w.lower() for w in tokens if (w.lower() not in stop_words) and (len(w) > 2) and(w.isalpha())]
    
    return tokens    
    
def process_query(query, normalization):
    
    stop_words = set(stopwords.words("english"))
    
    return tokenizer(clean_text(query), stop_words, normalization)

In [113]:
# this function returns the top n_rank docs based on word count from the inverted index
def retrieve_n_rank_docs(inverted_index, query, normalization, max_docs=-1):
    ret_docs = {}
    
    counts = {}
    query = process_query(query, normalization)
    
    for word in query:
        try:
            docs = inverted_index.get(word)
            for k, v in docs.items():
                if k in counts:
                    counts[k] += v
                else:
                    counts[k] = v

        except:
            pass
        break
    counts = sorted(counts.items(), key=lambda x: (x[1], -int(x[0][1:])), reverse=True)
    if max_docs > -1:
        ret_docs[' '.join(query)] = [x[0] for x in counts][:max_docs]
    else:
        ret_docs[' '.join(query)] = [x[0] for x in counts]
        
    return ret_docs

In [114]:
# IMPORT INVERTED_INDEXES
with open(r'assets/inverted_index_stem.json') as f:
    inverted_index_stem = json.load(f)
    
with open(r'assets/inverted_index_lem.json') as f:
    inverted_index_lem = json.load(f)

The [6-digit_2017_Codes.xlsx](https://www.census.gov/naics/2017NAICS/6-digit_2017_Codes.xlsx) file provides a NAICS code with the official "Title" of that NAICS code, rather than the description of the NAICS codes we collected before. We will use this file to neatly display class codes/titles so we know what industry we're looking at when exploring the data.

In [115]:
naics_titles = pd.read_excel('assets/6-digit_2017_Codes.xlsx')
naics_titles['naics'] = naics_titles['naics'].astype(str)

In [116]:
naics_titles.head()

Unnamed: 0,naics,title,Unnamed: 2
0,111110,Soybean Farming,
1,111120,Oilseed (except Soybean) Farming,
2,111130,Dry Pea and Bean Farming,
3,111140,Wheat Farming,
4,111150,Corn Farming,


### Text Normalization
There are two popular methods of normalizing text - lemmatization and stemming. Lemmatization takes the morphological base of a word. For example, studies and "studying", would be lemmatized to "study". Stemming takes the root of the word, for example, "studies" would become "studi" and "studying" would become "study". There are certain cases where one is more useful than the other, so I decided to try them both out. They produced very similar results, with lemmatizing getting a slightly higher score for Mean Average Precision. 

In [117]:
# Here we display the results of the query "Home improvement store" using stemming on a BoW model
stem_df = pd.DataFrame(retrieve_n_rank_docs(inverted_index_lem, 'Home improvement store', 'stem'))
stem_df.columns.values[0] = 'naics'
stem_df = stem_df.merge(naics_titles, on='naics', how='outer')
stem_df[['naics', 'title']].head(10)

Unnamed: 0,naics,title
0,321999,All Other Miscellaneous Wood Product Manufactu...
1,453998,All Other Miscellaneous Store Retailers (excep...
2,423220,Home Furnishing Merchant Wholesalers
3,454390,Other Direct Selling Establishments
4,333111,Farm Machinery and Equipment Manufacturing
5,321920,Wood Container and Pallet Manufacturing
6,423390,Other Construction Material Merchant Wholesalers
7,623990,Other Residential Care Facilities
8,236117,New Housing For-Sale Builders
9,442299,All Other Home Furnishings Stores


In [118]:
# Here we display the results of the query "musical instrument store" using stemming on a BoW model
stem_df = pd.DataFrame(retrieve_n_rank_docs(inverted_index_lem, 'musical instrument store', 'stem'))
stem_df.columns.values[0] = 'naics'
stem_df = stem_df.merge(naics_titles, on='naics', how='outer')
stem_df[['naics', 'title']].head(10)

Unnamed: 0,naics,title
0,453998,All Other Miscellaneous Store Retailers (excep...
1,511199,All Other Publishers
2,611610,Fine Arts Schools
3,512230,Music Publishers
4,511120,Periodical Publishers
5,511130,Book Publishers
6,621340,"Offices of Physical, Occupational and Speech T..."
7,453310,Used Merchandise Stores
8,512290,Other Sound Recording Industries
9,511140,Directory and Mailing List Publishers


At a high-level glance, these results don't seem too relevant to our queries. There are a few that seem relevant, but should probably be higher up on the list. Let's give lemmatization a try!

In [131]:
# Here we display the results of the query "musical instrument store" using lemmatize on a BoW model
lem_df = pd.DataFrame(retrieve_n_rank_docs(inverted_index_lem, 'Home improvement store', 'lemmatize'))
lem_df.columns.values[0] = 'naics'
lem_df = lem_df.merge(naics_titles, on='naics', how='outer')
lem_df[['naics', 'title']].head(10)

Unnamed: 0,naics,title
0,321999,All Other Miscellaneous Wood Product Manufactu...
1,453998,All Other Miscellaneous Store Retailers (excep...
2,423220,Home Furnishing Merchant Wholesalers
3,454390,Other Direct Selling Establishments
4,333111,Farm Machinery and Equipment Manufacturing
5,321920,Wood Container and Pallet Manufacturing
6,423390,Other Construction Material Merchant Wholesalers
7,623990,Other Residential Care Facilities
8,236117,New Housing For-Sale Builders
9,442299,All Other Home Furnishings Stores


In [120]:
lem_df = pd.DataFrame(retrieve_n_rank_docs(inverted_index_lem, 'musical instrument store', 'lemmatize'))
lem_df.columns.values[0] = 'naics'
lem_df = lem_df.merge(naics_titles, on='naics', how='outer')
lem_df[['naics', 'title']].head(10)

Unnamed: 0,naics,title
0,711510,"Independent Artists, Writers, and Performers"
1,711130,Musical Groups and Artists
2,711310,"Promoters of Performing Arts, Sports, and Simi..."
3,711320,"Promoters of Performing Arts, Sports, and Simi..."
4,711110,Theater Companies and Dinner Theaters
5,711219,Other Spectator Sports
6,339992,Musical Instrument Manufacturing
7,711410,"Agents and Managers for Artists, Athletes, Ent..."
8,711211,Sports Teams and Clubs
9,339999,All Other Miscellaneous Manufacturing


There wasn't really any difference for "Home improvement store" with lemmatizing, but it looks like there are some new results for "musical instrument store". This makes intuitive sense, as lemmatization is keeping the word "musical" rather than just taking the stem "music".
The results returned for the stemming example show industries that are broadly related to "music," however, the lemmatized example shows industries that are broadly related to things that are "musical". This clearly demonstrates the significance of lemmatizing vs. stemming words in NLP.

#### Testing
Finally, we will test different metrics of our recommender system to see how it performs across queries. We will measure Precision and Recall at N, Mean Average Precision at N (mAP@N), and Normalized Discounted Cumulative Gain (NDCG).

The maximum documents returned will be 10, due to the nature of how this recommender system should work. I thought for awhile about increasing the number of documents, but from a functional perspective, returning more than 10 documents to a user defeats the purpose of the Recommender system. You might as well just return all documents to the user, as they are only likely to read through the first few results anyway.

I also experimented with different values for N, but I ultimately concluded that 10 is the ideal number for N, to match the max_docs returned. In the case of this recommender system, N = 10 works because order is not of utmost importance. As long as the retrieved documents contain relevant documents, then the "goal" of this recommender system is complete. There is certainly room for improvemenent in that area, but as I said earlier, there is not always a single best matched NAICS code to a query, and therefore the cutoff should be measured within the range of the max docs returned.

In [122]:
# load the relevance judgments from Pt 3
relevance_judgments = pd.read_pickle('assets/relevant_naics_df.pkl')
relevance_judgments = dict(zip(relevance_judgments['query'], relevance_judgments['relevant_naics']))

In [123]:
relevance_judgments

{'Home improvement store': ['444110',
  '444120',
  '444130',
  '444190',
  '444210',
  '444220',
  '441110',
  '441120',
  '441210',
  '441222',
  '441228',
  '441310',
  '441320',
  '442110',
  '442210',
  '442291',
  '442299',
  '443141',
  '443142',
  '445110',
  '445120',
  '445210',
  '445220',
  '445230',
  '445291',
  '445292',
  '445299',
  '445310',
  '446110',
  '446120',
  '446130',
  '446191',
  '446199',
  '447110',
  '447190',
  '448110',
  '448120',
  '448130',
  '448140',
  '448150',
  '448190',
  '448210',
  '448310',
  '448320'],
 'Diesel fuel supplier': ['424710',
  '424720',
  '424110',
  '424120',
  '424130',
  '424210',
  '424310',
  '424320',
  '424330',
  '424340',
  '424410',
  '424420',
  '424430',
  '424440',
  '424450',
  '424460',
  '424470',
  '424480',
  '424490',
  '424510',
  '424520',
  '424590',
  '424610',
  '424690',
  '424810',
  '424820',
  '424910',
  '424920',
  '424930',
  '424940',
  '424950',
  '424990',
  '423110',
  '423120',
  '423130',
 

In [124]:
max_docs = 10
def create_testing_dicts(normalization='lemmatize'):
    ret_docs_dic = {}
    queries_dic = {}

    if normalization == 'lemmatize':
        invert_index = inverted_index_lem
    else:
        invert_index = inverted_index_stem
        
    for query, value in relevance_judgments.items():
        
        ret_docs = retrieve_n_rank_docs(invert_index, query, normalization, max_docs=max_docs)
        if list(ret_docs.keys())[0] not in ret_docs_dic:
            ret_docs_dic[list(ret_docs.keys())[0]] = list(ret_docs.values())[0]
            
        queries_dic[' '.join(process_query(query, normalization))] = value
    
        
    return ret_docs_dic, queries_dic

In [128]:
# function to calculate precision and recall
def calc_pre_rec_at_n(ret_docs, reljudges, n=-1):
    """
    Calculate precision and recall at n for each query in ret_docs
    """
    
    pre_at_n, rec_at_n = {}, {}
    
    count = 0
    for k, v in ret_docs.items():
        if n > -1 and n <= len(ret_docs):
            s1 = set(v[:n])
        else:
            s1 = set(v)
        s2 = reljudges[k]
        try:
            precision = len(s1.intersection(s2)) / len(s1)
        except:
            pass
        recall = len(s1.intersection(s2)) / len(reljudges[k])
        pre_at_n[k] = round(precision, 3)
        rec_at_n[k] = round(recall, 3)
        count += 1
    return pre_at_n, rec_at_n

In [129]:
#function to calculate avg precision and mAP
def calc_avg_pre(ret_docs, reljudges, cutoff=-1):
    """
    Calculate (mean) average precision for each query in ret_docs
    """
    
    avg_pre, mean_avg_pre = {}, None
    for k, v in ret_docs.items():
        total_rel = 0
        total = 0
        avg_prec = 0
        for i, doc in enumerate(v):
            if doc in reljudges[k] and cutoff == -1:
                total_rel += 1
                total += 1
                precision = total_rel/total
            elif doc in reljudges[k] and i+1 <= cutoff:
                total_rel += 1
                total += 1
                precision = total_rel/total
            else:
                total += 1
                precision = 0
            avg_prec += precision

        avg_pre[k] = round(avg_prec/len(reljudges[k]), 3)
    
    mean_avg_pre = round(sum(avg_pre.values()) / len(avg_pre), 3)
        
    return avg_pre, mean_avg_pre

In [130]:
import math

# function to calculate NDCG
def calc_NDCG_at_n(ret_docs, reljudges, n=-1, base=2):
    """
    Calculate NDCG at n for each query in ret_docs
    """
    
    ndcg = {}
    
    for k, v in ret_docs.items():
        
        counts = list(reversed([x for x in range(2,len(reljudges[k])+2)]))
        ideals = {reljudges[k][i]: counts[i] for i in range(len(reljudges[k]))}
        
        add_ons = {}
        if len(v) > len(reljudges[k]):
            for i in range(len(v)-len(reljudges[k])):
                add_ons[i] = 1
        ideals.update(add_ons)
        nums = list(map(ideals.get, v))
        
        systems = {}
        for i, doc in enumerate(v):
            if nums[i] == None:
                systems[doc] = 1
            else:
                systems[doc] = nums[i]
                
        ideal_order = {}
        if n != -1:
            for i, (key, value) in enumerate(ideals.items()):
                if i < n:
                    ideal_order[key] = value
        else:
            ideal_order = ideals
        
        add_ons = {}
        
        
        log = 0
        for i, (doc, rank) in enumerate(ideal_order.items()):
            if i >= len(v):
                break
            elif i < base:
                log += rank
            else:
                log += rank/math.log(i+1, base)

                
        system_order = {}
        if n != -1:
            for i, (key, value) in enumerate(systems.items()):
                if i < n:
                    system_order[key] = value
        else:
            system_order = systems
        
        
        
        system_log = 0
        for i, (doc, rank) in enumerate(system_order.items()):
            if i >= len(v):
                break
            elif i < base:
                system_log += rank
            else:
                system_log += rank/math.log(i+1, base)
        try:
            ndcg[k] = system_log / log
        except:
            ndcg[k] = 0
        
    
    return ndcg

## Stemming Test Results

In [132]:
# IMPORT INVERTED_INDEX
with open(r'assets/inverted_index_stem.json') as f:
    inverted_index_stem = json.load(f)

#### Precision/Recall

In [133]:
pre_at_n, rec_at_n = calc_pre_rec_at_n(create_testing_dicts(normalization='stem')[0], create_testing_dicts(normalization='stem')[1])

In [134]:
pre_at_n

{'home improv store': 0.1,
 'diesel fuel supplier': 0.2,
 'church': 0.2,
 'farm': 0.8,
 'seed supplier': 0.2,
 'account': 0.0,
 'truck compani': 0.3,
 'export': 0.0,
 'grain elev': 0.0,
 'popcorn store': 0.2,
 'agricultur servic': 0.7,
 'warehous': 0.0,
 'agricultur product': 0.7,
 'ranch': 1.0,
 'hold compani': 0.3,
 'farm equip supplier': 0.1,
 'store': 0.4,
 'groceri store': 0.0,
 'rice mill': 0.6,
 'food product supplier': 0.0,
 'account firm': 0.0,
 'produc market': 0.0,
 'pet suppli store': 0.2,
 'wholesal': 1.0,
 'produc wholesal': 0.0,
 'distribut servic': 0.0,
 'crop grower': 1.0,
 'addict treatment center': 1.0,
 'natur good store': 0.0,
 'orchard': 1.0,
 'lumber store': 0.2,
 'mine': 1.0,
 'transport servic': 0.6,
 'invest compani': 1.0,
 'fruit wholesal': 0.2,
 'real estat agenc': 1.0,
 'event venu': 0.0,
 'frozen dessert supplier': 0.7,
 'wine wholesal import': 0.3,
 'wineri': 1.0,
 'aerospac compani': 1.0,
 'cold storag facil': 0.1,
 'employ agenc': 0.3,
 'plant nurseri':

In [135]:
rec_at_n

{'home improv store': 0.023,
 'diesel fuel supplier': 0.028,
 'church': 0.041,
 'farm': 0.125,
 'seed supplier': 0.028,
 'account': 0.0,
 'truck compani': 0.06,
 'export': 0.0,
 'grain elev': 0.0,
 'popcorn store': 0.045,
 'agricultur servic': 0.109,
 'warehous': 0.0,
 'agricultur product': 0.109,
 'ranch': 0.156,
 'hold compani': 1.0,
 'farm equip supplier': 0.014,
 'store': 0.182,
 'groceri store': 0.0,
 'rice mill': 0.082,
 'food product supplier': 0.0,
 'account firm': 0.0,
 'produc market': 0.0,
 'pet suppli store': 0.091,
 'wholesal': 0.141,
 'produc wholesal': 0.0,
 'distribut servic': 0.0,
 'crop grower': 0.156,
 'addict treatment center': 0.128,
 'natur good store': 0.0,
 'orchard': 0.156,
 'lumber store': 0.028,
 'mine': 0.357,
 'transport servic': 0.12,
 'invest compani': 0.244,
 'fruit wholesal': 0.028,
 'real estat agenc': 0.417,
 'event venu': 0.0,
 'frozen dessert supplier': 0.096,
 'wine wholesal import': 0.042,
 'wineri': 0.096,
 'aerospac compani': 0.053,
 'cold stora

In [86]:
avg_pre, mean_avg_pre = calc_avg_pre(create_testing_dicts(normalization='stem')[0], create_testing_dicts(normalization='stem')[1])

In [87]:
avg_pre

{'home improv store': 0.002,
 'diesel fuel supplier': 0.005,
 'church': 0.007,
 'farm': 0.105,
 'seed supplier': 0.006,
 'account': 0.0,
 'truck compani': 0.046,
 'export': 0.0,
 'grain elev': 0.0,
 'popcorn store': 0.009,
 'agricultur servic': 0.106,
 'warehous': 0.0,
 'agricultur product': 0.106,
 'ranch': 0.156,
 'hold compani': 0.867,
 'farm equip supplier': 0.002,
 'store': 0.114,
 'groceri store': 0.0,
 'rice mill': 0.064,
 'food product supplier': 0.0,
 'account firm': 0.0,
 'produc market': 0.0,
 'pet suppli store': 0.036,
 'wholesal': 0.141,
 'produc wholesal': 0.0,
 'distribut servic': 0.0,
 'crop grower': 0.156,
 'addict treatment center': 0.128,
 'natur good store': 0.0,
 'orchard': 0.156,
 'lumber store': 0.005,
 'mine': 0.357,
 'transport servic': 0.078,
 'invest compani': 0.244,
 'fruit wholesal': 0.008,
 'real estat agenc': 0.417,
 'event venu': 0.0,
 'frozen dessert supplier': 0.071,
 'wine wholesal import': 0.025,
 'wineri': 0.096,
 'aerospac compani': 0.053,
 'cold s

In [88]:
mean_avg_pre

0.057

In [89]:
calc_NDCG_at_n(create_testing_dicts(normalization='stem')[0], create_testing_dicts(normalization='stem')[1], n=-1, base=2)

{'home improv store': 0.062387011628009956,
 'diesel fuel supplier': 0.08387095630361062,
 'church': 0.09034884889711046,
 'farm': 0.5178737847929384,
 'seed supplier': 0.1254333888595897,
 'account': 0.021394318627208773,
 'truck compani': 0.4425032659424683,
 'export': 0.025812192854352625,
 'grain elev': 0.20341748548710972,
 'popcorn store': 0.16539480028020273,
 'agricultur servic': 0.6202800711515095,
 'warehous': 0.20341748548710972,
 'agricultur product': 0.3756497272114066,
 'ranch': 0.3512314498260458,
 'hold compani': 0.9293020998846204,
 'farm equip supplier': 0.017505799826600107,
 'store': 0.2493704428958166,
 'groceri store': 0.0239570415260799,
 'rice mill': 0.6434691377083236,
 'food product supplier': 0.014547278280154294,
 'account firm': 0.021394318627208773,
 'produc market': 0.0239570415260799,
 'pet suppli store': 0.27225613333912835,
 'wholesal': 0.46171023274441414,
 'produc wholesal': 0.014547278280154294,
 'distribut servic': 0.021394318627208773,
 'crop grow

## Lemmatize

In [90]:
# IMPORT INVERTED_INDEX
with open(r'assets/inverted_index_lem.json') as f:
    inverted_index_lem = json.load(f)

In [91]:
pre_at_n, rec_at_n = calc_pre_rec_at_n(create_testing_dicts(normalization='lemmatize')[0], create_testing_dicts(normalization='lemmatize')[1])

In [92]:
pre_at_n

{'home improvement store': 0.1,
 'diesel fuel supplier': 0.2,
 'church': 0.2,
 'farm': 0.7,
 'seed supplier': 0.2,
 'accountant': 0.8,
 'trucking company': 0.875,
 'exporter': 1.0,
 'grain elevator': 0.0,
 'popcorn store': 0.2,
 'agricultural service': 0.6,
 'warehouse': 0.0,
 'agricultural production': 0.6,
 'ranch': 1.0,
 'holding company': 0.3,
 'farm equipment supplier': 0.2,
 'store': 0.4,
 'grocery store': 0.0,
 'rice mill': 0.6,
 'food product supplier': 0.0,
 'accounting firm': 0.4,
 'produce market': 0.0,
 'pet supply store': 0.2,
 'wholesaler': 1.0,
 'produce wholesaler': 0.0,
 'distribution service': 0.0,
 'crop grower': 1.0,
 'addiction treatment center': 1.0,
 'natural good store': 0.0,
 'orchard': 1.0,
 'lumber store': 0.2,
 'mine': 1.0,
 'transportation service': 0.8,
 'investment company': 0.9,
 'fruit wholesaler': 0.2,
 'real estate agency': 1.0,
 'event venue': 0.0,
 'frozen dessert supplier': 0.7,
 'wine wholesaler importer': 0.3,
 'winery': 1.0,
 'aerospace company'

In [93]:
rec_at_n

{'home improvement store': 0.023,
 'diesel fuel supplier': 0.028,
 'church': 0.041,
 'farm': 0.109,
 'seed supplier': 0.028,
 'accountant': 0.082,
 'trucking company': 0.14,
 'exporter': 0.024,
 'grain elevator': 0.0,
 'popcorn store': 0.045,
 'agricultural service': 0.094,
 'warehouse': 0.0,
 'agricultural production': 0.094,
 'ranch': 0.156,
 'holding company': 1.0,
 'farm equipment supplier': 0.028,
 'store': 0.182,
 'grocery store': 0.0,
 'rice mill': 0.082,
 'food product supplier': 0.0,
 'accounting firm': 0.082,
 'produce market': 0.0,
 'pet supply store': 0.091,
 'wholesaler': 0.141,
 'produce wholesaler': 0.0,
 'distribution service': 0.0,
 'crop grower': 0.156,
 'addiction treatment center': 0.128,
 'natural good store': 0.0,
 'orchard': 0.156,
 'lumber store': 0.028,
 'mine': 0.357,
 'transportation service': 0.16,
 'investment company': 0.22,
 'fruit wholesaler': 0.028,
 'real estate agency': 0.417,
 'event venue': 0.0,
 'frozen dessert supplier': 0.096,
 'wine wholesaler i

In [94]:
avg_pre, mean_avg_pre = calc_avg_pre(create_testing_dicts(normalization='lemmatize')[0], create_testing_dicts(normalization='lemmatize')[1])

In [95]:
avg_pre

{'home improvement store': 0.002,
 'diesel fuel supplier': 0.005,
 'church': 0.007,
 'farm': 0.073,
 'seed supplier': 0.006,
 'accountant': 0.082,
 'trucking company': 0.138,
 'exporter': 0.024,
 'grain elevator': 0.0,
 'popcorn store': 0.009,
 'agricultural service': 0.066,
 'warehouse': 0.0,
 'agricultural production': 0.066,
 'ranch': 0.156,
 'holding company': 0.867,
 'farm equipment supplier': 0.006,
 'store': 0.128,
 'grocery store': 0.0,
 'rice mill': 0.064,
 'food product supplier': 0.0,
 'accounting firm': 0.053,
 'produce market': 0.0,
 'pet supply store': 0.036,
 'wholesaler': 0.141,
 'produce wholesaler': 0.0,
 'distribution service': 0.0,
 'crop grower': 0.156,
 'addiction treatment center': 0.128,
 'natural good store': 0.0,
 'orchard': 0.156,
 'lumber store': 0.005,
 'mine': 0.357,
 'transportation service': 0.123,
 'investment company': 0.22,
 'fruit wholesaler': 0.008,
 'real estate agency': 0.417,
 'event venue': 0.0,
 'frozen dessert supplier': 0.071,
 'wine wholesal

In [96]:
mean_avg_pre

0.058

In [97]:
## BoW mAP results are going to be used later on the Charts.ipynb file
word_count_maps = []
for i in range(1, 11):
    word_count_maps.append(calc_avg_pre(create_testing_dicts(normalization='lemmatize')[0], create_testing_dicts(normalization='lemmatize')[1], cutoff=i)[1])

In [98]:
word_count_maps

[0.009, 0.017, 0.023, 0.029, 0.035, 0.04, 0.044, 0.05, 0.054, 0.058]

In [99]:
calc_NDCG_at_n(create_testing_dicts(normalization='lemmatize')[0], create_testing_dicts(normalization='lemmatize')[1], n=-1, base=2)

{'home improvement store': 0.062387011628009956,
 'diesel fuel supplier': 0.08387095630361062,
 'church': 0.09034884889711046,
 'farm': 0.47709029832625394,
 'seed supplier': 0.1254333888595897,
 'accountant': 0.8877119214641515,
 'trucking company': 0.8541001297672974,
 'exporter': 1.0,
 'grain elevator': 0.20341748548710972,
 'popcorn store': 0.16539480028020273,
 'agricultural service': 0.3551836963652799,
 'warehouse': 0.20341748548710972,
 'agricultural production': 0.3813826423749301,
 'ranch': 0.31597114306698204,
 'holding company': 0.9816035470353093,
 'farm equipment supplier': 0.07620562577841833,
 'store': 0.28003533954638904,
 'grocery store': 0.0239570415260799,
 'rice mill': 0.6434691377083236,
 'food product supplier': 0.014547278280154294,
 'accounting firm': 0.2991508324014225,
 'produce market': 0.0239570415260799,
 'pet supply store': 0.27225613333912835,
 'wholesaler': 0.4504652996471026,
 'produce wholesaler': 0.014547278280154294,
 'distribution service': 0.02139