# Searching for a Word in Business Names
In this notebook we will look at how to identify certain type of business, just by the name part. For example, say we want to identify businesses that are likely grocery stores. Simply checking if the business name has grocery in it would be too narrow. We want to find other words that appear in business name data that have similar semantic meaning that we could search for. Having these different words, will allow our search to be more robust.

In [1]:
import csv
import re
import string
from collections import OrderedDict

import annoy
import spacy
from scipy.spatial.distance import cosine
from spacy.lemmatizer import Lemmatizer

First we will just read in the data, we will just put it in a dictionary.  
The data came from the following kaggle dataset: https://www.kaggle.com/peopledatalabssf/free-7-million-company-dataset
This is 7MM business from around the world.

I should be using something like pandas, but sometimes it's fun to just use the base python csv package.

In [2]:
with open("companies_sorted.csv", "r") as bnm_csv:
    csv_reader = csv.DictReader(bnm_csv)
    bus_dat = OrderedDict()
    for idx, r in enumerate(csv_reader):
        if idx == 0:
            for col in r:
                bus_dat[col] = [r[col]]
        else:
            for col in r:
                bus_dat[col].append(r[col])

In [3]:
[col_name for col_name in bus_dat]

['',
 'name',
 'domain',
 'year founded',
 'industry',
 'size range',
 'locality',
 'country',
 'linkedin url',
 'current employee estimate',
 'total employee estimate']

In [4]:
for i, j in zip(bus_dat["name"][0:5], bus_dat["domain"][0:5]):
    print(f"{i}, {j}")

ibm, ibm.com
tata consultancy services, tcs.com
accenture, accenture.com
us army, goarmy.com
ey, ey.com


So we only need the first columns, which is the business name.

In [5]:
bus_name = bus_dat["name"]

In [6]:
len(bus_name)

7173426

Next, we will go through the list, and split all items on spaces.

In [7]:
bus_parts = [
    word for name_list in [name.split() for name in bus_name] for word in name_list
]

punc_n_nums = string.punctuation + "".join([str(i) for i in range(10)])
# Strip punctuation and numbers
bus_parts = [s.translate(str.maketrans("", "", string.punctuation)) for s in bus_parts]

In [8]:
bus_parts[100:110]

['promobroker',
 'agente',
 'de',
 'seguros',
 'y',
 'de',
 'fianzas',
 's',
 'a',
 'de']

In [9]:
len(bus_parts)

21455275

In [10]:
bus_parts = list(set([s.lower() for s in bus_parts if s != ""]))

We want to drop all stop words.

## Process Names with SpaCy
Next we will process the texts using spacy. A few things we want to do.
 - Remove stop words.
 - lemmatize words
 
After this we will dedup the words, and then get the word vectors.

In [11]:
nlp = spacy.load("en_core_web_md")

In [12]:
proc_words = list(set([s for s in bus_parts if s not in nlp.Defaults.stop_words]))

Next we will lemmatize all of the word parts.

In [13]:
lemmatizer = Lemmatizer(nlp.vocab.lookups)

This will lemmatize the word if it's in the vocab, otherwise just return the word.

In [14]:
lemmatizer.lookup("going"), lemmatizer.lookup("caring"), lemmatizer.lookup(
    "someCrazyWord"
)

('go', 'care', 'someCrazyWord')

In [15]:
proc_lemma = [lemmatizer.lookup(w) for w in proc_words]

In [16]:
# Drop dupicates
lemma_sub = list(set(proc_lemma))

In [17]:
len(lemma_sub)

2158126

Next, we will go through, and add all of our words, to a dictionary were the lemma is the key, and the word vector is the value.

In [18]:
bwd = {}
for idx, w in enumerate(lemma_sub):
    if nlp.vocab[w].has_vector:
        bwd[w] = (idx, nlp.vocab[w].vector)

Word vectors provide a way for us to represent words in vector space. One of the most common models for creating them is Word2Vec. The closer a word is to another word in vector space, the more similar the meanings are.

In [19]:
def word_cosine_similarity(w1, w2, model):
    return 1 - cosine(model.vocab[w1].vector, model.vocab[w2].vector)

Other text similarity measurements, such as edit distance, or soundex are looking to see if the word has similar spelling. Word vectors consider semantic similarity instead.

In [20]:
word_cosine_similarity("farmer", "framer", nlp)

0.06580065190792084

In [21]:
word_cosine_similarity("farmer", "agriculture", nlp)

0.5305896997451782

### Query Word Vectors with Annoy
Now we will query our word vectors using the package Annoy: https://github.com/spotify/annoy  
This is a fabulous approximate nearest neighbors package that I use a lot for querying word vectors.

In [36]:
aidx = annoy.AnnoyIndex(300, "angular")

In [37]:
for i in bwd.values():
    aidx.add_item(*i)

In [38]:
aidx.build(n_trees=300)

True

Now we can query the index and find the most similar words to our word of interest, "grocery".

In [39]:
groc_words = [lemma_sub[i] for i in aidx.get_nns_by_item(bwd["grocery"][0], 20)]
groc_words

['grocery',
 'grocer',
 'newsagent',
 'hypermarket',
 'healthfood',
 'minimart',
 'waitrose',
 'newsagency',
 'supermarket',
 'supercenters',
 'store',
 'delicatessen',
 'döner',
 'enoteca',
 'presliced',
 'bottleshop',
 'luncheonette',
 'knish',
 'bodega',
 'carryout']

So not all of these words would exactly identify a grocery store, but it still provides us with more alternatives than just searching "grocery" alone.

### Final Search Terms

Finally, let's grab any variations of these words that may exist in our pre-lemmatized data. We will add these variations to our search terms.

In [40]:
search_terms = {}
for w in groc_words:
    search_terms[w] = []

In [41]:
lemma_tuples = [(i, j) for i, j in zip(proc_lemma, proc_words) if i in groc_words]

In [42]:
lemma_tuples[0:4]

[('newsagent', 'newsagent'),
 ('bodega', 'bodegas'),
 ('hypermarket', 'hypermarket'),
 ('supermarket', 'supermarkets')]

We will append all of these variations into our search term dictionary.

In [43]:
for i, j in lemma_tuples:
    search_terms[i].append(j)

In [44]:
search_terms

{'grocery': ['groceries', 'grocery'],
 'grocer': ['grocers', 'grocer'],
 'newsagent': ['newsagent', 'newsagents'],
 'hypermarket': ['hypermarket', 'hypermarkets'],
 'healthfood': ['healthfood'],
 'minimart': ['minimart'],
 'waitrose': ['waitrose'],
 'newsagency': ['newsagency'],
 'supermarket': ['supermarkets', 'supermarket'],
 'supercenters': ['supercenters'],
 'store': ['stored', 'storing', 'store', 'stores'],
 'delicatessen': ['delicatessen', 'delicatessens'],
 'döner': ['döner'],
 'enoteca': ['enoteca'],
 'presliced': ['presliced'],
 'bottleshop': ['bottleshop'],
 'luncheonette': ['luncheonette'],
 'knish': ['knish'],
 'bodega': ['bodegas', 'bodega'],
 'carryout': ['carryout']}

I am actually going to drop the "store" and "newsagency" terms.

In [45]:
del search_terms["newsagent"]
del search_terms["newsagency"]
del search_terms["store"]

If I was doing this for a formal project, I would probably thing through the best way to use these word variations to query the business names. Additionally I would try to also identify common misspellings of these words.

In [46]:
grocery_name_parts = [p for var in search_terms.values() for p in var]

We want to make we only are considering the word if it is the actual word, and not a subword.

In [47]:
grocery_pattern = r"|".join([f"^{i}$| {i}$| {i} |^{i} " for i in grocery_name_parts])

In [48]:
grocery_stores = [
    name for name in bus_dat["name"] if bool(re.search(grocery_pattern, name))
]

In [49]:
len(grocery_stores)

2296

In [50]:
grocery_stores[0:15]

['wm morrison supermarkets plc',
 'waitrose',
 'c&s wholesale grocers',
 'hannaford supermarkets',
 'woolworths supermarkets',
 'price chopper supermarkets',
 'ralphs grocery company',
 'shoprite supermarkets',
 "shaw's supermarkets",
 'save mart supermarkets',
 'southeastern grocers',
 'brookshire grocery company',
 'associated wholesale grocers',
 'grocery outlet',
 'supermarket']