# Exploring Word Distance With SIC Descriptions

In this notebook we look at some methods for exploring semantics similarities between sentences using word vectors.  
Rather than creating our own word embeddings we will use the fast text word2vec vectors creating using a skip-gram model.
We will explore industry classification SIC code descriptions.

In [1]:
import pandas as pd
import numpy as np
import spacy

Lets first load in the word vectors we will be using, these come from the fasttext English model.  
https://fasttext.cc/docs/en/english-vectors.html  
I performed some manipulation of the vectors in R using the `data.table` package. I was able to read in the data with `fread()` extremely fast, and then could export it as a csv to use with `pandas`. This is probably something I could do with pure python, but I still fall back to my R roots at times. I then pickled the data with python to more quickly read it in.

In [2]:
# Word vectors, approx 1MM words with vectors 300 long.
wvecs = pd.read_pickle("../data/wiki-news-300d-1M.pkl")

In [3]:
wvecs.shape

(999994, 300)

In [4]:
wvecs.iloc[0:5,0:5]

Unnamed: 0_level_0,1,2,3,4,5
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
",",0.1073,0.0089,0.0006,0.0055,-0.0646
the,0.0897,0.016,-0.0571,0.0405,-0.0696
.,0.0004,0.0032,-0.0204,0.0479,-0.045
and,-0.0314,0.0149,-0.0205,0.0557,0.0205
of,-0.0063,-0.0253,-0.0338,0.0178,-0.0966


I am not going to use an punctuation here, so we can drop these vectors. I am also going to convert all words to lowercase, so we do not need to keep capitalized words. The fasttext model provides capitalized and lower case words. For example...

In [5]:
print(wvecs.loc['working'].head())
print("\n")
print(wvecs.loc['Working'].head())
print("\n")
print(wvecs.loc['WORKING'].head())

1   -0.1140
2    0.0393
3   -0.0282
4   -0.0730
5   -0.1154
Name: working, dtype: float64


1   -0.0297
2   -0.0028
3    0.1812
4   -0.0450
5    0.0572
Name: Working, dtype: float64


1   -0.1876
2   -0.0591
3   -0.0959
4    0.0605
5   -0.0325
Name: WORKING, dtype: float64


In [6]:
wvecs.index.to_series().isna().sum()

5

In [7]:
# Some word values are NA, probably strange characters or something.
wvecs = wvecs[~wvecs.index.to_series().isna()]

In [8]:
# Flag any variables with capitol letters, or non-word (\w) text.
wvecs['cap_punct'] = wvecs.index.to_series().str.contains("[A-Z]|\W")

In [9]:
wvecs = wvecs[~wvecs['cap_punct']].drop('cap_punct', axis=1)

In [10]:
# Flag any variables with capitol letters, or non-word (\w) text.
wvecs = wvecs.dropna(axis=0)

In [11]:
wvecs.shape

(258631, 300)

In [12]:
wvecs.iloc[0:5,0:5]

Unnamed: 0_level_0,1,2,3,4,5
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
the,0.0897,0.016,-0.0571,0.0405,-0.0696
and,-0.0314,0.0149,-0.0205,0.0557,0.0205
of,-0.0063,-0.0253,-0.0338,0.0178,-0.0966
to,0.0495,0.0411,0.0041,0.0309,-0.0044
in,-0.0234,-0.0268,-0.0838,0.0386,-0.0321


In [13]:
import os
os.listdir("../docs/")

['sic_descriptions.xlsx']

Read in and clean SIC data. For now we will just work with the major codes. After we develop a clean pipeline for cleaning and processing the data, we can work with the 4 digit codes.

In the future I will use the following table for the main groups
https://github.com/saintsjd/sic4-list/blob/master/sic-codes.csv

In [14]:
sic = pd.read_excel("../docs/sic_descriptions.xlsx", sheet_name='4 digit')

In [15]:
sic.shape

(1208, 2)

In [16]:
sic.head()

Unnamed: 0,code,desc
0,111,Wheat Farming
1,112,Rice Farming
2,115,Corn Farming
3,116,Soybean Farming
4,119,Dry Pea and Bean Farming


In [17]:
sic['code'].value_counts()

7389    21
7999    11
7699    10
5812     7
1799     6
        ..
3365     1
3364     1
3363     1
3356     1
2048     1
Name: code, Length: 880, dtype: int64

For some reason there are records with duplicate sic codes. We will just take the first one for now, and drop the rest.

In [18]:
sic = sic.drop_duplicates(subset='code')

In [19]:
import spacy
from nltk.corpus import stopwords

nlp = spacy.load('en_core_web_sm', disable = ['parser', 'ner'])
stpwrds = stopwords.words("english")
def clean_series_text(col):
    """Clean and process the series text"""
    clean_words = nlp.pipe(col.str.lower().tolist(), batch_size=200)
    clean_words = [[word.lemma_ for word in doc 
                    if word.lemma_ not in stpwrds and 
                    word.pos_ not in ['PUNCT', 'PRON']] for doc in clean_words]
    return clean_words

In [20]:
sic['desc_toks'] = clean_series_text(sic['desc'])

In [21]:
sic.head()

Unnamed: 0,code,desc,desc_toks
0,111,Wheat Farming,"[wheat, farming]"
1,112,Rice Farming,"[rice, farming]"
2,115,Corn Farming,"[corn, farming]"
3,116,Soybean Farming,"[soybean, farming]"
4,119,Dry Pea and Bean Farming,"[dry, pea, bean, farming]"


### Create Summed Word Vectors
For each description, we are going to want to create summed word vectors of all of the individual vectors.

In [22]:
def get_word_vec(word, word_vecs):
    """Get word vector and return an empty array if 
       the word is not in the word_vecs.
    """
    try:
        wv = word_vecs.loc[word,].values
    except KeyError:
        wv = np.zeros(word_vecs.shape[1])
    return wv

In [23]:
get_word_vec("apple", word_vecs=wvecs).shape

(300,)

In [24]:
get_word_vec("ADfDfdsA", word_vecs=wvecs)[0:10]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [25]:
def sum_word_vecs(word_list, word_vecs=wvecs):
    """Sum all of the word vectors of several words"""
    wvs = np.vstack([get_word_vec(word, word_vecs) for word in word_list])
    return np.sum(wvs, axis=0)

Now with these function we can create the word vector for each description.

In [26]:
sic['desc_wvec'] = sic['desc_toks'].apply(sum_word_vecs)

In [27]:
sic.head()

Unnamed: 0,code,desc,desc_toks,desc_wvec
0,111,Wheat Farming,"[wheat, farming]","[0.1251, -0.024800000000000003, -0.03259999999..."
1,112,Rice Farming,"[rice, farming]","[0.03689999999999999, -0.030799999999999994, -..."
2,115,Corn Farming,"[corn, farming]","[0.0678, 0.1759, -0.025300000000000003, 0.1433..."
3,116,Soybean Farming,"[soybean, farming]","[0.14529999999999998, 0.14, 0.0725000000000000..."
4,119,Dry Pea and Bean Farming,"[dry, pea, bean, farming]","[-0.065, -0.018799999999999997, 0.058500000000..."


In [28]:
sic.iloc[0,3].shape

(300,)

In [29]:
sic = sic.set_index("code")

Now that we have the word vectors, we can find the nearest neighbors of a given point using a approximate nearest neighbors. For this we can use the python package called `annoy`.

In [30]:
from annoy import AnnoyIndex

In [31]:
ann_index = AnnoyIndex(300, metric='angular')

In [32]:
sic_codes = sic.index.values

In [33]:
sic.index.value_counts()

2047    1
2821    1
2841    1
2836    1
2835    1
       ..
3446    1
3444    1
3443    1
3442    1
2048    1
Name: code, Length: 880, dtype: int64

In [34]:
sic.loc[119, "desc_wvec"][0:10]

array([-0.065 , -0.0188,  0.0585, -0.0697,  0.1298, -0.1754, -0.1045,
       -0.2171,  0.3343, -0.08  ])

In [35]:
for code in sic_codes:
    ann_index.add_item(code, sic.loc[code, "desc_wvec"])

In [36]:
# Build the annoy index
ann_index.build(n_trees=50)

True

In [37]:
ann_index.get_nns_by_item(119, n=6)

[119, 161, 134, 112, 171, 115]

In [38]:
sic.loc[119, "desc_wvec"][0:10]

array([-0.065 , -0.0188,  0.0585, -0.0697,  0.1298, -0.1754, -0.1045,
       -0.2171,  0.3343, -0.08  ])

In [39]:
sic.loc[[119, 161, 134, 112, 171, 115],'desc']

code
119                             Dry Pea and Bean Farming
161    Other Vegetable (Except Potato) and Melon Farming
134                                       Potato Farming
112                                         Rice Farming
171                    Berry (Except Strawberry) Farming
115                                         Corn Farming
Name: desc, dtype: object

Lets try this with the SIC major descriptions

In [40]:
sic_div = pd.read_excel("../docs/sic_descriptions.xlsx", sheet_name='division')

In [41]:
sic_div

Unnamed: 0,division,desc
0,A,"Agriculture, Forestry, And Fishing"
1,B,Mining
2,C,Construction
3,D,Manufacturing
4,E,"Transportation, Communications, Electric, Gas,..."
5,F,Wholesale Trade
6,G,Retail Trade
7,H,"Finance, Insurance, And Real Estate"
8,I,Services
9,J,Public Administration


In [42]:
sic_div = sic_div.reset_index().rename(columns={"index" : "id"})

In [43]:
sic_div['desc_toks'] = clean_series_text(sic_div['desc'])

In [44]:
sic_div['desc_wvec'] = sic_div['desc_toks'].apply(sum_word_vecs)

In [45]:
div_index = AnnoyIndex(300, metric='angular')

In [46]:
for code in sic_div['id'].values:
    div_index.add_item(code, sic_div.loc[code, "desc_wvec"])

In [47]:
ann_index.get_nns_by_vector(div_index.get_item_vector(0), 6)

[919, 851, 132, 112, 913, 111]

In [48]:
sic.loc[[919, 851, 132, 112, 913, 111], "desc"]

code
919               Other Marine Fishing
851    Support Activities for Forestry
132                    Tobacco Farming
112                       Rice Farming
913                  Shellfish Fishing
111                      Wheat Farming
Name: desc, dtype: object

We could apply this to the whole dataset by just applying it.

In [49]:
sic_div['nns_codes'] = sic_div['id']. \
apply(lambda x: ann_index.get_nns_by_vector(div_index.get_item_vector(x), 6))

In [50]:
sic_div

Unnamed: 0,id,division,desc,desc_toks,desc_wvec,nns_codes
0,0,A,"Agriculture, Forestry, And Fishing","[agriculture, forestry, fishing]","[0.3647, 0.16940000000000002, 0.23909999999999...","[919, 851, 132, 112, 913, 111]"
1,1,B,Mining,[mining],"[0.1812, 0.0106, 0.0889, -0.27399999999999997,...","[1041, 1044, 1099, 1241, 1446, 1011]"
2,2,C,Construction,[construction],"[0.1244, 0.053, -0.0315, -0.0084, -0.1072, -0....","[1522, 8741, 1623, 1622, 1721, 1781]"
3,3,D,Manufacturing,[manufacturing],"[0.1513, -0.003, 0.0342, -0.0873, -0.0529, 0.0...","[3552, 3559, 3581, 3582, 3589, 3531]"
4,4,E,"Transportation, Communications, Electric, Gas,...","[transportation, communication, electric, gas,...","[0.1109, 0.3394, 0.33709999999999996, -0.15739...","[4922, 5088, 4489, 4499, 4785, 4449]"
5,5,F,Wholesale Trade,"[wholesale, trade]","[0.03900000000000001, -0.2362, 0.087, -0.0871,...","[5085, 7353, 5159, 5149, 5049, 5141]"
6,6,G,Retail Trade,"[retail, trade]","[0.0456, -0.1482, -0.06309999999999999, 0.0079...","[5461, 7353, 5085, 5149, 5113, 5091]"
7,7,H,"Finance, Insurance, And Real Estate","[finance, insurance, real, estate, ]","[0.4942, -0.2691, -0.2666, -0.0184999999999999...","[6162, 6515, 6519, 6531, 6331, 6311]"
8,8,I,Services,[service],"[0.0345, 0.1283, 0.1887, -0.0616, -0.1113, 0.0...","[8711, 4121, 8744, 8721, 8748, 8712]"
9,9,J,Public Administration,"[public, administration, ]","[0.1037, 0.0487, 0.1012, -0.05159999999999999,...","[9531, 8743, 8742, 9511, 9199, 7376]"


In [51]:
sic.loc[[5085, 7353, 5159, 5149, 5049, 5141], "desc"]

code
5085             Industrial Supplies Merchant Wholesalers
7353                All Other Specialty Trade Contractors
5159    Other Farm Product Raw Material Merchant Whole...
5149    Other Grocery and Related Products Merchant Wh...
5049    Other Professional Equipment and Supplies Merc...
5141            General Line Grocery Merchant Wholesalers
Name: desc, dtype: object

We can now use this process on any text related to a business.