# COICOP Category Comparison

Requires:

```
gensim==3.8.3
fuzzywuzzy==0.18.0
pandas==1.0.5
```

Also, you must download Google's pretrained Word2Vec model: [https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing)

Unzip that download and put it in this directory. 

> Note: the pretrained model is 1.5gb

In [1]:
import pandas as pd
from fuzzywuzzy import process
from gensim.models import Word2Vec



## Load COICOP Categories

In [2]:
coicop = pd.read_csv('CL_COICOP_20200710_190018.csv')

In [3]:
coicop.head(10)

Unnamed: 0,Order,Level,Code,Parent,Code.1,Parent.1,Description,Remark
0,1824814,1,10,,TOTAL,,Total,
1,1824815,1,20,,TOT_X_CP041_042,,Total except actual rents,
2,1824816,1,30,,CP00,,All-items HICP,
3,1824817,1,40,,CP01,,Food and non-alcoholic beverages,
4,1824818,2,50,40.0,CP011,CP01,Food,
5,1824819,3,60,50.0,CP0111,CP011,Bread and cereals,
6,1824820,4,70,60.0,CP01111,CP0111,Rice,
7,1824821,5,80,70.0,CP01111A,CP01111,Long-grain rice (1 kg),
8,1824822,4,90,60.0,CP01112,CP0111,Flours and other cereals,
9,1824823,5,100,90.0,CP01112A,CP01112,Wheat flour (1 kg),


## Load Data Sample

In [4]:
df = pd.read_csv('output-0000.csv', 
                 names=['product','price','unit','category','city','store','date'])

In [5]:
df.head()

Unnamed: 0,product,price,unit,category,city,store,date
0,Debi Lilly Hydrangea 3 Stem - colors may varyrrn,$8.99,($8.99/each),Flowers,LA,Vons,2017-01-10 02:00:24
1,California Grown Deluxe Bouquet - colors may vary,$14.59,($14.59/each),Flowers,LA,Vons,2017-01-10 02:00:24
2,Alstroemeria 9 Stem - colors may vary,$6.79,($6.79/each),Flowers,LA,Vons,2017-01-10 02:00:24
3,1.50 LB Cinnamon Sugar Tortilla Chips,$5.99,($3.99/lb),Flowers,LA,Vons,2017-01-10 02:00:24
4,Lily Stargazer 3 Stem - colors may vary,$8.99,($8.99/each),Flowers,LA,Vons,2017-01-10 02:00:24


## Compare with string similarity (fuzzywuzzy)

In [6]:
categories = list(df.category.unique())
coicop_categories = list(coicop.Description.unique())

In [7]:
print(f"There are {len(categories)} categories in the sample and {len(coicop_categories)} in the COICOP database")

There are 731 categories in the sample and 734 in the COICOP database


In [8]:
categories[:10]

['Flowers',
 'Toddler-Foods',
 'Whiskey',
 'Cigarettes',
 'Other-Animal-Care',
 'Zinc',
 'Trash-Bags--Outside',
 'Hot-Dogs-Franks',
 'Southern-Foods',
 'Rice']

We can demonstrate this process by matching the term `Flowers` to the COICOP:

In [9]:
process.extractOne('Flowers',coicop_categories)

('Gardens, plants and flowers', 90)

Now we can make a dictionary mapping of the sample data categories to COICOP. To save time we'll just run the first 25:

In [10]:
category_map = {}

for c in categories[:25]:
    category_map[c] = process.extractOne(c,coicop_categories)[0]

In [11]:
category_map

{'Flowers': 'Gardens, plants and flowers',
 'Toddler-Foods': 'Food',
 'Whiskey': 'White bread, loaf (1 kg)',
 'Cigarettes': 'Cigarettes',
 'Other-Animal-Care': 'Other meats',
 'Zinc': 'Beef, minced (1 kg)',
 'Trash-Bags--Outside': 'Tea',
 'Hot-Dogs-Franks': 'Aeroplanes, microlight aircraft, gliders, hang-gliders and hot-air balloons',
 'Southern-Foods': 'Food',
 'Rice': 'Rice',
 'Tofu-Meat-Alternatives': 'Meat',
 'Sherbet-Sorbet': "Men's T-shirt, short sleeves (1 piece)",
 'Ice': 'Long-grain rice (1 kg)',
 'Gourmet-Deli-Condiments': 'Sauces, condiments',
 'Yogurt': 'Yoghurt',
 'Trail-Mix': 'Tea',
 'Soup-Mix-Meals': 'Meat',
 'Vinegar': 'Wine',
 'Oatmeal-Hot-Cereal': 'Aeroplanes, microlight aircraft, gliders, hang-gliders and hot-air balloons',
 'Pastries-Croissants': 'Crisps',
 'Soup-Cups-Ramen': 'Garments',
 'Water': 'Mineral or spring waters',
 'Microwaveable-Soup': 'Vegetables',
 'Ready-to-Serve-Soup': 'Garments for infants (0 to 2 years) and children (3 to 13 years)',
 'Wipes-Diaper

You can see that this fails relatively frequently. One approach is to threshold the match based on token similarity:

In [12]:
category_map = {}

for c in categories[:25]:
    match = process.extractOne(c,coicop_categories)
    if match[1] > 85:
        category_map[c] = match[0]
    else:
        category_map[c] = ''

In [13]:
category_map

{'Flowers': 'Gardens, plants and flowers',
 'Toddler-Foods': 'Food',
 'Whiskey': '',
 'Cigarettes': 'Cigarettes',
 'Other-Animal-Care': 'Other meats',
 'Zinc': '',
 'Trash-Bags--Outside': '',
 'Hot-Dogs-Franks': 'Aeroplanes, microlight aircraft, gliders, hang-gliders and hot-air balloons',
 'Southern-Foods': 'Food',
 'Rice': 'Rice',
 'Tofu-Meat-Alternatives': 'Meat',
 'Sherbet-Sorbet': '',
 'Ice': 'Long-grain rice (1 kg)',
 'Gourmet-Deli-Condiments': '',
 'Yogurt': 'Yoghurt',
 'Trail-Mix': '',
 'Soup-Mix-Meals': '',
 'Vinegar': '',
 'Oatmeal-Hot-Cereal': 'Aeroplanes, microlight aircraft, gliders, hang-gliders and hot-air balloons',
 'Pastries-Croissants': '',
 'Soup-Cups-Ramen': '',
 'Water': 'Mineral or spring waters',
 'Microwaveable-Soup': '',
 'Ready-to-Serve-Soup': 'Garments for infants (0 to 2 years) and children (3 to 13 years)',
 'Wipes-Diaper-Refills': ''}

## Compare with [Word2Vec](https://en.wikipedia.org/wiki/Word2vec)

In [14]:
import gensim.models
from gensim.utils import simple_preprocess    

def tidy_sentence(sentence, vocabulary):
    return [word for word in simple_preprocess(sentence) if word in vocabulary]    

def compute_sentence_similarity(sentence_1, sentence_2, model_wv):
    vocabulary = set(model_wv.index2word)    
    tokens_1 = tidy_sentence(sentence_1, vocabulary)    
    tokens_2 = tidy_sentence(sentence_2, vocabulary)    
    return model_wv.n_similarity(tokens_1, tokens_2)

wv = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)  

We can try this out:

In [15]:
compute_sentence_similarity('Flowers', 'Garden plants', wv)

0.54233813

In [16]:
compute_sentence_similarity('Flowers', 'toilet paper', wv)

0.18995756

Below, we will take the category `Flowers` and iteratively compare it to the COICOP categories until we find the _best match_. 

Unfortunately, this process is quite slow, but it seems to yield reasonable results.

In [17]:
c = 'Flowers'

best = ('',0)

for c_ in coicop_categories:
    try:
        score = compute_sentence_similarity(c, c_, wv)
        if score > best[1]:
            best = (c_,score)
            print(f"New best:{best}")
    except:
        continue

New best:('Total', 0.064134754)
New best:('Total except actual rents', 0.11047731)
New best:('All-items HICP', 0.24537386)
New best:('Food', 0.2550534)
New best:('Rice', 0.2682598)
New best:('Pizza and quiche', 0.2731087)
New best:('Pasta products and couscous', 0.29805464)
New best:('Pasta, without eggs (1 kg)', 0.34275413)
New best:('Other fresh, chilled or frozen edible meat', 0.35300505)
New best:('Eggs', 0.39279044)
New best:('Fruit', 0.45926985)
New best:('Root crops, non-starchy bulbs and mushrooms (fresh, chilled or frozen)', 0.46237177)
New best:('Dried vegetables, other preserved or processed vegetables', 0.46547884)
New best:('Dried vegetables', 0.4772619)
New best:('Other tubers and products of tuber vegetables', 0.48629558)
New best:('Garden furniture', 0.5104005)
New best:('Gardens, plants and flowers', 0.7861436)
New best:('Plants and flowers', 0.8398435)
