Library imports.

In [9]:
import itertools
import pathlib

import nltk
import pandas as pd
pd.set_option('display.max_columns', None)

Data loading. Note that the **try-except** structure is in place to prevent unnecessary reloading of data. Also, the *thousands* argument in the `read_csv` function helps us ensure numbers with a thousands separator are correctly interpreted as numbers.

In [10]:
raw_data_path = pathlib.Path('Data', 'ILS_OM597.csv.gz')
try:
    raw_data.shape
except:
    if raw_data_path.exists():
        raw_data = pd.read_csv(raw_data_path, thousands = ',', compression = 'gzip')
    else:
        print(f'Data file does not exist at {raw_data_path}')

Check dtypes

In [11]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7320458 entries, 0 to 7320457
Data columns (total 24 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   Invoice/Item Number    object 
 1   Date                   object 
 2   Store Number           int64  
 3   Store Name             object 
 4   Address                object 
 5   City                   object 
 6   Zip Code               float64
 7   Store Location         object 
 8   County Number          float64
 9   County                 object 
 10  Category               float64
 11  Category Name          object 
 12  Vendor Number          float64
 13  Vendor Name            object 
 14  Item Number            int64  
 15  Item Description       object 
 16  Pack                   int64  
 17  Bottle Volume (ml)     int64  
 18  State Bottle Cost      float64
 19  State Bottle Retail    float64
 20  Bottles Sold           int64  
 21  Sale (Dollars)         float64
 22  Volume Sold (Liter

Print the first 3 rows. Look at the `Vendor Name` column, it seems that the case is mixed. Although we haven't checked, this may be true in other columns.

In [12]:
raw_data.head(3)

Unnamed: 0,Invoice/Item Number,Date,Store Number,Store Name,Address,City,Zip Code,Store Location,County Number,County,Category,Category Name,Vendor Number,Vendor Name,Item Number,Item Description,Pack,Bottle Volume (ml),State Bottle Cost,State Bottle Retail,Bottles Sold,Sale (Dollars),Volume Sold (Liters),Volume Sold (Gallons)
0,INV-09076500035,2017-12-07,5490,Rockwell Area Market,202 4th St North,Rockwell,50469.0,POINT (-93.188172 42.986351000000006),17.0,CERRO GORD,1012100.0,Canadian Whiskies,115.0,CONSTELLATION BRANDS INC,10548,Black Velvet Toasted Caramel,6,1750,12.99,19.49,1,19.49,1.75,0.46
1,INV-09092300032,2017-12-07,4997,Downtown Pantry,"218, 6th Ave #101",Des Moines,50309.0,POINT (-93.62452 41.585651),77.0,POLK,1012100.0,Canadian Whiskies,260.0,DIAGEO AMERICAS,11290,Crown Royal Canadian Whisky Mini,10,300,7.35,11.03,10,110.3,3.0,0.79
2,INV-09070400014,2017-12-07,2567,Hy-Vee Drugstore / Davenport,2200 West Kimberly,Davenport,52806.0,POINT (-90.608201 41.560663),82.0,SCOTT,1051100.0,American Brandies,205.0,E & J Gallo Winery,73755,E & J Apple,12,750,6.0,9.0,3,27.0,2.25,0.59


Title case selected columns

In [13]:
columns_to_titlecase = [
    'City', 
    'County', 
    'Category Name', 
    'Vendor Name', 
    'Item Description',
]

for column in columns_to_titlecase:
    raw_data[column] = raw_data[column].str.title()

Inspecting the column names, if seems that there are some categories that have unique names but actually refer to the same group of products. For example, it is likely that the categories `Cocktails / Rtd` and `Cocktails /Rtd` refer to the same set of products.

In [16]:
raw_data['Category Name'].sort_values().unique()

array(['100% Agave Tequila', 'Aged Dark Rum', 'American Brandies',
       'American Cordials & Liqueur', 'American Cordials & Liqueurs',
       'American Distilled Spirit Specialty',
       'American Distilled Spirits Specialty', 'American Dry Gins',
       'American Flavored Vodka', 'American Schnapps',
       'American Sloe Gins', 'American Vodka', 'American Vodkas',
       'Blended Whiskies', 'Bottled In Bond Bourbon', 'Canadian Whiskies',
       'Cocktails / Rtd', 'Cocktails /Rtd', 'Coffee Liqueurs',
       'Corn Whiskies', 'Cream Liqueurs',
       'Delisted / Special Order Items', 'Distilled Spirits Specialty',
       'Flavored Gin', 'Flavored Rum', 'Gold Rum', 'Imported Brandies',
       'Imported Cordials & Liqueur', 'Imported Cordials & Liqueurs',
       'Imported Distilled Spirit Specialty',
       'Imported Distilled Spirits Specialty', 'Imported Dry Gins',
       'Imported Flavored Vodka', 'Imported Gins', 'Imported Schnapps',
       'Imported Vodka', 'Imported Vodkas', 'Iow

We will try to write an algorithm to detect *similar* strings on the basis of *edit distance*.

In [15]:
s1 = 'vodka is awesome'
s2 = 'vodkas'
s3 = 'momma'

for primary_string in [s1, s2, s3]:
    for comparison_string in [s1, s2, s3]:
        if primary_string != comparison_string:
            edit_distance = nltk.edit_distance(primary_string, comparison_string)
            print(f'{edit_distance} edits are required to convert "{primary_string}" to "{comparison_string}"')

10 edits are required to convert "vodka is awesome" to "vodkas"
14 edits are required to convert "vodka is awesome" to "momma"
10 edits are required to convert "vodkas" to "vodka is awesome"
4 edits are required to convert "vodkas" to "momma"
14 edits are required to convert "momma" to "vodka is awesome"
4 edits are required to convert "momma" to "vodkas"


#### This is where we left off

In [8]:
column_name = 'Category Name'
min_distance = 3


unique_values = raw_data[column_name].dropna().unique().tolist()

related_words = []
grouped_words = []
for primary_word in unique_values:
    inner_list = [primary_word]
    for secondary_word in unique_values:
        if secondary_word not in grouped_words:
            if secondary_word == primary_word:
                pass
            elif nltk.edit_distance(primary_word, secondary_word) <= min_distance:
                inner_list.append(secondary_word)
                grouped_words.append(secondary_word)
    
    grouped_words.append(primary_word)
    if len(inner_list) > 1:
        related_words.append(inner_list)

related_words

name_mapper = {}
for sublist in related_words:
    name_mapper[sublist[0]] = sublist
name_mapper

[['American Vodkas', 'American Vodka'],
 ['Cocktails /Rtd', 'Cocktails / Rtd'],
 ['Flavored Rum', 'Flavored Gin'],
 ['Imported Vodkas', 'Imported Vodka'],
 ['American Cordials & Liqueur', 'American Cordials & Liqueurs'],
 ['Imported Cordials & Liqueurs', 'Imported Cordials & Liqueur'],
 ['Temporary & Specialty Packages',
  'Temporary &  Specialty Packages',
  'Temporary  & Specialty Packages'],
 ['American Distilled Spirit Specialty',
  'American Distilled Spirits Specialty'],
 ['Imported Distilled Spirit Specialty',
  'Imported Distilled Spirits Specialty']]

**A Different Approach**: Your colleague, Natalie Patton, let me know about how the library *difflib*, which is included in the python standard library, may be used for the similarity comparison task. An example follows.

In [10]:
demo_difflib = False

if demo_difflib:

    import difflib

    for current_word in unique_values:

        similar_words = difflib.get_close_matches(current_word, unique_values, cutoff = 0.9)
        similar_words = [word for word in similar_words if word != current_word]
        if similar_words:
            print(current_word)

            print('  Similar words:')
            for similar_word in similar_words:
                print(f'   -{similar_word}')
            print('-'*50)

In [23]:
raw_data['Date'] =  pd.to_datetime(raw_data['Date'])