## Same class and a working example

The provided .py file has the same class

In [1]:
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
# import nltk
# from nltk.corpus import stopwords
import glob
# stop = stopwords.words('english')
from collections import defaultdict

class String_Matcher():
    """
    Class uses the Levenstein distance to find similar strings.
    Quick and dirty analysis shows that order of words doesn't make much difference, as well as, a removal of punctuation and stop-words.
    However, light cleaning is possible via clean_column()

    How to use:
    sm = String_Matcher(filepath)
    l = sm.created_matched_list('item_title', 'ipod') will create a list with the top 10 matches for the query 'ipod' from the column df['item_title'] 
    matched_df = sm.create_matched_df('item_title') will create a dataframe with two columns. The first column are item titles, the second column are the top 10
    matches selected from the same column. 
    """

    def __init__(self, filepath):
        """
        filepath: string with a path to the folder containing .txt files to be processed
        """
        l = [pd.read_csv(filename, header = None, sep = '\t', names = ['item_id', 'site', 'category_id', 'item_title'])\
            for filename in glob.glob(filepath + '/*.txt')]
        self.df = pd.concat(l, axis = 0)


    def clean_column(self, col_name):
        """
        (Optional: doesn't seem to change the matching quality much)
        Takes column 'col_name' and creates a new column 'col_name_clean'
        where all characters are in the lower case and stop words (using nltk.corpus) are removed
        
        col_name: String
        """
        self.df[col_name + str('_clean')] = self.df[col_name].str.lower()
        self.df[col_name + str('_clean')].apply(lambda x: [item for item in x if item not in stop])
        return self.df


    def create_matched_list(self, col_name, str_to_match):
        """
        Returns top10 matches as a list of strings
        from a column df[col_name] for a provided string str_to_match
        One caveat: it looks across all the files (because they're concatenated into one dataframe) that might or might not
        be a good idea depending on the context (for files containing info about different categories, it's probably bad).
        But building a logic for separate files was taking time.
        
        col_name: string
        """
        res, choices, output = [], [], []
        for item in self.df[col_name].unique():  # create a list with string (columns values)
            choices.append(item)
        res = process.extract(str_to_match, choices, limit = 10)
        for line in res:
            output.append(line[0])
        return output


    def create_matched_df(self, col_name):
        """
         Returns pd.DataFrame with two columns: 'item_title' and 'matches.'
        For every item_title, there are 10 matches. IMPORTANT: it considers only unique values.
        If a dataframe consists of 10 titles but two of them are similar, it will return 90 (9 * 10) rows
        One caveat: it looks across all the files (because they're concatenated into one dataframe) that might or might not
        be a good idea depending on the context (for files containing info about different categories, it's probably bad).
        But building a logic with separate files was taking time.
        The provided score takes care of punctuation and word order.
        
        col_name: string
        """
        d = defaultdict(list)
        for k, v in enumerate(self.df[col_name].unique()):
            curr_res = []
            choices = list(self.df[col_name].iloc[:k]) + list(self.df[col_name].iloc[k + 1:])
            curr_res = process.extract(v, choices, limit = 10, scorer = fuzz.token_sort_ratio)
            for ix in range(len(curr_res)):
                d[v].append(curr_res[ix][0])
        df_from_d = pd.DataFrame.from_dict(d, orient = 'index', columns = ['Top1', 'Top2', 'Top3', 'Top4', 'Top5','Top6', 'Top7', 'Top8', 'Top9', 'Top10'])
        df_from_d_stacked = df_from_d.stack().reset_index()
        df_from_d_stacked.drop('level_1', axis = 1, inplace = True)
        df_from_d_stacked.rename(columns = {"level_0": "item_id", 0: "matches"}, inplace = True)
        return df_from_d_stacked
    
    
    def save_to_csv(self, df, filepath, filename):
        """
        Saves the provided dataset to the folder defined in 'filepath' with the name defined in 'filename'

        df: pd.DataFrame
        filepath: string
        filename: string
        """
        df.to_csv(filepath + "/" + filename)


Folder "small_files" contains four smaller (to make everything run faster) .txt files: cases_small.txt, cellphones_small.txt, laptops_small.txt, and mp3_small.txt. These are random rows taken from the original file.

In [2]:
# Instantiate a class with a path to data
sm = String_Matcher('/home/mkareev/small_files')

In [3]:
# self.df contains everything from the .txt files that exist in the folder
sm.df

Unnamed: 0,item_id,site,category_id,item_title
0,350946226735,0,20349,Connecticut Huskies Wordmark on BlackBerry Tor...
1,310801215722,0,20349,Der Delight Windows Bracket Case Battery Cover...
2,151052793175,0,20349,NFC Leather Housing Battery Flip Case Cover Sa...
3,141174411773,0,20349,FOR Sprint LG Optimus G LS970 HARD Protector S...
4,310647790715,0,20349,Skinit 505 Silhouettes Skin for LG Cosmos VN250
...,...,...,...,...
125,261207061911,0,73839,Apple iPod Nano 4th Gen 8GB Green *Worldwide ...
126,281252513057,0,73839,Apple iPod touch 4th Generation White (8 GB)
127,300952896992,0,73839,GPX ML861B (8 GB) Digital Media Player
128,171210028116,0,73839,Waterproof iPod shuffle Pink Earbuds Swim Kit ...


In [None]:
# We can clean the column if needed:
sm.clean_column('item_title')

In [5]:
# We can pass a query and return top 10 most similar items:
sm.create_matched_list('item_title', 'Optimus')

['LG Optimus L9 P769 - Black T-Mobile GOOD Works Great',
 '★ NEW LG Optimus L3 E400f (AU) Black ★Factory Unlocked ★',
 'FOR Sprint LG Optimus G LS970 HARD Protector Snap On Case Phone Cover Ninja',
 'HEAD CASE WATERCOLOURED ANIMALS BACK CASE COVER FOR LG OPTIMUS L3 II DUAL E435',
 'LG OPTIMUS G PRO E980 Pink Lace Case Hard Plastic Design Cover',
 'LG Optimus L7 P700 Black Protective Cover Case - Gold Chain',
 'For LG Optimus Elite-Matte Red Hard Case Skin Guard Phone Cover+Car Charger',
 'HEAD CASE DESIGNS SASSY GIRL HARD BACK CASE COVER FOR LG OPTIMUS L7 II DUAL P715',
 'HEAD CASE ROYALTY CHAIN SNAP-ON BACK CASE COVER FOR LG OPTIMUS L7 II DUAL P715',
 'Leather Case Pouch for Straight Talk/Net10 LG LG900g 900g Optimus Q, Optimus Net']

In [6]:
# We can find top 10 most similar items for every unique item:
sm.create_matched_df('item_title')

Unnamed: 0,item_id,matches
0,Connecticut Huskies Wordmark on BlackBerry Tor...,D Blue Mesh Cover Case For Blackberry Curve 83...
1,Connecticut Huskies Wordmark on BlackBerry Tor...,EGC HTC Radar 4G Black Hard Cover Case
2,Connecticut Huskies Wordmark on BlackBerry Tor...,"Case for BlackBerry Curve 8520 8530,9300 9330 ..."
3,Connecticut Huskies Wordmark on BlackBerry Tor...,Clear (White) Frame Tpu Silicone Case Skin Cov...
4,Connecticut Huskies Wordmark on BlackBerry Tor...,BlackBerry Curve 8530 (Sprint) Clean ESN/No Co...
...,...,...
7025,iPod Mini 6 Gig Silver,Apple iPod nano 7th Generation Silver 16 GB DE...
7026,iPod Mini 6 Gig Silver,Apple iPod nano 2nd Generation Silver (2 GB)-f...
7027,iPod Mini 6 Gig Silver,Note 2 Verizon. White. MINT
7028,iPod Mini 6 Gig Silver,Rio 600 (32 MB) Digital Media Player


As you see, the dataset has 706 rows but the above we're getting only 7030 (instead of 7060). It means that title of four items were exactly the same. As a result, we take this title only once and returned ten matches for it also only once.

Finally, we can save the resulting dataset:

In [7]:
result_df = sm.create_matched_df('item_title')

In [8]:
result_df

Unnamed: 0,item_id,matches
0,Connecticut Huskies Wordmark on BlackBerry Tor...,D Blue Mesh Cover Case For Blackberry Curve 83...
1,Connecticut Huskies Wordmark on BlackBerry Tor...,EGC HTC Radar 4G Black Hard Cover Case
2,Connecticut Huskies Wordmark on BlackBerry Tor...,"Case for BlackBerry Curve 8520 8530,9300 9330 ..."
3,Connecticut Huskies Wordmark on BlackBerry Tor...,Clear (White) Frame Tpu Silicone Case Skin Cov...
4,Connecticut Huskies Wordmark on BlackBerry Tor...,BlackBerry Curve 8530 (Sprint) Clean ESN/No Co...
...,...,...
7025,iPod Mini 6 Gig Silver,Apple iPod nano 7th Generation Silver 16 GB DE...
7026,iPod Mini 6 Gig Silver,Apple iPod nano 2nd Generation Silver (2 GB)-f...
7027,iPod Mini 6 Gig Silver,Note 2 Verizon. White. MINT
7028,iPod Mini 6 Gig Silver,Rio 600 (32 MB) Digital Media Player


In [9]:
sm.save_to_csv(result_df, '/home/mkareev/small_files', '12_18_merged.csv')