# Practical Business Python: Efficiently Cleaning Text with Pandas
Chris Moffitt. "Efficiently Cleaning Text in Pandas. _Practical Business Python_, 16 Feb. 2021, https://pbpython.com/text-cleaning.html.

In [32]:
import pandas as pd
import numpy as np
import sidetable

Demonstrate with [open data set](https://data.iowa.gov/Sales-Distribution/Iowa-Liquor-Sales/m3tr-qhgy) from Iowa. Large set, a 565 MB csv with 24 columns and 2.3 million rows. Using 2019 data. Download [here](https://data.iowa.gov/Sales-Distribution/Iowa-Liquor-Sales/m3tr-qhgy).

In [33]:
df = pd.read_csv('2019_Iowa_Liquor_Sales.csv')

In [3]:
df

Unnamed: 0,Invoice/Item Number,Date,Store Number,Store Name,Address,City,Zip Code,Store Location,County Number,County,...,Item Number,Item Description,Pack,Bottle Volume (ml),State Bottle Cost,State Bottle Retail,Bottles Sold,Sale (Dollars),Volume Sold (Liters),Volume Sold (Gallons)
0,INV-16681900011,01/02/2019,5286,Sauce,"108, College",Iowa City,52240.0,,52.0,JOHNSON,...,48099,Hennessy VS,24,200,6.24,9.36,24,224.64,4.80,1.26
1,INV-16681900027,01/02/2019,5286,Sauce,"108, College",Iowa City,52240.0,,52.0,JOHNSON,...,89191,Jose Cuervo Especial Reposado Mini,12,500,11.50,17.25,12,207.00,6.00,1.58
2,INV-16681900018,01/02/2019,5286,Sauce,"108, College",Iowa City,52240.0,,52.0,JOHNSON,...,8824,Lauder's,24,375,3.21,4.82,24,115.68,9.00,2.37
3,INV-16685400036,01/02/2019,2524,Hy-Vee Food Store / Dubuque,3500 Dodge St,Dubuque,52001.0,,31.0,DUBUQUE,...,35917,Five O'Clock Vodka,12,1000,4.17,6.26,12,75.12,12.00,3.17
4,INV-16690300035,01/02/2019,4449,Kum & Go #121 / Urbandale,12041 Douglas Pkwy,Urbandale,50322.0,,77.0,POLK,...,36304,Hawkeye Vodka,24,375,1.86,2.79,24,66.96,9.00,2.37
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2380340,INV-24281000038,12/31/2019,4557,Hometown Foods / Traer,420 Second St,Traer,50675.0,POINT (-92.468086 42.193386),86.0,TAMA,...,55106,Phillips Blackberry Brandy,12,750,4.76,7.14,4,28.56,3.00,0.79
2380341,INV-24284800010,12/31/2019,5514,Johncy's Liquor Store,585 Hwy 965 Ste D & E,North Liberty,52317.0,POINT (-91.60761 41.738129),52.0,JOHNSON,...,8828,Lauders,6,1750,11.18,16.77,6,100.62,10.50,2.77
2380342,INV-24274100040,12/31/2019,3932,Main Street Spirits / Mapleton,311 Main St,Mapleton,51034.0,POINT (-95.79375 42.165915),67.0,MONONA,...,80486,Tippy Cow Orange Cream,6,750,10.00,15.00,1,15.00,0.75,0.19
2380343,INV-24278900030,12/31/2019,5447,New Star / Raymond,101 Comerical Street,Raymond,50667.0,,7.0,BLACK HAWK,...,18048,Evan Williams Green Label,6,1750,11.50,17.25,6,103.50,10.50,2.77


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2380345 entries, 0 to 2380344
Data columns (total 24 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   Invoice/Item Number    object 
 1   Date                   object 
 2   Store Number           int64  
 3   Store Name             object 
 4   Address                object 
 5   City                   object 
 6   Zip Code               float64
 7   Store Location         object 
 8   County Number          float64
 9   County                 object 
 10  Category               float64
 11  Category Name          object 
 12  Vendor Number          float64
 13  Vendor Name            object 
 14  Item Number            int64  
 15  Item Description       object 
 16  Pack                   int64  
 17  Bottle Volume (ml)     int64  
 18  State Bottle Cost      float64
 19  State Bottle Retail    float64
 20  Bottles Sold           int64  
 21  Sale (Dollars)         float64
 22  Volume Sold (Liter

In [5]:
df.describe()

Unnamed: 0,Store Number,Zip Code,County Number,Category,Vendor Number,Item Number,Pack,Bottle Volume (ml),State Bottle Cost,State Bottle Retail,Bottles Sold,Sale (Dollars),Volume Sold (Liters),Volume Sold (Gallons)
count,2380345.0,2375581.0,2375581.0,2377427.0,2380344.0,2380345.0,2380345.0,2380345.0,2380345.0,2380345.0,2380345.0,2380345.0,2380345.0,2380345.0
mean,3903.901,51266.7,57.30555,1052185.0,264.7152,48361.75,12.42525,876.6395,10.34431,15.51888,11.2775,146.71,9.368987,2.46961
std,1138.921,988.187,27.27291,93298.29,137.0791,67082.07,8.108758,521.4263,8.568864,12.85549,31.31091,487.1768,38.24454,10.10336
min,2106.0,50002.0,1.0,1011100.0,33.0,159.0,1.0,20.0,0.89,1.34,1.0,1.34,0.02,0.0
25%,2624.0,50316.0,31.0,1012200.0,115.0,26828.0,6.0,750.0,5.5,8.25,3.0,33.75,1.5,0.39
50%,3952.0,51103.0,62.0,1031200.0,260.0,38177.0,12.0,750.0,8.25,12.38,6.0,75.36,4.8,1.26
75%,4971.0,52302.0,77.0,1062400.0,389.0,64864.0,12.0,1000.0,12.96,19.44,12.0,148.56,10.5,2.77
max,9042.0,57222.0,99.0,1901200.0,978.0,999292.0,48.0,6000.0,1749.12,2623.68,6750.0,78435.0,11812.5,3120.53


Use [sidetable](https://pypi.org/project/sidetable/) to summarize data.

In [6]:
df.stb.freq(['Store Name'], value='Sale (Dollars)', style=True, cum_cols=False)

Unnamed: 0,Store Name,Sale (Dollars),percent
0,Central City 2,11877164,3.40%
1,Hy-Vee #3 / BDI / Des Moines,11275152,3.23%
2,Hy-Vee Wine and Spirits / Iowa City,5001156,1.43%
3,Wilkie Liquors,3639515,1.04%
4,Lot-A-Spirits,3504665,1.00%
5,Costco Wholesale #788 / WDM,3178079,0.91%
6,Sam's Club 8162 / Cedar Rapids,3147579,0.90%
7,Benz Distributing,3082936,0.88%
8,Hy-Vee Food Store / Urbandale,3073798,0.88%
9,Sam's Club 6344 / Windsor Heights,2963108,0.85%


## Cleaning attempt no. 1
Use .loc with a boolean filter (.str accessor) on each string in the 'Store Name' column.

First with and then without regex turned off.

In [7]:
%%timeit
df.loc[df['Store Name'].str.contains('Hy-Vee', case=False), 'Store_Group_1'] = 'Hy-Vee'

1.14 s ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [8]:
%%timeit
df.loc[df['Store Name'].str.contains('Hy-Vee', case=False, regex=False), 'Store_Group_1'] = 'Hy-Vee'

684 ms ± 13.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [9]:
df['Store_Group_1'].value_counts(dropna=False)

NaN       1617777
Hy-Vee     762568
Name: Store_Group_1, dtype: int64

## Cleaning attempt no. 2
Use np.select to run multiple searches and apply a specified value on match. Resources: [Article](https://www.dataquest.io/blog/tutorial-add-column-pandas-dataframe-based-on-if-else-condition/) on Dataquest, [presentation](https://docs.google.com/presentation/d/1X7CheRfv0n4_I21z4bivvsHt6IDxkuaiAuCclSzia1E/edit#slide=id.g635adc05c1_1_1840) by Nathan Cheever.

In [10]:
store_patterns = [
    (df['Store Name'].str.contains('Hy-Vee', case=False, regex=False), 'Hy-Vee'),
    (df['Store Name'].str.contains('Central City',
                                case=False,  regex=False), 'Central City'),
    (df['Store Name'].str.contains("Smokin' Joe's",
                                case=False,  regex=False), "Smokin' Joe's"),
    (df['Store Name'].str.contains('Walmart|Wal-Mart',
                                case=False), 'Wal-Mart'),
    (df['Store Name'].str.contains('Fareway Stores',
                                case=False,  regex=False), 'Fareway Stores'),
    (df['Store Name'].str.contains("Casey's",
                                case=False,  regex=False), "Casey's General Store"),
    (df['Store Name'].str.contains("Sam's Club", case=False,  regex=False), "Sam's Club"),
    (df['Store Name'].str.contains('Kum & Go',  regex=False, case=False), 'Kum & Go'),
    (df['Store Name'].str.contains('CVS',  regex=False, case=False), 'CVS Pharmacy'),
    (df['Store Name'].str.contains('Walgreens',  regex=False, case=False), 'Walgreens'),
    (df['Store Name'].str.contains('Yesway',  regex=False, case=False), 'Yesway Store'),
    (df['Store Name'].str.contains('Target Store',  regex=False, case=False), 'Target'),
    (df['Store Name'].str.contains('Quik Trip',  regex=False, case=False), 'Quik Trip'),
    (df['Store Name'].str.contains('Circle K',  regex=False, case=False), 'Circle K'),
    (df['Store Name'].str.contains('Hometown Foods',  regex=False,
                                case=False), 'Hometown Foods'),
    (df['Store Name'].str.contains("Bucky's", case=False,  regex=False), "Bucky's Express"),
    (df['Store Name'].str.contains('Kwik', case=False,  regex=False), 'Kwik Shop')
]

Easy to get conditions and values mismatched when using np.select, so here combine them into a tuple to help track data matches. List of tuples needs to be broken into two separate lists using zip.

In [11]:
store_criteria, store_values = zip(*store_patterns)
df['Store_Group_1'] = np.select(store_criteria, store_values, 'other')

Once again use sidetable to summarize:

In [12]:
df.stb.freq(['Store_Group_1'], value='Sale (Dollars)', style=True, cum_cols=False)

Unnamed: 0,Store_Group_1,Sale (Dollars),percent
0,Hy-Vee,126265195,36.16%
1,other,112733367,32.28%
2,Fareway Stores,23146939,6.63%
3,Wal-Mart,22641682,6.48%
4,Sam's Club,19604085,5.61%
5,Central City,14108944,4.04%
6,Casey's General Store,11351935,3.25%
7,Kum & Go,6019449,1.72%
8,Walgreens,2942270,0.84%
9,Target,2904611,0.83%


Still almost a third of revenue in "other" accounts. Maybe if an account doesn't match we can use 'Store Name' instead of lumping together with 'other'. The following uses combine_first to fill in the 'None' values with 'Store Name':

In [28]:
%%timeit
df['Store_Group_1'] = np.select(store_criteria, store_values, None)
df['Store_Group_1'] = df['Store_Group_1'].combine_first(df['Store Name'])

419 ms ± 4.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Now look at the data:

In [14]:
df.stb.freq(['Store_Group_1'], value='Sale (Dollars)', style=True, cum_cols=False)

Unnamed: 0,Store_Group_1,Sale (Dollars),percent
0,Hy-Vee,126265195,36.16%
1,Fareway Stores,23146939,6.63%
2,Wal-Mart,22641682,6.48%
3,Sam's Club,19604085,5.61%
4,Central City,14108944,4.04%
5,Casey's General Store,11351935,3.25%
6,Kum & Go,6019449,1.72%
7,Wilkie Liquors,3639515,1.04%
8,Lot-A-Spirits,3504665,1.00%
9,Costco Wholesale #788 / WDM,3178079,0.91%


## Cleaning attempt no. 3

Matt Harrison's generalize function.

In [34]:
def generalize(ser, match_name, default=None, regex=False, case=False):
    """ Search a series for text matches.
    Based on code from https://www.metasnake.com/blog/pydata-assign.html

    ser: pandas series to search
    match_name: tuple containing text to search for and text to use for normalization
    default: If no match, use this to provide a default value, otherwise use the original text
    regex: Boolean to indicate if match_name contains a  regular expression
    case: Case sensitive search

    Returns a pandas series with the matched value

    """
    seen = None
    for match, name in match_name:
        mask = ser.str.contains(match, case=case, regex=regex)
        if seen is None:
            seen = mask
        else:
            seen |= mask
        ser = ser.where(~mask, name)
    if default:
        ser = ser.where(seen, default)
    else:
        ser = ser.where(seen, ser.values)
    return ser

New pattern list to apply:

In [35]:
store_patterns_2 = [('Hy-Vee', 'Hy-Vee'), ("Smokin' Joe's", "Smokin' Joe's"),
                    ('Central City', 'Central City'),
                    ('Costco Wholesale', 'Costco Wholesale'),
                    ('Walmart', 'Walmart'), ('Wal-Mart', 'Walmart'),
                    ('Fareway Stores', 'Fareway Stores'),
                    ("Casey's", "Casey's General Store"),
                    ("Sam's Club", "Sam's Club"), ('Kum & Go', 'Kum & Go'),
                    ('CVS', 'CVS Pharmacy'), ('Walgreens', 'Walgreens'),
                    ('Yesway', 'Yesway Store'), ('Target Store', 'Target'),
                    ('Quik Trip', 'Quik Trip'), ('Circle K', 'Circle K'),
                    ('Hometown Foods', 'Hometown Foods'),
                    ("Bucky's", "Bucky's Express"), ('Kwik', 'Kwik Shop')]

Call out the data:

In [17]:
df['Store_Group_2'] = generalize(df['Store Name'], store_patterns_2)

In [18]:
%%timeit
df['Store_Group_2'] = generalize(df['Store Name'], store_patterns_2)

11.9 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


That took a crap-ton longer. I don't care how "elegant" it is, that's a hefty price to pay!

In [19]:
df.stb.freq(['Store_Group_2'], value='Sale (Dollars)', style=True, cum_cols=False)

Unnamed: 0,Store_Group_2,Sale (Dollars),percent
0,Hy-Vee,126265195,36.16%
1,Fareway Stores,23146939,6.63%
2,Walmart,22641682,6.48%
3,Sam's Club,19604085,5.61%
4,Central City,14108944,4.04%
5,Casey's General Store,11351935,3.25%
6,Kum & Go,6019449,1.72%
7,Costco Wholesale,5777393,1.65%
8,Wilkie Liquors,3639515,1.04%
9,Lot-A-Spirits,3504665,1.00%


## What about data types?

Converting strings to string data type doesn't seem to make a difference, but converting to a category time using astype can:

In [23]:
df['Store Name'] = df['Store Name'].astype('category')

In [27]:
%%timeit
store_criteria, store_values = zip(*store_patterns)
df['Store_Group_3'] = np.select(store_criteria, store_values, None)
df['Store_Group_3'] = df['Store_Group_1'].combine_first(df['Store Name'])

387 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)




## Lookup table
Data set has a lot of duplicated data. Build a lookup table an process resource intensive function only once per string.

Convert back to string from category:


In [36]:
df['Store Name'] = df['Store Name'].astype('string')

Now build lookup table:

In [37]:
lookup_df = pd.DataFrame()
lookup_df['Store Name'] = df['Store Name'].unique()
lookup_df['Store_Group_5'] = generalize(lookup_df['Store Name'], store_patterns_2)

In [38]:
lookup_df

Unnamed: 0,Store Name,Store_Group_5
0,Sauce,Sauce
1,Hy-Vee Food Store / Dubuque,Hy-Vee
2,Kum & Go #121 / Urbandale,Kum & Go
3,IDA Liquor,IDA Liquor
4,Lake View Foods,Lake View Foods
...,...,...
1754,Hometown Foods - Conrad,Hometown Foods
1755,Casey's General Store #72 / Tipton,Casey's General Store
1756,Hometown Foods - Hubbard,Hometown Foods
1757,Shortee's Pit Stop / Speedway Cafe,Shortee's Pit Stop / Speedway Cafe


In [40]:
df = pd.merge(df, lookup_df, how='left')

For some reason, above blows up when I try to use %%timeit.

## Summary
Importance of data cleaning in Randy Au's [newsletter](https://counting.substack.com/p/data-cleaning-is-analysis-not-grunt).

In summary:

|Solution                  |Execution time|Notes                    |
|:-------------------------|:-------------|:------------------------|
|np.select                 |13s           |Can work for non-text analysis|
|generalize                |15s           |Text only|
|Category Data and np.select|786ms        |Categorical data can get tricky when merging and joining|
|Lookup table and generalize|1.3s         |A lookup table can be maintained by someone else|
