# Cleaning Max's Data Set
Data columns are analyzed sequentially for missing values and typos. Missing values or otherwise unusable data is dropped. Cleaning effects are cumulative. Drop counts are recorded and an overall drop rate is presented at the end. Typos are fixed by inspection whenever possible.

The key takeaways are that missing receipt values generates a large amount of unusable data and that multiple assigned participants are missing from the data set.

In [1]:
import re
import datetime

import pandas as pd

In [2]:
%%time
DATA_PATH = '../Data/'
FILE_NAME = 'Max, Samantha, Maria data.xlsx'
SHEET = 'Max'

df = pd.read_excel(DATA_PATH + FILE_NAME, sheet_name=SHEET)
initial_row_count = df.shape[0]

Wall time: 1.06 s


First, column names are standardized with other data sets to provide consistency and readability.

In [3]:
column_names = ['ID', 'Session', 'Receipt', 'Date', 
                'Item', 'Item2', 'Uncertain', 'Unknown', 
                'Quantity', 'Hit', 'Miss', 'Category', 'Comment']
df.columns = column_names
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2159 entries, 0 to 2158
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   ID         2159 non-null   int64         
 1   Session    2159 non-null   int64         
 2   Receipt    1555 non-null   float64       
 3   Date       1718 non-null   datetime64[ns]
 4   Item       2159 non-null   object        
 5   Item2      38 non-null     object        
 6   Uncertain  37 non-null     object        
 7   Unknown    87 non-null     object        
 8   Quantity   274 non-null    float64       
 9   Hit        0 non-null      float64       
 10  Miss       0 non-null      float64       
 11  Category   2059 non-null   object        
 12  Comment    494 non-null    object        
dtypes: datetime64[ns](1), float64(4), int64(2), object(6)
memory usage: 219.4+ KB


### ID Column
Participant identification numbers. Max was assigned particpants identified by 129, 136, 144, 147, 151, 156, 160, 112, 117, and 120. Additionally, all three transcribers were assigned the collection 121, 114, 137, 153, 141, 127, 130, 135, 148, and 158, to control for errors.

In [4]:
df.ID = df.ID.astype('uint8') # memory saving conversion

pids_assigned = ({129, 136, 144, 147, 151, 156, 160, 112, 117, 120} | 
                 {121, 114, 137, 153, 141, 127, 130, 135, 148, 158})

print(set(df.ID.unique()) ^ pids_assigned)

{147, 148, 151, 156, 158, 160, 112, 117, 120}


Multiple participants are missing.

### Session Column
A record of when the transcription was preformed. Transcription was divided into 6 sessions.

In [5]:
df.Session = df.Session.astype('uint8') # memory saving conversion

valid_sessions = [1, 2, 3, 4, 5, 6]

assert df.Session.isin(valid_sessions).all() 

### Receipt Column
Enumeration of grocery receipts per session. df.info() has shown the existence of missing values, which must be dropped as distinguishing between different receipts is essential to this data set.

In [6]:
null_receipt_count = df.Receipt.isna().sum()
print(f'{null_receipt_count} rows missing Receipt values.')
df = df[df.Receipt.notna()]
df.Receipt = df.Receipt.astype('uint8')

604 rows missing Receipt values.


This is a very large discard of data. Attributing additional receipt values would greatly increase the quantity of usable data. Next, discontinuities in the enumeration are examined as further validation.

In [7]:
print('(ID, Session): [Receipts]')
for pid in df.ID.unique():
    for session in df.loc[df.ID == pid, 'Session'].unique():
        receipt_numbers = list(df.loc[(df.ID == pid) & (df.Session == session), 'Receipt'].unique())
        if receipt_numbers != list(range(1, len(receipt_numbers) + 1)):
            print(f'({pid}, {session}):', receipt_numbers)

(ID, Session): [Receipts]
(136, 1): [1, 2, 4, 5, 6, 7]


The tuple (136, 1, 3) is an empty receipt and is correctly absent from the data set.

### Date Column
Records purchase date on receipt if available. The approximate date range for data collection is 5/1/2020 to 12/31/2020.

In [8]:
# conversion hack to datetime while discarding time component
df.Date = pd.to_datetime(df.Date, errors='coerce').dt.date.astype('datetime64')

assert df.Date.dropna().between(datetime.datetime(2020, 5, 1), datetime.datetime(2020, 12, 31)).all()

### Item Column
Item descriptions are essential data and unidentifiable items are unusable. Item descriptions are also stripped of punctuation, whitespace, and formatting.

In [9]:
df.Item = df.Item.str.lower().str.strip().astype('string')
df.Item.value_counts(dropna=False).head()

bananas        43
unknown        30
blueberries    18
eggs           16
milk (2%)      14
Name: Item, dtype: Int64

In [10]:
NULL_ITEM_DESC = r'unknown|n/a|missing'
null_item_count = df.Item.str.contains(NULL_ITEM_DESC).sum() + df.Item.isna().sum()
print('Additional items containing null-like language:')
display(df[df.Item.str.contains(NULL_ITEM_DESC)])

df = df[df.Item.notna()]
df = df[~df.Item.str.contains(NULL_ITEM_DESC)]
print(f'{null_item_count} null item descriptions.')

Additional items containing null-like language:


Unnamed: 0,ID,Session,Receipt,Date,Item,Item2,Uncertain,Unknown,Quantity,Hit,Miss,Category,Comment
17,129,2,1,2020-06-30,unknown,,,x,2.0,,,,"unknown: ""100 Cal Trop/Mixed"""
60,129,3,2,2020-07-28,unknown,,,x,,,,,"unknown: ""GV SF PW SYR"""
196,129,5,1,2020-08-18,unknown,,,x,,,,,"unknown: ""GV 100 WWWP;"" Great Value _____?"
259,136,2,1,2020-08-14,unknown,,,x,,,,,"unknown: ""NGVC 2020 Anniv;"" ""0084932700262"""
269,136,2,1,2020-08-14,unknown,,,x,,,,,"unknown: ""DC $3.45 Natura;"" ""0084932700217"""
270,136,2,1,2020-08-14,unknown,,,x,,,,,"unknown: ""DC $3.45 Natura;"" ""0084932700217"""
271,136,2,1,2020-08-14,unknown,,,x,,,,,"unknown: ""DC $3.49 Lightl;"" ""0004345410080"""
272,136,2,1,2020-08-14,unknown,,,x,,,,,"unknown: ""DC Free Natural;"" ""0084932700134"""
369,136,4,8,2020-09-10,unknown,,,x,2.0,,,,"unknown: ""STO CARTS RN"" Simple Truth Organic _..."
894,137,1,1,2020-07-29,unknown,,,x,,,,,"unknown: ""OTBCTC24OZ"""


31 null item descriptions.


Item descriptions are optionally formated as "item (modifier)", where modifier usually denotes an adjective like flavor, such as "ice cream (chocolate)". The reformat_modifier function removes this formatting by moving 'modifier' to beginning of text and droping the parentheses.

In [11]:
paren = re.compile(r'\(.+\)')

def reformat_modifier(text):
    m = paren.search(text)
    if m:
        text = ' '.join([m.group(0)[1:-1], text])
        text = paren.sub('', text)
    return text

In [12]:
df.Item = (df.Item
           .apply(reformat_modifier)
           .str.replace(r'[/(),"&]', ' ', regex=True)
           .str.replace(r'?', '', regex=False)
           .str.replace(r"'s", '', regex=False)
           .str.replace(r"coupon", '', regex=False)
           .str.strip())

### Item2 Column
Provides additional description of the grocery, but is too sparse to be useful.

In [13]:
df = df.drop(columns='Item2')

### Uncertain Column
Denotes low confidence in transcription.

In [14]:
display(df[df.Uncertain.notna()])

Unnamed: 0,ID,Session,Receipt,Date,Item,Uncertain,Unknown,Quantity,Hit,Miss,Category,Comment
26,129,2,1,2020-07-02,sandwich rolls,x,,,,,Grain,"uncertain: ""KK SNDWCH RL 15"""
70,129,3,2,2020-07-28,herb butter,x,,,,,,"uncertain: ""BUTR HERB SL"""
151,129,4,2,2020-08-04,frozen meal,x,,,,,Dish,"uncertain: ""LFC BOWL 11z;"" Life Cuisine"
152,129,4,2,2020-08-04,frozen meal,x,,,,,Dish,"uncertain: ""LFC BOWL 10.875z;"" Life Cuisine"
176,129,5,1,2020-08-17,sandwich rolls,x,,,,,Grain,"uncertain: ""KK SNDWCH RL 15"""
343,136,4,4,2020-09-04,baby food,x,,,,,Dish,"uncertain: ""CMFRTS BABY"" could be diapers, wip..."
357,136,4,6,2020-09-06,bacon,x,,,,,Meat,"uncertain: ""PC HMPL BACON"" assumed Hempler's B..."
362,136,4,7,2020-09-10,red sugar,x,,,,,Seasoning,"uncertain: ""PC STO RED SUG"""
376,136,4,8,2020-09-10,red sugar,x,,,,,Seasoning,"uncertain: ""PC STO RED SUG"" & Duplicate of 4-7"
435,136,5,5,NaT,beef,x,,,,,Meat,"uncertain: ""HTGF BEEF"""


The transcription quality seems acceptable so the data will be kept.

In [15]:
df = df.drop(columns='Uncertain')

### Unknown Column
Denotes very low confidence in transcription.

In [16]:
display(df[df.Unknown.notna()])

Unnamed: 0,ID,Session,Receipt,Date,Item,Unknown,Quantity,Hit,Miss,Category,Comment
6,129,1,1,2020-06-19,indian meal,x,2.0,,,,"unknown: ""MEAL INDIAN"""
25,129,2,1,2020-07-02,bake shop item,x,2.0,,,Grain,"unknown: ""REDUCE BAKE SHOP"""
29,129,2,1,2020-07-02,deli item,x,,,,,"unknown: ""LOL AM WHT END"""
39,129,2,1,2020-07-02,produce item,x,2.0,,,,"unknown: ""REDUCED PRODUCE"""
44,129,3,1,2020-07-31,dairy item,x,,,,Dairy,"unknown: ""GB WHT PCH 5.3Z"""
48,129,3,1,2020-07-31,deli item,x,,,,,"unknown: ""DELI MI"""
52,129,3,1,2020-07-31,produce item,x,,,,,"unknown: ""REDUCED PRODUCE"""
62,129,3,2,2020-07-28,buttermilk product,x,,,,,"unknown: ""BUTTERMIL"""
65,129,3,2,2020-07-28,lemon snack product,x,,,,Snack,"unknown: ""LEMON SNACK"""
118,129,4,1,2020-08-11,bake shop item,x,,,,,"unknown: ""REDUCE BAKE SHOP"""


Item descriptions are too vague to be useful and are dropped.

In [17]:
unknown_count = df.Unknown.notna().sum()
print(f'{unknown_count} unknown items.')
df = df[df.Unknown.isna()]
df = df.drop(columns='Unknown')

26 unknown items.


### Quantity Column
An integer representing multiple purchases of the same item

In [18]:
df.Quantity.value_counts(dropna=False)

NaN    1323
2.0     136
3.0      24
4.0       7
5.0       5
6.0       3
Name: Quantity, dtype: int64

The Quantity column data is sparse. To make future analysis easier, rows will be repeated according to their Quantity value. Each row will now represent a single item. Notice that this will expand the size of the data set.

In [19]:
print(f'{df.Quantity.fillna(1).sum() - df.shape[0]} rows added from expanding Quantity data.')
df = df.loc[df.index.repeat(df.Quantity.fillna(1))]
df = df.drop(columns='Quantity')

240.0 rows added from expanding Quantity data.


### Hit Column & Miss Column
Contains little to no data at this time and is dropped.

In [20]:
df = df.drop(columns=['Hit', 'Miss'])

### Category Column
Labels grocery by type.

In [21]:
df.Category = df.Category.astype('string')
df.Category.value_counts(dropna=False).head()

Fruit        241
Vegetable    202
Drink        159
Meat         133
Dairy        128
Name: Category, dtype: Int64

### Comment Column
Contains miscellaneous notes from transcriber.

In [22]:
df.Comment.value_counts(dropna=False).head(10)

NaN                                         1352
Duplicate of 1-1                              49
category suggestion: "Petproduct"             44
category suggestion: "Cleaningproduct"        22
category suggestion: "Toiletries"             19
Duplicate of 6-1                              17
Duplicate of 1-2                              16
Duplicate of 1-5                              15
Duplicate of 3-1                              15
category suggestion: "Householdgoods" ??      14
Name: Comment, dtype: int64

In [23]:
duplicate_mask = df.Comment.str.contains(r'duplicate|repeat', case=False, na=False)
duplicate_drop_count = sum(duplicate_mask)
df = df[~duplicate_mask]
df.Comment = df.Comment.astype('string')
print(f'{duplicate_drop_count} duplicate rows.')

201 duplicate rows.


### A Cleaned Data Set

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1537 entries, 0 to 2137
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   ID        1537 non-null   uint8         
 1   Session   1537 non-null   uint8         
 2   Receipt   1537 non-null   uint8         
 3   Date      1221 non-null   datetime64[ns]
 4   Item      1537 non-null   object        
 5   Category  1532 non-null   string        
 6   Comment   185 non-null    string        
dtypes: datetime64[ns](1), object(1), string(2), uint8(3)
memory usage: 64.5+ KB


In [25]:
total_drop = null_receipt_count + null_item_count + unknown_count + duplicate_drop_count
print(f'Total row reduction: {total_drop} ({total_drop / initial_row_count:.0%})')

Total row reduction: 862 (40%)


In [26]:
df = df.reset_index(drop=True)
df.to_csv(f'{DATA_PATH}clean_{SHEET.lower()}.csv')