## Cleaning Maria's Data Set
Data columns are analyzed sequentially for missing values and typos. Missing values or otherwise unusable data is dropped. Cleaning effects are cumulative. Drop counts are recorded and an overall drop rate is presented at the end. Typos are fixed by inspection whenever possible.

The key takeaway is that missing receipt values generates a large amount of unusable data.

In [1]:
import re
import datetime

import pandas as pd

In [2]:
%%time
DATA_PATH = '../Data/'
FILE_NAME = 'Max, Samantha, Maria data.xlsx'
SHEET = 'Maria'

df = pd.read_excel(DATA_PATH + FILE_NAME, sheet_name=SHEET)
initial_row_count = df.shape[0] # used at the end to compute drop rate

  warn(msg)


Wall time: 1.57 s


First, column names are standardized with other data sets to provide consistency and readability. The coupon column is dropped because it provides little information overall and is entirely absent from Max's data set.

In [3]:
df = df.drop(columns='coupon') 
column_names = ['ID', 'Session', 'Receipt', 'Date', 'Item', 'Item2', 'Uncertain', 
                'Unknown', 'Quantity', 'Hit', 'Miss', 'Category', 'Comment']
df.columns = column_names
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4324 entries, 0 to 4323
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ID         4324 non-null   int64  
 1   Session    4324 non-null   int64  
 2   Receipt    2159 non-null   float64
 3   Date       2855 non-null   object 
 4   Item       4314 non-null   object 
 5   Item2      10 non-null     float64
 6   Uncertain  52 non-null     object 
 7   Unknown    95 non-null     object 
 8   Quantity   851 non-null    float64
 9   Hit        0 non-null      float64
 10  Miss       0 non-null      float64
 11  Category   4173 non-null   object 
 12  Comment    195 non-null    object 
dtypes: float64(5), int64(2), object(6)
memory usage: 439.3+ KB


### ID Column
Participant identification numbers. Maria was assigned particpants identified by 128, 134, 143, 146, 150, 154, 159, 110, 115, and 119. Additionally, all three transcribers were assigned the collection 121, 114, 137, 153, 141, 127, 130, 135, 148, and 158, to control for errors.

In [4]:
df.ID = df.ID.astype('uint8') # memory saving conversion

pids_assigned = ({128, 134, 143, 146, 150, 154, 159, 110, 115, 119} |
                 {121, 114, 137, 153, 141, 127, 130, 135, 148, 158})

print(set(df.ID.unique()) ^ pids_assigned)

set()


All assigned participants are accounted for.

### Session Column
A record of when the transcription was preformed. Transcription was divided into 6 sessions.

In [5]:
df.Session = df.Session.astype('uint8') # memory saving conversion

valid_sessions = [1, 2, 3, 4, 5, 6]

assert df.Session.isin(valid_sessions).all() 

### Receipt Column
Enumeration of grocery receipts per session. df.info() has shown the existence of missing values, which must be dropped as distinguishing between different receipts is essential to this data set.

In [6]:
null_receipt_count = df.Receipt.isna().sum()
print(f'{null_receipt_count} rows missing Receipt values.')
df = df[df.Receipt.notna()]

2165 rows missing Receipt values.


This is a very large discard of data. Attributing additional receipt values would greatly increase the quantity of usable data. Next, typos are discovered by examining discontinuities in the enumeration.

In [7]:
df.Receipt = df.Receipt.astype('uint8') # memory saving conversion

print('(ID, Session): [Receipts]')
for pid in df.ID.unique():
    for session in df.loc[df.ID == pid, 'Session'].unique():
        receipt_numbers = list(df.loc[(df.ID == pid) & (df.Session == session), 'Receipt'].unique())
        if receipt_numbers != list(range(1, len(receipt_numbers) + 1)):
            print(f'({pid}, {session}):', receipt_numbers)

(ID, Session): [Receipts]
(137, 1): [1, 11, 2]
(119, 6): [2, 3, 4, 5]
(130, 2): [1, 5, 2, 3, 4]


The tuple (119, 6, 1) refers to an empty receipt and so it's absence is correct. The other two discontinuities correctly indentify typos, which are fixed by inspection.

In [8]:
typo_date = datetime.date(2020, 8, 3)
df.loc[(df.ID == 137) & (df.Receipt == 11), 'Receipt'] = 1
df.loc[(df.ID == 130) & (df.Session == 2) & (df.Date == typo_date), 'Receipt'] = 1

### Date Column
Records purchase date on receipt if available. The approximate date range for data collection is 5/1/2020 to 12/31/2020.

In [9]:
# conversion hack to datetime while discarding time component
df.Date = pd.to_datetime(df.Date, errors='coerce').dt.date.astype('datetime64')

df[~df.Date.between(datetime.datetime(2020, 5, 1), datetime.datetime(2020, 12, 31))].dropna(subset=['Date'])

Unnamed: 0,ID,Session,Receipt,Date,Item,Item2,Uncertain,Unknown,Quantity,Hit,Miss,Category,Comment
3472,130,5,3,2002-09-21,carrots,,,,,,,vegetable,


A single date appears outside the expected range. It is corrected by inspection.

In [10]:
df.loc[df.Date == datetime.datetime(2002, 9, 21), 'Date'] = datetime.datetime(2020, 9, 21)

# some date typos that were discovered in a separate analysis, 
# but are now discarded by prior cleaning operations
#df.loc[df.Date == datetime.datetime(2002, 9, 10), 'Date'] = datetime.datetime(2020, 9, 10)
#df.loc[df.Date == datetime.datetime(2020, 4, 6), 'Date'] = datetime.datetime(2020, 6, 4)
#df.loc[df.Date == datetime.datetime(2020, 1, 7), 'Date'] = datetime.datetime(2020, 7, 1)

assert df.Date.dropna().between(datetime.datetime(2020, 5, 1), datetime.datetime(2020, 12, 31)).all()

### Item Column
A description of a grocery as string formatted as "ITEM (MODIFIER)". Item descriptions are essential data and unidentifiable items are unusable.

In [11]:
df.Item = df.Item.str.lower().str.strip().astype('string') # lowercase and strip white space
df.Item.value_counts(dropna=False).head(20)

strawberries            41
bananas                 38
blueberries             31
eggs                    24
cucumbers               22
raspberries             21
milk                    20
ice cream               18
peaches                 16
orange bell peppers     15
heavy whipping cream    14
cereal                  13
spring water            13
potatoes                13
oranges                 13
strawberry              12
sparkling water         11
bread                   11
tuna                    10
NaN                     10
Name: Item, dtype: Int64

10 null items to be dropped.

In [12]:
NULL_ITEM_DESC = r'unknown|n/a|missing'
null_item_count = df.Item.str.contains(NULL_ITEM_DESC).sum() + df.Item.isna().sum()
print('Additional items containing null-like language:')
display(df[df.Item.str.contains(NULL_ITEM_DESC)])

df = df[df.Item.notna()]
df = df[~df.Item.str.contains(NULL_ITEM_DESC)]
print(f'{null_item_count} null item descriptions.')

Additional items containing null-like language:


Unnamed: 0,ID,Session,Receipt,Date,Item,Item2,Uncertain,Unknown,Quantity,Hit,Miss,Category,Comment
2842,119,3,2,2020-07-17,unknown,,,x,,,,,walmart


11 null item descriptions.


### Item2 Column
Provides additional description of the grocery, but is too sparse to be useful.

In [13]:
df = df.drop(columns='Item2')

### Uncertain Column
Denotes low confidence in transcription.

In [14]:
display(df[df.Uncertain.notna()])

Unnamed: 0,ID,Session,Receipt,Date,Item,Uncertain,Unknown,Quantity,Hit,Miss,Category,Comment
1079,135,1,2,NaT,razor,x,,,,,,
1144,135,2,6,NaT,wrthr og hrd,x,,,,,,
1168,135,2,8,2020-08-14,tetro,x,,,,,,
2795,119,2,1,NaT,catfish nugget,x,,,,,fastfood,
3240,130,1,5,2020-07-25,frosting,x,,,,,dessert,
3244,130,1,5,2020-07-28,odor relief,x,,,,,,
3647,127,5,3,2020-08-13,half moon,x,,,,,,Grocery store unknwon


The transcription quality seems low and will be dropped.

In [15]:
uncertain_count = df.Uncertain.notna().sum()
df = df[df.Uncertain.isna()]
df = df.drop(columns='Uncertain')
print(f'{uncertain_count} rows with high uncertainty.')

7 rows with high uncertainty.


### Unknown Column
Denotes very low confidence in transcription.

In [16]:
display(df[df.Unknown.notna()])

Unnamed: 0,ID,Session,Receipt,Date,Item,Unknown,Quantity,Hit,Miss,Category,Comment
858,159,3,3,NaT,conventional,x,2.0,,,,
859,159,3,3,NaT,masculine bi,x,2.0,,,,
1105,135,1,4,NaT,cg tg tl,x,,,,,
1106,135,1,4,NaT,sct 1000,x,,,,,
1107,135,1,4,NaT,tr cn smth28,x,,,,,
1382,137,4,2,2020-09-14,the original em,x,2.0,,,,
2474,128,1,5,NaT,organic preserves,x,,,,,
2488,128,1,7,NaT,morsh,x,2.0,,,,I am not sure about this item
2489,128,1,7,NaT,cfl,x,2.0,,,,
2490,128,1,7,NaT,amer,x,,,,,


Item descriptions are too vague to be useful and are dropped.

In [17]:
unknown_count = df.Unknown.notna().sum()
print(f'{unknown_count} unknown items.')
df = df[df.Unknown.isna()]
df = df.drop(columns='Unknown')

33 unknown items.


### Quantity Column
An integer representing multiple purchases of the same item

In [18]:
df.Quantity.value_counts(dropna=False)

NaN     1657
2.0      345
3.0       53
4.0       31
5.0        9
6.0        7
10.0       3
14.0       1
8.0        1
1.0        1
Name: Quantity, dtype: int64

There are some large quantities that could be typos, but upon inspection, seem fine.

In [19]:
display(df[df.Quantity.isin([8, 10, 14])])

Unnamed: 0,ID,Session,Receipt,Date,Item,Quantity,Hit,Miss,Category,Comment
105,154,3,3,2020-09-15,spicy hot veggie juice,14.0,,,drink,
112,154,3,4,2020-09-15,tuna,10.0,,,meat (new category needed0,
2507,128,2,2,NaT,atkins cereal bar,8.0,,,snack,
3001,119,6,2,NaT,kool aid,10.0,,,drink,
3002,119,6,2,NaT,kool aid,10.0,,,drink,


The Quantity column data is sparse. To make future analysis easier, rows will be repeated according to their Quantity value. Each row will now represent a single item. Notice that this will expand the size of the data set.

In [20]:
print(f'{df.Quantity.fillna(1).sum() - df.shape[0]} rows added by expanding Quantity data.')
df = df.loc[df.index.repeat(df.Quantity.fillna(1))]
df = df.drop(columns='Quantity')

662.0 rows added by expanding Quantity data.


### Hit Column & Miss Column
Contains little to no data at this time and is dropped.

In [21]:
df = df.drop(columns=['Hit', 'Miss'])

### Category Column
Labels grocery by type.

In [22]:
df.Category = df.Category.astype('string')
df.Category.value_counts(dropna=False).head()

fruit         369
vegetables    287
drink         258
vegetable     171
snack         163
Name: Category, dtype: Int64

### Comment Column
Contains miscellaneous notes from transcriber.

In [23]:
df.Comment.value_counts().head(10)

new category needed                               8
new category needed                               8
walmart receipt and prices are unknown. Unsure    6
unsure about category                             3
New category needed                               3
walmart receipt. Unknwon receipt                  3
does not specify                                  3
could not find item                               2
Could not find products                           2
dont know grocery store,couldnt find item         2
Name: Comment, dtype: int64

Values sizes are small and ignored for now.

In [24]:
df.Comment = df.Comment.astype('string')

### A Cleaned Data Set

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2770 entries, 0 to 4236
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   ID        2770 non-null   uint8         
 1   Session   2770 non-null   uint8         
 2   Receipt   2770 non-null   uint8         
 3   Date      1536 non-null   datetime64[ns]
 4   Item      2770 non-null   string        
 5   Category  2741 non-null   string        
 6   Comment   76 non-null     string        
dtypes: datetime64[ns](1), string(3), uint8(3)
memory usage: 116.3 KB


In [26]:
total_drop = null_receipt_count + null_item_count + uncertain_count + unknown_count
print(f'Total row reduction: {total_drop} ({total_drop / initial_row_count:.0%})')

Total row reduction: 2216 (51%)


In [27]:
df = df.reset_index(drop=True)
df.to_csv(f'{DATA_PATH}clean_{SHEET.lower()}.csv')