# Cleaning Samantha's Data Set
Data columns are analyzed sequentially for missing values and typos. Missing values or otherwise unusable data is dropped. Cleaning effects are cumulative. Drop counts are recorded and an overall drop rate is presented at the end. Typos are fixed by inspection whenever possible.

The key takeaway is that missing receipt values generates a large amount of discarded data.

In [1]:
import re
import datetime

import pandas as pd

In [2]:
%%time
DATA_PATH = '../Data/'
FILE_NAME = 'Max, Samantha, Maria data.xlsx'
SHEET = 'Samantha'

df = pd.read_excel(DATA_PATH + FILE_NAME, sheet_name=SHEET)
initial_row_count = df.shape[0]

Wall time: 1.32 s


First, column names are standardized with other data sets to provide consistency and readability. The coupon column is deleted because it provides little information overall and is entirely absent from Max's data set.

In [3]:
df = df.drop(columns='Coupon (#)')
column_names = ['ID', 'Session', 'Receipt', 'Date', 
                'Item', 'Item2', 'Uncertain', 'Unknown', 
                'Quantity', 'Hit', 'Miss', 'Category', 'Comment']
df.columns = column_names
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3138 entries, 0 to 3137
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   ID         3138 non-null   int64         
 1   Session    3138 non-null   int64         
 2   Receipt    1822 non-null   object        
 3   Date       2509 non-null   datetime64[ns]
 4   Item       3069 non-null   object        
 5   Item2      135 non-null    object        
 6   Uncertain  111 non-null    object        
 7   Unknown    82 non-null     object        
 8   Quantity   525 non-null    object        
 9   Hit        27 non-null     float64       
 10  Miss       0 non-null      float64       
 11  Category   3003 non-null   object        
 12  Comment    284 non-null    object        
dtypes: datetime64[ns](1), float64(2), int64(2), object(8)
memory usage: 318.8+ KB


### ID Column
Participant identification numbers. Samantha was assigned particpants identified by 131, 139, 145, 149, 152, 157, 162, 113, 118, and 126. Additionally, all three transcribers were assigned the collection 121, 114, 137, 153, 141, 127, 130, 135, 148, and 158, to control for errors.

In [5]:
df.ID = df.ID.astype('uint8') # memory saving conversion

pids_assigned = ({131, 139, 145, 149, 152, 157, 162, 113, 118, 126} | 
                 {121, 114, 137, 153, 141, 127, 130, 135, 148, 158})

print(set(df.ID.unique()) ^ pids_assigned)

set()


All assigned participants are accounted for.

### Session Column
A record of when the transcription was preformed. Transcription was divided into 6 sessions.

In [None]:
df.Session = df.Session.astype('uint8') # memory saving conversion

valid_sessions = [1, 2, 3, 4, 5, 6]

assert df.Session.isin(valid_sessions).all() 

### Receipt Column
Enumeration of grocery receipts per session. df.info() has shown the existence of missing values, which must be dropped as distinguishing between different receipts is essential to this data set.

In [None]:
null_receipt_count = df.Receipt.isna().sum()
print(f'{null_receipt_count} rows missing Receipt values.')
df = df[df.Receipt.notna()]

This is a very large discard of data. Attributing additional receipt values would greatly increase the quantity of useable data. Next, a typo is discovered by attempting to convert Receipt to integer data type. This is fixed by inspection.

In [None]:
typo = datetime.datetime(1900, 1, 1, 0, 0)
df.loc[df.Receipt == typo, 'Receipt'] = 1
df.Receipt = df.Receipt.astype('uint8')

Discontinuities in the enumeration are examined as further validation.

In [None]:
print('(ID, Session): [Receipts]')
for pid in df.ID.unique():
    for session in df.loc[df.ID == pid, 'Session'].unique():
        receipt_numbers = list(df.loc[(df.ID == pid) & (df.Session == session), 'Receipt'].unique())
        if receipt_numbers != list(range(1, len(receipt_numbers) + 1)):
            print(f'({pid}, {session}):', receipt_numbers)

The tuples (145, 3, 1) and (145, 3, 2) do not exist in the database and so their absence here is correct. The tuple (153, 6, 1) is labled 153-6 with the receipt number missing, this could possibly be corrected in the database.

### Date Column
Records purchase date on receipt if available. The approximate date range for data collection is 5/1/2020 to 12/31/2020.

In [None]:
# conversion hack to datetime while discarding time component
df.Date = pd.to_datetime(df.Date, errors='coerce').dt.date.astype('datetime64')

assert df.Date.dropna().between(datetime.datetime(2020, 5, 1), datetime.datetime(2020, 12, 31)).all()

### Item Column
A description of a grocery as string formatted as "ITEM (MODIFIER)". Item descriptions are essential data and unidentifiable items must be dropped.

In [None]:
df.Item = df.Item.str.lower().str.strip().astype('string') # lowercase and strip white space
df.Item.value_counts(dropna=False).head()

35 null items to be dropped.

In [None]:
NULL_ITEM_DESC = r'unknown|n/a|missing'
null_item_count = df.Item.str.contains(NULL_ITEM_DESC).sum() + df.Item.isna().sum()
print('Additional items containing null-like language:')
display(df[df.Item.str.contains(NULL_ITEM_DESC)])

df = df[df.Item.notna()]
df = df[~df.Item.str.contains(NULL_ITEM_DESC)] # drop rows
print(f'{null_item_count} null item descriptions.')

### Item2 Column
Provides additional description of the grocery, but is too sparse to be useful.

In [None]:
df = df.drop(columns='Item2')

### Uncertain Column
Denotes low confidence in transcription.

In [None]:
display(df[df.Uncertain.notna()])

The transcription quality seems acceptable so the data will be kept, but question marks removed.

In [None]:
df.Item = df.Item.str.replace(r'?', '', regex=False)
df = df.drop(columns='Uncertain')

### Unknown Column
Denotes very low confidence in transcription.

In [None]:
display(df[df.Unknown.notna()])

Item descriptions are too vague to be useful and are dropped.

In [None]:
unknown_count = df.Unknown.notna().sum()
print(f'{unknown_count} unknown items.')
df = df[df.Unknown.isna()]
df = df.drop(columns='Unknown')

### Quantity Column
An integer representing multiple purchases of the same item

In [None]:
df.Quantity.value_counts(dropna=False)

There is one obvious typo as well as some large quantities that could be typos, but upon inspection, seem fine.

In [None]:
df.loc[df.Quantity == '??', 'Quantity'] = 1 # fix typo

# examine large quantities
display(df[df.Quantity.isin([7, 8, 11, 12, 14, 15])])

The Quantity column data is sparse. To make future analysis easier, rows will be repeated according to their Quantity value. Each row will now represent a single item. Notice that this will expand the size of the data set.

In [None]:
print(f'{df.Quantity.fillna(1).sum() - df.shape[0]} rows added from expanding Quantity data.')
df = df.loc[df.index.repeat(df.Quantity.fillna(1))]
df = df.drop(columns='Quantity')

### Hit Column & Miss Column
Contains little to no data at this time and is dropped.

In [None]:
df = df.drop(columns=['Hit', 'Miss'])

### Category Column
Labels grocery by type.

In [None]:
df.Category = df.Category.astype('string')
df.Category.value_counts(dropna=False).head()

### Comment Column
Contains miscellaneous notes from transcriber.

In [None]:
df.Comment.value_counts().head(10)

In [None]:
display(df[df.Comment.str.contains(r'uncertain', na=False)].sample(25))

"Uncertain" items seem useable, but "repeated" items are dropped.

In [None]:
duplicate_mask = df.Comment.str.contains(r'duplicate|repeat', case=False, na=False)
duplicate_drop_count = sum(duplicate_mask)
df = df[~duplicate_mask]
df.Comment = df.Comment.astype('string')
print(f'{duplicate_drop_count} repeated rows.')

### A Cleaned Data Set

In [None]:
df.info()

In [None]:
total_drop = null_receipt_count + null_item_count + unknown_count + duplicate_drop_count
print(f'Total row reduction: {total_drop} ({total_drop / initial_row_count:.0%})')

In [None]:
df = df.reset_index(drop=True)
df.to_csv(f'{DATA_PATH}clean_{SHEET.lower()}.csv')