## Fetch Rewards Coding Test
### Data Quality Issues

7/21/2021
Lili Teister

**Exercise:** https://fetch-hiring.s3.amazonaws.com/data-analyst/ineeddata-data-modeling/data-modeling.html

In [6]:
import sqlite3
import pandas as pd

In [7]:
def sqlite_connection(dbname):
    c = None
    try:
        c = sqlite3.connect(dbname)
        print(f"Database {dbname} connected with sqlite version {sqlite3.version}")
    except Error as err:
        print(err)
        
    return c

conn = sqlite_connection('fetchTest.db')

Database fetchTest.db connected with sqlite version 2.6.0


In exploring the items data as it relates to the brands data, I encountered some concerns with the quality of the brands data and whether or not it can be used to enhance analysis of purchased items.

While `brandId` is unique in the dataset, it is not present in the Items data, so I was exploring the option of using barcode instead. However, there appears to be duplicate barcodes for different brands, with some of the records appearing to be test data.

In [62]:
sql = '''
    SELECT 
        br.*
    FROM 
        brandsDim br
    WHERE EXISTS (
        SELECT 1
        FROM brandsDim br1
        WHERE br.barcode = br1.barcode
        GROUP BY br1.barcode
        HAVING COUNT(br1.barcode) > 1
    )
    ORDER BY barcode
'''

pd.read_sql_query(sql, conn)

Unnamed: 0,barcode,category,categoryCode,name,topBrand,brandId,cpgId,cpg,brandCode
0,511111004790,Baking,,alexa,1.0,5c409ab4cd244a3539b84162,55b62995e4b0d8e685c14213,Cogs,ALEXA
1,511111004790,Condiments & Sauces,,Bitten Dressing,,5cdacd63166eb33eb7ce0fa8,559c2234e4b06aca36af13c6,Cogs,BITTEN
2,511111204923,Grocery,,Brand1,1.0,5c45f91b87ff3552f950f027,5c45f8b087ff3552f950f026,Cogs,0987654321
3,511111204923,Snacks,,CHESTER'S,,5d6027f46d5f3b23d1bc7906,5332f5fbe4b03c9a25efd0ba,Cogs,CHESTERS
4,511111305125,Baby,,Chris Image Test,,5c4699f387ff3577e203ea29,55b62995e4b0d8e685c14213,Cogs,CHRISIMAGE
5,511111305125,Magazines,,Rachael Ray Everyday,,5d642d65a3a018514994f42d,5d5d4fd16d5f3b23d1bc7905,Cogs,511111305125
6,511111504139,Beverages,,Chris Brand XYZ,,5a7e0604e4b0aedb3b84afd3,55b62995e4b0d8e685c14213,Cogs,CHRISXYZ
7,511111504139,Grocery,,Pace,0.0,5a8c33f3e4b07f0a2dac8943,5a734034e4b0d58f376be874,Cogs,PACE
8,511111504788,Baking,,test,,5c408e8bcd244a1fdb47aee7,59ba6f1ce4b092b29c167346,Cogs,TEST
9,511111504788,Condiments & Sauces,,The Pioneer Woman,,5ccb2ece166eb31bbbadccbe,559c2234e4b06aca36af13c6,Cogs,PIONEER WOMAN


I also looked at how many items' barcodes matched any brands in the brands dataset:

In [54]:
sql = '''
    SELECT
        COUNT(DISTINCT r.id) AS NReceipts
        , COUNT(DISTINCT it.receiptId) AS NReceiptsWithItemsData
        , COUNT(DISTINCT it.generatedId) as Nitems
        , COUNT(DISTINCT IIF(br.barcode IS NULL, NULL, it.generatedId)) AS NItemsWithMatchedBarcodes
        , COUNT(DISTINCT IIF(it.brandCode IS NULL, NULL, it.generatedId)) AS NItemsWithBrandCodes
    FROM 
        receiptsFact r
    LEFT JOIN 
        receiptItemsDim it
        ON r.Id = it.receiptId
    LEFT JOIN 
        brandsDim br 
        ON it.barcode = br.barcode
    WHERE
        r.rewardsReceiptStatus = 'FINISHED'
        AND (it.deleted is null or it.deleted = 0)
    
'''

sql_brands = '''
    SELECT 
        br.name
        , COUNT(DISTINCT it.generatedId) as Items
    FROM 
        receiptsFact r
    JOIN 
        receiptItemsDim it
        ON r.Id = it.receiptId
    JOIN 
        brandsDim br 
        ON it.barcode = br.barcode
    WHERE
        r.rewardsReceiptStatus = 'FINISHED'
        AND (it.deleted is null or it.deleted = 0)
    GROUP BY 
        br.name
    ORDER BY Items DESC
'''

sql_brand_codes = '''
    SELECT 
        it.brandCode
        , COUNT(DISTINCT it.generatedId) as Items
    FROM 
        receiptsFact r
    JOIN 
        receiptItemsDim it
        ON r.Id = it.receiptId
    WHERE
        r.rewardsReceiptStatus = 'FINISHED'
        AND (it.deleted is null or it.deleted = 0)
        AND it.brandCode IS NOT NULL
    GROUP BY 
        it.brandCode
    ORDER BY Items DESC
'''



In [55]:
pd.read_sql_query(sql, conn)

Unnamed: 0,NReceipts,NReceiptsWithItemsData,Nitems,NItemsWithMatchedBarcodes,NItemsWithBrandCodes
0,518,516,5918,82,2350


In the Brands data, the `brandId` appears to be the unique key for each brand; however, `brandId` is not referenced in the data for items on receipts. The field `barcode` is in both sources; however, it is not a unique key in the brands source, and there are relatively few matches between items and brands on barcode (1.3% matched). One thought is to use the `brandCode` from the items data instead, which is less sparse (39.7%)

Before I can analyze the number of items purchased by brand for finished receipts, I have some follow-up questions regarding barcodes and brand identifiers on receipt items, summarized in my message to the stakeholders.

In [58]:
print("Number of items by brand from brands data:")
pd.read_sql_query(sql_brands, conn)

Number of items by brand from brands data:


Unnamed: 0,name,Items
0,Tostitos,23
1,Swanson,11
2,Cracker Barrel Cheese,10
3,Prego,7
4,Diet Chris Cola,7
5,Pepperidge Farm,5
6,V8,4
7,Rice A Roni,3
8,Quaker,3
9,Kraft,3


In [59]:
print("Number of items by brandCode on item:")
pd.read_sql_query(sql_brand_codes, conn)

Number of items by brandCode on item:


Unnamed: 0,brandCode,Items
0,HY-VEE,291
1,BEN AND JERRYS,162
2,PEPSI,89
3,KLEENEX,88
4,KNORR,78
...,...,...
214,BOTA BOX,1
215,BOAR'S HEAD,1
216,BIC,1
217,BANZA,1


----