# Intro

## Dataset
There are nine tables of sample data:

* Sales Receipts
* Pastry Inventory
* Sales Targets
* Customer
* Dates
* Product
* Sales Outlet
* Staff
* Generation

Source: [IBM/Kaggle](https://www.kaggle.com/datasets/ylchang/coffee-shop-sample-data-1113)

# Data Dictionary

## Library Imports

In [1]:
# data 
import pandas as pd
import matplotlib as plt
import seaborn as sns

# sql 
import sqlite3

#system
from pathlib import Path

## Database Connection

In a real world scenario your dataset isn't alway contained in a handy cleaned csv file, but in a production database.

He we create and populate a sqlite3 database in order to simulate real world conditions.

We use pandas to convert our sql queries into a dataframe for analysis.

In [2]:
# create our initial db file
Path('coffee.db').touch()

# create a db connection
connection = sqlite3.connect("coffee.db")

# cursor - database iterator
c = connection.cursor()

# create a sample table to verify
# c.execute('''CREATE TABLE sample_table (u_id int, email text)''')

# load the data into a dataframe
customers = pd.read_csv("customer.csv")

# write df to a sqlite table
customers.to_sql('customers', connection, if_exists="append", index=False) # 2246 results

# data for our second table/df
sales = pd.read_csv("2019_04_sales_reciepts.csv")

# second table
sales.to_sql('sales', connection, if_exists='append', index=False) # 49894 results

# verify
# fetchall returns an array of tuples
# c.execute('''SELECT *  FROM customers''').fetchall()

# verify second table
# c.execute('''SELECT *  FROM sales''').fetchall()

# join our two tables
c.execute('''SELECT * FROM sales s LEFT JOIN customers c ON c.customer_id = s.customer_id''')
c.fetchall()

[(7,
  '2019-04-01',
  '12:04:43',
  3,
  12,
  558,
  'N',
  1,
  1,
  52,
  1,
  2.5,
  2.5,
  'N',
  558,
  3,
  'Melissa Johnson',
  'Luke@eget.net',
  '2018-06-19',
  '816-924-9433',
  '1983-02-25',
  'F',
  1983),
 (7,
  '2019-04-01',
  '12:04:43',
  3,
  12,
  558,
  'N',
  1,
  1,
  52,
  1,
  2.5,
  2.5,
  'N',
  558,
  3,
  'Melissa Johnson',
  'Luke@eget.net',
  '2018-06-19',
  '816-924-9433',
  '1983-02-25',
  'F',
  1983),
 (7,
  '2019-04-01',
  '12:04:43',
  3,
  12,
  558,
  'N',
  1,
  1,
  52,
  1,
  2.5,
  2.5,
  'N',
  558,
  3,
  'Melissa Johnson',
  'Luke@eget.net',
  '2018-06-19',
  '816-924-9433',
  '1983-02-25',
  'F',
  1983),
 (7,
  '2019-04-01',
  '12:04:43',
  3,
  12,
  558,
  'N',
  1,
  1,
  52,
  1,
  2.5,
  2.5,
  'N',
  558,
  3,
  'Melissa Johnson',
  'Luke@eget.net',
  '2018-06-19',
  '816-924-9433',
  '1983-02-25',
  'F',
  1983),
 (11,
  '2019-04-01',
  '15:54:39',
  3,
  17,
  781,
  'N',
  1,
  1,
  27,
  2,
  7.0,
  3.5,
  'N',
  781,
  3,
  'Lu

In [3]:
# import our joined sql tables into a dataframe
coffee = pd.read_sql('''SELECT * FROM sales s LEFT JOIN customers c ON c.customer_id = s.customer_id''', connection)
# delete duplicate column
coffee = coffee.loc[:,~coffee.columns.duplicated()]

In [4]:
coffee

Unnamed: 0,transaction_id,transaction_date,transaction_time,sales_outlet_id,staff_id,customer_id,instore_yn,order,line_item_id,product_id,...,unit_price,promo_item_yn,home_store,customer_first-name,customer_email,customer_since,loyalty_card_number,birthdate,gender,birth_year
0,7,2019-04-01,12:04:43,3,12,558,N,1,1,52,...,2.5,N,3.0,Melissa Johnson,Luke@eget.net,2018-06-19,816-924-9433,1983-02-25,F,1983.0
1,7,2019-04-01,12:04:43,3,12,558,N,1,1,52,...,2.5,N,3.0,Melissa Johnson,Luke@eget.net,2018-06-19,816-924-9433,1983-02-25,F,1983.0
2,7,2019-04-01,12:04:43,3,12,558,N,1,1,52,...,2.5,N,3.0,Melissa Johnson,Luke@eget.net,2018-06-19,816-924-9433,1983-02-25,F,1983.0
3,7,2019-04-01,12:04:43,3,12,558,N,1,1,52,...,2.5,N,3.0,Melissa Johnson,Luke@eget.net,2018-06-19,816-924-9433,1983-02-25,F,1983.0
4,11,2019-04-01,15:54:39,3,17,781,N,1,1,27,...,3.5,N,3.0,Luke Patel,Herrod@Maecenas.us,2018-11-02,653-218-9979,1991-07-29,N,1991.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
373345,763,2019-04-29,15:45:52,8,45,8030,N,1,1,44,...,2.5,N,8.0,Deirdre,Austin@Nullam.edu,2018-08-23,383-091-4412,1994-01-10,F,1994.0
373346,763,2019-04-29,15:45:52,8,45,8030,N,1,5,75,...,3.5,N,8.0,Deirdre,Austin@Nullam.edu,2018-08-23,383-091-4412,1994-01-10,F,1994.0
373347,763,2019-04-29,15:45:52,8,45,8030,N,1,5,75,...,3.5,N,8.0,Deirdre,Austin@Nullam.edu,2018-08-23,383-091-4412,1994-01-10,F,1994.0
373348,763,2019-04-29,15:45:52,8,45,8030,N,1,5,75,...,3.5,N,8.0,Deirdre,Austin@Nullam.edu,2018-08-23,383-091-4412,1994-01-10,F,1994.0


In [5]:
coffee.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 373350 entries, 0 to 373349
Data columns (total 22 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   transaction_id       373350 non-null  int64  
 1   transaction_date     373350 non-null  object 
 2   transaction_time     373350 non-null  object 
 3   sales_outlet_id      373350 non-null  int64  
 4   staff_id             373350 non-null  int64  
 5   customer_id          373350 non-null  int64  
 6   instore_yn           373350 non-null  object 
 7   order                373350 non-null  int64  
 8   line_item_id         373350 non-null  int64  
 9   product_id           373350 non-null  int64  
 10  quantity             373350 non-null  int64  
 11  line_item_amount     373350 non-null  float64
 12  unit_price           373350 non-null  float64
 13  promo_item_yn        373350 non-null  object 
 14  home_store           298224 non-null  float64
 15  customer_first-na