# Intro

## Dataset
There are nine tables of sample data:

* Sales Receipts
* Pastry Inventory
* Sales Targets
* Customer
* Dates
* Product
* Sales Outlet
* Staff
* Generation

Source: [IBM/Kaggle](https://www.kaggle.com/datasets/ylchang/coffee-shop-sample-data-1113)

# Data Dictionary - Customers

|Column Name| Description|
|-----------|-----------|
|**customer_id**|A unique id assigned upon registration to each customer|
|**home_store**|The store a customer was registered at or has set as their main location|
|**customer_first-name**|Customers first name and last name|
|**customer_email**|Customers email used for registration|
|**customer_since**|How long a customer has been registered or when they first started shopping.|
|**loyalty_card_number**|A unique id used for a loyalty rewards program|
|**birthdate**|Customers Date of Birth - YYYY-MM-DD|
|**gender**|Customers registered gender|
|**birth_year**|Customers Year of Birth |

# Data Dictionary - Sales

|Column Name| Description|
|-----------|-----------|
|**transaction_id**|A unique id assigned to each purchase.|
|**transaction_date**|Calendar date when a transaction was made YYYY-MM-DD|
|**transaction_time**|Time when a transaction was made HH-MM-SS|
|**sales_outlet_id**|A unique used to identify which store handled the sale|
|**staff_id**|Unique id assigned to each staff member. Based on which staff member processed the transaction|
|**customer_id**|A unique id assigned upon registration to each customer|
|**instore_yn**|Was this purchase made in store or online|
|**order**|Unspecified|
|**line_item_id**|Unspecified|
|**product_id**|A unique id assigned to each product|
|**quantity**|The amount of each product purchased|
|**line_item_amount**|Unspecified|
|**unit_price**|Unit price for the product specified|
|**promo_item_yn**|Was the product a part of a promotional campaign|


## Library Imports

In [1]:
# data 
import pandas as pd
import matplotlib as plt
import seaborn as sns

# sql 
import sqlite3

#system
from pathlib import Path

## Database Connection

In a real world scenario your dataset isn't alway contained in a handy cleaned csv file, but in a production database.

He we create and populate a sqlite3 database in order to simulate real world conditions.

We use pandas to convert our sql queries into a dataframe for analysis.

In [2]:
# create our initial db file
Path('coffee.db').touch()

# create a db connection
connection = sqlite3.connect("coffee.db")

# cursor - database iterator
c = connection.cursor()

# create a sample table to verify
# c.execute('''CREATE TABLE sample_table (u_id int, email text)''')

# load the data into a dataframe
customers = pd.read_csv("customer.csv")

# write df to a sqlite table
customers.to_sql('customers', connection, if_exists="append", index=False) # 2246 results

# data for our second table/df
sales = pd.read_csv("2019_04_sales_reciepts.csv")

# second table
sales.to_sql('sales', connection, if_exists='append', index=False) # 49894 results

# verify
# fetchall returns an array of tuples
# c.execute('''SELECT *  FROM customers''').fetchall()

# verify second table
# c.execute('''SELECT *  FROM sales''').fetchall()

# join our two tables
c.execute('''SELECT * FROM sales s LEFT JOIN customers c ON c.customer_id = s.customer_id''')
c.fetchall()

[(7,
  '2019-04-01',
  '12:04:43',
  3,
  12,
  558,
  'N',
  1,
  1,
  52,
  1,
  2.5,
  2.5,
  'N',
  558,
  3,
  'Melissa Johnson',
  'Luke@eget.net',
  '2018-06-19',
  '816-924-9433',
  '1983-02-25',
  'F',
  1983),
 (11,
  '2019-04-01',
  '15:54:39',
  3,
  17,
  781,
  'N',
  1,
  1,
  27,
  2,
  7.0,
  3.5,
  'N',
  781,
  3,
  'Luke Patel',
  'Herrod@Maecenas.us',
  '2018-11-02',
  '653-218-9979',
  '1991-07-29',
  'N',
  1991),
 (19,
  '2019-04-01',
  '14:34:59',
  3,
  17,
  788,
  'Y',
  1,
  1,
  46,
  2,
  5.0,
  2.5,
  'N',
  788,
  3,
  'Hilel Ballard',
  'Rajah@risus.org',
  '2018-12-30',
  '263-826-9026',
  '1995-02-23',
  'N',
  1995),
 (32,
  '2019-04-01',
  '16:06:04',
  3,
  12,
  683,
  'N',
  1,
  1,
  23,
  2,
  5.0,
  2.5,
  'N',
  683,
  3,
  'Zephr Zimmerman',
  'Dacey@in.net',
  '2019-03-04',
  '741-320-7166',
  '1999-02-06',
  'F',
  1999),
 (33,
  '2019-04-01',
  '19:18:37',
  3,
  17,
  99,
  'Y',
  1,
  1,
  34,
  1,
  2.45,
  2.45,
  'N',
  99,
  3,
  '

In [3]:
# import our joined sql tables into a dataframe
coffee = pd.read_sql('''SELECT * FROM sales s LEFT JOIN customers c ON c.customer_id = s.customer_id''', connection)
# delete duplicate column
coffee = coffee.loc[:,~coffee.columns.duplicated()]

In [4]:
coffee

Unnamed: 0,transaction_id,transaction_date,transaction_time,sales_outlet_id,staff_id,customer_id,instore_yn,order,line_item_id,product_id,...,unit_price,promo_item_yn,home_store,customer_first-name,customer_email,customer_since,loyalty_card_number,birthdate,gender,birth_year
0,7,2019-04-01,12:04:43,3,12,558,N,1,1,52,...,2.50,N,3.0,Melissa Johnson,Luke@eget.net,2018-06-19,816-924-9433,1983-02-25,F,1983.0
1,11,2019-04-01,15:54:39,3,17,781,N,1,1,27,...,3.50,N,3.0,Luke Patel,Herrod@Maecenas.us,2018-11-02,653-218-9979,1991-07-29,N,1991.0
2,19,2019-04-01,14:34:59,3,17,788,Y,1,1,46,...,2.50,N,3.0,Hilel Ballard,Rajah@risus.org,2018-12-30,263-826-9026,1995-02-23,N,1995.0
3,32,2019-04-01,16:06:04,3,12,683,N,1,1,23,...,2.50,N,3.0,Zephr Zimmerman,Dacey@in.net,2019-03-04,741-320-7166,1999-02-06,F,1999.0
4,33,2019-04-01,19:18:37,3,17,99,Y,1,1,34,...,2.45,N,3.0,Orlando Shields,Ivory@scelerisque.us,2017-10-01,747-164-4596,1967-01-29,M,1967.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49889,753,2019-04-29,16:51:58,8,42,0,N,1,1,30,...,3.00,N,,,,,,,,
49890,756,2019-04-29,16:51:14,8,42,8412,Y,1,1,25,...,2.20,N,8.0,Malcolm,Cedric@neque.us,2019-01-08,193-832-1350,1953-09-16,M,1953.0
49891,759,2019-04-29,11:17:36,8,15,0,Y,1,1,31,...,2.20,N,,,,,,,,
49892,763,2019-04-29,15:45:52,8,45,8030,N,1,1,44,...,2.50,N,8.0,Deirdre,Austin@Nullam.edu,2018-08-23,383-091-4412,1994-01-10,F,1994.0


In [5]:
coffee.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49894 entries, 0 to 49893
Data columns (total 22 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   transaction_id       49894 non-null  int64  
 1   transaction_date     49894 non-null  object 
 2   transaction_time     49894 non-null  object 
 3   sales_outlet_id      49894 non-null  int64  
 4   staff_id             49894 non-null  int64  
 5   customer_id          49894 non-null  int64  
 6   instore_yn           49894 non-null  object 
 7   order                49894 non-null  int64  
 8   line_item_id         49894 non-null  int64  
 9   product_id           49894 non-null  int64  
 10  quantity             49894 non-null  int64  
 11  line_item_amount     49894 non-null  float64
 12  unit_price           49894 non-null  float64
 13  promo_item_yn        49894 non-null  object 
 14  home_store           24852 non-null  float64
 15  customer_first-name  24852 non-null 

## Data Cleaning - Column Rename

### Customer

customer_id	home_store	customer_first-name	customer_email	customer_since	loyalty_card_number	birthdate	gender	birth_year

### Sales

transaction_id	transaction_date	transaction_time	sales_outlet_id	staff_id	customer_id	instore_yn	order	line_item_id	product_id	quantity	line_item_amount	unit_price	promo_item_yn