# Data Exploration and Pipeline

The database contains `Users`, `Orders`, and `Partners` tables. 

Partners are the companies who sell surplus items on the marketplace.
A cohort consists of customers who made their first order within the same month (M0). 
M1 retention is the share of customers who have made at least one purchase one month after their first purchase month.

Explore the data with Sql to investigate the following:

- The top 10 partners by sales
- Customers’ favourite partner segments (default offer types). 
- What is the M1 retention for any given customer cohort. 

## Connect to the database 
The connection to the sqlite datbase is achieved through the `jupysql` python library. This allows querying the database from jupyter notebook. Alternative tools are SQLMagic

In [41]:
import sqlite3
con = sqlite3.connect("./data/mock_resq.db") 

%load_ext sql
%config SqlMagic.displaylimit = None

%sql sqlite:///data/mock_resq.db

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


### How many user are in the database

In [42]:
%%sql
SELECT COUNT(*) AS USER_COUNT
FROM USERS

USER_COUNT
358366


### How many countries do the users come from

In [43]:
%%sql
SELECT COUNT(DISTINCT COUNTRY) AS COUNTRY_COUNT
FROM USERS

COUNTRY_COUNT
111


### Which top 10 countries have the most users

In [44]:
%%sql 
SELECT COUNTRY, COUNT(*) AS USER_COUNT
FROM USERS 
GROUP BY COUNTRY
ORDER BY USER_COUNT DESC
LIMIT 10


country,USER_COUNT
FI,339573
SE,8961
EE,6505
DE,512
AX,434
ES,209
FR,200
GB,190
AT,182
NL,152


### Inference

-  There are **358,366** users in the database who are from **111** different countries. 
-  The top 10 countries where the users come from are Finland (FI), Sweden (SE), Estonia(EE), Germany(DE), Åland (AX), Spain (ES), France (FR), Great Britain (GB), Austria (AT) and Netherlands (NL). 
-  Ninety-four perecent of users (94% i.e **339573** users) are from Finland, followed by Sweden which has **8961** users. 

## How many Providers are in the database

In [45]:
%%sql
SELECT COUNT(ID) AS PROVIDER_COUNT
FROM PROVIDERS

PROVIDER_COUNT
4337


## Do providers have multiple offer types?

In [46]:
%%sql
SELECT ID AS PROVIDER, COUNT(DEFAULTOFFERTYPE) AS OFFER_TYPE_COUNT
FROM PROVIDERS
GROUP BY ID
ORDER BY OFFER_TYPE_COUNT DESC
LIMIT 5

PROVIDER,OFFER_TYPE_COUNT
9222930112446389796,1
9217379655006460479,1
9215371507696178188,1
9214584721622525154,1
9212615296993900753,1


### How providers by country

In [47]:
%%sql 
SELECT COUNTRY, COUNT(ID) AS PROVIDER_COUNT
FROM PROVIDERS
GROUP BY COUNTRY
ORDER BY PROVIDER_COUNT DESC

country,PROVIDER_COUNT
fin,4095
est,154
swe,84
pol,2
deu,2


## Inference

- There are **4337** providers in the database, with each provider having exactly one offer type (meal, snack, dessert, ingredients, flowers etc).  
- The providers are from Finland (FIN), Estonia (EST), Sweden (SWE), Poland (POL) and Germany (DEU). 
- Over **94 percent (4095)** of the providers are from Finland, followed by Estonia with **154** providers. 
- Poland and Germany has two (2) providers each. 


`Questions:` 

Are **partners** the same as **providers** ? If yes, Why the difference in numenclature in Db and instructions? 

The **Users** table has the two-letter [ISO-3166 country codes](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes) (example FI, SE, ES), while the **Providers** table uses the three-letter country code (example FIN, SWE, POL). Is there a perculiar reason for this ? 


All Providers have exactly one offer type in the database, however the numenclature of the offer says **defaultOfferType** and **partner segment**. Do providers have multiple offer types ? Could the numenclature be harmonised ?


# `Now answering the Analyst's questions`

## Top 10 partners by sales

In [48]:
%%sql 
SELECT PROVIDERID, SUM(SALES) AS TOTAL_SALES
FROM ORDERS
GROUP BY PROVIDERID
ORDER BY TOTAL_SALES DESC
LIMIT 10

providerId,TOTAL_SALES
7198110370745783236,10917800
8312310143652755348,7467750
8097235958083241788,2383700
3865474760205653333,2223400
8084884958338058541,1868140
4734853230275691017,1702100
5305286819167536850,1690500
1066258454353124935,1568100
7642201963087705313,1472000
4014236829817167297,1457000


How many currencies are there in the database?

In [49]:
%%sql 
SELECT DISTINCT CURRENCY
FROM ORDERS

currency
eur
sek


There are only **Euro(EUR)** and **Swedish Krone (SEK)** currencies in the database. 

`Question:` Is the conversion rate to the base currency, most likely in Euro, saved during the order payment? This will be useful in transforming the sales figures into a commmon currency for consistent reporting. 

How about **top 10 partners by sales in the respective sales currencies** ?

In [50]:
%%sql 
SELECT PROVIDERID, CURRENCY, SUM(SALES) AS TOTAL_SALES
FROM ORDERS
GROUP BY PROVIDERID, CURRENCY
ORDER BY TOTAL_SALES DESC
LIMIT 10

providerId,currency,TOTAL_SALES
7198110370745783236,sek,10917800
8312310143652755348,sek,7467750
8097235958083241788,sek,2383700
3865474760205653333,sek,2223400
8084884958338058541,eur,1868140
4734853230275691017,sek,1702100
5305286819167536850,sek,1690500
1066258454353124935,sek,1568100
7642201963087705313,sek,1472000
4014236829817167297,sek,1457000


The same providers are in the top 10

## Customers’ favourite partner segments (default offer types).

In [51]:
%%sql 
SELECT P.DEFAULTOFFERTYPE AS PARTNER_SEGMENT, SUM(O.QUANTITY) AS SUM_ORDER_QUANTITY
FROM ORDERS O
JOIN PROVIDERS P ON O.PROVIDERID = P.ID
GROUP BY P.DEFAULTOFFERTYPE
ORDER BY SUM_ORDER_QUANTITY DESC
LIMIT 1

PARTNER_SEGMENT,SUM_ORDER_QUANTITY
meal,305254


The Customers' favourite partner segment is **meal** with **305254** orders. 

## What is the M1 retention for any given customer cohort. 

Checking table definition and order table column types

In [54]:
%sql SELECT * FROM SQLITE_MASTER where TYPE='table'

type,name,tbl_name,rootpage,sql
table,orders,orders,2,"CREATE TABLE orders (id, createdAt, userId, quantity, refunded, currency, sales, providerId)"
table,providers,providers,4420,"CREATE TABLE providers (id, defaultOfferType, country, registeredDate)"
table,users,users,4465,"CREATE TABLE users (id, country, registeredDate)"


In [55]:
%sql SELECT NAME, TYPE FROM PRAGMA_TABLE_INFO('ORDERS')

name,type
id,
createdAt,
userId,
quantity,
refunded,
currency,
sales,
providerId,


No column types ? This is quit strange. The Date fields needs to be converted for date and aggregation functions

Now creating cohorts and computing the M1 retention share
A cohort consists of customers who made their first order within the same month (M0). 
M1 retention is the share of customers who have made at least one purchase one month after their first purchase month

In [123]:
%%sql
WITH FIRSTORDERDATES AS (
    SELECT 
        USERID, 
        MIN(DATE(CREATEDAT)) AS FIRST_ORDER_DATE
    FROM ORDERS
    GROUP BY USERID
),
CUSTOMERCOUNT AS (
    SELECT COUNT(*) AS ALLCUSTOMERS FROM USERS
),
M1RETENTIONCUSTOMERS AS (
    SELECT 
        O.USERID, 
        O.CREATEDAT AS ORDER_DATE, 
        FOD.FIRST_ORDER_DATE, 
        ((strftime('%Y', O.CREATEDAT) - strftime('%Y', FOD.FIRST_ORDER_DATE)) * 12) + 
        (strftime('%m', O.CREATEDAT) - strftime('%m', FOD.FIRST_ORDER_DATE)) AS COHORT
    FROM ORDERS O
    LEFT JOIN FIRSTORDERDATES FOD ON O.USERID = FOD.USERID
    WHERE COHORT > 0
    ORDER BY O.USERID, O.CREATEDAT
)


SELECT 
    (CAST((SELECT COUNT(*) FROM M1RETENTIONCUSTOMERS) AS REAL) / 
    CAST((SELECT AllCustomers FROM CustomerCount) AS REAL)) * 100 AS M1_SHARE_PERCENTAGE

M1_SHARE_PERCENTAGE
44.61890915990914


`Answer` The M1 retention rate is** 44.61** percent. This means that **44.61** percent of customers (**159,899** customers) make at least one purchase one month after their first purchase month.