## 1. Data Exploration

Our sales department is interested in a summary of the collected data. Please generate a report including numbers and diagrams. Note that your audience are not data scientists, so take care to prepare insights that are as clear as possible. We are interested in the following:

1. Calculate the total number of customers in each section

2. Calculate the total number of customers in each section over time

3. Display the number of customers at checkout over time

4. Calculate the time each customer spent in the market

5. Calculate the total number of customers in the supermarket over time

6. Our business managers think that the first section customers visit follows a different pattern than the following ones. Plot the distribution of customers of their first visited section versus following sections (treat all sections visited after the first as “following”).

In [20]:
import pandas as pd

import datetime
from datetime import timedelta

In [2]:
# Read the data
monday = pd.read_csv('../data/monday.csv', sep=';', parse_dates=True)
tuesday = monday = pd.read_csv('../data/tuesday.csv', sep=';')
wednesday =  pd.read_csv('../data/wednesday.csv', sep=';')
thursday = pd.read_csv('../data/thursday.csv', sep=';')
friday = pd.read_csv('../data/friday.csv', sep=';')

In [3]:
monday.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4714 entries, 0 to 4713
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   timestamp    4714 non-null   object
 1   customer_no  4714 non-null   int64 
 2   location     4714 non-null   object
dtypes: int64(1), object(2)
memory usage: 110.6+ KB


In [18]:
monday['timestamp'] = pd.to_datetime(monday['timestamp'])

### 1. Calculate the total number of customers in each section

In [9]:
monday_customers = monday.groupby('location').nunique()
tuesday_customers = tuesday.groupby('location').nunique()
wednesday_customers = wednesday.groupby('location').nunique()
thursday_customers = thursday.groupby('location').nunique()
friday_customers = friday.groupby('location').nunique()

# Output w/o duplicates
print(monday_customers)
print(tuesday_customers)
print(wednesday_customers)
print(thursday_customers)
print(friday_customers)

          timestamp  customer_no
location                        
checkout        682         1420
dairy           527          751
drinks          456          581
fruit           521          827
spices          450          543
          timestamp  customer_no
location                        
checkout        682         1420
dairy           527          751
drinks          456          581
fruit           521          827
spices          450          543
          timestamp  customer_no
location                        
checkout        699         1526
dairy           543          804
drinks          483          652
fruit           562          884
spices          475          565
          timestamp  customer_no
location                        
checkout        693         1532
dairy           540          782
drinks          502          632
fruit           587          872
spices          497          613
          timestamp  customer_no
location                        
checkout  

### 2. Calculate the total number of customers in each section over time

In [10]:
total = monday_customers + tuesday_customers + wednesday_customers + thursday_customers + friday_customers
total[['customer_no']]

Unnamed: 0_level_0,customer_no
location,Unnamed: 1_level_1
checkout,7400
dairy,3849
drinks,3134
fruit,4284
spices,2897


### 3. Display the number of customers at checkout over time

In [11]:
total_checkout = total.loc[['checkout']][['customer_no']]
total_checkout

Unnamed: 0_level_0,customer_no
location,Unnamed: 1_level_1
checkout,7400


### 4. Calculate the time each customer spent in the market

In [14]:
max_time_monday = monday.groupby("customer_no")["timestamp"].max()
min_time_monday = monday.groupby("customer_no")["timestamp"].min()

print(max_time_monday, min_time_monday)

customer_no
1      2019-09-03 07:12:00
2      2019-09-03 07:17:00
3      2019-09-03 07:10:00
4      2019-09-03 07:12:00
5      2019-09-03 07:09:00
               ...        
1418   2019-09-03 21:43:00
1419   2019-09-03 21:43:00
1420   2019-09-03 21:46:00
1421   2019-09-03 21:48:00
1422   2019-09-03 21:47:00
Name: timestamp, Length: 1422, dtype: datetime64[ns] customer_no
1      2019-09-03 07:02:00
2      2019-09-03 07:02:00
3      2019-09-03 07:03:00
4      2019-09-03 07:06:00
5      2019-09-03 07:06:00
               ...        
1418   2019-09-03 21:37:00
1419   2019-09-03 21:42:00
1420   2019-09-03 21:43:00
1421   2019-09-03 21:46:00
1422   2019-09-03 21:46:00
Name: timestamp, Length: 1422, dtype: datetime64[ns]


In [21]:
diff = max_time_monday - min_time_monday
pd.DataFrame(diff)

Unnamed: 0_level_0,timestamp
customer_no,Unnamed: 1_level_1
1,0 days 00:10:00
2,0 days 00:15:00
3,0 days 00:07:00
4,0 days 00:06:00
5,0 days 00:03:00
...,...
1418,0 days 00:06:00
1419,0 days 00:01:00
1420,0 days 00:03:00
1421,0 days 00:02:00


### 5. Calculate the total number of customers in the supermarket over time

In [None]:
total[['customer_no']].sum()

### 6. Plot the distribution of customers of their first visited section versus following sections