## 1. Data Exploration

Our sales department is interested in a summary of the collected data. Please generate a report including numbers and diagrams. Note that your audience are not data scientists, so take care to prepare insights that are as clear as possible. We are interested in the following:

1. Calculate the total number of customers in each section

2. Calculate the total number of customers in each section over time

3. Display the number of customers at checkout over time

4. Calculate the time each customer spent in the market

5. Calculate the total number of customers in the supermarket over time

6. Our business managers think that the first section customers visit follows a different pattern than the following ones. Plot the distribution of customers of their first visited section versus following sections (treat all sections visited after the first as “following”).

In [1]:
import pandas as pd

import datetime
from datetime import timedelta

In [2]:
# Read the data
monday = pd.read_csv('../data/monday.csv', sep=';')
tuesday = pd.read_csv('../data/tuesday.csv', sep=';')
wednesday =  pd.read_csv('../data/wednesday.csv', sep=';')
thursday = pd.read_csv('../data/thursday.csv', sep=';')
friday = pd.read_csv('../data/friday.csv', sep=';')

In [3]:
monday.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4884 entries, 0 to 4883
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   timestamp    4884 non-null   object
 1   customer_no  4884 non-null   int64 
 2   location     4884 non-null   object
dtypes: int64(1), object(2)
memory usage: 114.6+ KB


In [4]:
monday.describe()

Unnamed: 0,customer_no
count,4884.0
mean,718.274365
std,411.839636
min,1.0
25%,366.0
50%,720.0
75%,1070.0
max,1447.0


In [5]:
monday['timestamp'] = pd.to_datetime(monday['timestamp'])
tuesday['timestamp'] = pd.to_datetime(tuesday['timestamp'])
wednesday['timestamp'] = pd.to_datetime(wednesday['timestamp'])
thursday['timestamp'] = pd.to_datetime(thursday['timestamp'])
friday['timestamp'] = pd.to_datetime(friday['timestamp'])

In [14]:
# Creating new column in each data set:
monday['weekday'] = 'monday'
tuesday['weekday'] = 'tuesday'
wednesday['weekday'] = 'weekday'
thursday['weekday'] = 'thursday'
friday['weekday'] = 'friday'

# Combining all datasets together:
days = [monday, tuesday, wednesday, thursday, friday]
weekday = pd.concat(days)
weekday

Unnamed: 0,timestamp,customer_no,location,weekday
0,2019-09-02 07:03:00,1,dairy,monday
1,2019-09-02 07:03:00,2,dairy,monday
2,2019-09-02 07:04:00,3,dairy,monday
3,2019-09-02 07:04:00,4,dairy,monday
4,2019-09-02 07:04:00,5,spices,monday
...,...,...,...,...
5120,2019-09-06 21:50:00,1500,dairy,friday
5121,2019-09-06 21:50:00,1507,checkout,friday
5122,2019-09-06 21:50:00,1508,checkout,friday
5123,2019-09-06 21:50:00,1509,drinks,friday


### 1. Calculate the total number of customers in each section

In [7]:
monday_customers = monday.groupby('location').nunique()
tuesday_customers = tuesday.groupby('location').nunique()
wednesday_customers = wednesday.groupby('location').nunique()
thursday_customers = thursday.groupby('location').nunique()
friday_customers = friday.groupby('location').nunique()

print(monday_customers.customer_no)
print(tuesday_customers.customer_no)
print(wednesday_customers.customer_no)
print(thursday_customers.customer_no)
print(friday_customers.customer_no)

location
checkout    1437
dairy        720
drinks       661
fruit        827
spices       584
Name: customer_no, dtype: int64
location
checkout    1420
dairy        751
drinks       581
fruit        827
spices       543
Name: customer_no, dtype: int64
location
checkout    1526
dairy        804
drinks       652
fruit        884
spices       565
Name: customer_no, dtype: int64
location
checkout    1532
dairy        782
drinks       632
fruit        872
spices       613
Name: customer_no, dtype: int64
location
checkout    1502
dairy        761
drinks       688
fruit        874
spices       633
Name: customer_no, dtype: int64


### 2. Calculate the total number of customers in each section over time

In [8]:
total = monday_customers + tuesday_customers + wednesday_customers + thursday_customers + friday_customers
total[['customer_no']]

Unnamed: 0_level_0,customer_no
location,Unnamed: 1_level_1
checkout,7417
dairy,3818
drinks,3214
fruit,4284
spices,2938


### 3. Display the number of customers at checkout over time

In [9]:
total_checkout = total.loc[['checkout']][['customer_no']]
total_checkout

Unnamed: 0_level_0,customer_no
location,Unnamed: 1_level_1
checkout,7417


### 4. Calculate the time each customer spent in the market

In [10]:
max_time_monday = monday.groupby("customer_no")["timestamp"].max()
min_time_monday = monday.groupby("customer_no")["timestamp"].min()

print(max_time_monday, min_time_monday)

customer_no
1      2019-09-02 07:05:00
2      2019-09-02 07:06:00
3      2019-09-02 07:06:00
4      2019-09-02 07:08:00
5      2019-09-02 07:05:00
               ...        
1443   2019-09-02 21:48:00
1444   2019-09-02 21:49:00
1445   2019-09-02 21:49:00
1446   2019-09-02 21:50:00
1447   2019-09-02 21:50:00
Name: timestamp, Length: 1447, dtype: datetime64[ns] customer_no
1      2019-09-02 07:03:00
2      2019-09-02 07:03:00
3      2019-09-02 07:04:00
4      2019-09-02 07:04:00
5      2019-09-02 07:04:00
               ...        
1443   2019-09-02 21:47:00
1444   2019-09-02 21:48:00
1445   2019-09-02 21:49:00
1446   2019-09-02 21:50:00
1447   2019-09-02 21:50:00
Name: timestamp, Length: 1447, dtype: datetime64[ns]


In [11]:
diff_monday = max_time_monday - min_time_monday
diff_monday

customer_no
1      0 days 00:02:00
2      0 days 00:03:00
3      0 days 00:02:00
4      0 days 00:04:00
5      0 days 00:01:00
             ...      
1443   0 days 00:01:00
1444   0 days 00:01:00
1445   0 days 00:00:00
1446   0 days 00:00:00
1447   0 days 00:00:00
Name: timestamp, Length: 1447, dtype: timedelta64[ns]

In [12]:
pd.DataFrame(diff_monday).describe()

Unnamed: 0,timestamp
count,1447
mean,0 days 00:06:26.371803731
std,0 days 00:06:20.300576298
min,0 days 00:00:00
25%,0 days 00:02:00
50%,0 days 00:04:00
75%,0 days 00:08:00
max,0 days 00:51:00


### 5. Calculate the total number of customers in the supermarket over time

In [13]:
total[['customer_no']].sum()

customer_no    21671
dtype: int64

### 6. Plot the distribution of customers of their first visited section versus following sections