## Task
1. Identify all shops that are deemed to have conducted order brushing.
2. For each shop that is identified to have conducted order brushing, identify the buyers suspected to have conducted order brushing for that shop.

Definition of order brushing
- concentration_rate >= 3
- concentration_rate = num_orders_1hr / num_unique_buyer_1hr
- **suspicious buyers** are deemed to be the buyer that contributed the highest proportion of orders to a shop

In [158]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [159]:
df = pd.read_csv('order_brush_order.csv')

In [160]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 222750 entries, 0 to 222749
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   orderid     222750 non-null  int64 
 1   shopid      222750 non-null  int64 
 2   userid      222750 non-null  int64 
 3   event_time  222750 non-null  object
dtypes: int64(3), object(1)
memory usage: 6.8+ MB


In [161]:
df.describe()

Unnamed: 0,orderid,shopid,userid
count,222750.0,222750.0,222750.0
mean,31300270000000.0,94331170.0,98028800.0
std,122277400000.0,56957900.0,68390480.0
min,31075200000000.0,10009.0,10007.0
25%,31203600000000.0,49802670.0,35081270.0
50%,31305610000000.0,90336360.0,93096250.0
75%,31406040000000.0,147505300.0,159061200.0
max,31507200000000.0,215435200.0,215526200.0


In [162]:
df.event_time.min(), df.event_time.max()

('2019-12-27 00:00:00', '2019-12-31 23:59:56')

In [163]:
df['event_time'] = df['event_time'].astype('datetime64[ns]')

### Trying to define whether order brushing has occurred

Simplified approach
> Segregating orders into 1 hr interval
> Calculating concentration rate based on the 1 hr intervals to detect if order brushing has occurred

To-do
> Detect instataneous concentration rate spikes

In [164]:
orders_per_hr = df.groupby(['shopid', pd.Grouper(key='event_time', freq='h')]).orderid.count()
unique_buyers_per_hr = df.groupby(['shopid', pd.Grouper(key='event_time', freq='h')]).userid.nunique()

In [165]:
concentration_rate_per_hr = orders_per_hr / unique_buyers_per_hr
concentration_rate_mask = concentration_rate_per_hr >= 3.0
order_brush_shops = concentration_rate_mask.index.get_level_values('shopid')[concentration_rate_mask].values

In [166]:
len(set(order_brush_shops))

194

In [167]:
def extract_common_value(x):
    mode_list = x.mode()
    if len(mode_list) <= 1:
        return mode_list[0]
    else:
        return '&'.join([str(c) for c in sorted(mode_list)])

In [168]:
suspicious = df[df['shopid'].isin(order_brush_shops)].groupby('shopid')['userid'].apply(extract_common_value)
suspicious = suspicious.to_frame()

### Preparing for Submission

In [169]:
shops_all = df['shopid'].unique()

submission1 = pd.DataFrame({'shopid': shops_all, 'userid': np.zeros(len(shops_all))})

submission1.userid = submission1.userid.astype('int')
submission1.set_index('shopid', inplace=True)

In [170]:
submission1.head()

Unnamed: 0_level_0,userid
shopid,Unnamed: 1_level_1
93950878,0
156423439,0
173699291,0
63674025,0
127249066,0


In [171]:
suspicious.head()

Unnamed: 0_level_0,userid
shopid,Unnamed: 1_level_1
10402,77819
10536,672345
42472,740844
42818,170385453
76934,190449497


In [172]:
submission1.update(suspicious)

In [173]:
submission1.userid.unique()

array([0, 9753706, 61893096, 181408876, 174145893, 114498557, 123158564,
       52867898, 81928284, 31916119, 107641182, 214432120, 67554410,
       192251866, 87846708, 18688337, 172106152, 170385453, 74027394,
       131515076, 122507717, 156614746, 144612139, 188025647, 201343856,
       6059093, 79419297, 32594, 194833170, 59725263, 205729485,
       '81928284&198558630', '23962466&24053233&60599168&71152760',
       '29857724&212200633', 108214177, 143847348, 157946285, 95058664,
       137245836, 2779333, '5085857&15425170&203554877', 170673735,
       31233680, 1762129, 186634585, 199382229, 157747326, 78903959,
       138388930, 89014205, '5307816&214808165', 148215831, 193415051,
       193338089, 33794624, 192785138, 29299481, 46361526, 116055684,
       556867, 194647522, 93783570, 86802680, 105935455,
       '92521144&130587573', 188187242, 214546342, 132704747, 15053804,
       128702876, '35639374&159315857', 10209247, 51759862,
       '16339607&212325226', 7670129, 80682

In [174]:
submission1 = submission1.reset_index()
print(submission1.shape)
submission1.to_csv('submission1.csv', index=False)

(18770, 2)
