# Sorting Validation data

This notebook sorts the data by timestamp.

Data is expected to be located in `/data/recsys/part-XXXXX` and sorted data is going to be located in `/data/recsys/sorted/part-XXXXX.csv` where each file contains all the interactions of an hour.

+ Min timestamp train: 1612396800 (04 February 2021 0:00:00)
+ Min timestamp val: 1614211200 (25 February 2021 0:00:00)
+ Max timestamp val: 1614815999 (03 March 2021 23:59:59) 

To convert from partID to hour do the following operation:

In [1]:
from tqdm.notebook import trange
import numpy as np
import pandas as pd

In [2]:
def part_to_timestamp(partID):
    import pandas as pd
    min_h = pd.to_datetime(partID * 3600 + 1612396800, unit='s')
    max_h = pd.to_datetime((partID + 1) * 3600 + 1612396800 - 1, unit='s')

    print(f'Part {partID:03d} from time {min_h} to {max_h}')

Example:

In [3]:
part_to_timestamp(partID=0)
part_to_timestamp(partID=100)
part_to_timestamp(partID=503)
part_to_timestamp(partID=671)

Part 000 from time 2021-02-04 00:00:00 to 2021-02-04 00:59:59
Part 100 from time 2021-02-08 04:00:00 to 2021-02-08 04:59:59
Part 503 from time 2021-02-24 23:00:00 to 2021-02-24 23:59:59
Part 671 from time 2021-03-03 23:00:00 to 2021-03-03 23:59:59


## Read and sort the data

In [4]:
from dask.distributed import Client, LocalCluster
import dask.dataframe as dd

cluster = LocalCluster(dashboard_address=':8788')
# cluster.scale(3)

client = Client(cluster)
client

0,1
Client  Scheduler: tcp://127.0.0.1:34515  Dashboard: http://127.0.0.1:8788/status,Cluster  Workers: 6  Cores: 24  Memory: 125.60 GiB


In [5]:
column_type={
              'bert': str,'hashtags':str,'tweet_id':str,'media':str,'links':str,'domains':str,'type':str,'language':str,'timestamp':np.uint32,
              'AUTH_user_id':str,'AUTH_follower_count':np.uint32,'AUTH_following_count':np.uint32,'AUTH_verified':bool,'AUTH_account_creation':np.uint32,
              'READ_user_id':str,'READ_follower_count':np.uint32,'READ_following_count':np.uint32,'READ_verified':bool,'READ_account_creation':np.uint32,
              'auth_follows_read':bool,
              'reply_timestamp':'Int32','retweet_timestamp':'Int32','quote_timestamp':'Int32','like_timestamp':'Int32'
            }

First, we separate the data by hour. Since **1612396800** is the minimum timestamp we can do:
$$\text{part} = \left[\dfrac{\text{timestamp} - 1612396800}{3600}\right], \ \ \text{with} \ \ [] \ \ \text{the Floor function}$$

Read validation file and split by hour

In [6]:
df = dd.read_csv('/data/recsys/valid/part-00000.1', names=list(column_type.keys()), header=None, sep='\x01', dtype=column_type)
df['part'] = df.apply(lambda row: (row.timestamp - 1612396800) // 3600, axis=1, meta=(None, 'uint16'))
df.groupby('part').apply(lambda group: group.drop(columns=['part']).to_csv(f'/data/recsys/sorted/part-{group.name:05d}.csv',index=False), meta=df._meta).size.compute()

0

Then, we can sort files individually,

In [13]:
for i in trange(504, 672):
    df = pd.read_csv(f'/data/recsys/sorted/part-{i:05d}.csv')
    df = df.sort_values('timestamp')
    df.to_csv(f'/data/recsys/sorted/part-{i:05d}.csv', index=False)

### Move

Move files from `/data/recsys/valid` folder into `/data/recsys/RecsysDocker/test_4` folder to replicate the twitter docker execution. <br>
Create the `reals_4.csv` to simulate the leaderboard score

In [10]:
# !mkdir /data/recsys/test_csv_w4
# !mkdir /data/recsys/RecsysDocker/test

In [18]:
_from, _to = 504, 504+7*24

In [21]:
for i in trange(_from, _to, leave=False):
    f = f"{i:05d}"
    !cp /data/recsys/sorted/part-{f}.csv /data/recsys/test_csv_w4

In [23]:
import pandas as pd
from tqdm.notebook import trange
reals = None
for i in trange(_from, _to, leave=False):
    df = pd.read_csv(f'/data/recsys/test_csv_w4/part-{i:05d}.csv')
    if reals is None:
        reals = df.iloc[:,-4:]
    else:
        reals = pd.concat([reals, df.iloc[:,-4:]])
    df = df.iloc[:,:-4] 
    newi = i-_from
    df.to_csv(f'/data/recsys/RecsysDocker/test_4/part-{newi:05d}', sep='\x01', header=False, index=False)
reals.notna().astype(int).to_csv(f'/data/recsys/RecsysDocker/reals_4.csv', index=False, header=False)

In [7]:
# !ls /data/recsys/test_csv_w4/