## Generating the Data
For the sake of simplicity, we'll assume the time between each activity are uniformly distributed between 0 and 19 days. And the type of activity will always be either 'sign-up', 'sign-in', or 'purchase'. The sign-in must be after an initial sign-up. And the purchase will be following a sign-in.

Assume that we have 10k users for efficiency, though we think 10 million would be more realistic.

In [1]:
import numpy as np
from datetime import date, timedelta

In [2]:
#%%timeit -r 1

total_user = 10000
tensor = np.ndarray(shape=(total_user, 365, 3), dtype=bool)

# dim1 = userID (fixed)
# dim2 = activityDate (max 365)
# dim3 = activityType (max 3)

for user_id in range(total_user):
    num_activities = np.random.choice(100)
    signed_up = False
    for activity_date in np.sort(np.random.choice(365, size=num_activities, replace=False)):
        if signed_up:
            activity_type = np.random.choice([1, 2])
            tensor[user_id][activity_date][activity_type] = True
        else:
            tensor[user_id][activity_date][0] = True
            signed_up = True

2.77 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

## Prognosis

The data generated mostly have a sign-up date closer to the beginning of the year.\
But we would let that go for the sake of simplicity.

## Cleaning the Data

Now, let's convert the ndarray into a DataFrame-like table.

In [3]:
#%%timeit -r 1
table = []
for user_id in range(len(tensor)):
    for activity_date in range(len(tensor[user_id])):
        for activity_type in range(len(tensor[user_id][activity_date])):
            if tensor[user_id][activity_date][activity_type] == True:
                row = [user_id, date(2017, 1, 1) + timedelta(activity_date), activity_type]
                table.append(row)

14 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

In [4]:
import pandas as pd

In [25]:
df = pd.DataFrame(table, columns=['user_id', 'date', 'activity_type'])
df = df.sort_values('date')
df = df.reset_index()
df.head(10)

In [None]:
df.to_csv("data.csv")