In [6]:
import pandas as pd 
import json

## Load the train set 
Because it's too large, we have to process the .jsonl file in chunks. This has been done previously by someone online that provided the resulting file called "final_output.parquet". 

Details of the processing : 
Iof loading the entire file into memory, we read it line by line using a batch processing approach to store data in smaller Parquet files. 

Each line represents a session containing multiple events. The function extracts the session_id and attaches it to the event. 
Once all rows are collected, the data is saved as a Parquet file, an efficient storage format. Then, it resets the batch and continues processing it until the eof the file. 

Afterwards, all Parquet files are merged into one single file, the "final_output.parquet". 


In [7]:
import dask.dataframe as dd 
# Using Dask iof Pandas allows processing very large files without loading everything into memory 

parquet_file = "data/final_output.parquet"
data = dd.read_parquet(parquet_file)
print(type(data))


<class 'dask.dataframe.dask_expr._collection.DataFrame'>


In [8]:
data.head()

Unnamed: 0,aid,ts,type,session
0,1489275,1660039772288,clicks,5899776
1,1826552,1660043110728,clicks,5899776
2,1632206,1660048043858,clicks,5899776
3,1531634,1660048104470,clicks,5899776
4,1086210,1660039772327,clicks,5899777


In [9]:
train_df = data

## Load the test set

In [10]:
# Path to the test.jsonl file
test_file = "data/test.jsonl"
# Step 1: Read the JSONL file directly
with open(test_file, "r", encoding="utf-8") as f:
    data = [json.loads(line) for line in f]

# Step 2: Expand the nested 'events' column
flattened_data = []
for record in data:
    session_id = record["session"]
    for event in record["events"]:
        event["session"] = session_id  # Add session ID to each event
        flattened_data.append(event)

# Step 3: Convert to DataFrame
test_df = pd.DataFrame(flattened_data)

## Data understanding

In [11]:
train_df.tail()

Unnamed: 0,aid,ts,type,session
12243771,1117003,1660679104743,clicks,4200398
12243772,1117003,1660679120851,carts,4200398
12243773,24174,1660679253624,clicks,4200398
12243774,24174,1660679272894,carts,4200398
12243775,1391615,1660682317258,clicks,4200398


In [12]:
train_df.dtypes

aid                  int64
ts                   int64
type       string[pyarrow]
session              int64
dtype: object

In [None]:
"""from ydata_profiling import ProfileReport
df_sample = train_df.sample(frac=0.2).compute()

profile = ProfileReport(df_sample, title="Profiling-Report")
profile.to_file("Profiling-Report.html") "
"""

'from ydata_profiling import ProfileReport\ndf_sample = train_df.sample(frac=0.2).compute()\n\nprofile = ProfileReport(df_sample, title="Profiling-Report")\nprofile.to_file("Profiling-Report.html") "\n'

## Data preparation

In [13]:
df = train_df 
df.head()

Unnamed: 0,aid,ts,type,session
0,1489275,1660039772288,clicks,5899776
1,1826552,1660043110728,clicks,5899776
2,1632206,1660048043858,clicks,5899776
3,1531634,1660048104470,clicks,5899776
4,1086210,1660039772327,clicks,5899777


In [14]:
# Clearly define mapping
event_type_strength = {
    'clicks': 1.0,
    'add_to_cart': 2.0, 
    'order': 3.0,  
}

# Check unknown types first (recommended)
unknown_types = df.loc[~df['type'].isin(event_type_strength), 'type'].unique()
print("Unknown types found:", unknown_types)

# Apply mapping with safety net
df['eventStrength'] = df['type'].map(event_type_strength).fillna(0.0)

df.head()


Unknown types found: Dask Series Structure:
npartitions=14
    string
       ...
     ...  
       ...
       ...
Dask Name: unique, 7 expressions
Expr=Unique(frame=Loc(frame=ReadParquetFSSpec(46fde82)['type'], iindexer=~ Isin(frame=ReadParquetFSSpec(46fde82)['type'], values=_DelayedExpr(Delayed('delayed-214f12dd7839a27aad04324384ee5121')))))


You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta=('type', 'float64'))



Unnamed: 0,aid,ts,type,session,eventStrength
0,1489275,1660039772288,clicks,5899776,1.0
1,1826552,1660043110728,clicks,5899776,1.0
2,1632206,1660048043858,clicks,5899776,1.0
3,1531634,1660048104470,clicks,5899776,1.0
4,1086210,1660039772327,clicks,5899777,1.0


Recommender systems have a problem known as user cold-start, in which is hard do provide personalized recommendations for users with none or a very few number of consumed items, due to the lack of information to model their preferences.
For this reason, we are keeping in the dataset only users with at least 5 interactions.

In [15]:
users_interactions_count_df = df.groupby(['session', 'aid']).size().groupby('session').size()
print('# users: %d' % len(users_interactions_count_df))
users_with_enough_interactions_df = users_interactions_count_df[users_interactions_count_df >= 5].reset_index()[['session']]
print('# users with at least 5 interactions: %d' % len(users_with_enough_interactions_df))

# users: 12899779
# users with at least 5 interactions: 6091807


In [16]:
users_with_enough_interactions_df.head()

Unnamed: 0,session
0,0
1,1
2,2
3,3
4,4


In [19]:
# Merge to keep only relevant interactions
filtered_df = df.merge(users_with_enough_interactions_df[["session"]], 
                        how="right", 
                        on="session")

print("# of interactions from users with at least 5 interactions: %d" % len(filtered_df))

# Save the filtered dataset as Parquet (efficient for large data)
filtered_df.to_parquet("filtered_data.parquet", index=False)

print("Filtered data saved as Parquet.")


# of interactions from users with at least 5 interactions: 191678570


MemoryError: Unable to allocate 510. MiB for an array with shape (66838606,) and data type int64