# Reduce Memory 2GB ==> 500MB

This is a modified version of [@Mohamed Eltayeb](https://www.kaggle.com/mohammad2012191)'s notebook:

[https://www.kaggle.com/code/mohammad2012191/reduce-memory-usage-2gb-780mb](https://www.kaggle.com/code/mohammad2012191/reduce-memory-usage-2gb-780mb).

There are two steps to reduce memory usage:

- Don't load `fullscreen`,`hq`,`music` columns. These features are full of NA values.

- Use the `reduce_memory_usage` function. This function is copied from Mohamed Eltayeb's notebook. Thanks to the original author of this function [@ArjanGroen](https://www.kaggle.com/arjanso).

Train data after memory reduction is here: [https://www.kaggle.com/datasets/curiosity30/sp-reduce-mem-train](https://www.kaggle.com/datasets/curiosity30/sp-reduce-mem-train).

For ease of use, I have uploaded this dataset to this notebook. You can directly copy this notebook to start your work.

**Please upvote if this notebook helps you. Thank you for your support!**

# Code

In [1]:
import pandas as pd, numpy as np
from sklearn.model_selection import KFold, GroupKFold
import lightgbm as lgb
from xgboost import XGBClassifier
from sklearn.metrics import f1_score
import os
import gc

In [1]:
def reduce_memory_usage(df):
    
    start_mem = df.memory_usage().sum() / 1024**2
    print('BEFORE: Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype.name
        if ((col_type != 'datetime64[ns]') & (col_type != 'category')):
            if (col_type != 'object'):
                c_min = df[col].min()
                c_max = df[col].max()

                if str(col_type)[:3] == 'int':
                    if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                        df[col] = df[col].astype(np.int8)
                    elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                        df[col] = df[col].astype(np.int16)
                    elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                        df[col] = df[col].astype(np.int32)
                    elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                        df[col] = df[col].astype(np.int64)

                else:
                    if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                        df[col] = df[col].astype(np.float16)
                    elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                        df[col] = df[col].astype(np.float32)
                    else:
                        pass
            else:
                df[col] = df[col].astype('category')
    mem_usg = df.memory_usage().sum() / 1024**2 
    print("AFTER: Memory usage became: ",mem_usg," MB")
    
    return df

In [3]:
%%time
train_path = '/data/01_raw/train.csv'
train_cols = ['session_id', 'index', 'elapsed_time', 'event_name', 'name', 'level', 'page', \
              'room_coor_x', 'room_coor_y', 'screen_coor_x', 'screen_coor_y', 'hover_duration', \
              'text', 'fqid', 'room_fqid', 'text_fqid', 'level_group']
train = pd.read_csv(train_path, usecols=train_cols)
print(train.shape)

(13174211, 17)
CPU times: user 29.3 s, sys: 7.83 s, total: 37.2 s
Wall time: 50.8 s


In [4]:
train = reduce_memory_usage(train)
train.to_pickle('reduce_train.pkl')
print('OK!')

BEFORE: Memory usage of dataframe is 1708.69 MB
AFTER: Memory usage became:  477.4601535797119  MB
OK!


In [5]:
print(train.dtypes)

session_id           int64
index                int16
elapsed_time         int32
event_name        category
name              category
level                 int8
page               float16
room_coor_x        float16
room_coor_y        float16
screen_coor_x      float16
screen_coor_y      float16
hover_duration     float32
text              category
fqid              category
room_fqid         category
text_fqid         category
level_group       category
dtype: object
