## Efficient Memory Usage

##### Importing libraries

In [1]:
import pandas as pd
import numpy as np

##### Data

In [2]:
def getDataset(size):
    df = pd.DataFrame()      # Empty DataFrame
    df['roll no'] = np.arange(1, size+1)
    df['batch'] = np.random.choice(['B1', 'B2', 'B3', 'B4'], size)
    df['marks'] = np.random.randint(40, 100, size)
    df['age'] = np.random.randint(21, 24, size)
    df['pass'] = np.where(df['marks']>40, 'yes', 'no')
    df['average'] = df['marks'] / 5
    return df

In [3]:
data = getDataset(1000000) # 1 million data points
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 6 columns):
 #   Column   Non-Null Count    Dtype  
---  ------   --------------    -----  
 0   roll no  1000000 non-null  int32  
 1   batch    1000000 non-null  object 
 2   marks    1000000 non-null  int32  
 3   age      1000000 non-null  int32  
 4   pass     1000000 non-null  object 
 5   average  1000000 non-null  float64
dtypes: float64(1), int32(3), object(2)
memory usage: 34.3+ MB


For 1 million data points the memory usage is ~34MB. Lets try to optimize the memory usage.

Let's find the average marks scored by student across different batches.


We will also check the time it took to perform the action.

In [4]:
%timeit data.groupby(['batch'])['marks'].mean()

67.8 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Casting Categorical Value

In [5]:
type(data['batch'][0])

str

In [6]:
data = getDataset(1000000) 
data['batch'] = data['batch'].astype('category')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 6 columns):
 #   Column   Non-Null Count    Dtype   
---  ------   --------------    -----   
 0   roll no  1000000 non-null  int32   
 1   batch    1000000 non-null  category
 2   marks    1000000 non-null  int32   
 3   age      1000000 non-null  int32   
 4   pass     1000000 non-null  object  
 5   average  1000000 non-null  float64 
dtypes: category(1), float64(1), int32(3), object(1)
memory usage: 27.7+ MB


### Downcasting integer values

In [7]:
type(data['age'][0])

numpy.int32

In [8]:
data['age'] = data['age'].astype('int8')
data['roll no'] = data['roll no'].astype('int8')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 6 columns):
 #   Column   Non-Null Count    Dtype   
---  ------   --------------    -----   
 0   roll no  1000000 non-null  int8    
 1   batch    1000000 non-null  category
 2   marks    1000000 non-null  int32   
 3   age      1000000 non-null  int8    
 4   pass     1000000 non-null  object  
 5   average  1000000 non-null  float64 
dtypes: category(1), float64(1), int32(1), int8(2), object(1)
memory usage: 21.9+ MB


### Downcasting float values

In [9]:
type(data['average'][0])

numpy.float64

In [10]:
data['average'] = data['average'].astype('float16')

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 6 columns):
 #   Column   Non-Null Count    Dtype   
---  ------   --------------    -----   
 0   roll no  1000000 non-null  int8    
 1   batch    1000000 non-null  category
 2   marks    1000000 non-null  int32   
 3   age      1000000 non-null  int8    
 4   pass     1000000 non-null  object  
 5   average  1000000 non-null  float16 
dtypes: category(1), float16(1), int32(1), int8(2), object(1)
memory usage: 16.2+ MB


### Casting bool types

In [12]:
data['pass'].value_counts()

yes    983303
no      16697
Name: pass, dtype: int64

In [13]:
data['pass'] = data['pass'].map({'yes': True, 'no': False})
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 6 columns):
 #   Column   Non-Null Count    Dtype   
---  ------   --------------    -----   
 0   roll no  1000000 non-null  int8    
 1   batch    1000000 non-null  category
 2   marks    1000000 non-null  int32   
 3   age      1000000 non-null  int8    
 4   pass     1000000 non-null  bool    
 5   average  1000000 non-null  float16 
dtypes: bool(1), category(1), float16(1), int32(1), int8(2)
memory usage: 9.5 MB


In [14]:
def setTypes(df):
    df['batch'] = df['batch'].astype('category')
    df['pass'] = df['pass'].map({'yes': True, 'no': False})
    df['average'] = df['average'].astype('float16')
    df['age'] = df['age'].astype('int8')
    df['marks'] = df['marks'].astype('int8')
    df['roll no'] = df['roll no'].astype('int8')
    
    return df

In [15]:
data = getDataset(1000000)
data = setTypes(data)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 6 columns):
 #   Column   Non-Null Count    Dtype   
---  ------   --------------    -----   
 0   roll no  1000000 non-null  int8    
 1   batch    1000000 non-null  category
 2   marks    1000000 non-null  int8    
 3   age      1000000 non-null  int8    
 4   pass     1000000 non-null  bool    
 5   average  1000000 non-null  float16 
dtypes: bool(1), category(1), float16(1), int8(3)
memory usage: 6.7 MB


In [16]:
%timeit data.groupby(['batch'])['marks'].mean()

10.8 ms ± 312 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


#### By casting the datatypes we brought down the memory usage by ~80% from 34.3MB to 6.7MB. It also speeds up the grouping operation by ~85%.