# <center>Memory optimization</center>

Dataframe are stored entirely in memory, so **memory optimization** is key in working with large dataset in pandas.<br>
Memory optimization best practices<br>
 1.Drop unnecessary columns.<br>
 2.Convert object type to numeric or datetime datatype.<br>
 3.Use the categorical datatype for columns where the **number of unique values<rows/2**

In [1]:
import pandas as pd
import numpy as np

In [2]:
retail_df =pd.read_csv("retail_2016_2017.csv")

In [3]:
retail_df

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.000,0
1,1945945,2016-01-01,1,BABY CARE,0.000,0
2,1945946,2016-01-01,1,BEAUTY,0.000,0
3,1945947,2016-01-01,1,BEVERAGES,0.000,0
4,1945948,2016-01-01,1,BOOKS,0.000,0
...,...,...,...,...,...,...
1054939,3000883,2017-08-15,9,POULTRY,438.133,0
1054940,3000884,2017-08-15,9,PREPARED FOODS,154.553,1
1054941,3000885,2017-08-15,9,PRODUCE,2419.729,148
1054942,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8


In [4]:
retail_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1054944 entries, 0 to 1054943
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   id           1054944 non-null  int64  
 1   date         1054944 non-null  object 
 2   store_nbr    1054944 non-null  int64  
 3   family       1054944 non-null  object 
 4   sales        1054944 non-null  float64
 5   onpromotion  1054944 non-null  int64  
dtypes: float64(1), int64(3), object(2)
memory usage: 48.3+ MB


In [5]:
retail_df.memory_usage(deep=True)

Index               128
id              8439552
date           70681248
store_nbr       8439552
family         71480448
sales           8439552
onpromotion     8439552
dtype: int64

In [8]:
#date is stored as a object in a dataframe convert to datetime

retail_df =retail_df.astype({"date":"datetime64"})

In [9]:
retail_df.memory_usage(deep=True)

Index               128
id              8439552
date            8439552
store_nbr       8439552
family         71480448
sales           8439552
onpromotion     8439552
dtype: int64

In [23]:
#number of unique element in family column
no_of_unique =retail_df["family"].nunique()
no_of_unique

33

In [13]:

retail_df.shape

(1054944, 6)

In [24]:
#no of rows divide by 2
row_divide = retail_df.shape[0]/2
row_divide

527472.0

In [22]:
no_of_unique < row_divide

True

In [25]:
#convert the family to categorical datatype

retail_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1054944 entries, 0 to 1054943
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   id           1054944 non-null  int64         
 1   date         1054944 non-null  datetime64[ns]
 2   store_nbr    1054944 non-null  int64         
 3   family       1054944 non-null  object        
 4   sales        1054944 non-null  float64       
 5   onpromotion  1054944 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(3), object(1)
memory usage: 48.3+ MB


In [27]:
retail_df = retail_df.astype({"family":"category"})

In [28]:
retail_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1054944 entries, 0 to 1054943
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   id           1054944 non-null  int64         
 1   date         1054944 non-null  datetime64[ns]
 2   store_nbr    1054944 non-null  int64         
 3   family       1054944 non-null  category      
 4   sales        1054944 non-null  float64       
 5   onpromotion  1054944 non-null  int64         
dtypes: category(1), datetime64[ns](1), float64(1), int64(3)
memory usage: 41.3 MB


### Note:
<span style ="color:purple">1.Pandas dataframe are **datatables** with rows and columns<br>
    2.Use **exploration method** to quickly undestand data in the dataframe<br>
    3.You can easily **filter,sort and modify**dataframe with methods and functions<br>
    4.**Memory Optimization** is critical in working with large datasets in pandas</span>