##### **Memory Optimization** </b></br> DataFrames are stored entirely in memory </br> `Large Datasets use excess of memory` 

In [11]:
import pandas as pd
import numpy as np

##### **Best Practices**
| Step | Description|
|---|----------------|
| 1 | Drop unneccessary Columns (or avoid reading them)|
| 2 | Convert object types to numeric or datetime where possible|
| 3 | Downcast numeric data to the smallest appropriate bit size|
| 4 | Use the categorical data type `'category'` if `number of unique values < (rows / 2)`|

In [12]:
path_retail = 'Pandas Course Resources/retail/retail_2016_2017.csv'
retail_df = pd.read_csv(path_retail)

retail_df.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.0,0
1,1945945,2016-01-01,1,BABY CARE,0.0,0
2,1945946,2016-01-01,1,BEAUTY,0.0,0
3,1945947,2016-01-01,1,BEVERAGES,0.0,0
4,1945948,2016-01-01,1,BOOKS,0.0,0


In [13]:
# assess memory usage in MB
retail_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1054944 entries, 0 to 1054943
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   id           1054944 non-null  int64  
 1   date         1054944 non-null  object 
 2   store_nbr    1054944 non-null  int64  
 3   family       1054944 non-null  object 
 4   sales        1054944 non-null  float64
 5   onpromotion  1054944 non-null  int64  
dtypes: float64(1), int64(3), object(2)
memory usage: 167.8 MB


In [14]:
# assess memory usage in bytes
retail_df.memory_usage(deep=True).sum()
# 175920036

175920036

##### **Drop Columns**

In [15]:
# if data has id column that is the exact same as DataFrame index
# then drop that column inplace=True to overwrite DataFrame
retail_df.drop(columns='id', inplace=True)
retail_df

Unnamed: 0,date,store_nbr,family,sales,onpromotion
0,2016-01-01,1,AUTOMOTIVE,0.000,0
1,2016-01-01,1,BABY CARE,0.000,0
2,2016-01-01,1,BEAUTY,0.000,0
3,2016-01-01,1,BEVERAGES,0.000,0
4,2016-01-01,1,BOOKS,0.000,0
...,...,...,...,...,...
1054939,2017-08-15,9,POULTRY,438.133,0
1054940,2017-08-15,9,PREPARED FOODS,154.553,1
1054941,2017-08-15,9,PRODUCE,2419.729,148
1054942,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8


In [17]:
# assess memory usage in bytes
retail_df.memory_usage(deep=True).sum()
# 175920036 before 'id' column drop
# 167480484 after 'id' column drop

167480484

##### **Convert Object Data Types**

In [19]:
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 1054944 entries, 0 to 1054943
# Data columns (total 6 columns):
#  #   Column       Non-Null Count    Dtype  
# ---  ------       --------------    -----  
#  0   id           1054944 non-null  int64  
#  1   date         1054944 non-null  object 
#  2   store_nbr    1054944 non-null  int64  
#  3   family       1054944 non-null  object 
#  4   sales        1054944 non-null  float64
#  5   onpromotion  1054944 non-null  int64  
# dtypes: float64(1), int64(3), object(2)
# memory usage: 167.8 MB

retail_df = retail_df.astype({'date': 'datetime64[ns]', 'family':'category'})
# assess memory usage in bytes
retail_df.memory_usage(deep=True).sum()
# 175920036 before 'id' column drop
# 167480484 after 'id' column drop
# 34816592 after objects converted to datetime64, category data types

34816592

##### **Downcasting Number Data** </br> Pandas casts Integers and Floats as 64-bit as default to be able to handle any value
|bit size|number range|
|--------|------------|
|8-bits| -128 to 127|
|16-bits| -32,768 to 32,767|
|32-bits| -2,147,483,648 to 2,147,483,647|
|64-bits| -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807|


In [20]:
retail_df['onpromotion'].value_counts()

onpromotion
0      643896
1       93647
2       44341
3       26985
4       20934
        ...  
600         1
306         1
672         1
474         1
425         1
Name: count, Length: 362, dtype: int64

In [22]:
retail_df['onpromotion'] = retail_df['onpromotion'].astype('int16')
retail_df.memory_usage(deep=True).sum()
# 175920036 before 'id' column drop
# 167480484 after 'id' column drop
# 34816592 after objects converted to datetime64, category data types
# 28486928 after int64 converted to int32

28486928

##### **When doing Data Analysis, have an analysis path and optimize if if required </br> don't spend analysis time focused on optimizing memory, do after, especially if data is required for a pipeline