<a href="https://colab.research.google.com/github/lucasmontanheiro/colab/blob/main/Theory/Data_Science_01_Data_Preparation_07_Memory_Optimization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Data Preparation

###Memory Optimization & Efficiency

https://www.youtube.com/watch?v=u4_c2LDi4b8

In [2]:
import pandas as pd
import numpy as np

In [3]:
def get_dataset(size):
  df = pd.DataFrame()
  df["position"] = np.random.choice(["left", "middle", "middle"], size)
  df["age"] = np.random.randint(1, 50, size)
  df["team"] = np.random.choice(["red", "blue", "yellow", "green"], size)
  df["win"] = np.random.choice(["yes", "no"], size)
  df["prob"] = np.random.uniform(0, 1, size)
  return df

In [4]:
df = get_dataset(1_000_000)
df.info() # print memory usage

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 5 columns):
 #   Column    Non-Null Count    Dtype  
---  ------    --------------    -----  
 0   position  1000000 non-null  object 
 1   age       1000000 non-null  int64  
 2   team      1000000 non-null  object 
 3   win       1000000 non-null  object 
 4   prob      1000000 non-null  float64
dtypes: float64(1), int64(1), object(3)
memory usage: 38.1+ MB


In [5]:
%timeit df["age_rank"] = df.groupby(["team","position"])["age"].rank()
%timeit df["prob_rank"] = df.groupby(["team","position"])["prob"].rank()
%timeit df["win_prob_rank"] = df.groupby(["team","position","win"])["prob"].rank()

681 ms ± 215 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
602 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
707 ms ± 9.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


###Efficiency with Strings

From String(object) to Category datatype

In [6]:
df = get_dataset(1_000_000)
df["position"] = df["position"].astype("category")
df["team"] = df["team"].astype("category")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 5 columns):
 #   Column    Non-Null Count    Dtype   
---  ------    --------------    -----   
 0   position  1000000 non-null  category
 1   age       1000000 non-null  int64   
 2   team      1000000 non-null  category
 3   win       1000000 non-null  object  
 4   prob      1000000 non-null  float64 
dtypes: category(2), float64(1), int64(1), object(1)
memory usage: 24.8+ MB


###Efficiency with Integers

From int64 to int 8 datatype

Int Downcasting Value Range
- Int 8 can store integers from -128 to 127
- Int 16 can store integers from -32768 to 32767
- Int 64 can store integers from 

In [8]:
# Check max and min for the range of your data before downsizing it
df["age"] = df["age"].astype("int8")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 5 columns):
 #   Column    Non-Null Count    Dtype   
---  ------    --------------    -----   
 0   position  1000000 non-null  category
 1   age       1000000 non-null  int8    
 2   team      1000000 non-null  category
 3   win       1000000 non-null  object  
 4   prob      1000000 non-null  float64 
dtypes: category(2), float64(1), int8(1), object(1)
memory usage: 18.1+ MB


###Efficiency with Floats

In [10]:
df["prob"] = df["prob"].astype("float32")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 5 columns):
 #   Column    Non-Null Count    Dtype   
---  ------    --------------    -----   
 0   position  1000000 non-null  category
 1   age       1000000 non-null  int8    
 2   team      1000000 non-null  category
 3   win       1000000 non-null  object  
 4   prob      1000000 non-null  float32 
dtypes: category(2), float32(1), int8(1), object(1)
memory usage: 14.3+ MB


###Casting bool (true/false)

In [12]:
# Mapping the values 'yes' and 'no' to True and False
df["win"] = df["win"].map({"yes": True, "no": False})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 5 columns):
 #   Column    Non-Null Count    Dtype   
---  ------    --------------    -----   
 0   position  1000000 non-null  category
 1   age       1000000 non-null  int8    
 2   team      1000000 non-null  category
 3   win       1000000 non-null  bool    
 4   prob      1000000 non-null  float32 
dtypes: bool(1), category(2), float32(1), int8(1)
memory usage: 7.6 MB
