# Pandas Best Practices Demonstration

This Jupyter Notebook, `Pandas_1.ipynb`, is designed to showcase some of the best practices in pandas. Throughout this notebook, we will explore various pandas techniques that aim is to provide a practical guide to writing clean, memory efficient, and maintainable Python code. 

Let's dive in and start exploring Pandas best practices!

* Python can be very slow when you don't use the right tools and data types specially when you handle datasets because in Python "everything is an object"

* There are continuous efforts to increase the scalability and the speed of pandas operations: Like Modin, `modin.pandas` data tool that implements Pandas API  to speed up the data loading and `apply` function 

* PyArrow is introduced as an API to provide Arrow C++ functionality and interoperability with Pandas and Numpy 


In [1]:
import pandas as pd
# import modin.pandas as pd
import numpy as np
import pyarrow as pa

In [2]:
pd.__version__, np.__version__, pa.__version__

('2.2.1', '1.26.4', '15.0.2')

## Loading data

* Pandas enables choosing an engine to parse the loaded data in the dataframe. The default engine is Numpy, but we can also use Pyarrow, which is faster and more memory efficient.

#### We will time our code and check the memory usage as we go

In [3]:
%%time
df_np = pd.read_csv('data/GSS.csv')

CPU times: total: 203 ms
Wall time: 205 ms


In [9]:
%%time
# using PyArrow
df_ar = pd.read_csv('data/GSS.csv', dtype_backend='pyarrow', engine='pyarrow')

CPU times: total: 125 ms
Wall time: 38.2 ms


## Why Pyarrow?

- PyaArrow enables faster conversion of dataframes between packages like pandas and  polars(build using Rust Arrow ) as blob 

- PyArrow native string types saves memory over default pandas one.

- PyArrow doesn't cast columns with integers + missing values to float columns like Numpy.

In [6]:
%%time
gss_np = pd.read_csv('data/GSS.csv', index_col=0)
gss_np.memory_usage(deep=True).sum()

CPU times: total: 297 ms
Wall time: 308 ms


36076324

In [7]:
gss_np.info()

<class 'pandas.core.frame.DataFrame'>
Index: 64814 entries, 0 to 64813
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   YEAR      64814 non-null  int64  
 1   ID        64814 non-null  int64  
 2   AGE       64586 non-null  float64
 3   HRS1      37506 non-null  float64
 4   OCC       64814 non-null  int64  
 5   MAJOR1    64814 non-null  object 
 6   SEX       64814 non-null  object 
 7   RACE      64814 non-null  object 
 8   BORN      64814 non-null  object 
 9   INCOME    64814 non-null  object 
 10  INCOME06  64814 non-null  object 
 11  HONEST    64814 non-null  object 
 12  TICKET    64814 non-null  object 
dtypes: float64(2), int64(3), object(8)
memory usage: 6.9+ MB


In [8]:
%%time
gss = pd.read_csv('data/GSS.csv', index_col=0, dtype_backend='pyarrow', engine='pyarrow')

CPU times: total: 109 ms
Wall time: 125 ms


In [10]:
gss.memory_usage(deep=True).sum()

8611400

In [11]:
gss.info()

<class 'pandas.core.frame.DataFrame'>
Index: 64814 entries, 0 to 64813
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype          
---  ------    --------------  -----          
 0   YEAR      64814 non-null  int64[pyarrow] 
 1   ID        64814 non-null  int64[pyarrow] 
 2   AGE       64586 non-null  double[pyarrow]
 3   HRS1      37506 non-null  double[pyarrow]
 4   OCC       64814 non-null  int64[pyarrow] 
 5   MAJOR1    64814 non-null  string[pyarrow]
 6   SEX       64814 non-null  string[pyarrow]
 7   RACE      64814 non-null  string[pyarrow]
 8   BORN      64814 non-null  string[pyarrow]
 9   INCOME    64814 non-null  string[pyarrow]
 10  INCOME06  64814 non-null  string[pyarrow]
 11  HONEST    64814 non-null  string[pyarrow]
 12  TICKET    64814 non-null  string[pyarrow]
dtypes: double[pyarrow](2), int64[pyarrow](3), string[pyarrow](8)
memory usage: 8.2 MB


In [12]:
# numpy has int data types if you need to get details and limit : 
import numpy as np
np.iinfo(np.int8), np.iinfo(np.int16), np.iinfo(np.int32), np.iinfo(np.int64)

(iinfo(min=-128, max=127, dtype=int8),
 iinfo(min=-32768, max=32767, dtype=int16),
 iinfo(min=-2147483648, max=2147483647, dtype=int32),
 iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64))

#### Pandas Practice

- chaning 
it makes the code more readable as set of steps or a receipe with one line at a time 

### Hints on transformation : 

#### Int types

- pandas will not throw intger overflow error instead the dataframe cell will have inaccurate negative values.

`soon we will see that ` 

In [14]:
# this line 
# gss.select_dtypes(int).describ()

# is equal to 
(
    gss
    .select_dtypes(int)
    .describe()
)
# chaining casting into pyarrow types
type_map = {'YEAR': 'uint16[pyarrow]', 'ID': 'uint16[pyarrow]', 'OCC': 'uint16[pyarrow]' }
(gss
 .astype(type_map)
 .select_dtypes(['uint16'])
 .describe()
)

Unnamed: 0,YEAR,ID,OCC
count,64814.0,64814.0,64814.0
mean,1994.93918,1151.810211,6418.583284
std,13.465368,828.030233,4618.278478
min,1972.0,1.0,1.0
25%,1984.0,507.0,613.0
50%,1996.0,1029.5,9999.0
75%,2006.0,1570.0,9999.0
max,2018.0,4510.0,9999.0


In [15]:
# chaining
# Error in casting year to uint8
# use 'integer' so see all int-like columns
type_map_2 = {'YEAR': 'uint8[pyarrow]', 'ID': 'uint16[pyarrow]', 'OCC': 'uint16[pyarrow]' }
(gss
 .astype(type_map_2) 
 .select_dtypes(['integer'])  
 .describe()
)

ArrowInvalid: Integer value 1972 not in range: 0 to 255: Error while type casting for column 'YEAR'

In [16]:
# numpy
(gss
 .astype({'YEAR': 'int8'})
 .describe()
)

Unnamed: 0,YEAR,ID,AGE,HRS1,OCC
count,64814.0,64814.0,64586.0,37506.0,64814.0
mean,-53.06082,1151.810211,46.099356,41.303711,6418.583284
std,13.465368,828.030233,17.534703,14.171808,4618.278478
min,-76.0,1.0,18.0,0.0,1.0
25%,-64.0,507.0,31.0,37.0,613.0
50%,-52.0,1029.5,44.0,40.0,9999.0
75%,-42.0,1570.0,59.0,48.0,9999.0
max,-30.0,4510.0,89.0,89.0,9999.0


In [17]:
# pyarrow
(gss
 .astype({'YEAR': 'int8[pyarrow]'})
 .describe()
)

ArrowInvalid: Integer value 1972 not in range: -128 to 127: Error while type casting for column 'YEAR'

In [18]:
(gss_np
.select_dtypes('float'))

Unnamed: 0,AGE,HRS1
0,23.0,
1,70.0,
2,48.0,
3,27.0,
4,61.0,
...,...,...
64809,37.0,36.0
64810,75.0,36.0
64811,67.0,
64812,72.0,


In [None]:
# cast HRS1 to pyarrow int

In [None]:
# where are the missing values? 
# let's query

(gss
  .query('HRS1.isna()')
)



In [None]:
(gss
  .query('AGE.isna()')
)