# Pandas Best Practices Demonstration

This Jupyter Notebook, `Pandas_1.ipynb`, is designed to showcase some of the best practices in pandas. Throughout this notebook, we will explore various pandas techniques that aim is to provide a practical guide to writing clean, memory efficient, and maintainable Python code. 

Let's dive in and start exploring Pandas best practices!

* Python can be very slow when you don't use the right tools and data types specially when you handle datasets because in Python "everything is an object"

* There are continuous efforts to increase the scalability and the speed of pandas operations: Like Modin, `modin.pandas` data tool that implements Pandas API  to speed up the data loading and `apply` function 

* PyArrow is introduced as an API to provide Arrow C++ functionality and interoperability with Pandas and Numpy 


In [None]:
import pandas as pd
# import modin.pandas as pd
import numpy as np
import pyarrow as pa

In [None]:
pd.__version__, np.__version__, pa.__version__

## Loading data

* Pandas enables choosing an engine to parse the loaded data in the dataframe. The default engine is Numpy, but we can also use Pyarrow, which is faster and more memory efficient.

#### We will time our code and check the memory usage as we go

In [None]:
%%time
df_np = pd.read_csv('data/GSS.csv')

: 

In [None]:
%%time
# using PyArrow
df_ar = pd.read_csv('data/GSS.csv', dtype_backend='pyarrow')

## Why Pyarrow?

- PyaArrow enables faster conversion of dataframes between packages like pandas and  polars(build using Rust Arrow ) as blob 

- PyArrow native string types saves memory over default pandas one.

- PyArrow doesn't cast columns with integers + missing values to float columns like Numpy.

In [None]:
%%time
gss_np = pd.read_csv('data/GSS.csv', index_col=0)
gss_np.memory_usage(deep=True)#.sum()

In [None]:
gss_np.info()

In [None]:
%%time
gss = pd.read_csv('data/GSS.csv', index_col=0, dtype_backend='pyarrow', engine='pyarrow')

In [None]:
gss.memory_usage(deep=True)#.sum()

In [None]:
gss.info()

In [None]:
# numpy has int data types if you need to get details and limit : 
import numpy as np
np.iinfo(np.int8), np.iinfo(np.int16), np.iinfo(np.int32), np.iinfo(np.int64)

#### Pandas Practice

- chaning 
it makes the code more readable as set of steps or a receipe with one line at a time 

### Hints on transformation : 

#### Int types

- pandas will not throw intger overflow error instead the dataframe cell will have inaccurate negative values.

`soon we will see that ` 

In [None]:
# this line 
# gss.select_dtypes(int).describ()

# is equal to 
(
    gss
    .select_dtypes(int)
    .describ()
)
# chaining casting into pyarrow types
type_map = {'YEAR': 'uint16[pyarrow]', 'ID': 'uint16[pyarrow]', 'OCC': 'uint16[pyarrow]' }
(gss
 .astype(type_map)
 .select_dtypes(['uint16'])
 .describe()
)

In [None]:
# chaining
# use 'integer' so see all int-like columns
type_map_2 = {'YEAR': 'uint8[pyarrow]', 'ID': 'uint16[pyarrow]', 'OCC': 'uint16[pyarrow]' }
(gss
 .astype(type_map_2) 
 .select_dtypes(['integer'])  
 .describe()
)

In [None]:
# numpy
(gss
 .astype({'YEAR': 'int8'})
 .describe()
)

In [None]:
# pyarrow
(gss
 .astype({'YEAR': 'int8[pyarrow]'})
 .describe()
)

In [None]:
(gss_np
.select_dtypes('float'))

In [None]:
# where are the missing values? 
# let's query

(gss
  .query('HRS1.isna()')
)



In [None]:
(gss
  .query('AGE.isna()')
)