# Pandas Best Practices Demonstration

This Jupyter Notebook, `Pandas_1.ipynb`, is designed to showcase some of the best practices in pandas. Throughout this notebook, we will explore various pandas techniques that aim is to provide a practical guide to writing clean, memory efficient, and maintainable Python code. 

Let's dive in and start exploring Pandas best practices!

## Summary
- use PyArrow
- review the data types after loading the dataframe
- get function docs in place if possible. 
- check memory usage
- chain your transformations
- split the transformations maps/dictionaries 
- use `.query()`
- define your filtering condtions as variables.

## Introduction

* Python can be very slow when you don't use the right tools and data types specially when you handle datasets because in Python "everything is an object"

* There are continuous efforts to increase the scalability and the speed of pandas operations: Like Modin, `modin.pandas` data tool that implements Pandas API  to speed up the data loading and `apply` function 

* PyArrow is supported now to speed up in memory operation 

### What is PyArrow 

* python library provides a Python API for functionality provided by the Arrow C++ libraries

###  What is Arrow C++

* Apache Arrow is a development platform for in-memory analytics
* Arrow contains columnar vector and table-like containers supporting flat or nested types

In [None]:
%load_ext memory_profiler

In [None]:
import pandas as pd
# import modin.pandas as pd
import numpy as np
import pyarrow as pa

In [None]:
pd.__version__, np.__version__, pa.__version__

## dataframe methods used : 
- `memory_usage` : Returns the memory usage of each column in bytes 
- `info` :  Prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.
- `describe`: Generates descriptive statistics.
- `select_dtypes` : Returns a subset of the DataFrame’s columns based on the column dtypes.
- `query`: Query the columns of a DataFrame with a boolean expression.

## Loading data

* Pandas enables choosing an engine to parse the loaded data in the dataframe. The default engine is Numpy, but we can also use Pyarrow, which is faster and more memory efficient.

#### We will time our code and check the memory usage as we go

In [None]:
%%time
df_np = pd.read_csv('data/GSS.csv')

In [None]:
%%time
# using PyArrow
df_ar = pd.read_csv('data/GSS.csv', dtype_backend='pyarrow', engine='pyarrow')

In [None]:
%%memit
gss_np = pd.read_csv('data/GSS.csv')

In [None]:
%%memit
# using PyArrow
df_ar = pd.read_csv('data/GSS.csv', dtype_backend='pyarrow', engine='pyarrow')

## Why PyArrow?

- PyaArrow enables faster conversion of dataframes between packages like pandas and polars(build using Rust Arrow ) as blob 

- PyArrow native string types saves memory over default pandas one.

- PyArrow doesn't cast columns with integers + missing values to float columns like Numpy.
- PyArrow will become a required dependency with pandas 3.0 [docs](https://pandas.pydata.org/docs/whatsnew/v2.1.0.html#pyarrow-will-become-a-required-dependency-with-pandas-3-0)

In [None]:
%%time
gss_np = pd.read_csv('data/GSS.csv', index_col=0)
gss_np.memory_usage(deep=True).sum() # ~35 MB

In [None]:
gss_np.info()

- To get column by datatype use `select_dtypes` and pass parmeters for type: 
    - specific type like `int8`
    - generic like `integer`
    - or `np.number`
- To get more details you can  check the docs using command `df.select_dtypes?`

In [None]:
gss_np.select_dtypes?

In [None]:
%%time
gss = pd.read_csv('data/GSS.csv', index_col=0, dtype_backend='pyarrow', engine='pyarrow')

In [None]:
gss.info()

In [None]:
# numpy has int data types 
# # if you need to get details and limits use iinfo : 
import numpy as np
np.iinfo(np.int8), np.iinfo(np.int16), np.iinfo(np.int32), np.iinfo(np.int64)

#### Pandas Practice

- chaining 
it makes the code more readable as set of steps or a receipe with one line at a time 

In [None]:

# nlargest and nsmallest vs sort_values
gss['AGE'].nlargest(3)


In [None]:
gss['AGE'].nsmallest(3)

In [None]:
# chaining does not work with inplace=True 
# inplace=True is not recommended because it can slow down the code
(gss['AGE']
 .sort_values(ascending=False)
 .head(3)
 )

In [None]:
#%%time
(gss
 .nlargest(3, 'AGE', keep='all') 
 )

In [None]:
#%%time
(gss
 .nsmallest(3, 'AGE', keep='all') 
 )
# keep='all' to show all rows with the same value

### Hints on transformation : 

#### Int types

- pandas using numpy will not throw intger overflow error instead the dataframe cell will have inaccurate negative values.

`soon we will see that ` 

In [None]:
# remeber this line 
# gss.select_dtypes(int).describ()

# is equal to 
(
    gss
    .select_dtypes(int)
    .describe()
)


In [None]:
# chaining casting into pyarrow types
type_map = {'YEAR': 'uint16[pyarrow]', 'ID': 'uint16[pyarrow]', 'OCC': 'uint16[pyarrow]' }
(gss
 .astype(type_map)
 .select_dtypes(['uint16'])
 .describe()
)

In [None]:
# chaining
# Error in casting year to uint8
# use 'integer' so see all int-like columns
type_map_2 = {'YEAR': 'uint16[pyarrow]', 'ID': 'uint16[pyarrow]', 'OCC': 'uint16[pyarrow]' }
(gss
 .astype(type_map_2) 
 .select_dtypes(['integer'])  
 .describe()
)

In [None]:
# numpy
# this contains inaccuracy in the data
(gss
 .astype({'YEAR': 'int8'})
 .describe()
)

In [None]:
# pyarrow
(gss
 .astype({'YEAR': 'int8[pyarrow]'})
 .describe()
)

In [None]:
(gss_np
.select_dtypes('float')
)

In [None]:
# cast HRS1 to pyarrow int
casting_types = {'HRS1': 'int8[pyarrow]','AGE': 'int8[pyarrow]'}
(gss
 .astype(casting_types)
 .select_dtypes('integer')
 .describe()
)

In [None]:
casting_types = {'HRS1': 'int8[pyarrow]','AGE': 'int8[pyarrow]'}
(gss
 .astype(casting_types)
 .memory_usage(deep=True)
 .sum()
)

#### Finding values and NAs

query(`string`) is more readable and easier for chaining . 

In [None]:
(gss
  .query('AGE < 20')
)

In [None]:
# where are the missing values? 
# let's query

(gss
  .query('HRS1.isna()')
)


In [None]:
(gss
  .query('AGE.isna()')
)

In [None]:
# let's see the missing values using different method
gss[gss['HRS1'].isna()]
#gss[gss['AGE'].isna()]

In [None]:
# if the using [] is prefered over query 
# you may use python best practice for the condition 

NA_HR_filter = gss['HRS1'].isna()

gss[NA_HR_filter]

In [None]:
# for adding more than one condition 
NA_HR_filter = gss['HRS1'].isna()
NA_AGE_filter = gss['AGE'].isna()

NA_AGE_HRS_filter = NA_HR_filter & NA_AGE_filter

gss[NA_AGE_HRS_filter]
# should be equevalent to (gss
#  .query('AGE.isna() and HRS1.isna()')
#)