<a href="https://colab.research.google.com/github/kla55/Pandas_2_exploration/blob/main/Pandas_2_0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Installations

In [None]:
# Pip install pandas 2.0.3
!pip install pandas==2.0.3



#Imports

In [None]:
import pandas as pd
import numpy as np

# Null Types

We are comparing the behaviour of pandas with and without pyarrow. Data type changing from integer to float implicitly. That's because pandas automatically converts the data type to float when missing values are introduced during calculation or include in original datanp.nan is for floating-point numbers. None and np.nan are for object types, and pd.NaT is for date-related types.

In [None]:
pd.Series([1, 2, 3, None])

0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

In [None]:
df2 = pd.DataFrame({'a':[1,2,3, None]}, dtype='int64[pyarrow]')
print(df2.dtypes)
print(df2)

a    int64[pyarrow]
dtype: object
      a
0     1
1     2
2     3
3  <NA>


#String type

A column of string data in Pandas is actually a set of PyObject pointers, with the actual data scattered throughout the heap. This undoubtedly increases memory consumption and makes it unpredictable. This problem has become more severe as the amount of data increases.

In [None]:
df = pd.read_csv('/content/winemag-data_first150k.csv')

In [None]:
df.dtypes

Unnamed: 0       int64
country         object
description     object
designation     object
points           int64
price          float64
province        object
region_1        object
region_2        object
variety         object
winery          object
dtype: object

In [None]:
df.memory_usage(deep=True).sum()

119855745

In [None]:
df_arrow = pd.read_csv('/content/winemag-data_first150k.csv', dtype_backend="pyarrow", engine="pyarrow")


In [None]:
df_arrow.dtypes

                int64[pyarrow]
country        string[pyarrow]
description    string[pyarrow]
designation    string[pyarrow]
points          int64[pyarrow]
price          double[pyarrow]
province       string[pyarrow]
region_1       string[pyarrow]
region_2       string[pyarrow]
variety        string[pyarrow]
winery         string[pyarrow]
dtype: object

In [None]:
df_arrow.memory_usage(deep=True).sum()

54651346

In [None]:
df_arrow.memory_usage(deep=True).sum()/df.memory_usage(deep=True).sum()

0.455976023510596

In [None]:
%time df.country.str.startswith('France').sum()

CPU times: user 64.6 ms, sys: 888 µs, total: 65.5 ms
Wall time: 121 ms


21098

In [None]:
%time df_arrow.country.str.startswith('France').sum()

CPU times: user 2.12 ms, sys: 719 µs, total: 2.84 ms
Wall time: 5.26 ms


21098

#Copy on Write

A DataFrame and Series methods will no longer create a copy of the pandas object until needed.

In [None]:
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
subset = df["foo"]
subset.iloc[0] = 100
df

Unnamed: 0,foo,bar
0,100,4
1,2,5
2,3,6


In [None]:
pd.options.mode.copy_on_write = True
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
subset = df["foo"]
subset.iloc[0] = 100
df

Unnamed: 0,foo,bar
0,1,4
1,2,5
2,3,6


In [None]:
subset

0    100
1      2
2      3
Name: foo, dtype: int64

#More NumPy dtypes for indices

Now you can choose lower memory dtypes for indices. For example, you can specify the index to use 32-bit integers, saving 50% of the memory that you would have used previously, when 64-bit was the only option.

In [None]:
pd.Index([1, 2, 3])

Index([1, 2, 3], dtype='int64')

In [None]:
pd.Index([1, 2, 3], dtype=np.int8)

Index([1, 2, 3], dtype='int8')

#Pandas vs Polars comparison

In [None]:
!pip install polars



In [None]:
import polars as pl
import time

In [None]:
df_arrow.head()

Unnamed: 0,Unnamed: 1,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


In [None]:
# Pandas read
s = time.time()
df_arrow = pd.read_csv('/content/winemag-data_first150k.csv', dtype_backend="pyarrow", engine="pyarrow")
e = time.time()
pa_time = e - s
print("PyArrow Time = {}".format(pa_time))

# Polars read
s = time.time()
df_pl = pl.read_csv('/content/winemag-data_first150k.csv')
e = time.time()
pl_time = e - s
print("PyPolars Time = {}".format(pl_time))

PyArrow Time = 0.1571650505065918
PyPolars Time = 0.19959807395935059


In [None]:
# Pandas filter and select
s = time.time()
df_arrow[df_arrow['country']== "US"]['price'].mean()
e = time.time()
pl_time = e - s
print("PyArrow Time = {}".format(pl_time))

# Polars filter and select
s = time.time()
df_pl.filter(pl.col("country") == "US").select(pl.col('price').mean())
e = time.time()
pa_time = e - s
print("Polars Time = {}".format(pa_time))



PyArrow Time = 0.1018519401550293
Polars Time = 0.06182575225830078


In [None]:
# Pandas Groupby Functions
s = time.time()
Function_1= df_arrow.groupby(['country'])['points'].agg('count')   #Function 1
Function_2= df_arrow.groupby(['country'])['points'].agg('mean')    #Function 2
e = time.time()
pl_time = e - s
print("PyArrow Time = {}".format(pl_time))


# # Polars Groupby Functions
s = time.time()
Function_1= df_pl.groupby('country').agg(pl.col('points').count()) #Function 1
Function_2= df_pl.groupby('country').agg(pl.col('points').mean())  #Function 2
e = time.time()
pa_time = e - s
print("Polars Time = {}".format(pa_time))


PyArrow Time = 0.041661739349365234
Polars Time = 0.04564213752746582


In [None]:
cols=['country','points'] # columns to be used for sorting

#Sorting in Polars
s = time.time()
df_arrow.sort_values(by=cols,ascending=True)
e = time.time()
pl_time = e - s
print("PyArrow Time = {}".format(pl_time))

# Sorting in Pandas
s = time.time()
df_pl.sort(cols,descending=False)
e = time.time()
pa_time = e - s
print("Polars Time = {}".format(pa_time))

PyArrow Time = 0.12908339500427246
Polars Time = 0.11353874206542969
