# Pandas vs Polars : Comparaison des performances et de la mémoire

[Lien du post MonShotData](https://www.monshotdata.com/p/pandas-vs-polars)

In [None]:
!pip install polars

In [1]:
import polars as pl
import pandas as pd

Télécharger l'ensemble des données ici : [dataset.csv](https://drive.google.com/file/d/18dMgeRtL2SrXFY1At8dsNzURPcozN4lb/view?usp=sharing)

## Lire le CSV

In [2]:
%timeit pd.read_csv("dataset.csv")

1.34 s ± 7.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [3]:
%timeit pl.read_csv("dataset.csv")

57 ms ± 270 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [4]:
df_pd = pd.read_csv("dataset.csv")
df_pl = pl.read_csv("dataset.csv")

## Vers CSV

In [5]:
%timeit df_pd.to_csv("dataset_dummy_pandas.csv")

5.01 s ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [6]:
%timeit df_pl.write_csv("dataset_dummy_polars.csv")

167 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Utilisation de la mémoire

In [7]:
df_pl.estimated_size() # en octets

437379072

In [8]:
df_pd.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4096000 entries, 0 to 4095999
Data columns (total 9 columns):
 #   Column              Dtype  
---  ------              -----  
 0   Name                object 
 1   Company_Name        object 
 2   Employee_Job_Title  object 
 3   Employee_City       object 
 4   Employee_Country    object 
 5   Employee_Salary     int64  
 6   Employment_Status   object 
 7   Employee_Rating     float64
 8   Credits             int64  
dtypes: float64(1), int64(2), object(6)
memory usage: 1.5 GB


## Sélection des colonnes

In [9]:
%timeit df_pd[["Name", "Employee_Rating"]]

11 ms ± 207 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [10]:
%timeit df_pl[["Name", "Employee_Rating"]]

3.41 μs ± 44.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


## Filtrage

In [11]:
%timeit df_pd[df_pd.Credits > 2]

57 ms ± 130 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [12]:
%timeit df_pl.filter(pl.col('Credits') > 2)

5.01 ms ± 386 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Regroupement

In [13]:
%timeit df_pd.groupby("Company_Name").Employee_Salary.mean().reset_index()

102 ms ± 718 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [14]:
%timeit df_pl.group_by("Company_Name").agg([("Employee_Salary", "mean")])

31.1 ms ± 570 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Tri

In [15]:
%timeit df_pd.sort_values("Employee_Salary")

348 ms ± 1.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [16]:
%timeit df_pl.sort("Employee_Salary")

83.2 ms ± 2.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
