# Part 1: data wrangling

Lino Galiana  
2025-03-19

Although data scientists are often associated with the implementation
of artificial intelligence models, it is important
not to forget that training and using
these models do not necessarily represent
the daily work of data scientists.

In practice,
gathering heterogeneous data sources, structuring
and harmonizing them for exploratory analysis
prior to modeling or visualization
represents a significant part of data scientists’ work.
In many environments, this is even the essence of a data scientist’s role.
Developing relevant models indeed requires deep reflection on the data;
an essential step that should not be overlooked.

This course,
like many introductory resources on
data science (Wickham, Çetinkaya-Rundel, and Grolemund 2023; VanderPlas 2016; McKinney 2012),
will therefore offer a lot of content on data manipulation, an
essential skill for data scientists.

Programming software
based around the database concept
has become the main tool for data scientists.
Being able to apply a number of standard operations
on databases, regardless of their nature,
allows programmers to be more efficient than if they had to
repeat these operations manually, as in Excel.

All the dominant programming languages in the data science ecosystem
are based on the dataframe principle.
It is even a central object in some software,
notably R.
The logic of [`SQL`](https://fr.wikipedia.org/wiki/Structured_Query_Language),
a language for declaring data operations that has been around for over fifty years,
provides a relevant framework for performing standardized operations
on columns (creating new columns, selecting subsets of rows, etc.).

However, the dataframe only recently became established in Python,
thanks to the `Pandas` package created
by [Wes McKinney](https://fr.wikipedia.org/wiki/Wes_McKinney).
The rise of the `Pandas` library (downloaded over 5 million times
per day in 2023) is largely responsible for Python’s success
in the data science ecosystem and has led, in just a few years,
to a complete renewal of how coding in Python, such
a flexible language, is approached for data analysis.

This part of the course is a general introduction
to the rich ecosystem of
data manipulation with Python.
These chapters cover both data retrieval
and the restructuring and analysis
of that data.

## Summary of that section

`Pandas` has become essential in the `Python` ecosystem for *data science*.
`Pandas` itself is built on top of the `Numpy` package, which is useful to understand
to be comfortable with `Pandas`. `Numpy` is a low-level library
for storing and manipulating data.
`Numpy` is at the heart of the *data science* ecosystem because most libraries, even those
handling unstructured objects,
use objects built from `Numpy`[1].

The `Pandas` approach, which provides a unified entry point for manipulating
datasets of very different natures,
has been extended to geographic objects with `Geopandas`.
This allows for the manipulation of geographic data as if
it were classic structured data. Geographic data and
cartographic representation are becoming increasingly common with
the rise of open localized data and geolocated *big-data*.

However, structured data imported from flat files
is not the only data source. APIs and *web scraping*
allow for flexible downloading or extraction
of data from web pages or specialized portals. These data, particularly
those obtained through *web scraping*, often require a bit more data
cleaning work, especially with character strings.

The `Pandas` ecosystem thus represents a Swiss army knife
for data analysis. This is why this course
will cover it extensively.
Before trying to implement an *ad hoc* solution, it is
often useful to ask the following question: *“Could I do this
with the basic functionalities of `Pandas`?”* Asking this question can
prevent arduous paths and save a lot of time.

However, `Pandas` is not
suitable for handling large volumes of data.
To process such
data, it is recommended to use `Polars` or `Dask`, which adopt the logic of `Pandas` but
optimize its functionality, `Spark` if you have suitable infrastructure, generally in
big data environments, or
`DuckDB` if you are willing to use SQL queries rather than a high-level library.

## Exercises

This section provides both detailed tutorials
and guided exercises.

You can view them on this site or use one of the
badges at the beginning of the chapter, for example
these to open
the [Pandas exercises chapter](02b_pandas_TP/):

<div class="badge-container"><a href="https://github.com/linogaliana/python-datascientist-notebooks/blob/main/notebooks/en/manipulation/index.ipynb" target="_blank" rel="noopener"><img src="https://img.shields.io/static/v1?logo=github&label=&message=View%20on%20GitHub&color=181717" alt="View on GitHub"></a>
<a href="https://datalab.sspcloud.fr/launcher/ide/vscode-python?autoLaunch=true&name=«index»&init.personalInit=«https%3A%2F%2Fraw.githubusercontent.com%2Flinogaliana%2Fpython-datascientist%2Fmain%2Fsspcloud%2Finit-vscode.sh»&init.personalInitArgs=«en/manipulation%20index%20correction»" target="_blank" rel="noopener"><img src="https://custom-icon-badges.demolab.com/badge/SSP%20Cloud-Lancer_avec_VSCode-blue?logo=vsc&logoColor=white" alt="Onyxia"></a>
<a href="https://datalab.sspcloud.fr/launcher/ide/jupyter-python?autoLaunch=true&name=«index»&init.personalInit=«https%3A%2F%2Fraw.githubusercontent.com%2Flinogaliana%2Fpython-datascientist%2Fmain%2Fsspcloud%2Finit-jupyter.sh»&init.personalInitArgs=«en/manipulation%20index%20correction»" target="_blank" rel="noopener"><img src="https://img.shields.io/badge/SSP%20Cloud-Lancer_avec_Jupyter-orange?logo=Jupyter&logoColor=orange" alt="Onyxia"></a>
<a href="https://colab.research.google.com/github/linogaliana/python-datascientist-notebooks-colab//en/blob/main//notebooks/en/manipulation/index.ipynb" target="_blank" rel="noopener"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a><br></div>

## Going further

This course does not really address issues of volume or speed of
computation.
`Pandas` can show its limits in this area with large datasets
(several gigabytes).

It is therefore interesting to consider:

-   The book [Modern Pandas](https://tomaugspurger.github.io/modern-1-intro.html)
    for additional insights on performance with `Pandas`;
-   The question of
    [sparse objects](https://chrisalbon.com/machine_learning/vectors_matrices_and_arrays/create_a_sparse_matrix/);
-   The *packages* [`Dask`](https://dask.org/) or [`Polars`](https://ssphub.netlify.app/post/polars/) to speed up computations;
-   [`DuckDB`](https://duckdb.org/docs/api/python/overview.html) for very efficient SQL queries;
-   [`PySpark`](https://spark.apache.org/docs/latest/api/python/index.html) for very large datasets.

### References

Here is a selective bibliography of interesting books
complementary to the chapters in the “Manipulation” section of this course:

McKinney, Wes. 2012. *Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython*. " O’Reilly Media, Inc.".

VanderPlas, Jake. 2016. *Python Data Science Handbook: Essential Tools for Working with Data*. " O’Reilly Media, Inc.".

Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. *R for Data Science*. " O’Reilly Media, Inc.".

[1] Some libraries are gradually moving away
from `Numpy`, which is not always the most suitable for managing
certain types of data. The `Arrow` framework is becoming
the lower layer used by more and more data science libraries.
[This blog post](https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i) provides a detailed explanation of this topic.