# Introduction

Lino Galiana  
2025-03-19

# 1. Introduction

<div class="alert alert-danger" role="alert">
<h3 class="alert-heading"><i class="fa-solid fa-triangle-exclamation"></i> Important</h3>

This course gathers all the content of the course
***Python for Data Science*** that I have been teaching
at [ENSAE](https://www.ensae.fr/courses/python-pour-le-data-scientist-pour-leconomiste/) since 2018.
This course was previously taught by [Xavier Dupré](http://www.xavierdupre.fr/app/ensae_teaching_cs/helpsphinx3/td_2a.html).
About 170 students take this course each year. In 2024, a gradual introduction of an English version equivalent to the French version began, aimed at serving as an introductory course in *data science* for European statistical institutes
thanks to a [European call for projects](https://cros.ec.europa.eu/dashboard/aiml4os).

This site ([pythonds.linogaliana.fr/](https://pythonds.linogaliana.fr)) is the main entry point for the course. It centralizes
all the content created during the course for practical work or provided additionally for continuing education purposes.
This course is *open source*
and I welcome suggestions for improvement on [`Github` ](https://github.com/linogaliana/python-datascientist) or through the comments at the bottom of each page. As `Python` is a living and dynamic language, practices evolve and this course continuously adapts to the changing ecosystem of *data science*, while trying to distinguish lasting practice evolutions from passing trends.

Additional elements are available in
the [introductory slides](https://slidespython.linogaliana.fr/).
More advanced elements are present in another course dedicated
to deploying *data science* projects
that I teach with Romain Avouac
in the final year of ENSAE ([ensae-reproductibilite.github.io/website](https://ensae-reproductibilite.github.io/website)).

</div>

<div class="alert alert-info" role="alert">
<h3 class="alert-heading"><i class="fa-solid fa-comment"></i> Website Architecture</h3>

This course features
tutorials and complete exercises.
Each page is structured around
a concrete problem and presents the
generic approach to solving this general problem.

You can navigate the site architecture via the table of contents
or through the links to previous or subsequent content at the end of each
page. Some sections, notably the one dedicated to modeling,
offer extended examples to illustrate the approach in more detail.

</div>

`Python`, with its recognizable logo in the form of ,
is a language that has been around for over thirty years
but has experienced a renaissance during the 2010s
due to the surge in interest around
*data science*.

`Python`, more than any other
programming language, brings together diverse communities such as statisticians, developers,
application or IT infrastructure managers,
high school students - `Python` has been part of the French baccalaureate program
for several years - and researchers
in both theoretical and applied fields.

Unlike many programming languages that have a fairly homogeneous community,
`Python` has managed to bring together a wide range of users thanks to a few central principles: the readability
of the language, the ease of using modules,
the simplicity of integrating it with more performant languages
for specific tasks, the vast amount of documentation
available online…
Being the second best language for performing a given
task
can thus be a source of success when competitors do not have
a similarly broad range of advantages.

The success of `Python`, due to its nature as a
Swiss Army knife language, is inseparable
from the emergence of the *data scientist* profile, a role
capable of integrating at different levels in data valuation.
Davenport and Patil (2012), in the *Harvard Business Review*,
talked about the *“sexiest job of the 21st century”*
and, ten years later, provided a comprehensive overview of the evolving
skills expected of a *data scientist* in the same review (Davenport and Patil 2022). It’s not only *data scientists*
who are expected to use `Python`; within the ecosystem
of data-related jobs (*data scientist*, *data engineer*, *ML engineer*…),
`Python` serves as a Babel tower enabling communication between these
interdependent profiles.

The richness of `Python` allows it to be used in all phases of data processing, from retrieval and structuring from
various sources to its valuation.
Through the lens of *data science*, we will see that `Python` is
an excellent candidate to assist *data scientists* in all
aspects of data work.

This course introduces various tools that allow for the connection
of data and theories using `Python`. However, this course
goes beyond a simple introduction to the language and provides
more advanced elements, especially on the latest
innovations enabled by *data science* in work methods.

# 2. Why Use `Python`  for Data Analysis?

`Python` is first known in the world of *data science* for having
provided early on the tools useful for training *machine learning* algorithms on various types of data. Indeed,
the success of [`Scikit Learn`](https://scikit-learn.org/stable/)[1],
[`Tensorflow`](https://www.tensorflow.org/)[2], or more
recently [`PyTorch`](https://pytorch.org/)[3] in the *data science* community has greatly contributed to the adoption of `Python`. However,
reducing `Python` to a few *machine learning* libraries would be limiting, as it is
truly a Swiss Army knife for *data scientists*,
*social scientists*, or economists. The *success story* of `Python`
is not just about having provided *machine learning* libraries at an opportune time: this
language has real advantages for new data practitioners.

The appeal of `Python` is its central role in a
larger ecosystem of powerful, flexible, and *open-source* tools.
Like , it belongs to the class
of languages that can be used daily for a wide variety of tasks.
In many areas explored in this course, `Python` is, by far,
the programming language offering the most complete and accessible ecosystem.

Beyond *machine learning*, which we have already discussed, `Python` is
indispensable when it comes to retrieving data via
APIs or *web scraping*[4], two approaches that we will explore
in the first part of the course. In the fields of tabular data analysis[5],
web content publishing, or graphic production, `Python` presents an ecosystem
increasingly similar to due to the growing investment of [`Posit`](https://posit.co/),
the company behind the major libraries for *data science*, in the
`Python` community.

Nevertheless, these elements are not meant to engage in the
sterile debate of vs `Python`.
These two languages have many more points of convergence than divergence,
making it very simple to transpose good practices from one
language to the other. This is a point that is discussed more extensively
in the advanced course I teach with Romain Avouac in the final year
at ENSAE: [ensae-reproductibilite.github.io/website](https://ensae-reproductibilite.github.io/website).

Ultimately, data scientists and researchers in social sciences or
economics will use or `Python` almost interchangeably and alternately.
This course
will regularly present analogies with to help
those discovering `Python`, but who are already familiar with , to
better understand certain points.

# 3. Course Objectives

## 3.1 Introducing the Approach to *Data Science*

This course is aimed at practitioners of *data science*,
understood here in a broad sense as the **combination of techniques from mathematics, statistics, and computer science to produce useful knowledge from data**.
As *data science* is not only a scientific discipline but also aims to provide a set of tools to meet operational objectives, learning the main tool necessary for acquiring knowledge in *data science*, namely the `Python` language, is also an opportunity to discuss the rigorous scientific approach to be adopted when working with data. This course aims to present the approach to handling a dataset, the problems encountered, the solutions to overcome them, and the implications of these solutions. It is therefore not just a course on a technical tool, detached from scientific issues.

<div class="alert alert-success" role="alert">
<h3 class="alert-heading"><i class="fa-solid fa-lightbulb"></i> Is a Mathematical Background Required for This Course?</h3>

This course assumes a desire to use `Python` intensively for data analysis within a rigorous statistical framework. It only briefly touches on the statistical or algorithmic foundations behind some of the techniques discussed, which are often the subject of dedicated teachings, particularly at ENSAE.

Not knowing these concepts does not prevent understanding the content of this website, as more advanced concepts are generally presented separately in dedicated boxes. The ease of using `Python` avoids the need to program a model oneself, which makes it possible to apply models without being an expert. Knowledge of models will be more necessary for interpreting results.

However, even though it is relatively easy to use complex models with `Python`, it is very useful to have some background on them before embarking on a modeling approach. This is one of the reasons why modeling comes later in this course: in addition to involving advanced statistical concepts, it is necessary to have understood the stylized facts in our data to produce relevant modeling. A thorough understanding of data structure and its alignment with model assumptions is essential for building high-quality models.

</div>

## 3.2 Reproductibilité

This course places a central emphasis on the concept of reproducibility. This requirement is reflected in various ways throughout this teaching, primarily by ensuring that all examples and exercises in this course can be tested using `Jupyter` notebooks[6].

The entire content of the website is reproducible in various computing environments. It is, of course, possible to copy and paste the code snippets present on this site, using the button above the code examples:

[1] Library developed by the French public research laboratories of INRIA since 2007.

[2] Library initially used by Google for their internal needs, it was made public in 2015. Although less used now, this library had a significant influence in the 2010s by promoting the use of neural networks in research and operational applications.

[3] Library developed by Meta since 2018 and affiliated since 2022 with the [*PyTorch foundation*](https://pytorch.org/foundation).

[4] In these two areas, the most serious competitor to `Python`
is `Javascript`. However, the community around this language is more focused
on web development issues than on *data science*.

[5] Tabular data are structured data, organized,
as their name indicates, in a table format that allows matching
observations with variables. This structuring differs from other types
of more complex data: free texts, images, sounds, videos… In the domain of unstructured data,
`Python` is the hegemonic language for analysis. In the domain of tabular data, `Python`’s competitive advantage is less pronounced, particularly compared to ,
but these two languages offer a core set of fairly similar functionalities. We will
regularly draw parallels between these two languages
in the chapters dedicated to the `Pandas` library.

[6] Un *notebook* est un environnement interactif qui permet d’écrire et d’exécuter du code en direct. Il combine, dans un seul document, du texte, du code qui peut être exécuté et dont les sorties s’affichent après calculs. C’est extrêmement pratique pour l’apprentissage du langage `Python`. Pour plus de détails, consultez la [documentation officielle de Jupyter](https://jupyter.org/documentation).

In [2]:
x = "Try to copy-paste me"

However, since this site presents many examples, the back-and-forth between a Python testing environment and this site could be cumbersome. Each chapter is therefore easily retrievable as a `Jupyter` notebook via buttons at the beginning of each page. For example, here are those buttons for the `Numpy` tutorial:

<div class="badge-container"><a href="https://github.com/linogaliana/python-datascientist-notebooks/blob/main/notebooks/en/manipulation/01_numpy.ipynb" target="_blank" rel="noopener"><img src="https://img.shields.io/static/v1?logo=github&label=&message=View%20on%20GitHub&color=181717" alt="View on GitHub"></a>
<a href="https://datalab.sspcloud.fr/launcher/ide/vscode-python?autoLaunch=true&name=«01_numpy»&init.personalInit=«https%3A%2F%2Fraw.githubusercontent.com%2Flinogaliana%2Fpython-datascientist%2Fmain%2Fsspcloud%2Finit-vscode.sh»&init.personalInitArgs=«en/manipulation%2001_numpy%20correction»" target="_blank" rel="noopener"><img src="https://custom-icon-badges.demolab.com/badge/SSP%20Cloud-Lancer_avec_VSCode-blue?logo=vsc&logoColor=white" alt="Onyxia"></a>
<a href="https://datalab.sspcloud.fr/launcher/ide/jupyter-python?autoLaunch=true&name=«01_numpy»&init.personalInit=«https%3A%2F%2Fraw.githubusercontent.com%2Flinogaliana%2Fpython-datascientist%2Fmain%2Fsspcloud%2Finit-jupyter.sh»&init.personalInitArgs=«en/manipulation%2001_numpy%20correction»" target="_blank" rel="noopener"><img src="https://img.shields.io/badge/SSP%20Cloud-Lancer_avec_Jupyter-orange?logo=Jupyter&logoColor=orange" alt="Onyxia"></a>
<a href="https://colab.research.google.com/github/linogaliana/python-datascientist-notebooks-colab//en/blob/main//notebooks/en/manipulation/01_numpy.ipynb" target="_blank" rel="noopener"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a><br></div>

Recommendations regarding
the preferred environments for using
these notebooks are deferred to the next chapter.

The requirement for reproducibility is also evident
in the choice of examples used in this course.
All content on this site relies on open data, whether it is French data (mainly
from the centralizing platform [`data.gouv`](https://www.data.gouv.fr) or the
website of [Insee](https://www.insee.fr)) or American data. Results are therefore reproducible for someone
with an identical environment[1].

<div class="alert alert-info" role="alert">
<h3 class="alert-heading"><i class="fa-solid fa-comment"></i> Note</h3>

American researchers have discussed a reproducibility crisis in the field
of *machine learning* (Kapoor and Narayanan 2022). Issues with the scientific
publishing ecosystem and the economic stakes behind academic publications
in the field of *machine learning* are prominent factors that may explain this.

However, academic teaching also bears a responsibility
in this area. Students and researchers are not trained in these topics, and if they
do not adopt this requirement early in their careers, they may not be encouraged to do so later. For this reason, in addition to training in `Python` and *data science*, this course
introduces the use of
the version control software `Git` in a dedicated section.
All student projects must be *open source*, which is one of the best ways
for a teacher to ensure that students produce quality code.

</div>

## 3.3 Assessment

ENSAE students validate the course through
an in-depth project.
Details about the course assessment, as well as a
list of previously completed projects, are available in the
[Assessment](annexes/evaluation) section.

# 4. Course Outline

This course is an introduction to the issues of *data science* through the learning of the `Python` language. As the term *“data science”* suggests, a significant part of this course is dedicated to working with data: retrieval, structuring, exploration, and linking. This is the subject of the first part of the course
[“Manipulating Data”](../../content/manipulation/index.qmd), which serves as the foundation for the rest of the course. Unfortunately, many programs in *data science*, applied statistics, or social and economic sciences, overlook this part of the data scientist’s work sometimes referred to as [“data wrangling”](https://en.wikipedia.org/wiki/Data_wrangling)
or [*“feature engineering”*](https://en.wikipedia.org/wiki/Feature_engineering), which, in addition to being a significant portion of the data scientist’s work, is essential for building a relevant model.

The goal of this part is to illustrate the challenges related to retrieving various types of data sources and their exploitation using `Python`. The examples will be varied to illustrate the richness of the data that can be analyzed with `Python`: municipal $CO_2$ emission statistics, real estate transaction data, energy diagnostics of housing, Vélib station attendance data…

The second part is dedicated to producing visualizations with `Python`. After retrieving and cleaning data, one generally wants to synthesize it through tables, graphics, or maps. This part is a brief introduction to this topic ([“Communicating with `Python`”](../../content/visualisation/index.qmd)). Being quite introductory, the goal of this part is mainly to provide some concepts that will be consolidated later.

The third part is dedicated to modeling through the example of electoral science ([“Modeling with `Python`”](../../content/modelisation/index.qmd)). The goal of this part is to illustrate the scientific approach of *machine learning*, the related methodological and technical choices, and to open up to the following issues that will be discussed in the rest of the university curriculum.

The fourth part of the course takes a step aside to focus on specific issues related to the exploitation of textual data. This is the chapter on [“Introduction to *Natural Language Processing (NLP)* with `Python`”](../../content/nlp/index.qmd). This research field being particularly active, it is only an introduction to the subject. For further reading, refer to Russell and Norvig (2020), chapter 24.

# References

Davenport, Thomas H, and DJ Patil. 2012. “Data Scientist, the Sexiest Job of the 21st Century.” *Harvard Business Review* 90 (5): 70–76. <https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century>.

———. 2022. “Is Data Scientist Still the Sexiest Job of the 21st Century?” *Harvard Business Review* 90.

Kapoor, Sayash, and Arvind Narayanan. 2022. “Leakage and the Reproducibility Crisis in ML-Based Science.” arXiv. <https://doi.org/10.48550/ARXIV.2207.07048>.

Russell, Stuart J., and Peter Norvig. 2020. *Artificial Intelligence: A Modern Approach (4th Edition)*. Pearson. <http://aima.cs.berkeley.edu/>.

[1] Opening chapters as *notebooks* in standardized environments, as will be proposed starting from the next chapter, ensures that you have a controlled environment. Personal installations of `Python` are likely to have undergone modifications that can alter your environment and cause unexpected and hard-to-understand errors: this is not a recommended use for this course. As you will discover in the next chapter, *cloud* environments offer comfort regarding environment standardization.