# Introduction

Lino Galiana  
2025-10-07

> **Skills you will acquire in this chapter**
>
> -   Understand why Python has become essential in the fields of *data science* and *data engineering*, thanks to its simplicity, readability, and strong community ecosystem  
> -   Recognize how Python can support you through every stage of a data-driven project—from structuring and exploring data to modeling and communicating results  
> -   Appreciate the value of a reproducible, rigorous, and scientific approach to *data science* and *data engineering*, and learn how to support this using tools like Jupyter notebooks, open data, and standardized environments  
> -   Grasp the purpose and teaching goals of the course, as well as its main themes  

# 1. Introduction

This course brings together all the material from the
***Python for Data Science*** class I’ve been teaching at [ENSAE](https://www.ensae.fr/courses/python-pour-le-data-scientist-pour-leconomiste/) since 2020[1]. Each year, around 190 students take this course. In 2024, an English version—equivalent to the French original—was gradually introduced. It is designed as an introductory *data science* course for European statistical institutes, following a [European call for projects](https://cros.ec.europa.eu/dashboard/aiml4os).

The site ([pythonds.linogaliana.fr](https://pythonds.linogaliana.fr)) serves as the main hub for the course. It gathers all course content, including practical assignments and additional materials aimed at continuing education. The course is *open source*, and I welcome suggestions for improvement either on [`GitHub` ](https://github.com/linogaliana/python-datascientist) or in the comments section at the bottom of each page.

Because `Python` is a living, fast-evolving language, the course is continuously updated to reflect the changing *data science* ecosystem. At the same time, it strives to differentiate lasting trends from short-lived fads.

You can find more information in the [introductory slides](https://slidespython.linogaliana.fr/). More advanced topics are covered in another course focused on deploying *data science* projects to production, which I co-teach with Romain Avouac in the final year at ENSAE ([ensae-reproductibilite.github.io/website](https://ensae-reproductibilite.github.io/website)).

> **Website Architecture**
>
> This course offers comprehensive tutorials and exercises that can be read directly on the site or edited and run in an interactive `Jupyter Notebook` environment (see the [next chapter](../../content/getting-started/01_environment.qmd) for details).
>
> Each page is built around a concrete problem and introduces a general approach to solving it. All examples are based on *open data* and are fully reproducible.
>
> You can navigate the site using the table of contents or the previous/next links at the bottom of each page. Some sections - such as the one on modeling - include highlighted examples that illustrate the methodology and present different possible approaches to solving the same problem.

# 2. Why `Python` ?

`Python` whose recognizable logo appears as , is a language that’s been around for over thirty years. But it was in the 2010s that it experienced a major resurgence, driven by the growing popularity of *data science*.

More than any other language, `Python` brings together a wide range of communities: statisticians, application developers, IT infrastructure managers, high school students (it has been part of the French baccalaureate curriculum for several years), and researchers in both theoretical and applied fields.

Unlike many programming languages that cater to relatively homogeneous communities, `Python` has succeeded in uniting diverse users thanks to a few key principles: its readable syntax, the simplicity of using modules, the ease of integration with more powerful languages for specific tasks, and the vast amount of online documentation. Sometimes, being the second-best tool for a task—while offering a broader set of advantages—can be the key to success.

`Python` success story is closely tied to the rise of the *data scientist* role, a profile capable of working across the entire data processing pipeline. In the *Harvard Business Review*, Davenport and Patil (2012) famously called it “the sexiest job of the 21st century.” A decade later, he and his co-authors provided a full update on the evolving expectations for *data scientists* (Davenport and Patil 2022).

But it’s not only *data scientists* who need to use `Python`. In the broader ecosystem of data-related professions—*data scientists*, *data engineers*, *ML engineers*, and more—`Python` serves as a kind of Tower of Babel, enabling collaboration among interdependent roles.

This course introduces various tools that use `Python` to connect data with theoretical concepts from statistics and the economic and social sciences. However, it goes beyond a simple introduction to the language: it regularly reflects on both the strengths and the limitations of `Python` in meeting operational and scientific needs.

# 3. Why use `Python`  for data analysis?

This question is slightly different: if `Python` is already a popular language for learning programming due to its ease of use, how did it also become the dominant language in the *data* and AI ecosystem?

Python first gained traction in the *data science* world for offering tools to train *machine learning* algorithms, even before such approaches became mainstream. Of course, the success of libraries like [`Scikit-Learn`](https://scikit-learn.org/stable/), [`TensorFlow`](https://www.tensorflow.org/), and more recently [`PyTorch`](https://pytorch.org/), played a major role in `Python`’s adoption by the *data science* community[2]. However, reducing `Python` to a handful of *machine learning* libraries would be overly simplistic. It is truly a Swiss Army knife for *data scientists*, social scientists, economists, and data practitioners of all kinds. Its success is not only due to offering the right tools at the right time, but also because the language itself offers real advantages for newcomers to data work.

What makes `Python` appealing is its central role in a broader ecosystem of powerful, flexible, and open-source tools. Like , it belongs to a category of languages suitable for everyday use across a wide variety of tasks. In many of the fields covered in this course, `Python` has by far the richest and most accessible ecosystem. Unlike other popular languages such as `JavaScript` or `Rust`, it has a very gentle learning curve, allowing users to write high-quality code quickly - provided they learn the right habits, which this course (and the companion course on [production workflows](https://ensae-reproductibilite.github.io/)) aims to instill.

Beyond AI projects\[^nte-ia-en\], `Python` is also indispensable for retrieving data via APIs or through *web scraping*[3], two techniques introduced early in the course. In areas like tabular data analysis[4], web publishing or data visualization, `Python` now offers an ecosystem comparable to , thanks in part to growing investment from [`Posit`](https://posit.co/), which has ported many of ’s most successful libraries—such as [ggplot](https://ggplot2.tidyverse.org/) to `Python`.

> **Note 3.1: Why discuss AI so little in a `Python` course?**
>
> While a significant portion of this course covers *machine learning* and related algorithms, I tend to resist the current trend - especially strong since the release of `ChatGPT` in late 2022 - of labeling everything as “AI”.
>
> First, because the term is vague, overused, and often exploited for marketing purposes, capitalizing on its symbolic power drawn from science fiction to sell miracle solutions or stoke fear.
>
> Second, because the term “AI” covers a vast range of possible methods, depending on how broadly we define it. The sections on modeling and NLP in this course, which are the closest to the AI field, focus on learning-based methods. But as definitions from Russell and Norvig (2020) or the [European AI Act](https://artificialintelligenceact.eu/fr/article/3/) show, artificial intelligence encompasses much more:
>
> > The study of \[intelligent\] agents that perceive their environment and act upon it. Each such agent is implemented by a function that maps perceptions to actions. We study different ways to define this function, such as production systems, reactive agents, logical planners, neural networks, and decision-theoretic systems.
> >
> > Russell and Norvig (2020)
>
> > “AI system” means a machine-based system designed to operate with varying levels of autonomy and capable of adapting after deployment. It infers, based on its inputs, how to generate outputs—such as predictions, content, recommendations, or decisions—that can influence physical or virtual environments.
> >
> > [European AI Act](https://artificialintelligenceact.eu/fr/article/3/)
>
> For more on this topic, see a presentation I gave in 2024 (in French):
>
> <details>
> <summary><p>Scroll the <em>slides</em> or open in <a
> href="https://linogaliana.github.io/20241015-prez-ia-masa/#/title-slide/">full
> screen</a>.</p></summary>
> <div class="sourceCode">
> <iframe class="sourceCode" src="https://linogaliana.github.io/20241015-prez-ia-masa/#/title-slide/"></iframe>
>
> </div>
> </details>
>
> Finally, there’s also a pedagogical reason. Since 2023, “AI” has largely become synonymous with generative AI. But to understand how this radically different paradigm works - and to implement meaningful, value-driven generative AI projects - one must first understand the foundations and limitations of the *machine learning* approach. Otherwise, there’s a risk of building overly complex solutions for simple problems or misjudging the value of generative models compared to more traditional methods. Since this is an introductory course, I’ve chosen to focus on *machine learning* and introductory NLP, deep enough to be meaningful, while leaving it to the curious to explore generative AI further on their own.

That said, this course does not aim to stir up the sterile debate between and `Python`. The two languages share far more than they differ, and best practices are often transferable between them. This idea is explored more deeply in the advanced course I co-teach with Romain Avouac ([ensae-reproductibilite.github.io/website](https://ensae-reproductibilite.github.io/website)).

In practice, data scientists and researchers in social sciences or economics increasingly use and `Python` interchangeably. This course will frequently draw analogies between the two, to help learners already familiar with transition smoothly to `Python`.

# 4. Why learn `Python` when code-generating AIs exist?

Code assistants like `Copilot` and `ChatGPT` have fundamentally transformed software development. These tools are now part of the everyday toolkit of a *data scientist*, offering remarkable convenience by generating `Python` code from more or less well-specified instructions. Trained on massive amounts of publicly available code—and often fine-tuned for solving development tasks—they can be extremely helpful. The concept of *vibe coding* even pushes this further, aiming to let large language models (LLMs) take initiative without requiring human intermediaries to access the computational resources needed to run the code they generate.

So, if AIs can now generate code, why should we still learn how to code?

Because coding is not just about writing lines of code. It’s about understanding a problem, crafting a step-by-step strategy to tackle it, considering multiple solutions and trade-offs (e.g., speed, simplicity), testing and debugging. Code is a means to an engineering end. While AIs are very good at generating code, relating problems to known patterns, and even translating solutions across languages into `Python`, that’s only part of the picture.

Working with data is first and foremost an engineering process. Code is not the goal—it’s the tool that supports structured reasoning toward solving real-world problems. Just like an engineer designs a bridge to meet a practical need, a data scientist begins with an operational goal—such as building a recommendation system, evaluating the impact of a product launch, or forecasting sales—and reformulates it into an analytical task. This means translating scientific or business ideas into a set of questions, then breaking those down into logical steps, each of which can be executed by a computer.

In this context, an LLM can act as a valuable assistant—but only when the problem is well formulated. If the task is vague or ill-defined, the model’s answers will be approximate or even useless. On standard problems, the results may appear accurate. But for more specific, non-standard tasks, it often becomes necessary to iterate, refine the prompt, reframe the problem… and sometimes still fail to get a satisfactory result. Not because the model is poor, but because good problem formulation—the essence of problem engineering—makes all the difference[5].

For instance, in the year 2025, [`uv`](https://docs.astral.sh/uv/) saw rapid adoption, as did [`ruff`](https://docs.astral.sh/ruff/) the year before. It will still be some time before generative AIs propose this environment manager on their own, rather than [`poetry`](https://python-poetry.org/). The existence of generative AIs does not, therefore, dispense us, as before, from keeping an active technical watch and being vigilant about changes in practices.

# 5. Course Objectives

## 5.1 Introducing the data science approach

This course is intended for practitioners of *data science*, understood here in the broadest sense as the **combination of techniques from mathematics, statistics, and computer science to extract useful knowledge from data**.

Since *data science* is not only an academic discipline but also a practical field aimed at achieving operational goals, learning its main tool—namely, the `Python` programming language—goes hand in hand with adopting a rigorous, scientific approach to data.

The objective of this course is to explore how to approach a dataset, identify and address common challenges, develop appropriate solutions, and reflect on their broader implications. It is therefore not merely a course about a technical tool, disconnected from scientific reasoning, but one rooted in understanding data through both technical and conceptual lenses.

> **Do I need a math background for this course?**
>
> This course assumes you are interested in using data-intensive `Python` within a rigorous statistical framework. It does not delve deeply into the statistical or algorithmic foundations of the techniques presented - many of which are covered in dedicated courses, particularly at ENSAE.
>
> That said, not being familiar with these concepts shoud not prevent from following this course. More advanced ideas are typically introduced separately, in dedicated callout boxes. Thanks to Python’s ease of use, you will not need to implement complex models from scratch - making it possible to apply techniques even if you are not an expert in the underlying theory. What *is* important, however, is having enough understanding to correctly interpret the results.
>
> Still, while `Python` makes it relatively easy to run sophisticated models, it is very helpful to have some perspective before diving into modeling. That explains why modeling appears later in the course: in addition to relying on advanced statistical concepts, effective modeling also requires a solid understanding of the data. You need to identify key patterns and assess whether your data fits the assumptions of the model. Without this foundation, it is difficult to build models that are truly meaningful or reliable.

## 5.2 Reproducibility

This course places strong emphasis on reproducibility. This principle is reflected in several ways. First and foremost, by ensuring that all examples and exercises can be run and tested using `Jupyter` *notebooks*[6].

All content on the website is designed to be reproducible across different computing environments. Of course, you’re free to copy and paste code snippets directly from the site using the button available at the top of each code block.

[1] This course was originally taught by [Xavier Dupré](http://www.xavierdupre.fr/app/ensae_teaching_cs/helpsphinx3/td_2a.html).

[2] [`Scikit-Learn`](https://scikit-learn.org/stable/) is a library developed since 2007 by French public research labs at INRIA. Open source from the outset, it is now maintained by [`:probabl.`](https://probabl.ai/), a startup created to manage the `Scikit` ecosystem, bringing together some of the INRIA research teams behind the core of the modern *machine learning* stack.

[`TensorFlow`](https://www.tensorflow.org/) was developed internally at Google and made public in 2015. Although now less widely used - partly due to the rise of `PyTorch` - it played a major role in popularizing neural networks in both research and production during the 2010s.

[`PyTorch`](https://pytorch.org/) was developed by Meta starting 2018 and has been governed by the [*PyTorch Foundation*](https://pytorch.org/foundation) since 2022. It is now the most widely used framework to train neural networks.

[3] In the domains of API access and *web scraping*, `JavaScript` is `Python`’s most serious competitor. However, its community is more focused on web development than on *data science*.

[4] Tabular data refers to structured data organized in tables that map observations to variables. This structure contrasts with unstructured data like free text, images, audio, or video. In unstructured data analysis, `Python` dominates. For tabular data, `Python` advantage is less clear - especially compared to  - but both languages now offer similar capabilities. We will regularly draw parallels between them in chapters on the `Pandas` library.

[5] On this topic, see Thomas Wolf’s blog post [*The Einstein AI model*](https://thomwolf.io/blog/scientific-ai.html). Although the post focuses on disruptive innovation and pays less attention to incremental progress, it’s insightful in understanding that LLMs—despite bold predictions from tech influencers—are still just tools. They may excel at standardized tasks, but for now, they remain assistants.

[6] Jupyter notebooks are interactive documents that allow you to combine code, text, and visualizations in a single file. They’re widely used in data science and education to make code both readable and executable.

In [None]:
x = "Try to copy-paste me"

However, since this site includes many examples, constantly switching between a `Python` environment and the website can become tedious. To make things easier, each chapter can be downloaded as a `Jupyter` *notebook* using the buttons provided at the top of each page.

For example, here are the buttons for the first chapter on `Pandas`:

<div class="badge-container"><a href="https://github.com/linogaliana/python-datascientist-notebooks/blob/main/notebooks/en/manipulation/02_pandas_intro.ipynb" target="_blank" rel="noopener"><img src="https://img.shields.io/static/v1?logo=github&label=&message=View%20on%20GitHub&color=181717" alt="View on GitHub"></a>
<a href="https://datalab.sspcloud.fr/launcher/ide/vscode-python?autoLaunch=true&name=«02_pandas_intro»&init.personalInit=«https%3A%2F%2Fraw.githubusercontent.com%2Flinogaliana%2Fpython-datascientist%2Fmain%2Fsspcloud%2Finit-vscode.sh»&init.personalInitArgs=«en/manipulation%2002_pandas_intro%20correction»" target="_blank" rel="noopener"><img src="https://custom-icon-badges.demolab.com/badge/SSP%20Cloud-Lancer_avec_VSCode-blue?logo=vsc&logoColor=white" alt="Onyxia"></a>
<a href="https://datalab.sspcloud.fr/launcher/ide/jupyter-python?autoLaunch=true&name=«02_pandas_intro»&init.personalInit=«https%3A%2F%2Fraw.githubusercontent.com%2Flinogaliana%2Fpython-datascientist%2Fmain%2Fsspcloud%2Finit-jupyter.sh»&init.personalInitArgs=«en/manipulation%2002_pandas_intro%20correction»" target="_blank" rel="noopener"><img src="https://img.shields.io/badge/SSP%20Cloud-Lancer_avec_Jupyter-orange?logo=Jupyter&logoColor=orange" alt="Onyxia"></a>
<a href="https://colab.research.google.com/github/linogaliana/python-datascientist-notebooks-colab//en/blob/main//notebooks/en/manipulation/02_pandas_intro.ipynb" target="_blank" rel="noopener"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a><br></div>

Recommendations on the best environments for using these notebooks are provided in the next chapter.

The focus on reproducibility is also reflected in the choice of examples used throughout the course. All content on this site is based on *open data*, sourced either from French platforms - primarily the centralized portal [`data.gouv`](https://www.data.gouv.fr), which aggregates public datasets from French institutions, or the official statistics agency [Insee](https://www.insee.fr), France’s national institute for statistics and economic studies - or from U.S. datasets. This ensures that results are reproducible for anyone working in an identical environment[1].

> **Note**
>
> American researchers have described a reproducibility crisis in the field of *machine learning* (**Reproducibilitycrisis-en?**). The distortions of the scientific publishing ecosystem - combined with the economic incentives driving academic publications in *machine learning* - are often cited as major contributing factors.
>
> However, university education also bears a share of the responsibility. Students and researchers are rarely trained in the principles of reproducibility, and if these practices are not introduced early in their careers, they are unlikely to adopt them later. This is why, in addition to teaching `Python` and *data science*, this course includes a dedicated section on using version control with `Git` .
>
> All student projects are required to be *open source*—one of the most effective ways for instructors to encourage high-quality, transparent, and reproducible code.

## 5.3 Assessment

Students at ENSAE complete the course by working on an in-depth project. Details on how the course is assessed, along with a list of past student projects, can be found in the [Evaluation](../../content/annexes/evaluation) section.

# 6. Course Outline

This course serves as an introduction to the core challenges of *data science* through learning the `Python` programming language. As the term *“data science”* implies, a significant portion of the course is dedicated to working directly with data: retrieving it, structuring it, exploring it, and combining it.

These topics are covered in the first part of the course, [“Manipulating Data”](../../content/manipulation/index.qmd), which lays the foundation for everything that follows. Unfortunately, many training programs in *data science*, applied statistics, or the economic and social sciences tend to overlook this crucial aspect of a data scientist’s work—often referred to as [“data wrangling”](https://en.wikipedia.org/wiki/Data_wrangling) or [*“feature engineering”*](https://en.wikipedia.org/wiki/Feature_engineering). And yet, not only does it represent a large share of the day-to-day work in data science, it’s also essential for building relevant and accurate models.

The goal of this first section is to highlight the challenges involved in accessing and leveraging different types of data sources using `Python`. The examples are diverse, reflecting the variety of data that can be analyzed with `Python`: municipal $CO_2$ emissions in France, real estate transaction records, housing energy performance diagnostics, bike-sharing data from the Velib system, and more.

The second part of the course focuses on creating visualizations with `Python`. Once your data has been cleaned and processed, you’ll typically want to summarize it—through tables, graphs, or maps. This part, [“Communicating with Python”](../../content/visualization/index.qmd), offers a concise introduction to the topic. While somewhat introductory, it provides essential concepts that will be reinforced later in the course.

The third part centers on modeling, using electoral science as the main example ([“Modeling with Python”](../../content/modelisation/index.qmd)). This section introduces the scientific reasoning behind *machine learning*, explores both methodological and technical choices, and sets the stage for deeper topics addressed later in the program.

The fourth part takes a step back to focus on the specific challenges of working with text data. This is the [“Introduction to Natural Language Processing (NLP) with Python”](../../content/NLP/index.qmd) chapter. Given that NLP is a rapidly evolving field, this section serves only as an introduction. For more advanced coverage, see Russell and Norvig (2020), chapter 24.

This chapter also includes a section on version control with `Git` ([Discover `Git`](../../content/git/index.qmd)). Why include this in a course about `Python` ? Because learning `Git` helps you write better code, collaborate effectively, and test or share your work in reproducible environments. This is especially valuable in a world where platforms like [`GitHub`](https://github.com/) act as professional showcases—and where companies and public institutions increasingly expect their *data scientists* to be proficient with `Git`.

For more advanced applications, including deployment and reproducibility, refer to the companion course on [putting data science projects into production](https://ensae-reproductibilite.github.io/).

# References

Davenport, Thomas H, and DJ Patil. 2012. “Data Scientist, the Sexiest Job of the 21st Century.” *Harvard Business Review* 90 (5): 70–76. <https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century>.

———. 2022. “Is Data Scientist Still the Sexiest Job of the 21st Century?” *Harvard Business Review* 90.

Russell, Stuart J., and Peter Norvig. 2020. *Artificial Intelligence: A Modern Approach (4th Edition)*. Pearson. <http://aima.cs.berkeley.edu/>.

[1] Opening the chapters as *notebooks* in standardized environments - something explained in the next chapter - ensures you are working in a controlled setup. Personal `Python` installations often involve tweaks and adjustments that can alter your environment and lead to unexpected, hard-to-diagnose errors. For this reason, such local setups are not recommended for this course. As you’ll see in the next chapter, *cloud-based* environments offer the advantage of consistent, preconfigured setups that greatly improve reliability and ease of use.