# Part 4: Natural Language Processing (NLP)

Lino Galiana  
2025-10-07

# 1. Introduction

The previous sections focused on acquiring cross-functional skills for working with data. Naturally, we have so far mostly focused on structured data—modest in size but already rich in analytical potential. This new section turns to a subject that, at first glance, may seem unlikely to be handled by computers—a topic of centuries-old philosophical debate, from Plato to Saussure: the richness of **human language**.

By drawing an analogy between “language” and “tongue”—that is, defining the former as the capacity to express and communicate thought through signs, and the latter as the conventional implementation of that capacity—we align ourselves with the field of linguistics and treat language as data.
This opens the door to statistical and algorithmic analysis. Yet, even if statistical regularities exist, how can computers—ultimately limited to just *0* and *1*—grasp such a complex object as language, which takes humans years to understand and master?[1]

# 2. Natural Language Processing

Natural Language Processing (NLP) refers to the set of techniques that allow computers to understand, analyze, synthesize, and generate human language[2].

NLP is a disciplinary field at the intersection of statistics and linguistics, which has experienced significant growth in recent years — academically, operationally, and industrially.
Some applications of these techniques have become essential in our daily lives, such as search engines, machine translation, and more recently, chatbots, whose development has accelerated rapidly since the launch of `ChatGPT` in December 2022.

# 3. Section Summary

This part of the course is dedicated to text data analysis with 📖 examples for fun. It serves as a gradual introduction to the topic by focusing on basic concepts necessary for later understanding of more advanced principles and sophisticated techniques[3]. This section mainly covers:

-   The challenges of cleaning textual fields
    and frequency analysis. This is somewhat *old school* NLP but understanding it is essential to progress further;
-   Language modeling using several approaches.

Before diving into the topic of *embeddings*, it’s important to understand the contributions and limitations of concepts like
the bag of words or TF-IDF (*term frequency - inverse document frequency*). One of the main benefits of large language models—namely the richness of their contextual window that allows them to better grasp textual nuances and speaker intentionality—becomes clearer when the limitations of traditional NLP are understood.

As an introductory perspective, this course focuses on frequency-based approaches, especially the bag-of-words approach, to ease into the later exploration of the Pandora’s box that is *embeddings*.

## 3.1 Text Cleaning and Frequency Analysis

`Python` is an excellent tool for text data analysis.
Basic methods for transforming textual data or dictionaries, combined with specialized libraries
such as `NLTK` and `SpaCy`, make it possible to perform normalization and text data analysis
very efficiently. `Python` is much better equipped than `R` for text data analysis.
There is a wealth of online resources on this subject, and the best way to learn remains hands-on practice with a corpus to clean.

This section first revisits how to structure and clean a textual corpus
through the *bag of words* approach.
It aims to demonstrate how to turn a corpus into a tool suitable for
statistical analysis:

-   It first introduces the challenges of text data cleaning
    through an analysis of *The Count of Monte Cristo* by Alexandre Dumas
    [here](../../content/NLP/01_intro.qmd), which helps to quickly summarize the available information
    in a large volume of text data (as illustrated by wordclouds)
-   It then offers a series of exercises on text cleaning based on the
    works of Edgar Allan Poe, Mary Shelley, and H.P. Lovecraft, aiming to highlight
    the specificity of each author’s vocabulary. These exercises are
    available [in the second chapter](../../content/NLP/02_exoclean.qmd) of the section.

This frequency-based analysis provides perspective on the nature of text data and recurring issues in dimensionality reduction of natural language corpora. Just as descriptive statistics naturally lead to modeling, this frequency approach typically quickly leads to the desire to identify underlying rules behind our text corpora.

## 3.2 Language Modeling

The remainder of this section introduces the challenges of language modeling. These are currently very popular due to the success of `ChatGPT`. However, before delving into large language models (LLMs)—those neural networks with billions of parameters trained on massive data volumes—it’s important to first understand some preliminary modeling techniques.

We begin by exploring an alternative approach that takes into account the context in which a word appears. The introduction of
*Latent Dirichlet Allocation* (LDA) serves as an opportunity to present document modeling through *topics*. However, this approach has fallen out of favor in comparison to methods related to the concept of *embedding*.

Toward the end of this course section, we will introduce the challenge of transforming textual fields
into numeric vector forms. To do so, we will present the principle
behind `Word2Vec`, which allows us, for instance,
despite significant syntactic distance,
to identify that semantically, `Man` and `Woman` are closely related.
This chapter serves as a bridge to the concept of *embedding*, a major recent revolution in NLP. It enables the comparison of corpora
not only by syntactic similarity (e.g., do they share common words?)
but also by semantic similarity (e.g., do they share a theme or meaning?). Covering `Word2Vec` will give curious learners a solid foundation to then explore transformer-based models, which are now the benchmark in NLP.

# To Go Further

Research in the field of NLP is highly active. It is therefore advisable
to stay curious and explore additional resources, as no single source
can compile all knowledge—especially in a field as dynamic as NLP.

To deepen the skills discussed in this course, I strongly recommend
this [course by `HuggingFace`](https://huggingface.co/course/chapter1/2?fw=pt).

To understand the internal architecture of an LLM,
this [post by Sebastian Raschka](https://magazine.sebastianraschka.com/p/understanding-encoder-and-decoder)
is very helpful.

These chapters only scratch the surface of NLP use cases for *data scientists*.
For instance,
in public statistics, one major NLP use case involves using automatic classification techniques to convert free-text answers in surveys
into predefined fields within a nomenclature.
This is a specific adaptation to public statistics, a heavy user of standardized nomenclatures, of multi-level classification problems.

Here is an example from a project on automated job classification using the PCS (socio-professional categories) typology,
based on a model trained with the `Fasttext` library:

[1] While writing these lines, I asked `ChatGPT` to tell me a joke.

> Why do programmers hate the beach?
> Because the sand keeps changing paradigms!

Not understanding it, I asked for an explanation. Here’s what I got:

> Of course, I’d be happy to explain! This joke plays on the double meaning of the word “paradigm” and is a pun involving programming and the beach.
>
> In programming, a “paradigm” refers to a model or method of programming, such as object-oriented or functional programming. Programmers are often familiar with different paradigms.
>
> On the other hand, in a beach context, “paradigm” is humorously misused to mean “tide”—alluding to the natural shifting of sand and water.
>
> The joke plays on the humorous frustration programmers might feel when adapting to new programming paradigms, just like how sand constantly shifts with the tides. By combining both meanings of “paradigm,” the joke creates an absurd and amusing scenario that will resonate with those familiar with programming and beaches.

In the end, I found the explanation funnier than the joke itself.

[2] The concept of “natural language processing” has tended to replace the more specific term “text mining.” Text mining refers to the search for — or mining of — information within a set of texts. While it is one application area of NLP, it is not the only one. Advances in computing power and in the formalization of language have expanded the scope of fields where computational linguistics is applied.

[3] For example, the concept of *embedding*—the transformation of a text field into a multidimensional numeric vector—is central in NLP today but is only briefly mentioned here.

In [None]:
import requests
import pandas as pd

activite = "data scientist"
urlApe = (
    "https://codification-ape-test.lab.sspcloud.fr/"
    f"predict?nb_echos_max=3&prob_min=0&text_feature={activite}"
)

try:
    # requête
    resp = requests.get(urlApe, timeout=10)
    resp.raise_for_status()  # lève une erreur si code HTTP != 200
    data = resp.json()

    # récupération de IC
    IC = data.pop("IC", None)

    # transformation en DataFrame
    df = pd.DataFrame(data.values())
    df["indice_confiance"] = IC

    print(df)

except requests.exceptions.RequestException as e:
    print("Erreur lors de l'appel API :", e)
    df = pd.DataFrame()  # DataFrame vide en cas d'échec

except (ValueError, KeyError) as e:
    print("Erreur lors du parsing des données :", e)
    df = pd.DataFrame()

Erreur lors de l'appel API : 503 Server Error: Service Temporarily Unavailable for url: https://codification-ape-test.lab.sspcloud.fr/predict?nb_echos_max=3&prob_min=0&text_feature=data%20scientist