# Numpy, the foundation of data science

Lino Galiana  
2025-10-07

<div class="badge-container"><div class="badge-text">If you want to try the examples in this tutorial:</div><a href="https://github.com/linogaliana/python-datascientist-notebooks/blob/main/notebooks/en/manipulation/01_numpy.ipynb" target="_blank" rel="noopener"><img src="https://img.shields.io/static/v1?logo=github&label=&message=View%20on%20GitHub&color=181717" alt="View on GitHub"></a>
<a href="https://datalab.sspcloud.fr/launcher/ide/vscode-python?autoLaunch=true&name=«01_numpy»&init.personalInit=«https%3A%2F%2Fraw.githubusercontent.com%2Flinogaliana%2Fpython-datascientist%2Fmain%2Fsspcloud%2Finit-vscode.sh»&init.personalInitArgs=«en/manipulation%2001_numpy%20correction»" target="_blank" rel="noopener"><img src="https://custom-icon-badges.demolab.com/badge/SSP%20Cloud-Lancer_avec_VSCode-blue?logo=vsc&logoColor=white" alt="Onyxia"></a>
<a href="https://datalab.sspcloud.fr/launcher/ide/jupyter-python?autoLaunch=true&name=«01_numpy»&init.personalInit=«https%3A%2F%2Fraw.githubusercontent.com%2Flinogaliana%2Fpython-datascientist%2Fmain%2Fsspcloud%2Finit-jupyter.sh»&init.personalInitArgs=«en/manipulation%2001_numpy%20correction»" target="_blank" rel="noopener"><img src="https://img.shields.io/badge/SSP%20Cloud-Lancer_avec_Jupyter-orange?logo=Jupyter&logoColor=orange" alt="Onyxia"></a>
<a href="https://colab.research.google.com/github/linogaliana/python-datascientist-notebooks-colab//en/blob/main//notebooks/en/manipulation/01_numpy.ipynb" target="_blank" rel="noopener"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a><br></div>

> **Note**
>
> Ceci est la version française 🇫🇷 de ce chapitre, pour voir la version anglaise allez <a href="/home/runner/work/python-datascientist/python-datascientist/en/content/manipulation/01_numpy.qmd">ici</a>.

# 1. Introduction

This chapter serves as an introduction to *Numpy* to ensure that the basics of vector calculations with `Python` are mastered. The first part of the chapter presents small exercises to practice some basic functions of `Numpy`. The end of the chapter presents more in-depth practical exercises using `Numpy`.

It is recommended to regularly refer to the [*numpy cheatsheet*](https://www.datacamp.com/community/blog/python-numpy-cheat-sheet) and the [official documentation](https://numpy.org/doc/stable/) if you have any doubts about a function.

In this chapter, we will adhere to the convention of importing `Numpy` as follows:

In [None]:
import numpy as np

We will also set the seed of the random number generator to obtain reproducible results:

In [None]:
import numpy as np
rng = np.random.default_rng(seed=12345)

> **Caution**
>
> Historically, random numbers were generated using the `numpy.random` package. However, the authors of `Numpy` [now recommend](https://numpy.org/doc/stable/reference/random/index.html) using generators instead. The examples in this tutorial adopt this practice.

# 2. Concept of *array*

In the world of data science, as will be discussed in more depth in the upcoming chapters, the central object is the two-dimensional data table. The first dimension corresponds to rows and the second to columns. If we only consider one dimension, we refer to a variable (a column) of our data table. It is therefore natural to link data tables to the mathematical objects of matrices and vectors.

`NumPy` (`Numerical Python`) is the foundational brick for processing numerical lists or strings of text as matrices. `NumPy` comes into play to offer this type of object and the associated standardized operations that do not exist in the basic `Python` language.

The central object of `NumPy` is the **`array`**, which is a multidimensional data table. A `Numpy` array can be one-dimensional and considered as a vector (`1d-array`), two-dimensional and considered as a matrix (`2d-array`), or, more generally, take the form of a multidimensional object (`Nd-array`), a sort of nested table.

Simple arrays (one or two-dimensional) are easy to represent and cover most of the use-case related to `Numpy`. We will discover in the next chapter on `Pandas` that, in practice, we usually don’t directly use `Numpy` since it is a low-level library. A `Pandas` *DataFrame* is constructed from a collection of one-dimensional arrays (the variables of the table), which allows performing coherent (and optimized) operations with the variable type. Having some `Numpy` knowledge is useful for understanding the logic of vector manipulation, making data processing more readable, efficient, and reliable.

Compared to a list,

-   an *array* can only contain one type of data (`integer`, `string`, etc.), unlike a list.
-   operations implemented by `Numpy` will be more efficient and require less memory.

Geographical data will constitute a slightly more complex construction than a traditional `DataFrame`. The geographical dimension takes the form of a deeper table, at least two-dimensional (coordinates of a point). However, geographical data manipulation libraries will handle this increased complexity.

## 2.1 Creating an array

We can create an array in several ways. To create an array from a list, simply use the `array` method:

In [None]:
np.array([1,2,5])

array([1, 2, 5])

It is possible to add a `dtype` argument to constrain the array type:

In [None]:
np.array([["a","z","e"],["r","t"],["y"]], dtype="object")

array([list(['a', 'z', 'e']), list(['r', 't']), list(['y'])], dtype=object)

There are also practical methods for creating arrays:

-   Logical sequences: `np.arange` (sequence) or `np.linspace` (linear interpolation between two bounds);
-   Ordered sequences: array filled with zeros, ones, or a desired number: `np.zeros`, `np.ones`, or `np.full`;
-   Random sequences: random number generation functions: `rng.uniform`, `rng.normal`, etc. where `rng` is a random number generator;
-   Matrix in the form of an identity matrix: `np.eye`.

This gives, for logical sequences:

In [None]:
np.arange(0,10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [None]:
np.arange(0,10,3)

array([0, 3, 6, 9])

In [None]:
np.linspace(0, 1, 5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

For an array initialized to 0:

In [None]:
np.zeros(10, dtype=int)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

or initialized to 1:

In [None]:
np.ones((3, 5), dtype=float)

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

or even initialized to 3.14:

In [None]:
np.full((3, 5), 3.14)

array([[3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14]])

Finally, to create the matrix $I_3$:

In [None]:
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

> **Exercise 1**
>
> Generate:
>
> -   $X$ a random variable, 1000 repetitions of a $U(0,1)$ distribution
> -   $Y$ a random variable, 1000 repetitions of a normal distribution with zero mean and variance equal to 2
> -   Verify the variance of $Y$ with `np.var`

# 3. Indexing and slicing

## 3.1 Logic illustrated with a one-dimensional array

The simplest structure is the one-dimensional array:

In [None]:
x = np.arange(10)
print(x)

[0 1 2 3 4 5 6 7 8 9]

Indexing in this case is similar to that of a list:

-   The first element is 0
-   The nth element is accessible at position $n-1$

The logic for accessing elements is as follows:

``` python
x[start:stop:step]
```

With a one-dimensional array, the slicing operation (keeping a slice of the array) is very simple. For example, to keep the first *K* elements of an array, you would do:

``` python
x[:K]
```

In this case, you select the K$^{th}$ element using:

``` python
x[K-1]
```

To select only one element, you would do:

In [None]:
x = np.arange(10)
x[2]

np.int64(2)

The syntax for selecting particular indices from a list also works with arrays.

> **Exercise 2**
>
> Take `x = np.arange(10)` and…
>
> -   Select elements 0, 3, 5 from `x`
> -   Select even elements
> -   Select all elements except the first
> -   Select the first 5 elements

In [None]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

The same logic applies to multidimensional *arrays*. Indexing then takes place at several levels. Take, for example, a 2-dimensional array (a matrix of sorts):

In [None]:
x = np.array([[1, 2, 3], [4, 5, 6]], np.int32)

If we want to select the 2nd row, 3rd column (the element with value 6), we do

In [None]:
x[1, 2]

np.int32(6)

Now, to select a complete column (e.g. the 2nd), we can use the 2nd index to specify it (index 1 in Python since indexing starts from 0) and then `:` on the first dimension (shortened version of `0:N`) to avoid discriminating according to this dimension:

In [None]:
x[:,1]

array([2, 5], dtype=int32)

The principle is generalized, but becomes more complex, for nested *arrays*. Fortunately, these are objects we rarely manipulate directly, as most of our numerical data are flat arrays (a value - the observation - is the intersection of a row - the individual - and a column - the variable).

## 3.2 Regarding performance

A key element in the performance of `Numpy` compared to lists, when it comes to slicing, is that an array does not return a copy of the element in question (a copy that costs memory and time) but simply a view of it.

When it is necessary to make a copy, for example to avoid altering the underlying array, you can use the `copy` method:

``` python
x_sub_copy = x[:2, :2].copy()
```

It is also possible, and more practical, to select data based on logical conditions (an operation called a ***boolean mask***). This functionality will mainly be used to perform data filtering operations.

For simple comparison operations, logical comparators may be sufficient. These comparisons also work on multidimensional arrays thanks to broadcasting, which we will discuss later:

In [None]:
x = np.arange(10)
x2 = np.array([[-1,1,-2],[-3,2,0]])
print(x)
print(x2)

[0 1 2 3 4 5 6 7 8 9]
[[-1  1 -2]
 [-3  2  0]]

In [None]:
x==2
x2<0

array([[ True, False,  True],
       [ True, False, False]])

To select the observations related to the logical condition, just use the `numpy` slicing logic that works with logical conditions.

> **Exercise 3**
>
> Given
>
> ``` python
> x = np.random.normal(size=10000)
> ```
>
> 1.  Keep only the values whose absolute value is greater than 1.96
> 2.  Count the number of values greater than 1.96 in absolute value and their proportion in the whole set
> 3.  Sum the absolute values of all observations greater (in absolute value) than 1.96 and relate them to the sum of the values of `x` (in absolute value)

Whenever possible, it is recommended to use `numpy`’s logical functions (optimized and well-handling dimensions). Among them are:

-   `count_nonzero` ;
-   `isnan` ;
-   `any` or `all` especially with the `axis` argument ;
-   `np.array_equal` to check element-by-element equality.

Let’s create `x` a multidimensional array and `y` a one-dimensional array with a missing value.

In [None]:
# Assuming rng has been created beforehand
x = rng.normal(0, size=(3, 4))
y = np.array([np.nan, 0, 1])

> **Exercise 4**
>
> 1.  Use `count_nonzero` on `y`
> 2.  Use `isnan` on `y` and count the number of non-NaN values
> 3.  Check if `x` has at least one positive value in its entirety, by rows and then by columns.
>
> <details>
>
> <summary>
>
> Hint
>
> </summary>
>
> Take a look at the `axis` parameter by researching online. For example, [here](https://www.sharpsightlabs.com/blog/numpy-axes-explained/).
>
> </details>

# 4. Manipulating an array

## 4.1 Manipulation functions

`Numpy` provides standardized methods or functions for modifying
here’s a table showing some of them:

Here are some functions to modify an array:

| Operation | Implementation |
|------------------------------|------------------------------------------|
| Flatten an array | `x.flatten()` (method) |
| Transpose an array | `x.T` (method) or `np.transpose(x)` (function) |
| Append elements to the end | `np.append(x, [1,2])` |
| Insert elements at a given position (at positions 1 and 2) | `np.insert(x, [1,2], 3)` |
| Delete elements (at positions 0 and 3) | `np.delete(x, [0,3])` |

To combine arrays, you can use, depending on the case, the functions `np.concatenate`, `np.vstack` or the method `.r_` (row-wise concatenation). `np.hstack` or the method `.column_stack` or `.c_` (column-wise concatenation).

In [None]:
x = rng.normal(size = 10)

To sort an array, use `np.sort`

In [None]:
x = np.array([7, 2, 3, 1, 6, 5, 4])

np.sort(x)

array([1, 2, 3, 4, 5, 6, 7])

If you want to perform a partial reordering to find the *k* smallest values in an `array` without sorting them, use `partition`:

In [None]:
np.partition(x, 3)

array([1, 2, 3, 4, 5, 6, 7])

For classical descriptive statistics, `Numpy` offers a number of already implemented functions, which can be combined with the `axis` argument.

In [None]:
x = rng.normal(0, size=(3, 4))

> **Exercise 5**
>
> 1.  Sum all the elements of an `array`, the elements by row, and the elements by column. Verify the consistency.
> 2.  Write a function `statdesc` to return the following values: mean, median, standard deviation, minimum, and maximum. Apply it to `x` using the *axis* argument.

# 5. Broadcasting

Broadcasting refers to a set of rules for applying operations to arrays of different dimensions. In practice, it generally consists of applying a single operation to all members of a `numpy` array.

The difference can be understood from the following example. Broadcasting allows the scalar `5` to be transformed into a 3-dimensional array:

In [None]:
a = np.array([0, 1, 2])
b = np.array([5, 5, 5])

a + b
a + 5

array([5, 6, 7])

Broadcasting can be very practical for efficiently performing operations on data with a complex structure. For more details, visit [here](https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-broadcasting.html) or [here](https://stackoverflow.com/questions/47435526/what-is-the-meaning-of-axis-1-in-keras-argmax).

## 5.1 Application: programming your own k-nearest neighbors

> **Exercise 6 (a bit more challenging)**
>
> 1.  Create `X`, a two-dimensional array (i.e., a matrix) with 10 rows and 2 columns. The numbers in the array are random.
> 2.  Import the `matplotlib.pyplot` module as `plt`. Use `plt.scatter` to plot the data as a scatter plot.
> 3.  Construct a 10x10 matrix storing, at element $(i,j)$, the Euclidean distance between points $X[i,]$ and $X[j,]$. To do this, you will need to work with dimensions by creating nested arrays using `np.newaxis`:
>     1.  First, use `X1 = X[:, np.newaxis, :]` to transform the matrix into a nested array. Check the dimensions.
>     2.  Create `X2` of dimension `(1, 10, 2)` using the same logic.
>     3.  Deduce, for each point, the distance with other points for each coordinate. Square this distance.
>     4.  At this stage, you should have an array of dimension `(10, 10, 2)`. The reduction to a matrix is obtained by summing over the last axis. Check the help of `np.sum` on how to sum over the last axis.
>     5.  Finally, apply the square root to obtain a proper Euclidean distance.
> 4.  Verify that the diagonal elements are zero (distance of a point to itself…).
> 5.  Now, sort for each point the points with the most similar values. Use `np.argsort` to get the ranking of the closest points for each row.
> 6.  We are interested in the k-nearest neighbors. For now, set k=2. Use `argpartition` to reorder each row so that the 2 closest neighbors of each point come first, followed by the rest of the row.
> 7.  Use the code snippet below to graphically represent the nearest neighbors.

<details><summary>A hint for graphically representing the nearest neighbors</summary>

``` python
plt.scatter(X[:, 0], X[:, 1], s=100)

# draw lines from each point to its two nearest neighbors
K = 2

for i in range(X.shape[0]):
    for j in nearest_partition[i, :K+1]:
        # plot a line from X[i] to X[j]
        # use some zip magic to make it happen:
        plt.plot(*zip(X[j], X[i]), color='black')
```

</details>

Question 7 result is :

Did I invent this challenging exercise? Not at all, it comes from the book [*Python Data Science Handbook*](https://jakevdp.github.io/PythonDataScienceHandbook/02.08-sorting.html#Example:-k-Nearest-Neighbors). But if I had told you this immediately, would you have tried to answer the questions?

Moreover, it would not be a good idea to generalize this algorithm to large datasets. The complexity of our approach is $O(N^2)$. The algorithm implemented by `Scikit-Learn` is $O[NlogN]$.

Additionally, computing matrix distances using the power of GPU (graphics cards) would be faster. In this regard, the library [faiss](https://github.com/facebookresearch/faiss), or the dedicated frameworks for computing distance between high-dimensional vectors like [ChromaDB](https://www.trychroma.com/) offer much more satisfactory performance than `Numpy` for this specific problem.

# 6. Additional Exercises

`Google` became famous thanks to its `PageRank` algorithm. This algorithm allows, from links between websites, to give an importance score to a website which will be used to evaluate its centrality in a network. The objective of this exercise is to use `Numpy` to implement such an algorithm from an adjacency matrix that links the sites together.

> **Comprendre le principe de l’algorithme <code>PageRank</code>**
>
> `Google` est devenu célèbre grâce à son algorithme `PageRank`. Celui-ci permet, à partir
> de liens entre sites *web*, de donner un score d’importance à un site *web* qui va
> être utilisé pour évaluer sa centralité dans un réseau. L’objectif de cet exercice est d’utiliser `Numpy` pour mettre en oeuvre un tel algorithme à partir d’une matrice d’adjacence qui relie les sites entre eux.
>
> 1.  Créer la matrice suivante avec `Numpy`. L’appeler `M`:
>
> $$
> \begin{bmatrix}
> 0 & 0 & 0 & 0 & 1 \\
> 0.5 & 0 & 0 & 0 & 0 \\
> 0.5 & 0 & 0 & 0 & 0 \\
> 0 & 1 & 0.5 & 0 & 0 \\
> 0 & 0 & 0.5 & 1 & 0
> \end{bmatrix}
> $$
>
> 1.  Pour représenter visuellement ce *web* minimaliste,
>     convertir en objet `networkx` (une librairie spécialisée
>     dans l’analyse de réseau) et utiliser la fonction `draw`
>     de ce package.
>
> Il s’agit de la transposée de la matrice d’adjacence
> qui permet de relier les sites entre eux. Par exemple,
> le site 1 (première colonne) est référencé par
> les sites 2 et 3. Celui-ci ne référence que le site 5.
>
> 1.  A partir de la page wikipedia anglaise de `PageRank`, tester
>     sur votre matrice.

Site 1 is quite central because it is referenced twice. Site 5 is also central since it is referenced by site 1.