# Tutorial - NumPy and Pandas  

### NumPy arrays

This tutorial presents some simple examples of NumPy and Pandas data containers, so you can get familiar with the different ways in which data sets can be managed in Python. Let us start with NumPy. A typical way to import it is:

In [1]:
import numpy as np

In Mathematics, a **vector** is a sequence of numbers, and a **matrix** is a rectangular arrangement of numbers. In NumPy, vectors are called one-dimensional (1d) **arrays**, and matrices are called two-dimensional (2d) arrays. Arrays of more than two dimensions can be managed without pain.

Creating a 1d array in NumPy is easy:

In [2]:
arr1 = np.array(range(10))
arr1

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Numeric and string arrays can be created in the same way with the NumPy function `array`. The terms of a 1d array can be extracted from a range or a list. If the elements of the list have different type, they are converted to a common type when creating the array.

A 2d array can be directly created from a list of lists of equal length. The terms are entered row-by-row:

In [3]:
arr2 = np.array([[0, 7, 2, 3], [3, 9, -5, 1]])
arr2

array([[ 0,  7,  2,  3],
       [ 3,  9, -5,  1]])

Although we visualize a vector as a column (or as a row) and a matrix as a rectangular arrangement, with rows and columns, it is not so in the computer. The 1d array is just a sequence of elements of the same type, neither horizontal nor vertical. It has one **axis**, which is the 0-axis.

In a similar way, a 2d array is a sequence of 1d arrays of the same length and type. It has two axes. When we visualize it as rows and columns, `axis=0` means *across rows*, while `axis=1` means *across columns*.

The number of terms stored along an axis is the **dimension** of that axis. The dimensions are collected in the attribute `shape`:

In [4]:
arr1.shape

(10,)

In [5]:
arr2.shape

(2, 4)

### NumPy functions

NumPy incorporates vectorized forms of the mathematical functions of the package `math`. A **vectorized function** is one that, when applied to an array, returns an array with same shape, whose terms are the values of the function on the corresponding terms of the original array. For instance, the NumPy square root function takes the square root of every term of a numeric array:

In [6]:
np.sqrt(arr1)

array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712, 3.        ])

The functions that are defined in terms of vectorized functions are automatically vectorized. For instance:

In [7]:
def f(t): return 1/(1 + np.exp(t))
f(arr2)

array([[5.00000000e-01, 9.11051194e-04, 1.19202922e-01, 4.74258732e-02],
       [4.74258732e-02, 1.23394576e-04, 9.93307149e-01, 2.68941421e-01]])

### Subsetting arrays

**Subsetting** a 1d array is done as for a list:

In [8]:
arr1[:3]

array([0, 1, 2])

The same applies to two-dimensional arrays, but we need two indexes within the square brackets. The first index selects the rows (`axis=0`), and the second index the columns (`axis=1`):

In [9]:
arr2[:1, 1:]

array([[7, 2, 3]])

When an expression involving an array is evaluated in Python, it returns a Boolean array with the same shape:

In [10]:
arr1 > 3

array([False, False, False, False,  True,  True,  True,  True,  True,
        True])

In [11]:
arr2 > 2

array([[False,  True, False,  True],
       [ True,  True, False, False]])

Subsets of an array can be extracted by means of an expression. The exzpression is evaluated, returning a Boolean array called **Boolean mask**. The terms for which the mask is true are selected: 

In [12]:
arr1[arr1 > 3]

array([4, 5, 6, 7, 8, 9])

Note that this is the same as:

In [13]:
arr1[[False, False, False, False,  True,  True,  True,  True,  True, True]]

array([4, 5, 6, 7, 8, 9])

Boolean masks can also be used to filter out rows or columns of a matrix. For instance, you can select the rows of `arr2` for which the first column is positive:

In [14]:
arr2[arr2[:, 0] > 0, 1:]

array([[ 9, -5,  1]])

Under the hood, what Python does is to use the Boolean mask to select the rows:

In [15]:
arr2[:, 0] > 0

array([False,  True])

In [16]:
arr2[[False,  True], 1:]

array([[ 9, -5,  1]])

### Pandas series

**Pandas** is typically imported as:

In [17]:
import pandas as pd

Pandas provides two data container types, series (one-dimensional) and data frames (two-dimensional). A **series** is composed of a 1d array of **values** and a list containing the names of the values of the series, called the **index**. These two components can be extracted as the attributes `values` and `index`.

Let me illustrate this with a simple example. To get it, I'll create directly a short series, something you rarely do in data science, where the data are imported from external data files. But a Pandas series can be created directly, for instance from a range, with the Pandas function `Series`:

In [18]:
s1 = pd.Series(range(10))
s1

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

Now, the values of the series are extracted as:

In [19]:
s1.values

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int64)

As shown above, when a series is printed, the index appears on the left. Since the index of `s1` has not been specified, a range of consecutive integers has been assigned as the index.

In [20]:
s1.index

RangeIndex(start=0, stop=10, step=1)

Instead of a range, a list or a vector can be used to provide the values of a series. In a list, the elements can have different type, but, as NumPy, Pandas converts them to a common type, as shown in the following example. Now, instead of leting Python to create an index automatically, as a `RangeIndex`, I specify an index directly:

In [21]:
s2 = pd.Series([1, 5, 'Messi'], index = ['a', 'b', 'c'])
s2

a        1
b        5
c    Messi
dtype: object

Now the index is a plain `Index`:

In [22]:
s2.index

Index(['a', 'b', 'c'], dtype='object')

### Pandas data frames

A Pandas **data frame** can be seen as a collection of series with the same index (hence, with the same length). Data frames can be built in many ways with the Pandas function `DataFrame`, for instance from a dictionary of vector-like objects of the same length, as in:

In [23]:
df = pd.DataFrame({'v1': range(0, 5),
    'v2': ['a', 'b', 'c', 'd', 'e'],
    'v3': np.repeat(-1.3, 5)})

As a series, a data frame has the attributes `values` and `index`: 

In [24]:
df.values

array([[0, 'a', -1.3],
       [1, 'b', -1.3],
       [2, 'c', -1.3],
       [3, 'd', -1.3],
       [4, 'e', -1.3]], dtype=object)

In [25]:
df.index

RangeIndex(start=0, stop=5, step=1)

Without a explicit specification, the index is created automatically, as a `RangeIndex`. In this example, since the columns have different data types, `df.values` takes `object` type. The third component of the data frame is a list with the column names, which can be extracted as the attribute `columns`:

In [26]:
df.columns

Index(['v1', 'v2', 'v3'], dtype='object')

Other attributes of a data frame are `shape` and `dtypes`. `shape` is the shape of the array of values: 

In [27]:
df.shape

(5, 3)

`dtypes` contains the data types of the columns:

In [28]:
df.dtypes

v1      int64
v2     object
v3    float64
dtype: object

Note that the data type of the second column, for which you would have expected `str`, is reported as `object`. Don't worry about this, you can apply string functions to this column, as will be seen later in this course. 

Having rows and columns, a data frame looks like a 2d array with row and column names. Indeed, we can create data frames in this way:

In [29]:
pd.DataFrame(arr2)

Unnamed: 0,0,1,2,3
0,0,7,2,3
1,3,9,-5,1


But not all data frames are so simple. While a NumPy 2d array has a data type, in a Pandas data frame every column has its own data type.

### Exploring Pandas objects

The functions `head` and `tail` extract the first and the last rows of a data frame, respectively. The default number of rows extracted is 5, but you may pass a custom number.

In [30]:
df.head(2)

Unnamed: 0,v1,v2,v3
0,0,a,-1.3
1,1,b,-1.3


The content of a data frame can also be explored with the function `info`. It reports the dimensions, the data type and the number of non-missing values of every column of the data frame.

In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   v1      5 non-null      int64  
 1   v2      5 non-null      object 
 2   v3      5 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 248.0+ bytes


The function `describe` returns a conventional statistical summary. The columns of type `object` are omitted, except when all the columns have that type. Then the report contains only counts. This function also works for series.

In [32]:
df.describe()

Unnamed: 0,v1,v3
count,5.0,5.0
mean,2.0,-1.3
std,1.581139,0.0
min,0.0,-1.3
25%,1.0,-1.3
50%,2.0,-1.3
75%,3.0,-1.3
max,4.0,-1.3


### Subsetting in Pandas

Pandas offers multiple ways for subsetting series and data frames. Suppose that you wish to select a subset of complete columns from a data frame. You can specify this with a list containing the names of those columns:

In [33]:
df[['v1', 'v2']]

Unnamed: 0,v1,v2
0,0,a
1,1,b
2,2,c
3,3,d
4,4,e


To select a collection of complete rows, we specify them as in lists or in a 1d array (using the row number):

In [34]:
df[1:3]

Unnamed: 0,v1,v2,v3
1,1,b,-1.3
2,2,c,-1.3


As for a NumPy array, an expression involving a Pandas object returns a Boolean Pandas object with the same shape:

In [35]:
df['v1'] > 2

0    False
1    False
2    False
3     True
4     True
Name: v1, dtype: bool

In data science, you often use a Boolean mask to extract rows from a data frame. By entering an expression within the brackets, you select the rows for which the expression is true:

In [36]:
df[df['v1'] > 2]

Unnamed: 0,v1,v2,v3
3,3,d,-1.3
4,4,e,-1.3
