**Introductory and intermediate computing for Data Science [Barcelona School of Economics]**

`Instructor:` Maxim Fedotov  
`Program:` M.Sc. in Data Science Methodology

# Class 5

## Numpy package

The "numpy" package provides effective data structures and utilities to handle mathematical computations that involve collections of object.

Let's first import the package:

In [1]:
import numpy as np  # the name on the right is just a short name that I define for the notebook
np.random.seed(1337)  # you will see later, why it is here (setting random state for the whole document)

Note that in the text I use a full name of the package. In fact, for serious applications it is not quite recommended to use short names. One of the reasons for that you can get when you check a type of any numpy class. Even though you provided a specific short name for the package in the current script, names of objects from there will still contain a full name of the package.

### Numpy array

A basic building block for the computations that we obtain from the package is `numpy.ndarray` (simply called a numpy array). You can create a numpy array using `numpy.array(...)` function. One more helpful function would be `numpy.arange(...)` which gives an array of incremented values (feel free to check out it's arguments).

In [2]:
powers = np.array([1, 2, 3])
add = np.arange(3, 6)

The main advantage of `numpy` is that it provides all the tools for operations vectorization.

In [3]:
2**powers + add

array([ 5,  8, 13])

The core of the package is written in C++. It means that a well optimized code with propoer use of vectorized operations can give substantial computation time gains. You should always try utilize this feature as much as possible.

Just a small example for you:

In [4]:
import time

def inner_product_list_implementation(a: np.ndarray, b: np.ndarray):  # list comprehension implementation of
    if a.size != b.size:                                              # inner product operation;
        raise ValueError(f"The two iterables must have the s")        # note that I will still use numpy arrays
    if a.ndim != 1 or b.ndim != 1:                                    # as arguments, but this won't help –
        raise NotImplemented()                                        # we still need a proper vectorization
    return [a * b for a, b in zip(a, b)]

def measure_time(func, *args, **kwargs):  # utility to measure computation time
    start = time.time()
    func(*args, **kwargs)
    end = time.time()
    print("--- time: %f ---" % (end - start))
    
    
a = np.random.normal(0, 1, 1000)
b = np.random.normal(0, 1, 1000)
print("For a simple implementation through a list comprehension:")
measure_time(inner_product_list_implementation, a, b)
print("For a numpy implementation:")
measure_time(np.inner, a, b)
# the time difference depends on the size of collections involved in computations

For a simple implementation through a list comprehension:
--- time: 0.000286 ---
For a numpy implementation:
--- time: 0.000025 ---


There are also many other ways to create specific arrays. The examples are presented below:

In [5]:
zeros_array = np.zeros(4)
print("The array consists of 4 zeros:", zeros_array)

ones_array = np.ones(4)  # 
print("The array consists of 4 ones:", ones_array)

identity_matrix = np.eye(3)  # note that, despite the name, it is a numpy array in Python
print("The identity matrix of size 3 x 3:", identity_matrix, sep="\n")

The array consists of 4 zeros: [0. 0. 0. 0.]
The array consists of 4 ones: [1. 1. 1. 1.]
The identity matrix of size 3 x 3:
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


Note that when you use "print" function the output looks like a list, but if you just call a numpy array in a notebook, you will instantly find out the difference.

In [6]:
zeros_array

array([0., 0., 0., 0.])

As you can see from the identity matrix example above, numpy arrays can be multidimentional. Try to create a 2-dimensional array of zeros using the `numpy.zero(...)` function (note that you should use `np` instead of `numpy` in this notebook).

In [None]:
# try here


You can also create a multidimensional array manually:

In [7]:
three_dimensional_array = np.array([[[1, 2], 
                                     [3, 4]], 
                                    [[5, 6], 
                                     [7, 8]]])

# ndim is a useful propery of the numpy array
print('A number of dimensions of the array is:', three_dimensional_array.ndim)

# size too
print('A number of dimensions of the array is:', three_dimensional_array.size)

# you can also access shape of an array, it will correspond (sort of) to a mathematical "size of a matrix"
print('Shape of the array is:', three_dimensional_array.shape) # shape property returns a tuple

A number of dimensions of the array is: 3
A number of dimensions of the array is: 8
Shape of the array is: (2, 2, 2)


There are also some useful methods to transform the arrays. These are just some examples of them, which are the most popular:

In [8]:
# you could also specify more dimensions or use an exact value insted of -1; 
# reshape figures out the size of a dimension itself if you put -1.
print("We can reshape the array like this:", reshaped_array := three_dimensional_array.reshape(-1, 4), sep="\n")  
    
print("You can also transpose an array:", reshaped_array.transpose(), sep="\n")

print("The flattened version of the array is:", three_dimensional_array.ravel())

We can reshape the array like this:
[[1 2 3 4]
 [5 6 7 8]]
You can also transpose an array:
[[1 5]
 [2 6]
 [3 7]
 [4 8]]
The flattened version of the array is: [1 2 3 4 5 6 7 8]


Try to transpose a flat array. What do you get and why? How do you think you can fix that?

In [None]:
# try here:


You can subset elements from the array by specifying proper *indices* or *slices* inside square brackets right after the corresponding identifier:

In [9]:
slice_it = np.arange(25).reshape(5, 5)

print("Use indices to select elements:", slice_it[0, 0], sep="\n")
print("Use slices to select elements:", slice_it[:3, 1:3], sep="\n")
print("Use slices to select elements:", slice_it[:3, :], sep="\n")  

Use indices to select elements:
0
Use slices to select elements:
[[ 1  2]
 [ 6  7]
 [11 12]]
Use slices to select elements:
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]


Any proper slice will do: `start:end` (where start and end are in fact optional). In addition, iterable objects that contain proper indices also will do:

In [10]:
print(slice_it[:, [1, 2]])
print(slice_it[(3, 4), [1, 2]])  # note that if you use several iterables as indices, 
                                 # Python will try to broadcast them, i.e. this gives you elements 
                                 # at indices (3, 1) and (4, 2)
try:
    slice_it[(1, 3, 4), [1, 2]] 
except IndexError as error_text:
    print("You cannot specify two iterables of different sizes!", f"Error text: '{error_text}'", sep="\n")

[[ 1  2]
 [ 6  7]
 [11 12]
 [16 17]
 [21 22]]
[16 22]
You cannot specify two iterables of different sizes!
Error text: 'shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,) '


You can see that numpy arrays differ from the basic sequence types in this aspect too. For example, you are unable to select elements of a list with iterable objects.

Note that even though the `slice_it` array is two dimensional, we can specify one-dimensional index to subset:

In [11]:
slice_it[3]

array([15, 16, 17, 18, 19])

### Producing realizations of (pseudo-)random variables.

The module that is responsible for random variable simulations is called `numpy.random`. You can find some of the available distributions below.

In [12]:
n_obs = 10

print(f"This is {n_obs} realizations of U[2, 5]:\n", np.random.uniform(low=2, high=5, size=n_obs))
print(f"This is {n_obs} realizations of a r.v. uniformly distributed on integer values in [2, 5):\n", 
      np.random.randint(low=2, high=5, size=n_obs))
print(f"This is {n_obs} realizations of a gaussian r.v. with mean 2 and st.dev. 4:\n", 
      np.random.normal(loc=2, scale=4, size=n_obs))

This is 10 realizations of U[2, 5]:
 [2.98181964 4.7262762  4.31786114 3.15410594 2.36080876 2.94094805
 2.61743937 3.35732478 2.05391737 2.09398573]
This is 10 realizations of a r.v. uniformly distributed on integer values in [2, 5):
 [2 2 3 2 4 4 3 4 2 4]
This is 10 realizations of a gaussian r.v. with mean 2 and st.dev. 4:
 [ 6.45121614  0.39676198  1.95631332  3.43166157  7.35690004  2.06571451
 -1.98160019 -0.90015354  7.42159615  0.53410108]


Note that these realizations are not purely random. In fact, nothing in your computer is not purely random. All these procedures are mostly called "pseudo-random" because they always depend on something, e.g. state of your hardware, software and other things.

So, to make your code reproducible, it is a good tone to specify a concrete value of a "seed" like this:

In [13]:
np.random.seed(1337)

People who will run your script will then be able to check reliability of your results, they will get the same results in the parts of your code that exhibit randomness (through numpy).

## JSON

JSON is a quite convenient data format when you work with data. You will meet it quite often if you will do web-scraping. However, it is not a restriction, its use is broad. There is a specific syntax for this data format. A simple example is presented below:

```{JSON}
{"holder": "Foo", 
 "account_no": 1337, 
 "balance": 100500}
```

However, it is not the only possible structure. It can be also: a dictionary where values are lists, a nested dictionary, a list of dictionaries and similar structures. Note that in JSON format you must not use single quotes `'`. 

In particular, these formats can be easily converted to pythonic objects. Package `json` gives us this functionality.

In [14]:
import json

Let's have a look at how we can convert a simple string written in a JSON format into a pythonic object:

In [15]:
json_str = '{"variable1": [4.45654794846065, 4.278310755227896, 3.009643269934777], "variable2": [3, 3, 3]}'
data = json.loads(json_str)
data

{'variable1': [4.45654794846065, 4.278310755227896, 3.009643269934777],
 'variable2': [3, 3, 3]}

The `json.loads(...)` function "loads" a string, i.e. transforms it to a dictionary in this case. 

We also can write pythonic objects to test files with the "json" extension. We can use a *context manager* which allows us to open files and work with them in different modes. To define a context manager, we use keyword `with` together with `open(...)` function. In the `open(...)` function we specify a path to a file, and a mode in which we want to opent is, e.g. "w" for writing (but truncating an existing file at first), "a" to append content, "r" to read and others.

In [17]:
with open("data.json", "w") as file:
    json.dump(data, file)

You could do it in another way using just the `open(...)` function. But you **have to** close a file after use, otherwise you can encounter some problems.

In [18]:
file = open("data.json", "w")
json.dump(data, file)
file.close()

So, use of a *context manager* is preferrable.

You can also read files with json.

In [19]:
with open("data.json", "r") as file:
    data_new = json.load(file)

In [20]:
data_new

{'variable1': [4.45654794846065, 4.278310755227896, 3.009643269934777],
 'variable2': [3, 3, 3]}

Just as a comment, you can also read any text file line by line:

In [21]:
with open("data.json", "r") as file:  # our file consists of only one line, but you get the idea
    data_txt = []
    for line in file:
        data_txt.append(line)

You could also use functions like `file.readline(...)`, `file.readlines(...)`, `file.seek(...)` and others to work with files. Feel free to find out what they do.

## Example: data simulation

Let's consider an application of `numpy`. In theoretical studies, you might want to simulate some data to test your algorithm and obtain some quantitative performance metrics for it under some theoretical assumptions.

In general, lets suppose that we want to generate and output according to the following linear model:
$$y = X \cdot \beta + \varepsilon$$
where $y \in \mathbb{R}^{n}$ is an output variable, $X \in \mathbb{R}^{n \times p}$ is a design matrix which consists of different covariates(variables) for each observation (written by columns), $\beta \in \mathbb{R}^{p}$ is a vector of coefficients, $\varepsilon \sim \mathcal{N}(0_n, I_n)$ is a noise.

But first, we need to assume something about distribution of $X$. For our exercise, let's create a utility to simulate a design matrics and make it as abstract as we can. The following function is going to accept dictionaries that specify feature distributions and parameters for these distributions, as well as encodings for some features (we will get to that later).

In [22]:
from types import FunctionType

def simulate_X(n: int, vars_dists: dict[FunctionType], dists_params: dict[dict], 
               encodings: dict[FunctionType]):
    p = len(vars_dists)
    # just for educational purposes, lets have our simulations in two formats: a dictionary and a numpy array
    # you can easily save the dictionary to a .json file if you wish
    X_dict = {}
    X_array = np.ndarray((n, 0))
    
    for varname in vars_dists:
        var_simulation = var_dists[varname](**dists_params[varname], size=n)
        X_dict[varname] = var_simulation
        if varname in encodings:
            encoded_result, _ = encodings[varname](X_dict[varname])
            X_array = np.hstack((X_array, encoded_result))
        else:
            X_array = np.hstack((X_array, var_simulation.reshape(n, -1)))
    return X_dict, X_array

Sometimes we deal with variables that take three or more values from a discrete set. For example, these could be different cities / countries, occupations, car models and so on. To work with them, we usually need to transform (encode) them to some numerical features. One possible way of doing that is to use One Hot Encoding. Namely, if we have $v$ possible values that a variable can take, to $v$ indicators that take values 0 or 1. (note that to use them later in a statistical model plainly you will have to drop one of them to get rid of *multicollinearity*).

Let's implement a function which allows us to do that:

In [23]:
def one_hot_encoding(array: np.ndarray):
    unique_elements = np.unique(array)
    encoded_result = np.zeros((array.size,  unique_elements.size))
    for i in range(array.size):
        encoded_result[i, :] = array[i] == unique_elements
    return encoded_result, unique_elements

Now we can create our dataset. We will specify the following variables:
* age – integer number, age of a person
* female – binary, an indicator of sex
* occupation – one of the pre-specified values from the `occupations` list taken with equal probabilities

In [24]:
occupations = ["data analyst", "data engineer", "software engineer", "tester"]

var_dists = {"age": np.random.randint, "female": np.random.binomial, "occupation": np.random.choice}
dists_params = {"age": {"low": 18, "high": 65}, "female": {"n": 1, "p": 0.5}, "occupation": {"a": occupations}}

X_dict, X = simulate_X(100, vars_dists=var_dists, dists_params=dists_params, 
                       encodings={"occupation": one_hot_encoding})

X[:3, :]

array([[41.,  1.,  0.,  1.,  0.,  0.],
       [46.,  0.,  0.,  1.,  0.,  0.],
       [58.,  1.,  0.,  0.,  0.,  1.]])

Now we can finally move to creating our output variable according to the linear model:
$$y = X \cdot \beta + \varepsilon$$

Luckily, we can use the matrix algebra operations that `numpy` provides us with. Here we use, for example, a `.dot(...)` method of a numpy array.

In [25]:
beta = np.append(np.array([0.5, 0]), np.arange(0, 1, 0.25))
y = X.dot(beta) + np.random.normal(0, 1, size=X.shape[0])
y[:10]

array([21.66133414, 24.27141737, 30.10000886, 28.63220201, 20.53774962,
       28.88624474, 21.61354079, 18.85999115, 20.44635657, 14.5289177 ])

We could also multiply 2-d arrays or a 2-d array and a 1-d array using `@` operator, but it is a rather recent functionality. And it is not recommended for multiplying two one-dimensional arrays, for example.

In [26]:
(X @ beta)[:10]  # it is different from the above because there is no noise 

array([20.75, 23.25, 29.75, 29.  , 21.75, 29.25, 22.  , 18.75, 19.5 ,
       13.  ])

## Other linear algebra tools: numpy.linalg

Also, feel free to check out `linalg` module of numpy that has more linear algebra related functons. 

In [27]:
sigma = np.ones((3, 3))
sigma[np.diag_indices_from(sigma)] = np.arange(1, 4)
print("We have the following matrix:\n", sigma)
print("Determinant:", np.linalg.det(sigma))
print("Inverse:\n", np.linalg.inv(sigma))
print("Eigenvalues and eigenvectors:", *np.linalg.eig(sigma), sep='\n')
print("Solve a simple linear system >> sigma * x = (0, 1, 2)^T:", np.linalg.solve(sigma, np.arange(3)), sep='\n')

We have the following matrix:
 [[1. 1. 1.]
 [1. 2. 1.]
 [1. 1. 3.]]
Determinant: 2.0
Inverse:
 [[ 2.5 -1.  -0.5]
 [-1.   1.   0. ]
 [-0.5  0.   0.5]]
Eigenvalues and eigenvectors:
[4.21431974 0.32486913 1.46081113]
[[-0.39711255 -0.88765034 -0.23319198]
 [-0.52065737  0.42713229 -0.73923874]
 [-0.75578934  0.17214786  0.63178128]]
Solve a simple linear system >> sigma * x = (0, 1, 2)^T:
[-2.  1.  1.]


## Pandas

One particularly helpful package when you work with data that has a table-type structure is `pandas`. Let's import it:

In [28]:
import pandas as pd

It defines `pandas.DataFrame` class, so we can create dataframes. To create a dataframe, you can call a class constructor `pandas.DataFrame(...)` and pass some data into it. In fact, there are different ways to create dataframes. You can explore them in the corresponding docstring by executing `help(pandas.DataFrame)`.

Recall that we had some data in a dictionary format:

In [29]:
# lets create a utility to truncate our dictionary so you can see the structure
def truncate_dict(dictionary: dict, start: int, end: int):
    truncated_dictionary = {}
    for key in dictionary:
        truncated_dictionary[key] = dictionary[key][start:end]
    return truncated_dictionary

truncate_dict(X_dict, 0, 10)

{'age': array([41, 46, 58, 57, 43, 57, 44, 36, 38, 26]),
 'female': array([1, 0, 1, 0, 1, 0, 0, 0, 0, 1]),
 'occupation': array(['data engineer', 'data engineer', 'tester', 'software engineer',
        'data engineer', 'tester', 'data analyst', 'tester',
        'software engineer', 'data analyst'], dtype='<U17')}

Now let's pass it to the pandas DataFrame constructor to see what it gives us.

In [30]:
data = pd.DataFrame(X_dict)
data.head(10)

Unnamed: 0,age,female,occupation
0,41,1,data engineer
1,46,0,data engineer
2,58,1,tester
3,57,0,software engineer
4,43,1,data engineer
5,57,0,tester
6,44,0,data analyst
7,36,0,tester
8,38,0,software engineer
9,26,1,data analyst


Dataframes also have useful methods. You can see how we use `.head(...)` method in the above cell. There are also other methods  and properties of dataframes that help us explore their contents, like:

In [31]:
print('Explore the "tail":', data.tail(), sep='\n')
print('Explore what columns the dataframe has:', data.columns)  # notice the specific type of the output
print('Explore the shape and size:', str(data.shape) + ", " + str(data.size))

Explore the "tail":
    age  female     occupation
95   34       0   data analyst
96   45       0  data engineer
97   45       0   data analyst
98   21       1         tester
99   59       0   data analyst
Explore what columns the dataframe has: Index(['age', 'female', 'occupation'], dtype='object')
Explore the shape and size: (100, 3), 300


You can access a column of a dataframe using `[]` or call it as a property like that:

In [32]:
print('Access the "age" column:', data["age"], sep='\n', end='\n\n')
print('Access it as a property of the dataset:', data.age, sep='\n')

Access the "age" column:
0     41
1     46
2     58
3     57
4     43
      ..
95    34
96    45
97    45
98    21
99    59
Name: age, Length: 100, dtype: int64

Access it as a property of the dataset:
0     41
1     46
2     58
3     57
4     43
      ..
95    34
96    45
97    45
98    21
99    59
Name: age, Length: 100, dtype: int64


But note, that it works only with existing columns. To create a new column, simply provide a name in `[]` right after the name of the dataframe and assign it a specific iterable of a valid size.

In [33]:
data['wage'] = y  # this is the output we have simulated before
data.head()

Unnamed: 0,age,female,occupation,wage
0,41,1,data engineer,21.661334
1,46,0,data engineer,24.271417
2,58,1,tester,30.100009
3,57,0,software engineer,28.632202
4,43,1,data engineer,20.53775


Note that you can pick specific fields or slices from a dataframe using properties `.loc[...]` and `.iloc[...]`. The former one selects fields based on index names, i.e. names of rows and names of columns (even numeric). The latter one uses indices plainly, like `.loc[1:3, 3]` take elements that lie in the intersection of the 2nd and the 3rd rows with the 4th column, but be careful, that **for** `.loc[...]` both start and end are included if you use slices (unlike usual Python slices) **both start and end are included**, but **for** `.iloc[...]` **start is included and end is excluded**.

The following two lines give the same result.

In [34]:
data.iloc[1:10, 0:3]
data.loc[1:9, ["age", "female", "occupation"]]

Unnamed: 0,age,female,occupation
1,46,0,data engineer
2,58,1,tester
3,57,0,software engineer
4,43,1,data engineer
5,57,0,tester
6,44,0,data analyst
7,36,0,tester
8,38,0,software engineer
9,26,1,data analyst


It is possible (and quite useful) to use boolean iterables for subsetting. For example:

In [36]:
print("For example, we can get all the observations with age between 22 and 40 and wage greater than 20:", 
      data.loc[data.age.between(22, 40) & data.wage.gt(20), :], sep='\n', end='\n\n')

print("For example, we can get all the observations with age between 22 and 40 and wage greater than 20:", 
      data.loc[data.occupation.eq("data engineer") | (data.occupation == "data analyst"),  # parentheses are crucial
               ["age", "wage", "occupation"]], 
      sep='\n')

For example, we can get all the observations with age between 22 and 40 and wage greater than 20:
    age  female         occupation       wage
8    38       0  software engineer  20.446357
19   40       1  software engineer  20.589102
26   37       0  software engineer  20.641106
72   38       1      data engineer  20.617162
77   38       0       data analyst  20.264496
78   39       1  software engineer  21.157447

For example, we can get all the observations with age between 22 and 40 and wage greater than 20:
    age       wage     occupation
0    41  21.661334  data engineer
1    46  24.271417  data engineer
4    43  20.537750  data engineer
6    44  21.613541   data analyst
9    26  14.528918   data analyst
10   27  14.987409  data engineer
11   24  12.530042  data engineer
14   42  21.001997   data analyst
15   19   8.620657   data analyst
23   29  14.897551  data engineer
24   19  10.941216   data analyst
27   64  32.583040   data analyst
28   35  17.749267   data analyst
31   

Note the use of element-wise logical operations `&` and `|`. You might also need `~` which is element-wise negation.

We can also try to find out what is the type of a column:

In [37]:
type(data['wage'])

pandas.core.series.Series

That is a pandas series object. These are created to store a sequence of elements, and they provide mathematical and statistical methods to explore this data (so as pandas dataframes have). 

In [38]:
print("Get the maximum wage:", data['wage'].max())
print("Get the mean wage:", data['wage'].mean())
print("Get the min wage:", data['wage'].min())

Get the maximum wage: 34.90045780193024
Get the mean wage: 20.140629344684374
Get the min wage: 8.62065713769996


Note that you should always try to vectorize mathematical operations with series. The pandas objects allow for that; in fact, proper vectorization is computationally efficient than iterating over objects. As one simple example, let's consider taking a square of age (note: dependence of wage on age is oftenly claimed to be quadratic in labour economics)

In [39]:
data["age_sq"] = data.age**2
data["wage_log"] = np.log(data.wage)

You can use `.agg(...)` method to apply different aggregating functions to different columns like that:

In [40]:
data.agg({"age": np.mean, "wage": max})

age     39.160000
wage    34.900458
dtype: float64

This is a powerful tool which you will most likely use later too.

Several technical, but useful methods would be `.replace(...)`, `.drop(...)`, and `.rename(...)`.

Let's suppose that we want to replace some values to NaNs (just as example):

In [41]:
data['occupation'] = data.occupation.replace(["tester", "software engineer"], np.NaN)

And now we can drop all the rows that contain NaNs:

In [42]:
data.dropna(inplace=True)

You can try to get an index of the new dataframe using `.method` propery of the dataframe:

In [None]:
# try here


You can see, that after we dropped some rows, indices did not change. Now, if we would try to slice the dataframe with `.loc[]` property, then we could encounter problems (whereas `.iloc[]` would work). To avoid problems, it is recommended to reset the indices after untroducing changes to the original dataset (if you do not need old indices).

In [43]:
data.reset_index(inplace=True)

You can drop columns (or rows) using the following method:

In [44]:
data.drop(columns=['wage'], inplace=True)

We can also rename columns:

In [45]:
data.rename(columns={'age': 'experience', 'age_sq': 'experience_sq'}, inplace=True)

Note that you could also rename your row indices.

It is very important to pay attention to the `inplace` parameters in pandas methods you use, because it defines whether a method mutates an initial object or returns a new one.

In addition, whenever you want to create a new dataframe from an existing one, consider using `.copy()` method that returns a copy of a dataframe.

In [47]:
data_new = data.loc[0:100, ["wage_log", "experience", "occupation"]].copy()

The problem is in how Python handles assignments. In this case a new variable would refer to an old one, and changes to the new one can mutate the original object.

You can simply save your data to a file using `.to_...` methods, like:

In [None]:
data.to_csv('class_5_data.csv')

To read this data later, use functions of form `pandas.read_...`:

In [None]:
data_read = pd.read_csv('class_5_data.csv')

## Afterword

Of course, pandas and numpy are massive packages. The information in this notebook provides you with basic objects, their properties and methods in Pandas, so you can get acquainted with it build up on that knowledge further.

Some of the topics that we have not covered here but can be of your interest in Pandas (so feel free to check them up):

* multi-indexing
* plotting with pandas (and in general with `matplotlib`)
* mapping functions to series (but alwas try to vectorize computations, then you won't need it often), one of the use cases is "renaming" values (use dictionary).
* datetime format

For sure, you should not stop on that. Your journey just begins. Good luck!