NumPy and Basic Pandas

Introduction Now that we have introduced the fundamentals of Python, it's time to learn about NumPy and Pandas.

NumPy NumPy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays. It also has strong integration with Pandas, which is another powerful tool for manipulating financial data.

Python packages like NumPy and Pandas contain classes and methods which we can use by importing the package:


In [None]:
import numpy as np

Basic NumPy Arrays A NumPy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. Here we make an array by passing a list of Apple stock prices:

In [None]:
price_list = [143.73, 145.83, 143.68, 144.02, 143.5, 142.62]
price_array = np.array(price_list)
print(price_array, type(price_array))

[143.73 145.83 143.68 144.02 143.5  142.62] <class 'numpy.ndarray'>


#Ejemplo

In [None]:
price = [156.73, 123.43, 193.65, 150.19, 167.5, 196.78]
prices_array = np.array(price)
print(prices_array, type(price_array))

[156.73 123.43 193.65 150.19 167.5  196.78] <class 'numpy.ndarray'>


Notice that the type of array is "ndarray" which is a multi-dimensional array. If we pass np.array() a list of lists, it will create a 2-dimensional array.

In [None]:
Ar = np.array([[1,3],[2,4]])
print(Ar, type(Ar))

[[1 3]
 [2 4]] <class 'numpy.ndarray'>


#Ejemplo

In [None]:
Arr = np.array([[3,2],[8,6],[5,4]])
print(Arr, type(Arr))

[[3 2]
 [8 6]
 [5 4]] <class 'numpy.ndarray'>


We get the dimensions of an ndarray using the .shape attribute:

In [None]:
print(Ar.shape)

(2, 2)


#Ejemplo

In [None]:
print(Arr.shape)

(3, 2)


If we create an 2-dimensional array (i.e. matrix), each row can be accessed by index:

In [None]:
print(Ar[0])
print(Ar[1])

[1 3]
[2 4]


#Ejemplo

In [None]:
print(Arr[1])
print(Arr[2])

[8 6]
[5 4]


In [None]:
print('the first column: ', Ar[:,0])
print('the second column: ', Ar[:,1])

the first column:  [1 2]
the second column:  [3 4]


If we want to access the matrix by column instead:


#Ejemplo

In [None]:
print('the first column: ', Arr[:,0])
print('the second column: ', Arr[:,1])

the first column:  [3 8 5]
the second column:  [2 6 4]


Array Functions Some functions built in NumPy that allow us to perform calculations on arrays. For example, we can apply the natural logarithm to each element of an array:

In [None]:
print(np.log(price_array))

[5.05452458 4.81567419 5.26605241 5.01190116 5.12098335 5.28208635]


#Ejemplo

In [None]:
print(np.log(prices_array))

[5.05452458 4.81567419 5.26605241 5.01190116 5.12098335 5.28208635]


Other functions return a single value:


In [None]:
print(np.mean(price_array))
print(np.std(price_array))
print(np.sum(price_array))
print(np.max(price_array))

164.71333333333334
25.338842821951353
988.28
196.78


#Ejemplo

In [None]:
print(np.mean(prices_array))
print(np.std(prices_array))
print(np.sum(prices_array))
print(np.max(prices_array))

164.71333333333334
25.338842821951353
988.28
196.78


The functions above return the mean, standard deviation, total and maximum value of an array.

Pandas Pandas is one of the most powerful tools for dealing with financial data. First we need to import Pandas:


In [None]:
import pandas as pd

Series Series is a one-dimensional labeled array capable of holding any data type (integers, strings, float, Python object, etc.)

We create a Series by calling pd.Series(data), where data can be a dictionary, an array or just a scalar value.

In [None]:
price = [143.73, 145.83, 143.68, 144.02, 143.5, 142.62]
s = pd.Series(price)
s

0    143.73
1    145.83
2    143.68
3    144.02
4    143.50
5    142.62
dtype: float64

#Ejemplo

In [None]:
prices = [156.73, 123.43, 193.65, 150.19, 167.5, 196.78]
z = pd.Series(prices)
z

0    156.73
1    123.43
2    193.65
3    150.19
4    167.50
5    196.78
dtype: float64

We can customize the indices of a new Series:

In [None]:
s = pd.Series(price,index = ['a','b','c','d','e','f'])
s

a    143.73
b    145.83
c    143.68
d    144.02
e    143.50
f    142.62
dtype: float64

#Ejemplo

In [None]:
z = pd.Series(prices,index = ['i','ii','iii','iv','v','vi'])
z

i      156.73
ii     123.43
iii    193.65
iv     150.19
v      167.50
vi     196.78
dtype: float64

Or we can change the indices of an existing Series:

In [None]:
s.index = [6,5,4,3,2,1]
s

6    143.73
5    145.83
4    143.68
3    144.02
2    143.50
1    142.62
dtype: float64

#Ejemplo

In [None]:
z.index = [1,2,3,4,5,6]
z

1    156.73
2    123.43
3    193.65
4    150.19
5    167.50
6    196.78
dtype: float64

Series is like a list since it can be sliced by index:


In [None]:
print(s[1:])
print(s[:-2])

5    145.83
4    143.68
3    144.02
2    143.50
1    142.62
dtype: float64
6    143.73
5    145.83
4    143.68
3    144.02
dtype: float64


#Ejemplo

In [None]:
print(z[1:])
print(z[:-3])

2    123.43
3    193.65
4    150.19
5    167.50
6    196.78
dtype: float64
1    156.73
2    123.43
3    193.65
dtype: float64


Series is also like a dictionary whose values can be set or fetched by index label:

In [None]:
print(s[4])
s[4] = 0
print(s)

143.68
6    143.73
5    145.83
4      0.00
3    144.02
2    143.50
1    142.62
dtype: float64


#Ejemplo

In [None]:
print(z[3])
z[3] = 0
print(z)

193.65
1    156.73
2    123.43
3      0.00
4    150.19
5    167.50
6    196.78
dtype: float64


Series can also have a name attribute, which will be used when we make up a Pandas DataFrame using several series.


In [None]:
s = pd.Series(price, name = 'Apple Price List')
print(s)
print(s.name)

0    143.73
1    145.83
2    143.68
3    144.02
4    143.50
5    142.62
Name: Apple Price List, dtype: float64
Apple Price List


#Ejemplo

In [None]:
z = pd.Series(prices, name = 'coffe Price List')
print(z)
print(z.name)

0    156.73
1    123.43
2    193.65
3    150.19
4    167.50
5    196.78
Name: coffe Price List, dtype: float64
coffe Price List


Time Index Pandas has a built-in function specifically for creating date indices: pd.date_range(). We use it to create a new index for our Series:

In [None]:
time_index = pd.date_range('2017-01-01',periods = len(s),freq = 'D')
print(time_index)
s.index = time_index
print(s)

DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
               '2017-01-05', '2017-01-06'],
              dtype='datetime64[ns]', freq='D')
2017-01-01    143.73
2017-01-02    145.83
2017-01-03    143.68
2017-01-04    144.02
2017-01-05    143.50
2017-01-06    142.62
Freq: D, Name: Apple Price List, dtype: float64


#Ejemplo

In [None]:
time_index = pd.date_range('2021-02-01',periods = len(z),freq = 'D')
print(time_index)
z.index = time_index
print(z)

DatetimeIndex(['2021-02-01', '2021-02-02', '2021-02-03', '2021-02-04',
               '2021-02-05', '2021-02-06'],
              dtype='datetime64[ns]', freq='D')
2021-02-01    156.73
2021-02-02    123.43
2021-02-03    193.65
2021-02-04    150.19
2021-02-05    167.50
2021-02-06    196.78
Freq: D, Name: coffe Price List, dtype: float64


Series are usually accessed using the iloc[] and loc[] methods. iloc[] is used to access elements by integer index, and loc[] is used to access the index of the series.

iloc[] is necessary when the index of a series are integers, take our previous defined series as example:


In [None]:
s.index = [6,5,4,3,2,1]
print(s)
print(s[1])

6    143.73
5    145.83
4    143.68
3    144.02
2    143.50
1    142.62
Name: Apple Price List, dtype: float64
142.62


#Ejemplo

In [None]:
z.index = [1,2,3,4,5,6]
print(z)
print(z[1])

1    156.73
2    123.43
3    193.65
4    150.19
5    167.50
6    196.78
Name: coffe Price List, dtype: float64
156.73


If we intended to take the second element of the series, we would make a mistake here, because the index are integers. In order to access to the element we want, we use iloc[] here:

In [None]:
print(s.iloc[1])

145.83


#Ejemplo

In [None]:
print(z.iloc[1])

123.43


While working with time series data, we often use time as the index. Pandas provides us with various methods to access the data by time index.

In [None]:
s.index = time_index
print(s['2017-01-03'])

143.68


#Ejemplo

In [None]:
s.index = time_index
print(z['2021-02-03'])

193.65


We can even access to a range of dates:


In [None]:
print(s['2017-01-02':'2017-01-05'])

2017-01-02    145.83
2017-01-03    143.68
2017-01-04    144.02
2017-01-05    143.50
Freq: D, Name: Apple Price List, dtype: float64


#Ejemplo

In [None]:
print(z['2021-02-02':'2021-02-06'])

2021-02-02    123.43
2021-02-03    193.65
2021-02-04    150.19
2021-02-05    167.50
2021-02-06    196.78
Freq: D, Name: coffe Price List, dtype: float64


We can even access to a range of dates:


In [None]:
print(s[s < np.mean(s)] )
print([(s > np.mean(s)) & (s < np.mean(s) + 1.64*np.std(s))])

2017-01-01    143.73
2017-01-03    143.68
2017-01-05    143.50
2017-01-06    142.62
Name: Apple Price List, dtype: float64
[2017-01-01    False
2017-01-02    False
2017-01-03    False
2017-01-04     True
2017-01-05    False
2017-01-06    False
Freq: D, Name: Apple Price List, dtype: bool]


#Ejemplo

In [None]:
print(z[z > np.mean(z)] )
print([(z < np.mean(z)) & (s < np.mean(z) + 1.64*np.std(z))])

2021-02-03    193.65
2021-02-05    167.50
2021-02-06    196.78
Name: coffe Price List, dtype: float64
[2017-01-01    False
2017-01-02    False
2017-01-03    False
2017-01-04    False
2017-01-05    False
2017-01-06    False
2021-02-01    False
2021-02-02    False
2021-02-03    False
2021-02-04    False
2021-02-05    False
2021-02-06    False
dtype: bool]


As demonstrated, we can use logical operators like & (and), | (or) and ~ (not) to group multiple conditions.