# Python Data Structures for Data Science: Numpy and Pandas

In the last practical, we took a look at some basic concepts in python. We covered:
1. Basic data types like numerics and strings.
2. Compound data types like lists, tuples, and dictionaries.
3. Control flow devices like if, else, for, and while.
These concepts give us access to some powerful abilities, however most data scientists and analysts are not writing for loops over lists to understand their data. Instead, they are using even higher level objects to structure their data - namely numpy arrays and pandas dataframes. In this practical, we introduce these libraries and objects, giving a glimpse into how data analysis is actually done in practice.

For this practical, we generally follow the standard tutorials for these libraries found here:
- [Numpy](https://numpy.org/devdocs/user/quickstart.html)
- [Pandas](https://pandas.pydata.org/docs/user_guide/index.html)

When first encountering numpy arrays, it is natural to wonder why we even need them. After all, they look a lot like the ordinary lists we have already learned about. There are, as we will see, several advantages to their use. The first we will look at is speed. Let's do a small experiment of calculating the mean of one million numbers - first by iterating over a list and second by making use of a numpy array method.

In [1]:
import numpy as np
import time

In [2]:
heights = np.random.normal(size=1000000) #this is a numpy array
heights_list = heights.tolist() #this is a list

In [3]:
start_time = time.perf_counter()
sum_nums = 0
for i in heights_list:
    sum_nums = sum_nums + i
print('Mean: ', sum_nums / len(heights_list))
end_time = time.perf_counter()
list_time = end_time - start_time
print('Elapsed time: ', list_time)

Mean:  0.0017307978739592488
Elapsed time:  0.07947445799982233


In [4]:
start_time = time.perf_counter()
print('Mean: ', heights.mean())
end_time = time.perf_counter()
numpy_time = end_time - start_time
print('Elapsed time: ', numpy_time)

Mean:  0.001730797873959119
Elapsed time:  0.0011659589999908349


In [5]:
 numpy_time / list_time

0.014670864442931355

It seems there is in fact a big speed advantage in this case!

Another advantage is the quantity of built-in methods. This allows us to transform our data quickly and with very few lines of code:

In [6]:
test_np_array = np.array([1, 2, 3, 4, 5])

In [9]:
# log transform
my_logs = np.log(test_np_array)
my_logs

array([0.        , 0.69314718, 1.09861229, 1.38629436, 1.60943791])

In [10]:
# exponential tranform
np.exp(my_logs)

array([1., 2., 3., 4., 5.])

In [13]:
# square roots
my_sqrts = np.sqrt(test_np_array)
my_sqrts

array([1.        , 1.41421356, 1.73205081, 2.        , 2.23606798])

In [14]:
np.square(my_sqrts)

array([1., 2., 3., 4., 5.])

In [16]:
my_sqrts * my_logs

array([0.        , 0.98025814, 1.9028523 , 2.77258872, 3.59881258])

Yet another advantage is in providing a natural setting for us to do linear algebra - something lists do not provide. We now demonstrate this connection and introduce the numpy library and its core objects more systematically.

The main object in numpy is the "ndarray" (n-dimensional array) or just "array". Let's see some examples:

In [19]:
# ndarrays can behave like vectors!

vec_1 = np.array([1, 2, 3, 4])
vec_2 = np.array([2, 4, 6, 8])

print('Vector sum: ', vec_1 + vec_2)
print('Element-wise product: ', vec_1 * vec_2)
print('Dot product: ', np.dot(vec_1, vec_2))

Vector sum:  [ 3  6  9 12]
Element-wise product:  [ 2  8 18 32]
Dot product:  60


In [23]:
# ndarrays van behave like matrices!
mat_1 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
mat_2 = np.array([[1, 1], [2, 2], [3, 3]])
mat_3 = np.array([[2, 4, 6], [3, 6, 9], [4, 8, 12]])

print(mat_1)
print(mat_2)
print(mat_3)

[[1 2 3]
 [4 5 6]
 [7 8 9]]
[[1 1]
 [2 2]
 [3 3]]
[[ 2  4  6]
 [ 3  6  9]
 [ 4  8 12]]


In [25]:
# Matrix sums
mat_1 + mat_3

array([[ 3,  6,  9],
       [ 7, 11, 15],
       [11, 16, 21]])

In [26]:
# Matrix multiplication
mat_1 @ mat_2

array([[14, 14],
       [32, 32],
       [50, 50]])

In [30]:
# ndarrays can be higher-order things (tensors)
ten_1 = [[[1, 2, 3], [4, 5, 6]], [[1, 1, 1], [2, 2, 2]]]
ten_1

[[[1, 2, 3], [4, 5, 6]], [[1, 1, 1], [2, 2, 2]]]