# Part 7 - Using NumPy 
by Kaan Kabalak @ witfuldata.com

## The Amazing World of Data Science Libraries

All right, now we are talking! This part is our first step into the world of data science.

NumPy (short for Numeric Python) is one of the most popular data science libraries. What are the others? Let's see:

* Matplotlib (for data visualization)
* Pandas (for data manipulation and analysis)
* SciPy (for statistical and scientific computations, short for Scientific Python)
* Scikit-learn (ready-to-use algorithms for data preparation and machine learning)

Why did we start with NumPy?

Because most of the functionalities of these libraries are built or structured on NumPy.

In this part, we are going to go through some of the core NumPy functionalities that are often used by data science libraries. We are also going to take a first look at the statistical concept of descriptive statistics through NumPy's statistical functions. 

Please note that, like previous parts of this tutorial, this part will focus only on the necessary things about NumPy which you need to understand to understand how other libraries use its structure. This is not a tutorial which aims to teach every aspect of NumPy. 

Let's see what NumPy has to offer!

## Installing & Importing Libraries

I am not going to go over how to install Python libraries. There are many tutorials about it. It is something simple that takes only a few minutes at max and if you installed Anaconda as I suggested earlier, you do not have to install NumPy separately. Anaconda already comes with an installation of NumPy.


After you install a library you can import it with the import keyword. Here we will import numpy as np:

In [1]:
import numpy as np

## NumPy Arrays

NumPy arrays are list-like collection variables that can hold single or multiple values. They look like Python's lists but are different in various ways. We will take a look at some of those differences but first let's see how these arrays are defined. 

A NumPy array is defined by using the array method of numpy and passing a list as an argument to this method. Like this:

In [2]:
a_array = np.array([17, 54, 62, 75, 86])
type (a_array)

numpy.ndarray

Another important aspect of NumPy arrays is dimension. 

One dimensional arrays are defined by passing only 1 list to the array method as an argument. 

In [3]:
# One dimensional array
oned_array = np.array ([1, 2, 3, 4])
oned_array

array([1, 2, 3, 4])

Before we move onto defining multi-dimensional arrays, let's take a look at the attributes of NumPy arrays. The attributes can be used like methods but without the parantheses ()

In [4]:
# Check the dimension
oned_array.ndim

1

In [5]:
# Check the shape (the number of rows, the number of columns)
oned_array.shape

(4,)

In [6]:
# Check the size (The total number of elements)
oned_array.size

4

We can define two dimensional arrays by passing a list of lists into the array method as arguments. 

In [7]:
# Two-dimensional array
twod_array = np.array(
    [
     [2, 4, 6, 8],
     [1, 3, 5, 7], 
     [9, 5, 7, 3]
     ]
    )
twod_array

array([[2, 4, 6, 8],
       [1, 3, 5, 7],
       [9, 5, 7, 3]])

In [8]:
# Check the dimension
twod_array.ndim

2

In [9]:
# Check the shape (the number of rows, the number of columns)
twod_array.shape

(3, 4)

In [10]:
# Check the size
twod_array.size

12

Like two dimensional arrays, we can also define three dimensional arrays with a list of list of lists (try to read it fast :))

In [11]:
threed_array = np.array([[[0, 1, 2, 3],
                           [4, 5, 6, 7]],

                          [[0, 1, 2, 3],
                           [4, 5, 6, 7]],

                          [[0 ,1 ,2, 3],
                           [4, 5, 6, 7]]])
threed_array

array([[[0, 1, 2, 3],
        [4, 5, 6, 7]],

       [[0, 1, 2, 3],
        [4, 5, 6, 7]],

       [[0, 1, 2, 3],
        [4, 5, 6, 7]]])

Here we can see that the rows and columns are grouped together in 3 groups. There are 2 rows and 4 columns in each group.

In [12]:
# Check the dimension
threed_array.ndim

3

In [13]:
# Check the shape
threed_array.shape # 3 groups, 2 rows and 4 columns in each group

(3, 2, 4)

In [14]:
# Check the size
threed_array.size

24

## Reshaping arrays

We can also reshape arrays. The size of the reshaped array has to be the same as the size of the original array. In other words, the total number of elements must be the same in both. 

In [15]:
twod_array = np.array([[2, 4, 6, 8],[1, 3, 5, 7], [9, 5, 7, 3]]) # Array with 3 rows and 4 columns
twod_array

array([[2, 4, 6, 8],
       [1, 3, 5, 7],
       [9, 5, 7, 3]])

In [16]:
res_two = twod_array.reshape(4, 3) # Reshape into 4 rows and 3 columns
res_two

array([[2, 4, 6],
       [8, 1, 3],
       [5, 7, 9],
       [5, 7, 3]])

In [17]:
res_three = threed_array.reshape(2, 3, 4) # Reshape into 2 groups, 3 rows and 4 columns in each group
res_three

array([[[0, 1, 2, 3],
        [4, 5, 6, 7],
        [0, 1, 2, 3]],

       [[4, 5, 6, 7],
        [0, 1, 2, 3],
        [4, 5, 6, 7]]])

## Practical Array Definitions

We can define arrays in a very practical manner without writing values manually, using the np.arange function 

In [18]:
# Use np.arange to define an array with values between the range of 0-50, increasing by 10 at every step
rn_array = np.arange(0,50,10) # Start from 0, finish before 50, add 10 at every step
rn_array

array([ 0, 10, 20, 30, 40])

In [19]:
# Start from 15, finish before 135, add 15 at every step
rntwo_array = np.arange(15,135,15)
rntwo_array

array([ 15,  30,  45,  60,  75,  90, 105, 120])

## Array Operations (Broadcasting)

Here is the point where NumPy arrays behave in a very different way when compared to Python lists. When you add two lists together, Python will just append one of the lists to the other. It won't return the sum of the elements. Also, there are many arithmetic operations that you cannot do with Python lists without writing loops. Let's see how Python lists can be limited for arithmetic operations. 

In [20]:
# Adding lists together
b_list = [1, 2, 3, 4]

c_list = [5, 6, 7, 8]

b_list + c_list

[1, 2, 3, 4, 5, 6, 7, 8]

To add the elements of c_list to b_list, we would have to write a loop (which is time consuming and inefficient in computational terms):

In [21]:
# Define an empty list
sum_list = [] 
# Iterate over enumerated b_list
for n,x in enumerate(b_list):
    # Iterate over enumerated c_list
    for m, y in enumerate (c_list):
        # Only add elements with matching index numbers (to avoid adding every element of c_list to every element b_list in iteration)
        if n == m:
            sum_list.append (y + x)
        else:
            pass
sum_list

[6, 8, 10, 12]

Unlike Python's lists, NumPy arrays are designed for the ease of mathmetical operations. The array operations are distributed to each element of the arrays. This is called broadcasting. NumPy broadcasts the elements of an array to the corresponding ones in the other array. To understand this better, let's repeat the operation in the code block above, but this time with NumPy arrays:

In [22]:
# Add an array to another
b_vector = np.array ([1, 2, 3, 4])

c_vector = np.array ([5, 6, 7, 8])
b_vector + c_vector

array([ 6,  8, 10, 12])

In [23]:
# Multiply arrays
b_mdarr = np.array([[1, 2, 3, 4],
                    [5, 6, 7, 8]])

c_mdarr = np.array ([[12, 13, 14, 15],
                     [11, 17, 18, 19]])

b_mdarr * c_mdarr

array([[ 12,  26,  42,  60],
       [ 55, 102, 126, 152]])

In [24]:
# Use array variables in operations
fr_arr = np.array([5, 6, 7, 8])
sn_arr = np.array(fr_arr / 2)
sn_arr

array([2.5, 3. , 3.5, 4. ])

## Descriptive Statistical Functions

Descriptive Statistics are statistical measures that help us understand how our data is structured.
We will get into descriptive statistics in future parts, but it would be beneficial to take a look at some NumPy functions that help us describe our data.  

### Max, Min and Range

First, we have the max and min functions which return the maximum and minimum values of an array. The ptp function can be used to return the range (max-min) of an array

In [25]:
x_arr = np.array([3, 10, 20, 35, 40, 58, 67 ,74, 87 , 96, 120])

x_arr.max()

120

In [26]:
x_arr.min()

3

In [27]:
np.ptp (x_arr)

117

### Percentiles

What are percentiles? The percentiles basically represent the values that are larger than n percent of the total values. Here n refers to the percentile. For example, when we check the 25th percentile, we are actually check the value that is larger than the 25 percent of all values. Here is an example:

In [28]:
x_arr = np.array([3, 10, 20, 35, 40, 58, 67 ,74, 87 , 96, 120])

print (np.percentile (x_arr, 10)) # Value that is (or would be) larger than the 10 percent of all values
print (np.percentile (x_arr, 25)) # Value that is (or would be) larger than the 25 percent of all values
print (np.percentile (x_arr, 50)) # Value that is (or would be) larger than the 50 percent of all values
print (np.percentile (x_arr, 75)) # Value that is (or would be) larger than the 75 percent of all values
print (np.percentile (x_arr, 100))# Value that is (or would be) larger than the 100 percent of all values

10.0
27.5
58.0
80.5
120.0


### Mean & Median

We can also take a look at measures like mean and median. The explanation of these statistical measures are as follows:

* <strong> Mean </strong> : The arithmetic average of the array. Calculated by dividing the sum of all elements by the number of all elements. 

* <strong> Median </strong> : The middle value. In other words, the value that is larger than % 50 of all values. If there are two values in the middle, then their average is taken as the median. 


In [29]:
x_arr = np.array([3, 10, 20, 35, 40, 58, 67 ,74, 87 , 96, 120])

In [30]:
# Mean
np.mean (x_arr)

55.45454545454545

Let's manually calculate the mean of x_arr:

In [31]:
# The sum of all values
sum_val = np.sum(x_arr)

# The number of all elements (array length)
arr_num = len(x_arr)

# Divide
arr_mean = sum_val / arr_num
arr_mean

55.45454545454545

As you can see, the result is the same as the one produced by the np.mean( ) function.

Now let's take a look at the median of our array.

In [32]:
# Median
np.median (x_arr)

58.0

One important thing about mean and median is how they relate to outliers. Outliers can be understood as exteremly low or high values in your data (in this case, the data in your array). Mean is sensitive to outliers. This means that if you have very high or low value in your data, this can distort the mean and give you the wrong idea about the situation. 

Median, on the other hand, is not so sensitive to outliers. 

If you have outliers in your data, it is better to use the median. 

Let's see an example to understand better. 

In [33]:
# An array without an outlier
nrm_array = np.array([10, 15, 20, 30, 38, 43, 45, 50, 60, 70])
print ("The mean of the array is {},\nThe median of the array is {}".format(np.mean(nrm_array),np.median(nrm_array)))

The mean of the array is 38.1,
The median of the array is 40.5


In [34]:
# An array with an outlier (890)
nrm_array = np.array([10, 15, 20, 30, 38, 43, 45, 50, 60, 70, 890])
print ("The mean of the array is {},\nThe median of the array is {}".format(np.mean(nrm_array),np.median(nrm_array)))

The mean of the array is 115.54545454545455,
The median of the array is 43.0


As you can see, the mean changed drastically while the median stayed mostly the same. This makes median much more reliable when outlier values are present. 

### Variance and Standart Deviation

* <strong> Variance </strong> : The sum of the squared distances of each element to the mean. It is simply performed by substracting the mean value of the array from each element, taking the square of the result and then taking the sum of all results. After that, you divide this result by the length of your array.

The square of the differences (the result of substraction) is taken to make negative values neutral. Think of it like this, if an element has a value smaller than the mean, the result of the substraction will be negative. This will cause some results to be negative while the result of the substraction with elements larger than the mean are positive. When you add them up, the smaller values' distances to the mean will cancel most of the larger values' distances to the mean. In the end, you will have the wrong idea about the distances of elements to the mean. To prevent all this, you take the square of the result of the substraction (element value - mean value), because of the fact that the square of a negative value is always positive. 

Variance shows how much our data varies within itself. Why is this important? 

Because it tells you how well a randomly picked value will represent the mean of your data. Low variance means that there is higher probability that a randomly picked value will represent the mean of your data. 

* <strong> Standard Deviation </strong> : Standart deviation is the square root of variance. When we take the square of distances for variance we also cause the scale of the data points to be distorted. Standart deviation allows us to fix this by taking the square root. The usage of standart deviation for the description of data variance is more common than the usage of variance only by itself. This is due to the fact that the standart deviation is more in line with the unit of the data itsel whereas the unit of variance will be much larger than the unit of data itself. 


In [35]:
x_arr = np.array([3, 10, 20, 35, 40, 58, 67 ,74, 87 , 96, 120])

# Variance 
np.var (x_arr)

1274.611570247934

Let's have some fun and see how variance can be calculated manually!

In [36]:
# The distances of values to the mean
distances_x = x_arr - np.mean(x_arr)

# The square of the distances (to make negative values neutral)
distances_sq_x = distances_x ** 2

# The sum of the squared distances
sum_dsq_x = np.sum(distances_sq_x)

# Divide the result by the number of elements
arr_variance = sum_dsq_x/len(x_arr)
arr_variance

1274.611570247934

The maximum value of x_arr is 120, but the variance we got here is 1274, more than 10 times larger when compared to 120! As I have explained earlier, this is the reason standard deviation is used more commonly than variance. Standard deviation is the square root of variance and is much closer to the unit of the original data.

In [37]:
# Standard deviation
arr_std = np.std(x_arr)
arr_std

35.70170262393565

In [38]:
print ("The square root of the variation of the array is: {}".format(np.sqrt(np.var(x_arr))))
print ("The standard deviation of the array is: {}".format(np.std (x_arr)))

The square root of the variation of the array is: 35.70170262393565
The standard deviation of the array is: 35.70170262393565


### Covariance and Correlation

<strong>Covariance</strong> helps us measure how two variables are related to each other. This is done by multiplying the distance of a variable (in this case x_arr) to its mean with the distance of another variable to its mean (in this case y_arr). After the multiplication, we divide the result by the total number of observations (elements) - 1 (the length of the arrays - 1). 

Why do we substract 1 from the number of observations? 

We will go through tutorials which focus more intensely on statistical topics. I will explain this in those parts because it can require a lot of space for a tutorial that was intended to get you started. For now, let's just say it is not crucial that you understand this. The more important thing is what covariance tells you about the relation between variables and how its the building block of correlation coefficients (which you will see just a few blocks below from here). If you understood that, don't worry much about the rest. You got the essence of it all fine. 

Note that covariance may have a negative value which indicates that there is a negative relation between variables (one decreases while other increases vice versa). You should also know that a covariance of a variable with itself is actually that variable's variance. In other terms, cov(x_arr) = var(x_arr)



In [39]:
x_arr = np.array([3, 10, 20, 35, 40, 58, 67 ,74, 87 , 96, 120])
y_arr = np.array([7, 13, 25, 38, 42, 59, 68, 79, 88, 98, 127])

In [40]:
np.cov (x_arr,y_arr)

array([[1402.07272727, 1407.62727273],
       [1407.62727273, 1417.07272727]])

This result shows us the covariance of each variable with itself (which is the same thing as their variance) and their covariance with the other variable.

The resulting array can be read like this:

[cov(x_arr, x_arr), cov(x_arr, y_arr)

cov(x_arr, y_arr), cov(y_arr, y_arr)
]

In this case this means that:

* The covariance of x_arr with itself (its variance) is 1402.07272727 (indicated in top left)
* The covariance between x_arr and y_arr is 1407.62727273 (indicated in top right and bottom left)
* The covariance of y_arr with itself (its variance) is 1417.07272727 (indicated in bottom right)


In [41]:
# Distance of x_arr to its mean
x_dist = x_arr - np.mean(x_arr)

# Distance of y_arr to its mean
y_dist = y_arr - np.mean(y_arr)

# Multiply the distances (take their product)
multip_dist = x_dist * y_dist

# Sum of the product of distances
sum_mdist = np.sum(multip_dist)

# Divide by number of elements - 1
cov_manu_xy = sum_mdist / (len(x_arr) - 1)
cov_manu_xy

1407.6272727272728

To make covariance more understandable we can use the np.corrcoef( ) function. This returns the <strong>Pearson Correlation Coefficient </strong>. Like covariance, it's basically a measure of how variable values are related to each other. It is, however, more readable than covariance which is much larger than the unit of the original data and does not come in a standardized form. 

For example, the covariance of x and y variables was 1407 which was more than 10 times larger than the maxium values of both arrays. It is hard to determine if this covariance of 1407 is large or small according to the range of values in the arrays. There is no standard. Because of this, it is not very possible to compare the covariance of x and y with (for example) the covariance of x and z.

The Pearson Correlation Coefficient, on the other hand, is  standardized. Its value is always between -1 and 1.

* -1 indicates a very strong negative correlation (one value decreases while the other increases)
* 1 indicates a very strong positive correlation (one value increases while the other also increases)


It is possible to compare the correlation coefficient between variables due to this standardized form. We can, for example, compare the correlation coefficient of x and y with the correlation coefficient of x and z.

In [42]:
np.corrcoef(x_arr,y_arr)

array([[1.        , 0.99863396],
       [0.99863396, 1.        ]])

The resulting array can be read like this:

[corrcoef(x_arr, x_arr), corrcoef(x_arr, y_arr)

corrcoef(x_arr, y_arr), corrcoef(y_arr, y_arr)
]

The correlation coefficient of a variable with itself is always 1.

Here, the correlation coefficient between x_arr and y_arr is 0.99 which indicates that there is a very strong positive correlation between the two because 0.99 is very close to 1.

The Pearson Correlation Coefficient is calculated by dividing the covariance of two variables (in this case x_arr and y_arr) by the product (multiplication) of the standart deviations of these two variables. 

##### Note: I will not do a manual calculation for the corrcoef ( ) function, because NumPy carries out some extra computational processes when returning the result of the function. If we were to do these calculation by hand, the result could be a bit different. Do not worry, this does not mean that NumPy's result is not reliable. It just does a bit of editing to make things more readable. 