# From Zero to Hero - Data Analysis with Python - NumPy

The intention of this document is to review the basic and most used sintax in a data manipulation and analysis environment in Python, the proposal pretends to contemplate from the most basic of what to the most complex concepts, this publication also seeks to encourage participation by any developer to add value to this knowledge base.

In this section, it is proposed to review basic concepts about Python **(NumPy module)** syntax for data processing.

### Content:
* [Introduction](#intro)
* [Basic creations](#basics)
    * [Ndarray composition](#composition)
    * [Arithmetic operations](#arithmetic)
    * [Indexing](#indexing)
    * [Axes/dimensions transposition](#transposition)
    * [Axis](#axis)
* [Ndarray Operations](#operation)
    * [Unitary Functions](#u_func)
    * [Binary Functions](#b_func)
    * [Ndarrays based on a condition (np.where)](#np_where)
    * [Mathematical and statistical functions](#math)
    * [Sorting ndarrays](#sort)
    * [Grouping functions](#group)
    
    
    



<a id="intro"></a>
### Introduction

In Python, we have lists that serve the purpose of arrays, but they are slow to process.
NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.
The array object in NumPy is called ndarray, it provides a lot of supporting functions that make working with ndarray very easy. Arrays are very frequently used in data science, where speed and resources are very important.

**Basic features of ndarray**
<ul>
<li>A ndarray can contain elements of <b>ANY TYPE</b></li>
<li>All elements of a ndarray must have <b>THE SAME TYPE.</b></li>
<li>The size of a ndarray (number of elements) is defined at creation time and cannot be changed.</li>
<li>But the organization of these elements between different dimensions can be modified</li>
<li>Basic usage of any NumPy element</li>
</ul>    
    
Remember that NumPy is not a Python core module so it will ALWAYS have to be imported completely or component by component.

In [93]:
import numpy as np

<a id="basics"></a>
### Basic creation of ndarrays

In [94]:
v_array_1 = np.arange(5, 10) 
v_array_1

array([5, 6, 7, 8, 9])

In [95]:
array_uni = np.array([1, 2, 3, 4, 5])                  # Unidimensional
array_uni

array([1, 2, 3, 4, 5])

In [96]:
array_multi = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])   # Multidimentional
array_multi

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

<a id="composition"></a>
#### Ndarray composition
* **dtype:** Type of the ndarray content.
* **ndim:** Number of dimensions / axes of the ndarray.
* **shape:** Structure / shape of the ndarray, that is, number of elements in each of the axes / dimensions.
* **size:** Total number of elements in the ndarray.

In [97]:
v_array = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

In [98]:
v_array.dtype  # data type - unique -

dtype('int32')

In [99]:
v_array.ndim  # number of dimensions

2

In [100]:
v_array.shape # shape and dimensions

(3, 4)

<a id="arithmetic"></a>
#### Arithmetic operations

In [101]:
v_array = np.array([10, 20, 30, 40, 50, 60], dtype=np.float64)

In [102]:
v_array + v_array  # sum of arrays (element by element)

array([ 20.,  40.,  60.,  80., 100., 120.])

In [103]:
v_array - v_array  # subtraction of arrays (element by element)

array([0., 0., 0., 0., 0., 0.])

In [104]:
v_array * v_array  # array multiplication (element by element)

array([ 100.,  400.,  900., 1600., 2500., 3600.])

<a id="indexing"></a>
#### Indexing

In [105]:
v_array = np.array([[[1, 2, 3], [5, 6, 7]], [[10, 11, 12], [14, 15, 16]]])
v_array

array([[[ 1,  2,  3],
        [ 5,  6,  7]],

       [[10, 11, 12],
        [14, 15, 16]]])

In [106]:
v_array[1]  # Recursive indexing - first level

array([[10, 11, 12],
       [14, 15, 16]])

In [107]:
v_array[1][0]  # Recursive indexing - second level

array([10, 11, 12])

In [108]:
v_array[1][0][2]  # Recursive indexing - third level

12

In [109]:
v_array_rnd = np.random.randn(4, 4)
v_array_rnd

array([[-0.93373938,  0.34571805, -0.10253557,  0.65452356],
       [-0.46643791, -1.36174858,  0.09067704, -0.0126964 ],
       [-0.58842914,  0.37864838, -1.60463872,  0.72362846],
       [-0.62003125, -0.9852809 ,  1.49544358, -0.59755879]])

In [110]:
v_array_rnd[v_array_rnd < 0]  # Boolean indexing over values

array([-0.93373938, -0.10253557, -0.46643791, -1.36174858, -0.0126964 ,
       -0.58842914, -1.60463872, -0.62003125, -0.9852809 , -0.59755879])

<a id="transposition"></a>
#### Axes/dimensions transposition

In [111]:
v_array = np.arange(20)
v_array

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [112]:
v_array_resh = v_array.reshape(4, 5) # Axes/dimensions modification
v_array_resh

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

In [113]:
v_array_resh.T  # Axes/dimensions transposition

array([[ 0,  5, 10, 15],
       [ 1,  6, 11, 16],
       [ 2,  7, 12, 17],
       [ 3,  8, 13, 18],
       [ 4,  9, 14, 19]])

<a id="axis"></a>
#### Axis

In [114]:
v_array_resh.sum(axis=1)  # Apply the funtion for all the lines

array([10, 35, 60, 85])

In [115]:
v_array_resh.sum(axis=0)  # Apply the funtion for all the columns

array([30, 34, 38, 42, 46])

<a id="operation"></a>
### Ndarray operations
NumPy offers so-called "universal functions" (ufuncs) that allow element-by-element operations of an array. Depending on the number of parameters, there are two types of universal functions.

<a id="u_func"></a>
#### Unary functions
Functions that require a single ndarray as a parameter.

* **abs, fabs:** Absolute value.
* **sqrt:** Square root (equivalent to array \ * \ * 0.5).
* **square:** Power squared (equivalent to array ** 2).
* **exp:** Power of e.
* **log, log10, log2, log1p:** Logarithms in different bases.
* **sign:** Sign (+ = 1 / - = -1 / 0 = 0).
* **ceil:** Ceiling.
* **floor:** Floor.
* **rint:** Round to the nearest integer.
* **modf:** Returns two arrays, one with the fractional part and the other with the integer part.
* **isnan:** Returns a Boolean array indicating whether the value is NaN or not.
* **isfinite, isinf:** Returns a Boolean array indicating if the value is finite or infinite.
* **cos, cosh, sin, sinh, tan, tanh:** Trigonometric functions.
* **arccos, arccosh, arcsin, arcsinh, arctan, arctanh:** Inverse trigonometric functions.
* **logical_not:** Boolean inverse of all array values (equivalent to ~ (array)).

In [116]:
v_array = np.array([1,2,3,4,5])
np.abs(v_array)

array([1, 2, 3, 4, 5])

In [117]:
np.cosh(v_array)

array([ 1.54308063,  3.76219569, 10.067662  , 27.30823284, 74.20994852])

In [118]:
np.sign(v_array)

array([1, 1, 1, 1, 1])

<a id="b_func"></a>
#### Binary functions
Functions that receive two arrays as parameters.

* **add:** Addition of the elements of the two arrays (equivalent to array1 + array2).
* **subtract:** Subtract the elements of the two arrays (equivalent to array1 - array2).
* **multiply:** Multiply the elements of the two arrays (equivalent to array1 \ * array2).
* **divide, floor_divide:** Divide the elements of the two arrays (equivalent to array1 / (or //) array2).
* **power:** Raises the elements of the first array to the powers of the second (equivalent to array1 ** array2).
* **maximum, fmax:** Calculates the maximum of the two arrays (element by element). fmax ignores NaN.
* **minimum, fmin:** Calculates the minimum of the two arrays (element by element). fmax ignores NaN.
* **mod:** Calculates the remainder of the division of the two arrays (equivalent to array1% array2).
* **greater, greater_equal, less, less_equal, equal, not_equal:** Comparisons on the elements of both ndarrays (element by element).
* **logical_and, logical_or, logical_xor:** Boolean operations on the elements of both ndarrays (element by element).

In [119]:
v_array_1 = np.random.randn(3, 3)
v_array_1

array([[-0.35565431,  2.01523612,  1.19744644],
       [-1.08602917, -2.71230386,  0.20383996],
       [-0.89435243, -0.00877193, -1.1563722 ]])

In [120]:
v_array_2 = np.random.randn(3, 3)
v_array_2

array([[ 0.31574901,  1.07836019,  1.12311761],
       [ 1.28384525,  0.86019888, -0.39492825],
       [ 1.21877452, -1.06649056,  0.3741595 ]])

In [121]:
np.minimum(v_array_1, v_array_2)

array([[-0.35565431,  1.07836019,  1.12311761],
       [-1.08602917, -2.71230386, -0.39492825],
       [-0.89435243, -1.06649056, -1.1563722 ]])

In [122]:
np.divide(v_array_1, v_array_2)

array([[-1.12638295,  1.86879685,  1.06618081],
       [-0.84591906, -3.15311253, -0.51614429],
       [-0.73381287,  0.00822504, -3.09058626]])

In [123]:
np.floor_divide(v_array_1, v_array_2)

array([[-2.,  1.,  1.],
       [-1., -4., -1.],
       [-1.,  0., -4.]])

<a id="np_where"></a>
#### Ndarrays based on a condition (np.where)
Through the **np.where** function it is possible to generate an output array from two inputs, establishing a Boolean mask that indicates whether (element by element) we must output the element of the first ndarray (True) or the second (False).

In [124]:
np.where(v_array_1 < v_array_2, v_array_1, v_array_2)

array([[-0.35565431,  1.07836019,  1.12311761],
       [-1.08602917, -2.71230386, -0.39492825],
       [-0.89435243, -1.06649056, -1.1563722 ]])

In [125]:
np.where(v_array_1 < v_array_2, np.where(v_array_1 < 0, 0, v_array_1), v_array_2)

array([[ 0.        ,  1.07836019,  1.12311761],
       [ 0.        ,  0.        , -0.39492825],
       [ 0.        , -1.06649056,  0.        ]])

<a id="math"></a>
#### Mathematical and statistical functions

* **sum:** Sum of elements.
* **mean:** Arithmetic mean of the elements.
* **median:** Median of the elements.
* **std:** Standard deviation of the elements.
* **var:** Variance of the elements.
* **min:** Minimum value of the elements.
* **max:** Maximum value of the elements.
* **argmin:** Index of the minimum value.
* **argmax:** Index of the maximum value.
* **cumsum:** Cumulative sum of the elements.
* **complrod:** Cumulative product of the elements.

All these functions could receive a second parameter called **axis**. If this parameter is not received, the functions will be applied to the global set of ndarray elements, but if it is included, it can take two values:

 * Value 0: It will apply the function by rows
 * Value 1: It will apply the function by columns

In [126]:
np.sum(v_array_1)

-2.7969613609321984

In [127]:
np.sum(v_array_1, axis=0)

array([-2.3360359 , -0.70583966,  0.2449142 ])

In [128]:
np.sum(v_array_1, axis=1)

array([ 2.85702826, -3.59449306, -2.05949656])

<a id="sort"></a>
#### Sorting ndarrays

In [129]:
v_array = np.random.randn(3, 4)
v_array

array([[ 0.60887207,  0.46840229, -1.48846128, -0.59804317],
       [ 0.30192688, -1.01995864,  1.11447234,  0.37833779],
       [-0.76578679, -0.63732591, -0.92591639,  0.81245276]])

In [130]:
np.sort(v_array) 

array([[-1.48846128, -0.59804317,  0.46840229,  0.60887207],
       [-1.01995864,  0.30192688,  0.37833779,  1.11447234],
       [-0.92591639, -0.76578679, -0.63732591,  0.81245276]])

In [131]:
np.sort(v_array, axis=0)

array([[-0.76578679, -1.01995864, -1.48846128, -0.59804317],
       [ 0.30192688, -0.63732591, -0.92591639,  0.37833779],
       [ 0.60887207,  0.46840229,  1.11447234,  0.81245276]])

<a id="group"></a>
#### Grouping functions
NumPy allows to perform treatments on a ndarray assuming that the total of its elements form a set.

* **unique:** Computes the unique set of items without duplicates.
* **intersect1d:** Calculates the intersection of the elements of two arrays.
* **union1d:** Calculates the union of the elements of two arays.
* **in1d:** Calculates a Boolean array that indicates whether each element of the first array is contained in the second.
* **setdiff1d:** Calculates the difference between both sets.
* **setxor1d:** Calculates the symmetric difference between both sets.

In [132]:
v_array_1 = np.array([6, 0, 0, 0, 3, 2, 5, 6])
v_array_1

array([6, 0, 0, 0, 3, 2, 5, 6])

In [133]:
v_array_2 = np.array([7, 4, 3, 1, 2, 6, 5])
v_array_2

array([7, 4, 3, 1, 2, 6, 5])

In [134]:
np.unique(v_array_1)

array([0, 2, 3, 5, 6])

In [135]:
np.union1d(v_array_1, v_array_2)

array([0, 1, 2, 3, 4, 5, 6, 7])

In [136]:
np.in1d(v_array_1, v_array_2)

array([ True, False, False, False,  True,  True,  True,  True])