# Statistic and Mathematical functions 

We will see some functions that will able us to do some statistical reckoning or some mathematical operations on numpy arrays.

## Doc :

<a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html">https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html</a>

<a href="https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.math.html">https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.math.html</a>

<a href="https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.statistics.html">https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.statistics.html</a>

<a href="https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.linalg.html">https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.linalg.html</a>

<a href="https://en.wikipedia.org/wiki/Feature_scaling">https://en.wikipedia.org/wiki/Feature_scaling</a>


In [1]:
import numpy as np

In [2]:
np.random.seed(0)
A = np.random.randint(0, 10, [2, 3])
A

array([[5, 0, 3],
       [3, 7, 9]])

## SUM

In [3]:
A.sum()

27

In [4]:
A[0].sum()

8

In [6]:
A.sum(axis=0) # 0 == vertical axis, 1 horizontal, n = more dimensions

array([ 8,  7, 12])

In [7]:
A.sum(axis=1)

array([ 8, 19])

It is really useful to be able to do some operations by axis on real dataset (eg : mean by rows)

In [8]:
A.cumsum() # cumulated sum

array([ 5,  5,  8, 11, 18, 27])

In [9]:
A.prod()

0

In [10]:
A.prod(axis=1)

array([  0, 189])

In [11]:
A.cumprod(axis=0) #cumulated prod

array([[ 5,  0,  3],
       [15,  0, 27]])

In [12]:
A.min()

0

In [13]:
A.min(axis=0)

array([3, 0, 3])

In [15]:
A

array([[5, 0, 3],
       [3, 7, 9]])

In [16]:
A.argmin(axis=1) # min on 1st row is 0 at index 1
                 # min on second row is 3 at index 0

array([1, 0])

# sort and argsort

sort able us to sort an array
argsort will return the indexes of the values in the array in order.

In [21]:
B = np.array([2, 4, -4, 5])
B2 = B.copy()
B2.sort()
B2

array([-4,  2,  4,  5])

In [22]:
B = np.array([2, 4, -4, 5])
B2 = B.copy()
B2.argsort()

array([2, 0, 1, 3])

## General purpose mathematical functions

to load them use np.function_name()

In [23]:
np.exp(A)

array([[1.48413159e+02, 1.00000000e+00, 2.00855369e+01],
       [2.00855369e+01, 1.09663316e+03, 8.10308393e+03]])

## Statistics



In [24]:
A.mean()

4.5

In [26]:
A.std() # standard deviation (ecart type)

2.9297326385411577

In [27]:
A.var() # variance

8.583333333333334

In [29]:
#A.median()
np.median(A)

4.0

# corrcoef

In [31]:
np.corrcoef(A)
# Line 1 correlation with line 1 | line 1 with line 2
# line 2 correlation with line 1| line 2 with line 2

array([[ 1.        , -0.56362148],
       [-0.56362148,  1.        ]])

In [32]:
np.corrcoef(A)[0, 1] # correlation coeficient between line 1 and line 2

-0.5636214801906779

# Unique

In [33]:
np.unique(A)

array([0, 3, 5, 7, 9])

In [34]:
np.unique(A, return_counts=True)

(array([0, 3, 5, 7, 9]), array([1, 2, 1, 1, 1]))

In [37]:
np.unique(A, return_counts=True) # very useful to analyse a dataset and see how many unique value are in it.

(array([0, 3, 5, 7, 9]), array([1, 2, 1, 1, 1]))

## Unique + argsort

With these two we can arrange our array to see which element is the most present in our dataset.

In [38]:
A = np.random.randint(0, 10, [5, 5])
A

array([[3, 5, 2, 4, 7],
       [6, 8, 8, 1, 6],
       [7, 7, 8, 1, 5],
       [9, 8, 9, 4, 3],
       [0, 3, 5, 0, 2]])

In [39]:
np.unique(A, return_counts=True)

(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([2, 2, 2, 3, 2, 3, 2, 3, 4, 2]))

In [41]:
values, counts = np.unique(A, return_counts=True)
counts

array([2, 2, 2, 3, 2, 3, 2, 3, 4, 2])

In [42]:
counts.argsort()

array([0, 1, 2, 4, 6, 9, 3, 5, 7, 8])

In [43]:
values[counts.argsort()]

array([0, 1, 2, 4, 6, 9, 3, 5, 7, 8])

In [44]:
for i in counts.argsort():
    print(f'val : {values[i]} count {counts[i]}')

val : 0 count 2
val : 1 count 2
val : 2 count 2
val : 4 count 2
val : 6 count 2
val : 9 count 2
val : 3 count 3
val : 5 count 3
val : 7 count 3
val : 8 count 4


In [45]:
# exercice, show the same array than above using zip()

for i, j in zip(values[counts.argsort()], counts[counts.argsort()]):
    print(f'val : {i} count {j}')

val : 0 count 2
val : 1 count 2
val : 2 count 2
val : 4 count 2
val : 6 count 2
val : 9 count 2
val : 3 count 3
val : 5 count 3
val : 7 count 3
val : 8 count 4


# NaN Not a Number

Most of the time there will be missing values in our dataset which can be pretty annoying.

In [46]:
A = np.random.randn(5, 5)
A[1, 3] = np.nan
A[4, 2] = np.nan
A

array([[ 1.41437719, -0.12405066,  2.00815709,  0.22988654,  0.60489373],
       [ 1.62715982,  1.59456053,  0.23043417,         nan, -0.96898025],
       [ 0.59124281, -0.7827755 , -0.44423283, -0.34518616, -0.88180055],
       [-0.44265324, -0.5409163 , -1.32322737, -0.11279892,  0.90734594],
       [ 0.81526991,  0.22909795,         nan,  0.47752547,  1.29269823]])

if we now try to do some statistical operations above it won't work : 

In [47]:
A.mean()

nan

In [48]:
np.nanmean(A) # We're still able to reckon the mean even with NaN values.

0.2633055476891779

## Count NaN values

In [49]:
np.isnan(A) # return a numpy mask

array([[False, False, False, False, False],
       [False, False, False,  True, False],
       [False, False, False, False, False],
       [False, False, False, False, False],
       [False, False,  True, False, False]])

In [50]:
np.isnan(A).sum() # this will return the number of NaN in our array

2

In [53]:
np.isnan(A).sum() / A.size # return the percentage of NaN values (very useful while analysing datas)

0.08

## remove NaN

In [55]:
A[np.isnan(A)] = 0 # with a default value
A

array([[ 1.41437719, -0.12405066,  2.00815709,  0.22988654,  0.60489373],
       [ 1.62715982,  1.59456053,  0.23043417,  0.        , -0.96898025],
       [ 0.59124281, -0.7827755 , -0.44423283, -0.34518616, -0.88180055],
       [-0.44265324, -0.5409163 , -1.32322737, -0.11279892,  0.90734594],
       [ 0.81526991,  0.22909795,  0.        ,  0.47752547,  1.29269823]])

# Matrix 

In [56]:
A = np.ones((2, 3))
B = np.ones((3, 2))

A

array([[1., 1., 1.],
       [1., 1., 1.]])

In [57]:
B

array([[1., 1.],
       [1., 1.],
       [1., 1.]])

In [58]:
A.T # Transposed matrix

array([[1., 1.],
       [1., 1.],
       [1., 1.]])

In [60]:
A.dot(B) # multiply matrix

array([[3., 3.],
       [3., 3.]])

In [61]:
B.dot(A)

array([[2., 2., 2.],
       [2., 2., 2.],
       [2., 2., 2.]])

In [63]:
C = np.random.randint(0, 15, [4, 4])
C

array([[14,  4,  3, 10],
       [ 7, 13,  5,  5],
       [ 0,  1,  5,  9],
       [ 3,  0,  5, 14]])

## Determinant

In [66]:
np.linalg.det(C)

2131.9999999999995

## Inverse matrix

In [67]:
np.linalg.inv(C) # TODO CHECK pint (pseudo inversion pour les matrix linéaire) 
                 # pinv (Compute the (Moore-Penrose) pseudo-inverse of a matrix.)

array([[ 0.13133208, -0.05065666,  0.13320826, -0.16135084],
       [-0.11022514,  0.13180113, -0.27251407,  0.20684803],
       [ 0.20356473, -0.12851782,  0.8564728 , -0.65009381],
       [-0.10084428,  0.05675422, -0.33442777,  0.33818011]])

In [70]:
np.linalg.eig(C) # compute eigenvalues
                 # two arrays one with eigen value
                 # second with eigenvector

(array([23.5229133 ,  0.82013955,  8.23152911, 13.42541804]),
 array([[-0.65570878, -0.17405725, -0.70964039, -0.41341585],
        [-0.66700853,  0.30341209,  0.60503773, -0.83489958],
        [-0.1830856 , -0.86209061,  0.35590955,  0.21433137],
        [-0.30269669,  0.36666737,  0.06056604,  0.29341455]]))