# Python maths and stats

Now that we have some data in a structure, we may want to do something with it.

In [1]:
a = 3
b = 4

In [4]:
c = a + b

In [5]:
c = a ** 2 * (b/2)
c

18.0

In [7]:
import random

In [8]:
random

<module 'random' from 'C:\\Users\\stephen\\.conda\\envs\\intro_python_ds_tutorial\\lib\\random.py'>

In [9]:
random.random()

0.5694611342860146

In [12]:
num_values = 10

## Comprehensions
Comprehensions are an easy qway to create list and dictionaries from an iterable. They are shorthand way to do repetition, and is more efficient than a normal loop.

In [14]:
random_list = [random.random() for i1 in range(num_values)]

In [19]:
random_list

[0.9716671019778833,
 0.26279986165919156,
 0.36322586081706254,
 0.5348368838784794,
 0.3909518189108391,
 0.09101825161821142,
 0.34484286470449854,
 0.349292828282158,
 0.6627023323330025,
 0.22297184862247832]

In [17]:
random_dict = {i1: random.random() for i1 in range(num_values)}

In [18]:
random_dict

{0: 0.9042000919519435,
 1: 0.8942589667469942,
 2: 0.23167992328766884,
 3: 0.9808603732021249,
 4: 0.1039129103046883,
 5: 0.3199693892078521,
 6: 0.4356067827673775,
 7: 0.7258926365233034,
 8: 0.8856385067307676,
 9: 0.30066585016537684}

## Perform operations on collection of data
The core of data science is performing an operation efficiently on many items in a collection. This can be done with loops, comprehensions or other constructs

## Apply an operation
You may want to apply the same operation to each item and get the same number of items back as you input

In [22]:
[2* val for val in random_list]

[1.9433342039557666,
 0.5255997233183831,
 0.7264517216341251,
 1.0696737677569588,
 0.7819036378216782,
 0.18203650323642284,
 0.6896857294089971,
 0.698585656564316,
 1.325404664666005,
 0.44594369724495664]

In [21]:
{key1: 2 * val1 for key1, val1 in random_dict.items()}

{0: 1.808400183903887,
 1: 1.7885179334939885,
 2: 0.4633598465753377,
 3: 1.9617207464042499,
 4: 0.2078258206093766,
 5: 0.6399387784157042,
 6: 0.871213565534755,
 7: 1.4517852730466068,
 8: 1.7712770134615352,
 9: 0.6013317003307537}

Another option is the map function. This applies a function to every item in a collection. This needs us to define a function. We can either do it explcitly or implictly using a lambda.

In [24]:
def double(x):
    return 2 * x

In [25]:
map(double, random_list)

<map at 0x14dccf3bf10>

In [28]:
map(lambda x: 2* x, random_list)

<map at 0x14dccf3bb80>

In python the operation is not evaluated until  the data is needed. So we get a *generator* object. This is a key idea in python for data science, that we don't actually do the calculation immediately, rather we set up the calculation to do later. This has many advantages:

* set up computing pipline interactively
* send compute to other computing resource
* run asynchronously, so other work can be done while waiting for results.

https://wiki.python.org/moin/Generators

We can force the calculation to happen by converting the generator to a list.

In [27]:
list(map(double, random_list))

[1.9433342039557666,
 0.5255997233183831,
 0.7264517216341251,
 1.0696737677569588,
 0.7819036378216782,
 0.18203650323642284,
 0.6896857294089971,
 0.698585656564316,
 1.325404664666005,
 0.44594369724495664]

In [29]:
list(map(lambda x: 2* x, random_list))

[1.9433342039557666,
 0.5255997233183831,
 0.7264517216341251,
 1.0696737677569588,
 0.7819036378216782,
 0.18203650323642284,
 0.6896857294089971,
 0.698585656564316,
 1.325404664666005,
 0.44594369724495664]

### Reductions
Another typical operation is to combine values in some way, for example finding the average or max/min.

In [30]:
max(random_list)

0.9716671019778833

In [33]:
max(random_dict.values())

0.9808603732021249

In [34]:
sum(random_list)

4.194309652803804

In [36]:
len(random_list)

10

In [35]:
sum(random_list) / len(random_list)

0.41943096528038043

## Numerical python with numpy
Typically, substantial calculations will be done using a third-party library called **numpy** (a contraction of numerical python). This includes many common numberical operation, as well as important data structures for arrays and matrices.

https://numpy.org/doc/

In [37]:
import numpy

In [54]:
rand_arr = numpy.random.random((10,5))
rand_arr

array([[0.27540882, 0.1728077 , 0.07302226, 0.86702407, 0.445949  ],
       [0.50574239, 0.2167221 , 0.30911444, 0.34841866, 0.75210726],
       [0.65809158, 0.98781675, 0.61543413, 0.08595842, 0.81275611],
       [0.42632665, 0.02155948, 0.62862858, 0.29697226, 0.68810677],
       [0.58796238, 0.85551995, 0.01889157, 0.41252549, 0.55337425],
       [0.06958205, 0.72022917, 0.74007193, 0.6209007 , 0.33380298],
       [0.76662751, 0.59860868, 0.17873446, 0.65612747, 0.31933609],
       [0.83954676, 0.62165114, 0.73485465, 0.61533474, 0.7622486 ],
       [0.67925936, 0.34283712, 0.55859609, 0.97595897, 0.57209862],
       [0.54652386, 0.37314234, 0.23119516, 0.69793204, 0.30435198]])

This provides easier arithmetic and additional mathematical functionality.

In [55]:
rand_arr * 2

array([[0.55081765, 0.34561539, 0.14604452, 1.73404815, 0.891898  ],
       [1.01148477, 0.43344421, 0.61822887, 0.69683731, 1.50421452],
       [1.31618316, 1.97563349, 1.23086825, 0.17191683, 1.62551222],
       [0.85265331, 0.04311895, 1.25725715, 0.59394453, 1.37621354],
       [1.17592477, 1.71103991, 0.03778313, 0.82505098, 1.1067485 ],
       [0.1391641 , 1.44045834, 1.48014385, 1.24180141, 0.66760596],
       [1.53325502, 1.19721736, 0.35746892, 1.31225495, 0.63867219],
       [1.67909352, 1.24330228, 1.4697093 , 1.23066949, 1.52449719],
       [1.35851873, 0.68567425, 1.11719218, 1.95191795, 1.14419724],
       [1.09304773, 0.74628467, 0.46239032, 1.39586408, 0.60870396]])

In [56]:
other_arr = numpy.random.random((10,5))

In [57]:
rand_arr + other_arr

array([[0.73873527, 0.95261017, 1.06212582, 1.22365792, 0.96143735],
       [0.89529666, 0.36395568, 0.48783331, 1.12542163, 1.68971662],
       [0.66354301, 1.81137567, 1.23113574, 0.71777633, 0.91806817],
       [0.87202536, 1.01331768, 0.6554026 , 0.46405042, 1.43152808],
       [0.59072323, 1.58072994, 0.57587494, 1.20267834, 1.09904489],
       [0.74085597, 1.68620677, 1.3213275 , 1.46626942, 1.2679869 ],
       [1.36485134, 1.17376325, 0.56496619, 1.55167533, 0.84845468],
       [0.88431288, 1.4720335 , 1.50215801, 0.80815386, 1.24915577],
       [0.73093052, 1.27779258, 1.04626369, 1.91791762, 1.17778521],
       [1.18095451, 0.49181857, 0.52091646, 1.68126118, 0.45152973]])

In [60]:
numpy.mean(rand_arr + other_arr)

1.0541481246328877

In [61]:
numpy.std(rand_arr + other_arr)

0.39782207455322316

The other common numpy data structure is a matrix, which corrsponds to a mathematical matrix its operations. So rather than a mupltiplcation being element wise, it is proper matrix multiplication.

In [68]:
rand_arr1 = numpy.random.random((5,5))
rand_mat1 = numpy.matrix(rand_arr1)
rand_arr2 = numpy.random.random((5,5))
rand_mat2 = numpy.matrix(rand_arr2)

In [69]:
rand_arr1 * rand_arr2

array([[0.37353874, 0.04665452, 0.72741746, 0.07932256, 0.56098741],
       [0.14732256, 0.0396383 , 0.34380106, 0.5578604 , 0.48125186],
       [0.29373476, 0.01709241, 0.3705256 , 0.19020217, 0.15731403],
       [0.5656448 , 0.38488233, 0.15785151, 0.02608633, 0.10769253],
       [0.03975572, 0.14039977, 0.01586128, 0.07130194, 0.24940553]])

In [74]:
rand_arr1[0,0] * rand_arr2[0,0]

0.3735387416011707

In [70]:
rand_mat1 * rand_mat2

matrix([[1.33123189, 0.8288002 , 1.54855534, 1.12129405, 1.38895845],
        [1.92294086, 1.51537319, 1.76297213, 1.4649116 , 1.72839313],
        [1.27616109, 0.89430863, 1.11393576, 0.82530049, 1.00852554],
        [1.21204335, 0.56448472, 1.3151479 , 1.01272881, 1.18243524],
        [0.61019255, 0.37742405, 0.59316195, 0.53095288, 0.63761302]])

In [72]:
rand_mat1[0,0] * rand_mat2[0,0] + rand_mat1[0,1] * rand_mat2[1,0] + rand_mat1[0,2] * rand_mat2[2,0] + rand_mat1[0,3] * rand_mat2[3,0] + rand_mat1[0,4] * rand_mat2[4,0] 

1.3312318888355474