# Python for Machine Learning

## Maciej Szankin

# About me

<div style="float:left; padding:30px;">
    <img src="https://avatars0.githubusercontent.com/u/4785345?s=460&u=347ee5cbd03d6f3972af3b66ec530eef19f1af1e&v=4" style="width:300px;" />
</div>
<div style="float:left; padding:30px;">
    <h2>Maciej Szankin</h2>
    <h4>Deep Learning Software Engineer @ Intel</h4>
</div>

# Introduction - How do I participate?

TODO Presentation: add link

* Slack #ml
* Run Locally
    1. Install Python
        * Python
        * Anaconda
    2. Run...
        * ... as a script
        * ... in a REPL
        * ... in a Notebook
    
* Run Remotely
    * Google Colab
    * ML Server

## Introduction | Run Locally

* If you are on Linux or Mac you should have Python availabile
* For Windows users: https://www.python.org/downloads/windows/

__... but either way you should really use Anaconda__

* It's a Python distribution - manage not only Python packages, but also additional libraries / drivers
* Get it from: https://www.anaconda.com/products/individual
* `conda install` vs `pip install`

* CPU-focused optimizations are included
* MKL support in most packages by default


* Intel® Distribution for Python* and Intel® Performance Libraries with Anaconda - https://software.intel.com/content/www/us/en/develop/articles/using-intel-distribution-for-python-with-anaconda.html
* Read more: https://www.anaconda.com/blog/tensorflow-cpu-optimizations-in-anaconda

* Run as a script
* or use Jupyter Notebook

```bash
pip install jupyter

# or even better:
conda install jupyter
```

## Introduction | Notebooks

```bash
cd ~/workspace/
jupyter notebook
```

## Introduction | Run Remotely

* Google Colab
* ML Server -  http://ml.eti.pg.gda.pl/

## Introduction | Run Remotely | Google Colab

* Available at https://colab.research.google.com/
* It's free! You only need a Google account to access
* Offers different environment: CPU / GPU / TPU
* Resources are not guaranteed
* Avoid hogging resources - you can get lower priority in the future if requested resources are not actievely used!
* For improved experience - Colab Pro - https://colab.research.google.com/signup

* You can execute bash commands in a cell.
* Prepend your command with ! to run it in a shell
* When running Google Colab you get your own virtualized environment - you can install packages
* The environment will be cleaned upon exiting - make your life easier and install all additional dependencies in the first cell.

* Example command to install Python's wget:

    ```bash
    !pip install wget
    ```
    Verify:

    ```python
    import wget
    wget.__version__
    ```

## Introduction | Run Remotely | ML Server

* Available at http://ml.eti.pg.gda.pl/
* JupyterHub-based preconfigured environment deployed for this School
* Server is located in Gdansk, Poland
* Use it if you can't use Google Colab

* You can execute bash commands in a cell.
* Installing new packages: 

<center><h2>You can't! Sorry! 💔</h2></center>

## Introduction | Notebook Tricks

* Command dialog - `cmd + shift + p` / `ctrl + shift + p`
    * `Esc` command mode
    * `m` / `y` to switch cell's type to markdown/code
    * `a` / `b` to insert a new cell above/below the current cell
    * `shift + tab` to show doc for the selected object
    * `ctrl + shift + -` to split cell in half at cursor's position
* Bash commands in a cell - `!<bash_command>`
* Cell with different kernel (`%%bash`/`%%HTML`/`%%python2`/`%%python3`/`%%ruby`/`%%perl`)
* Python variables in bash
* LaTeX - `$ formula $`

...
* Bash commands in a cell - !<bash_command>
  ```
  !pip list
  ```
* Cell with different kernel
  ```
  %%bash
  for i in {1..5}
  do echo $i;
  done;
  ```
* Python variables in bash
  ```
  foo = 'bar'
  !echo $foo
  ```
* LaTeX
  ```
  $P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)}$
  ```

In [21]:
# code

__Markdown__

# 🐍 Python

<center>
    <img src="https://imgs.xkcd.com/comics/python.png" alt="Python: https://xkcd.com/353/" />
    <div><i>Source: <a href="https://xkcd.com/353/" target="_blank">XKCD 353</a></i></div>
</center>

* Object Oriented
* however #1 - no explicit encapsulation: "After all, we're all consenting adults here."
* however #2 - no class interface, only (multi)inheritance
* multi-paradigm
* interpreted
* strongly typed
* dynamically typed
* 🦆 duck typing
* garbage collector
* designed for code clarity
* object introspection
* interactive mode (terminal / IPython)
* interfaces to many popular programming languages:
    * C++
    * Java
    * .NET
    * and many others
* has it's own package manager - pip & easy_install
* before writing a library - check if it exists

* We will do a general overview of the Python, some programming experience is expected.
* Now, Before we go into python code let's quickly go through some features that make python great, especially for data science

* Object Oriented
     * Object-oriented programming is a programming paradigm that provides a means of structuring programs so that properties and behaviors are bundled into individual objects.
* however #1 - no explicit encapsulation: "After all, we're all consenting adults here."
    * Python does not have the private keyword, unlike some other object oriented languages.
    * Instead, it relies on the convention: a class variable that should not directly be accessed should be prefixed with an underscore. It's more of an agreement that can be violated
* however #2 - no class interface, only (multi)inheritance
    * python does not have any equivalent of interfaces . Since Python does support multiple inheritance, you can easily emulate the equivalence of interfaces. ... Interfaces are concepts that belong to statically typed languages such as Java or C#, and do not really apply to dynamic language such as Pytho
* multi-paradigm
    * object oriented programming
    * structured programming
    * functional programming
* interpreted
* strongly typed
    *  forbidding operations that are not well-defined (for example, adding a number to a string) rather than silently attempting to make sense of them.
* dynamically typed
    * Python doesn't know about the type of the variable until the code is run
* duck typing
    * "If it walks like a duck and it quacks like a duck, then it must be a duck"— to determine if an object can be used for a particular purpose. With normal typing, suitability is determined by an object's type. In duck typing, an object's suitability is determined by the presence of certain methods and properties, rather than the type of the object itself.
* garbage collector
    * reference counting and cycle-detecing garbage collector for memory management
* designed for code clarity
    * ex loops: `for element in collection`
* object introspection
    * By using introspection, we can dynamically examine python objects. Code Introspection is used for examining the classes, methods, objects, modules, keywords and get information about them so that we can utilize it.
* interactive mode (terminal / IPython)
* interfaces to many popular programming languages:
    * C++
    * Java
    * .NET
    * and many others
* has it's own package manager - pip & easy_install
* before writing a library - check if it exists

## Python | List

In [2]:
[1,2,3]*

TypeError: can't multiply sequence by non-int of type 'list'

```python
my_list = [0]*10
my_list = list(range(10))
my_list
my_list[2]
my_list[-1]
my_list[4:10]
my_list[4:]
my_list[:]

#2D - we can create a 2d list
r = 2
c = 5
list2d = [[0]*c for _ in range(r)]  # this is called list comprehension
list2d[0]
list2d[0][3]
# what's worth mentioning is people often call 2d list a 2d array - it's not the same. 
# A 2D list is more precisely a list of lists which only sometimes happens to be of the rectangular shape.
# What is the difference between 2d list and python's array, you may ask
# For 99% cases you are fine with lists - if you haven't used python's arrays so far then most likely you just never needed
# to - and that's fine!
# arrays by definition are homogenous - all objects within array must be of the same type
# lists are heterogenous, they don't care for data type
# arrays tend to be faster and more memory efficient, but for some operations they can be super slow - like expending the array
#     the array module is used to interface with C code
# lists are heavier, as every single element is a python object, 
#     but on the other hand adding new elements happens in amortized constant cost
# we will not be going into more details on python's array, as we will be covering something more interesting in few slides
```

## Python | Scope

```python
x = 4 # comment it in in the end

def fun():
    #global x
    #x = x+1
    return x+1

fun()
```

# NumPy

## NumPy

* linear algegra library for python
* main building block for data-oriented libaries
* It's fast
* Even faster if you install it using Anaconda

```bash
pip install numpy
# or
conda install numpy
```

Now that we have refeshed our python, it's time to learn about our first python data analysis library in this lecture - NumPy.

* it's a linear algebra library for python
* it's super important for Data science - almost all python data-oriented libraryies rely on NumPy as their main building block.
* you may think that python, as a interpreted language is slow, but, again, there are C libraries bindings thanks to which NumPy can be blazingly fast

```python
import numpy as np
```

## NumPy | Creating Arrays

```python
# single dimensional list/array
my_list = [0,1,2,3,4]
arr = np.array(my_list)
arr
# two dimensional
my_mat = [[1,2,3], [4,5,6], [7,8,9]]
np.array(my_mat)

# usually however, you will be using numpys built in methods to create np arrays a lot faster

np.arange(0,10) # similar to python's range
np.arange(0,10, 2)

np.arange(0,10).reshape(2,5) # reshape

np.zeros(3)
np.zeros((5,5)) # rows, columns

np.ones((2,4))

np.linspace(0,10,6) # dont confuse it with arrange, arrange will give points in increments of integer, and linspace
# will give us exactly N points in given range
np.linspace(0,10,101)

np.eye(4) # NxN as identity matrix must be square,
# identity matrix is useful matrix is a useful matrix when dealing with linear algegra problems

np.random.rand(2) # will populate given range with a population of uniform distribution over 0 to 1
np.random.rand(2,2) # weirdly enough, for more dimensions we do not pass tuple, but instead we just add another argument

np.random.randint(0,10)  # will draw a single int from given range
np.random.randint(0,10,(3,3)) # will draw a 3x3 array from given range

np.random.randn(4) # draw N samples for normal distribution centered around 0
```

## NumPy | Data types

* NumPy arrays are homogeneous (elements of the same type)

```python
l = np.array([1,2])
print(l.dtype)

l = np.array([1.0,2.0])
print(l.dtype)

l = np.array([1.0,2.0], dtype=np.int16)
print(l.dtype)
```

## NumPy | Operations

In [101]:
np.random.seed(101)
arr = np.random.randint(0,10,(5))
print(arr)
arr.min()
arr.max()
arr.argmin()
arr.argmax()
arr.dtype

[1 6 7 9 8]


dtype('int32')

## NumPy | Indexing

```python
mat = np.arange(0,9).reshape(3, 3)
print(mat)

print(mat[0][1]) # while we can use syntax just like in python's list
print(mat[0,1])  # a coma-seperated indices became a golden standard for getting a value out of numpy;s array
print(mat[:,1])
print(mat[1,:])
print(mat[:2,1:])
```

## NumPy | Boolean Masking



```python

mat>4
mat[mat>4]

# we can also generate mask for
(mat*2) == (mat**2)
```
```
==	np.equal		!=	np.not_equal
<	np.less		<=	np.less_equal
>	np.greater		>=	np.greater_equal
```

[[0 1 2]
 [3 4 5]
 [6 7 8]]


array([[ True, False,  True],
       [False, False, False],
       [False, False, False]])

## NumPy | Broadcasting


Arithmetic operations on arrays are usually done on corresponding elements. If two arrays are of exactly the same shape, then these operations are smoothly performed.

The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes

```python
# to avioid being a memory hog when dealing with larger matrices, numpy handles matrices as references

list_f = np.arange(0,11)
list_a = list_f[:5]
list_a[:] = 999
list_f

list_copy = list_f[:5].copy()
list_copy[:] = 33
list_copy
list_f

a = np.ones((3,3))
a * [1,2,3] * [[1], [2], [3]]
a * [2,2] # this will cause an error
```

# Data Visualization

## Data Visualization | Matplotlib

* very (most?) popular plotting library for Python

PCA / t-SNE

# 🐼 Pandas

## Pandas

* open-source library built on top of NumPy
* fast analysis, data cleaning and preparation
* built-in visualization features


* Excel for Python

```bash
pip install pandas
# or
conda install pandas
```

```python
import pandas as pd

data_url = 'https://gist.githubusercontent.com/macsz/48c62a681de563e16f64438882694443/raw/0a058af5c0e249b77610cb56c804d77e1de7c2b9/data.csv'

df = pd.read_csv(data_url)
df

df['animal']

df[['uniq_id', 'animal']]  # mind it's a list

df['water_need'].max()

df.describe()

# Just like NumPy we Can do boolean filters, or masks
df['water_need'] > 550

df[ df['water_need'] > 550 ]
```

# ???? Preprocessing with SciKit

# Machine Learning

# Graph Computing

## Graph Computing | Concept

* TensorFlow - define graph statically in a session before model can be run
* PyTorch - define, change and execute graph nodes as you go

**Both TensorFlow and PyTorch process any model as directed acyclic graph (DAG)**

<center>
    <img src="files/graphs.png" alt="DAG vs DCG" />
</center>

## Graph Computing | Example

$$ res = (15*5) / (15+5) $$

$$ a = 5 $$
$$ b = 15 $$
$$ prod = a * b $$
$$ sum = a + b $$
$$ res = prod / sum $$

<center>
    <img src="https://miro.medium.com/max/700/1*oo8djcq1ykZxxsSEo6jx2g.gif" alt="Graph processing order" />
    <div><i>Source: <a href="https://medium.com/@d3lm/understand-tensorflow-by-mimicking-its-api-from-scratch-faa55787170d" target="_blank">medium.com</a></i></div>
</center>

## Graph Framework

## Graph Framework | Operations

```python
class Operation(object):
    def __init__(self, input_nodes=[]):
        self.input_nodes = input_nodes
        self.output_nodes = []
        
        for node in input_nodes:
            node.output_nodes.append(self)
        
        _default_graph.operations.append(self)
    
    def compute(self):
        raise NotImplemented()
```

## Graph Framework | Operations | Add

```python
class Add(Operation):
    def __init__(self, x, y):
        super().__init__([x, y])
    
    def compute(self, x_var, y_var):
        self.inputs = [x_var, y_var]
        return x_var + y_var
```

## Graph Framework | Operations | Multiply

```python
class Mul(Operation):
    def __init__(self, x, y):
        super().__init__([x, y])
    
    def compute(self, x_var, y_var):
        self.inputs = [x_var, y_var]
        return x_var * y_var
```

## Graph Framework | Operations | MatMul

```python
class MatMul(Operation):
    def __init__(self, x, y):
        super().__init__([x, y])
    
    def compute(self, x_var, y_var):
        self.inputs = [x_var, y_var]
        return x_var.dot(y_var)
```

## Graph Framework | Placeholder

Placeholder - a special node which is used as data input

```python
class Placeholder(object):
    def __init__(self):
        self.output_nodes = []
        _default_graph.placeholders.append(self)
```

## Graph Framework | Variables

```python
class Variable(object):
    def __init__(self, initial_value=None):
        self.value = initial_value
        self.output_nodes = []
        
        _default_graph.variables.append(self)
```

## Graph Framework | Graph Object

```python
class Graph(object):
    def __init__(self):
        self.operations = []
        self.placeholders = []
        self.variables = []
    
    def set_as_default(self):
        global _default_graph
        _default_graph = self
```

## Graph Framework | Session

```python
class Session(object):
    def run(self, operation, feed_dict={}):
        nodes_postorder = self._traverse_postorder(operation)
        
        for node in nodes_postorder:
            if type(node) == Placeholder:
                node.output = feed_dict[node]
            elif type(node) == Variable:
                node.output = node.value
            else:
                node.inputs = [input_node.output for input_node in node.input_nodes]
                node.output = node.compute(*node.inputs)
            
            if type(node.output) == list:
                node.output = np.array(node.output)
        return operation.output
                
    def _traverse_postorder(self, operation):
        nodes_postorder = []
        
        def recurse(node):
            if isinstance(node, Operation):
                for input_node in node.input_nodes:
                    recurse(input_node)
            nodes_postorder.append(node)
            
        recurse(operation)
        return nodes_postorder
```

In [7]:
class Session(object):
    def run(self, operation, feed_dict={}):
        nodes_postorder = self._traverse_postorder(operation)
        
        for node in nodes_postorder:
            if type(node) == Placeholder:
                node.output = feed_dict[node]
                
            elif type(node) == Variable:
                node.output = node.value
                
            else:
                node.inputs = [input_node.output for input_node in node.input_nodes]
                node.output = node.compute(*node.inputs)
            
            if type(node.output) == list:
                node.output = np.array(node.output)
        return operation.output
                
    def _traverse_postorder(self, operation):
        nodes_postorder = []
        
        def recurse(node):
            if isinstance(node, Operation):
                for input_node in node.input_nodes:
                    recurse(input_node)
            nodes_postorder.append(node)
            
        recurse(operation)
        return nodes_postorder

## Graph Framework | Results

```python
g = Graph()
g.set_as_default()

A = Variable(10)
b = Variable(1)
x = Placeholder()

y = Mul(A, x)
z = Add(y, b)
```

# TensorFlow

https://towardsdatascience.com/pytorch-vs-tensorflow-spotting-the-difference-25c75777377b

# BACKLOG