---

# Data Mining:<br>Statistical Modeling and Learning from Data

## Dr. Ciro Cattuto<br>Dr. Laetitia Gauvin<br>Dr. André Panisson

### Exercises - Numpy

---

In [None]:
import numpy as np

# Indexing/Slicing Review

* `s[i]` (indexing)
* `s[i:j]` (slicing)
* `s[i:j:k]` (step slicing)
* meaning of negative indices
* 0-base counting

## EXERCISE: Indexing Review

In [None]:
m = list(range(10)) #in Python3 range returns an iterator! Casting to list is needed
m

In [None]:
# access the first position of the list
# YOUR CODE HERE

In [None]:
# access the last position of the list
# YOUR CODE HERE

The triple [i:j:k] are in fact parameters of a **slice** object:

### slice(start, stop[, step])
> Return a slice object representing the set of indices specified by range(start, stop, step). The start and step arguments default to None. Slice objects have read-only data attributes start, stop and step which merely return the argument values (or their default). They have no other explicit functionality; however they are used by Numerical Python and other third party extensions. Slice objects are also generated when extended indexing syntax is used. For example: a[start:stop:step] or a[start:stop, i].

http://docs.python.org/2/library/functions.html#slice

For example, to return the first 3 elements in the even positions of a list, you can use:

In [None]:
m[slice(0,5,2)]

which is equivalent to

In [None]:
m[0:5:2]

## EXERCISE: Slicing Review

In [None]:
# access the first five elements of the list
# YOUR CODE HERE

In [None]:
# access the last five elements of the list
# YOUR CODE HERE

In [None]:
# access the list elements in reverse order
# YOUR CODE HERE

## pylab mode

These imports done for you in `pylab` mode.

    import numpy as np
    import matplotlib.pyplot as plt
    
The same can be done with the following command:

In [None]:
%pylab inline

# NumPy

<http://www.numpy.org/>:

NumPy is the **fundamental package for scientific computing with Python**. It contains among other things:

* a powerful N-dimensional array object
* sophisticated (**broadcasting**) functions [what is *broadcasting*?]
* tools for integrating C/C++ and Fortran code
* useful linear algebra, Fourier transform, and random number capabilities

### ndarray.ndim, ndarray.shape

In [None]:
# zero-dimensions

a0 = array(5)
a0

In [None]:
a0.ndim, a0.shape

In [None]:
# 1-d array
a1 = array([1,2])
a1.ndim, a1.shape

In [None]:
# 2-d array
a2 = array(([1,2], [3,4]))
a2.ndim, a2.shape

In [None]:
a = arange(10)

In [None]:
a.dtype

### Array creation routines

In [None]:
a = array([1,2])
a

In [None]:
a = zeros((2,2))
a

In [None]:
a = ones((2,2))
a

In [None]:
a = empty((2,2))
a

In [None]:
a = eye(3,4,1)
a

In [None]:
a = identity(3)
a

In [None]:
a = diag(arange(4))
a

In [None]:
a = linspace(1,10,5) 
a

In [None]:
a = logspace(1,2,5) 
a

### Type hierarchy: 

<img src="http://docs.scipy.org/doc/numpy/_images/dtype-hierarchy.png">

In [None]:
a = arange(10, dtype=float)
a.dtype

In [None]:
a = arange(10, dtype=byte)
a.dtype

In [None]:
a[0] = 128
a[0]

In [None]:
a1 = a.astype(int16)
a1[0] = 128
a1[0]

### reshape, transpose

In [None]:
a = arange(64)
a

In [None]:
# map a 0..63 1d array to a 8x8 2d array
a1 = a.reshape(8,8)
a1

In [None]:
a.shape = (8,8)
a

In [None]:
a.T

### stacking & concatenation

In [None]:
a = array([[1, 2], [3, 4]])
b = array([[5, 6]])
print(a.shape, b.shape)

In [None]:
x = concatenate((a, b), axis=0) # vertical stack
print(x, x.shape)

In [None]:
y = concatenate((a, b.T), axis=1) # horizontal
print(y, y.shape)

In [None]:
print(vstack((a,b)))
print(hstack((a,b.T)))

In [None]:
print(append(a, b, axis=0))
print(append(a, b.T, axis=1))

In [None]:
print(insert(a, 0, b, axis=0))
print(insert(a, 0, b, axis=1))

##Numpy operations

example of [broadcasting](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html):

> The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations. There are, however, cases where broadcasting is a bad idea because it leads to inefficient use of memory that slows computation.

In [None]:
2*a

In [None]:
a+2

More broadcasting

<img src="http://scipy-lectures.github.io/_images/numpy_broadcasting.png" width="600">

In [None]:
a = array([range(0,3)]*4)
b = array([range(0,40,10)]*3).T
a, b

In [None]:
a+b

In [None]:
b + arange(0,3)

In [None]:
arange(0,40,10).reshape(4,1) + arange(0,3)

In [None]:
w = arange(0,3)
a = arange(0,12).reshape(4,3)
print(w)
print(a)
print(w * a)

## Indexing/Slicing

In [None]:
a3 = arange(30) 
a3

In [None]:
print(a3[0])
print(a3[::-1])
print(a3[2:5])

In [None]:
print(a3[[2,3,4,6,5,2]])

In [None]:
np.mod(a3, 3)

Select numbers divisible by 3.

In [None]:
# list comprehension
[i for i in a3 if i % 3 == 0]

In [None]:
np.mod(a3, 3) == 0

In [None]:
divisible_by_3 = np.mod(a3, 3) == 0
a3[divisible_by_3]

2d, 3d slicing

In [None]:
a = arange(64).reshape(8,8)
a

In [None]:
a[0,:]

In [None]:
a[:,0]

In [None]:
a[:2,:2]

In [None]:
a[::2,::2]

In [None]:
b = arange(27).reshape(3,3,3)
b

In [None]:
b[0,:,:]

In [None]:
b[:,0,:]

In [None]:
b[:,:,0]

## Exercise:  Calculate a series that holds all the squares less than 100

In [None]:
# Use arange, np.sqrt, astype
# YOUR CODE HERE

# NumPy Functions

http://docs.scipy.org/doc/numpy/reference/routines.math.html

In [None]:
import random
a = array([random.randint(0, 10) for i in range(10)])

print(a)
print(a.min())
print(a.max())
print(a.mean())
print(a.std()) # standard deviation
print(a.sum())

In [None]:
b = arange(16).reshape(4, 4)
print(b)
print(b.T)
print(b.trace())
print(b.min())
print(b.min(axis=0))
print(b.min(axis=1))
print(b.ravel())

In [None]:
a = arange(0,3)
b = arange(1,4)
print(a, b)
print(np.dot(a,b))

### Matrices

In [None]:
a = matrix([[3, 2, -1], [2, -2, 4], [-1, .5, -1]])
print(a)
print(a.trace())
print(a.diagonal())
print(a.T) # matrix transpose
print(a.I) # matrix inverse
print(a.H) # matrix conjugate transpose

In [None]:
a = matrix(arange(0,6).reshape(2,3))
print(a * a) #it's ok to have an error, we need to use the transpose matrix!

In [None]:
print(a * a.T)

## Exercise: Implement the matrix product using python loops and compare the execution time with the Numpy implementation

In [None]:
a = arange(50).reshape(5,10)

def my_dot(a,b):
    
    for i in range(a.shape[0]):
        
    
    # YOUR CODE HERE

In [None]:
%timeit my_dot(a,a.T)

In [None]:
%timeit a*a.T

## Exercise:  Reimplement the perceptron using NumPy (e.g., using the matrix product operation)

In [None]:
%%file data.csv
x1,x2,y
0.4946,5.7661,0
4.7206,5.7661,1
1.2888,5.3433,0
4.2898,5.3433,1
1.4293,4.5592,0
4.2286,4.5592,1
1.1921,5.8563,0
3.1454,5.8563,1
1.063,5.7357,0
5.1043,5.7357,1
1.5079,5.8622,0
3.9799,5.8622,1
0.2678,6.9931,0
4.5288,6.9931,1
0.9726,3.6268,0
4.106,3.6268,1
2.5389,3.3884,0
4.7555,3.3884,1
2.473,5.6404,0
4.7977,5.6404,1

In [None]:
data = np.genfromtxt('data.csv', delimiter=',', skip_header=1)

X = data[:,:-1]
y = data[:,-1]

print(X)
print(y)

In [None]:
c0 = y==0
c1 = y==1

plot(X[:,0][c0], X[:,1][c0], 'o', mec='r', mfc='none')
plot(X[:,0][c1], X[:,1][c1], 'o', mec='g', mfc='none')

In [None]:
class Perceptron:
        
    def predict(self, x):
        # YOUR CODE HERE

    def fit(self, X, y):
        self.w = zeros(len(x))
        # YOUR CODE HERE, using dot product

p = Perceptron()
p.fit(X, y)

for i,x in enumerate(X):
    if p.predict(x) != y[i]:
        print('FAIL')
        break
else:
    print('SUCCESS!')

## Exercise:  Read a digits image and split it up in different training examples

In [None]:
#  old and not functioning import
# from scipy import misc
# img = misc.imread('../data/digits.png', flatten=1)

from matplotlib.pyplot import imread
img = imread('../data/digits.png')

print(img.shape)
print(img.dtype)

imshow(img, cmap=cm.Greys_r)

In [None]:
data = []

# YOUR CODE HERE

In [None]:
print(data[20])
imshow(data[20], cmap=cm.Greys_r)

# Random Generators

In [None]:
from numpy import random
data = random.normal(size=1000, loc=2)
h=plt.hist(data, bins=50)

https://numpy.org/doc/stable/reference/random/index.html

In [None]:
from numpy import random
data = random.exponential(size=1000)
h = plt.hist(data, bins=50)

## Exercise:  Create a set of 100 points that follow the function $$f(x) = 0.5x + 1 $$ and add Gaussian white noise to the result. 

In [None]:
X = linspace(1,10,100)

# YOUR CODE HERE

## Exercise: Use polinomial fitting (numpy.polyfit) to fit the results in a line and verify that the error is Gaussian white noise.

In [None]:
c = polyfit(X, y, 1)

# YOUR CODE HERE

#Linear Regression

##Exercise: Linear Regression with Ordinary Least Squares

Find the weight values $\mathbf{w}$ that minimize the error $E_{\mathbf{in}}(\mathbf{w}) = \frac{1}{N} \sum_{n=1}^n {(\mathbf{w}^T \mathbf{X}_n - \mathbf{y}_n)^2}$.

For this, implement Linear Regression and use the Ordinary Least Squares (OLS) closed-form expression to find the estimated values of $\mathbf{w}$:

$$\mathbf{w} = (\mathbf{X}^{\rm T}\mathbf{X})^{-1} \mathbf{X}^{\rm T}\mathbf{y}$$

You can use the `np.linalg.inv` function to invert a matrix:

In [None]:
help(np.linalg.inv)

The following function generates some data to test the linear regression:

In [None]:
def generate_linear_regression(c):
    X = np.random.rand(100, len(c)-1)
    noise = random.normal(size=len(X), loc=0, scale=0.1)
    y = ((c[0] + X*c[1:]).sum(axis=1) + noise).reshape(100, 1)
    
    return X, y

X, y = generate_linear_regression([2, 0.8])
scatter(X, y)

To implement the linear regression, follow these steps:

1. extend $\mathbf{X}$ to $d + 1$ dimensions, setting the first column to 1
2. calculate $\mathbf{w}$ using the OLS closed form

In [None]:
# YOUR CODE HERE

##Exercise: Linear Regression with nonlinear data

Even if the data is nonlinear, linear regression can be used. 
The first step is to transform the data nonlinearly and append it as new features.

The following function creates nonlinear data:

In [None]:
def generate_linear_regression_squared(c):
    X = np.random.rand(100, len(c)-1)
    noise = random.normal(size=len(X), loc=0, scale=0.1)
    y = ((c[0] + c[1:]*X**2).sum(axis=1) + noise).reshape(100, 1)
    
    return X, y

X, y = generate_linear_regression_squared([2, 2])
scatter(X, y)

To apply the linear regression in nonlinear data, follow these steps:

1. extend $\mathbf{X}$ to add a nonlinear transformation of $\mathbf{X}$ (e.g., $\mathbf{Z} = [\mathbf{X}, \mathbf{X}^2$]).
1. extend $\mathbf{Z}$ to $d + 1$ dimensions, setting the first column to 1.
2. calculate $\mathbf{w}$ applying OLS to $\mathbf{Z}$.

In [None]:
# YOUR CODE HERE