# Discussion 3

## NumPy

NumPy is a fundamental tool for machine learning. All of the data handling needed for ML projects can be done with numpy.

NumPy is a matrix-based framework (array of arrays). unlike pandas, strings will not be a part of your numpy arrays.

It is a norm to change the name of numpy to np during importing. It is not required, but is almost always done. 

In [1]:
import numpy as np

a = np.array([1, 2, 3])
b = np.asarray([4,5,6])
a, b

(array([1, 2, 3]), array([4, 5, 6]))

In [2]:
b = np.array([[1, 2, 3], [4, 5, 6]])
b

array([[1, 2, 3],
       [4, 5, 6]])

One of the most important functionalities of np is array.shape. This prints the dimensions of the array

In [3]:
print(a.shape, b.shape)

(3,) (2, 3)


Likewise, when an array is not in the desired shape that you need, you can use array.reshape to modify the shape to what you want.

There are restrictions though, you cannot just pass any number, the nuumber of elements needs to be the same.

Link: https://numpy.org/doc/stable/reference/generated/numpy.reshape.html

In [5]:
c = b.reshape(3,2)
print(c, c.shape)

[[1 2]
 [3 4]
 [5 6]] (3, 2)


In [5]:
c = b.transpose()
c.shape

(3, 2)

In [11]:
d = b.reshape(6)
print(d, d.shape)

[1 2 3 4 5 6] (6,)


One other reshape we saw, commonly used for machine learning tasks is to reshape rows into columns, lets see what happens

In [10]:
e = d.reshape(-1, 1)
print(e, e.shape)

[[1]
 [2]
 [3]
 [4]
 [5]
 [6]] (6, 1)


This is what is happening here: when you read data into an np array, most times, each subarray will be 1 column of the data. 

The above data reshaping is converting the 1 array of data, into multiple arrays containing 1 array each. In this manner, each array can be considered to be a datapoint.

### np array arithmetic

In [8]:
c = np.array([1, 2, 3])
c*2

array([2, 4, 6])

#### np array broadcasting

The term broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations. 

There are, however, cases where broadcasting is a bad idea because it leads to inefficient use of memory that slows computation.

Link: https://numpy.org/doc/stable/user/basics.broadcasting.html

In [9]:
b, c

(array([[1, 2, 3],
        [4, 5, 6]]),
 array([1, 2, 3]))

In [13]:
b+c

ValueError: operands could not be broadcast together with shapes (2,3) (4,) 

In [12]:
c = np.array([1,2,3,4])

Indexing an np array logically works the same as any multidimensional array would work.

In order to acccess a specific element, we provide the indices in order of importance, the outermost index being the first one

In [12]:
d = np.array([[1, 2, 3], [4, 5, 6]])
print(d[0, 1])
print(d[:, 1])
for row in d:
    print(row)

2
[2 5]
[1 2 3]
[4 5 6]


Matrix operations:

The main 2 operations that are specifically for matrices are dot product and cross product.

Dot product will give the scalar value product of the 2 arrays. Link: https://numpy.org/doc/stable/reference/generated/numpy.dot.html

Matrix multiplication or cross product will return the actual matrix multiplication of the 2 arrays provided. The output will also be a matrix of the appropriate shape. Link: https://numpy.org/doc/stable/reference/generated/numpy.matmul.html

Numpy also provides many more matrix related operations with very efficient algorithm implemented for the same such as eigenvectors, eigensum, transpose of vector and so on.

In [17]:
e = np.array([[1, 2], [3, 4]])
f = np.array([[2, 0], [1, 2]])
print(np.dot(e, f))  # or e @ f

print(np.matmul(e, f))

[[ 4  4]
 [10  8]]
[[ 4  4]
 [10  8]]


Random:

NumPy has functionality that allows to generate random numbers as well.

You can provide specific range to choose the random number from (default is 0-1)

In [18]:
np.random.rand()

0.6196524553757847

passing arguments to above function also generates an entire matrix of random numbers of given shape

In [21]:
np.random.rand(2,2)

array([[0.3799199 , 0.30460009],
       [0.46336676, 0.47060984]])

randint generates random integers in the presented range.

In [23]:
np.random.randint(10, 100)

16

In [None]:
np.random.

There are also all the random variable distributions available in np.random. you can generate random samples from any random variable distribution you want.

link: https://numpy.org/doc/stable/reference/random/index.html


In [29]:
a = np.random.exponential()
a

0.5104959836159607

## Scikit-learn

All your machine learning needs will be found in this library. scikit learn has implemented each one of the base machine learning algorithms pre-implemented and ready to use with your datasets.

sklearn also has a module named datasets, which contains many datasets which are famously used as example datasets for testing.

In [14]:
from sklearn import datasets

iris = datasets.load_iris()

In [15]:
iris

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

One of the most important functionalities in sklearn is also the train_test_split. Many times datasets will not be pre-divided into training and testing sets.

train_test_split splits the datapoints into whichever percentage distribution, randomly, for your needs 

In [16]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2)

In [17]:
iris.data.shape

(150, 4)

In [18]:
X_train.shape, X_test.shape

((120, 4), (30, 4))

You can find any of the ML models that you want to construct in sklearn. Different models will be in different modules, but the building process is the same

In [19]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LinearRegression

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)

In [20]:
print([i for i in zip(predictions, y_test)])

[(2, 2), (0, 0), (1, 1), (1, 2), (0, 0), (1, 1), (1, 1), (1, 1), (0, 0), (2, 2), (1, 1), (1, 1), (2, 2), (2, 2), (1, 1), (2, 2), (2, 2), (0, 0), (2, 2), (0, 0), (0, 0), (2, 2), (0, 0), (2, 2), (0, 0), (1, 1), (1, 1), (0, 0), (1, 1), (1, 1)]


You can also calculate any errors that you want to from sklearn.metrics.

Link: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

In [22]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

print('Mean squared error:', mean_squared_error(predictions, y_test))
print('Mean absolute error:', mean_absolute_error(predictions, y_test))

Mean squared error: 0.03333333333333333
Mean absolute error: 0.03333333333333333


Confusion matrix: one important metric you can also construct from sklearn.metrics is the confusion matrix.
a confusion matrix is a 2x2 matrix showing the percentage of true positives, false negatives, false positives and true negatives respectively

In the confusion matrix below, the answer is calculated as follows: Confusion matrix whose i-th row and j-th column entry indicates the number of samples with true label being i-th class and predicted label being j-th class.

Link: https://en.wikipedia.org/wiki/Confusion_matrix#:~:text=In%20predictive%20analytics%2C%20a%20table,false%20positives%2C%20and%20true%20negatives.

Link: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html


In [23]:
from sklearn.metrics import confusion_matrix

confusion_matrix(predictions, y_test)

array([[ 9,  0,  0],
       [ 0, 11,  1],
       [ 0,  0,  9]])

Data preprocessing:

One of the most important things you will have to do is to preprocess your data in order to adjust scaling issues, normalization, etc in order for your data to construct a good model. sklearn has a module named `preprocessing` that has many functionalities that can make this process easy for you.

In [46]:
from sklearn.preprocessing import StandardScaler, Normalizer

print(X_train, X_test)

scaler = StandardScaler()
X_train_fitted = scaler.fit_transform(X_train)
X_test_fitted = scaler.transform(X_test)

print(X_train_fitted, X_test_fitted)

[[7.  3.2 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.  2.9 4.5 1.5]
 [6.7 2.5 5.8 1.8]
 [6.7 3.1 5.6 2.4]
 [6.1 2.9 4.7 1.4]
 [6.4 2.9 4.3 1.3]
 [5.4 3.4 1.7 0.2]
 [5.6 2.9 3.6 1.3]
 [5.4 3.9 1.3 0.4]
 [6.4 2.7 5.3 1.9]
 [7.6 3.  6.6 2.1]
 [6.2 2.9 4.3 1.3]
 [5.6 3.  4.1 1.3]
 [4.9 3.1 1.5 0.1]
 [5.9 3.  5.1 1.8]
 [4.8 3.4 1.9 0.2]
 [4.9 2.4 3.3 1. ]
 [7.7 3.8 6.7 2.2]
 [5.7 2.8 4.1 1.3]
 [5.5 2.5 4.  1.3]
 [6.3 2.5 5.  1.9]
 [5.  3.5 1.6 0.6]
 [5.8 2.8 5.1 2.4]
 [4.9 3.6 1.4 0.1]
 [5.1 3.5 1.4 0.2]
 [5.6 2.7 4.2 1.3]
 [6.3 2.3 4.4 1.3]
 [7.2 3.6 6.1 2.5]
 [6.6 2.9 4.6 1.3]
 [6.  3.4 4.5 1.6]
 [4.6 3.4 1.4 0.3]
 [5.1 3.8 1.5 0.3]
 [5.8 2.6 4.  1.2]
 [5.8 2.7 4.1 1. ]
 [7.3 2.9 6.3 1.8]
 [5.1 3.3 1.7 0.5]
 [5.  3.6 1.4 0.2]
 [4.6 3.2 1.4 0.2]
 [6.8 3.2 5.9 2.3]
 [6.7 3.1 4.7 1.5]
 [4.4 3.2 1.3 0.2]
 [4.8 3.4 1.6 0.2]
 [5.7 3.  4.2 1.2]
 [6.5 3.  5.2 2. ]
 [4.7 3.2 1.6 0.2]
 [4.6 3.1 1.5 0.2]
 [5.8 4.  1.2 0.2]
 [6.6 3.  4.4 1.4]
 [6.3 2.5 4.9 1.5]
 [5.5 2.6 4.4 1.2]
 [5.1 3.8 1.6 0.2]
 [5.  2.3 3.

In [47]:
X_train.max(), X_train.min()

(7.9, 0.1)

In [48]:
X_train_fitted.max(), X_train_fitted.min()

(3.009115301607578, -2.2719645695867072)

In [49]:
print(X_train, X_test)

normalizer = Normalizer()
X_train_fitted = normalizer.fit_transform(X_train)
X_test_fitted = normalizer.transform(X_test)

print(X_train_fitted, X_test_fitted)

[[7.  3.2 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.  2.9 4.5 1.5]
 [6.7 2.5 5.8 1.8]
 [6.7 3.1 5.6 2.4]
 [6.1 2.9 4.7 1.4]
 [6.4 2.9 4.3 1.3]
 [5.4 3.4 1.7 0.2]
 [5.6 2.9 3.6 1.3]
 [5.4 3.9 1.3 0.4]
 [6.4 2.7 5.3 1.9]
 [7.6 3.  6.6 2.1]
 [6.2 2.9 4.3 1.3]
 [5.6 3.  4.1 1.3]
 [4.9 3.1 1.5 0.1]
 [5.9 3.  5.1 1.8]
 [4.8 3.4 1.9 0.2]
 [4.9 2.4 3.3 1. ]
 [7.7 3.8 6.7 2.2]
 [5.7 2.8 4.1 1.3]
 [5.5 2.5 4.  1.3]
 [6.3 2.5 5.  1.9]
 [5.  3.5 1.6 0.6]
 [5.8 2.8 5.1 2.4]
 [4.9 3.6 1.4 0.1]
 [5.1 3.5 1.4 0.2]
 [5.6 2.7 4.2 1.3]
 [6.3 2.3 4.4 1.3]
 [7.2 3.6 6.1 2.5]
 [6.6 2.9 4.6 1.3]
 [6.  3.4 4.5 1.6]
 [4.6 3.4 1.4 0.3]
 [5.1 3.8 1.5 0.3]
 [5.8 2.6 4.  1.2]
 [5.8 2.7 4.1 1. ]
 [7.3 2.9 6.3 1.8]
 [5.1 3.3 1.7 0.5]
 [5.  3.6 1.4 0.2]
 [4.6 3.2 1.4 0.2]
 [6.8 3.2 5.9 2.3]
 [6.7 3.1 4.7 1.5]
 [4.4 3.2 1.3 0.2]
 [4.8 3.4 1.6 0.2]
 [5.7 3.  4.2 1.2]
 [6.5 3.  5.2 2. ]
 [4.7 3.2 1.6 0.2]
 [4.6 3.1 1.5 0.2]
 [5.8 4.  1.2 0.2]
 [6.6 3.  4.4 1.4]
 [6.3 2.5 4.9 1.5]
 [5.5 2.6 4.4 1.2]
 [5.1 3.8 1.6 0.2]
 [5.  2.3 3.

In [50]:
X_train.max(), X_train.min()

(7.9, 0.1)

In [52]:
X_train_fitted.max(), X_train_fitted.min()

(0.8609385732675535, 0.014726598240177802)

When creating and training neural networks, there is a lot of preprocessing that needs to be done before feeding the data as a batch to the network.

Also, different parameters need to be scaled in order to be appropriate datapoints to learn from for the neural network. Using preprocessing techniques and data manipulations via numpy and scikit-learn will prove very helpful when diving into these complex topics.

In [25]:
np.zeros((4,3))

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])