# Week 1 Notes

## 1.1 [Introduction to Machine Learning](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/01-intro/01-what-is-ml.md)

- Example: recommending a price for a car, given its properties such as: age, mileage, model, no. of doors, etc.
- Patterns can be found in the data which can be modelled
- Essence of ML: given data, have a model learn patterns in the data
- Data consists of:
    - Features: what we know about cars (i.e. its properties)
    - Target: what we want to predict (i.e. its price)
- User can provide information on a car they want to sell. This data is extracted. Price (target) is predicted and suggested to the user as an appropriate price.

## 1.2 [ML vs Rule-Based Systems](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/01-intro/02-ml-vs-rules.md)

This lesson will cover a comparison between rule-based and ML systems. The example used is an e-mail spam detection system.

We want to develop a spam detection system using a classifier.

One approach is to take all the spam messages and try to identify what makes a message spam. We could come up with explicit rules based on the sender, subject, etc. However spam messages evolve and require frequent updates, leading inevitably to a myriad of rules and logic which will be a nightmare to maintain.

Alternatively, we can use Machine Learning. How do we do this? We get the data (all the e-mails), define and calculate features, train and use the model.
- Getting the data: we could have a spam button which users can press to indicate an e-mail is spam
- Defining features: we could start off with some rules, e.g. length of subject greater than 10, length of body greater than 10, sender has a certain domain, sender e-mail address has a certain pattern, and so on. These are just some examples of features, there can be more/different ones. Then for a given e-mail, we can put its data in a vector: $[1, 1, 0, 0, 1, 1]$, This would indicate then the length of the subject is greater than 10, length of body is greater than 10, etc. Say that the user designated this e-mail as spam, then the target variable is also $1$. We do this for many e-mails and get a matrix with the features and a vector with target variables.
- Train model: features and target variables go into ML model. We fit/train the model and determine some unknown coefficients that minimize the error between predictions and the target.
- Use model: now that the model has been trained, we can put new data into it and predict the target. 



## 1.3 [Supervised Machine Learning](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/01-intro/03-supervised-ml.md)

In the cars example in [1.1](#11-introduction-to-machine-learning) we provided the data and the target variable and the ML model learns the patterns in the data. These patterns can than be used to generalize to new samples. In e-mail spam example in [1.2](#12-ml-vs-rule-based-systems), we did exactly the same. When the target variable is given, this is known as _Supervised Machine Learning_.

The _feature matrix_ is a twodimensional array in which the rows are our observations or _samples_ and the columns are _features_. It is designated with $\mathbf{X}$. The _target_ is a vector designated with $\mathbf{y}$. $\mathbf{X}$ is our input and $\mathbf{y}$ is our output.

If we have a model g:
$$
g(\mathbf{X}) \approx \mathbf{y}
$$

For _regression_ (car price prediction), we predict the price of a car. $g$ outputs a price.

For _classification_ (spam classification), we predict the probability of an e-mail being spam. $g$ outputs a 1 or 0 (whether it is spam or not).

For _classification_, besides _binary classification_ (spam or not spam) there can also be _multi-class classification_, where we are predicting a category out of more than 2 categories (e.g. cat, bird, dog, horse, etc.)

_Ranking_ orders a list for your you by providing scores for each item. Recommender systems are based on this.



## 1.4 [CRISP-DM](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/01-intro/04-crisp-dm.md)

_CRISP-DM_ is a methodology for organizing ML projects. It stands for Cross-Industry Standard Processing Data Mining. There are 6 steps.

<img src="image.png" alt="CRISP-DM" style="width: 600px;"/>

ML Project: understand the problem, collect the data, train the model, use it.

We will use the spam detection example.

E-mail &rarr; Model &rarr; spam / not spam.


### Step 1: Business Understanding

Goal is to identify the problem to solve. Understand if problem is important and how to measure success. Here we decide whether we need ML.

Spam detection: users complain about spam. What is the extent of the problem? One users or many? Do we need ML? Define the goal: e.g. reduce amount of spam messages, reduce amount of complaints about spam. Goal should be measurable: reduce amount of spam messages by 50%.

### Step 2: Data Understanding

Make sure data is available, or what is missing, and how to collect/acquire it. 

Spam detection: spam button. Is the data behind this button good enough? Is it reliable? Do we track it correctly? Is the dataset large enough? Do we need to get more data?

### Step 3: Data Preparation

The data needs to be transformed such that it can be used in the ML model. We may need to clean the data, extract features, build pipelines, convert into tabular form.

### Step 4: Modeling

We now have the data in the right format. We try different models and we pick the best one. For example logistic regression, decision tree, neural network, etc. We may need to go back to data preparation to fix issues with the data or do more feature engineering.

### Step 5: Evaluation

Now we will assess how well the model is performing with respect to the goal we defines under Business Understanding. Have we reached the goal? Do our metrics improve? Did we solve/measure the right thing? Was the goal achievable? Do we need to update the goal?

### Step 6: Deployment

This step goes hand in hand with Evaluation. We deploy the model and monitor it when online. Usually we first deploy to a subset of users. 

This process is typically iterated over based on learnings.

It is a good idea to start simple and go quickly through all the steps. Then you iterate to introduce further improvements.

## 1.5 [The Modeling Step (Model Selection Process)](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/01-intro/05-model-selection.md)

The modeling step is where we try different models and choose the best model. This is where the actual ML happens.

Let's say we train in July using data $\mathbf{X}$ with output $\mathbf{y}$ and we set up model $g$. Then we evaluate in August and find out our model has an accuracy of 0.7. Ideally we would be able to travel to the future and get the data from then and set our model up with that data. That obviously cannot happen. Therefore we do something else. We split the data. For example we take a subset of 20% (representing August) and we 'hide' that data. We train only with the remaining 80% (July). We call the 80% the training data and the 20% the validation data.

So using the training data we get a model $g$ from training with $\mathbf{X}$ and $\mathbf{y}$.

The validation data also has data $\mathbf{X_v}$ and $\mathbf{y_v}$. We can use our model to make predictions: $g(X_v)=\mathbf{\hat{y}_v}$.

The difference between $\mathbf{\hat{y}}$ and $\mathbf{y}$ is our accuracy. We can do this for various models such as logistic regression, decision tree, neural network, etc. and compare accuracies.

The problem with this approach is that one of the models might have gotten lucky. In statistics this is called Multiple Comparison Problem. ML models are probabilistic, so we need to guard against this.

To address this, we introduce a testing dataset. For example: 60% training, 20% validation, 20% test. The test data is hidden. 
We do the model selection: use $\mathbf{X}$ and $\mathbf{y}$ to get model $g$.
We validate it by using model $g$ and $\mathbf{X_v}$ to make predictions $\mathbf{\hat{y}_v}$. We select the best model. 
Then at the very end, we test the model to ensure it didn't get particularly lucky. 
It is basically an extra round of validation with $\mathbf{X_t}$ to make predictions $\mathbf{\hat{y}_t}$. 
The accuracy of the test data should be in line with the validation data.



## 1.6 [Setting up the Environment](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/01-intro/06-environment.md)

I am using WSL2. 

I created a conda environment:

```bash
conda create -n ml-zoomcamp python=3.10
```

Then I activated it:

```bash
conda activate ml-zoomcamp
```

Subsequently I installed these libraries:

```bash
conda install numpy pandas scikit-learn seaborn jupyter
```

I had to install the VS Code Jupyter extension in WSL2.




## 1.7 [Introduction to NumPy](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/01-intro/07-numpy.md)

First import numpy:

In [None]:
import numpy as np

Check version:

In [None]:
np.__version__

'2.1.1'

### Create arrays

All zeros with size 3:

In [None]:
np.zeros(3)

array([0., 0., 0.])

All ones with size 10:

In [None]:
np.ones(10)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

Specify a size and number:

In [None]:
np.full(10, 3)

array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3])

Convert python list to array:

In [None]:
a = [1,2,4,1]

a_arr = np.array(a)

a_arr

array([1, 2, 4, 1])

Access third element (zero-indexed):

In [None]:
a_arr[2]

np.int64(4)

Change second element:

In [None]:
print(a_arr)

a_arr[1] = 999

print(a_arr)

[1 2 4 1]
[  1 999   4   1]


Create array based on range (0 to 10):

In [None]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

From 3 to 9:

In [None]:
np.arange(3, 10)

array([3, 4, 5, 6, 7, 8, 9])

Create array from 0 to 1 with 11 elements (i.e. in 10 steps):

In [None]:
np.linspace(0, 1, 11)

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])

### Multi-dimensional arrays

Create an array with 5 rows, 2 columns, filled with zeros:

In [None]:
np.zeros((5, 2))

array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]])

We can also create a multi-dimensional array from a list of lists:

In [None]:
a = [[1,2,2], [2,14, 4]]

a_arr = np.array(a)

a_arr

array([[ 1,  2,  2],
       [ 2, 14,  4]])

Access first row, second column:

In [None]:
a_arr[0, 1]

np.int64(2)

Replace with 20:

In [None]:
a_arr[0, 1] = 20

a_arr

array([[ 1, 20,  2],
       [ 2, 14,  4]])

Get first row and second row:

In [None]:
a_arr[0]

[1, 2, 2]

In [None]:
a_arr[1]

array([ 2, 14,  4])

Change second row to something else:

In [None]:
a_arr[1] = [2,1,1]

In [None]:
a_arr

array([[ 1, 20,  2],
       [ 2,  1,  1]])

Same for column

In [None]:
a_arr[:, 0]

array([1, 2])

In [None]:
a_arr[:, 0] = [10, 20]

In [None]:
a_arr

array([[10, 20,  2],
       [20,  1,  1]])

Randomly generated array with 5 rows and 2 columns:

In [None]:
np.random.seed(2)       # seed to make results reproducible
np.random.rand(5, 2)    # this is standard uniform distribution between 0 and 1

array([[0.4359949 , 0.02592623],
       [0.54966248, 0.43532239],
       [0.4203678 , 0.33033482],
       [0.20464863, 0.61927097],
       [0.29965467, 0.26682728]])

Make it between 0 and 100

In [None]:
100 * np.random.rand(5, 2)

array([[62.11338328, 52.91420943],
       [13.45799453, 51.35781213],
       [18.44398656, 78.53351478],
       [85.39752926, 49.42368374],
       [84.65614854,  7.9645477 ]])

Now from standard normal distribution:

In [None]:
np.random.randn(5, 2)

array([[ 0.26551159,  0.10854853],
       [ 0.00429143, -0.17460021],
       [ 0.43302619,  1.20303737],
       [-0.96506567,  1.02827408],
       [ 0.22863013,  0.44513761]])

Random integers between 0 and 100:

In [None]:
np.random.randint(low=0, high=100, size=(5, 2))

array([[68, 46],
       [70, 95],
       [83, 31],
       [66, 80],
       [52, 76]])

### Element-wise operations

Addition:

In [None]:
a = np.arange(5)

In [None]:
a + 10

array([10, 11, 12, 13, 14])

Multiplication:

In [None]:
a * 2

array([0, 2, 4, 6, 8])

Division:

In [None]:
a / 2

array([0. , 0.5, 1. , 1.5, 2. ])

It can be chained:

In [None]:
( (a + 10) / 2 ) * 4

array([20., 22., 24., 26., 28.])

You can add arrays:

In [None]:
(a + 3 * a ) / (a + 1)

array([0.        , 2.        , 2.66666667, 3.        , 3.2       ])

### Comparison operations

In [None]:
a > (10*a)

array([False, False, False, False, False])

In [None]:
a == a

array([ True,  True,  True,  True,  True])

In [None]:
a < 2

array([ True,  True, False, False, False])

Filter array on elements larger than 2:

In [None]:
a[a > 2]

array([3, 4])

### Summary operations

In [None]:
a.min()

np.int64(0)

In [None]:
a.sum()

np.int64(10)

In [None]:
a.max()

np.int64(4)

In [None]:
a.mean()

np.float64(2.0)

In [None]:
a.std()

np.float64(1.4142135623730951)

## 1.8 [Linear Algebra Refresher](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/01-intro/08-linear-algebra.md)

### Vector Operations

We have vectors $ \mathbf{u} $ and $ \mathbf{v} $ below:

$$ \mathbf{u} = \begin{bmatrix} 2 \\ 4 \\ 5 \\ 6 \end{bmatrix}, \mathbf{v} = \begin{bmatrix} 1 \\ 0 \\ 0 \\ 2 \end{bmatrix} $$

#### Multiplication with a scalar

We can multiply a vector by any scalar number, e.g.:

$$ 2 \mathbf{u} = 2 \begin{bmatrix} 2 \\ 4 \\ 5 \\ 6 \end{bmatrix} = \begin{bmatrix} 4 \\ 8 \\ 10 \\ 12 \end{bmatrix}$$

#### Addition of a vector and a scalar

$$ \mathbf{u} + 2 = \begin{bmatrix} 2 \\ 4 \\ 5 \\ 6 \end{bmatrix} + 2 = \begin{bmatrix} 4 \\ 6 \\ 7 \\ 8 \end{bmatrix}$$

#### Addition of vectors

We can add vectors:

$$ \mathbf{u} + \mathbf{v} = \begin{bmatrix} 2 \\ 4 \\ 5 \\ 6 \end{bmatrix} + \begin{bmatrix} 1 \\ 0 \\ 0 \\ 2 \end{bmatrix} = \begin{bmatrix} 3 \\ 4 \\ 5 \\ 8 \end{bmatrix}$$

#### Vector-vector multiplication (dot product or inner product)

$$
\mathbf{u}^T \mathbf{v} = \langle \mathbf{u}, \mathbf{v} \rangle = \sum_{i=1}^{n} u_i v_i = \begin{bmatrix} 2 & 4 & 5 & 6 \end{bmatrix} \begin{bmatrix} 1 \\ 0 \\ 0 \\ 2 \end{bmatrix} = 2 \cdot 1 + 4 \cdot 0 + 5 \cdot 0 + 6 \cdot 2 = 14
$$

The first three expressions are different notations for the dot product (or inner product) of two vectors:
1. $ \mathbf{u}^T \mathbf{v} $: Matrix multiplication of the transpose of $ \mathbf{u} $ and $ \mathbf{v} $, which is the standard matrix form of the dot product.
2. $ \langle \mathbf{u}, \mathbf{v} \rangle $: Angle bracket notation commonly used in functional analysis and physics for the inner product.
3. $ \sum_{i=1}^{n} u_i v_i $: Explicit summation over the components, representing the dot product as the sum of element-wise products (also known as the Einstein notation).



#### Matrix-vector multiplication

We have matrix $\mathbf{U}$ and vector $\mathbf{v}$ as below.

$$
\mathbf{U}=\begin{bmatrix} 2 & 4 & 5 & 5 \\ 1 & 2 & 1 & 2 \\ 3 & 1 & 2 & 1 \end{bmatrix}, \mathbf{v}=\begin{bmatrix} 1 \\ 0.5 \\ 2 \\ 1 \end{bmatrix}
$$

We want to multiply:
$$
\mathbf{U} \cdot \mathbf{v}
$$

The dimension of $\mathbf{U}$ is $3 \times 4$ and of $\mathbf{v}$ is $4 \times 1$. Therefore the matrix-vector product should be a vector with dimensions $3 \times 1$.

For each row of the matrix $\mathbf{U}$, we calculate the dot product with $\mathbf{v}$. Let $\mathbf{u}_i$ represent the i-th row of $\mathbf{U}$, where:
$$
\mathbf{u}_0=\begin{bmatrix} 2 \\ 4 \\ 5 \\ 5 \end{bmatrix}, 
\mathbf{u}_1=\begin{bmatrix} 1 \\ 2 \\ 1 \\ 2 \end{bmatrix}, 
\mathbf{u}_2=\begin{bmatrix} 3 \\ 1 \\ 2 \\ 1 \end{bmatrix}
$$

The result of the multiplication is the following vector, where each entry is the dot product of the corresponding row of $\mathbf{U}$ with $\mathbf{v}$:
$$
\mathbf{U} \mathbf{v} = \begin{bmatrix} \mathbf{u}_0^T \mathbf{v} \\ \mathbf{u}_1^T \mathbf{v} \\ \mathbf{u}_2^T \mathbf{v} \end{bmatrix}
$$

#### Matrix-matrix multiplication

Consider matrices:

$$
\mathbf{U}=\begin{bmatrix} 2 & 4 & 5 & 5 \\ 1 & 2 & 1 & 2 \\ 3 & 1 & 2 & 1 \end{bmatrix}, 
\mathbf{V}=\begin{bmatrix} 1 & 1 & 2 \\ 0 & 0.5 & 1 \\ 0 & 2 & 1 \\ 2 & 1 & 0\end{bmatrix} 
$$

We want to compute $\mathbf{U} \mathbf{V}$. Let's represent $\mathbf{V}$ as: 

$$\begin{bmatrix} \mathbf{v}_0 & \mathbf{v}_1 & \mathbf{v}_2 \end{bmatrix}$$

Then the matrix-multiplication becomes:
$$
\mathbf{U} \mathbf{V} = \begin{bmatrix} \mathbf{U} \mathbf{v}_0 & \mathbf{U} \mathbf{v}_1 & \mathbf{U} \mathbf{v}_2 \end{bmatrix}
$$

#### Identity Matrix

Identity matrix has ones on the diagonal. Here's an example of a $4 \times 4$ identity matrix:

$$
\mathbf{I} = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}
$$

If the identity matrix is multiplied with any matrix, the product is equal to that matrix:
$$
\mathbf{U} \mathbf{I} = \mathbf{I} \mathbf{U} = \mathbf{U}
$$

#### Matrix inverse

Let's say we have a matrix $\mathbf{A}$. Its inverse $\mathbf{A}^{-1}$ is defined such that, when multiplied with $\mathbf{A}$, the product is the identity matrix $\mathbf{I}$:

$$
\mathbf{A}^{-1} \mathbf{A} = \mathbf{I}
$$
$$
\mathbf{A} \mathbf{A}^{-1} = \mathbf{I}
$$

The inverse only exists for square matrices.

### Linear Algebra in Numpy

#### Defining vectors in Numpy

In Numpy we can do this as follows:

In [None]:
import numpy as np


u = np.array([2, 4, 5, 6])
v = np.array([1, 0, 0, 2])

In [None]:
u

array([2, 4, 5, 6])

In [None]:
v

array([1, 0, 0, 2])

#### `.shape` and `.reshape()`

Note that although $\mathbf{u}$ and $\mathbf{v}$ are rendered as row vectors $( 1 \times n )$, if we look at their shapes, we can see that the second dimension is unspecified:

In [None]:
u.shape

(4,)

We could explicitly turn this into $(1 \times n)$ (row vector) as follows:

In [None]:
u_row = u.reshape(-1, 4)
u_row

array([[2, 4, 5, 6]])

Or into an column vector $( m \times 1 )$:

In [None]:
u_col = u.reshape(4, -1)
u_col

array([[2],
       [4],
       [5],
       [6]])

The `.reshape()` function takes an array and reshapes it into the dimensions you want (provided that the number of elements of the resulting array match that of the original). The `-1` indicates that you want Numpy to figure out how many elements of that dimensions are needed. Numpy will by default read the elements from left to right, top to bottom, and populate the dimensions of the reshaped array. 

But for vector algebra, we can leave $\mathbf{u}$ and $\mathbf{v}$ as is (no need to use `.reshape` to explicitly turn them into a row or column vectors).

#### Multiplication with a scalar

In [None]:
u * 2

array([ 4,  8, 10, 12])

#### Adding a vector and a scalar

In [None]:
u + 2

array([4, 6, 7, 8])

#### Adding two vectors

In [None]:
u + v

array([3, 4, 5, 8])

#### Element-wise multiplication of two vectors

In [None]:
u * v

array([ 2,  0,  0, 12])

#### Dot (inner) product (3 ways of doing it)

In [None]:
np.dot(u, v)

np.int64(14)

In [None]:
u.dot(v)

np.int64(14)

In [None]:
u @ v

np.int64(14)

Here's a code implementation of the dot product (inefficient implementation, just for educational purposes):

In [None]:
def vector_vector_multiplication(u, v):
    
    # first check that shapes are compatible:
    assert u.shape == v.shape

    # get number of elements
    n = u.shape[0]

    result = 0

    # loop over elements and add their product to the result
    for i in range(n):
        result += u[i] * v[i]
    return result

vector_vector_multiplication(u, v)

np.int64(14)

#### Matrix-vector multiplication

In [None]:
U = np.array([
    [2, 4, 5, 6],
    [1, 2, 1, 2],
    [3, 1, 2, 1],
])

In [None]:
U @ v  # Preferred

array([14,  5,  5])

In [None]:
U.dot(v)

array([14,  5,  5])

In [None]:
np.dot(U, v)

array([14,  5,  5])

In [None]:
np.matmul(U, v)

array([14,  5,  5])

Code implementation of matrix-vector multiplication (just for educational purposes):

In [None]:
def matrix_vector_multiplication(U, v):
    
    assert (U.shape[1] == v.shape[0]) and len(v.shape) == 1
    

    num_rows = U.shape[0]
    result = np.zeros(num_rows)

    for i in range(num_rows):
        result[i] = vector_vector_multiplication(U[i], v)
    
    return result

matrix_vector_multiplication(U, v)


array([14.,  5.,  5.])

#### Matrix-matrix multiplication

In [None]:
V = np.array([
    [1, 1, 2],
    [0, 0.5, 1],
    [0, 2, 1],
    [2, 1, 0],
])

In [None]:
U @ V  # Preferred

array([[14. , 20. , 13. ],
       [ 5. ,  6. ,  5. ],
       [ 5. ,  8.5,  9. ]])

In [None]:
np.dot(U, V)

array([[14. , 20. , 13. ],
       [ 5. ,  6. ,  5. ],
       [ 5. ,  8.5,  9. ]])

In [None]:
U.dot(V)

array([[14. , 20. , 13. ],
       [ 5. ,  6. ,  5. ],
       [ 5. ,  8.5,  9. ]])

In [None]:
np.matmul(U, V)

array([[14. , 20. , 13. ],
       [ 5. ,  6. ,  5. ],
       [ 5. ,  8.5,  9. ]])

In [None]:
def matrix_matrix_multiplication(U, V):
    
    assert U.shape[1] == V.shape[0]

    num_rows = U.shape[0]
    num_cols = V.shape[1]
    
    result = np.zeros((num_rows, num_cols))
    
    for i in range(num_cols):
        result[:, i] = matrix_vector_multiplication(U, V[:, i])
    return result

matrix_matrix_multiplication(U, V)


array([[14. , 20. , 13. ],
       [ 5. ,  6. ,  5. ],
       [ 5. ,  8.5,  9. ]])

#### Identity Matrix

In [None]:
np.eye(10)

array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]])

In [None]:
I = np.eye(3)

In [None]:
I @ U == U

array([[ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True]])

In [None]:
I = np.eye(4)

In [None]:
U @ I == U

array([[ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True]])

#### Matrix inverse

Let's take first 3 rows of V (to have a square matrix):

In [None]:
V_s = V[:3, :]

In [None]:
V_s_inv = np.linalg.inv(V_s)

In [None]:
V_s @ V_s_inv == np.eye(3)

array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]])

In [None]:
V_s_inv @ V_s == np.eye(3)

array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]])

## 1.9 [Introduction to Pandas](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/01-intro/09-pandas.md)

Pandas is a Python library for manipulating tabular data in Python.

In [None]:
import pandas as pd

Our data:

In [None]:
data = [
    ['Nissan', 'Stanza', 1991, 138, 4, 'MANUAL', 'sedan', 2000],
    ['Hyundai', 'Sonata', 2017, None, 4, 'AUTOMATIC', 'Sedan', 27150],
    ['Lotus', 'Elise', 2010, 218, 4, 'MANUAL', 'convertible', 54990],
    ['GMC', 'Acadia',  2017, 194, 4, 'AUTOMATIC', '4dr SUV', 34450],
    ['Nissan', 'Frontier', 2017, 261, 6, 'MANUAL', 'Pickup', 32340],
]

columns = [
    'Make', 'Model', 'Year', 'Engine HP', 'Engine Cylinders',
    'Transmission Type', 'Vehicle_Style', 'MSRP'
]

### pd.DataFrame

Create `pd.DataFrame` from data:

In [None]:
df = pd.DataFrame(
    data=data,
    columns=columns,
)

In [None]:
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


We can also create it from a list of dictionaries, where the keys are the column names:

In [None]:
data = [
    {
        "Make": "Nissan",
        "Model": "Stanza",
        "Year": 1991,
        "Engine HP": 138.0,
        "Engine Cylinders": 4,
        "Transmission Type": "MANUAL",
        "Vehicle_Style": "sedan",
        "MSRP": 2000
    },
    {
        "Make": "Hyundai",
        "Model": "Sonata",
        "Year": 2017,
        "Engine HP": None,
        "Engine Cylinders": 4,
        "Transmission Type": "AUTOMATIC",
        "Vehicle_Style": "Sedan",
        "MSRP": 27150
    },
    {
        "Make": "Lotus",
        "Model": "Elise",
        "Year": 2010,
        "Engine HP": 218.0,
        "Engine Cylinders": 4,
        "Transmission Type": "MANUAL",
        "Vehicle_Style": "convertible",
        "MSRP": 54990
    },
    {
        "Make": "GMC",
        "Model": "Acadia",
        "Year": 2017,
        "Engine HP": 194.0,
        "Engine Cylinders": 4,
        "Transmission Type": "AUTOMATIC",
        "Vehicle_Style": "4dr SUV",
        "MSRP": 34450
    },
    {
        "Make": "Nissan",
        "Model": "Frontier",
        "Year": 2017,
        "Engine HP": 261.0,
        "Engine Cylinders": 6,
        "Transmission Type": "MANUAL",
        "Vehicle_Style": "Pickup",
        "MSRP": 32340
    }
]

In [None]:
df = pd.DataFrame(data=data)

In [None]:
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


To look at the first 3 rows:

In [None]:
df.head(3)

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990


### pd.Series

Every column of a `pd.DataFrame` is a `pd.Series`:

In [None]:
df.Make

0     Nissan
1    Hyundai
2      Lotus
3        GMC
4     Nissan
Name: Make, dtype: object

In [None]:
type(df.Make)

pandas.core.series.Series

In [None]:
df["Make"]

0     Nissan
1    Hyundai
2      Lotus
3        GMC
4     Nissan
Name: Make, dtype: object

In [None]:
type(df["Make"])

pandas.core.series.Series

In [None]:
df["Engine HP"]

0    138.0
1      NaN
2    218.0
3    194.0
4    261.0
Name: Engine HP, dtype: float64

Select subset of DataFrame columns:

In [None]:
df[["Make", "Model", "MSRP"]]

Unnamed: 0,Make,Model,MSRP
0,Nissan,Stanza,2000
1,Hyundai,Sonata,27150
2,Lotus,Elise,54990
3,GMC,Acadia,34450
4,Nissan,Frontier,32340


We can add a column to this DataFrame:

In [None]:
df["id"] = [i for i in range(1, df.shape[0]+1)]

In [None]:
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP,id
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000,1
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150,2
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990,3
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450,4
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340,5


We can also use chained assignment with `.assign()`:

In [None]:

df = (
    df
    .assign(
        id = [i*10 for i in range(1, df.shape[0]+1)] 
    )
)

To delete a column:

In [None]:
del df['id']

### Index

DataFrame has an index:

In [None]:
df.index

RangeIndex(start=0, stop=5, step=1)

All columns share this index:

In [None]:
df.Make.index

RangeIndex(start=0, stop=5, step=1)

### Accessing elements

We can access an index as follows:

In [None]:
df.iloc[1]

Make                   Hyundai
Model                   Sonata
Year                      2017
Engine HP                  NaN
Engine Cylinders             4
Transmission Type    AUTOMATIC
Vehicle_Style            Sedan
MSRP                     27150
Name: 1, dtype: object

In [None]:
df.iloc[2:4]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450


We can change the index:

In [None]:
df.index = ["a", "b", "c", "d", "e"]

In [None]:
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
a,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
b,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
c,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
d,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
e,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


Since we changed the index, this no longer works:

In [None]:
df.loc[1]

KeyError: 1

Instead:

In [None]:
df.loc["b"]

Make                   Hyundai
Model                   Sonata
Year                      2017
Engine HP                  NaN
Engine Cylinders             4
Transmission Type    AUTOMATIC
Vehicle_Style            Sedan
MSRP                     27150
Name: b, dtype: object

In [None]:
df.loc[["b", "c"]]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
b,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
c,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990


However, we can still refer to the position of the index using `iloc` instead of `loc`:

In [None]:
df.iloc[1]

Make                   Hyundai
Model                   Sonata
Year                      2017
Engine HP                  NaN
Engine Cylinders             4
Transmission Type    AUTOMATIC
Vehicle_Style            Sedan
MSRP                     27150
Name: b, dtype: object

In [None]:
df.iloc[[1,2]]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
b,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
c,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990


We can reset to the default index (sequential, starting from 0):

In [None]:
df.reset_index()

Unnamed: 0,index,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,a,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,b,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,c,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,d,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,e,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


We can also drop the previous index:

In [None]:
df.reset_index(drop=True)

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


Original DataFrame did not change though:

In [None]:
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
a,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
b,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
c,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
d,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
e,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


This is because `reset_index()` returns a new DataFrame. What we need to do is to explicitly overwrite the old one as follows:

In [None]:
df = (
    df
    .reset_index(drop=True)
)

In [None]:
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


### Element-wise operations

In [None]:
df["Engine HP"] / 100

0    1.38
1     NaN
2    2.18
3    1.94
4    2.61
Name: Engine HP, dtype: float64

In [None]:
df["Engine HP"] * 100

0    13800.0
1        NaN
2    21800.0
3    19400.0
4    26100.0
Name: Engine HP, dtype: float64

In [None]:
df["Year"] > 2010

0    False
1     True
2    False
3     True
4     True
Name: Year, dtype: bool

### Filtering


In [None]:
mask = df["Year"] >= 2015
df[mask]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


In [None]:
df[df["Year"] >= 2015]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


In [None]:
df[df["Make"]=="Nissan"]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


In [None]:
df[(df["Make"]=="Nissan") & (df["Year"] >= 2015)]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


### String operations

In [None]:
df["Vehicle_Style"]

0          sedan
1          Sedan
2    convertible
3        4dr SUV
4         Pickup
Name: Vehicle_Style, dtype: object

In [None]:
df["Vehicle_Style"].str.lower()

0          sedan
1          sedan
2    convertible
3        4dr suv
4         pickup
Name: Vehicle_Style, dtype: object

In [None]:
df["Vehicle_Style"].str.replace(" ", "_")

0          sedan
1          Sedan
2    convertible
3        4dr_SUV
4         Pickup
Name: Vehicle_Style, dtype: object

In [None]:
df["Vehicle_Style"] = (
    df["Vehicle_Style"]
    .str.lower()
    .str.replace(" ", "_")
)

In [None]:
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr_suv,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,pickup,32340


### Summarizing operations

In [None]:
df.MSRP.max()

np.int64(54990)

In [None]:

df.MSRP.min()

np.int64(2000)

In [None]:

df.MSRP.mean()

np.float64(30186.0)

In [None]:
df.MSRP.median()

np.float64(32340.0)

In [None]:
df.describe()

Unnamed: 0,Year,Engine HP,Engine Cylinders,MSRP
count,5.0,4.0,5.0,5.0
mean,2010.4,202.75,4.4,30186.0
std,11.260551,51.29896,0.894427,18985.044904
min,1991.0,138.0,4.0,2000.0
25%,2010.0,180.0,4.0,27150.0
50%,2017.0,206.0,4.0,32340.0
75%,2017.0,228.75,4.0,34450.0
max,2017.0,261.0,6.0,54990.0


In [None]:
df.describe().round(2)

Unnamed: 0,Year,Engine HP,Engine Cylinders,MSRP
count,5.0,4.0,5.0,5.0
mean,2010.4,202.75,4.4,30186.0
std,11.26,51.3,0.89,18985.04
min,1991.0,138.0,4.0,2000.0
25%,2010.0,180.0,4.0,27150.0
50%,2017.0,206.0,4.0,32340.0
75%,2017.0,228.75,4.0,34450.0
max,2017.0,261.0,6.0,54990.0


Count unique values:

In [None]:
df.Make.nunique()

4

In [None]:
df.nunique()

Make                 4
Model                5
Year                 3
Engine HP            4
Engine Cylinders     2
Transmission Type    2
Vehicle_Style        4
MSRP                 5
dtype: int64

### Missing values

How many NaNs per column?

In [None]:
df.isna().sum()

Make                 0
Model                0
Year                 0
Engine HP            1
Engine Cylinders     0
Transmission Type    0
Vehicle_Style        0
MSRP                 0
dtype: int64

In [None]:
df.isnull().sum()  # Same as above, I usually use .isna()

Make                 0
Model                0
Year                 0
Engine HP            1
Engine Cylinders     0
Transmission Type    0
Vehicle_Style        0
MSRP                 0
dtype: int64

NaNs per row:

In [None]:
df.isnull().sum(axis=1)

0    0
1    1
2    0
3    0
4    0
dtype: int64

### Grouping operations

In [None]:
(
    df
    .groupby(by="Transmission Type")["MSRP"]
    .mean()
)

Transmission Type
AUTOMATIC    30800.000000
MANUAL       29776.666667
Name: MSRP, dtype: float64

### Getting Numpy array from pd.DataFrame

In [None]:
df.MSRP.values

array([ 2000, 27150, 54990, 34450, 32340])

### Convert pd.DataFrame to dictionary

In [None]:
df.to_dict(orient="records")

[{'Make': 'Nissan',
  'Model': 'Stanza',
  'Year': 1991,
  'Engine HP': 138.0,
  'Engine Cylinders': 4,
  'Transmission Type': 'MANUAL',
  'Vehicle_Style': 'sedan',
  'MSRP': 2000},
 {'Make': 'Hyundai',
  'Model': 'Sonata',
  'Year': 2017,
  'Engine HP': nan,
  'Engine Cylinders': 4,
  'Transmission Type': 'AUTOMATIC',
  'Vehicle_Style': 'sedan',
  'MSRP': 27150},
 {'Make': 'Lotus',
  'Model': 'Elise',
  'Year': 2010,
  'Engine HP': 218.0,
  'Engine Cylinders': 4,
  'Transmission Type': 'MANUAL',
  'Vehicle_Style': 'convertible',
  'MSRP': 54990},
 {'Make': 'GMC',
  'Model': 'Acadia',
  'Year': 2017,
  'Engine HP': 194.0,
  'Engine Cylinders': 4,
  'Transmission Type': 'AUTOMATIC',
  'Vehicle_Style': '4dr_suv',
  'MSRP': 34450},
 {'Make': 'Nissan',
  'Model': 'Frontier',
  'Year': 2017,
  'Engine HP': 261.0,
  'Engine Cylinders': 6,
  'Transmission Type': 'MANUAL',
  'Vehicle_Style': 'pickup',
  'MSRP': 32340}]

## 1.10 [Summary](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/01-intro/10-summary.md)


Summary of previous sections. No notes made.



## 1.11 [Homework](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/01-intro/homework.md)

Go to [week_1_homework.ipynb](week_1_homework.ipynb).