<div class="alert alert-block alert-warning">
<h1><span style="color:green"> Under-Graduate Research Internship Program (UGRIP) - 2024 <br> Lab 01 - Part A </span><h1>

<h2><span style="color:green"> Lab-01 (Part-A) </span><h2> Machine Learning Fundamentals and Supervised Learning
</div>

---
---

# 1. Getting Started with Numpy
Numpy is a Python library that provides many numerical programming tools, such as matrix operations and vector processing, etc. It gives you an enormous range of fast and efficient ways of creating arrays and manipulating numerical data inside them. Arrays are the central data structure of the NumPy library. You may find this [tutorial](https://numpy.org/doc/stable/user/absolute_beginners.html}) useful to get started with NumPy.

To use Numpy, you first need to import the `numpy` package:

In [None]:
import numpy as np

## 1.1 Creating NumPy Arrays

An array is a grid of values that can be indexed in various ways. The elements are all of the same type, referred to as the array dtype. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.

Programatically, the NumPy ndarray class is used to represent both matrices and vectors. A vector is an array with a single dimension (there‚Äôs no difference between row and column vectors), while a matrix refers to an array with two dimensions. For 3-D or higher dimensional arrays, the term tensor is also commonly used. At this time however, we will focus on vectors and matrices to build familiarity with the basic concepts.

To create a NumPy array, you can use the function `np.array()`.

All you need to do to create a simple array is pass a list to it. This creates a vector:

In [None]:
arr1 = np.array([12, 18, 16])            # Create a one-dimensional array, also called a vector
print(arr1)

Conversely, we can also create matrices, which are arrays having more than one dimension (think of the matrices you're used to from math classes). To do this, we use the same `np.array()` function, but pass in a nested list to it. Each list within the main list is one row of the matrix:

In [None]:
arr2 = np.array([[12, 4, 16],[4, 11, 2]])    # Create a two-dimensional array (2 rows and 3 columns)
print(arr2)

These are simplistic cases, however. In practice, we often have data stored in files and would like to load this data into our programs for computation and/or analysis. To do this, we can use the `np.loadtxt()` function:

In [None]:
data = np.loadtxt('sample.csv', delimiter=',')

In [None]:
data.ndim                           # Dimension of the array

In [None]:
data.shape                          # The matrix has 20 rows and 5 columns

In [None]:
data.size                           # The array has 100 elements

NumPy also provides many functions to create special arrays e.g. arrays filled with zeros or ones, or even random ones:

In [None]:
a = np.zeros((3,3))                     # Create a 3*3 zero matrix
print(a)

In [None]:
b = np.ones((1,2))                  # Create an array of all ones
print(b)

In [None]:
c = np.random.random((2,2))         # Create an array of random values
print(c)

Finally, NumPy can also create arrays that consist of sequences over some interval. For instance, you might want to generate an array consisting of numbers between 0 and 10 i.e. 0, 1, 2, ..., 10, or perhaps with some other interval e.g. 0, 2, 4, ..., 10. NumPy allows you to do this using the `arange()` function, which takes three arguments:
- start: what number should be at the beginning of the sequence. Defaults to 0
- stop: (required) the largest number in the sequence smaller than this value i.e. exclusive
- step: (optional) an argument specifying the interval between numbers

In [None]:
sequence_0 = np.arange(5) #here, we supply only one argument which is the stop value. Start will default to 0
print(sequence_0)

sequence = np.arange(0, 10) #generate a sequence between 0 (inclusive) and 10 (exclusive)
print(sequence)

# now, we can specify a step to change the interval:
sequence_2 = np.arange(0, 10, 2)
print(sequence_2)



---



---



## 1.2 Indexing Numpy Arrays

We can access particular elements of an array via indexing. The indexing values starts from 0 (0 will retrieve first element) and ends at [n-1] (which will retrieve the last element) where n corresponds to the size of the indexed dimension.

For instance, indexing a vector is quite straightforward:

In [None]:
a = np.array([1,1,2,3,5,8])
print(a[0]) #gives 1
print(a[5]) #gives 8 

In the case of 2D arrays, the same principle applies - keeping the per-axis sizes in mind:

In [None]:
arr2 = np.random.random((10,5))
print(arr2[0,0], arr2[0,1], arr2[1,0])    # Accessing elements of an array [row_index, column_index]
arr2[8,4] #works fine
#arr2[5,8] #fails because the second dimension has only 5 elements

This is powerful because it also lets us modify particular elements in the array:

In [None]:
arr2[2,0] = 10                              # Change an element of an array
print (arr2)

### Reshaping a matrix

This is an important function in NumPy which allows you to transform the dimensions of the matrix. For example, a 4x4 matrix can be reshaped into 8x2 or 2x8 matrix (as long as the final product of shape remains same). Consider the following:

In [None]:
x = np.arange(12)
print(x)
print(x.shape)

Lets now reshape it to 3x4 matrix, reshape function will only transform the number of rows and columns, while keeping the same values in the overall matrix:

In [None]:
print(x.reshape(3, 4))

We will revisit Reshaping later on as we look at some of the other libraries we're covering in this session.

### Slicing

Similar to Python lists, numpy arrays can be sliced. Since arrays may be multidimensional, you can specify a slice for each dimension of the array, providing many possibilities:

In [None]:
# Create the following rank 2 array with shape (3, 4)
# [[12, 4, 16] 
#  [4, 11, 2]
#  [1, 3, 21]]
a = np.array([[12, 4, 16],[4, 11, 2], [1, 3, 21]])

# Use slicing to pull out the subarray consisting of the first 2 rows
# and columns 1 and 2; b is the following array of shape (2, 2):
# [[2 3]
#  [6 7]]
b = a[:2, 1:3]
print(b)

A slice of an array is a view into the same data, so modifying it will modify the original array.

In [None]:
print(a[0, 1])
b[0, 0] = 77    # b[0, 0] is the same piece of data as a[0, 1]
print(a[0, 1]) 

You can also mix integer indexing with slice indexing. However, doing so will yield an array of lower rank than the original array. Note that this is quite different from the way that MATLAB handles array slicing:

In [None]:
# Create the following rank 2 (rank 2 means two dimensional) array with shape (3, 4)
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
print(a)

Two ways of accessing the data in the middle row of the array.
Mixing integer indexing with slices yields an array of lower rank,
while using only slices yields an array of the same rank as the
original array:

In [None]:
row_r1 = a[1, :]    # Rank 1 view of the second row of a  (rank 1 means one dimensional)
row_r2 = a[1:2, :]  # Rank 2 view of the second row of a
row_r3 = a[[1], :]  # Rank 2 view of the second row of a  (rank 2 means two dimensional)
print(row_r1, row_r1.shape)
print(row_r2, row_r2.shape)
print(row_r3, row_r3.shape)

In [None]:
# We can make the same distinction when accessing columns of an array:
col_r1 = a[:, 1]
col_r2 = a[:, 1:2]
print(col_r1, col_r1.shape)
print(col_r2, col_r2.shape)

### Boolean array indexing

Boolean array indexing lets you pick out elements of an array which satisfy a given logical condition. Here is an example:

In [None]:
a = np.array([[1,2], [3, 4], [5, 6]])

bool_idx = (a > 2)  # Find the elements of a that are bigger than 2;
                    # this returns a numpy array of Booleans of the same
                    # shape as a, where each slot of bool_idx tells
                    # whether that element of a is > 2.

print(bool_idx)

In [None]:
# We use boolean array indexing to construct a rank 1 array
# consisting of the elements of a corresponding to the True values
# of bool_idx
print(a[bool_idx])

# We can do all of the above in a single concise statement:
print(a[a > 2])

For brevity we have left out a lot of details about numpy array indexing; if you want to know more you should read the [documentation](https://numpy.org/doc/stable/reference/arrays.indexing.html).



---



---



## 1.2 Linear Algebra

Basic mathematical functions operate elementwise on arrays, and are available as both operator overloads and functions in the NumPy module:

In [None]:
a = np.array([1,2,3])
b = np.array([4,5,6])

print(a + b)              # Elementwise sum; both produce the array
print(np.add(a, b))

In [None]:
print(a - b)              # Elementwise difference; both produce the array
print(np.subtract(a, b))

In [None]:
a = np.array([1,2,3])
b = np.array([4,5,6])
c = np.dot(a,b)             # Inner product of vectors
print(c)

In [None]:
a = [[1,0],[0,1]] 
b = [[4,1],[2,2]] 
c = np.matmul(a,b)          # Matrix multiplication 
print(c)

In [None]:
a = np.array([[1,2],[3,4]]) 
b = np.linalg.inv(a)        # Matrix inversion
print(a)
print(b)

In [None]:
print(a @ b) # You can also use @ to perform matrix multiplication

In [None]:
a = np.array([[1,2], [3,4]]) 
print(np.linalg.det(a))   # Caculate the determinant of a matrix

In [None]:
a = np.arange(12).reshape(3,4)  # Create an array of integers in range 0-11 and reshape to 3x4 matrix
print(a)     
print('\n')
print(a.T)   #Transpose matrix

In [None]:
np.eye(5) #create a 5x5 identity matrix



---



---


## 1.3 Some Common Functions

In [None]:
arr = np.array([4,9,16])
print(np.sqrt(arr))  # square root
print(np.abs(arr))   # absolute value
print(np.sum(arr))   # sum
print(np.mean(arr))  # mean
print(np.max(arr))   # max

In [None]:
arr = np.array([9,-2,7,6,3,1,2])
arr.sort()           # Sorts an array into ascending order by default
print(arr)

In [None]:
np.any(arr>0)        # Determine if any element of an iterable is True

In [None]:
np.all(arr>0)        # Determine if all elements in the given iterable are true

---
---

# 2. Getting Started with Data Visualization using Matplotlib

Now that we can load and manipulate numeric data, the next logical step is being able to visualize said data. Matplotlib is a Python plotting library that will enable us to just that.

In this section we give a brief introduction to the `matplotlib.pyplot` module, which provides a plotting system similar to that of MATLAB. With a few lines of code, we can generate plots, histograms, bar graphs, and scatter plots, etc. using Matplotlib.

We can load this library as follows:

In [None]:
import matplotlib.pyplot as plt

Note: If using a Jupyter notebook, include the line `%matplotlib inline` after the imports. With IPython started, we now need to connect to a GUI event loop. This tells IPython where (and how) to display plots. To connect to a GUI loop, execute the `%matplotlib` magic at your IPython. This turns on inline plotting, where plot graphics will appear in your notebook. This has important implications for interactivity. For inline plotting, commands in cells below the cell that outputs a plot will not affect the plot.


In [None]:
%matplotlib inline

### General Concepts

A Matplotlib figure can be split into different parts as below:

**Figure:** It is a whole figure which may contain one or more than one axes (plots). You can think of a Figure as a canvas which contains plots.

**Axes:** It is the area where your plot appears in. A Figure can contain many Axes. It contains two or three (in the case of 3D) Axis objects. Each Axes has a title like ‚Äòx-label‚Äô and a ‚Äòy-label‚Äô.

**Axis:** They are the number line like objects and take care of generating the graph limits. ( Axis is the axis of the plot, the thing that gets ticks and tick labels )

**Artist:** Everything which one can see on the figure is an artist like Text objects, Line2D objects, collection objects. Most Artists are tied to Axes.

The below figure describes the anatomy of a general Matplotlib figure.

![Image source :  https://matplotlib.org/3.1.1/gallery/showcase/anatomy.html](https://paper-attachments.dropbox.com/s_B536E0D4A8117651BB96F9D57C4499954C1D5F288484652137AACB534421E956_1576568028457_anatomy.png)



## 2.1 Plotting

The most important function in `matplotlib` is `plot()`, which allows you to plot 2D data. Here is a simple example:

In [None]:
import matplotlib.pyplot as plt
import numpy as np

data = np.arange(0,101)        # Compute data points that need to be plotted
plt.figure(facecolor='gray')   # Canvas background is gray
plt.plot(data)                 # Draw a line chart
plt.show()

We can also plot multiple lines on the same graph at once, and add other elements such as a title, legend, and axis labels:

In [None]:
data = np.arange(0, 1.1, 0.01)
plt.title('Title')              # Add title
plt.xlabel('x')                 # Add x axis name
plt.ylabel('y')                 # Add y axis name
plt.xticks([0, 0.5, 1.0])    
plt.yticks([0, 0.5, 1.0])
plt.plot(data, data**2)
plt.plot(data, data**3, linestyle='--')  # We can change line styles or marker type
plt.legend(['y=x^2','y=x^3'])   # Add legend
plt.show()

And we can go even further to add even more plots with different styles per plot and other bells and whistles:

In [None]:
x = np.arange(15)
# Syntax : plot(x, y, color='green', marker='o', linestyle='dashed'), where x and y are coordinates
# we can also specify a label per plot which legend() can use later by adding an argument label=...

plt.plot(x, x , color = 'black' , marker = 'o' , linestyle = 'dashed', label='y = x')
plt.plot(x, 2 * x , 'bo', label = 'y = 2x')  # plot with color blue and marked 'o'
plt.plot(x, 3 * x , '+r', label = 'y = 3x' )  # plot with color red and marker '+'
plt.plot(x, 4 * x , '^g', label = 'y = 4x' ) # plot with color green and marker '^'
plt.plot(x, 6 * x , '--b', label = 'y = 6x' ) # plot with color blue and marker '--'
plt.plot(x, 7 * x, label = 'y = 7x')
plt.plot(x, 8 * x, label= 'y = 8x')
plt.legend(loc='upper left') # legend will appear on the top left of the figure.
plt.show()

We can also create multiple subplots within the same figure using `plt.subplots()`:

In [None]:
# Create multiple subgraphs
nums = np.arange(0,101)
fig, axes = plt.subplots(2,2)          # Set up a subplot grid with 2 rows and 2 columns
ax1 = axes[0,0]                       # Assign different variable names to each subplot
ax2 = axes[0,1]
ax3 = axes[1,0]
ax4 = axes[1,1]
ax1.plot(nums,nums)
ax2.plot(nums,-nums)
ax3.plot(nums,nums**2)
ax4.plot(nums,np.sqrt(nums))
plt.show()

You can read much more about the `subplots` function in the [documentation](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.subplot).

Thus far, we have only explored line plots. However, `matplotlib` also offers other plot types. For example,

### Scatter Plots

A scatter plot (also called a scatter, scatter graph, scatter chart, scatter gram, or scatter diagram) is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data.

In [None]:
x = np.arange(51)
y = np.random.rand(51)*10
plt.scatter(x,y, marker='*')  # Draw a scatter plot with '*' denoting each point. Play around with different characters!
plt.show()

### Bar Plots

A bar plot represents an estimate of central tendency for a numeric variable with the height of each rectangle and provides some indication of the uncertainty around that estimate using error bars.

In [None]:
fig = plt.figure()
axs = fig.add_axes([0,0,2,2])
langs = ['CV', 'NLP', 'AI', 'Maths', 'Data Science']
students = [23,17,35,29,12]
axs.bar(langs, students)
plt.show()

### Histogram

A histogram is a graphical display of data using bars of different heights. In a histogram, each bar groups numbers into ranges. Taller bars show that more data falls in that range. A histogram displays the shape and spread of continuous sample data.

In [None]:
arr_random = np.random.randn(100)
plt.hist(arr_random,bins=8,color='g',alpha=0.7) # Plot a histogram
                                                # alpha is the parameter to control transparency, 
                                                # bins defines the number of equal-width 
                                                # bins in the range.
plt.show()

# 3. Scikit Image

Scikit-image is a Python package dedicated to image processing. Underneath, it relies on NumPy arrays to represent images, so we will use it to experiment a little. 

In [None]:
from skimage import data

img = data.astronaut()

After executing the above, `img` contains the pixel values of the photo as a NumPy array. We can use Matplotlib to view the photo by using its `imshow()` function:

In [None]:
plt.imshow(img)
plt.show()

We can also examine the image data using some NumPy functions:

In [None]:
# Dimension of image: pixels in (r,c)
img_size = img.shape
print('Size of image: \n{} \n'.format(img_size))
dim1, dim2 = img.shape[0], img.shape[1]
num_channels = img.shape[2]

# RGB Colour image has three channels: R,G,B
print('No. of channels: \n{}'.format(num_channels))

As a challenge, can you use matplotlib to separately plot the three channels of the image in their own plots/windows? Use your knowledge of slicing and subplots here...

In [None]:
# Implement here, if you can!

# 4. Get Started with Scikit-learn
Scikit-learn, also known as `sklearn`, is one of the most famous Python modules in machine learning. Sklearn includes many subpackages designed for machine learning tasks, such asÔºö

* Classification

* Regression

* Clustering

* Dimensionality reduction

* Model selection

* Data preprocessing

These subpackages contain Python implementations of various algorithms to perform these tasks. These implementations/classes are generally referred to as **estimators**.

The estimator you choose for your project will depend on the data set you have and the problem that you are trying to solve. The `Scikit-learn` documentation helpfully provides this diagram, shown below, to help you to determine which algorithm is right for your task.

![ml_map.png](https://scikit-learn.org/stable/_downloads/b82bf6cd7438a351f19fac60fbc0d927/ml_map.svg)

What makes Scikit-learn so straight forward to use is that regardless of the model or algorithm you are using, the code structure for model training and prediction is the same. We will explore this in-depth as we begin our first foray into actual machine learning shortly.

---
---
---

# 4. Linear Regression

## 4.1 Simple Linear Regression

We will start with the most straightforward machine learning task, which is linear regression. Linear regression involves "regressing" on variable on another, which is just a fancy way of saying calculating one value (the "dependent variable") once you have another (the "independent variable"). The linear part refers to the assumption that the two quantities have a "linear" relationship i.e. you can predict the value of the _dependent_ variable as a linear function of the _independent_ variable. If you were to plot the two variables on a graph, then you would end up with a straight line with some slope.

More formally, linear regression's governing assumption is that the variables (y for dependent, x for independent usually) have the following relationship: 
$$
y = ax + b
$$
where $a$ is commonly known as the *slope*, and $b$ is commonly known as the *intercept*.

An example of such a linear relationship would be the relationship between voltage and current in an electric circuit (via Ohm's law), or the speed of a car and the amount of acceleration acting upon it.

For example, consider the following data, which is scattered about a line with a slope of 4 and an intercept of -6:

In [None]:
rng = np.random.RandomState(1)
x = 10 * rng.rand(50)
y = 4 * x - 6 + rng.randn(50)
plt.scatter(x, y)
plt.show()

From the figure, we can see that the data appears to follow a linear relationship. This means that we can indeed model it using a linear regression model. And once we have the linear regression model, we will be able to calculate `y` for any given value of `x`. 

To achieve this in practice, we can use Scikit-Learn's ``LinearRegression`` estimator. Let us begin by importing it, and creating an instance of it:

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)

Now, we will "fit" our estimator. This fit step corresponds to the training step of the standard machine learning pipeline. In this case, our estimator will now calculate the value of `a` and `b` in our equation above to build the model.

Ps: All Sklearn estimators - whether for classification, regression, etc - all have a `fit()` function. Calling this function will trigger the training step of the particular estimator. Afterwards, the estimator can be used on new, unseen data (we'll see that shortly).

In [None]:
model.fit(x[:, np.newaxis], y) # "fit"/train the estimator

Now, our estimator is fitted, which means it has calculated the best values of `a` and `b` based on the data we supplied The slope (`a`) and intercept (`b`) of the data are contained in the model. For a linear regression model, we can access the values of these parameters via the attributes ``coef_`` and ``intercept_`` respectively.

Ps: Model parameters in Scikit-Learn are always marked by a trailing underscore e.g. ``coef_``, ``intercept_``, etc.

In [None]:
print("Model slope:    ", model.coef_[0])
print("Model intercept:", model.intercept_)

We see that the results are very close to the inputs, as we might hope. This suggests that our model is properly fitted/trained.

Now, let us see how well it does on other data outside of our training data:

In [None]:
xfit = np.linspace(0, 15, 1000)
yfit = model.predict(xfit[:, np.newaxis])

plt.scatter(x, y)
plt.plot(xfit, yfit)

## 4.2 Multiple Regression: A Practical Example

We have studied a very simple case of linear regression, where we have a single independent variable. In many cases in the real world, the dependent variable `y` is actually a function of multiple independent variables. For instance in the medical field, a person's body-mass index (BMI) is a function of $both$ their height and weight. 

For this reason, it is common to represent linear regression (whether univariate or multivariate) as:

$$
    Y = A.X + B
$$

Which is almost identical to our initial equation. But note that now Y, A, X and B are all capital letters (for notational purposes). Mathematically, this means that they are now vectors. While this may sound complicated initially, it actually gives us more flexibility: if we are dealing with the univariate case, then Y, A, X and B are single-element vectors (also called scalars). In the multivariate case, then they are all vectors. So basically this form can be considered to be a more general representation.

For easy understanding, let us assume that we are trying to predict a single output value (i.e. `Y` is a scalar value, meaning `B` is also a scalar value), but we have multiple independent variables (i.e. `X` is a vector, meaning `A` is also a vector). Then in this case, we can use our knowledge of the dot product to express the above equation as:

$$
    y = a_1x_1 + a_2x_2 + a_3x_3 + ... + a_nx_n + b
$$

In other words, our aim will become finding suitable values for the A vector - which contains multiple variables - as well as for `B` (in this case its just one value). 

``sklearn`` supports multivariate linear regression out of the box directly. However, unlike the multivariate case, such regressions are more difficult to visualize graphically, but we can see one of these fits in action by building some example data, using NumPy's matrix multiplication operator:

In [None]:
rng = np.random.RandomState(1)
X = 10 * rng.rand(100, 3)
y = 0.5 + np.dot(X, [1.5, -2., 1.])

model.fit(X, y)
print(model.intercept_)
print(model.coef_)

Here the $y$ data is constructed from three random $x$ values, and the linear regression recovers the coefficients used to construct the data.

Now let us consider a practical exercise.

### California House Price Prediction

Let us consider the California House Price Prediction problem. Here, the aim is to predict the cost of a house given some of its properties e.g. number of bedrooms, the population of the area it is located in, etc. As expected, we have some training data, which consists of actual data samples (`X`) and the corresponding house price (`Y`). As ML practitioners we would like to use this provided data to build a model that we can then apply later on to new, unseen houses, possibly for our house financing startup.

Let us begin by loading the data. Luckily it is built into Sklearn, so we can use the ``sklearn`` ``datasets`` module for this:

In [None]:
from sklearn.datasets import fetch_california_housing 
california = fetch_california_housing() 

We use ".keys()" to display several keys that we can use :  

In [None]:
print(california.keys())  

For finding the shape of our dataset :

In [None]:
california.data.shape

This means it has 20,640 rows/observations and 8 independent variables/features.

For getting the feature names :

In [None]:
california.feature_names

Now, we are going to split the data into testing and training parts. The training part will be used to "fit" our estimator. The testing part will be used to "test" it i.e. see how good it is (more on this later). This means that the testing portion of the data is NEVER used in the training process.

Sklearn provides a handy function for this purpose called `train_test_split()` (in the `sklearn.model_selection` module). To use it, you need to pass 3 parameters:
- features: the NumPy array containing the features/predictors/independent variables
- target: the NumPy array containing the target/output/dependent variable
- test_size: a fraction between 0 and 1 specifying the proportion of the data to be put in the testing set. The training set size is automatically computed as the remainder of this i.e. `1 - test_size`. Conversely, you can specify the `train_size` argument instead, which will instead compute the test set size automatically.

This will split the data randomly according to the train_size/test_size specified. If you don't want a random split (i.e. you want the data split the same way each time), you can supply an argument called `random_state` (more on this later).

Let us see this in action:

In [None]:
x = california.data
y = california.target


from sklearn.model_selection import train_test_split 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)
print("x_train shape : ",x_train.shape)
print("x_test shape : ",x_test.shape)
print("y_train shape : ",y_train.shape)
print("y_test shape : ",y_test.shape)

We can now fit our linear regression model :

In [None]:
linear_regression = LinearRegression() 
linear_regression.fit(x_train, y_train) 

Recall that we have 8 independent variables. Let's see how the model handles this internally:

In [None]:
linear_regression.coef_

As expected, our "slope" parameter is now a vector of 8 elements, where each element is attached to one of the independent variables/features.

Now, let us use our trained model to predict the house prices on our testing data. Recall that the testing data was not at all seen by our model when it was being trained. This means that we can use it to assess how well our model does at computing the house price for new houses it hasn't been trained on (which is really what we want this model for anyway):

In [None]:
y_prediction =  linear_regression.predict(x_test)

Plotting the results:

In [None]:
plt.scatter(y_test, y_prediction, c = 'red') 
plt.xlabel("Actual Price (in $1000's)") 
plt.ylabel("Predicted Price (in $1000's)") 
plt.title("Observed value vs predicted value") 
plt.show()

## 4.3 Model Evaluation

In Machine Learning, our aim is to build models that suitably approximate some process solely based on observations from data. As a result, we can always come up with models that we believe approximate the underlying process. However, we have to find some measurable or quantifiable way to put a number of the correctness of our model. This is where evaluation metrics come in.

An evaluation metric is basically a function that quantifies our model in the context of some task. This metric allows us to gauge our model's performance, and also compare various models on the same task. 

Evaluation metrics are selected based on the desired task or outcome. For instance, in our exercise, our model is predicting a numeric output. So one evaluation metric might be the average error it makes i.e. on an average, how far is it off the mark from the true value. Ideally, an evaluation metric should have a finite scale and be ordinal, and preferably with a zero (to indicate a "perfect" model).

For the regression task, there are many common metrics in use e.g. the one we have described above (which is formally called Mean Absolute Error/MAE). A more popular, closely-related alternative is the Mean Squared Error/MSE, which is the average _squared_ error in the model's predictions. We will adopt the MSE in evaluating our model shortly.

Sklearn provides implementations of these evaluation metrics in the `sklearn.metrics` module. Let us see how we can score or evaluate our model based on the MSE metric:

In [None]:
from sklearn.metrics import mean_squared_error 
MSE = mean_squared_error(y_test, y_prediction) 
print("Mean Square Error : ", MSE)

Can you take a moment to intuit what this number means in practical terms?

## 5 Repurposing Linear Models for Classification via Logistic Regression

Although the name suggests regression, Logistic regression is a supervised classification algorithm in machine learning. It can be used to solve both binary classification problems, and multi-class classification problems. 

More formally, it is a special case of linear regression where the target variable is categorical (non-continuous) in nature. A binary logistic regression results in a binary outcome , i.e. the output class consist of only two categories.

To see how this works, let us revisit the Linear Regression Equation (with minor changes in notation):
$$ y = a_0 + a_1 x_1 + a_2 x_2 + \cdots + a_n x_n $$

Here, y is dependent variable and $x_1, x_2 \textbf{ ... } \text{and} \ x_n$ are explanatory (input) variables.

Normally, y is continuous. However in a (binary) classification, we would like to get a discrete output ("label"). For simplicity, we will assume a binary case where our two labels are 0 and 1.

We can therefore interpret the output label in a probabilistic way. In other words, we can think of 0 meaning that there is 0 chance of something being true, and 1 as there being 100% certainty that it is true. For example, say we are building a dog vs cat classifier. In this case, we can assign label 0 to dog and 1 to cat. Intuitively you can see that label 0 more or less means that there is 0 chance the input is a cat, and 1 means that the given input indeed belongs to a cat.

However, recall that the output of our regression model is continuous. This also means that there is a chance that (depending on input) it might produce outputs below 0 or greater than 1, which is meaningless. Therefore, we need to find a way to really constrain the output of our model to be strictly between 0 and 1. 

To achieve this, we can use a sigmoid function.


### The Sigmoid Function
The sigmoid function, also called logistic function gives an ‚ÄòS‚Äô shaped curve that can take any real-valued number and map it into a value (strictly) between 0 and 1 (inclusive). If the output of the sigmoid function is more than 0.5,  we can classify the outcome as 1 or YES, and if it is less than 0.5, we can classify it as 0 or NO. This is demonstrated in the graph below:

<p align="center">
    <img src='https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/500px-Logistic-curve.svg.png'/>
</p>

Ps: The sigmoid function is a member of a family called squashing functions. Squashing functions "squash" their output values to some finite range.

As seen from the figure, the output of the function tends to/asymptotically approaches 0 as its input tends to negative infinity. On the other hand, it tends to +1 as its input tends to positive infinity.

The Sigmoid Function is given by:
$$
p(x) = \frac{1}{1+e^{-x}}
$$

Therefore, it is clear that if we apply this sigmoid function to the output of a linear regression model, we can "squash" its output to between 0 and 1 i.e. give it a probabilistic interpretation, which is what we're trying to do in this case. This yields the expression for a logistic regression classifier:
$$ 
p(X) = \frac{1}{1+e^{-(a_0 + a_1 x_1 + a_2 x_2+ \text{.....} + a_n x_n)}}
$$

It turns out that sklearn actually also has an implementation of logistic regression, which we can use just as easily as the linear regression models we've seen thus far. We will explore a simple classification problem with our new friend, the logistic regression classifier to get some hands-on practice with it.

### Diabetes Prediction

Here, we are going to tackle the problem of diagnosing diabetes given the details/data (formally, features) of many individuals, using a logistic regression classifier. For this task we will use the well-known Pima Indian Diabetes database.

Let's first load the dataset using the NumPy `loadtxt()` function:

In [None]:
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'outcome']

# load dataset
pima = np.loadtxt("https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv", skiprows=1, delimiter=',') #we skip one row because of the header, and the file is comma-delimited

We can quickly look at the first five lines of the data. Can you use your knowledge of indexing/slicing to do this?

In [None]:
# Preview the first 5 lines of the loaded data 

### Splitting the Data Columns

Not all the given columns are useful for our task. Furthermore the outcome we're trying to predict is also part of the data in the file, so we need to separate it out.
The columns of interest are: 
- features: 'pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree'
- outcome: 'label'

Use your NumPy skills to index these columns into new variables X and y. X will be for the features, and y will be for the corresponding labels:

In [None]:
#split dataset in features and target variable
X = ... # Features
y = ... # Target variable

### Splitting Data

As usual, we are going to split our given dataset into training and testing sets. The training set will be used to fit/train our estimator, while the testing set will be used to evaluate/assess the performance of our classifier (just as we have seen before). 

Go ahead and complete the code segment below to achieve this with a 75%/25% training/testing split (you can revisit our example above to see how to do it):

In [None]:
# split X and y into training and testing sets with 75/25 train/test ratio
X_train, X_test, y_train, y_test = ...

### Model Development and Prediction

First, import the Logistic Regression module and create a Logistic Regression classifier object:

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(max_iter=1000)

Then, fit your model on the training portion of the dataset using `fit()`:

In [None]:
# fit the model with data using the fit() function

During the fitting process, the model learnt/calculated the values of the weights $a_1 \cdots a_n$. Now it is ready to be used to perform predictions via the `predict()` function, which will return labels for each observation passed into it e.g. if you pass in 100 observations, you will receive 100 labels predicted from the model, etc. 

Use the fitted model to get the predictions on the <ins>testing portion</ins> of the data.

In [None]:
# get predictions for the test data points
y_pred = ...

Now we have the model's predicted outputs on data it hasn't seen before. Just like last time, we will now score or assess the model based on the ground-truth labels we already have. We will explore this further in the next section.

## 5.1 Model Evaluation for Classifiers

Unlike regression models, classifiers have their own evaluation metrics. This makes sense because the output of regression models is continuous, while for classifiers it is discrete. This means that "correctness" means something completely different for classifiers.

An intuitive metric for classifiers is the notion of accuracy. Given a number of test instances N, accuracy simply refers to the proportion (or percentage) of cases where the model makes the right prediction. This isn't very different than how you are evaluated on exams in the real world üòÅ.

Can you use NumPy to implement a rudimentary accuracy function? You can do it with loops if you want, or you can use NumPy's advanced boolean capabilities to do it.

In [None]:
def simple_accuracy_function(ground_truth, predictions):
    ...

Luckily, Sklearn also has a straightforward implementation we can use for this purpose. Unsurprisingly, it is also in the `sklearn.metrics` module, and has the name `accuracy_score()`. It has the same signature as our simple version above, so you can use it fairly easily. 

Using this function, compute the accuracy of our classifier on the test set. If you implemented your own function above, compare the results. Are they the same? Can you also intuit or interpret these results?

In [None]:
# Use sklearn's accuracy_score() to compute the classifier accuracy.

### Confusion Matrix

You will notice that accuracy as a metric gives you a simple, numerical output of the classifier's overall performance. However there are cases where it is useful/important to understand the classifier's performance on a class-by-class basis. For instance, in a medical setting, it might be better for the model to make mistakes in saying someone is sick rather than saying they aren't sick when they really are. 

We can perform this sort of class-level analysis using a confusion matrix. As the name suggests, it is a matrix. Less apparently, it shows the degree of "confusion" of the model. You can think of confusion as a phenomenon where the classifier _confuses_ something from one class/group as being a member of another. Being able to see this confusion will allow you analyze which (pairs of) classes the model has trouble distinguishing/classifying correctly. 

As usual, sklearn has a nice implementation of the confusion matrix calculation (called `confusion_matrix()` (shocking!)), ready to use in the `sklearn.metrics` module. Go ahead to import the function, and visualize the confusion matrix of our classifier.

In [None]:
# print/show confusion matrix of logistic regression classifier

You can see that confusion matrix is basically a NumPy array. The dimension of this matrix is 2*2 because this model is binary classification i.e. you have two classes 0 and 1. 
Diagonal values represent accurate predictions, while non-diagonal elements are inaccurate predictions.

### Visualizing Confusion Matrix using Heatmap

For better visual presentation, we can also visualize the confusion matrix as a heatmap. Sklearn provides a class called `ConfusionMatrixDisplay` in the `sklearn.metrics` module which can do this for us. Let us see how we can use its functionality in the cell below to get a nicer confusion matrix:

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
disp = ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
disp.figure_.suptitle("Confusion Matrix")
print(f"Confusion matrix:\n{disp.confusion_matrix}")

plt.show()

There are other evaluation metrics, such as precision, recall, AUROC, etc. We will revisit those in the third session of the day.

# Task:

For your task, you will build a classifier (based on logistic regression) for handwritten image recognition. We will use the Sklearn handwritten digits dataset - which you can access using the `sklearn.datasets.load_digits()` function, just like you did for the California housing dataset. You are expected to:

Steps to follow:

1) Load the data and targets. 
2) Visualize some random samples from the data to get a feel of how they look
3) Create training and test splits.
4) Initialise a Logistic Regression Model (set the maximum iterations to 3000)
5) Fit your model on the training data.
6) Make predictions on your test data.
7) Evaluate your model and report the classifier's accuracy and confusion matrix as a heatmap.

In [None]:
# Implement Here

What can you see from the Confusion Matrix?