<a href="https://colab.research.google.com/github/pravallika7670/FMML.M1L2.ipynb/blob/main/FMML_M1L2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning terms and metrics

FMML Module 1, Lab 2

In this lab, we will show a part of the ML pipeline by using the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district. We will use the scikit-learn library to load the data and perform some basic data preprocessing and model training. We will also show how to evaluate the model using some common metrics, split the data into training and testing sets, and use cross-validation to get a better estimate of the model's performance.

In [1]:
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt

rng = np.random.default_rng(seed=42)

In [2]:
dataset = datasets.fetch_california_housing()
# Dataset description
print(dataset.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per ce

Given below are the list of target values. These correspond to the house value derived considering all the 8 input features and are continuous values. We should use regression models to predict these values but we will start with a simple classification model for the sake of simplicity. We need to just round off the values to the nearest integer and use a classification model to predict the house value.

In [3]:
print("Orignal target values:", dataset.target)

dataset.target = dataset.target.astype(int)

print("Target values after conversion:", dataset.target)
print("Input variables shape:", dataset.data.shape)
print("Output variables shape:", dataset.target.shape)

Orignal target values: [4.526 3.585 3.521 ... 0.923 0.847 0.894]
Target values after conversion: [4 3 3 ... 0 0 0]
Input variables shape: (20640, 8)
Output variables shape: (20640,)


The simplest model to use for classification is the K-Nearest Neighbors model. We will use this model to predict the house value with a K value of 1. We will also use the accuracy metric to evaluate the model.

In [4]:
def NN1(traindata, trainlabel, query):
    """
    This function takes in the training data, training labels and a query point
    and returns the predicted label for the query point using the nearest neighbour algorithm

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    query: numpy array of shape (d,) where d is the number of features

    returns: the predicted label for the query point which is the label of the training data which is closest to the query point
    """
    diff = (
        traindata - query
    )  # find the difference between features. Numpy automatically takes care of the size here
    sq = diff * diff  # square the differences
    dist = sq.sum(1)  # add up the squares
    label = trainlabel[np.argmin(dist)]
    return label


def NN(traindata, trainlabel, testdata):
    """
    This function takes in the training data, training labels and test data
    and returns the predicted labels for the test data using the nearest neighbour algorithm

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    testdata: numpy array of shape (m,d) where m is the number of test samples and d is the number of features

    returns: the predicted labels for the test data which is the label of the training data which is closest to each test point
    """
    predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
    return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [5]:
def RandomClassifier(traindata, trainlabel, testdata):
    """
    This function takes in the training data, training labels and test data
    and returns the predicted labels for the test data using the random classifier algorithm

    In reality, we don't need these arguments but we are passing them to keep the function signature consistent with other classifiers

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    testdata: numpy array of shape (m,d) where m is the number of test samples and d is the number of features

    returns: the predicted labels for the test data which is a random label from the training data
    """

    classes = np.unique(trainlabel)
    rints = rng.integers(low=0, high=len(classes), size=len(testdata))
    predlabel = classes[rints]
    return predlabel

We need a metric to evaluate the performance of the model. Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm. We will use the accuracy metric to evaluate and compate the performance of the K-Nearest Neighbors model and the random classifier.

In [6]:
def Accuracy(gtlabel, predlabel):
    """
    This function takes in the ground-truth labels and predicted labels
    and returns the accuracy of the classifier

    gtlabel: numpy array of shape (n,) where n is the number of samples
    predlabel: numpy array of shape (n,) where n is the number of samples

    returns: the accuracy of the classifier which is the number of correct predictions divided by the total number of predictions
    """
    assert len(gtlabel) == len(
        predlabel
    ), "Length of the ground-truth labels and predicted labels should be the same"
    correct = (
        gtlabel == predlabel
    ).sum()  # count the number of times the groundtruth label is equal to the predicted label.
    return correct / len(gtlabel)

Let us make a function to split the dataset with the desired probability. We will use this function to split the dataset into training and testing sets. We will use the training set to train the model and the testing set to evaluate the model.

In [7]:
def split(data, label, percent):
    # generate a random number for each sample
    rnd = rng.random(len(label))
    split1 = rnd < percent
    split2 = rnd >= percent

    split1data = data[split1, :]
    split1label = label[split1]
    split2data = data[split2, :]
    split2label = label[split2]
    return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [8]:
testdata, testlabel, alltraindata, alltrainlabel = split(
    dataset.data, dataset.target, 20 / 100
)
print("Number of test samples:", len(testlabel))
print("Number of train samples:", len(alltrainlabel))
print("Percent of test data:", len(testlabel) * 100 / len(dataset.target), "%")

Number of test samples: 4144
Number of train samples: 16496
Percent of test data: 20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [9]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)

What is the accuracy of our classifiers on the train dataset?

In [10]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Training accuracy using nearest neighbour algorithm:", trainAccuracy*100, "%")

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Training accuracy using random classifier: ", trainAccuracy*100, "%")

Training accuracy using nearest neighbour algorithm: 100.0 %
Training accuracy using random classifier:  16.4375808538163 %


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case. This is because the random classifier randomly assigns a label to each sample and the probability of assigning the correct label is 1/(number of classes). Let us predict the labels for our validation set and get the accuracy. This accuracy is a good estimate of the accuracy of our model on unseen data.

In [11]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")


valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 34.10852713178294 %
Validation accuracy using random classifier: 16.884689922480618 %


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier. Now let us try another random split and check the validation accuracy. We will see that the validation accuracy changes with the split. This is because the validation set is small and the accuracy is highly dependent on the samples in the validation set. We can get a better estimate of the accuracy by using cross-validation.

In [12]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 34.048257372654156 %


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [13]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)

print("Test accuracy:", testAccuracy*100, "%")

Test accuracy: 34.91795366795367 %


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

> Exercise: Try to implement a 3 nearest neighbour classifier and compare the accuracy of the 1 nearest neighbour classifier and the 3 nearest neighbour classifier on the test dataset. You can use the KNeighborsClassifier class from the scikit-learn library to implement the K-Nearest Neighbors model. You can set the number of neighbors using the n_neighbors parameter. You can also use the accuracy_score function from the scikit-learn library to calculate the accuracy of the model.

## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>cross-validation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute. You can reduce the number of splits to make it faster.

In [14]:
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
    """
    This function takes in the data, labels, split percentage, number of iterations and classifier function
    and returns the average accuracy of the classifier

    alldata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    alllabel: numpy array of shape (n,) where n is the number of samples
    splitpercent: float which is the percentage of data to be used for training
    iterations: int which is the number of iterations to run the classifier
    classifier: function which is the classifier function to be used

    returns: the average accuracy of the classifier
    """
    accuracy = 0
    for ii in range(iterations):
        traindata, trainlabel, valdata, vallabel = split(
            alldata, alllabel, splitpercent
        )
        valpred = classifier(traindata, trainlabel, valdata)
        accuracy += Accuracy(vallabel, valpred)
    return accuracy / iterations  # average of all accuracies

In [15]:
avg_acc = AverageAccuracy(alltraindata, alltrainlabel, 75 / 100, 10, classifier=NN)
print("Average validation accuracy:", avg_acc*100, "%")
testpred = NN(alltraindata, alltrainlabel, testdata)

print("Test accuracy:", Accuracy(testlabel, testpred)*100, "%")

Average validation accuracy: 33.58463539517022 %
Test accuracy: 34.91795366795367 %


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
To help you determine the best combination of features, I need more context. Could you clarify what features you're referring to? For example, are you asking about product features, software options, or characteristics in a specific context? The more details you provide, the better I can assist you.

2. Does it give more accurate estimate of test accuracy?
Testing or visualizing four or more features depends on the type of data you're working with and the goals of your analysis. Here are some common methods:

### 1. *Scatter Plot Matrix (Pair Plot)*
   - *What it is:* A grid of scatter plots showing relationships between each pair of features.
   - *When to use:* When you want to visualize pairwise relationships among features.
   - *Tools:* Seaborn’s pairplot() or Pandas scatter_matrix().

### 2. *Heatmap*
   - *What it is:* A visual representation of data where individual values are represented by colors.
   - *When to use:* To show the correlation between features.
   - *Tools:* Seaborn’s heatmap().

### 3. *Parallel Coordinates Plot*
   - *What it is:* A plot that represents each feature as a vertical line, with each data point as a line connecting them.
   - *When to use:* To visualize patterns and trends across multiple features.
   - *Tools:* Matplotlib’s parallel_coordinates() or Plotly.

### 4. *Principal Component Analysis (PCA)*
- *What it is:* A dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional space.
   - *When to use:* To visualize how the features relate in a reduced dimension (2D or 3D).
   - *Tools:* Scikit-learn’s PCA() or Seaborn's pairplot() after applying PCA.

### 5. *Radial Plot (Spider Plot)*
   - *What it is:* A plot that represents each feature as a spoke on a wheel, with data points plotted as a polygon.
   - *When to use:* To compare multiple features at once.
   - *Tools:* Matplotlib’s radar_chart().

### 6. *3D Scatter Plot*
   - *What it is:* A three-dimensional scatter plot where each axis represents one feature, and color or size can represent additional features.
   - *When to use:* When you have three features to visualize simultaneously, and the fourth can be represented by color, size, or shape.
   - *Tools:* Matplotlib, Plotly.

### 7. *t-SNE or UMAP*
   - *What it is:* Non-linear dimensionality reduction techniques that preserve local structures.
   - *When to use:* When you want to visualize clusters or patterns in high-dimensional data.
   - *Tools:* Scikit-learn, UMAP-learn.

### 8. *Feature Importance Plot*
   - *What it is:* A bar plot that shows the relative importance of each feature for a specific model.
   - *When to use:* To determine which features contribute most to the model's predictions.
   - *Tools:* Scikit-learn’s plot_importance() or XGBoost’s plot_importance().

### 9. *Contour Plots*
   - *What it is:* A plot that shows the joint distribution of two features with additional dimensions encoded in color or contour lines.
   - *When to use:* To visualize how the distribution of one feature changes with respect to another while considering additional features.
   - *Tools:* Matplotlib’s contour() or contourf().

### 10. *Box Plot*
   - *What it is:* A graphical representation of data distribution through quartiles.
   - *When to use:* To compare distributions of multiple features.
   - *Tools:* Seaborn’s boxplot().

### Practical Example:
If you have four features (say, A, B, C, and D), you could:
1. Use a scatter plot matrix to see pairwise relationships.
2. Apply PCA to reduce dimensionality and visualize in 2D.
3. Use a heatmap to examine correlations.

Would you like more specific examples or a walkthrough of one of these methods?


3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
Certainly! Creating features, also known as "feature engineering," is a critical part of data analysis and machine learning. Here are a few examples of how you might create new features based on existing data in different contexts:

### 1. *E-Commerce:*
   - *Existing Data:* Customer purchase history, product prices, and timestamps.
   - *New Features:*
     - *Average Order Value (AOV):* Average amount spent per order by each customer.
     - *Days Since Last Purchase:* Time since the customer last made a purchase.
     - *Purchase Frequency:* Number of purchases per unit of time (e.g., per month).
     - *Discount Rate:* Difference between original price and purchase price, indicating sensitivity to discounts.

### 2. *Real Estate:*
   - *Existing Data:* Property size, location, number of rooms, and age of the property.
   - *New Features:*
     - *Price Per Square Foot:* Total price divided by the area of the property.
     - *Room-to-Space Ratio:* Number of rooms divided by the square footage.
     - *Age of Property in Years:* Current year minus the year the property was built.
     - *Proximity to Amenities:* Distance to nearest school, park, or shopping center.

### 3. *Health & Fitness:*
   - *Existing Data:* Daily step count, calorie intake, hours of sleep, and workout type.
   - *New Features:*
- *Active vs. Sedentary Ratio:* Ratio of active minutes to sedentary minutes per day.
     - *Calories Burned per Step:* Calculated using step count and personal metrics like weight.
     - *Sleep Efficiency:* Ratio of actual sleep time to time spent in bed.
     - *Workout Intensity Score:* Derived from the type of workout, duration, and heart rate.

### 4. *Financial Analysis:*
   - *Existing Data:* Stock prices, trading volume, and economic indicators.
   - *New Features:*
     - *Moving Average:* Average stock price over a specific period (e.g., 50-day moving average).
     - *Price Momentum:* Rate of change in stock price over time.
     - *Volatility Index:* Measure of the stock's price fluctuations over a period.
     - *Price-to-Earnings Growth (PEG) Ratio:* Price-to-earnings ratio divided by the growth rate of earnings.

### 5. *Social Media Analysis:*
   - *Existing Data:* Number of likes, comments, shares, and post frequency.
   - *New Features:*
     - *Engagement Rate:* Total engagement (likes, comments, shares) divided by the number of followers.
- *Post Impact Score:* Weighted score based on engagement and the reach of a post.
     - *Time Between Posts:* Average time interval between posts.
     - *Sentiment Score:* Derived from natural language processing (NLP) analysis of comments.

### 6. *Retail:*
   - *Existing Data:* Product sales data, customer demographics, and store locations.
   - *New Features:*
     - *Sales Growth Rate:* Rate at which sales are increasing or decreasing over time.
     - *Customer Lifetime Value (CLV):* Predicted revenue a customer will generate over their relationship with the business.
     - *Seasonal Index:* Sales performance adjusted for seasonal patterns.
     - *Customer Segmentation:* Grouping customers based on purchasing behavior, demographics, or engagement.

### 7. *Marketing:*
   - *Existing Data:* Campaign budget, conversion rate, and ad impressions.
   - *New Features:*
     - *Return on Ad Spend (ROAS):* Revenue generated from ad campaigns divided by the ad spend.
     - *Cost per Acquisition (CPA):* Total cost divided by the number of new customers acquired.
     - *Click-Through Rate (CTR):* Number of clicks divided by the number of impressions.
     - *Customer Journey Duration:* Time taken from the first interaction to purchase.

Creating new features involves understanding the domain, identifying relationships or patterns in the data, and then crafting features that can help in making predictions or understanding the data better.

4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?
Yes, the features mentioned above can work for different classes beyond binary classes like 0 and 1. These features are generally applicable to multi-class classification, regression tasks, or other forms of analysis. Here’s how:

### 1. *Multi-Class Classification:*
   - *Example:* Predicting the category of a product (e.g., electronics, clothing, food).
   - *Feature Application:*
     - *Engagement Rate:* Useful across multiple product categories to see which types of products engage customers more.
     - *Price Per Square Foot (Real Estate):* Can differentiate between luxury, mid-range, and budget properties.
     - *Workout Intensity Score (Health & Fitness):* Can distinguish between different levels of workout intensity, which may correlate with various fitness goals or outcomes.

### 2. *Regression Tasks:*
   - *Example:* Predicting continuous outcomes, like house prices or stock returns.
   - *Feature Application:*
     - *Moving Average (Financial):* A good predictor of stock price trends, which is not limited to binary outcomes.
     - *Customer Lifetime Value (Retail):* A continuous feature predicting revenue generated by a customer.
     - *Sales Growth Rate (Retail):* Useful in predicting future sales, which is a continuous outcome.

### 3. *Multi-Class or Ordinal Classification:*
   - *Example:* Predicting academic grades (A, B, C, etc.) or customer satisfaction levels (Very Satisfied, Satisfied, Neutral, Unsatisfied, Very Unsatisfied).
   - *Feature Application:*
     - *Sentiment Score (Social Media):* Can disting
     - *Sentiment Score (Social Media):* Can distinguish between various levels of customer satisfaction.
     - *Proximity to Amenities (Real Estate):* Can affect property values across different classes (e.g., residential, commercial, industrial).
     - *Cost per Acquisition (Marketing):* Can be used to analyze the effectiveness of different marketing channels, each being a different class.

### 4. *Unsupervised Learning:*
   - *Example:* Clustering customers into different segments.
   - *Feature Application:*
     - *Customer Segmentation (Retail):* Can be used to group customers without predefined classes, and the resulting clusters can be more than two.
     - *Purchase Frequency (E-Commerce):* Useful for clustering customers based on how often they shop.
     - *Time Between Posts (Social Media):* Can help in clustering users by their activity levels.

### 5. *Anomaly Detection:*
   - *Example:* Detecting fraudulent transactions in a multi-class setting.
   - *Feature Application:*
     - *Volatility Index (Financial):* Can indicate unusual fluctuations in stock prices, which might be labeled differently in a mulmight be labeled differently in a multi-class context (e.g., minor fluctuation, significant fluctuation, etc.).
     - *Engagement Rate (Social Media):* Could flag unusual user activity that deviates from typical engagement patterns.

### *Why These Features Work Beyond 0 and 1:*
- *Scalability:* The features are continuous or categorical, making them adaptable to more than just binary classification.
- *Generalizability:* These features capture inherent properties of the data (e.g., trends, ratios, frequencies) that apply to various classes, not just binary ones.
- *Flexibility:* They can be used with different types of machine learning models (e.g.,
decision trees, neural networks, clustering algorithms) that support multi-class and continuous output.

### Adapting Features for Multi-Class Use:
- *Normalization/Scaling:* Ensure features are scaled appropriately, especially when dealing with multiple classes with different ranges.
- *Feature Selection:* Some features might be more relevant for certain classes, so using techniques like feature importance ranking can help refine them for multi-class problems.
- *Transformation:* Sometimes, you may need to create additional features or transform existing ones to capture relationships unique to multi-class settings.

In summary, the features you engineer should generally be applicable to different types of classes, not just binary ones. They should be tested and validated to ensure they provide meaningful insights or predictive power in the specific context you’re working with.

> Exercise: How does the accuracy of the 3 nearest neighbour classifier change with the number of splits? How is it affected by the split size? Compare the results with the 1 nearest neighbour classifier.