# .NET 💖 ML

- .NET Runtime is matured
- .NET has decent interop story
- C# is cool
- Types and Numerics
- ML.NET
- Microsoft & Azure
- https://rubikscode.net/2021/10/25/using-huggingface-transformers-with-ml-net

## 🚀 ML.NET

- https://dot.net/ml
- https://github.com/dotnet/machinelearning
    https://github.com/dotnet/machinelearning-samples
- https://www.nuget.org/packages/Microsoft.ML

.NET Break Sessions

In [None]:
#r "nuget: Microsoft.ML"

#### Overview

- ML.NET is __build__, __train__, and __deploy__ machine learning models library
- It doesnt come with pretrained models but we can use models through ONNX support

__Data__  ML.NET uses IDataView as the primary data structure for datasets
- We can load data from various sources (e.g., CSV, databases, in-memory collections) using mlContext.Data.LoadFrom... methods

__Transforms__ Data transformations are used to preprocess and prepare data for training
- Feature engineering: Normalization, encoding categorical variables, text featurization, etc
- Data cleaning: Handling missing values, filtering rows, etc
- Column operations: Concatenating, renaming, or dropping columns

__Model Training__ ML.NET supports various machine learning tasks:
    - Classification: Binary and multiclass
    - Regression: Predicting continuous values
    - Clustering: Grouping similar data points
    - Anomaly Detection: Identifying outliers
    - Recommendation: Building recommendation systems
- We can evaluate model, save and load and can use / make predictions

__Advanced Features__
- Model Explainability: Use Explainability methods to understand feature importance
- Cross-Validation: Evaluate model performance using cross-validation
- Time Series: Use specialized libraries for time-series forecasting
- ONNX Support: Export/import models in the ONNX format for interoperability

- https://devblogs.microsoft.com/dotnet/announcing-ml-net-2-0
- https://devblogs.microsoft.com/dotnet/announcing-ml-net-3-0
- https://github.com/dotnet/machinelearning/blob/main/docs/release-notes/4.0/release-4.0.md

#### Sparse Vectorization

The __"Bag of Words" (BoW)__ model is a way to represent text data in a numerical format that machine learning algorithms can understand. It's called a "bag" because it ignores the order of words and only focuses on the presence and frequency of words

- Tokenization: Split the text into individual words (or tokens)
- Counting Words: How many times each word appears in the text. This is where the "Bag of Words" comes into play. It creates a list of all unique words in your dataset and counts how often each word appears in each document (or text entry)
- Vector Representation: Converts these counts into a numerical vector. Each position in the vector corresponds to a specific word, and the value at that position represents how many times that word appeared in the text
- N-grams: Sequences of n words. For example, a 2-gram (bigram) would consider pairs of words like "I love", "love programming", etc. If you specify n-grams in ProduceWordBags, it will also count these sequences, not just individual words

In [None]:
using Microsoft.ML;
using Microsoft.ML.Data;
using System.Linq;

class TransformedData
{
    public float[] BagOfWords { get; set; }
}

var mlContext = new MLContext();

var data = new[]
{
    new { Text = "I love programming in C#" },
    new { Text = ".NET is the best runtime" },
    //new { Text = "I am liking machine learning" } // Uncomment this line to see the effect of adding more data 👈
};

// Loading data into IDataView
var dataView = mlContext.Data.LoadFromEnumerable(data);

// Pipeline for Bag of Words
var pipeline = mlContext.Transforms.Text.TokenizeIntoWords("Tokens", "Text")
    .Append(mlContext.Transforms.Text.ProduceWordBags("BagOfWords", "Tokens")); // Vector to n-grams count

// Fit and transform data
var transformer = pipeline.Fit(dataView);
var transformedData = transformer.Transform(dataView);
var embeddings = mlContext.Data.CreateEnumerable<TransformedData>(transformedData, reuseRowObject: false);

foreach (var embedding in embeddings)
    Console.WriteLine(string.Join(", ", embedding.BagOfWords));

- The length of the vector corresponds to the total number of unique words (or n-grams) in the vocabulary created from your dataset
- Each position in the vector represents a specific word or n-gram. For example:
    - Position 0 might correspond to "I"
    - Position 1 might correspond to "love"
    - Position 9 might correspond to ".NET"
    - And so on

While creating Bag of Words; we can have an approach where we count appearance of word instead of just presence; doing so our vector will have other than 0, 1 values above

#### Vector Search

<img src=images/vector-similarity.png>

- https://chatgpt.com/share/67332e70-3228-800b-828c-935003396ed4 Dot Product, Cosine and Euclidean Distances
- https://qdrant.tech/blog/what-is-vector-similarity

In [None]:
var query = "I love programming in Java";
var queryData = new[] { new { Text = query } };
var queryDataView = mlContext.Data.LoadFromEnumerable(queryData);
var queryTransformedData = transformer.Transform(queryDataView);
var queryEmbedding = mlContext.Data.CreateEnumerable<TransformedData>(queryTransformedData, reuseRowObject: false);

foreach (var embedding in queryEmbedding)
    Console.WriteLine(string.Join(", ", embedding.BagOfWords));

In [None]:
float CosineSimilarity(float[] vectorA, float[] vectorB)
{
    float dotProduct = 0.0f;
    float magnitudeA = 0.0f;
    float magnitudeB = 0.0f;

    for (int i = 0; i < vectorA.Length; i++) // We can use Vectors / Tensors for better performance
    {
        dotProduct += vectorA[i] * vectorB[i];
        magnitudeA += vectorA[i] * vectorA[i];
        magnitudeB += vectorB[i] * vectorB[i];
    }

    magnitudeA = (float)Math.Sqrt(magnitudeA);
    magnitudeB = (float)Math.Sqrt(magnitudeB);

    if (magnitudeA == 0 || magnitudeB == 0)
        return 0;

    return dotProduct / (magnitudeA * magnitudeB);
}

var closestEmbedding = data.Zip(embeddings, (d, e) => new { d, e })
    .Select(x => new
    {
        Text = x.d.Text,
        Similarity = CosineSimilarity(queryEmbedding.First().BagOfWords, x.e.BagOfWords)
    })
    .OrderByDescending(x => x.Similarity);

foreach(var embedding in closestEmbedding)
    Console.WriteLine($"{embedding.Text}\tSimilarity:\t{embedding.Similarity}");

#### Dense Vectorization

- __Features__ refer to the individual measurable properties or characteristics of the data that are used as input to a machine learning model. Features are the variables (or columns) in our dataset that the model uses to make predictions or decisions. They represent the information that the model will learn from to make accurate predictions
- __One Hot Encoding__ can be used to convert the categorical feature into a numerical representation. This is a common technique to handle categorical data in machine learning. The result is a dense vector
- __Concatenation of Feature__ is when we concatenating all the features (including the one-hot encoded ones) into a single feature vector; we use Concatenate transform that creates a dense vector where all the features are combined into one array
- __Feature Engineering__ is the process of selecting, transforming, and creating features (e.g., one-hot encoding, normalization, creating new features). Good feature engineering can significantly improve model performance.

__Regression using ML.NET__

- https://learn.microsoft.com/en-us/dotnet/api/microsoft.ml.trainers
    - https://learn.microsoft.com/en-us/dotnet/api/microsoft.ml.trainers.lbfgspoissonregressiontrainer
        - https://en.wikipedia.org/wiki/Poisson_regression

<img src=images/poisson-regression.png>

In [None]:
using Microsoft.ML;
using Microsoft.ML.Data;
using System.Collections.Generic;

class HousingData
{
    [LoadColumn(0)] public float NumberOfRooms { get; set; }
    [LoadColumn(1)] public float SquareFootage { get; set; }
    [LoadColumn(2)] public string Location { get; set; }
    [LoadColumn(3)] public float AgeOfHouse { get; set; }
    [LoadColumn(4)] public float Price { get; set; }
}
class HousingPrediction
{
    [ColumnName("Score")]
    public float Price { get; set; }
}

var housingData = new List<HousingData>
{
    new HousingData { NumberOfRooms = 3, SquareFootage = 1500, Location = "Urban", AgeOfHouse = 10, Price = 300000 },
    new HousingData { NumberOfRooms = 4, SquareFootage = 2000, Location = "Suburban", AgeOfHouse = 5, Price = 400000 },
    new HousingData { NumberOfRooms = 2, SquareFootage = 1000, Location = "Rural", AgeOfHouse = 20, Price = 200000 }
};

var mlContext = new MLContext();
var dataView = mlContext.Data.LoadFromEnumerable(housingData);
var pipeline = mlContext.Transforms
    .Categorical.OneHotEncoding("LocationEncoded", "Location") // Encode categorical feature
    .Append(mlContext.Transforms.Concatenate("Features",
        nameof(HousingData.NumberOfRooms), nameof(HousingData.SquareFootage),
        "LocationEncoded",
        nameof(HousingData.AgeOfHouse))) // Combine features into a dense vector
    .Append(mlContext.Regression.Trainers.LbfgsPoissonRegression(labelColumnName: nameof(HousingData.Price))); // Train a regression model

var model = pipeline.Fit(dataView);
var predictionEngine = mlContext.Model.CreatePredictionEngine<HousingData, HousingPrediction>(model);

var sampleHouse = new HousingData
{
    NumberOfRooms = 3,
    SquareFootage = 1600,
    Location = "Urban",
    AgeOfHouse = 8
};
var prediction = predictionEngine.Predict(sampleHouse);

Console.WriteLine($"Predicted Price: {prediction.Price}");

## 🧰 Accord.Net 🗃️

- https://accord-framework.net
- https://github.com/accord-net/framework Archived in 2020
- https://www.nuget.org/packages/Accord

__TF-IDF__ Term Frequency-Inverse Document Frequency, is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus), Use cases:
- Search Engines: To rank documents based on their relevance to a query
- Text Classification: To identify important features (words) for machine learning models
- Information Retrieval: To find the most relevant documents for a given search term

In [None]:
#r "nuget: Accord.Math"

In [None]:
using Accord.Math;

class Tfidf
{
    string[] vocabulary;
    double[] idf;

    string[] documents;
    public int VocabularyLength => vocabulary.Length;

    public Tfidf(string[] documents)
    {
        var tokenizedDocs = documents.Select(d => d.Split(' ')).ToArray();
        vocabulary = tokenizedDocs.SelectMany(x => x).Distinct().ToArray();
        idf = new double[vocabulary.Length];

        for (int i = 0; i < vocabulary.Length; i++)
        {
            int docCount = tokenizedDocs.Count(d => d.Contains(vocabulary[i]));
            //idf[i] = Math.Log((double)documents.Length / (docCount + 1)); // to avoid dividing by zero
            idf[i] = Math.Log((double)(documents.Length + 1) / (docCount + 1)); // Laplace smoothing
        }

        this.documents = documents;
    }

    public Sparse<double>[] Transform()
    {
        var sparseVectors = new Sparse<double>[this.documents.Length];

        for (int i = 0; i < this.documents.Length; i++)
        {
            var tokens = this.documents[i].Split(' ');
            var indices = new List<int>();
            var values = new List<double>();

            for (int j = 0; j < vocabulary.Length; j++)
            {
                int count = tokens.Count(t => t == vocabulary[j]);
                if (count > 0)
                {
                    indices.Add(j);
                    values.Add(count * idf[j]);
                }
            }

            sparseVectors[i] = new Sparse<double>(indices.ToArray(), values.ToArray());
        }

        return sparseVectors;
    }
}

In [None]:
string[] documents = {
    "I love programming in C#",
    "C# is a great language",
    "I hate bugs in my code",
    "Debugging is essential in programming"
};

var tfidf = new Tfidf(documents);
var vectors = tfidf.Transform();

for (int i = 0; i < vectors.Length; i++)
    Console.WriteLine($"Document {i + 1}:\n\t{vectors[i]}");

In [None]:
using Accord.Math.Distances;

var cosine = new Cosine();
double similarity = cosine.Similarity(vectors[0].ToDense(), vectors[1].ToDense());
Console.WriteLine($"Cosine Similarity: {similarity}");

In [None]:
#r "nuget: Accord.MachineLearning"

In [None]:
using Accord.MachineLearning;

double[][] denseVectors = vectors
    .Select(v => v.ToDense(tfidf.VocabularyLength)) // Convert sparse to dense
    .ToArray();

denseVectors

In [None]:
// "I love programming in C#",
// "C# is a great language",
// "I hate bugs in my code",
// "Debugging is essential in programming"

var kmeans = new KMeans(k: 2); // 2 clusters
var clusters = kmeans.Learn(denseVectors);
int[] labels = clusters.Decide(denseVectors);
labels

We can build upon using Accord.Net (and other machine learning libraries)

- Dimensionality Reduction
    - Principal Component Analysis (PCA)
- Classification
    - Support Vector Machines
- Topic Modeling
    - Latent Dirichlet Allocation; to discovers topics being in documents
- Anamoly Detection
    - One class SVM (Support Vector Machine)
- Search & Retreival
- Recommentation System
    - Content-Based Filtering (TF-IDF)
- Text Summarization
- Feature Selection
    - TF-IDF

## 🧰 Math.NET 🗃️

- https://mathdotnet.com
    - https://numerics.mathdotnet.com
    - https://github.com/mathnet/mathnet-numerics
    - https://www.nuget.org/packages/MathNet.Numerics

- We can bring in more specialized numerical algorithms with this library
- Math.NET offers more optimized high-performance sparse linear algebra; for very large datasets

__Restart Kernel__

These libraries have their own Vector implementations and that's the problem in mixing and matching such libraries and their namespaces
- How System.Numerics is solving this issue for future libraries

In [None]:
#r "nuget: MathNet.Numerics"

In [None]:
using MathNet.Numerics.LinearAlgebra;

double CalculateCosineSimilarity(Vector<double> vector1, Vector<double> vector2)
{
    double dotProduct = vector1.DotProduct(vector2);

    double magnitude1 = vector1.L2Norm();
    double magnitude2 = vector2.L2Norm();

    double cosineSimilarity = dotProduct / (magnitude1 * magnitude2);
    return cosineSimilarity;
}

var user1 = Vector<double>.Build.Dense(new double[] { 5, 4, 3, 0, 2 }); // User 1's ratings
var user2 = Vector<double>.Build.Dense(new double[] { 1, 5, 4, 0, 3 }); // User 2's ratings
double cosineSimilarity = CalculateCosineSimilarity(user1, user2);

if (cosineSimilarity > 0.8)
    Console.WriteLine("The users have very similar tastes!");
else if (cosineSimilarity > 0.5)
    Console.WriteLine("The users have somewhat similar tastes.");
else
    Console.WriteLine("The users have different tastes.");