<table align="left">
  <td>
    <a target="_blank" href="https://github.com/polyhedron-gdl/ml-for-finance-intro/blob/main/2025/01-notebooks/nb-lesson-1-1.ipynb">
        <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>


# Introduction to Machine Learning 

**with Applications in Banking and Finance**

<!--
<div class="alert alert-block alert-success">
    <h2> Why? </h2>
    <br>The purpose and the rationale for the subject covered in the seminar</br>
</div>
-->

In this notebook, you will learn about the main concepts and different types of machine learning. Together with a basic introduction to the relevant terminology, we will lay the groundwork for successfully using machine learning techniques
for practical problem solving.

In the following we will cover the following topics:

- The general concepts of machine learning
- The three types of learning and basic terminology
- The building blocks for successfully designing machine learning systems
- Installing and setting up Python for data analysis and machine learning

## What is Machine Learning

Machine Learning (ML) is a subset of artificial intelligence (AI) that enables systems to learn and improve from experience without explicit programming. Over the past few decades, ML has evolved from a niche academic discipline into a critical technology that powers a wide range of applications, from autonomous vehicles to personalized recommendations on streaming platforms. Understanding ML is essential for anyone aiming to thrive in today's technology-driven world, as it is at the heart of innovations in numerous fields, including finance, healthcare, robotics, and natural language processing.

At its core, machine learning is a method of teaching computers to make decisions or predictions based on data. Unlike traditional programming, where explicit rules and instructions are coded to solve a problem, ML models identify patterns and relationships within data, using these insights to perform tasks such as classification, prediction, and clustering. For instance, instead of manually specifying rules to identify whether an email is spam or not, a machine learning model can analyze a large dataset of emails labeled as spam or non-spam, learning the distinguishing features on its own and applying this knowledge to new, unseen messages.

The origins of machine learning can be traced back to the mid-20th century, with pivotal contributions from fields such as mathematics, statistics, and computer science. Alan Turing’s seminal work in the 1950s laid the foundation for the idea that machines could learn and make decisions, while Arthur Samuel’s program for playing checkers marked one of the earliest practical applications of machine learning. Over time, advancements in computational power and the availability of large datasets accelerated the development of ML, leading to the creation of algorithms capable of solving increasingly complex problems.

Machine learning operates through the iterative process of training, validation, and testing. A model is trained on a set of data, which allows it to learn the relationships and patterns inherent in the data. The trained model is then validated and tested on separate datasets to assess its performance and generalization ability. The quality of the model depends heavily on the data used for training. This is why the phrase "garbage in, garbage out" is often used to emphasize the importance of high-quality, representative data in machine learning.

The field of machine learning encompasses several subfields, each focusing on different types of tasks and learning paradigms.

- In supervised learning, the model is provided with labeled data and learns to map inputs to outputs, such as predicting house prices based on features like size, location, and age. 

- In contrast, unsupervised learning deals with unlabeled data, where the goal is to identify hidden patterns or structures, such as grouping customers with similar purchasing behaviors. 

- Reinforcement learning represents another major subfield, where an agent learns to make sequential decisions by interacting with an environment, receiving feedback in the form of rewards or penalties. This type of learning has been instrumental in achieving breakthroughs in areas like game-playing AI, as demonstrated by systems like AlphaGo.

Let's see in more detail the differences and characteristics of these three types of models

## The three different types of machine learning

### Supervised Learning

The main goal in supervised learning is to learn a model from labeled training data that allows us to make predictions about unseen or future data. Here, the term "supervised" refers to a set of **training** examples (data inputs) where the
desired output signals (**labels**) are already known. The following figure summarizes a typical supervised learning workflow, where the labeled training data is passed to a machine learning algorithm for fitting a predictive model that can make
predictions on new, unlabeled data inputs:

![chapter-0-0_pic_0.png](./pic/chapter-0-0_pic_0.png)

A supervised learning task with discrete class labels, such as in the previous example, is also called a **classification
task**. 
A second type of supervised learning is the prediction of continuous outcomes, which is also called **regression analysis**. In
regression analysis, we are given a number of predictor (explanatory) variables and a continuous response variable (outcome), and we try to find a relationship between those variables that allows us to predict an outcome. Note that in the field of machine learning, the predictor variables are commonly called ***features***, and the response variables are usually referred to as ***target variables***.

![chapter-0-0_pic_1.png](./pic/chapter-0-0_pic_1.png)

### Unsupervised Learning

In supervised learning, we know the right answer beforehand when we train a model. In **unsupervised learning**, however, we are dealing with ***unlabeled data*** or data of unknown structure. Using unsupervised learning techniques, we are able to explore the structure of our data to extract meaningful information without the guidance of a known outcome variable or reward function.

**Clustering** is an exploratory data analysis technique that allows us to organize a pile of information into meaningful subgroups (clusters) *without having any prior knowledge of their group memberships*. Each cluster that arises during the analysis defines a group of objects that share a certain degree of similarity but are more dissimilar to objects in other clusters, which is why clustering is also sometimes called unsupervised classification. Clustering is a great technique for structuring information and deriving meaningful relationships from data. For example, it allows marketers to discover customer groups based on their interests, in order to develop distinct marketing programs.

![chapter-0-0_pic_2.png](./pic/chapter-0-0_pic_2.png)

### Reinforcement Learning

Another type of machine learning is **reinforcement learning**. In reinforcement learning, the goal is to develop a system (***agent***) that improves its performance based on interactions with the environment. Since the information about the current state of the environment typically also includes a so-called **reward signal**, we can think of reinforcement learning as a field related to supervised learning. However, in reinforcement learning, this feedback is not the correct ground truth label or value, but a measure of how well the action was measured by a reward function. Through its interaction with the environment, an agent can then use reinforcement learning to learn a series of actions that maximizes this reward via an exploratory trial-and-error approach or deliberative planning. A popular example of reinforcement learning is a chess engine. Here, the agent decides upon a series of moves depending on the state of the board (the
environment), and the reward can be defined as win or lose at the end of the game.

## A Practical Example

In [1]:
#
# STEP 1 - Import libraries
#
import tensorflow as tf
import numpy as np
from tensorflow import keras

In [2]:
#
# STEP 2 - Define a model
#
model = tf.keras.Sequential([keras.layers.Dense(units=1, input_shape=[1])])

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [3]:
#
# STEP 3 - Define a way to measure how good or bad are model guesses (loss) and a method to chose the next guess (optimizer)
#
# In this example, you want the program find the relationship between two series of numbers he numbers 
# (the relationship is Y=3X+1). When the computer is trying to learn that, it makes a guess, maybe 
# Y=10X+10. The loss function measures the guessed answers against the known correct answers and measures 
# how well or badly it did. Next, the model uses the optimizer function to make another guess. Based on 
# the loss function's result, it tries to minimize the loss. At this point, maybe it will come up with 
#something like Y=5X+5. While that's still pretty bad, it's closer to the correct result (the loss is lower).

model.compile(optimizer='sgd', loss='mean_squared_error')

In [None]:
#
# STEP 4 - Select data and labels
#
xs = np.array([-1.0, 0.0, 1.0, 2.0, 3.0, 4.0], dtype=float)
ys = np.array([-2.0, 1.0, 4.0, 7.0, 10.0, 13.0], dtype=float)

In [None]:
#
# STEP 5 - Fit the model
#
model.fit(xs, ys, epochs=500)

In [None]:
#
# STEP 6 - Make some prediction
#
print(model.predict([10.0]))

## Features and Labels

The data for supervised learning contains what are referred to as **features** and **labels**. The **labels** are the values of the target that is to be predicted. The **features** are the variables from which the predictions are to be made. For example when predicting the price of a house the **features** could be the square meters of living space, the number of bedrooms, the number of bathrooms, the size of the garage and so on. The **label** would be the house price.

The data for unsupervised learning consists of features but no labels because the model is being used to identify patterns not to forecast something.


### The main difference between supervised and unsupervised learning: Labeled data

It very important to stress that the main distinction between supervised and unsupervised methods is *the use of labeled datasets*. To put it simply, supervised learning uses labeled input and output data, while an unsupervised learning algorithm does not.

In supervised learning, the algorithm “learns” from the training dataset by iteratively making predictions on the data and adjusting for the correct answer. While supervised learning models tend to be more accurate than unsupervised learning models, they require upfront human intervention to label the data appropriately. For example, a supervised learning model can predict how long your commute will be based on the time of day, weather conditions and so on. But first, you’ll have to train it to know that rainy weather extends the driving time.

Unsupervised learning models, in contrast, work on their own to discover the inherent structure of unlabeled data. Note that they still require some human intervention for validating output variables. For example, an unsupervised learning model can identify that online shoppers often purchase groups of products at the same time. However, a data analyst would need to validate that it makes sense for a recommendation engine to group baby clothes with an order of diapers, applesauce and sippy cups.

## Type of Data ##

The success of AI applications hinges on the effective use of diverse data types, each serving unique roles in financial modeling, decision-making, and fraud detection. Among the most critical data types are numerical data, categorical data, text data, visual data, and audio data. Understanding their characteristics and applications is essential for leveraging AI's full potential in finance and banking.

- ***Numerical*** Data forms the backbone of AI applications in finance. Represented as measurable quantities, numerical data includes continuous variables such as account balances, stock prices, and interest rates, as well as discrete values like the number of transactions or loan approvals. It is widely used in tasks such as credit risk assessment, portfolio management, and algorithmic trading. Numerical data requires preprocessing techniques such as normalization or scaling to ensure compatibility with machine learning models, particularly those that rely on gradient-based optimization algorithms. Time series data, a subset of numerical data, is crucial for predicting trends in financial markets and assessing historical performance. The versatility and quantitative nature of numerical data make it indispensable in financial analytics.

- ***Categorical*** Data, representing distinct groups or classifications, is another cornerstone of AI in finance. Examples include transaction types, customer demographics, account statuses, and credit ratings. This type of data is pivotal in tasks such as customer segmentation, fraud detection, and regulatory compliance. For instance, identifying high-risk groups or patterns in categorical variables enables banks to tailor services and mitigate risks. However, categorical data often needs to be transformed into numerical representations using methods like one-hot encoding or embedding techniques for compatibility with machine learning models. The ability to extract meaningful insights from categorical data provides a qualitative dimension to financial analysis, complementing the quantitative insights derived from numerical data.

- ***Text*** Data introduces the complexity of unstructured information into AI-driven financial systems. Textual information such as emails, transaction descriptions, contracts, news articles, and social media posts is abundant in the financial domain. Text data is integral to applications like sentiment analysis, where AI systems assess market sentiment from news or social media, and compliance monitoring, where AI scans documents for irregularities or potential policy violations. Processing text data requires natural language processing (NLP) techniques such as tokenization, parsing, and word embeddings to convert unstructured text into structured forms suitable for analysis. These advancements allow financial institutions to extract actionable insights from large volumes of textual information, enhancing decision-making and risk management.

- ***Visual*** Data, encompassing images, charts, and other graphical representations, plays a specialized but critical role in finance and banking. The adoption of computer vision techniques has enabled the processing of visual data for tasks such as identity verification, where optical character recognition (OCR) systems analyze scanned documents or facial recognition systems authenticate users. Visual data also supports the analysis of financial charts and patterns, enhancing traders' ability to predict market movements. Additionally, AI systems use visual data in insurance to evaluate claims by assessing property damage through images or videos. By leveraging convolutional neural networks (CNNs) and similar technologies, financial institutions can derive insights from visual data that were previously inaccessible through manual processes.

- ***Audio*** Data represents an emerging frontier in AI applications for finance. Voice recordings from customer interactions, call center communications, and virtual assistants offer valuable information for enhancing customer service and fraud prevention. For example, voice biometrics is increasingly used to authenticate users based on their vocal patterns, adding an extra layer of security to financial transactions. Sentiment analysis of customer calls and real-time monitoring of audio data for specific keywords or suspicious tones also support improved customer experiences and compliance. Processing audio data typically involves advanced techniques such as feature extraction and speech-to-text conversion, enabling AI systems to integrate audio with other data types for comprehensive analysis.


## Loss (Cost Functions)


### Linear Cost Function 

In Machine Learning a cost function or loss function is used to represent how far away a mathematical model is from the real data. One adjusts the mathematical model, usually by varying parameters within the model, so as to minimize the cost function. 

Let's take for example the simple case of a linear fitting. We want to find a relationship of the form 

\begin{equation}
y=\theta_0 +\theta_1x
\end{equation}

where the $\theta$s are the parameters that we want to find to give us the best fit to the data. We call this linear function $h_\theta(x)$ to emphasize the dependence on both the variable $x$ and the two parameters $\theta_0$ and $\theta_1$.


We want to measure how far away the data, the $y^{(n)}$s, are from the function $h_\theta(x)$. A common way to do this is via the quadratic *cost function*

\begin{equation}
J(\mathbf{\theta}) = \frac{1}{2N} \sum\limits_{n=1}^N \left[ h_\theta \left( x^{(n)} \right) - y^{(n)} \right]^2
\label{eq:ols}
\end{equation}

This is called *Ordinary Least Squares*.

In this case, the minimum is easily find analitically, differentiate $\eqref{eq:ols}$ with respect to both $\theta$s and set the result to zero:

\begin{equation}
\begin{array}{lcl} 
\frac{\partial J}{\partial \theta_0} & = & \sum\limits_{n=1}^N \left( \theta_0 + \theta_1 x^{(n)} - y^{(n)} \right) = 0 
\\ 
\frac{\partial J}{\partial \theta_1} & = & \sum\limits_{n=1}^N x^{(n)} \left( \theta_0 + \theta_1 x^{(n)} - y^{(n)} \right) = 0 
\end{array}
\end{equation}

The solution is trivially obtained for both $\theta$s

\begin{equation}
\begin{array}{lcl} 
\theta_0 = \frac{\left(\sum y \right) \left(\sum x^2 \right) -\left(\sum x \right) \left(\sum xy \right) }{N\left(\sum x^2 \right) \left(\sum x \right)^2 } 
\\ 
\theta_1 = \frac{N\left(\sum xy \right) - \left(\sum y \right)\left(\sum x \right)}{N\left(\sum x^2 \right) \left(\sum x \right)^2 }
\end{array}
\end{equation}




## Optimizer

### Gradient Descent

>We can describe the main idea behind gradient descent as climbing down a hill until a local or global cost minimum is reached. In each iteration, we take a step in the opposite direction of the gradient, where the step size
is determined by the value of the **learning rate**, as well as the slope of the gradient.

The scheme works as follow: start with an initial guess for each parameter $\theta_k$. Then move $\theta_k$ in the direction of the slope:

\begin{equation}
\theta_k^{new} =\theta_k^{old}+\beta \frac{\partial J}{\partial \theta_k}
\end{equation}

**Update all $\theta_k$ simultaneously** and repet until convergence. Here $\beta$ is a *learning factor* that governs how far you move. if $\beta$ is too small it will take a long time to converge, if too large it will overshoot and might not converge at all. 

The loss function $J$ is a function of all of the data points. In the above description of gradient descent we have used all of the data points simultaneously. This is called *batch gradient* descent. But rather than use all of the data in the parameter updating we can use a technique called *stochastic gradient descent*. This is like batch gradient descent except that you only update using *one* of the data points each time. And that data point is chosen randomly.

\begin{equation}
J(\mathbf{\theta}) = \sum\limits_{n=1}^N J_n(\mathbf{\theta})
\end{equation}

Stochastic gradient descent means pick an *n* at random and then update according to 

\begin{equation}
\theta_k^{new} =\theta_k^{old}+\beta \frac{\partial J_n}{\partial \theta_k}
\end{equation}

Repeat, picking another data point at random, etc.

An important parameter in Gradient Descent is the size of the steps, determined by
the **learning rate** hyperparameter. 

![chapter-2-3-pic_1.png](./pic/chapter-2-3-pic_1.png)

If the learning rate is too small, then the algorithm
will have to go through many iterations to converge, which will take a long time...

![chapter-2-3-pic_2.png](./pic/chapter-2-3-pic_2.png)

... on the other hand, if the learning rate is too high, you might jump across the valley. This might make the algorithm diverge failing to find a good solution. 

![chapter-2-3-pic_3.png](./pic/chapter-2-3-pic_3.png)

Gradient descent is one of the many algorithms that benefit from feature scaling.

### Stochastic Gradient Descent

In the previous section, we learned how to minimize a cost function by taking a step
in the opposite direction of a cost gradient that is calculated from the whole training
dataset; this is why this approach is sometimes also referred to as batch gradient
descent. Now imagine that we have a very large dataset with millions of data
points, which is not uncommon in many machine learning applications. Running
batch gradient descent can be computationally quite costly in such scenarios, since
we need to reevaluate the whole training dataset each time that we take one step
toward the global minimum.

A popular alternative to the batch gradient descent algorithm is stochastic gradient
descent (SGD), which is sometimes also called iterative or online gradient descent.
Instead of updating the weights based on the sum of the accumulated errors over all
training examples, we update the weights incrementally for each training example:

$$\eta \left( y^{(i)} - \phi\left(z^{(i)} \right)\right)\mathbf{x}^{(i)}$$

Although SGD can be considered as an approximation of gradient descent, it
typically reaches convergence much faster because of the more frequent weight
updates. Since each gradient is calculated based on a single training example, the
error surface is noisier than in gradient descent, which can also have the advantage
that SGD can escape shallow local minima more readily if we are working with
nonlinear cost functions.

## Learning Tools

### Using Python for machine learning

Python is one of the most popular programming languages for data science and
thanks to its very active developer and open source community, a large number of
useful libraries for scientific computing and machine learning have been developed.
Although the performance of interpreted languages, such as Python, for
computation-intensive tasks is inferior to lower-level programming languages,
extension libraries such as **NumPy**, **Matplotlib** and **Pandas**, among the others, have been developed that build
upon lower-layer Fortran and C implementations for fast vectorized operations
on multidimensional arrays.
For machine learning programming tasks, we will mostly refer to the **scikit-learn**
library, which is currently one of the most popular and accessible open source
machine learning libraries. In the later chapters, when we focus on a subfield
of machine learning called deep learning, we will use the latest version of the
**Keras** library, which specializes in training so-called deep neural network
models very efficiently. 

### Installing Python and Packages

To set up your python environment, you’ll first need to have a python on your machine. There are various python distributions available and we have chosen one that works very well for data science: **Anaconda**. Anaconda comes with its own Python distribution which will be installed along with it. 

Data Science often requires you to work with a lot of scientific packages like scipy and numpy, data manipulation packages like pandas and IDEs and interactive Jupyter Notebook.Now, you don’t need to worry about any python package most of them come pre-installed and if you want to install a new package, you can do that simply by using conda or via the pip installer program, which has been part of the Python Standard Library
since Python 3.3. More information about pip can be found [here](https://docs.python.org/3/installing/index.html). After we have successfully installed Python, we can execute pip from the terminal
to install additional Python packages:

**pip install SomePackage**

Already installed packages can be updated via the --upgrade flag:

**pip install SomePackage --upgrade**

To download an Anaconda distribution, you can use the [official download page](https://www.anaconda.com/download/) and
you can select your platform and then choose the installer. For this, you can choose which version you want and whether 32-bit or 64-bit.

<!--
<div>
<img src="./img/anaconda_2.png" width="600"/>
</div>
-->
![chapter-0-0_pic_3.png](./pic/chapter-0-0_pic_3.png)

To test your installation, on Windows, click on Start and then Anaconda Navigator in the program list (or search for Anaconda in the search bar and select Anaconda Navigator). On a Mac, open up the finder, and in the Applications folder, double click on Anaconda-Navigator.

![chapter-0-0_pic_4.png](./pic/chapter-0-0_pic_4.png)

**Package Managers**

Anaconda will give you two package managers- **pip** and **conda**. When some packages aren’t available with conda, you can use pip to install them. Note that using pip to install packages also available to conda may cause an installation error.

**Jupyter Notebook**

A notebook is a document like this one! A notebook integrates code and its output into a single document that combines visualizations, narrative text, mathematical equations, and other rich media.

In other words: it's a single document where you can run code, display the output, and also add explanations, formulas, charts, and make your work more transparent, understandable, repeatable, and shareable. As part of the open source Project Jupyter, Jupyter Notebooks are completely free. You can download the software on its own, or as part of the Anaconda data science toolkit.

### Google Colab

Although it is not essential to work in a colab environment (all the course notebooks are in fact designed to be able to run without problems locally on your pc), it is useful to know some basic elements of the interaction with colab. In particular, in the cells below you will find two examples for the use of external files. In the first case it is shown how to load a text file from your local PC into the google virtual machine. The second example relates to the opposite operation: let's create a simple pandas dataframe into the colab environment and export it in csv format to the local machine.

#### How Upload a File on Google Colab

In [None]:
if 'google.colab' in str(get_ipython()):
    from google.colab import files
    uploaded = files.upload()
    path = ''
else:
    path = './data/'

In [None]:
with open(path + "carroll-alice.txt", "r") as f:
    alice = f.read()
    
alice[:392]    

#### How Download a File on Google Colab

In [None]:
import pandas as pd

cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
        'Price': [22000,25000,27000,35000]
        }

df = pd.DataFrame(cars, columns= ['Brand', 'Price'])

In [None]:
if 'google.colab' in str(get_ipython()):
    # if we run in google environment first we save in virtual machine...
    df.to_csv ('export_dataframe.csv', index = False, header=True)
    # ...then we download to local machine
    from google.colab import files
    files.download("export_dataframe.csv")    
else:
    # if we are working in local we save directly with the usual method
    df.to_csv ('./data/export_dataframe.csv', index = False, header=True)

### Packages for scientific computing

Throughout this course, we will use **NumPy**'s multidimensional arrays to store
and manipulate data. We will make use of **Pandas**, which is a library
built on top of NumPy that provides additional higher-level data manipulation
tools that make working with tabular data even more convenient. To augment your
learning experience and visualize quantitative data, which is often extremely useful
to make sense of it, we will use the very customizable **Matplotlib** library. 

## Summary

In this lesson, we explored machine learning at a very high level and familiarized
ourselves with the big picture and major concepts that we are going to explore in the
following chapters in more detail. 

We learned that:

- **Supervised learning** is composed of two important subfields: **classification** and **regression**. While classification models allow us to categorize objects into known classes, we can use regression analysis to predict the continuous outcomes of target variables; 

- **Unsupervised learning** offers useful techniques for discovering structures in unlabeled data;

- **How to set up a Python environment** and installed and updated the required packages to get ready to see machine learning in action.

## References

For this introduction you can refer to 

- John C. Hull, **Machine Learning in Business, An Introduction to the World of Data Science**, Amazon (2019)

- Paul Wilmott}, **Machine Learning, An Applied Mathematics Introduction**, Panda Ohana Publishing (2019)

see this very simple introduction described [here](https://developers.google.com/codelabs/tensorflow-1-helloworld#5)