In [2]:
from IPython.display import Image
from IPython.core.display import HTML 
import pandas as pd

Machine learning has in many ways become a throw around phrase for complex issues in research - "apply some machine learning and problem solved!". Recently there's also been many talks about how machine learning algorithms are a black box, some rumors of machine learning models coming to life and becoming sentient, and jobs becoming obsolete because of tools such as ChatGPT, which is also built on an ML model. There's much to demistify here, so let's start from the very beginning.

### Early days of ML:
- teaching the compute how to play checkers (Arthur Samuel, IBM, 1959), trial and error


### What is Machine learning (ML)?

In [3]:
Image(url= "../img/ML1.png",width=500, height=1500)
# source: https://dataedo.com/cartoon/machine-learning

There are many definitions and ways to describe it...

-  type of technology that allows computers to learn and improve at tasks without being explicitly programmed to do so

-  application of statistical modelling to detect patterns and improve performance based on data and empirical information, without programming commands - self-learning -
- machines can preform a task using <b>input data</b> rather than <b>input command</b>
  * unlike traditional programming where inputs and outputs are predefined by the programmer, ML uses data as input to build a decision model
  
  * decisions are generated by deciphering relationships and patterns in the data using probabilistic reasoning, trial and error and other computationally intensive techniques


- building mathematical models of data -  to understand data
- learning - giving the models <i>tunable</i> (tweakable) parameters that can be adapted to the observed data
- an "idea that generic algorithms cana tell you something interesting about your data without having to write any custom code specific to the prblem" - instead, you feed your data to the algorithm and it builds its own logic based on it
- it involves feeding the computer a large amount of data and using the ML algorithm to identify patterns inside the data. Those patterns can then bve used to make predictions and decisions about the data

<i>Example</i>
- you want to teach a computer to recognize pictures of dogs. You would first show it thousands of pictures of dogs and tell it which ones are actually dogs. The computer would then use this information to identify common features of dogs, such as their fur, ears, and snouts.

- Once the computer has learned these features, it can use them to identify whether new pictures it encounters are likely to be dogs or not. Over time, as it encounters more and more pictures, it can improve its accuracy and become better at recognizing dogs

- In essence, machine learning allows computers to learn from experience, much like how humans learn from experience.

### The Anatomy of Machine Learning

In [4]:
Image(url= "../img/babuska.png",width=500, height=1500)
# source: Machine Learning for Absolute Beginners, Oliver Theobald

- computer science (design and use of computers) 
- data science (methods for extracting knowledge and insights from data with computers)
- artificial intelligence (ability of machines to perform intelligent tasks, tasks such as NLP, perception) 
- machine learning -  


In [5]:
Image(url= "../img/ML3.png",width=500, height=1500)
# source: Machine Learning for Absolute Beginners, Oliver Theobald

- ML overlaps with data mining in a sense of finding patterns in datasets
- data mining focuses on analyzing input variables to predict a new output, machine learning extends to analyzing both input and output variables


## Types of Machine Learning

The two main types of ML are:
1.  Supervised learning 

2. Unsupervised learning


The main difference between the two is the way our data looks like before we put it into the algorithm. More specifically, whether we're using labels or not. Labels of what kind though?


### Example 1

- <b> Supervised learning</b>
           * you would provide the computer with a labeled dataset, which includes examples of each type of fruit along with their labels (i.e., "apple," "orange," "banana")
           * the computer then uses this labeled dataset to learn the characteristics that distinguish each type of fruit, such as their color, shape, and texture. Once the computer has learned these characteristics, it can use them to classify new fruits that it encounters.

In [6]:
Image(url= "../img/SL1.PNG",width=500, height=1500)
# source: https://www.educba.com/what-is-supervised-learning/

- <b> Unsupervised learning </b>
        *  you would provide the computer with an unlabeled dataset, which includes examples of different types of fruits but without any labels
        * the computer then uses clustering algorithms to identify patterns or similarities within the dataset, without being told explicitly what the different types of fruits are. Once the computer has identified these patterns, it can use them to group similar fruits together
       * you might show the computer a dataset of images of different fruits, without any labels. The computer would then group similar fruits together based on their color, shape, or texture. It might identify a cluster of red, round fruits as apples, and a cluster of yellow, curved fruits as bananas, without being told what they are.

In [7]:
Image(url= "../img/UL1.png",width=500, height=1500)
# source: https://www.educba.com/what-is-supervised-learning/

### Example 2
<b>Supervised approach </b> 

You're a real estate agent that is hiring new people, who have no idea how to price apartments.
You with your experience in apartment prices, write a small app that will help them make a decision about a apartment price depending on the size, neighbourhood, etc. and also depending on what other apartments with the same criteria have sold for. 

Using that training data, we want to create a program that can estimate how much any other house in your area is worth:

In [8]:
data = {'Bedrooms': [3,2,1,3],
        'Sq. meters': [70, 40, 20, 60],
        'Neighbourhood': ['Lend', 'Geidorf', 'Strassgang', 'Gries'],
        'Price': [200000, 240000, 100000, 150000]}

df = pd.DataFrame(data)
df


Unnamed: 0,Bedrooms,Sq. meters,Neighbourhood,Price
0,3,70,Lend,200000
1,2,40,Geidorf,240000
2,1,20,Strassgang,100000
3,3,60,Gries,150000


In [9]:
prediction = {'Bedrooms': [3],
        'Sq. meters': [100],
        'Neighbourhood': ['Lend'],
        'Price': ['?']}

prediction = pd.DataFrame(prediction)
prediction

Unnamed: 0,Bedrooms,Sq. meters,Neighbourhood,Price
0,3,100,Lend,?


This is the essence of supervised learning - deducting the answer to a problem depending on previously solved (labelled) problems). The label in this case is price.

<b>Unsupervised approach </b> 

Imagine the same scenario, but we don't know theprices of any of the apartmens. In this case you only know the size, the neighbourhood and the amount of bedrooms. But we can still make use of that info.

What we could do is:

- have an algorithm identify market segments in the data (e.g. in one area buyers prefer smaller flats, in another bigger etc.)
- identify outliers (such as large mansions that are rare)
 

## What can we do with each type of learning?

Supervised learning: Models that can predict labels based on labeled training data

- <b>Classification:</b> Models that predict labels as two or more categories. Some additional examples - sentiment analysis, spam detection
- <b>Regression:</b> Models that predict continuous labels (continuous numerical variabes - infinite number of values between any two values). Some additional examples - prediction housing or stock prices, the weather

Unsupervised learning: Models that identify structure in unlabeled data

 - <b>Clustering:</b> Models that detect and identify distinct groups in the data, i.e. grouping similar data together based on their characteristics (features). We can use it to group similar documents and images or identify patterns in documents (text similarity)
 - <b>Dimensionality reduction:</b> Models that detect and identify lower-dimensional structure in higher-dimensional data. It is used to reduce the number of features in a dataset by transforming them in a lower dimensional space while still preserving important info. This can be used for data visualisation and speeding up the training period of ML algorithms. ChatGPTalso uses a form of this, called PCA


In [10]:
Image(url= "../img/ML.png",width=700, height=1500)
# source: https://www.cognub.com/index.php/cognitive-platform/

As you see from this image, there isn't just supervised and unsupervised learning. 

Other types of ML:

- semi-supervised learning - a machine learning method in which we have input data, and a fraction of input data is labeled as the output. It is a mix of supervised and unsupervised learning.
- reinforcement learning- an algorithm learns to make decisions by interacting with its environment and receiving feedback in the form of rewards or punishments (e.g. teaching a computer to play chess)


But let's stick to our first two types for now.

### ML in Python

Python has a wide range of libraries and tools available for ML. 

With these libraries, it's easy to build and train machine learning models to perform tasks such as predictive data analysis and  natural language processing.analytics

Some of the famous ones:
 - scikit-learn - https://scikit-learn.org/stable/ - used for shallow ML algorithms
 - TensorFlow -https://www.tensorflow.org/ - used for deep learning and more advanced work
 - Keras - https://keras.io/getting_started/intro_to_keras_for_researchers/ - - used for deep learning and more advanced work
 - PyTorch - https://pytorch.org/ - - used for deep learning and more advanced work, builds on TensorFlow
 
* for ML,  Python is the go-to because of the number of libraries
* however, if you are proficient in are c and c++ that's also a good choice for ML,  because they can run directly on the GPU and Python has to be converted


For our next class, please make sure you have scikit-learn installed :)

### Next up (and throughout the course):
 - scikit-learn
 - model validation (evaluation)
 - feature engineering
 - basic classification and regression algorithms
 

### References

* Books: 
   - Python Data Science Handbook - https://www.oreilly.com/library/view/python-data-science/9781491912126/
   - Python Data Science Hanbook ML Section - https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.00-Machine-Learning.ipynb#scrollTo=ItD8Vf0zgELe
   - Machine Learning for Absolute Beginners, Oliver Theobald (https://www.amazon.de/gp/product/B08RWBSKQB/ref=ppx_yo_dt_b_d_asin_title_o00?ie=UTF8&psc=1)
   - The Hundred Page Machine Learning Book, Andriy Burkov



* Medium guide - https://medium.com/@ageitgey/machine-learning-is-fun-80ea3ec3c471
* https://neptune.ai/blog/self-supervised-learning
