# Unsupervised Learning

## What is Machine Learning

Machine learning is an artificial intelligence (AI) technology which provides systems with the ability to automatically learn from experience without the need for explicit programming, and can help solve complex problems. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.

## The three different types of machine learning

### Supervised Learning

The main goal in supervised learning is to learn a model from labeled training data that allows us to make predictions about unseen or future data. Here, the term "supervised" refers to a set of **training** examples (data inputs) where the
desired output signals (**labels**) are already known. The following figure summarizes a typical supervised learning workflow, where the labeled training data is passed to a machine learning algorithm for fitting a predictive model that can make
predictions on new, unlabeled data inputs:

![image.png](attachment:image.png)

A supervised learning task with discrete class labels, such as in the previous example, is also called a **classification
task**. 
A second type of supervised learning is the prediction of continuous outcomes, which is also called **regression analysis**. In
regression analysis, we are given a number of predictor (explanatory) variables and a continuous response variable (outcome), and we try to find a relationship between those variables that allows us to predict an outcome. Note that in the field of machine learning, the predictor variables are commonly called ***features***, and the response variables are usually referred to as ***target variables***.

![image.png](attachment:image.png)

### Unsupervised Learning

In supervised learning, we know the right answer beforehand when we train a model. In **unsupervised learning**, however, we are dealing with ***unlabeled data*** or data of unknown structure. Using unsupervised learning techniques, we are able to explore the structure of our data to extract meaningful information without the guidance of a known outcome variable or reward function.

**Clustering** is an exploratory data analysis technique that allows us to organize a pile of information into meaningful subgroups (clusters) *without having any prior knowledge of their group memberships*. Each cluster that arises during the analysis defines a group of objects that share a certain degree of similarity but are more dissimilar to objects in other clusters, which is why clustering is also sometimes called unsupervised classification. Clustering is a great technique for structuring information and deriving meaningful relationships from data. For example, it allows marketers to discover customer groups based on their interests, in order to develop distinct marketing programs.

![chapter-0-0_pic_2.png](./pic/chapter-0-0_pic_2.png)

### Reinforcement Learning

Another type of machine learning is **reinforcement learning**. In reinforcement learning, the goal is to develop a system (***agent***) that improves its performance based on interactions with the environment. Since the information about the current state of the environment typically also includes a so-called **reward signal**, we can think of reinforcement learning as a field related to supervised learning. However, in reinforcement learning, this feedback is not the correct ground truth label or value, but a measure of how well the action was measured by a reward function. Through its interaction with the environment, an agent can then use reinforcement learning to learn a series of actions that maximizes this reward via an exploratory trial-and-error approach or deliberative planning. A popular example of reinforcement learning is a chess engine. Here, the agent decides upon a series of moves depending on the state of the board (the
environment), and the reward can be defined as win or lose at the end of the game.

## Features and Labels

The data for supervised learning contains what are referred to as **features** and **labels**. The **labels** are the values of the target that is to be predicted. The **features** are the variables from which the predictions are to be made. For example when predicting the price of a house the **features** could be the square meters of living space, the number of bedrooms, the number of bathrooms, the size of the garage and so on. The **label** would be the house price.

The data for unsupervised learning consists of features but no labels because the model is being used to identify patterns not to forecast something.


### The main difference between supervised and unsupervised learning: Labeled data

Risking of being repetitive, I want to stress again that the main distinction between the two approaches is the use of labeled datasets. To put it simply, supervised learning uses labeled input and output data, while an unsupervised learning algorithm does not.

In supervised learning, the algorithm “learns” from the training dataset by iteratively making predictions on the data and adjusting for the correct answer. While supervised learning models tend to be more accurate than unsupervised learning models, they require upfront human intervention to label the data appropriately. For example, a supervised learning model can predict how long your commute will be based on the time of day, weather conditions and so on. But first, you’ll have to train it to know that rainy weather extends the driving time.

Unsupervised learning models, in contrast, work on their own to discover the inherent structure of unlabeled data. Note that they still require some human intervention for validating output variables. For example, an unsupervised learning model can identify that online shoppers often purchase groups of products at the same time. However, a data analyst would need to validate that it makes sense for a recommendation engine to group baby clothes with an order of diapers, applesauce and sippy cups.

## Basic of Unsupervised Learning

As we have seen, at its core, unsupervised learning involves the analysis of data sets without predefined or known outcomes. The algorithms seek to identify patterns or groupings from the input data without any guidance or supervision. This form of learning is crucial when the task at hand does not include prior knowledge, or when it is impractical to obtain labeled data, which is often expensive and time-consuming. Unsupervised learning, unlike its counterpart supervised learning, operates on **data without labeled responses**. The primary goal is to unearth hidden patterns, intrinsic structures, or useful representations from such unlabeled data. 

### Key Techniques in Unsupervised Learning

- **Clustering**: Clustering is perhaps the most well-known unsupervised learning technique. It involves grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups. Common clustering algorithms include K-Means, hierarchical clustering, and DBSCAN.

- **Dimensionality Reduction**: This technique is about reducing the number of random variables under consideration and can be divided into feature selection and feature extraction. Methods like Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and autoencoders are widely used for dimensionality reduction.

- **Association Rule Learning**: This technique is used to discover interesting relations between variables in large databases. It’s commonly used in market basket analysis where it reveals how items purchased by customers are related. The Apriori algorithm is a classic example of association rule learning.

- **Anomaly Detection**: Anomaly detection involves identifying unusual patterns that do not conform to expected behavior. It is widely used in fraud detection, system health monitoring, and outlier detection in data cleaning and preprocessing.

### Applications of Unsupervised Learning

Unsupervised learning techniques are applied in numerous fields due to their ability to discover hidden patterns in unlabeled data.

- In **marketing**, clustering helps in customer segmentation by identifying groups of customers with similar behaviors or preferences.

- In **genomics**, it assists in understanding genetic structures and variations by clustering similar genetic patterns.

- **Finance** sector employs anomaly detection for fraudulent transaction identification.

- In **image processing**, unsupervised learning helps in image compression and segmentation.

- **Natural Language Processing** (NLP) utilizes unsupervised learning for topic modeling and word clustering.

### Challenges and Considerations in Unsupervised Learning

While unsupervised learning is powerful, it comes with its set of challenges:

- **Interpretability**: The outcomes of unsupervised learning are often difficult to interpret. Since there are no predefined labels, the meaning and significance of the results can be ambiguous and require domain expertise for interpretation.

- **Evaluation Metrics**: Evaluating the performance of unsupervised learning models is challenging since there is no ground truth to compare against. Metrics such as silhouette score or Davies-Bouldin index are used in clustering, but they don’t always provide a clear indication of model performance.

- **Data Quality**: The quality of outcomes heavily depends on the quality of input data. Noisy, incomplete, or inconsistent data can lead to misleading patterns and results.

## Learning Tools

### Using Python for machine learning

Python is one of the most popular programming languages for data science and
thanks to its very active developer and open source community, a large number of
useful libraries for scientific computing and machine learning have been developed.
Although the performance of interpreted languages, such as Python, for
computation-intensive tasks is inferior to lower-level programming languages,
extension libraries such as **NumPy**, **Matplotlib** and **Pandas**, among the others, have been developed that build
upon lower-layer Fortran and C implementations for fast vectorized operations
on multidimensional arrays.
For machine learning programming tasks, we will mostly refer to the **scikit-learn**
library, which is currently one of the most popular and accessible open source
machine learning libraries. In the later chapters, when we focus on a subfield
of machine learning called deep learning, we will use the latest version of the
**Keras** library, which specializes in training so-called deep neural network
models very efficiently. 

### Installing Python and Packages

To set up your python environment, you’ll first need to have a python on your machine. There are various python distributions available and we have chosen one that works very well for data science: **Anaconda**. Anaconda comes with its own Python distribution which will be installed along with it. 

Data Science often requires you to work with a lot of scientific packages like scipy and numpy, data manipulation packages like pandas and IDEs and interactive Jupyter Notebook.Now, you don’t need to worry about any python package most of them come pre-installed and if you want to install a new package, you can do that simply by using conda or via the pip installer program, which has been part of the Python Standard Library
since Python 3.3. More information about pip can be found [here](https://docs.python.org/3/installing/index.html). After we have successfully installed Python, we can execute pip from the terminal
to install additional Python packages:

**pip install SomePackage**

Already installed packages can be updated via the --upgrade flag:

**pip install SomePackage --upgrade**

To download an Anaconda distribution, you can use the [official download page](https://www.anaconda.com/download/) and
you can select your platform and then choose the installer. For this, you can choose which version you want and whether 32-bit or 64-bit.

<!--
<div>
<img src="./img/anaconda_2.png" width="600"/>
</div>
-->
![chapter-0-0_pic_3.png](./pic/chapter-0-0_pic_3.png)

To test your installation, on Windows, click on Start and then Anaconda Navigator in the program list (or search for Anaconda in the search bar and select Anaconda Navigator). On a Mac, open up the finder, and in the Applications folder, double click on Anaconda-Navigator.

![chapter-0-0_pic_4.png](./pic/chapter-0-0_pic_4.png)

**Package Managers**

Anaconda will give you two package managers- **pip** and **conda**. When some packages aren’t available with conda, you can use pip to install them. Note that using pip to install packages also available to conda may cause an installation error.

**Jupyter Notebook**

A notebook is a document like this one! A notebook integrates code and its output into a single document that combines visualizations, narrative text, mathematical equations, and other rich media.

In other words: it's a single document where you can run code, display the output, and also add explanations, formulas, charts, and make your work more transparent, understandable, repeatable, and shareable. As part of the open source Project Jupyter, Jupyter Notebooks are completely free. You can download the software on its own, or as part of the Anaconda data science toolkit.

### Google Colab

Although it is not essential to work in a colab environment (all the course notebooks are in fact designed to be able to run without problems locally on your pc), it is useful to know some basic elements of the interaction with colab. In particular, in the cells below you will find two examples for the use of external files. In the first case it is shown how to load a text file from your local PC into the google virtual machine. The second example relates to the opposite operation: let's create a simple pandas dataframe into the colab environment and export it in csv format to the local machine.

#### How Upload a File on Google Colab

In [1]:
if 'google.colab' in str(get_ipython()):
    from google.colab import files
    uploaded = files.upload()
    path = ''
else:
    path = './data/'

In [2]:
with open(path + "carroll-alice.txt", "r") as f:
    alice = f.read()
    
alice[:392]    

FileNotFoundError: [Errno 2] No such file or directory: './data/carroll-alice.txt'

#### How Download a File on Google Colab

In [None]:
import pandas as pd

cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
        'Price': [22000,25000,27000,35000]
        }

df = pd.DataFrame(cars, columns= ['Brand', 'Price'])

In [3]:
if 'google.colab' in str(get_ipython()):
    # if we run in google environment first we save in virtual machine...
    df.to_csv ('export_dataframe.csv', index = False, header=True)
    # ...then we download to local machine
    from google.colab import files
    files.download("export_dataframe.csv")    
else:
    # if we are working in local we save directly with the usual method
    df.to_csv ('./data/export_dataframe.csv', index = False, header=True)

NameError: name 'df' is not defined

## Data Science Python Libraries

As we delve into the multifaceted world of machine learning, two libraries stand out for their robustness and versatility: Scikit-learn and SciPy. Both are cornerstones in the Python ecosystem for data science and provide a suite of tools that are indispensable for machine learning practitioners.

**Scikit-learn**, commonly referred to as sklearn, is a specialized library that offers a wide array of machine learning algorithms and tools. It is built on top of libraries such as NumPy, SciPy, and matplotlib, which are workhorses for numerical computing and data visualization in Python. Scikit-learn simplifies complex processes, allowing for the easy implementation of many machine learning techniques. It encompasses algorithms for classification, regression, clustering, and dimensionality reduction, as well as utilities for model evaluation, data transformation, and data splitting. The library's consistency in API design makes it highly accessible for beginners, yet it remains powerful enough for seasoned practitioners to implement state-of-the-art machine learning models with only a few lines of code.

**SciPy**, on the other hand, is a scientific computing library that provides more fundamental functionalities for mathematics, science, and engineering. It extends the capabilities of NumPy with additional modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers, and more. While SciPy is not exclusively a machine learning library, it forms the backbone of many higher-level machine learning operations that require scientific computations. Its modules are meticulously optimized for performance and are relied upon by researchers and developers for technical and scientific computing tasks that demand high precision and efficiency.

Together, Scikit-learn and SciPy form a potent duo, serving as the foundation upon which we can build, analyze, and deploy sophisticated machine learning models. Their contribution to the Python data science landscape is unparalleled, and a thorough understanding of both is crucial for anyone looking to make strides in the field of machine learning.

For those eager to explore the intricacies of Scikit-learn and SciPy, there are a wealth of resources available that range from official documentation to comprehensive textbooks and online courses.

To delve into **Scikit-learn**, the library's official documentation (https://scikit-learn.org/stable/documentation.html) is the definitive reference, offering detailed guides and tutorials on every aspect of the library. It includes user guides for different machine learning algorithms, information on model selection and evaluation, and practical examples to get your hands dirty. For a more structured learning experience, ***"Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow"*** by Aurélien Géron provides a deep dive into using Scikit-learn for practical machine learning. This book is well-regarded for its clear explanations and hands-on approach.

When it comes to **SciPy**, the official documentation (https://docs.scipy.org/doc/scipy/reference/) is again an excellent starting point. It provides detailed documentation of all modules and functions within the library. For a broader understanding, ***"Python for Data Analysis"*** by Wes McKinney offers insights into using SciPy alongside pandas, NumPy, and other data analysis tools. For those who prefer a more interactive approach, platforms like Coursera, edX, and Udemy offer courses on scientific computing with Python that include modules on SciPy.

In addition to these resources, communities such as Stack Overflow and GitHub provide forums where one can ask questions, share knowledge, and collaborate on projects. Journals such as the Journal of Machine Learning Research (JMLR) and conferences like SciPy and PyCon also publish papers and talks on the latest developments and applications of these libraries. These resources collectively provide a comprehensive ecosystem for learners to deepen their understanding and expertise in using Scikit-learn and SciPy for machine learning and scientific computing.

## Emerging Trends and Future Directions

Unsupervised learning is an area ripe for innovation and growth. Recent trends include the integration of unsupervised learning with deep learning techniques, such as deep neural networks and autoencoders. These approaches have shown promising results in complex tasks like feature learning, representation learning, and generative models.

Another exciting development is the use of unsupervised learning in reinforcement learning and transfer learning, where it helps in feature discovery and efficient learning in environments with sparse or no labels.

## Conclusion

Unsupervised learning is a dynamic and expansive field in machine learning. Its ability to work with unlabeled data makes it incredibly versatile and valuable across various domains. As data continues to grow in size and complexity, the role of unsupervised learning in extracting meaningful information and discovering hidden patterns becomes increasingly important. While it poses unique challenges in terms of interpretation and evaluation, advancements in algorithms and computational power continue to push the boundaries, making unsupervised learning an exciting field to watch in the coming years. The future of unsupervised learning, intertwined with developments in artificial intelligence, holds immense potential for innovation and discovery, making it a key pillar in the quest to harness the power of data.

## References and Credits