## Semi-Supervised Machine Learning
This is a machine learning algorithm which is a combination of supervised and unsupervised machine learning algorithms. It came to play to tackle the challenge of having a supervised machine learning that needs labelled data. In the event that there is a large dataset, it can be costly to label the data.

The idea is to cluster data using unsupervised learning and then use labelled data to label the rest of the unlabelled data.

Types of Unsupervised Learning:
- Inductive Learning (It refers to building a learning algorithm that learns from a labeled training set and generalizes to new data.)
- Transductive Learning (The goal is to transduce information from labeled training datasets to available unlabeled data)

Examples of Unsupervised Learning Algorithms:
- Self Training
- Label Propagation
- Graph-based semi supervised machine learning
- Low density Seperataion

Application of Semi-Supervised Learning:
- Speech Analysis
- Internet Content Classification
- Protein Sequence Classification
- Banking
- Text Document Classifier

Sources:<br>
https://www.geeksforgeeks.org/ml-semi-supervised-learning/
https://towardsdatascience.com/semi-supervised-machine-learning-explained-c1a6e1e934c7
https://machinelearningmastery.com/semi-supervised-learning-with-label-propagation/
https://www.finsliqblog.com/ai-and-machine-learning/types-of-semi-supervised-algorithms/
https://scikit-learn.org/stable/modules/semi_supervised.html

In [1]:
# import libraries
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.semi_supervised import LabelPropagation

Import dataset from make_classification() function. The dataset has 1500 rows and 3 columns. Two are the predictors and 1 is the target variable.

In [2]:
X, y = make_classification(n_samples=1500, n_features=2, n_informative=2, n_redundant=0, random_state=1)

In [3]:
X

array([[ 1.40458883, -1.6400002 ],
       [-0.49361702,  0.36713854],
       [ 0.82885965,  1.18265727],
       ...,
       [ 0.45901053, -1.31034764],
       [-1.73219597,  0.32685592],
       [-1.43388392, -0.49886316]])

Splitting the data into train and test (50:50).

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=1, stratify=y)

In [5]:
lr = LogisticRegression()
lr.fit(X_train, y_train)

LogisticRegression()

In [6]:
lrPred = lr.predict(X_test)
lrScore = accuracy_score(y_test, lrPred)
lrScore

0.9093333333333333

Proceed to split the training dataset into half. First half will be labeled, the other half is unlabeled.

In [7]:
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=.5, random_state=1, stratify=y_train)

In [8]:
print("Labeled Train Dataset: {}".format(X_train_lab.shape, y_train_lab))
print("Unlabeled Train Dataset: {}".format(X_test_unlab.shape, y_test_unlab.shape))
print("Test Dataset: {}".format(X_test.shape, y_test.shape))

Labeled Train Dataset: (375, 2)
Unlabeled Train Dataset: (375, 2)
Test Dataset: (750, 2)


In the case above:
- Supervised ML has 375 rows
- Semi Supervised ML has 375 labeled rows and 375 unlabeled rows.

The code below combines the entire dataset which is the input data. It concatenates both the labeled and unlabeled dataset into a single array.

In [9]:
X_train_mixed = np.concatenate((X_train_lab, X_test_unlab))

The semi-supervised model will be fed with both labeled and unlabeled data. It is important to mark the unlabeled data with -1.

In [10]:
noLabel = [-1 for _ in range(len(y_test_unlab))]

Concatenate the label/target variale with noLabel which is from the unlabeled dataset into one array.

In [11]:
y_train_mixed = np.concatenate((y_train_lab, noLabel))

Train the semi-supervised model on the entire dataset

In [12]:
lp = LabelPropagation()
lp.fit(X_train_mixed, y_train_mixed)

LabelPropagation()

Predict and get the accuracy score of the model.

In [13]:
lpPred = lp.predict(X_test)

In [14]:
lpScore = accuracy_score(y_test, lpPred)
lpScore

0.916

Find the estimated models, through .transduction_ attribute on the LabelPropagation class.

In [15]:
transLabels = lp.transduction_
transLabels

array([1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1,
       1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1,
       1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1,
       0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1,
       0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0,
       1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0,

The other way is to fit the estimated labels into a supervised machine learning and get the accuracy score.

In [16]:
lr = LogisticRegression()
lr.fit(X_train_mixed, transLabels)

LogisticRegression()

In [17]:
lrPred = lr.predict(X_test)

In [18]:
lrScore = accuracy_score(y_test, lrPred)
lrScore

0.9106666666666666