<a href="https://colab.research.google.com/github/gupta24789/multilabel-classification/blob/main/02_stackoverflow_scikit_multilearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Solution for Multi-Label Problem
- Methods for solving Multi-label Classification Problems
    - Problem Transformation
    - Adapted Algorithm
    - Ensemble approaches

#### Problem Transformation
- It refers to transforming the multi-label problem into single-label problem(s) by using
    - Binary Relevance: treats each label as a separate single class classification
    - Classifier Chains:In this, the first classifier is trained just on the input data and then each next classifier is trained on the input space and all the previous classifiers in the chain.
    - Label Powerset:we transform the problem into a multi-class problem with one multi-class classifier is trained on all unique label combinations found in the training data.

#### Adapted Algorithm
adapting the algorithm to directly perform multi-label classification, rather than transforming the problem into different subsets of problems.

In [1]:
!pip install -q scikit-multilearn
!pip install -q neattext

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random

# Multi Label Pkgs
import skmultilearn
from skmultilearn.problem_transform import BinaryRelevance
from skmultilearn.problem_transform import ClassifierChain
from skmultilearn.problem_transform import LabelPowerset
from skmultilearn.adapt import MLkNN

## Sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB,MultinomialNB
from sklearn.metrics import accuracy_score,hamming_loss,classification_report

In [3]:
## set seed
random.seed(121)
np.random.seed(121)

In [4]:
df = pd.read_csv("https://raw.githubusercontent.com/gupta24789/multilabel-classification/main/data/stackoverflow_train.csv")
df.head(3)

Unnamed: 0,ID,TITLE,ABSTRACT,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance
0,1,Reconstructing Subject-Specific Effect Maps,Predictive models allow subject-specific inf...,1,0,0,0,0,0
1,2,Rotation Invariance Neural Network,Rotation invariance and translation invarian...,1,0,0,0,0,0
2,3,Spherical polyharmonics and Poisson kernels fo...,We introduce and develop the notion of spher...,0,0,1,0,0,0


In [5]:
df['CONTEXT'] = df['TITLE'] + ". " + df['ABSTRACT']
df.drop(labels=['TITLE', 'ABSTRACT', 'ID'], axis=1, inplace=True)
df = df[['CONTEXT', 'Computer Science', 'Physics', 'Mathematics', 'Statistics',
                     'Quantitative Biology', 'Quantitative Finance',]]
df = df.reset_index(drop = True)

target_list = ['Computer Science', 'Physics', 'Mathematics', 'Statistics',
               'Quantitative Biology', 'Quantitative Finance']

df.head(3)

Unnamed: 0,CONTEXT,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance
0,Reconstructing Subject-Specific Effect Maps. ...,1,0,0,0,0,0
1,Rotation Invariance Neural Network. Rotation...,1,0,0,0,0,0
2,Spherical polyharmonics and Poisson kernels fo...,0,0,1,0,0,0


## Split the data into train & test set

In [6]:
# Split Data
X_train,X_test,y_train,y_test = train_test_split(df['CONTEXT'],df[target_list],test_size=0.3,random_state=42)

In [7]:
y_train['Computer Science'].value_counts()

0    8621
1    6059
Name: Computer Science, dtype: int64

In [8]:
y_train['Physics'].value_counts()

0    10516
1     4164
Name: Physics, dtype: int64

In [9]:
y_train['Mathematics'].value_counts()

0    10773
1     3907
Name: Mathematics, dtype: int64

In [10]:
y_train['Statistics'].value_counts()

0    11028
1     3652
Name: Statistics, dtype: int64

In [11]:
y_train['Quantitative Biology'].value_counts()

0    14287
1      393
Name: Quantitative Biology, dtype: int64

In [12]:
y_train['Quantitative Finance'].value_counts()

0    14498
1      182
Name: Quantitative Finance, dtype: int64

## Text Processing

In [13]:
import neattext as nt
import neattext.functions as nfx

In [14]:
X_train.apply(lambda x:nt.TextFrame(x).noise_scan())

8037     {'text_noise': 9.961685823754788, 'text_length...
18469    {'text_noise': 8.805418719211824, 'text_length...
5028     {'text_noise': 6.837606837606838, 'text_length...
11658    {'text_noise': 8.045977011494253, 'text_length...
280      {'text_noise': 7.676348547717843, 'text_length...
                               ...                        
11284    {'text_noise': 8.375634517766498, 'text_length...
11964    {'text_noise': 9.623095429029672, 'text_length...
5390     {'text_noise': 7.76536312849162, 'text_length'...
860      {'text_noise': 9.69597370583402, 'text_length'...
15795    {'text_noise': 8.489795918367347, 'text_length...
Name: CONTEXT, Length: 14680, dtype: object

In [15]:
X_train.apply(lambda x:nt.TextExtractor(x).extract_stopwords())

8037     [show, that, are, more, than, and, that, these...
18469    [the, of, he, no, but, a, we, the, and, of, th...
5028     [the, for, four, the, that, does, not, to, a, ...
11658    [the, of, to, have, that, could, enough, to, t...
280      [of, the, of, is, to, the, from, a, to, on, th...
                               ...                        
11284    [for, in, the, the, that, the, of, the, to, be...
11964    [of, at, the, of, we, the, and, of, in, a, the...
5390     [for, by, in, are, that, can, be, at, if, are,...
860      [in, a, of, at, by, of, or, at, a, in, which, ...
15795    [the, using, is, one, of, the, most, to, this,...
Name: CONTEXT, Length: 14680, dtype: object

In [16]:
X_train.apply(nfx.remove_stopwords)

8037     Modelling Luminous-Blue-Variable Isolation. Ob...
18469    MUSE view 2-10: AGN ionization sparkling starb...
5028     Frobenius problem numerical semigroups. greate...
11658    Constraining contribution active galactic nucl...
280      Non-Parametric Calibration Probabilistic Regre...
                               ...                        
11284    Tensor Methods Nonlinear Matrix Completion. lo...
11964    Numerical Simulations Collisional Cascades Roc...
5390     Possible evidence spin-transfer torque induced...
860      Common Knowledge Logic Gossips. Gossip protoco...
15795    Trust-Based Collaborative Filtering: Tackling ...
Name: CONTEXT, Length: 14680, dtype: object

In [17]:
train_corpus = X_train.apply(nfx.remove_stopwords)
test_corpus = X_test.apply(nfx.remove_stopwords)

## Tf-IDF Vectrizer

In [18]:
tfidf = TfidfVectorizer()

In [19]:
tfidf.get_params()

{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.float64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 1.0,
 'max_features': None,
 'min_df': 1,
 'ngram_range': (1, 1),
 'norm': 'l2',
 'preprocessor': None,
 'smooth_idf': True,
 'stop_words': None,
 'strip_accents': None,
 'sublinear_tf': False,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': None,
 'use_idf': True,
 'vocabulary': None}

In [20]:
# Build Features
train_features = tfidf.fit_transform(train_corpus).toarray()
test_features = tfidf.transform(test_corpus)

In [21]:
train_features.shape, test_features.shape

((14680, 47279), (6292, 47279))

In [22]:
train_features

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [23]:
dir(skmultilearn)

['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 'adapt',
 'base',
 'problem_transform',
 'utils']

## Binary Relevance classficiation
  - Convert Our Multi-Label Prob to Multi-Class

In [24]:
# Convert Our Multi-Label Prob to Multi-Class
# binary classficiation
binary_rel_clf = skmultilearn.problem_transform.BinaryRelevance(MultinomialNB())
binary_rel_clf.fit(train_features,y_train)

In [25]:
# Predictions
br_prediction = binary_rel_clf.predict(test_features)
br_prediction

<6292x6 sparse matrix of type '<class 'numpy.int64'>'
	with 5230 stored elements in Compressed Sparse Column format>

In [26]:
# Accuracy
accuracy_score(y_test,br_prediction)

0.5516528925619835

In [27]:
# Hamming Loss :Incorrect Predictions
# The Lower the result the better
hamming_loss(y_test,br_prediction)

0.10245814791269336

## Classifier Chains

- Preserve Label Correlation

In [28]:
def build_model(model,mlb_estimator,xtrain,ytrain,xtest,ytest):
    # Create an Instance
    clf = mlb_estimator(model)
    clf.fit(xtrain,ytrain)
    # Predict
    clf_predictions = clf.predict(xtest)
    # Check For Accuracy
    acc = accuracy_score(ytest,clf_predictions)
    ham = hamming_loss(ytest,clf_predictions)
    print({"accuracy:":acc,"hamming_score":ham})
    return clf

In [29]:
clf_chain_model = build_model(MultinomialNB(),ClassifierChain,train_features,y_train,test_features,y_test)

{'accuracy:': 0.5684996821360457, 'hamming_score': 0.09877622377622378}


## LabelPowerset

In [30]:
clf_labelP_model = build_model(MultinomialNB(),LabelPowerset,train_features,y_train,test_features,y_test)

{'accuracy:': 0.6028289891926255, 'hamming_score': 0.11024581479126934}


In [31]:
clf_labelP_model = build_model(LogisticRegression(),LabelPowerset,train_features,y_train,test_features,y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


{'accuracy:': 0.6760966306420851, 'hamming_score': 0.07994278448823904}


## Predict

In [39]:
index = X_test.sample().index[0]
example = X_test.loc[index]
label = y_test.loc[index]
print("Example : ", example)
print("Label : ", label[label==1].index.tolist())

labelfeatures = tfidf.transform([example])
predict = clf_labelP_model.predict(labelfeatures).toarray()[0]
print("Prediction : ",[target_list[index] for index, value in enumerate(predict) if value==1])

Example :  Stochastic Canonical Correlation Analysis.   We tightly analyze the sample complexity of CCA, provide a learning algorithm
that achieves optimal statistical performance in time linear in the required
number of samples (up to log factors), as well as a streaming algorithm with
similar guarantees.

Label :  ['Computer Science', 'Statistics']
Prediction :  ['Computer Science', 'Statistics']
