# Text Classification - OneClass Classificaiton

The one-class algorithms are based on recognition since their aim is to recognize data from a particular class, and reject data from all other classes. This is accomplished by creating a boundary that encompasses all the data belonging to the target class within itself, so when a new sample arrives the algorithm only has to check whether it lies within the boundary or outside and accordingly classify the sample as belonging to the target class or the outlier.

Things we are going to discuss:

1. Data Preparation 
2. Cleaning and Tokenization
3. Feature Extraction
4. Train One-class classificaiton model
5. Predict one-class on test data

In [None]:
# Load packages
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.svm import OneClassSVM
from sklearn.utils import shuffle
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import classification_report 
from nltk.corpus import stopwords
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.stem.porter import PorterStemmer
import string
import spacy
from spacy.lang.en import English
spacy.load('en')
parser = English()

In [None]:
# load dataset
bbc_df = pd.read_csv('../input/bbc-text.csv')

In [None]:
bbc_df.head(10)

In [None]:
bbc_df.shape

In [None]:
bbc_df.info()

In [None]:
bbc_df['category'].unique()

In [None]:
bbc_df['category'].value_counts()

In [None]:
sns.countplot(bbc_df['category'])

## Data preparation

Let's take "sport" category as our traning class for one-class classification

so let's replace the category labels

Since "sport" is our traning class let's replace "sport" with "1" and replace "business, politics, tech and entertainment" with "-1"

becuase one-class classification model prediction will be 1 or -1

here "1" is target class and "-1" is the outlier

In [None]:
# change category labels
bbc_df['category'] = bbc_df['category'].map({'sport':1,'business':-1,'politics':-1,'tech':-1,'entertainment':-1})

In [None]:
# create a new dataset with only sport category data
sports_df = bbc_df[bbc_df['category'] == 1]

In [None]:
sports_df.shape

In [None]:
# create train and test data
train_text = sports_df['text'].tolist()
train_labels = sports_df['category'].tolist()

test_text = bbc_df['text'].tolist()
test_labels = bbc_df['category'].tolist()

## Data Cleaning and Tokenization

In [None]:
# stop words list
STOPLIST = set(stopwords.words('english') + list(ENGLISH_STOP_WORDS)) 
# special characters
SYMBOLS = " ".join(string.punctuation).split(" ") + ["-", "...", "”", "”","''"]

In [None]:
# class for cleaning the text
class CleanTextTransformer(TransformerMixin):
    def transform(self, X, **transform_params):
        return [cleanText(text) for text in X]
    def fit(self, X, y=None, **fit_params):
        return self
    def get_params(self, deep=True):
            return {}

def cleanText(text):
    text = text.strip().replace("\n", " ").replace("\r", " ")
    text = text.lower()
    return text

In [None]:
# tokenizing the raw text
def tokenizeText(sample):
    
    tokens = parser(sample)
    
    # lemmatization
    lemmas = []
    for tok in tokens:
        lemmas.append(tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_)
    tokens = lemmas
    
    # remove stop words and special characters
    tokens = [tok for tok in tokens if tok.lower() not in STOPLIST]
    tokens = [tok for tok in tokens if tok not in SYMBOLS]
    
    # only take words with length greater than or equal to 3
    tokens = [tok for tok in tokens if len(tok) >= 3]
    
    # remove remaining tokens that are not alphabetic
    tokens = [tok for tok in tokens if tok.isalpha()]
    
    # stemming of words
    porter = PorterStemmer()
    tokens = [porter.stem(word) for word in tokens]
    
    return list(set(tokens))

In [None]:
# lets see tokenized random text
tokenizeText(train_text[9])

## Feature Extraction

In [None]:
# getting features
vectorizer = HashingVectorizer(n_features=20,tokenizer=tokenizeText)

features = vectorizer.fit_transform(train_text).toarray()
features.shape

## One-class SVM

One-class SVM is an unsupervised algorithm that learns a decision function for novelty detection: classifying new data as similar or different to the training set.

In [None]:
# OneClassSVM algorithm
clf = OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)
pipe_clf = Pipeline([('cleanText', CleanTextTransformer()), ('vectorizer', vectorizer), ('clf', clf)])

In [None]:
# fit OneClassSVM model 
pipe_clf.fit(train_text, train_labels)

In [None]:
# validate OneClassSVM model with train set
preds_train = pipe_clf.predict(train_text)

print("accuracy:", accuracy_score(train_labels, preds_train))

In [None]:
# validate OneClassSVM model with test set
preds_test = pipe_clf.predict(test_text)
preds_test

In [None]:
results = confusion_matrix(test_labels, preds_test) 
print('Confusion Matrix :')
print(results) 
print('Accuracy Score :',accuracy_score(test_labels, preds_test)) 
print('Report : ')
print(classification_report(test_labels, preds_test)) 

Let's check how model is performing 

In [None]:
# let's take random text from dataset
test_text[3]

In [None]:
# check actual category
test_labels[3]

In [None]:
# let's predict the category of above random text
pipe_clf.predict([test_text[3]])

our model predicted random text as sport category which is correct