To create a classifier that predicts multiple features of the author of a given text, designed as a Multi label classification problem
The need is to build a NLP classifier which can use input text parameters to determine the label/s of of the blog.

Dataset is taken from 
https://www.kaggle.com/rtatman/blog-authorship-corpus

In [28]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [29]:
import os 
os.chdir('/content/drive/MyDrive/Colab Notebooks')

#### Read the csv using pandas

In [30]:
import pandas as pd
import numpy as np

df = pd.read_csv('blogtext.csv')

Data set info

#### Get the names of the columns

In [31]:
df.head(5)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [32]:
df.shape

(681284, 7)

In [33]:
df.columns

Index(['id', 'gender', 'age', 'topic', 'sign', 'date', 'text'], dtype='object')

EDA

Check for NULL values in data

In [34]:
df.isnull().sum()

id        0
gender    0
age       0
topic     0
sign      0
date      0
text      0
dtype: int64

Working on full df leads to crash in google collab. while fitting the model
So limiting it to 10000

In [35]:
df = df.head(10000)

####Perform data pre-processing on the data:

Data cleansing by removing unwanted characters, spaces, stop words etc. Convert text to lowercase

In [36]:
# Select only alphabets
import re
df.text = df.text.apply(lambda x: re.sub('[^A-Za-z]+', ' ', x))

# Convert text to lowercase
df.text = df.text.apply(lambda x: x.lower())

# Strip unwanted spaces
df.text = df.text.apply(lambda x: x.strip())



In [37]:
# Remove stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
df.text = df.text.apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords]))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [38]:
print(stopwords)

{'in', "you're", 'themselves', "doesn't", 'out', 'over', 'himself', 'didn', 'just', 'd', "haven't", 'mightn', 'haven', 'about', "you'll", 'what', 'of', 'which', 'off', 'ain', 'because', 'any', 'has', 'all', 'below', "should've", 'm', 'we', 'your', 'yours', 'further', 'so', 'my', 'both', 'needn', 'here', 'between', 'during', "needn't", 'then', "that'll", 'those', 'above', 'did', 'whom', 'if', 'theirs', 'than', 'again', 'him', "shan't", 'being', 'y', 'shouldn', 'at', "couldn't", 'doesn', 'them', 'had', 'she', 'up', 'does', 'hasn', 'when', 'other', "she's", 'now', 'do', "hadn't", 'on', 'as', 'should', 'wouldn', 'it', 'having', 'into', 'hadn', 'why', 'was', 'only', 'our', 'too', "mightn't", 'before', 'its', 't', 'few', "won't", 'some', "didn't", 'are', 'o', 'each', 'am', "isn't", 'mustn', 'there', 'his', 'such', 'don', 'itself', 'll', 'me', 'ma', 'down', "aren't", 'be', 'were', 'weren', 'a', "weren't", 'no', 'will', 'most', "you'd", 'isn', 'hers', "it's", 'where', 'they', 'her', 'shan', 'b

check the value

In [39]:
df.text[10]

'ah korean language looks difficult first figure read hanguel korea surprisingly easy learn alphabet characters seems easy vocabulary starts oh backwards us sentence structure yikes luckily many options us slow witted foreigners take language course could list urllink joongang article says lot resources urllink well guy motivation jeon ji hyun latest something actually star movies cfs hear means commercial feature positive saw latest movie sunday night hard describe name english version windstruck korean version yeochinso short ne yeojachingu rul sogayhamnida like introduce girlfriend surprisingly titles make sense like website korean english looks quite good actually urllink movie shown theatres subtitles special times info urllink list many theatres seoul click urllink urllink great reason learn korean already married went foreigners well local korean national course korean take picture put urllink movie hof bar update bud mine passed urllink link giordano ad apparently aired korea n

#### Target/label merger and transformation

Merging  of all  label columns

In [40]:
df['labels'] = df.apply(lambda row: [row['gender'], str(row['age']), row['topic'], row['sign']], axis=1)

In [41]:
df.head(2)

Unnamed: 0,id,gender,age,topic,sign,date,text,labels
0,2059027,male,15,Student,Leo,"14,May,2004",info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,2059027,male,15,Student,Leo,"13,May,2004",team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"


#### Using text and label info

In [42]:
df = df[['text','labels']]

In [43]:
df.head(5)

Unnamed: 0,text,labels
0,info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,testing testing,"[male, 15, Student, Leo]"
4,thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"


#### Train and test split

In [44]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.text.values, df.labels.values, test_size=0.20, random_state=1)

#### Vectorization

#### Create BOW

In [45]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

#### Checking feature names

In [46]:
vectorizer.get_feature_names()[:10]

['aa',
 'aa amazing',
 'aa anger',
 'aa compared',
 'aa keeps',
 'aa sd',
 'aaa',
 'aaa come',
 'aaa rated',
 'aaa someone']

#### get label counts

In [47]:
label_counts = dict()

for labels in df.labels.values:
    for label in labels:
        if label in label_counts:
            label_counts[label] += 1
        else:
            label_counts[label] = 1

#### Mapping

In [48]:
label_counts

{'13': 42,
 '14': 212,
 '15': 602,
 '16': 440,
 '17': 1185,
 '23': 253,
 '24': 655,
 '25': 386,
 '26': 234,
 '27': 1054,
 '33': 136,
 '34': 553,
 '35': 2315,
 '36': 1708,
 '37': 33,
 '38': 46,
 '39': 79,
 '40': 1,
 '41': 20,
 '42': 14,
 '43': 6,
 '44': 3,
 '45': 16,
 '46': 7,
 'Accounting': 4,
 'Aquarius': 571,
 'Aries': 4198,
 'Arts': 45,
 'Automotive': 14,
 'Banking': 16,
 'BusinessServices': 91,
 'Cancer': 504,
 'Capricorn': 215,
 'Communications-Media': 99,
 'Consulting': 21,
 'Education': 270,
 'Engineering': 127,
 'Fashion': 1622,
 'Gemini': 150,
 'HumanResources': 2,
 'Internet': 118,
 'InvestmentBanking': 70,
 'Law': 11,
 'LawEnforcement-Security': 10,
 'Leo': 301,
 'Libra': 491,
 'Marketing': 156,
 'Museums-Libraries': 17,
 'Non-Profit': 71,
 'Pisces': 454,
 'Publishing': 4,
 'Religion': 9,
 'Sagittarius': 1097,
 'Science': 63,
 'Scorpio': 971,
 'Sports-Recreation': 80,
 'Student': 1137,
 'Taurus': 812,
 'Technology': 2654,
 'Telecommunications': 2,
 'Virgo': 236,
 'female': 4

#### Multi label binarizer

In [49]:
from sklearn.preprocessing import MultiLabelBinarizer

binarizer = MultiLabelBinarizer(classes=sorted(label_counts.keys()))
y_train = binarizer.fit_transform(y_train)
y_test = binarizer.transform(y_test)

#### 3. Design, train, tune and test the best text classifier.

Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on every label.

In [50]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='lbfgs')
model = OneVsRestClassifier(model)

### Fit the classifier

In [53]:
model.fit(X_train_bow, y_train)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

#### Predictions
- calculating predicted labels and scores

In [54]:
predict_labels = model.predict(X_test_bow)
predict_scores = model.decision_function(X_test_bow)

#### Calculating inverse transform for predicted labels and test labels

In [55]:
pred_inversed = binarizer.inverse_transform(predict_labels)
y_test_inversed = binarizer.inverse_transform(y_test)

### Print the true vs predicted labels for dataset

In [56]:
for i in range(10):
    print('Title:\t{}\nActual labels:\t{}\nPredict labels:\t{}\n\n'.format(
        X_test[i],
        ','.join(y_test_inversed[i]),
        ','.join(pred_inversed[i])
    ))

Title:	sucky mcsuckmeister nbsp freaking hate geoff nbsp gave decision love want get respect nbsp get hear loves every single day nbsp love end phone msn convos nbsp nbsp otherwise die nbsp never able fall love anyone else cause still hung geoff hate hate hate nbsp hate
Actual labels:	16,Capricorn,female,indUnk
Predict labels:	female


Title:	staunch religious conservative republicans much problem annoying pussy liberals republicans longer party devoted freedom mainly religious party governs us based religious philosiphies even besides religion longer believe gun rights made patriot act claim small government quite opposite back religion annoying radical christian government tell country sex bad forth john ashcroft like epitome religious authoritarian conservative ashcroft spearheaded huge anti porn campaign supported way tighter fcc regulations things majorly piss whole christian crusade abortion sex porn drugs name belong national policy enough annoying fundamentalists imposing hey c

#### 4. Display and explain detail the classification report


In [57]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score

def display_metrics_micro(Ytest, Ypred):
    print('Accuracy score: ', accuracy_score(Ytest, Ypred))
    print('F1 score: Micro', f1_score(Ytest, Ypred, average='micro'))
    print('Average precision score: Micro', average_precision_score(Ytest, Ypred, average='micro'))
    print('Average recall score: Micro', recall_score(Ytest, Ypred, average='micro'))
    
    
def display_metrics_macro(Ytest, Ypred):
    print('Accuracy score: ', accuracy_score(Ytest, Ypred))
    print('F1 score: Macro', f1_score(Ytest, Ypred, average='macro'))
    print('Average recall score: Macro', recall_score(Ytest, Ypred, average='macro'))
    
def display_metrics_weighted(Ytest, Ypred):
    print('Accuracy score: ', accuracy_score(Ytest, Ypred))
    print('F1 score: weighted', f1_score(Ytest, Ypred, average='weighted'))
    print('Average precision score: weighted', average_precision_score(Ytest, Ypred, average='weighted'))
    print('Average recall score: weighted', recall_score(Ytest, Ypred, average='weighted'))
    

In [58]:
display_metrics_micro(y_test,predict_labels)

Accuracy score:  0.315
F1 score: Micro 0.6423666138404648
Average precision score: Micro 0.46044253475528474
Average recall score: Micro 0.532


In [59]:
display_metrics_macro(y_test,predict_labels)

Accuracy score:  0.315
F1 score: Macro 0.21713773753890891
Average recall score: Macro 0.1660267706887283


  average, "true nor predicted", 'F-score is', len(true_sum)
  _warn_prf(average, modifier, msg_start, len(result))


In [60]:
display_metrics_weighted(y_test,predict_labels)

Accuracy score:  0.315
F1 score: weighted 0.5944270384254596
Average precision score: weighted 0.514864020887684
Average recall score: weighted 0.532


  average, "true nor predicted", 'F-score is', len(true_sum)
  recall = tps / tps[-1]
  _warn_prf(average, modifier, msg_start, len(result))


#### Using a linear classifier  and  wrapping it up in OneVsRestClassifier to train it on every label
LR, SVM, NaiveBayes

In [61]:
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

def build_model_train(X_train, y_train, X_valid=None, y_valid=None, C=1.0, model='lr'):
    if model=='lr':
        model = LogisticRegression(C=C, penalty='l1', dual=False, solver='liblinear')
        model = OneVsRestClassifier(model)
        model.fit(X_train, y_train)
    
    elif model=='svm':
        model = LinearSVC(C=C, penalty='l1', dual=False, loss='squared_hinge')
        model = OneVsRestClassifier(model)
        model.fit(X_train, y_train)
    
    elif model=='nbayes':
        model = MultinomialNB(alpha=1.0)
        model = OneVsRestClassifier(model)
        model.fit(X_train, y_train)

    return model

In [62]:
models = ['lr','svm','nbayes']
for model in models:
    print({model})
    model = build_model_train(X_train_bow,y_train,model=model)
    model.fit(X_train_bow,y_train)
    Ypred=model.predict(X_test_bow)
    print("\n")
    print(f"**displaying  metrics for the mode {model}\n")
    display_metrics_micro(y_test,Ypred)
    print("\n")
    print("\n")
    display_metrics_macro(y_test,Ypred)
    print("\n")
    print("\n")
    display_metrics_weighted(y_test,Ypred)
    print("\n")
    print("\n")

{'lr'}


**displaying  metrics for the mode OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l1',
                                                 random_state=None,
                                                 solver='liblinear', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

Accuracy score:  0.343
F1 score: Micro 0.6737785363772135
Average precision score: Micro 0.48826996756929547
Average recall score: Micro 0.592125




Accuracy score:  0.343
F1 score: Macro 0.34031652431543735
Average recall score: Macro 0.269684

  average, "true nor predicted", 'F-score is', len(true_sum)
  _warn_prf(average, modifier, msg_start, len(result))
  recall = tps / tps[-1]
  _warn_prf(average, modifier, msg_start, len(result))


Average recall score: weighted 0.592125




{'svm'}






**displaying  metrics for the mode OneVsRestClassifier(estimator=LinearSVC(C=1.0, class_weight=None, dual=False,
                                        fit_intercept=True, intercept_scaling=1,
                                        loss='squared_hinge', max_iter=1000,
                                        multi_class='ovr', penalty='l1',
                                        random_state=None, tol=0.0001,
                                        verbose=0),
                    n_jobs=None)

Accuracy score:  0.3275
F1 score: Micro 0.6675879223381469
Average precision score: Micro 0.475448392601962
Average recall score: Micro 0.603875




Accuracy score:  0.3275
F1 score: Macro 0.376787507460514
Average recall score: Macro 0.3069679492077122




Accuracy score:  0.3275
F1 score: weighted 0.6502124163652602
Average precision score: weighted 0.5330947516251924


  average, "true nor predicted", 'F-score is', len(true_sum)
  _warn_prf(average, modifier, msg_start, len(result))
  recall = tps / tps[-1]
  _warn_prf(average, modifier, msg_start, len(result))


Average recall score: weighted 0.603875




{'nbayes'}


**displaying  metrics for the mode OneVsRestClassifier(estimator=MultinomialNB(alpha=1.0, class_prior=None,
                                            fit_prior=True),
                    n_jobs=None)

Accuracy score:  0.0475
F1 score: Micro 0.4149056603773584
Average precision score: Micro 0.2778011298076923
Average recall score: Micro 0.274875




Accuracy score:  0.0475
F1 score: Macro 0.0632447435320839
Average recall score: Macro 0.046842438349092116




Accuracy score:  0.0475
F1 score: weighted 0.3243996975912275
Average precision score: weighted 0.37695397733766733
Average recall score: weighted 0.274875






  average, "true nor predicted", 'F-score is', len(true_sum)
  _warn_prf(average, modifier, msg_start, len(result))
  recall = tps / tps[-1]
  _warn_prf(average, modifier, msg_start, len(result))


Multilabel classification problem is solved that predicts multiple features of the author of a given text
The data is loaded and required basic EDA and data inspection has been done
The text has been pre processed like cleansing it(removing the unnecessary chars, removing the spaces, converting the case to lower) and also removing the stop words, vectorizing the features
Preparing the data and splitting them to train and test
using multilable binarizers, also various classifier models are trained and the predictions are made and also the accuracy, f1 score, Avg precision and recall scores are calculated.

LR classifier  wrapping it up with OneVsRestClassifier to train it on every label gives better result