### Author Features Prediction

**Description:**

Classification is probably the most popular task that you would deal with in real life.
Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the information
about the writer without knowing about him/her.
We are going to create a classifier that predicts multiple features of the author of a given text.
We have designed it as a Multilabel classification problem.

In [1]:
import pandas as pd
from warnings import filterwarnings
filterwarnings('ignore')

**1. Load the dataset (5 points)**

In [2]:
df = pd.read_csv('C:/Users/Ramyasridar/Downloads/NLP/project/blogtext.csv',nrows=3000)
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [3]:
df.shape

(3000, 7)

In [4]:
df.isnull().sum()

id        0
gender    0
age       0
topic     0
sign      0
date      0
text      0
dtype: int64

In [5]:
df.columns

Index(['id', 'gender', 'age', 'topic', 'sign', 'date', 'text'], dtype='object')

Inference: The original dataset hold 681284 observations and 7 variables.as the dataset is large we are selecting 3000 rows for further analysis.

**2. Preprocess rows of the “text” column (7.5 points)**

**a. Remove unwanted characters**

**b. Convert text to lowercase**

**c. Remove unwanted spaces**

**d. Remove stopwords**

In [6]:
#!pip install spacy

In [7]:
import spacy
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
import re
import unicodedata

In [8]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Ramyasridar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [9]:
from nltk.corpus import stopwords

In [10]:
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [11]:
df.text = df.text.apply(lambda x: re.sub('[^A-Za-z]+', ' ', x))

In [12]:
df.text = df.text.apply(lambda x: x.lower())

In [13]:
df.text = df.text.apply(lambda x: x.strip())

In [14]:
stopwords = set(stopwords.words('english'))
df.text = df.text.apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords]))

In [15]:
df.text.head()

0    info found pages mb pdf files wait untill team...
1    team members drewes van der laag urllink mail ...
2    het kader van kernfusie op aarde maak je eigen...
3                                      testing testing
4    thanks yahoo toolbar capture urls popups means...
Name: text, dtype: object

In [16]:
df.text[5]

'interesting conversation dad morning talking koreans put money invariably lot real estate cash cash would include short term investments one year well savings accounts reason real estate makes money lot money seen surveys seoul real estate rising per year long stretches even taking account crisis referred imf crisis although imf bailed korea compare korean corporate bonds fell modestly recovered local stock market represented kospi version dow jones index gone appreciably high points points see urllink link see real estate makes sense back conversation noted real big elite real estate investor billion usd see urllink converter properties dad seemed little flabbergasted heck need million dollars need much retire maybe lot risk take real estate south korean asset example north toots horn louder make move country usd worth cents also denominated imf crisis dropped vis vis usd also make bad investment fall victim scam latest urllink good morning city project toast saw lady tv lost everyth

Inference: the data has been preprocessed from the observation of first five rows.

**3. As we want to make this into a multi-label classification problem, you are required to merge all the label
   columns together, so that we have all the labels together for a particular sentence (7.5 points)**

   **a. Label columns to merge: “gender”, “age”, “topic”, “sign”**
    
   **b. After completing the previous step, there should be only two columns in your data frame i.e. “text” and “labels” as shown in the below image**


In [17]:
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004",info found pages mb pdf files wait untill team...
1,2059027,male,15,Student,Leo,"13,May,2004",team members drewes van der laag urllink mail ...
2,2059027,male,15,Student,Leo,"12,May,2004",het kader van kernfusie op aarde maak je eigen...
3,2059027,male,15,Student,Leo,"12,May,2004",testing testing
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",thanks yahoo toolbar capture urls popups means...


In [18]:
df['labels'] = df.apply(lambda row: [row['gender'],str(row['age']),row['topic'],row['sign']],axis=1)

In [19]:
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text,labels
0,2059027,male,15,Student,Leo,"14,May,2004",info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,2059027,male,15,Student,Leo,"13,May,2004",team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,2059027,male,15,Student,Leo,"12,May,2004",het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,2059027,male,15,Student,Leo,"12,May,2004",testing testing,"[male, 15, Student, Leo]"
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"


Inference: the features 'gender','age','sign','topic' are combined under feature named 'labels' and these columns are excluded for further analysis

In [20]:
df = df[['text','labels']]

In [21]:
df.head()

Unnamed: 0,text,labels
0,info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,testing testing,"[male, 15, Student, Leo]"
4,thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"


**4. Separate features and labels, and split the data into training and testing (5 points)**

In [22]:
from sklearn.model_selection import train_test_split 

x = df.text.values
y = df.labels.values

X_train,X_test,y_train,y_test = train_test_split(x,y,random_state=10,test_size=0.2)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(2400,)
(600,)
(2400,)
(600,)


**5. Vectorize the features (5 points)**

**a. Create a Bag of Words using count vectorizer**

i. Use ngram_range=(1, 2)

ii. Vectorize training and testing features

**b. Print the term-document matrix****

In [23]:
from sklearn.feature_extraction.text import CountVectorizer

In [24]:
toNumeric = CountVectorizer()
toNumeric

CountVectorizer()

In [25]:
# Include 1 and 2-grams
toNumeric = CountVectorizer(ngram_range=(1, 2),binary=True)
toNumeric.fit(X_train)
len(toNumeric.get_feature_names())

180518

In [26]:
toNumeric.get_feature_names()

['aa',
 'aa anger',
 'aa compared',
 'aa nice',
 'aaa',
 'aaa take',
 'aaa travel',
 'aaaaaah',
 'aaaahh',
 'aaagh',
 'aaagh pero',
 'aaarrrggghhhhhhhhgggghhhhhh',
 'aaarrrggghhhhhhhhgggghhhhhh dropped',
 'aal',
 'aal eliminate',
 'aal esseneth',
 'aal lost',
 'aaldering',
 'aaldering urllink',
 'aand',
 'aand jim',
 'aar',
 'aar toy',
 'aarde',
 'aarde maak',
 'aargh',
 'aargh told',
 'aaron',
 'aaron burr',
 'aaron rowand',
 'aba',
 'aba foundation',
 'aba sessions',
 'aba sus',
 'aba therapy',
 'aba well',
 'abandon',
 'abandon days',
 'abandon every',
 'abandon horses',
 'abandon life',
 'abandon may',
 'abandoned',
 'abandoned although',
 'abandoned area',
 'abandoning',
 'abandoning isreal',
 'abandons',
 'abandons hate',
 'abated',
 'abated came',
 'abbreviate',
 'abbreviate fun',
 'abc',
 'abc last',
 'abc nightline',
 'abcnews',
 'abcnews com',
 'abcnews go',
 'abdicate',
 'abdicate overthrown',
 'abdomen',
 'abdomen thighs',
 'abdominal',
 'abdominal area',
 'abercrombie',
 '

In [27]:
X_train_dtm = toNumeric.fit_transform(X_train)
X_test_dtm = toNumeric.transform(X_test)

In [28]:
toNumeric.get_feature_names()[:5]

['aa', 'aa anger', 'aa compared', 'aa nice', 'aaa']

In [29]:
X_train_dtm.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [30]:
X_test_dtm.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

**6. Create a dictionary to get the count of every label i.e. the key will be label name and value will be the
total count of the label. Check below image for reference (5 points)**

In [31]:
df.head()

Unnamed: 0,text,labels
0,info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,testing testing,"[male, 15, Student, Leo]"
4,thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"


In [32]:
label_counts = dict()

for labels in df.labels.values:
    for label in labels:
        if label in label_counts:
            label_counts[label] += 1
        else:
            label_counts[label] = 1

In [33]:
label_counts

{'male': 2272,
 '15': 299,
 'Student': 403,
 'Leo': 55,
 '33': 94,
 'InvestmentBanking': 70,
 'Aquarius': 286,
 'female': 728,
 '14': 74,
 'indUnk': 452,
 'Aries': 1699,
 '25': 110,
 'Capricorn': 77,
 '17': 147,
 'Gemini': 21,
 '23': 93,
 'Non-Profit': 46,
 'Cancer': 76,
 'Banking': 16,
 '37': 19,
 'Sagittarius': 113,
 '26': 43,
 '24': 334,
 'Scorpio': 243,
 '27': 86,
 'Education': 118,
 '45': 14,
 'Engineering': 119,
 'Libra': 313,
 'Science': 33,
 '34': 6,
 '41': 14,
 'Communications-Media': 14,
 'BusinessServices': 21,
 'Sports-Recreation': 75,
 'Virgo': 39,
 'Taurus': 76,
 'Arts': 2,
 'Pisces': 2,
 '44': 3,
 '16': 25,
 'Internet': 20,
 'Museums-Libraries': 2,
 'Accounting': 2,
 '39': 32,
 '35': 1607,
 'Technology': 1607}

In [34]:
label_counts.keys()

dict_keys(['male', '15', 'Student', 'Leo', '33', 'InvestmentBanking', 'Aquarius', 'female', '14', 'indUnk', 'Aries', '25', 'Capricorn', '17', 'Gemini', '23', 'Non-Profit', 'Cancer', 'Banking', '37', 'Sagittarius', '26', '24', 'Scorpio', '27', 'Education', '45', 'Engineering', 'Libra', 'Science', '34', '41', 'Communications-Media', 'BusinessServices', 'Sports-Recreation', 'Virgo', 'Taurus', 'Arts', 'Pisces', '44', '16', 'Internet', 'Museums-Libraries', 'Accounting', '39', '35', 'Technology'])

**7. Transform the labels - (7.5 points)**

As we have noticed before, in this task each example can have multiple tags. To deal with such kind of
prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s.
For this purpose, it is convenient to use MultiLabelBinarizer from sklearn

a. Convert your train and test labels using MultiLabelBinarizer

In [35]:
from sklearn.preprocessing import MultiLabelBinarizer

In [36]:
mlb = MultiLabelBinarizer(classes=sorted(label_counts.keys()))
y_train = mlb.fit_transform(y_train)
y_test = mlb.transform(y_test)

In [37]:
y_train

array([[0, 0, 0, ..., 0, 0, 1],
       [1, 0, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 1, 1, 0],
       ...,
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 1]])

In [38]:
mlb.classes_

array(['14', '15', '16', '17', '23', '24', '25', '26', '27', '33', '34',
       '35', '37', '39', '41', '44', '45', 'Accounting', 'Aquarius',
       'Aries', 'Arts', 'Banking', 'BusinessServices', 'Cancer',
       'Capricorn', 'Communications-Media', 'Education', 'Engineering',
       'Gemini', 'Internet', 'InvestmentBanking', 'Leo', 'Libra',
       'Museums-Libraries', 'Non-Profit', 'Pisces', 'Sagittarius',
       'Science', 'Scorpio', 'Sports-Recreation', 'Student', 'Taurus',
       'Technology', 'Virgo', 'female', 'indUnk', 'male'], dtype=object)

**8. Choose a classifier - (5 points)**

In this task, we suggest using the One-vs-Rest approach, which is implemented in OneVsRestClassifier
class. In this approach k classifiers (= number of tags) are trained. As a basic classifier, use
LogisticRegression. It is one of the simplest methods, but often it performs good enough in text
classification tasks. It might take some time because the number of classifiers to train is large.

a. Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on every label

b. As One-vs-Rest approach might not have been discussed in the sessions, we are providing you
with the code for that

In [39]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(solver='lbfgs')
clf = OneVsRestClassifier(clf)

In [40]:
clf.fit(X_train_dtm,y_train)

OneVsRestClassifier(estimator=LogisticRegression())

**9. Fit the classifier, make predictions and get the accuracy (5 points)**

a. Print the following

i. Accuracy score

ii. F1 score

iii. Average precision score

iv. Average recall score

v. Tip: Make sure you are familiar with all of them. How would you expect the things to
work for the multi-label scenario? Read about micro/macro/weighted averaging

In [41]:
y_pred = clf.predict(X_test_dtm)

In [42]:
y_pred

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1]])

In [43]:
y_pred_inversed = mlb.inverse_transform(y_pred)
y_test_inversed = mlb.inverse_transform(y_test)

In [44]:
for i in range(5):
    print('Title:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        X_test[i],
        ','.join(y_test_inversed[i]),
        ','.join(y_pred_inversed[i])
    ))

Title:	love food network encompasses iron chef rules enjoy almost anything run e law order third watch murder wrote magnum pi l law also adore st elsewhere think show bravo love matlock really deal diagnosis murder tips cheesy scales much also love remington steele scarecrow mrs king bruce boxleitner sp pierce brosnan double dose hot course columbo rocks love man annoys suspects confessing also really loved cheesy mountie show due south rerunning tnt sure think right jesus watch way much television
True labels:	35,Aries,Technology,male
Predicted labels:	35,Aries,Technology,male


Title:	alpha omega returned alpha camp weekend thoroughly enjoyable pretty far melbie though upper plenty far hardly melway map eep place fantastic dorm rooms toilet really neat clean esp bathroom amenities needed great food plus real fireplace could ask spent n enjoying company learning holy spirit part alpha course taking church sorta like foundation back basics christianity kind thing real good stuff eg gre

In [45]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score

def print_evaluation_scores(y_val, predicted):
    print('Accuracy score: ', accuracy_score(y_val, predicted))
    print('F1 score: ', f1_score(y_val, predicted, average='micro'))
    print('Average precision score: ', average_precision_score(y_val, predicted, average='micro'))
    print('Average recall score: ', recall_score(y_val, predicted, average='micro'))

In [46]:
print_evaluation_scores(y_test, y_pred)

Accuracy score:  0.54
F1 score:  0.735632183908046
Average precision score:  0.579383667766852
Average recall score:  0.6533333333333333


In [47]:
from sklearn.metrics import classification_report

In [48]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       1.00      0.07      0.12        15
           1       0.96      0.32      0.48        68
           2       0.00      0.00      0.00         9
           3       1.00      0.23      0.37        35
           4       1.00      0.07      0.13        14
           5       0.92      0.40      0.56        60
           6       0.00      0.00      0.00        30
           7       0.00      0.00      0.00        11
           8       0.00      0.00      0.00        11
           9       1.00      0.25      0.40        24
          10       0.00      0.00      0.00         2
          11       0.84      0.94      0.89       309
          12       0.00      0.00      0.00         3
          13       0.00      0.00      0.00         4
          14       0.00      0.00      0.00         4
          15       0.00      0.00      0.00         0
          16       0.00      0.00      0.00         1
          17       0.00    

In [49]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score

def display_metrics_micro(y_val, predicted):
    print('Accuracy score: ', accuracy_score(y_val, predicted))
    print('F1 score: Micro', f1_score(y_val, predicted, average='micro'))
    print('Average precision score: Micro', average_precision_score(y_val, predicted, average='micro'))
    print('Average recall score: Micro', recall_score(y_val, predicted, average='micro'))
    
    
def display_metrics_macro(y_val, predicted):
    print('Accuracy score: ', accuracy_score(y_val, predicted))
    print('F1 score: Macro', f1_score(y_val, predicted, average='macro'))
    print('Average recall score: MAcro', recall_score(y_val, predicted, average='macro'))
    
def display_metrics_weighted(y_val, predicted):
    print('Accuracy score: ', accuracy_score(y_val, predicted))
    print('F1 score: weighted', f1_score(y_val, predicted, average='weighted'))
    print('Average precision score: weighted', average_precision_score(y_val, predicted, average='weighted'))
    print('Average recall score: weighted', recall_score(y_val, predicted, average='weighted'))

In [50]:
display_metrics_micro(y_test,y_pred)

Accuracy score:  0.54
F1 score: Micro 0.735632183908046
Average precision score: Micro 0.579383667766852
Average recall score: Micro 0.6533333333333333


In [51]:
display_metrics_macro(y_test,y_pred)

Accuracy score:  0.54
F1 score: Macro 0.23362146939402045
Average recall score: MAcro 0.18562280828959574


In [52]:
display_metrics_weighted(y_test,y_pred)

Accuracy score:  0.54
F1 score: weighted 0.6686151046086233
Average precision score: weighted 0.6013572937816467
Average recall score: weighted 0.6533333333333333


**10. Print true label and predicted label for any five examples (7.5 points)**

In [53]:
y_test_inversed[:5]

[('35', 'Aries', 'Technology', 'male'),
 ('24', 'Scorpio', 'female', 'indUnk'),
 ('15', 'Aquarius', 'Student', 'female'),
 ('23', 'Taurus', 'indUnk', 'male'),
 ('24', 'Scorpio', 'female', 'indUnk')]

In [54]:
y_pred_inversed[:5]

[('35', 'Aries', 'Technology', 'male'),
 ('24', 'Scorpio', 'female', 'indUnk'),
 ('male',),
 ('indUnk', 'male'),
 ('female',)]

In [63]:
import random 

def print_predicted(y_predicted, y_test = y_test , n = 5):
    j = []
    for i in range(n):
        j.append(random.randint(0, len(y_test)))
    print(j)
                 
    for k in j:
        print(mlb.inverse_transform(y_predicted)[k])
        print(mlb.inverse_transform(y_test)[k])
        print("-----------------------------------")

In [64]:
print_predicted(y_predicted=y_pred, y_test = y_test , n = 5)

[95, 449, 13, 369, 270]
('35', 'Aries', 'Technology', 'male')
('35', 'Aries', 'Technology', 'male')
-----------------------------------
('female', 'indUnk')
('24', 'Scorpio', 'female', 'indUnk')
-----------------------------------
('15', 'Student', 'female')
('15', 'Aquarius', 'Student', 'male')
-----------------------------------
('male',)
('17', 'Sagittarius', 'Student', 'male')
-----------------------------------
('male',)
('41', 'Communications-Media', 'Libra', 'male')
-----------------------------------
