# HackLive 3: Guided Hackathon - NLP (Analytics vidhya)

##### Click <a href='https://datahack.analyticsvidhya.com/contest/hacklive-3-guided-hackathon-text-classification/'>here</a> to go to competition page

#### Author: Palakodeti Nagendra Deepak
* Performance metric used : Micro F1 score
* Public leaderboard score: 0.7745629898 
* Private leaderboard ranking: 7
* Private leaderboard score: 0.7775161860
* Private leaderboard ranking: 6

<h2> Problem statement </h2>

<h4> In real world scenario many research institutes go through huge archives of research papers, in such scenario tagging of research papers manually becomes a tedious task. The objective of this ML problem is to automatically tag the research paper in any of the 25 possible tags.  </h4>
    
**List of possible tags are as follows:**

[Tags, Analysis of PDEs, Applications, Artificial Intelligence,Astrophysics of Galaxies, Computation and Language, Computer Vision and Pattern Recognition, Cosmology and Nongalactic Astrophysics, Data Structures and Algorithms, Differential Geometry, Earth and Planetary Astrophysics, Fluid Dynamics,Information Theory, Instrumentation and Methods for Astrophysics, Machine Learning, Materials Science, Methodology, Number Theory, Optimization and Control, Representation Theory, Robotics, Social and Information Networks, Statistics Theory, Strongly Correlated Electrons, Superconductivity, Systems and Control]

<h2> Type of Machine Learning Problem </h2>

<h4> Since there are 25 different labels and each research paper may belong to one or more category this is a multilabel classification problem </h4>

<h2> Performance metric </h2>

<h4> Micro F1 score </h4>

## Importing the data and necessary libraries

In [1]:
import re
import numpy as np
import pandas as pd
from scipy.sparse import hstack
from sklearn.metrics import f1_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
df = pd.read_csv('/kaggle/input/analyticsvidhya-hacklive3/Train/Train.csv')
df.drop('id', axis=1, inplace=True)
df.head(2)

Unnamed: 0,ABSTRACT,Computer Science,Mathematics,Physics,Statistics,Analysis of PDEs,Applications,Artificial Intelligence,Astrophysics of Galaxies,Computation and Language,...,Methodology,Number Theory,Optimization and Control,Representation Theory,Robotics,Social and Information Networks,Statistics Theory,Strongly Correlated Electrons,Superconductivity,Systems and Control
0,a ever-growing datasets inside observational a...,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,we propose the framework considering optimal $...,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
print(df.shape)

(14004, 30)


<h3> There are total 14004 research papers(rows) in which Abstract gives us the gist of the research paper, rows such as Computer Science, Mathematics, Physics, Statistics gives us the primary domain of the research paper and the remaining 25 columns are the target columns(labels) </h3>

In [4]:
df.loc[0, 'ABSTRACT']

'a ever-growing datasets inside observational astronomy have challenged scientists inside many aspects, including an efficient and interactive data exploration and visualization. many tools have been developed to confront this challenge. however, they usually focus on displaying a actual images or focus on visualizing patterns within catalogs inside the predefined way. inside this paper we introduce vizic, the python visualization library that builds a connection between images and catalogs through an interactive map of a sky region. vizic visualizes catalog data over the custom background canvas with the help of a shape, size and orientation of each object inside a catalog. a displayed objects inside a map are highly interactive and customizable comparing to those inside a images. these objects should be filtered by or colored by their properties, such as redshift and magnitude. they also should be sub-selected with the help of the lasso-like tool considering further analysis with the

<h4> Above is the sample of a Abstract of a research paper </h4>

In [5]:
test = pd.read_csv('/kaggle/input/analyticsvidhya-hacklive3/Test/Test.csv')
test.head(2)

Unnamed: 0,id,ABSTRACT,Computer Science,Mathematics,Physics,Statistics
0,9409,fundamental frequency (f0) approximation from ...,0,0,0,1
1,17934,"this large-scale study, consisting of 24.5 mil...",1,0,0,1


In [6]:
TARGET_COLS = ['Analysis of PDEs', 'Applications',
               'Artificial Intelligence', 'Astrophysics of Galaxies',
               'Computation and Language', 'Computer Vision and Pattern Recognition',
               'Cosmology and Nongalactic Astrophysics',
               'Data Structures and Algorithms', 'Differential Geometry',
               'Earth and Planetary Astrophysics', 'Fluid Dynamics',
               'Information Theory', 'Instrumentation and Methods for Astrophysics',
               'Machine Learning', 'Materials Science', 'Methodology', 'Number Theory',
               'Optimization and Control', 'Representation Theory', 'Robotics',
               'Social and Information Networks', 'Statistics Theory',
               'Strongly Correlated Electrons', 'Superconductivity',
               'Systems and Control']

<h3> Text preprocessing </h3>

In [7]:
def remove_punctuations(x):
    x = str(x)
    for punct in "/-'":
        x = x.replace(punct, ' ')
    for punct in '&':
        x = x.replace(punct, f' {punct} ')
    for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '“”’' + '…':
        x = x.replace(punct, '')
    return x

In [8]:
def clean_numbers(x):
    x = re.sub('[0-9]{5,}', '#####', x)
    x = re.sub('[0-9]{4}', '####', x)
    x = re.sub('[0-9]{3}', '###', x)
    x = re.sub('[0-9]{2}', '##', x)
    return x

In [9]:
# This is a generalized replacement of misspelled words which i use for all projects so some words here may not be actually in abstract
def misspelled_words(x):
    x = x.replace('colour', 'color').replace('centre', 'center').replace('didnt', 'did not').replace('doesnt', 'does not') \
        .replace('isnt', 'is not').replace('shouldnt', 'should not').replace('favourite', 'favorite').replace('travelling', 'traveling') \
        .replace('counselling', 'counseling').replace('theatre', 'theater').replace('cancelled', 'canceled').replace('labour', 'labor') \
        .replace('organisation', 'organization').replace('wwii', 'world war 2').replace('citicise', 'criticize') \
        .replace('instagram', 'social medium').replace('whatsapp', 'social medium').replace('WeChat', 'social medium') \
        .replace('snapchat', 'social medium').replace('Snapchat', 'social medium').replace('btech', 'B.Tech').replace('Quorans', 'Quora') \
        .replace('cryptocurrency', 'crypto currency').replace('cryptocurrencies', 'crypto currency').replace('behaviour', 'behavior') \
        .replace('analyse', 'analyze').replace('licence', 'license').replace('programme', 'program').replace('grey', 'gray') \
        .replace('realise', 'realize').replace('bcom', 'B.Com').replace('defence', 'defense').replace('mtech', 'M.Tech') \
        .replace('Btech', 'B.Tech').replace('honours', 'honors').replace('recognise', 'recognize').replace('programr', 'programmer') \
        .replace('programrs', 'programmer').replace('hasnt', 'has not').replace('litre', 'liter').replace('Isnt', 'is not') \
        .replace('learnt', 'learn').replace('favour', 'favor').replace('neighbour', 'neighbor').replace('demonetisation', 'demonetization') \
        .replace('₹', '').replace('&', 'and')
    return x

In [10]:
df["ABSTRACT"] = df["ABSTRACT"].apply(lambda x: remove_punctuations(x))
df["ABSTRACT"] = df["ABSTRACT"].apply(lambda x: clean_numbers(x))
df["ABSTRACT"] = df["ABSTRACT"].apply(lambda x: misspelled_words(x))
test["ABSTRACT"] = test["ABSTRACT"].apply(lambda x: remove_punctuations(x))
test["ABSTRACT"] = test["ABSTRACT"].apply(lambda x: clean_numbers(x))
test["ABSTRACT"] = test["ABSTRACT"].apply(lambda x: misspelled_words(x))

<h3> Splitiing the data into train and validation (80:20) </h3>

In [11]:
train, val = train_test_split(df, test_size=0.2, random_state=0)
train.shape, val.shape

((11203, 30), (2801, 30))

<h3> Vectorizing train, validation and test dataset using Tfidf vectorizer</h3>

In [12]:
tfidfvec = TfidfVectorizer(min_df=3, max_features=None, ngram_range=(1, 2), strip_accents='unicode', stop_words='english')
tfidfvec.fit(df['ABSTRACT'])
train_vec = tfidfvec.transform(train['ABSTRACT'])
val_vec = tfidfvec.transform(val['ABSTRACT'])
test_vec = tfidfvec.transform(test['ABSTRACT'])
train_vec.shape, val_vec.shape, test_vec.shape

((11203, 76621), (2801, 76621), (6002, 76621))

<h5> Here after vectorizing we are stacking the remaining 4 features into csr format. Here if we use numpy array format instead of csr format then our RAM won't be able to suffice hence it is important to pass data to our model in csr format </h5>

In [13]:
train_data = hstack((train_vec, train[['Computer Science', 'Mathematics', 'Physics', 'Statistics']]), format="csr", dtype='float64')
val_data = hstack((val_vec, val[['Computer Science', 'Mathematics', 'Physics', 'Statistics']]), format="csr", dtype='float64')
test_data = hstack((test_vec, test[['Computer Science', 'Mathematics', 'Physics', 'Statistics']]), format="csr", dtype='float64')
train_data.shape, val_data.shape, test_data.shape

((11203, 76625), (2801, 76625), (6002, 76625))

<h3> Using Grid search to find best hyperparameters </h3>
<h5> Note: Since there was only single hyperparameter to tune hence i used GridSearchCV. If there are more hyperparameters it is wise to choose RandomizedSearchCV </h5>

In [14]:
parameters = {
    'estimator__C': [10 ** x for x in range(-2, 3)]
}

estimator = OneVsRestClassifier(LogisticRegression(max_iter=500, n_jobs=-1))
model = GridSearchCV(estimator, parameters, scoring='f1_micro', cv=5, n_jobs=-1, refit=False)
model.fit(train_data, train[TARGET_COLS])
best_C = model.best_params_['estimator__C']
print('The best value of C is', best_C)

The best value of C is 100


<h3> Applying ML model using best hyperparameter and predicting on validation data

In [15]:
clf = OneVsRestClassifier(LogisticRegression(C = best_C, max_iter=500, n_jobs=-1))
clf.fit(train_data, train[TARGET_COLS])
pred = clf.predict(val_data)
f1_score(val[TARGET_COLS], pred, average='micro')

0.7315175097276265

In [16]:
#This is a simple hack which is used to find the optimal treshold to calculate the best F1 score
def get_best_thresholds(true, preds):
    thresholds = [i/100 for i in range(100)]
    best_thresholds = []
    for idx in range(25):
        f1_scores = [f1_score(true[:, idx], (preds[:, idx] > thresh) * 1) for thresh in thresholds]
        best_thresh = thresholds[np.argmax(f1_scores)]
        best_thresholds.append(best_thresh)
    return best_thresholds

In [17]:
val_preds = clf.predict_proba(val_data)
best_thresholds = get_best_thresholds(val[TARGET_COLS].values, val_preds)
for i, thresh in enumerate(best_thresholds):
    val_preds[:, i] = (val_preds[:, i] > thresh) * 1
f1_score(val[TARGET_COLS], val_preds, average='micro')

0.7864363403710812

<h4> As you can see above the F1 score after finding optimal tresholds has drastically improved from 0.73 to 0.78.
Such improvements can lead to gaining more rankings in competitions and hackathons</h4>

<h3> Submitting the predictions </h3>

In [18]:
ss = pd.read_csv('../input/analyticsvidhya-hacklive3/SampleSubmission.csv')
preds_test = clf.predict_proba(test_data)

for i, thresh in enumerate(best_thresholds):
    preds_test[:, i] = (preds_test[:, i] > thresh) * 1

ss[TARGET_COLS] = preds_test
ss.to_csv('hacklive_submission', index = False)

<h4> PS: This is my first ever hackathon participation and kaggle notebook. Please provide feedbacks and upvote the notebook. </h4>