# Classification of Conversation Text

**Objective:** To classify the text into two categories, for which a labelled dataset was given.


## Table of content
1. Aim
2. Prerequists
3. Data gathering
4. Data prepration
5. Modelling
6. Accuracy
7. Saving test predictions

## Prerequists
All the required libraries

In [1]:
import pandas as pd         
import re                     
import emoji          
import contractions    
from collections import Counter       

from nltk.corpus import stopwords    
from nltk.tokenize import WordPunctTokenizer  
from nltk.stem import WordNetLemmatizer   
from nltk.collocations import BigramCollocationFinder 
from nltk.metrics import BigramAssocMeasures

from sklearn.model_selection import train_test_split  
from sklearn.feature_extraction.text import TfidfVectorizer   
from sklearn import metrics  
from sklearn.pipeline import Pipeline  
from sklearn.svm import SVC, LinearSVC    
from sklearn.compose import ColumnTransformer

## Data

* Gathering the data
* Printing a few records
* Looking for missing records
* Understanding the missing records 

In [2]:
data=pd.read_csv('dataset/train.csv')
data.head(3)

Unnamed: 0,Source,Host,Link,Date(ET),Time(ET),time(GMT),Title,TRANS_CONV_TEXT,Patient_Tag
0,FORUMS,cafepharma.com,http://cafepharma.com/boards/threads/epstein.5...,6/15/2016,13:58:00,6/15/2016 23:28,Epstein,I don't disagree with you in principle. I'm ju...,0
1,FORUMS,www.patient.co.uk,http://www.patient.co.uk/forums/discuss/enlarg...,5/7/2016,0.820833333,42498.21667,Enlarged Heart.Thread Enlarged Heart,I am always dizzy I get dizzy standing up so I...,1
2,BLOG,http://abcnewsradioonline.com/entertainment-news,http://abcnewsradioonline.com/entertainment-ne...,4/14/2016,15:00:38,4/15/2016 0:30,Queen Latifah Joins American Heart Association...,Axelle/Bauer-Griffin/FilmMagic(NEW YORK) -- Qu...,0


In [3]:
data.isnull().sum()

Source               0
Host                59
Link                 0
Date(ET)             0
Time(ET)             0
time(GMT)          161
Title              216
TRANS_CONV_TEXT      1
Patient_Tag          0
dtype: int64

In [4]:
data.loc[(data['TRANS_CONV_TEXT'].isnull())]

Unnamed: 0,Source,Host,Link,Date(ET),Time(ET),time(GMT),Title,TRANS_CONV_TEXT,Patient_Tag
841,FORUMS,www.reddit.com,https://www.reddit.com/r/science/comments/4ogb...,6/16/2016,19:25:00,6/17/2016 4:55,Teenage weight is linked to risk of heart fail...,,0


## Data Prepration

* To prepare the text for training cleaning of the text was required, which was done in few steps:
    1. Removing emojis
    2. Fixing contractions
    3. Removing special characters and numbers
    4. Creating tokens from text
    5. Removing stopwords
    6. Lemmatizing


* N-grams was created


* Combined title and the conversation text, applied cleaning and N-grams on new combined text.

In [5]:
wordnet_lemmatizer = WordNetLemmatizer()   #lemmatizer
stop_words = stopwords.words('English')    #stop words from english
stop_words.remove('not')
stop_words.remove('no')

def preprocessing(myString):
    """to clean the text, to remove contraction"""
    myString = emoji.get_emoji_regexp().sub(r'', myString)  
    string_encode = myString.encode() 
    myString = string_encode.decode()   
    myString = contractions.fix(myString)   
    myString = re.sub('[^a-zA-Z]', ' ',myString) 
    myString = myString.strip()
    tokenizer = WordPunctTokenizer()   
    tokens = tokenizer.tokenize(myString)
    tokens = [t for t in tokens if t.lower() not in stop_words]
    tokens = [wordnet_lemmatizer.lemmatize(x) for x in tokens ]
    tokens = [wordnet_lemmatizer.lemmatize(x,pos='v') for x in tokens ]
    myString=' '.join(tokens)
    return myString

def get_bigrams(myString):
    """To create bigrams"""
    try:
        tokenizer = WordPunctTokenizer()   
        tokens = tokenizer.tokenize(myString)
        bigram_finder = BigramCollocationFinder.from_words(tokens)
        bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 100)
        tt=["%s %s" % bigram_tuple for bigram_tuple in bigrams]
        result = [x for x in tt ]
    except Exception as e:
        print(myString)
        result=[]
    return result

In [6]:
def prepare_data(dataframe,columns,final):
    df=dataframe[columns]
    df.dropna(axis=0, subset=['TRANS_CONV_TEXT'], inplace=True)
    for i,row in df.iterrows():
        if type(row['Title'])==str:
            df.loc[i,'New TRANS_CONV_TEXT']=row['Title']+'. '+df['TRANS_CONV_TEXT'][i]
        else:
            df.loc[i,'New TRANS_CONV_TEXT']=df['TRANS_CONV_TEXT'][i]
    df=df[final]
    df['conv'] = df['New TRANS_CONV_TEXT'].apply(lambda y:preprocessing(y))
    df['conv_bigrams'] = df['conv'].apply(lambda y:get_bigrams(y))
    df['conv_bigrams_'] = [" ".join(x) for x in df['conv_bigrams']]
    return df

## Applying Data Prepration

In [7]:
dftrain=prepare_data(data,['Title','TRANS_CONV_TEXT','Patient_Tag'],['New TRANS_CONV_TEXT','Patient_Tag'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(axis=0, subset=['TRANS_CONV_TEXT'], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


In [8]:
dftrain.head()

Unnamed: 0,New TRANS_CONV_TEXT,Patient_Tag,conv,conv_bigrams,conv_bigrams_
0,Epstein. I don't disagree with you in principl...,0,Epstein not disagree principle say Entresto ma...,"[David let, Diovan discover, Ignorance frequen...",David let Diovan discover Ignorance frequently...
1,Enlarged Heart.Thread Enlarged Heart. I am alw...,1,Enlarged Heart Thread Enlarged Heart always di...,"[Christies cancer, England fantastic, Enlarged...",Christies cancer England fantastic Enlarged He...
2,Queen Latifah Joins American Heart Association...,0,Queen Latifah Joins American Heart Association...,"[ABC Radio, American Heart, Americans live, As...",ABC Radio American Heart Americans live Associ...
3,Bulaemia. I am 17 and I have been throwing up ...,1,Bulaemia throw year almost everyday throw bloo...,"[ANSWER Good, Ask school, Electrolyte imbalanc...",ANSWER Good Ask school Electrolyte imbalance F...
4,DIY Silver interconnects and RCAs???. Quote: O...,0,DIY Silver interconnect RCAs Quote Originally ...,"[Boyan Silyavski, DIY Silver, GROUNDED skip, O...",Boyan Silyavski DIY Silver GROUNDED skip Origi...


## Spliting Data into train and test

In [9]:
X_train, X_test, y_train, y_test = train_test_split(dftrain[['conv','conv_bigrams_']], dftrain['Patient_Tag'], 
                                                    test_size=0.20,
                                                    stratify=dftrain['Patient_Tag'],
                                                    random_state=0
                                                   )

## Linear SVC Model

Used TFIDF to vectorize the documents. vectorization is important to convert the text in to vectors to make it understandatble to machine.

In [10]:
preprocessing = ColumnTransformer(
    transformers=[
                  ('Text_features',TfidfVectorizer(ngram_range=(1,2), analyzer='word', norm='l2'),'conv_bigrams_'),
    ])

In [11]:
LinearSVC_pipeline = Pipeline(steps=[('preprocessing', preprocessing),
                                   ('LinearSVC', LinearSVC(class_weight='balanced',loss='hinge', max_iter=5000,C=1))
                                  ])

LinearSVC_pipeline = LinearSVC_pipeline.fit(X_train,y_train) 

In [12]:
acc=LinearSVC_pipeline.score(X_test, y_test)

In [13]:
acc

0.9008620689655172

## Accuracy of Linear SVC Model



In [14]:
LinearSVC_pred = LinearSVC_pipeline.predict(X_test)
print(metrics.classification_report(y_test,LinearSVC_pred))

              precision    recall  f1-score   support

           0       0.95      0.92      0.94       184
           1       0.74      0.81      0.77        48

    accuracy                           0.90       232
   macro avg       0.84      0.87      0.85       232
weighted avg       0.91      0.90      0.90       232



## Test dataset

In [15]:
datatest=pd.read_csv('dataset/test.csv')

In [16]:
wordnet_lemmatizer = WordNetLemmatizer()   #lemmatizer
stop_words = stopwords.words('English')    #stop words from english
stop_words.remove('not')
stop_words.remove('no')

def preprocessing(myString):
    """to clean the text, to remove contraction"""
    myString = emoji.get_emoji_regexp().sub(r'', myString)  
    string_encode = myString.encode() 
    myString = string_encode.decode()   
    myString = contractions.fix(myString)   
    myString = re.sub('[^a-zA-Z]', ' ',myString) 
    myString = myString.strip()
    tokenizer = WordPunctTokenizer()   
    tokens = tokenizer.tokenize(myString)
    tokens = [t for t in tokens if t.lower() not in stop_words]
    tokens = [wordnet_lemmatizer.lemmatize(x) for x in tokens ]
    tokens = [wordnet_lemmatizer.lemmatize(x,pos='v') for x in tokens ]
    myString=' '.join(tokens)
    return myString

def get_bigrams(myString):
    """To create bigrams"""
    try:
        tokenizer = WordPunctTokenizer()   
        tokens = tokenizer.tokenize(myString)
        bigram_finder = BigramCollocationFinder.from_words(tokens)
        bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 100)
        tt=["%s %s" % bigram_tuple for bigram_tuple in bigrams]
        result = [x for x in tt ]
    except Exception as e:
        print(myString)
        result=[]
    return result

In [17]:
def prepare_data(dataframe,columns,final):
    df=dataframe[columns]
    df.dropna(axis=0, subset=['TRANS_CONV_TEXT'], inplace=True)
    for i,row in df.iterrows():
        if type(row['Title'])==str:
            df.loc[i,'New TRANS_CONV_TEXT']=row['Title']+'. '+df['TRANS_CONV_TEXT'][i]
        else:
            df.loc[i,'New TRANS_CONV_TEXT']=df['TRANS_CONV_TEXT'][i]
    df=df[final]
    df['conv'] = df['New TRANS_CONV_TEXT'].apply(lambda y:preprocessing(y))
    df['conv_bigrams'] = df['conv'].apply(lambda y:get_bigrams(y))
    df['conv_bigrams_'] = [" ".join(x) for x in df['conv_bigrams']]
    return df

In [18]:
dftest=prepare_data(datatest,['Index','Title','TRANS_CONV_TEXT'],['Index','New TRANS_CONV_TEXT'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(axis=0, subset=['TRANS_CONV_TEXT'], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


# Predicting and saving predictions

In [19]:
X_result=dftest[['conv','conv_bigrams_']]
LinearSVC_model=LinearSVC_pipeline
SVC_pred_class = LinearSVC_model.predict(X_result)
dftest['Patient_Tag']=SVC_pred_class
testFile=dftest[['Index','Patient_Tag']]
testFile.to_csv('final.csv',index=False)

In [20]:
testFile.head(10)

Unnamed: 0,Index,Patient_Tag
0,1,0
1,2,1
2,3,0
3,4,1
4,5,0
5,6,0
6,7,0
7,8,1
8,9,0
9,10,0


In [21]:
testFile.tail()

Unnamed: 0,Index,Patient_Tag
566,567,0
567,568,0
568,569,0
569,570,1
570,571,1


# Data For transformer

In [22]:
from flair.data import Corpus
from flair.data import Sentence
from flair.datasets import CSVClassificationCorpus
from flair.embeddings import TransformerDocumentEmbeddings
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentRNNEmbeddings
from torch.optim.adam import Adam
from flair.models import TextClassifier
from flair.trainers import ModelTrainer

import torch
from torch.optim.lr_scheduler import OneCycleLR

In [23]:
X_val_t, X_test_t, y_val_t, y_test_t = train_test_split(X_test, y_test, 
                                                test_size=0.20,
                                                random_state=0)
X_train_t=X_train[['conv']].copy()
y_train_t=y_train.copy()
X_train_t['Patient_Tag'] = y_train_t
X_test_t['Patient_Tag'] = y_test_t
X_val_t['Patient_Tag'] = y_val_t
print(X_train_t.shape,X_test_t.shape,X_val_t.shape)

(924, 2) (47, 3) (185, 3)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test_t['Patient_Tag'] = y_test_t
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_val_t['Patient_Tag'] = y_val_t


In [24]:
X_train_t[['conv','Patient_Tag']].to_csv('Training_data_zs/train.csv',index=False)
X_test_t[['conv','Patient_Tag']].to_csv('Training_data_zs/test.csv',index=False)
X_val_t[['conv','Patient_Tag']].to_csv('Training_data_zs/dev.csv',index=False)

## Tranformers Flair

In [25]:
 torch.cuda.device_count()

0

## Data prepration for transformer

In [34]:
from flair.trainers import ModelTrainer
# this is the folder in which train, test and dev files reside
data_folder = './Training_data_zs'

# column format indicating which columns hold the text and label(s)
column_name_map = {0: "text", 1: "label_topic"}

label_type='Patient_Tag'
# load corpus containing training, test and dev data and if CSV has a header, you can skip it
corpus: Corpus = CSVClassificationCorpus(data_folder,
                                         column_name_map,
                                         skip_header=True,
                                         delimiter=',',
                                         label_type=label_type) 

2021-12-19 21:40:32,350 Reading data from Training_data_zs
2021-12-19 21:40:32,352 Train: Training_data_zs\train.csv
2021-12-19 21:40:32,353 Dev: Training_data_zs\dev.csv
2021-12-19 21:40:32,354 Test: Training_data_zs\test.csv


In [35]:
# 2. create the label dictionary
label_dict = corpus.make_label_dictionary(label_type=label_type)

2021-12-19 21:40:33,713 Computing label dictionary. Progress:


100%|███████████████████████████████████████████████████████████████████████████████| 924/924 [00:07<00:00, 116.02it/s]

2021-12-19 21:42:49,841 Corpus contains the labels: Patient_Tag (#924)
2021-12-19 21:42:49,842 Created (for label 'Patient_Tag') Dictionary with 2 tags: 0, 1





## 3. Bert

In [24]:
# 3. initialize transformer document embeddings (many models are available)
flair.device = torch.device('cpu')
document_embeddings = TransformerDocumentEmbeddings('bert-base-uncased', fine_tune=True)

# 4. create the text classifier
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict,label_type=label_type)

# 5. initialize the text classifier trainer with Adam optimizer
trainer = ModelTrainer(classifier, corpus, optimizer=torch.optim.AdamW)

In [25]:
trainer.train('resources_BERT/taggers/trec',
               learning_rate=3e-5, # use very small learning rate
               mini_batch_size=16,
               mini_batch_chunk_size=4, # optionally set this if transformer is too much for your machine
               max_epochs=5, # terminate after 5 epochs
               scheduler=OneCycleLR,
               embeddings_storage_mode='none',
               weight_decay=0.,
               )

2021-12-14 13:48:21,806 ----------------------------------------------------------------------------------------------------
2021-12-14 13:48:21,809 Model: "TextClassifier(
  (loss_function): CrossEntropyLoss()
  (document_embeddings): TransformerDocumentEmbeddings(
    (model): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(30522, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (token_type_embeddings): Embedding(2, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0): BertLayer(
            (attention): BertAttention(
              (self): BertSelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_fe

2021-12-14 13:48:21,812 ----------------------------------------------------------------------------------------------------
2021-12-14 13:48:21,812 Corpus: "Corpus: 924 train + 185 dev + 47 test sentences"
2021-12-14 13:48:21,813 ----------------------------------------------------------------------------------------------------
2021-12-14 13:48:21,814 Parameters:
2021-12-14 13:48:21,816  - learning_rate: "3e-05"
2021-12-14 13:48:21,817  - mini_batch_size: "16"
2021-12-14 13:48:21,817  - patience: "3"
2021-12-14 13:48:21,818  - anneal_factor: "0.5"
2021-12-14 13:48:21,820  - max_epochs: "5"
2021-12-14 13:48:21,821  - shuffle: "True"
2021-12-14 13:48:21,821  - train_with_dev: "False"
2021-12-14 13:48:21,822  - batch_growth_annealing: "False"
2021-12-14 13:48:21,823 ----------------------------------------------------------------------------------------------------
2021-12-14 13:48:21,823 Model training base path: "resources_BERT\taggers\trec"
2021-12-14 13:48:21,824 -------------------

2021-12-14 19:06:51,726 epoch 5 - iter 30/58 - loss 0.00483408 - samples/sec: 0.29 - lr: 0.000001
2021-12-14 19:10:37,645 epoch 5 - iter 35/58 - loss 0.00423217 - samples/sec: 0.35 - lr: 0.000000
2021-12-14 19:14:42,077 epoch 5 - iter 40/58 - loss 0.00433103 - samples/sec: 0.33 - lr: 0.000000
2021-12-14 19:18:32,803 epoch 5 - iter 45/58 - loss 0.00402068 - samples/sec: 0.35 - lr: 0.000000
2021-12-14 19:22:31,889 epoch 5 - iter 50/58 - loss 0.00382390 - samples/sec: 0.33 - lr: 0.000000
2021-12-14 19:26:30,623 epoch 5 - iter 55/58 - loss 0.00377895 - samples/sec: 0.34 - lr: 0.000000
2021-12-14 19:28:51,329 ----------------------------------------------------------------------------------------------------
2021-12-14 19:28:51,376 EPOCH 5 done: loss 0.0037 - lr 0.0000000
2021-12-14 19:32:57,654 DEV : loss 0.11610954999923706 - f1-score (micro avg)  0.8919
2021-12-14 19:32:58,138 BAD EPOCHS (no improvement): 4
2021-12-14 19:32:59,170 ---------------------------------------------------------

{'test_score': 0.8936170212765957,
 'dev_score_history': [0.8324324324324325,
  0.854054054054054,
  0.8864864864864865,
  0.8972972972972972,
  0.8918918918918919],
 'train_loss_history': [0.09392895966813496,
  0.051846714226012884,
  0.02748573507872437,
  0.013488688543841698,
  0.0037132242208883165],
 'dev_loss_history': [tensor(0.0778),
  tensor(0.0685),
  tensor(0.0833),
  tensor(0.1085),
  tensor(0.1161)]}

### 4. distilbert

In [103]:
# # 3. initialize transformer document embeddings (many models are available)
# document_embeddings = TransformerDocumentEmbeddings('distilbert-base-uncased', fine_tune=True)
# # flair.device = torch.device('cpu')
# # 4. create the text classifier
# classifier = TextClassifier(document_embeddings, label_dictionary=label_dict,label_type=label_type)

# # 5. initialize the text classifier trainer with Adam optimizer
# trainer = ModelTrainer(classifier, corpus, optimizer=Adam)

In [104]:
# # 6. start the training
# trainer.train('resources_distilbert/taggers/trec',
#               learning_rate=3e-5, # use very small learning rate
#               mini_batch_size=16,
#               mini_batch_chunk_size=4, # optionally set this if transformer is too much for your machine
#               max_epochs=5, # terminate after 5 epochs .86
#               )

## 5. Al bert

In [26]:
# 3. initialize transformer document embeddings (many models are available)
#flair.device = torch.device('cpu')
document_embeddings = TransformerDocumentEmbeddings('albert-base-v1', fine_tune=True)

# 4. create the text classifier
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict,label_type=label_type)

# 5. initialize the text classifier trainer with Adam optimizer
trainer = ModelTrainer(classifier, corpus, optimizer=torch.optim.AdamW)

In [27]:
# 6. start the training
trainer.train('resources_albert/taggers/trec',
               learning_rate=3e-5, # use very small learning rate
               mini_batch_size=16,
               mini_batch_chunk_size=4, # optionally set this if transformer is too much for your machine
               max_epochs=5, # terminate after 5 epochs
               embeddings_storage_mode='none',
               weight_decay=0.,
               scheduler=OneCycleLR,
               )

2021-12-14 19:36:17,098 ----------------------------------------------------------------------------------------------------
2021-12-14 19:36:17,100 Model: "TextClassifier(
  (loss_function): CrossEntropyLoss()
  (document_embeddings): TransformerDocumentEmbeddings(
    (model): AlbertModel(
      (embeddings): AlbertEmbeddings(
        (word_embeddings): Embedding(30000, 128, padding_idx=0)
        (position_embeddings): Embedding(512, 128)
        (token_type_embeddings): Embedding(2, 128)
        (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): AlbertTransformer(
        (embedding_hidden_mapping_in): Linear(in_features=128, out_features=768, bias=True)
        (albert_layer_groups): ModuleList(
          (0): AlbertLayerGroup(
            (albert_layers): ModuleList(
              (0): AlbertLayer(
                (full_layer_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)

2021-12-14 21:57:13,276 epoch 4 - iter 5/58 - loss 0.04045948 - samples/sec: 0.36 - lr: 0.000009
2021-12-14 22:00:43,566 epoch 4 - iter 10/58 - loss 0.03100806 - samples/sec: 0.38 - lr: 0.000009
2021-12-14 22:04:15,442 epoch 4 - iter 15/58 - loss 0.02406540 - samples/sec: 0.38 - lr: 0.000008
2021-12-14 22:07:38,166 epoch 4 - iter 20/58 - loss 0.02504253 - samples/sec: 0.39 - lr: 0.000007
2021-12-14 22:10:26,488 epoch 4 - iter 25/58 - loss 0.02396558 - samples/sec: 0.48 - lr: 0.000007
2021-12-14 22:14:22,159 epoch 4 - iter 30/58 - loss 0.02305727 - samples/sec: 0.34 - lr: 0.000006
2021-12-14 22:18:31,982 epoch 4 - iter 35/58 - loss 0.02294940 - samples/sec: 0.32 - lr: 0.000005
2021-12-14 22:21:59,961 epoch 4 - iter 40/58 - loss 0.02223650 - samples/sec: 0.38 - lr: 0.000005
2021-12-14 22:25:04,160 epoch 4 - iter 45/58 - loss 0.02322185 - samples/sec: 0.43 - lr: 0.000004
2021-12-14 22:28:58,048 epoch 4 - iter 50/58 - loss 0.02333909 - samples/sec: 0.34 - lr: 0.000004
2021-12-14 22:32:24,6

{'test_score': 0.8723404255319149,
 'dev_score_history': [0.8378378378378378,
  0.8162162162162163,
  0.8594594594594595,
  0.8648648648648649,
  0.8648648648648649],
 'train_loss_history': [0.0946300085152709,
  0.07088000135398868,
  0.04455240070959232,
  0.025354302336763716,
  0.017925507204830417],
 'dev_loss_history': [tensor(0.0716),
  tensor(0.0939),
  tensor(0.0696),
  tensor(0.0957),
  tensor(0.0846)]}

## 2.GPT

In [2]:
# # 3. initialize transformer document embeddings (many models are available)
# #flair.device = torch.device('cpu')
# document_embeddings = TransformerDocumentEmbeddings('gpt2', fine_tune=True)

# # 4. create the text classifier
# classifier = TextClassifier(document_embeddings, label_dictionary=label_dict,label_type=label_type)

# # 5. initialize the text classifier trainer with Adam optimizer
# trainer = ModelTrainer(classifier, corpus, optimizer=torch.optim.AdamW)

In [1]:
# # 6. start the training
# trainer.train('resources_Gpt2/taggers/trec',
#                learning_rate=3e-5, # use very small learning rate
#                mini_batch_size=4,
#                mini_batch_chunk_size=1, # optionally set this if transformer is too much for your machine
#                max_epochs=4, 
#                embeddings_storage_mode='none',
#                weight_decay=0.,
#                scheduler=OneCycleLR,
#                )

# Predictions with ensemble model

In [27]:
model1=LinearSVC_pipeline #linear svc
model2=TextClassifier.load('./resources_BERT/taggers/trec/best-model.pt')  #bert
model3=TextClassifier.load('./resources_albert/taggers/trec/best-model.pt') #albert

2021-12-21 10:26:14,734 loading file ./resources_BERT/taggers/trec/best-model.pt
2021-12-21 10:26:32,566 loading file ./resources_albert/taggers/trec/best-model.pt


# Ensemble Model

In [28]:
prediction_list  = []
X_test_acc=X_test[['conv','conv_bigrams_']]
for i,row in X_test_acc.iterrows():
        #--LinearSVC prediction
        model1_class = model1.predict(X_test_acc.loc[[i]])[0]
        #Bert Prediction
        sentence = Sentence(row['conv'])
        model2.predict(sentence)
        model2_class = int(sentence.labels[0].value)
        #albert Prediction
        sentence = Sentence(row['conv'])
        model3.predict(sentence)
        model3_class = int(sentence.labels[0].value)
        pred_class_array = [model1_class,model2_class,model3_class]  #
        predicted_occurence_count = Counter(pred_class_array).most_common()
        prediction_list.append(predicted_occurence_count[0][0])            

## Accuracy of ensemble model

In [29]:
from sklearn.metrics import accuracy_score
X_test['actual'] = y_test
X_test['predicted_category'] = prediction_list
ensemble_acc=accuracy_score(X_test['actual'], X_test['predicted_category'])
print(ensemble_acc)
print(metrics.classification_report(y_test, X_test['predicted_category']))

0.9224137931034483
              precision    recall  f1-score   support

           0       0.95      0.96      0.95       184
           1       0.83      0.79      0.81        48

    accuracy                           0.92       232
   macro avg       0.89      0.87      0.88       232
weighted avg       0.92      0.92      0.92       232



### Predicting and saving the results

In [30]:
prediction_list  = []
X_test_acc=dftest[['conv','conv_bigrams_']]
for i,row in X_test_acc.iterrows():
        #--LinearSVC prediction
        model1_class = model1.predict(X_test_acc.loc[[i]])[0]
        #Bert Prediction
        sentence = Sentence(row['conv'])
        model2.predict(sentence)
        model2_class = int(sentence.labels[0].value)
        #albert Prediction
        sentence = Sentence(row['conv'])
        model3.predict(sentence)
        model3_class = int(sentence.labels[0].value)
        pred_class_array = [model1_class,model2_class,model3_class] 
        predicted_occurence_count = Counter(pred_class_array).most_common()
        prediction_list.append(predicted_occurence_count[0][0])            

In [None]:
dftest=dftest[['Index']]
dftest['Patient_Tag']=prediction_list

In [None]:
dftest.to_csv('final_predictions_ensemble.csv',index=False)

# Thank you

command: jupyter nbconvert ZS.ipynb --to slides --post serve