<a href="https://colab.research.google.com/github/kristopherpaul/FraudCallDetector/blob/main/AIFraudCall.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Steps**
1. Scraping relevant tweets from twitter
> This is stored in **Data/raw_scraped_tweets.txt** and was done by **scrape_tweets.py** program
2. Normalizing Text of scraped tweets and the Santa Barbara Corpus of regular conversations 
3. Training the XLNet Transformer Model.
4. Converting Call recordings of the final test set to transcripts with Speaker Diarization using IBM Watson's API
5. Normalizing Text of the final test set transcripts
6. Getting Model predictions on the Test set of 20 Call Recordings(10 fraud + 10 regular)
7. Evaluating Model's effectiveness

**Setting up the environment**

In [None]:
!pip install symspellpy
!pip install pylangacq
!pip install ibm_watson
!pip install phonetics
!pip install transformers
!pip install torch
%cd drive
%cd My Drive
%cd DLFraudCall

Collecting symspellpy
[?25l  Downloading https://files.pythonhosted.org/packages/99/af/e71fcca6a42b6a63f518b0c1627e1f67822815cb0cf71e6af05acbd75c78/symspellpy-6.7.0-py3-none-any.whl (2.6MB)
[K     |████████████████████████████████| 2.6MB 8.5MB/s 
Installing collected packages: symspellpy
Successfully installed symspellpy-6.7.0
Collecting pylangacq
[?25l  Downloading https://files.pythonhosted.org/packages/85/80/a86b86562e0c233babf9d63c6189917eac6f5c4ebe5119b52b2448208073/pylangacq-0.12.0-py3-none-any.whl (65kB)
[K     |████████████████████████████████| 71kB 5.3MB/s 
[?25hInstalling collected packages: pylangacq
Successfully installed pylangacq-0.12.0
Collecting ibm_watson
[?25l  Downloading https://files.pythonhosted.org/packages/a2/3c/c2cfb41db546fe98820e89017c892d73991cef61b9c48680191fe703a214/ibm-watson-4.7.1.tar.gz (385kB)
[K     |████████████████████████████████| 389kB 9.4MB/s 
Collecting websocket-client==0.48.0
[?25l  Downloading https://files.pythonhosted.org/packages/8

**Required Library Imports**

In [None]:
import pandas as pd
import transformers
from tqdm import trange
from transformers import XLNetTokenizer, XLNetModel, AdamW, get_linear_schedule_with_warmup
from transformers import XLNetForSequenceClassification
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
import tensorflow as tf
from sklearn.model_selection import train_test_split
import keras
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence

from ibm_watson import SpeechToTextV1
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator

from symspellpy.symspellpy import SymSpell, Verbosity
from numpy import array,asarray,zeros
import pylangacq as pla
import spacy
import pkg_resources
import pickle
import random 
import string
import json
import re
import os

dirpath = "drive/My Drive/DLFraudCall"

**Converting Santa Barabara corpus(SBC) collection of regular call transcripts from .cha to .txt**

In [None]:
cha_f = pla.read_chat('./Data/SantaBarbaraCorpus/*.cha')
cha_sents = cha_f.utterances()

cha_d = {}

for name,sent in cha_sents:
		if name in cha_d.keys():
				cha_d[name] += " "+sent
		else:
				cha_d[name] = sent

with open('./Data/raw_SBC.txt', 'w',encoding="utf-8") as filehandle:
    filehandle.writelines("%s\n" % value for key,value in cha_d.items())

**Functions for Text Normalization**

1. Normalizing Punctuation



In [None]:
def norm_punctuation(data,b):
    norm_data = []
    whitelist = set('abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ')
    for line in data:
        line = str(line)
        line = re.sub('\(','',line)
        line = re.sub('\)','',line)
        line = re.sub('’','\'',line)
        line = re.sub(',',' ',line)
        line = re.sub('‘','\'',line)
        line = re.sub('\.',' ',line)
        line = re.sub('%HESITATION','',line)
        line = re.sub('\'*\'','',line)
        line = re.sub(r'([!?,;])\1+', r'\1', line)
        line = re.sub(r'\.{2,}', r'...', line)
        if b:
            #Only for SBC Data
            line = ''.join(filter(whitelist.__contains__,line))
        norm_data.append(line)
    return norm_data

2. Removing Tags like **@userid** mainly from tweets



In [None]:
def rem_tag(data):
    norm_data = []
    for line in data:
        line = str(line)
        line = re.sub(r'@[A-Za-z0-9\.\-+_]+', r'', line)
        norm_data.append(line)
    return norm_data

3. Normalizing Whitespaces

In [None]:
def norm_whitespace(data):
    norm_data = []
    for line in data:
        line = str(line)
        line = re.sub(r"//t",r"\t", line)
        line = re.sub(r"( )\1+",r"\1", line)
        line = re.sub(r"(\n)\1+",r"\1", line)
        line = re.sub(r"(\r)\1+",r"\1", line)
        line = re.sub(r"(\t)\1+",r"\1", line)
        norm_data.append(line.strip(" "))
    return norm_data

4. Normalizing Character cases

In [None]:
def norm_case(data):
    norm_data = []
    for line in data:
        line = str(line)
        line = line.lower()
        norm_data.append(line)
    return norm_data

5. Expanding Contractions eg: **we're** is replaced with **we are**

In [None]:
def other_contrac(data):
    othercon = json.loads(open('./NLP_txt/othercon.json', 'r').read())
    norm_data = []
    for line in data:
        tokens = line.split()
        new_tokens = []
        for t_pos in range(0,len(tokens)):
            if tokens[t_pos] in othercon:
                new_tokens.append(othercon[tokens[t_pos]])
            else:
                new_tokens.append(tokens[t_pos])
        new_line = " ".join(new_tokens).strip(" ")
        norm_data.append(new_line)
    return norm_data

def norm_contractions(data):
    stdcon = json.loads(open('./NLP_txt/stdcon.json', 'r').read())
    norm_data = []
    for line in data:
        tokens = line.split()
        new_tokens = []
        skip = False
        for t_pos in range(0,len(tokens)):
            if skip:
                skip = False
                continue
            if tokens[t_pos] in stdcon:
                new_tokens.append(stdcon[tokens[t_pos]])
            elif (t_pos < (len(tokens)-1)) and (str(tokens[t_pos]+"'"+tokens[t_pos+1]) in stdcon):
                new_tokens.append(stdcon[str(tokens[t_pos]+"'"+tokens[t_pos+1])])
                skip = True
            else:
                new_tokens.append(tokens[t_pos])
        new_line = " ".join(new_tokens).strip(" ")
        norm_data.append(new_line)
    return norm_data

6. Spelling Corrections along with reducing exaggerations eg: **ohhh** is replaced with **oh**

In [None]:
def spell_correction(data):
    mx_edit_dist = 3
    pref_len = 4
    spellchecker = SymSpell(mx_edit_dist,pref_len)
    dictionary_path = pkg_resources.resource_filename("symspellpy","frequency_dictionary_en_82_765.txt")
    bigram_path = pkg_resources.resource_filename("symspellpy","frequency_bigramdictionary_en_243_342.txt")
    spellchecker.load_dictionary(dictionary_path,term_index=0,count_index=1)
    spellchecker.load_bigram_dictionary(dictionary_path,term_index=0,count_index=2)
    norm_data = []
    for line in data:
        norm_data.append(spell_correction_line(line,spellchecker))
    return norm_data

def reduce_exaggeration(line):
    line = str(line)
    return re.sub(r'([\w])\1+', r'\1', line)

def is_numeric(line):
    for char in line:
        if not (char in "0123456789" or char in ",%.$"):
            return False
    return True

def spell_correction_line(line,spellchecker):
    if len(line) < 1:
        return ""
    mx_edit_dist_l = 2
    suggest_verbosity = Verbosity.TOP
    token_list = line.split()
    for word_pos in range(len(token_list)):
        word = token_list[word_pos]
        if word is None:
            token_list[word_pos] = ""
            continue
        if not '\n' in word and word not in string.punctuation and not is_numeric(word) and not (word in spellchecker.words.keys()):
            suggestions = spellchecker.lookup(word,suggest_verbosity,mx_edit_dist_l)
            n_word = ""
            if len(suggestions) > 0:
                n_word = suggestions[0].term
            else:
                n_word = reduce_exaggeration(word)
            token_list[word_pos] = n_word
    return " ".join(token_list).strip()

7. Removing Stopwords

In [None]:
def rem_pre_stopwords(data):
    new_data = []
    stopwords = []
    with open('./NLP_txt/pre_stopwords.txt', 'r') as filehandle:
        stopwords = [word.strip() for word in filehandle.readlines()]
    for line in data:
        words = line.split(" ")
        new_words = []
        for word in words:
            if word not in stopwords:
                new_words.append(word)
        new_line = " ".join(new_words).strip()
        new_data.append(new_line)
    return new_data

def rem_stopwords(data):
    new_data = []
    stopwords = []
    with open('./NLP_txt/stopwords.txt', 'r') as filehandle:
        stopwords = [word.strip() for word in filehandle.readlines()]
    for line in data:
        words = line.split(" ")
        new_words = []
        for word in words:
            if word not in stopwords:
                new_words.append(word)
        new_line = " ".join(new_words).strip()
        new_data.append(new_line)
    return new_data

8. Lemmatizing to group together variant forms of the same word eg: **changing** is replaced with **change**

In [None]:
def lemmatize(data):
    nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])
    new_norm=[]
    for sentence in data:
        new_norm.append(_lemmatize_text(sentence, nlp).strip())
    return new_norm

def _lemmatize_text(sentence, nlp):
    sent = ""
    doc = nlp(sentence)
    for token in doc:
        sent+=" "+token.lemma_
    return sent

**Grouping the whole Text Normalization process into a single function**

In [None]:
def normalize_data(data,b):
    data = norm_punctuation(data,b)
    data = rem_tag(data)
    data = norm_whitespace(data)
    data = norm_case(data)
    data = other_contrac(data)
    data = norm_contractions(data)
    data = norm_case(data)
    data = norm_whitespace(data)
    if b:
        data = spell_correction(data)
    data = lemmatize(data)

    for i in range(len(data)):
        data[i] = re.sub('-PRON-','',data[i])

    data = norm_whitespace(data)
    data = rem_pre_stopwords(data)
    data = rem_stopwords(data)
    data = set(data)
    return data

Normalizing **raw_scraped_tweets.txt**

In [None]:
data = []

with open('./Data/raw_scraped_tweets.txt', 'r',encoding="utf-8") as filehandle:
    data = [line.strip() for line in filehandle.readlines()]

b_words1 = []
for line in data:
    for word in line.split():
        b_words1.append(word)
b_words1 = set(b_words1)

data = normalize_data(data,False)

a_words1 = []
for line in data:
    for word in line.split():
        a_words1.append(word)
a_words1 = set(a_words1)

with open('./Data/norm_scraped_tweets.txt', 'w',encoding="utf-8") as filehandle:
    filehandle.writelines("%s\n" % line for line in data)

print("raw_scraped_tweets.txt\n---------------------")
print("No. of distinct words before Text normalization:",len(b_words1))
print("No. of distinct words after Text normalization:",len(a_words1))

raw_scraped_tweets.txt
---------------------
No. of distinct words before Text normalization: 2337
No. of distinct words after Text normalization: 1411


Normalizing **raw_SBC.txt**

In [None]:
data = []

with open('./Data/raw_SBC.txt', 'r',encoding="utf-8") as filehandle:
    data = [line.strip() for line in filehandle.readlines()]

b_words2 = []
for line in data:
    for word in line.split():
        b_words2.append(word)
b_words2 = set(b_words2)

data = normalize_data(data,True)

a_words2 = []
for line in data:
    for word in line.split():
        a_words2.append(word)
a_words2 = set(a_words2)

with open('./Data/norm_SBC.txt', 'w',encoding="utf-8") as filehandle:
    filehandle.writelines("%s\n" % line for line in data)

print("raw_SBC.txt\n-----------")
print("No. of distinct words before Text normalization:",len(b_words2))
print("No. of distinct words after Text normalization:",len(a_words2))

raw_SBC.txt
-----------
No. of distinct words before Text normalization: 7220
No. of distinct words after Text normalization: 3417


After Text Normalization, there is a **40%**, **53%** reduction in the number of distinct words in **raw_scraped_tweets.txt**, **raw_SBC.txt** respectively

Combining **raw_scraped_tweets.txt** and **raw_SBC.txt** into **Dataset.csv**

In [None]:
MAX_LEN = 20

data_labels = []

with open('./Data/norm_scraped_tweets.txt', 'r',encoding="utf-8") as filehandle:
    data_tweets = [line.strip() for line in filehandle.readlines()]
    data_labels.extend([1]*len(data_tweets))

with open('./Data/norm_SBC.txt', 'r',encoding="utf-8") as filehandle:
    data_sbc = [line.strip() for line in filehandle.readlines()]
    data_labels.extend([0]*len(data_sbc))

dict = {'text': data_tweets+data_sbc, 'labels': data_labels}
df = pd.DataFrame(dict)
df.to_csv('./Data/Dataset.csv')

**XLNet** 
1. Tokenizing Sentences of the Dataset


In [None]:
df = pd.read_csv('./Data/Dataset.csv')
sents = df.text.values
sents = [str(sent) + " [SEP] [CLS]" for sent in sents]
labels = df.labels.values
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased',do_lower_case=True)
tokenized_sents = [tokenizer.tokenize(sent) for sent in sents]

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=798011.0, style=ProgressStyle(descripti…




2. Padding/truncating sequences and creating attention masks



In [None]:
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_sents]
input_ids = pad_sequences(input_ids,maxlen=MAX_LEN,dtype="long",truncating="post",padding="post")
attention_masks = []
for seq in input_ids:
    seq_mask = [float(i>0) for i in seq]
    attention_masks.append(seq_mask)

3. Splitting Dataset into training and cross validation sets in the ratio 80:20

In [None]:
train_sents,cv_sents,train_labels,cv_labels = train_test_split(input_ids,labels,random_state=56,test_size=0.2)
train_masks,cv_masks, _, _ = train_test_split(attention_masks,input_ids,random_state=56,test_size=0.2)

In [None]:
train_sents = torch.tensor(train_sents)
cv_sents = torch.tensor(cv_sents)
train_labels = torch.tensor(train_labels)
cv_labels = torch.tensor(cv_labels)
train_masks = torch.tensor(train_masks)
cv_masks = torch.tensor(cv_masks)

4. Loading the training and cross validation data into DataLoaders

In [None]:
batch_size = 32

train_data = TensorDataset(train_sents,train_masks,train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data,sampler=train_sampler,batch_size=batch_size)

cv_data = TensorDataset(cv_sents,cv_masks,cv_labels)
cv_sampler = SequentialSampler(cv_data)
cv_dataloader = DataLoader(cv_data,sampler=cv_sampler,batch_size=batch_size)

5. Choosing the Model

In [None]:
model = XLNetForSequenceClassification.from_pretrained("xlnet-base-cased",num_labels=2)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=760.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=467042463.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.weight', 'lm_loss.bias']
- This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['sequence_summary.summary.weight', 'sequence_summary.summary.bias', 'logits_proj.weight', 'logits_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions a

6. Parameter optimizing

In [None]:
param_optimizer = list(model.named_parameters())
no_decay = ['bias','gamma','beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]
optimizer = AdamW(optimizer_grouped_parameters,lr=2e-5)

7. Training the Model

In [None]:
device = torch.device("cpu")
train_loss_set = []

epochs = 12

for _ in trange(epochs, desc="Epoch"):
    model.train()
    tr_loss = 0
    nb_tr_examples, nb_tr_steps = 0, 0
    for step, batch in enumerate(train_dataloader):
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch
        optimizer.zero_grad()
        outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
        loss = outputs[0]
        logits = outputs[1]
        train_loss_set.append(loss.item())    
        loss.backward()
        optimizer.step()

        tr_loss += loss.item()
        nb_tr_examples += b_input_ids.size(0)
        nb_tr_steps += 1

    with torch.no_grad():
        correct = 0
        total = 0
        for i, batch in enumerate(train_dataloader):
            batch = tuple(t.to(device) for t in batch)
            b_input_ids, b_input_mask, b_labels = batch
            outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
            prediction = torch.argmax(outputs[0],dim=1)
            total += b_labels.size(0)
            correct+=(prediction==b_labels).sum().item()
        print('Train Accuracy of the model on train data is: {} %'.format(100 * correct / total))

    with torch.no_grad():
        correct = 0
        total = 0
        for i, batch in enumerate(cv_dataloader):
            batch = tuple(t.to(device) for t in batch)
            b_input_ids, b_input_mask, b_labels = batch
            outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
            prediction = torch.argmax(outputs[0],dim=1)
            total += b_labels.size(0)
            correct+=(prediction==b_labels).sum().item()
        print('Cross Validation Accuracy of the model on cv data is: {} %'.format(100 * correct / total))
        cv_acc = (100.0 * correct) / total
        if cv_acc >= 93:
            break

Epoch:   0%|          | 0/12 [00:00<?, ?it/s]

Train Accuracy of the model on train data is: 87.62214983713355 %


Epoch:   8%|▊         | 1/12 [01:29<16:20, 89.09s/it]

Cross Validation Accuracy of the model on cv data is: 83.11688311688312 %
Train Accuracy of the model on train data is: 87.62214983713355 %


Epoch:  17%|█▋        | 2/12 [02:57<14:47, 88.80s/it]

Cross Validation Accuracy of the model on cv data is: 83.11688311688312 %
Train Accuracy of the model on train data is: 96.09120521172639 %


Epoch:  25%|██▌       | 3/12 [04:25<13:18, 88.67s/it]

Cross Validation Accuracy of the model on cv data is: 89.6103896103896 %
Train Accuracy of the model on train data is: 97.06840390879479 %


Epoch:  33%|███▎      | 4/12 [05:54<11:49, 88.64s/it]

Cross Validation Accuracy of the model on cv data is: 89.6103896103896 %
Train Accuracy of the model on train data is: 99.0228013029316 %
Cross Validation Accuracy of the model on cv data is: 93.50649350649351 %


**Speech to Text conversion of Call Recordings with Speaker Diarization using IBM Watson's API**

1. Sending API request and receving json response

In [None]:
apikey = "Ybjtzt1yFtMNCNvU1dEOVUlSmpIJIcjRwly5WU5YZNbF"
url = "https://api.au-syd.speech-to-text.watson.cloud.ibm.com/instances/70141360-b930-4968-bd81-ccf647e25b31"

authenticator = IAMAuthenticator(apikey)
speech_to_text = SpeechToTextV1(authenticator=authenticator)
speech_to_text.set_service_url(url)

file_done = []
for filename in os.listdir('./SpeechToText/json'):
	  file_done.append(filename.split('.')[0])

for filename in os.listdir('./CallRecordings'):
    if filename.split('.')[0] in file_done:
        continue
    print("Converting",filename,"....",end=" ")
    f_type = filename.split('.')[1]
    file_path = "./CallRecordings/"+filename
    with open(file_path, 'rb') as f:
	      c_type = 'audio/'+f_type
	      res = speech_to_text.recognize(audio=f,content_type=c_type,model='en-US_NarrowbandModel',continuous=True,speaker_labels=True).get_result()
	      out_json = json.dumps(res,indent=2)

    out_file = filename.split('.')[0]+".json"
    out_path = "./SpeechToText/json/"+out_file
    with open(out_path, "w") as outfile: 
	      outfile.write(out_json)
    print("Done")

2. Extracting text from the json response and normalizing it

In [None]:
file_done = []
for filename in os.listdir("./SpeechToText/transcript"):
	  file_done.append(filename.split('.')[0])

for filename in os.listdir("./SpeechToText/json"):
    if filename.split('.')[0] in file_done:
        continue
    print("Normalizing",filename,"....",end=" ")	
    file_path = "./SpeechToText/json/"+filename
    with open(file_path, 'r') as f:
        json_obj = json.load(f)
        
    ttow = {}
    for sentence in json_obj['results']:
		    text = sentence['alternatives'][0]['transcript']
		    for word in sentence['alternatives'][0]['timestamps']:
			      ttow[word[1]] = {}
			      ttow[word[1]][word[2]] = word[0]
	
    speakers = {}
    for word in json_obj['speaker_labels']:
		    confi = word['confidence']
		    speaker = word['speaker']
		    if speaker in speakers:
			      speakers[speaker] += " "+ttow[word['from']][word['to']]
		    else:
			      speakers[speaker] = ttow[word['from']][word['to']]

    out_file = filename.split('.')[0]+".txt"
    out_path = "./SpeechToText/transcript/"+out_file
    with open(out_path, "w") as outfile:
        data = [] 
        for speaker in speakers.keys():
            data.append(speakers[speaker])
        data = normalize_data(data,True)
        for line in data:
            outfile.write(line+"\n")
    print("Done")

**Getting Model Predictions for the Call recordings in the final Test set**

In [None]:
for filename in os.listdir("./SpeechToText/transcript"):
    data = []
    data_labels = []
    with open('./SpeechToText/transcript/'+filename, 'r') as filehandle:
        for line in filehandle.readlines():
            if len(line.strip().split()) >= 4:
                data.append(line.strip())

    data = [' '.join(data)]
    data_labels.append(1)
    sents = data
    sents = [str(sent) + " [SEP] [CLS]" for sent in sents]
    labels = data_labels
    tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased',do_lower_case=True)
    tokenized_sents = [tokenizer.tokenize(sent) for sent in sents]

    input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_sents]
    input_ids = pad_sequences(input_ids,maxlen=MAX_LEN,dtype="long",truncating="post",padding="post")
    attention_masks = []
    for seq in input_ids:
        seq_mask = [float(i>0) for i in seq]
        attention_masks.append(seq_mask)    

    test_sents = input_ids
    test_labels = labels
    test_masks = attention_masks

    test_sents = torch.tensor(test_sents)
    test_labels = torch.tensor(test_labels)
    test_masks = torch.tensor(test_masks)


    test_data = TensorDataset(test_sents,test_masks,test_labels)
    test_sampler = SequentialSampler(test_data)
    test_dataloader = DataLoader(test_data,sampler=test_sampler,batch_size=batch_size)

    with torch.no_grad():
        correct = 0
        total = 0
        for i, batch in enumerate(test_dataloader):
            batch = tuple(t.to(device) for t in batch)
            b_input_ids, b_input_mask, b_labels = batch
            outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
            prediction = torch.argmax(outputs[0],dim=1)
            total += b_labels.size(0)
            correct+=(prediction==b_labels).sum().item()
    probability = 100 * correct / total   
    out_file = filename.split('.')[0]+".pred"
    out_path = "./Predictions/"+out_file
    with open(out_path, "w") as outfile:
        outfile.write(str(probability))

**Evaluating Model's effectiveness**

In [None]:
print("File Name                      |Type of Recording|Predicted Type")
print("------------------------------- ----------------- --------------")
correct = 0
total = 0
for filename in os.listdir("./Predictions"):
    typerec = "NOT FRAUD"
    if '-' in filename:
        typerec = "FRAUD"
    typepred = "NOT FRAUD"
    with open('./Predictions/'+filename, 'r') as filehandle:
        for line in filehandle.readlines():
            val = float(line.strip())
            if(val == 100.0):
                typepred = "FRAUD"
    print(filename+(" "*(31-len(filename)))+"|"+typerec+(" "*(17-len(typerec)))+"|"+typepred+(" "*(14-len(typepred))))
    total += 1
    correct += (typerec==typepred)
print("Accuracy: ",(correct/total)*100,"%")

File Name                      |Type of Recording|Predicted Type
------------------------------- ----------------- --------------
elder-fraud.pred               |FRAUD            |NOT FRAUD     
amazon-fraud.pred              |FRAUD            |FRAUD         
cv19_delivery-fraud.pred       |FRAUD            |FRAUD         
cv19_social_security-fraud.pred|FRAUD            |FRAUD         
debt_arrest-fraud.pred         |FRAUD            |NOT FRAUD     
diabetic_test_kit-fraud.pred   |FRAUD            |FRAUD         
student_loan-fraud.pred        |FRAUD            |FRAUD         
irs_tax-fraud.pred             |FRAUD            |FRAUD         
test_kit-fraud.pred            |FRAUD            |FRAUD         
cv19_vaccine-fraud.pred        |FRAUD            |FRAUD         
SBC025.pred                    |NOT FRAUD        |NOT FRAUD     
SBC028.pred                    |NOT FRAUD        |NOT FRAUD     
SBC031.pred                    |NOT FRAUD        |NOT FRAUD     
SBC039.pred              