# 2.1.2 Preprocessing data before builing NLP model

Scraped title and abstracts that were downloaded in 1.1 as TAC_scraped.csv were ranked offline for relevancy. Articles that mentioned fast drying and corrosion resistance were given 5's. Mention of only one of these aspects were ranked 3. Others, including mention of powder coatings, electrodeposition coatings were ranked 1.

This code is designed to merge the title and abstract text into a single column and replace all 5 and 3 rankings with 1 (relevant) and all 1's with 0 (not relevant). Data is then split in to train test and validaiton portions and saved as an npz file for import into the model training code in 3.1.2

In [1]:
import pandas as pd
import numpy as np

## Import data and remove empty cells

In [2]:
#import training data
Database_file = 'TACScrapedCleanMSDOS.csv'

raw_data = pd.read_csv(Database_file, encoding='latin')

#combine title and abstracts
raw_data['Title_Abstract'] = raw_data['title'].fillna('') + raw_data['Abstracts'].fillna('')

#remove other columns
data_TA = raw_data[['Title_Abstract', 'Relevant']]

data_TA.shape


(3935, 2)

In [3]:
#sample results
example = 1
'Title Abstract and label:', data_TA['Title_Abstract'][example], data_TA['Relevant'][example]

('Title Abstract and label:',
 'Two-component polyurethane clear coat kit system ',
 1.0)

## Make the rankings binary

In [4]:
data_TA = data_TA.dropna(subset= 'Relevant', ignore_index = True)
data_TA_1_to_0 = data_TA.replace({'Relevant' : 1}, 0)
data_TA_binary = data_TA_1_to_0.replace({'Relevant' : [3, 5]}, 1)

#sample balance
print(
    'Not relevant:', len(data_TA_binary[data_TA_binary['Relevant'] == 0]),
    'Relevant:', len(data_TA_binary[data_TA_binary['Relevant'] == 1]),
    'Total documents:', data_TA_binary.shape[0]
)

Not relevant: 283 Relevant: 214 Total documents: 497


## Balance the data by removing excess not relevant documents

In [6]:
number_to_remove = len(data_TA_binary[data_TA_binary['Relevant'] == 0]) - len(data_TA_binary[data_TA_binary['Relevant'] == 1])
data_binary_remove = data_TA_binary.loc[data_TA_binary['Relevant'] == 0].sample(n = number_to_remove, random_state = 42)
data_balanced = data_TA_binary.drop(data_binary_remove.index)
data_balanced.reset_index(drop=True, inplace=True)

print(
    'Not relevant:', len(data_balanced[data_balanced['Relevant'] == 0]),
    'Relevant:', len(data_balanced[data_balanced['Relevant'] == 1])
)

Not relevant: 214 Relevant: 214


## Split off 10% for test and 10% for validation data

In [8]:
test_data = data_balanced.sample(frac = 0.1, random_state = 1, ignore_index = False)
train_data = data_balanced.drop(test_data.index)
test_data.reset_index(drop=True, inplace=True)
train_data.reset_index(drop=True, inplace=True)

val_data = train_data.sample(frac = 0.1, random_state = 5, ignore_index = False)
train_data_balanced = train_data.drop(val_data.index)
val_data.reset_index(drop=True, inplace=True)
train_data_balanced.reset_index(drop=True, inplace=True)

#sample the new balance of the sets
print(
    'Test relevant:', len(test_data[test_data['Relevant'] == 1]),
    'Test not relevant:', len(test_data[test_data['Relevant'] == 0]))
print(
    'val relevant:', len(val_data[val_data['Relevant'] == 1]),
    'val not relevant:', len(val_data[val_data['Relevant'] == 0]))
print(
    'Train relevant:', len(train_data_balanced[train_data_balanced['Relevant'] == 1]),
    'Train not relevant:', len(train_data_balanced[train_data_balanced['Relevant'] == 0]))
print( 
    'Total test data:', test_data.shape[0],
    'Total val data:', val_data.shape[0],
    'Total train data:', train_data_balanced.shape[0]
)

Test relevant: 21 Test not relevant: 22
val relevant: 19 val not relevant: 19
Train relevant: 174 Train not relevant: 173
Total test data: 43 Total val data: 38 Total train data: 347


## Separate targets and align data type for tensor conversion

In [19]:
test_targets = test_data.pop('Relevant').astype(np.int32)
train_targets = train_data_balanced.pop('Relevant').astype(np.int32)
val_targets = val_data.pop('Relevant').astype(np.int32)

train_data_export = train_data_balanced.to_numpy(copy = True)
test_data_export = test_data.to_numpy(copy = True)
val_data_export = val_data.to_numpy(copy = True)

## Save the data for modeling

In [21]:
np.savez('2.1.2.TrainTestValData', 
         train = train_data_export, train_targets = train_targets, 
         test = test_data_export, test_targets = test_targets,
        val = val_data_export, val_targets = val_targets) 