## Data Preprocessing
##### This code records data subsetting and preprocessing of the Batteries Abstract data (training only) that can be found on HuggingFace. This series of code is the first entry of replicating a similar problem in the workplace, using small datasets to model binary outcomes. 

### 1. Import Data

In [1]:
# Import packages
import pandas as pd
import numpy as np

# Import data
df = pd.read_csv(r"/Users/ishanisahama/Documents/Data Science/github_blog/keybert and h2o/input/training_data.csv") 
df.head(10)

# Data Source: https://huggingface.co/datasets/batterydata/paper-abstracts/tree/main.


Unnamed: 0,abstract,label
0,Background Allergic contact dermatitis due to ...,non-battery
1,ABSTRACT According to the Nernst–Planck equati...,battery
2,Carbon-encapsulated nano-MnO composite with no...,battery
3,Search for simple and economical electrocataly...,battery
4,The frequent occurrence of natural disasters i...,battery
5,"Polymeric gel electrolytes (PGE), based on pol...",battery
6,Actigraphy has been used for more than 60 year...,non-battery
7,Photoelectrocatalytic cells for water splittin...,battery
8,Perfectly aligned silicon microwire arrays sho...,battery
9,Potassium-ion batteries attract tremendous att...,battery


### 2. Subset Data Sample

##### This section extracts an equal sized sample from each data class. This is to replicate the workplace problem of equal class sizes in a small dataset (<800 samples) for modelling.

In [2]:
# Subset "battery" and "non-battery" samples from dataset
battery = df[df["label"] == "battery"]
non_battery = df[df["label"] == "non-battery"]

# Print data dimensions of subsets
print("The dimensions of the battery dataset are", battery.shape)
print("The dimensions of the non-battery dataset are", non_battery.shape)

The dimensions of the battery dataset are (20629, 2)
The dimensions of the non-battery dataset are (12034, 2)


In [3]:
# Extract a sample of 350 rows from each subset
n = int(350)
battery_samp = battery.sample(n=n, random_state=123)
non_battery_samp = non_battery.sample(n=n, random_state=123)

# Print data dimensions of samples
print("The dimensions of the battery sample are", battery_samp.shape)
print("The dimensions of the non-battery sample are", non_battery_samp.shape)


The dimensions of the battery sample are (350, 2)
The dimensions of the non-battery sample are (350, 2)


### 3. Preprocess Modelling Sample

##### This section concatonates the "battery" and "non_battery" samples, and minimally preprocesses the text to ready for keyword extraction. In the workplace problem, it was found that numbers were being included in the keyword extraction output (e.g. "xxx 2023"). In addition, specific junk words were being included that have no bearing on the analysis. While the keyword extraction method of choice for this project works with largely unprocessed text data, numerical information and stop words are not intended output to decipher the content of the text. Therefore, numerical information and specific stop words will be removed from the input data.

In [4]:
# Concatonate samples
frames = [battery_samp, non_battery_samp]
df_model = pd.concat(frames)

# Print data dimensions of modelling dataset
print("The dimensions of df_model are", df_model.shape)

The dimensions of df_model are (700, 2)


In [5]:
# View modelling dataset
df_model.head(10)

Unnamed: 0,abstract,label
14034,A new ionic liquid-based electrolyte for lithi...,battery
27411,The interest in self-consumption of PV electri...,battery
12201,This paper explores the synergistic and cataly...,battery
3739,Li-rich layered oxides with micro-sized primar...,battery
14130,"In the present study, Al2O3 is utilized for th...",battery
3994,Micron spherical Sn doping Li1.2Ni0.2Mn0.8O2 c...,battery
10606,Based on the re-construction idea of carbon na...,battery
17436,Li-rich layered transition metal oxides with t...,battery
14267,Micrometre-size silicon particles are desirabl...,battery
14547,"Globally, buildings are responsible for approx...",battery


In [6]:
# How many unique ids (match dimensions)?
df_model["abstract"].nunique()

700

In [7]:
# Create an ID column by row number and reset the index 
df_model["id"] = range(len(df_model))
# df_model["id"] = df_model[["abstract"]].sum(axis=1).map(hash)  --> The "hash" function creates complex IDs. 
df_model = df_model.reset_index(drop=True)
df_model.head(10)

Unnamed: 0,abstract,label,id
0,A new ionic liquid-based electrolyte for lithi...,battery,0
1,The interest in self-consumption of PV electri...,battery,1
2,This paper explores the synergistic and cataly...,battery,2
3,Li-rich layered oxides with micro-sized primar...,battery,3
4,"In the present study, Al2O3 is utilized for th...",battery,4
5,Micron spherical Sn doping Li1.2Ni0.2Mn0.8O2 c...,battery,5
6,Based on the re-construction idea of carbon na...,battery,6
7,Li-rich layered transition metal oxides with t...,battery,7
8,Micrometre-size silicon particles are desirabl...,battery,8
9,"Globally, buildings are responsible for approx...",battery,9


In [8]:
# Lowercase text data
df_model["text_proc"] = df_model["abstract"].str.lower()

# Remove numerical information
df_model["text_proc"] = df_model["text_proc"].apply(lambda x: "".join([character for character in x if not character.isdigit()]))

# View preprocessing so far
df_model.head(10)

Unnamed: 0,abstract,label,id,text_proc
0,A new ionic liquid-based electrolyte for lithi...,battery,0,a new ionic liquid-based electrolyte for lithi...
1,The interest in self-consumption of PV electri...,battery,1,the interest in self-consumption of pv electri...
2,This paper explores the synergistic and cataly...,battery,2,this paper explores the synergistic and cataly...
3,Li-rich layered oxides with micro-sized primar...,battery,3,li-rich layered oxides with micro-sized primar...
4,"In the present study, Al2O3 is utilized for th...",battery,4,"in the present study, alo is utilized for the ..."
5,Micron spherical Sn doping Li1.2Ni0.2Mn0.8O2 c...,battery,5,micron spherical sn doping li.ni.mn.o cathode ...
6,Based on the re-construction idea of carbon na...,battery,6,based on the re-construction idea of carbon na...
7,Li-rich layered transition metal oxides with t...,battery,7,li-rich layered transition metal oxides with t...
8,Micrometre-size silicon particles are desirabl...,battery,8,micrometre-size silicon particles are desirabl...
9,"Globally, buildings are responsible for approx...",battery,9,"globally, buildings are responsible for approx..."


In [9]:
# Import stop words packages
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords

# Set stop words
stop_words = set(stopwords.words('english'))

# Remove stop words from text
df_model["text_proc"] = df_model["text_proc"].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
df_model.head(10)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ishanisahama/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,abstract,label,id,text_proc
0,A new ionic liquid-based electrolyte for lithi...,battery,0,new ionic liquid-based electrolyte lithium bat...
1,The interest in self-consumption of PV electri...,battery,1,interest self-consumption pv electricity grid-...
2,This paper explores the synergistic and cataly...,battery,2,paper explores synergistic catalytic propertie...
3,Li-rich layered oxides with micro-sized primar...,battery,3,li-rich layered oxides micro-sized primary par...
4,"In the present study, Al2O3 is utilized for th...",battery,4,"present study, alo utilized first time coating..."
5,Micron spherical Sn doping Li1.2Ni0.2Mn0.8O2 c...,battery,5,micron spherical sn doping li.ni.mn.o cathode ...
6,Based on the re-construction idea of carbon na...,battery,6,based re-construction idea carbon nanomaterial...
7,Li-rich layered transition metal oxides with t...,battery,7,li-rich layered transition metal oxides nomina...
8,Micrometre-size silicon particles are desirabl...,battery,8,micrometre-size silicon particles desirable ba...
9,"Globally, buildings are responsible for approx...",battery,9,"globally, buildings responsible approximately ..."


In [10]:
# Export output for further processing
df_model.to_excel(r"/Users/ishanisahama/Documents/Data Science/github_blog/keybert and h2o/output/df_model.xlsx")