# Scientific Articles – Offline Preprocessing & Embeddings

This notebook prepares the data for our semantic search engine.

**Goals:**
- Load the Scopus CSV export
- Clean and select the most relevant fields (title, abstract, authors, etc.)
- Build a combined text field for semantic embeddings
- Compute embeddings for all articles using `SentenceTransformer`
- Save:
  - cleaned metadata (`articles.parquet`)
  - article embeddings (`embeddings.npy`)
  - an optional FAISS index (`index.faiss`) for fast similarity search

These artifacts will later be used in a web / Streamlit app for semantic search.


## 1. Import libraries

We start by importing the main Python libraries:

- `pandas` and `numpy` for data handling,
- `re` for simple text cleaning,
- `SentenceTransformer` to compute semantic embeddings,
- `os` for file operations.


In [9]:
import pandas as pd
import numpy as np
import re
import os

from sentence_transformers import SentenceTransformer


  from .autonotebook import tqdm as notebook_tqdm


## 2. Load the CSV dataset

We load the Scopus-based CSV file that contains one row per document.

Make sure the path and file name are correct (replace `scopus_full_data_v2.csv` if needed).


In [13]:
df = pd.read_csv('scopus_full_data_v2.csv')
df.head()

Unnamed: 0,file_name,chapter_title,doi,scopus_id,publication_year,cover_date,book_title,publisher,aggregation_type,authors,affiliation,abstract,description,author_keywords,ASJC,ASJC_translation,reference_count
0,201800000,Public health and international epidemiology f...,10.1007/978-3-319-98485-8_15,85077976956,2018,2018-12-31,"Radiology in Global Health: Strategies, Implem...",Springer International Publishing,Book,Pongpirul K.; Lungren M.P.,"Department of Radiology, Stanford University S...",,,,2700,Medicine,76
1,201800001,Flexible Printed Active Antenna for Digital Te...,10.23919/PIERS.2018.8597669,85060936020,2018,2018-12-31,Progress in Electromagnetics Research Symposium,Institute of Electrical and Electronics Engine...,Conference Proceeding,Pratumsiri T.; Janpugdee P.,"Department of Electrical Engineering, Wireless...","© 2018 The Institute of Electronics, Informati...",This paper presents the development of a flexi...,,"[{'$': '2208'}, {'$': '2504'}]","Electrical and Electronic Engineering, Materia...",4
2,201800002,Parametric study of hydrogen production via so...,10.1016/j.ces.2018.08.042,85052201238,2018,2018-12-31,Chemical Engineering Science,Elsevier Ltd,Journal,Phuakpunk K.; Assabumrungrat S.; Chalermsinsuw...,"Fuels Research Center, Department of Chemical ...",© 2018 Elsevier LtdComputational fluid dynamic...,Computational fluid dynamics was applied for s...,Circulating fluidized bed; Computational fluid...,"[{'$': '1600'}, {'$': '1500'}, {'$': '2209'}]","Chemistry, Chemical Engineering, Industrial an...",42
3,201800003,Superhydrophobic coating from fluoroalkylsilan...,10.1016/j.apsusc.2018.08.059,85051498032,2018,2018-12-31,Applied Surface Science,Elsevier B.V.,Journal,Saengkaew J.; Le D.; Samart C.; Kongparakul S....,"FRST, Academy of Science, Office of the Royal ...",© 2018 Elsevier B.V. A superhydrophobic/supero...,A superhydrophobic/superoleophilic mesh was su...,Encapsulation; Fluoroalkylsilane; Natural rubb...,"[{'$': '1600'}, {'$': '3104'}, {'$': '3100'}, ...","Chemistry, Condensed Matter Physics, Physics a...",45
4,201800004,Electrochemical impedance-based DNA sensor usi...,10.1016/j.aca.2018.07.045,85050678366,2018,2018-12-31,Analytica Chimica Acta,Elsevier B.V.,Journal,Teengam P.; Siangproh W.; Tuantranont A.; Vila...,"Organic Synthesis Research Unit, Department of...",© 2018 Elsevier B.V. A label-free electrochemi...,A label-free electrochemical DNA sensor based ...,acpcPNA; Electrochemical impedance spectroscop...,"[{'$': '1602'}, {'$': '1303'}, {'$': '2304'}, ...","Analytical Chemistry, Biochemistry, Environmen...",55


## 3. Inspect dataset structure

We inspect:
- column names,
- data types,
- basic information about missing values.

This helps us verify that the data matches our expectations and identify potential issues.


In [10]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19805 entries, 0 to 19804
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   file_name         19805 non-null  int64 
 1   chapter_title     19804 non-null  object
 2   doi               18670 non-null  object
 3   scopus_id         19805 non-null  int64 
 4   publication_year  19805 non-null  int64 
 5   cover_date        19805 non-null  object
 6   book_title        19805 non-null  object
 7   publisher         19800 non-null  object
 8   aggregation_type  19805 non-null  object
 9   authors           19805 non-null  object
 10  affiliation       19793 non-null  object
 11  abstract          19276 non-null  object
 12  description       19276 non-null  object
 13  author_keywords   16338 non-null  object
 14  ASJC              19805 non-null  object
 15  ASJC_translation  19805 non-null  object
 16  reference_count   19805 non-null  int64 
dtypes: int64(4),

In [12]:
df.isna().sum()

file_name              0
chapter_title          1
doi                 1135
scopus_id              0
publication_year       0
cover_date             0
book_title             0
publisher              5
aggregation_type       0
authors                0
affiliation           12
abstract             529
description          529
author_keywords     3467
ASJC                   0
ASJC_translation       0
reference_count        0
dtype: int64

## 4. Select relevant columns for semantic search

Our semantic search engine will primarily rely on:
- `chapter_title` (used as the article title),
- `abstract` (main text for meaning),
- `doi` (to link to the article),
- `scopus_id`,
- `publication_year`,
- `authors`,
- `affiliation`,
- `author_keywords`,
- `ASJC_translation` (high-level domain/subject information).

We create a reduced dataframe containing only these fields.


In [14]:
cols_to_keep = [
    "file_name",
    "chapter_title",
    "abstract",
    "doi",
    "scopus_id",
    "publication_year",
    "authors",
    "affiliation",
    "author_keywords",
    "ASJC_translation"
]

df = df[cols_to_keep].copy()
df.head()


Unnamed: 0,file_name,chapter_title,abstract,doi,scopus_id,publication_year,authors,affiliation,author_keywords,ASJC_translation
0,201800000,Public health and international epidemiology f...,,10.1007/978-3-319-98485-8_15,85077976956,2018,Pongpirul K.; Lungren M.P.,"Department of Radiology, Stanford University S...",,Medicine
1,201800001,Flexible Printed Active Antenna for Digital Te...,"© 2018 The Institute of Electronics, Informati...",10.23919/PIERS.2018.8597669,85060936020,2018,Pratumsiri T.; Janpugdee P.,"Department of Electrical Engineering, Wireless...",,"Electrical and Electronic Engineering, Materia..."
2,201800002,Parametric study of hydrogen production via so...,© 2018 Elsevier LtdComputational fluid dynamic...,10.1016/j.ces.2018.08.042,85052201238,2018,Phuakpunk K.; Assabumrungrat S.; Chalermsinsuw...,"Fuels Research Center, Department of Chemical ...",Circulating fluidized bed; Computational fluid...,"Chemistry, Chemical Engineering, Industrial an..."
3,201800003,Superhydrophobic coating from fluoroalkylsilan...,© 2018 Elsevier B.V. A superhydrophobic/supero...,10.1016/j.apsusc.2018.08.059,85051498032,2018,Saengkaew J.; Le D.; Samart C.; Kongparakul S....,"FRST, Academy of Science, Office of the Royal ...",Encapsulation; Fluoroalkylsilane; Natural rubb...,"Chemistry, Condensed Matter Physics, Physics a..."
4,201800004,Electrochemical impedance-based DNA sensor usi...,© 2018 Elsevier B.V. A label-free electrochemi...,10.1016/j.aca.2018.07.045,85050678366,2018,Teengam P.; Siangproh W.; Tuantranont A.; Vila...,"Organic Synthesis Research Unit, Department of...",acpcPNA; Electrochemical impedance spectroscop...,"Analytical Chemistry, Biochemistry, Environmen..."


## 5. Basic text cleaning

We define a small helper function `clean_text` to:
- convert values to string,
- replace newlines with spaces,
- collapse multiple spaces into a single one,
- strip leading/trailing spaces.

We then apply it to text-like columns such as title, abstract, authors, etc.


In [15]:
def clean_text(text):
    """Basic text cleaning for titles, abstracts, etc."""
    if pd.isna(text):
        return ""
    text = str(text)
    text = text.replace("\n", " ")
    text = re.sub(r"\s+", " ", text)
    return text.strip()

text_columns = ["chapter_title", "abstract", "authors", "affiliation", "author_keywords"]

for col in text_columns:
    df[col] = df[col].apply(clean_text)

df.head()


Unnamed: 0,file_name,chapter_title,abstract,doi,scopus_id,publication_year,authors,affiliation,author_keywords,ASJC_translation
0,201800000,Public health and international epidemiology f...,,10.1007/978-3-319-98485-8_15,85077976956,2018,Pongpirul K.; Lungren M.P.,"Department of Radiology, Stanford University S...",,Medicine
1,201800001,Flexible Printed Active Antenna for Digital Te...,"© 2018 The Institute of Electronics, Informati...",10.23919/PIERS.2018.8597669,85060936020,2018,Pratumsiri T.; Janpugdee P.,"Department of Electrical Engineering, Wireless...",,"Electrical and Electronic Engineering, Materia..."
2,201800002,Parametric study of hydrogen production via so...,© 2018 Elsevier LtdComputational fluid dynamic...,10.1016/j.ces.2018.08.042,85052201238,2018,Phuakpunk K.; Assabumrungrat S.; Chalermsinsuw...,"Fuels Research Center, Department of Chemical ...",Circulating fluidized bed; Computational fluid...,"Chemistry, Chemical Engineering, Industrial an..."
3,201800003,Superhydrophobic coating from fluoroalkylsilan...,© 2018 Elsevier B.V. A superhydrophobic/supero...,10.1016/j.apsusc.2018.08.059,85051498032,2018,Saengkaew J.; Le D.; Samart C.; Kongparakul S....,"FRST, Academy of Science, Office of the Royal ...",Encapsulation; Fluoroalkylsilane; Natural rubb...,"Chemistry, Condensed Matter Physics, Physics a..."
4,201800004,Electrochemical impedance-based DNA sensor usi...,© 2018 Elsevier B.V. A label-free electrochemi...,10.1016/j.aca.2018.07.045,85050678366,2018,Teengam P.; Siangproh W.; Tuantranont A.; Vila...,"Organic Synthesis Research Unit, Department of...",acpcPNA; Electrochemical impedance spectroscop...,"Analytical Chemistry, Biochemistry, Environmen..."
