# Step 3: Keyword completation

## Data loading

In this step, the records for the study are selected. The previous file is loaded with:

In [1]:
import pandas as pd

pdf = pd.read_json("step-02.json", orient="records", lines=True)

Next, the pandas.DataFrame is converted to a `'RecordsDataFrame'` object

In [2]:
from techminer import RecordsDataFrame

rdf = RecordsDataFrame(pdf)

In [3]:
from techminer import nan2none

rdf = nan2none(rdf)

## Record visualization

**TechMiner** implements the `display_records()` to visualize a portion of a dataframe in json format.

In [4]:
from techminer import display_records

display_records(rdf[["Title", "Author Keywords", "Index Keywords"]].head(5))

-----------------------------------------------
Record index: 0
{
  "Author Keywords": "Component trends; Empirical mode decomposition; Interference-less machine learning; Long short-term memory; Meta-learning; Moving average; Noise reduction; Nonlinear autoregressive neural network; Time series forecasting",
  "Index Keywords": null,
  "Title": "Improving DWT-RNN model via B-spline wavelet multiresolution to forecast a high-frequency time series"
}
-----------------------------------------------
Record index: 1
{
  "Author Keywords": null,
  "Index Keywords": "Earnings; Financial data processing; Information science; Newsprint; Consumer price index; Distributed representation; Financial time series forecasting; Long short term memory; Novel applications; Stock predictions; Textual information; Tokyo Stock Exchange; Costs",
  "Title": "Direct marketing campaigns in retail banking with the use of deep learning and random forests"
}
-----------------------------------------------
Record 

## Keywords completation

This step aims to create a column (field) in the dataframe containing key terms for document selection. First, the columns `'Author Keywords'` and `'Index Keywords'` are joined using the `'merge_fields'` function. The new column is called `'keywords'`.

In [5]:
from techminer import merge_fields

merge_fields(
    rdf["Author Keywords"], rdf["Index Keywords"], sepA=";", sepB=";", new_sep=";"
)[0:10]

0    Component trends;Empirical mode decomposition;...
1    Earnings;Financial data processing;Information...
2    Multilayer neural networks;Financial time seri...
3    Electronic trading;Time series analysis;Deep l...
4    Commerce;Deep learning;Electronic trading;Fina...
5    Financial time series predictions;Forecasting ...
6    Electronic trading;Nearest neighbor search;Fin...
7    Time series analysis;Deep learning;Financial m...
8    Complex networks;Data mining;Economics;Forecas...
9    IBM stock indices;Electronic trading;Financial...
dtype: object

In [6]:
rdf["keywords"] = merge_fields(
    rdf["Author Keywords"], rdf["Index Keywords"], sepA=";", sepB=";", new_sep=";"
)

However, there are 51 records without `'Author Keywords'` and `'Index Keywords'`.

In [7]:
len(rdf[rdf["keywords"].map(lambda x: x is None)])

8

In [8]:
from techminer.keywords import Keywords

kyw = Keywords()
kyw.add_keywords(rdf["keywords"], sep=";")
kyw.keywords[0:20]

['(2D) 2 PCA',
 '(2D) <sup>2</sup> PCA',
 'AMAPE',
 'ANN',
 'ARIMA',
 'ARIMA Model',
 'ARIMA model',
 'ARIMA modeling',
 'Absolute values',
 'Abstract representation',
 'Accounts receivable',
 'Accuracy Improvement',
 'Accuracy of classifications',
 'Accurate prediction',
 'Activation layer',
 'AdaBoost algorithm',
 'Adam Optimizer',
 'Adaptive boosting',
 'Adaptive gradient algorithm',
 'Adaptive noise']

In [9]:
len(rdf[rdf["Abstract"].map(lambda x: x is None)])

0

In [10]:
len(rdf[rdf["Title"].map(lambda x: x is None)])

0

In [11]:
#
# Remove copyright
#
import numpy as np

rdf["Abstract"] = rdf["Abstract"].map(
    lambda x: x[0 : x.find("\u00a9")]
    if isinstance(x, str) and x.find("\u00a9") != -1
    else x
)

In [12]:
title_abstract = merge_fields(rdf["Title"], rdf["Abstract"], new_sep=" ")

In [13]:
#
# Extracts title and abstract
#
keywords_title_abstract = title_abstract.map(
    lambda x: kyw.extract_from_text(x, sep=";")
)

In [14]:
idx = rdf["keywords"].map(lambda x: x is None)
rdf.loc[idx, "keywords"] = keywords_title_abstract[idx]

In [15]:
len(rdf[rdf["keywords"].map(lambda x: x is None)])

0

In [16]:
rdf.to_json("step-03.json", orient="records", lines=True)