# Step 3: Keyword completation

## Data loading

In this step, the records for the study are selected. The previous file is loaded with:

In [1]:
import pandas as pd

pdf = pd.read_json(
    'step-02.json', 
    orient='records', 
    lines=True)

Next, the pandas.DataFrame is converted to a `'RecordsDataFrame'` object

In [2]:
from techminer import RecordsDataFrame
rdf = RecordsDataFrame(pdf)

In [3]:
from techminer import nan2none
rdf = nan2none(rdf)

## Record visualization

**TechMiner** implements the `display_records()` to visualize a portion of a dataframe in json format.

In [4]:
from techminer import display_records

display_records(rdf[['Title', 'Author Keywords', 'Index Keywords']].head(5))

-----------------------------------------------
Record index: 0
{
  "Author Keywords": null,
  "Index Keywords": "Algorithmic trading; Experimental demonstrations; Fibre-optic communication; Signal transmission; State-of-the-art technology; Transmission bandwidth; Wavelength division multiplexed; Wide bandwidth; Bandwidth; Data communication systems; Fibers; Glass fibers; Light transmission; Light velocity; Optical fibers; Supercomputers; Vacuum; Wave transmission",
  "Title": "Towards high-capacity fibre-optic communications at the speed of light in vacuum"
}
-----------------------------------------------
Record index: 1
{
  "Author Keywords": "High-frequency trading; Limit order markets; Liquidity; Market quality; NASDAQ; Order placement strategies",
  "Index Keywords": null,
  "Title": "Low-latency trading"
}
-----------------------------------------------
Record index: 2
{
  "Author Keywords": null,
  "Index Keywords": null,
  "Title": "Rise of the machines: Algorithmic trading in

## Keywords completation

This step aims to create a column (field) in the dataframe containing key terms for document selection. First, the columns `'Author Keywords'` and `'Index Keywords'` are joined using the `'merge_fields'` function. The new column is called `'keywords'`.

In [5]:
from techminer import merge_fields
merge_fields(rdf['Author Keywords'], rdf['Index Keywords'], sepA=';', sepB=';', new_sep=';')[0:10]

0    Algorithmic trading;Experimental demonstration...
1    High-frequency trading;Limit order markets;Liq...
2                                                 None
3    competition (economics);financial market;marke...
4    High-frequency trading;Market making;Market qu...
5    Financial disclosure;Individual characteristic...
6                                                 None
7    Social networking (online);Social network secu...
8                                                 None
9                                                 None
dtype: object

In [6]:
rdf['keywords'] = merge_fields(rdf['Author Keywords'], rdf['Index Keywords'], sepA=';', sepB=';', new_sep=';')

However, there are 51 records without `'Author Keywords'` and `'Index Keywords'`.

In [7]:
len(rdf[rdf['keywords'].map(lambda x: x is None)])

51

In [8]:
from techminer.keywords import Keywords
kyw = Keywords()
kyw.add_keywords(rdf['keywords'], sep=';')
kyw.keywords[0:20]

['(location-based) publish/subscribe',
 '10-fold cross-validation',
 '?-stable processes',
 'ABM',
 'ACE',
 'ADALINE',
 'ADCC-GARCH',
 'AI techniques',
 'ANFIS',
 'ANFIS ensemble',
 'ANN',
 'ARCH',
 'ARIMA',
 'ARMA',
 'Abstract modeling',
 'Academic literature',
 'Academic research',
 'Acceleration',
 'Acceleration approach',
 'Acoustic streaming']

In [9]:
len(rdf[rdf['Abstract'].map(lambda x: x is None)])

0

In [10]:
len(rdf[rdf['Title'].map(lambda x: x is None)])

0

In [11]:
## remove copyright
import numpy as np
rdf['Abstract'] = rdf['Abstract'].map(lambda x: x[0:x.find('\u00a9')] if isinstance(x, str) and x.find('\u00a9')!= -1 else x)

In [12]:
title_abstract =  merge_fields(rdf['Title'], rdf['Abstract'], new_sep=' ')

In [13]:
## se extrae del titulo y del abstract
keywords_title_abstract = title_abstract.map(lambda x: kyw.extract_from(x, sep=';'))

In [14]:
idx = rdf['keywords'].map(lambda x: x is None)
rdf.loc[idx, 'keywords'] = keywords_title_abstract[idx]

In [15]:
len(rdf[rdf['keywords'].map(lambda x: x is None)])

0

In [17]:
rdf.to_json(
    'step-03.json', 
    orient='records', 
    lines=True)