# Step 3: Keyword completation

## Data loading

In this step, the records for the study are selected. The previous file is loaded with:

In [None]:
import pandas as pd

pdf = pd.read_json(
    'step-02.json', 
    orient='records', 
    lines=True)

Next, the pandas.DataFrame is converted to a `'RecordsDataFrame'` object

In [None]:
from techminer import RecordsDataFrame
rdf = RecordsDataFrame(pdf)

In [None]:
type(rdf)

In [None]:
from techminer import nan2none
rdf = nan2none(rdf)

## Record visualization

**TechMiner** implements the `display_records()` to visualize a portion of a dataframe in json format.

In [None]:
from techminer import display_records

display_records(rdf[['Title', 'Author Keywords', 'Index Keywords']].head(5))

## Keywords completation

This step aims to create a column (field) in the dataframe containing key terms for document selection. First, the columns `'Author Keywords'` and `'Index Keywords'` are joined using the `'merge_fields'` function. The new column is called `'keywords'`.

In [None]:
from techminer import merge_fields
merge_fields(rdf['Author Keywords'], rdf['Index Keywords'], sepA=';', sepB=';', new_sep=';')[0:10]

In [None]:
rdf['keywords'] = merge_fields(rdf['Author Keywords'], rdf['Index Keywords'], sepA=';', sepB=';', new_sep=';')

However, there are 51 records without `'Author Keywords'` and `'Index Keywords'`.

In [None]:
len(rdf[rdf['keywords'].map(lambda x: x is None)])

In [None]:
from techminer.keywords import Keywords
kyw = Keywords()
kyw.add_keywords(rdf['keywords'], sep=';')
kyw.keywords[0:20]

In [None]:
len(rdf[rdf['Abstract'].map(lambda x: x is None)])

In [None]:
len(rdf[rdf['Title'].map(lambda x: x is None)])

In [None]:
## remove copyright
import numpy as np
rdf['Abstract'] = rdf['Abstract'].map(lambda x: x[0:x.find('\u00a9')] if isinstance(x, str) and x.find('\u00a9')!= -1 else x)

In [None]:
title_abstract =  merge_fields(rdf['Title'], rdf['Abstract'], new_sep=' ')

In [None]:
## se extrae del titulo y del abstract
keywords_title_abstract = title_abstract.map(lambda x: kyw.extract_from_text(x, sep=';'))

In [None]:
idx = rdf['keywords'].map(lambda x: x is None)
rdf.loc[idx, 'keywords'] = keywords_title_abstract[idx]

In [None]:
len(rdf[rdf['keywords'].map(lambda x: x is None)])

In [None]:
rdf.to_json(
    'step-03.json', 
    orient='records', 
    lines=True)