# Keyword completation

## Data loading

In this step, the records for the study are selected. The previous file is loaded with:

In [1]:
import pandas as pd

url = "https://raw.githubusercontent.com/jdvelasq/techminer/master/data/tutorial/"

pdf = pd.read_json(url + "keyword-completation.json", orient="records", lines=True)

HTTPError: HTTP Error 404: Not Found

`NaN` values are changed by `None`.

In [None]:
pdf = pdf.applymap(lambda x: None if pd.isna(x) is True else x)

## Keywords completation

This step aims to create a column (field) in the dataframe containing key terms for document selection. The columns `'Author Keywords'` and `'Index Keywords'` are joined in a new column is called `'keywords'`.

In [None]:
pdf = pdf.assign(
    keywords=pdf["Author Keywords"].map(lambda x: x.split(";") if x is not None else [])
    + pdf["Index Keywords"].map(lambda x: x.split(";") if x is not None else [])
)

#  remove blank spaces sorounding keywords
pdf["keywords"] = pdf["keywords"].map(lambda x: [e.strip() for e in x])

#  join keywords in a new string
pdf["keywords"] = pdf["keywords"].map(lambda x: ";".join(x))

# converts in None empty keywords list
pdf["keywords"] = pdf.keywords.map(lambda x: None if x == "" else x)

In [None]:
pdf.keywords.head()

However, there are records without `'Author Keywords'` and `'Index Keywords'`.

In [None]:
len(pdf[pdf.keywords.map(lambda x: x is None)])

In [None]:
# Verification:

pdf.keywords[
    (pdf["Author Keywords"].map(lambda x: x is None))
    & (pdf["Index Keywords"].map(lambda x: x is None))
]

In the following code, a `Keywords` object is created. The content of column `keywords` is added to the object.

In [None]:
from techminer.keywords import Keywords

kyw = Keywords()
kyw.add_keywords(pdf.keywords, sep=";")
kyw.keywords[0:20]

In [None]:
#
# Number of records without abstract
#
len(pdf[pdf.Abstract.map(lambda x: x is None)])

In [None]:
#
# Number of rows without title
#
len(pdf[pdf["Title"].map(lambda x: x is None)])

In [None]:
#
# Remove copyright character from abstract
#
pdf["Abstract"] = pdf.Abstract.map(
    lambda x: x[0 : x.find("\u00a9")]
    if isinstance(x, str) and x.find("\u00a9") != -1
    else x
)

In [None]:
#
# We combine title and abstract in a variable
#
title_abstract = pdf["Title"] + " " + pdf["Abstract"]

In [None]:
#
# Extracts previous recorded keywords using the Keywords object.
#
keywords_in_title_and_abstract = title_abstract.map(
    lambda x: kyw.extract_from_text(x, sep=";")
)

In [None]:
#
# Adds the new keywords only to rows without keywords
#
idx = pdf.keywords.map(lambda x: x is None)
pdf.loc[idx, "keywords"] = keywords_in_title_and_abstract[idx]

In [None]:
#
# Verify the number of rows without keywords
#
len(pdf[pdf.keywords.map(lambda x: x is None)])

In [None]:
#
# Save the dataframe for the next step
#
pdf.to_json("deletion-of-records.json", orient="records", lines=True)