# Step 3: Record selection

In this step, the records for the study are selected. The previous file is loaded with:

In [1]:
import pandas as pd

In [2]:
pdf = pd.read_csv('demo-merged.csv')

Next, the pandas.DataFrame is converted to a `'RecordsDataFrame'` object

In [3]:
from techminer import RecordsDataFrame
rdf = RecordsDataFrame(pdf)

In [4]:
from techminer import nan2none
rdf = nan2none(rdf)

## Group management

Record selection is based in the presence or ausence of terms in each row of a dataframe.

To mark the registers, two new columns called `'INCLUDE'` and `'EXCLUDE'` will be created. Each of them is Boolean. In this tutorial, we call group to this class of columns.  

In the next code, both columns are created with a default value of `None`. 

In [5]:
rdf['SELECTED'] = None

## Record visualization

**TechMiner** implements the `displayRecords()` to visualize a portion of a dataframe in json format.

In [6]:
from techminer import display_records

display_records(rdf[['Title', 'Author Keywords', 'Index Keywords']].head(5))

-----------------------------------------------
Record index: 0
{
  "Author Keywords": null,
  "Index Keywords": "Algorithmic trading; Experimental demonstrations; Fibre-optic communication; Signal transmission; State-of-the-art technology; Transmission bandwidth; Wavelength division multiplexed; Wide bandwidth; Bandwidth; Data communication systems; Fibers; Glass fibers; Light transmission; Light velocity; Optical fibers; Supercomputers; Vacuum; Wave transmission",
  "Title": "Towards high-capacity fibre-optic communications at the speed of light in vacuum"
}
-----------------------------------------------
Record index: 1
{
  "Author Keywords": "High-frequency trading; Limit order markets; Liquidity; Market quality; NASDAQ; Order placement strategies",
  "Index Keywords": null,
  "Title": "Low-latency trading"
}
-----------------------------------------------
Record index: 2
{
  "Author Keywords": null,
  "Index Keywords": null,
  "Title": "Rise of the machines: Algorithmic trading in

## Keywords completation

This step aims to create a column (field) in the dataframe containing key terms for document selection. First, the columns `'Author Keywords'` and `'Index Keywords'` are joined using the `'merge_fields'` function. The new column is called `'keywords'`.

In [7]:
from techminer import merge_fields
merge_fields(rdf['Author Keywords'], rdf['Index Keywords'], sepA=';', sepB=';')[0:10]

0    Algorithmic trading;Experimental demonstration...
1    High-frequency trading;Limit order markets;Liq...
2                                                 None
3    competition (economics);financial market;marke...
4    High-frequency trading;Market making;Market qu...
5    Financial disclosure;Individual characteristic...
6                                                 None
7    Automated social engineering;Online privacy;On...
8                                                 None
9                                                 None
dtype: object

In [8]:
rdf['keywords'] = merge_fields(rdf['Author Keywords'], rdf['Index Keywords'], sepA=';', sepB=';')

However, there are 51 records without `'Author Keywords'` and `'Index Keywords'`.

In [9]:
len(rdf[rdf['keywords'].map(lambda x: x is None)])

51

In [10]:
from techminer.keywords import Keywords
kyw = Keywords()
kyw.add_keywords(rdf['keywords'], sep=';')
kyw._keywords[0:10]

['(location-based) publish/subscribe',
 '10-fold cross-validation',
 '?-stable processes',
 'ABM',
 'ACE',
 'ADALINE',
 'ADCC-GARCH',
 'AI techniques',
 'ANFIS',
 'ANFIS ensemble']

In [11]:
## remove copyright
import numpy as np
rdf['Abstract'] = rdf['Abstract'].map(lambda x: x[0:x.find('\u00a9')] if isinstance(x, str) and x.find('\u00a9')!= -1 else x)

In [12]:
title_abstract =  merge_fields(rdf['Title'], rdf['Abstract'])

In [13]:
## se extrae del titulo y del abstract
keywords_title_abstract = title_abstract.map(lambda x: kyw.extract_from(x, sep=';'))

In [14]:
idx = rdf['keywords'].map(lambda x: x is None)
rdf.loc[idx, 'keywords'] = keywords_title_abstract[idx]

In [15]:
len(rdf[rdf['keywords'].map(lambda x: x is None)])

0

In [16]:
##
## Eliminación de conferencias

In [17]:
conf = Keywords()
conf.add_keywords(['Conference', 'Proceeding', 'Workshop'], sep=';')
rdf['CONF'] = rdf['Title'].map(lambda x: True if x in conf else False)

In [18]:
for title in rdf[rdf['CONF']]['Title']:
    print(title)

ICSIT 2019 - 10th International Conference on Society and Information Technologies, Proceedings
International Conference on Digital Science, DSIC 2018
ACM International Conference Proceeding Series
Proceedings - 23rd IEEE International Conference on High Performance Computing Workshops, HiPCW 2016
Workshop on Logistik und Echtzeit, Echtzeit 2017
4th Workshop on Engineering Applications, WEA 2017
10th International Conference on Knowledge Science, Engineering and Management, KSEM 2017
19th International Conference on Business Information Systems, BIS 2016
SISY 2015 - IEEE 13th International Symposium on Intelligent Systems and Informatics, Proceedings
Proceedings - 2013 Tools and Methods of Program Analysis, TMPA 2013
International Workshops on Business Information Systems, BIS 2015
2015 National Conference on Parallel Computing Technologies, PARCOMPTECH 2015
7th European Workshop on Probabilistic Graphical Models, PGM 2014
Self-Organizing Systems - 7th IFIP TC 6 International Workshop,

In [19]:
print(len(rdf))
rdf = rdf[rdf['CONF'].map(lambda x: False if x is True else True)]
print(len(rdf))

528
510


In [20]:
##
## Limpieza de palabras clave
##
from techminer import Thesaurus, text_clustering
th = text_clustering(rdf['keywords'], sep=';')


In [21]:
rdf['keywords (cleaned)'] = rdf['keywords'].map(lambda x: th.apply(x, sep=';'))

## Manual review using external software

It is possible to save the dataframe to disk for manual review using, for example, an external editor. In the following code, the columns `'Title'`, `'Author Keywords'`, `'Index Keywords'`, `'Abstract'`, `'INCLUDE'` and `'EXCLUDE'` are saved in json format.

In [22]:
rdf[['Title', 'Author Keywords', 'Index Keywords', 'Abstract', 'SELECTED']].to_json(
    'demo-records.json', 
    orient='records', 
    lines=True)

In the following figure, a partial view of the file is presented. Note that the use of json format allows the user to identify easily each field. 

![screen_capture_4.jpg](screen_capture_4.jpg)

## Removing of duplicated records

Records with the same title are removed. 

In [23]:
from techminer import remove_duplicate_records

print(len(rdf))
rdf = remove_duplicate_records(rdf, fields='Title', matchType='fingerprint')
print(len(rdf))

510
504


## Saving the changes in the dataframe

In [24]:
rdf.to_json(
    'cleaned-records.json', 
    orient='records', 
    lines=True)