# Step 2: Data merging

From previous step, there are three files with the data in the current directory:

In [3]:
!ls -1 *.csv

demo-citations.csv
demo-keywords.csv
demo-refs.csv


First, the files are loaded as dataframes.

In [4]:
import pandas as pd
citations = pd.read_csv('demo-citations.csv')
keywords = pd.read_csv('demo-keywords.csv')
references = pd.read_csv('demo-refs.csv')

Second, column names are listed and reviewed, looking for unwanted columns.

In [5]:
citations.columns

Index(['Authors', 'Author(s) ID', 'Title', 'Year', 'Source title', 'Volume',
       'Issue', 'Art. No.', 'Page start', 'Page end', 'Page count', 'Cited by',
       'DOI', 'Link', 'Affiliations', 'Authors with affiliations', 'ISSN',
       'ISBN', 'CODEN', 'Abbreviated Source Title', 'Document Type',
       'Publication Stage', 'Access Type', 'Source', 'EID'],
      dtype='object')

In [6]:
keywords.columns

Index(['Link', 'Abstract', 'Author Keywords', 'Index Keywords'], dtype='object')

In [7]:
references.columns

Index(['Link', 'References'], dtype='object')

For all dataframes, the `'Link'` column can be removed. Also, the column `'Authors with affiliations'` in `citations` can be removed.

In [10]:
citations = citations.drop(columns=['Link', 'Authors with affiliations'])
keywords = keywords.drop(columns='Link')
references = references.drop(columns='Link')

Third, the dataframes are concated.

In [11]:
df = pd.concat([citations, keywords, references], axis=1)

At this moment, not all records of the dataframe `df` will be included in the analysis. To mark the selected records that will be used in the review, the field `'Selected'` is included in the dataframe with a defaul value of `False`. 

In [12]:
df['Selected'] = False

Finally, the dataframe is saved to disk for further processing.

In [13]:
df.to_csv('demo-merged.csv')

In the next part of this tutorial, strategies for selecting records are discussed.