# Filter data based on keywords

## Introduction
The aim of this notebook is to read a set of keywords and a set of scraped data and filter out all non-Covid-19 related entries/rows.

## Import libraries and set up defaults

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#import seaborn as sns
%matplotlib inline
#%xmode Verbose
# Set global default figure size
plt.rc('figure', figsize=(20, 12)) # It's nice with figures that fill the whole space in width
# Show maximum of 8 rows when printing dataframes
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 8
# Show only 4 digits when printing floating point number
np.set_printoptions(precision=4, suppress=True)

## Read in the keywords

In [2]:
raw = "01_raw/"
key_words_df = pd.read_csv(raw + "filtering_terms.tsv",
                           sep = '\t',
                           header = 0,
                           usecols = ['ORF','Gene','Gene2','Full_Name','Disease_Names'] # Dropping Search_Terms column
                          )
key_words_df

Unnamed: 0,ORF,Gene,Gene2,Full_Name,Disease_Names
0,ORF1AB,nsp1,,Host translation inhibitor nsp1,sars-cov-2
1,ORF1AB,nsp2,,Non-structural protein 2,sars-cov2
2,ORF1AB,nps3,,Papain-like proteinase,covid19
3,ORF1AB,nps4,,Non-structural protein 4,covid-19
...,...,...,...,...,...
24,,,,S-protein,
25,,,,Spike protein,
26,,,,Spike trimeric complex (S1S2S`),
27,,,,Spike surface glycoprotein (monomer),


Most of the key words above are from [Zhang Lab's](https://zhanglab.ccmb.med.umich.edu/COVID-19/) website.

#### Create a unique Python list of keywords

In [3]:
first_term = (key_words_df['ORF']
              .dropna() # Drop np.nan:s
              .unique() # Filter all non-unique values
              .tolist() # Make a python list
             ) + \
key_words_df['Gene'].dropna().unique().tolist() + \
key_words_df['Gene2'].dropna().unique().tolist() + \
key_words_df['Full_Name'].dropna().unique().tolist() + \
key_words_df['Disease_Names'].dropna().unique().tolist()
print(first_term)

['ORF1AB', 'S', 'ORF3A', 'E', 'M', 'ORF6', 'ORF7A', 'ORF8', 'N', 'ORF10', 'nsp1', 'nsp2', 'nps3', 'nps4', 'nsp5', 'nsp6', 'nsp7', 'nsp8', 'nsp9', 'nsp10', 'RDRP', 'Hel', 'Exon', 'NendoU', "2'-O-MT", 'Spike', '3CL-PRO', 'Spike trimeric complex (S1, S2, S`)', 'Host translation inhibitor nsp1', 'Non-structural protein 2', 'Papain-like proteinase', 'Non-structural protein 4', '3C-like proteinase', 'Non-structural protein 6', 'Non-structural protein 7', 'Non-structural protein 8', 'Non-structural protein 9', 'Non-structural protein 10', 'RNA-Directed RNA Polymerase', 'Helicase', 'Proofreading exoribonuclease (Guanine-N7 methyltransferase)', 'Uridylate-specific endoribonuclease', "2'-O-methyltransferase", 'Spike surface glycoprotein (monomer)', 'Protein 3a', 'Envelope small membrane proteins', 'Membrane protein', 'Protein 6', 'Protein 7a', 'Protein 8', 'Nucleoprotein', '3` UTR', 'S-protein', 'Spike protein', 'Spike trimeric complex (S1S2S`)', 'sars-cov-2', 'sars-cov2', 'covid19', 'covid-19',

Let's remove some blatantly non-specific search terms such as "S" or "E":

In [4]:
excludable_terms = ['S',
                    'E', 
                    'M', 
                    'N', 
                    'Hel', 
                    'Exon', 
                    'Helicase', 
                    '3` UTR'
                   ]
first_term = [term for term in first_term if
          all(excludable not in term for excludable in excludable_terms)]

Let's make all the keywords lower case for easier use.

In [5]:
first_term = [term.lower() for term in first_term]
first_term

['orf1ab',
 'orf3a',
 'orf6',
 'orf7a',
 'orf8',
 'orf10',
 'nsp1',
 'nsp2',
 'nps3',
 'nps4',
 'nsp5',
 'nsp6',
 'nsp7',
 'nsp8',
 'nsp9',
 'nsp10',
 'rdrp',
 '3cl-pro',
 'host translation inhibitor nsp1',
 'papain-like proteinase',
 '3c-like proteinase',
 'uridylate-specific endoribonuclease',
 "2'-o-methyltransferase",
 'protein 3a',
 'protein 6',
 'protein 7a',
 'protein 8',
 'sars-cov-2',
 'sars-cov2',
 'covid19',
 'covid-19',
 'sars',
 'coronavirus',
 'ncov2019',
 'ncov-2019',
 'ncov2019',
 'ncov-2019',
 'covid2019',
 'ncov']

## Read in Scraped data from Mendeley Database

In [6]:
md_data = pd.read_csv(raw + "mendeley_molecular_dynamics.csv",
                   sep = ",",
                   header = 0
                  )
md_data.head()

Unnamed: 0,accessRights,authors,containerURI,dataTypes,dateAvailable,description,doi,externalSubjectAreas,institutions,keywords,method,publicationDate,source,title,type_cont,version
0,,NMR structure and <strong>molecular</strong> <...,http://www.bmrb.wisc.edu/data_library/summary/...,DATASET|TEXT,,Natural source:\nCommon Name:. Taxonomy ID:. ...,10.13018/BMR6757,,,,,,BIOLOGICAL_MAGNETIC_RESONANCE_DATABANK,1,article,
1,http://www.gnu.org/licenses/gpl-3.0.en.html,Sergio Davis|Claudia Loyola|Felipe González|Jo...,https://data.mendeley.com/datasets/v55y7vcyrx,DATASET|FILE_SET,2019-03-14,This program has been imported from the CPC Pr...,10.17632/v55y7vcyrx.1,Surface Science|Condensed Matter Physics|Compu...,,Surface Science|Condensed Matter Physics|Compu...,,2019-03-14,MENDELEY_DATA,Las Palmeras Molecular Dynamics: A flexible an...,article,1.0
2,info:eu-repo/semantics/restrictedAccess,Walter Rocchia,https://zenodo.org/record/2649259,OTHER|DATASET,2019-11-01,This dataset contains Molecular Dynamics traje...,10.5281/zenodo.2649259,molecular dynamics,,molecular dynamics,,2019-04-30,ZENODO,Molecular dynamics simulation dataset,article,
3,,"Laranjeiro, Ricardo|Whitmore, David",https://doi.org/10.5061/dryad.r07bc,TABULAR_DATA|DATASET,2014-06-13,The circadian clock is known to regulate a wid...,10.5061/dryad.r07bc,photoreceptors|neuroD|retina|transcription fac...,"Centre for Cell and Molecular Dynamics, Depart...",Danio rerio,,2014-06-13,DRYAD,Data from: Transcription factors involved in r...,,
4,info:eu-repo/semantics/openAccess,Henrik Andersen Sveinsson,https://zenodo.org/record/3769670,DATASET,2020-04-29,Atom coordinates for molecular dynamics. For u...,10.5281/zenodo.3769670,,,,,2020-04-27,ZENODO,Atom coordinates for molecular dynamics,article,


In [7]:
mt_data = pd.read_csv(raw + "mendeley_molecular_trajectories.csv",
                   sep = ",",
                   header = 0
                  )
mt_data.head()

Unnamed: 0,accessRights,authors,containerURI,dataTypes,dateAvailable,description,doi,externalSubjectAreas,institutions,keywords,method,publicationDate,source,title,type_cont,version
0,,"Trujillo, Kevin|Papagiannopoulos, Tasso|Olsen,...",https://doi.org/10.5256%2Ff1000research.6127.d...,DATASET,,The data are represented as trajectory files (...,10.5256/f1000research.6127.d43528,,,,,2015-01-01,bl.f1000r,Data of molecular dynamics trajectories,,
1,info:eu-repo/semantics/openAccess,Jiří Průša|Michal Cifra,https://zenodo.org/record/3676936,DATASET,2020-02-21,Molecular dynamics (MD) trajectories of water ...,10.5281/zenodo.3676936,,,,,2020-02-20,ZENODO,A1904 molecular dynamics trajectories data,article,
2,info:eu-repo/semantics/restrictedAccess,Walter Rocchia,https://zenodo.org/record/2649259,OTHER|DATASET,2019-11-01,This dataset contains Molecular Dynamics traje...,10.5281/zenodo.2649259,molecular dynamics,,molecular dynamics,,2019-04-30,ZENODO,Molecular dynamics simulation dataset,article,
3,CC BY 4.0,"Kenney, Ian M.|Shujie Fan|Beckstein, Oliver",https://doi.org/10.6084%2Fm9.figshare.7185203,DATASET,,Molecular dynamics (MD) trajectory of the NhaA...,10.6084/m9.figshare.7185203,60112 Structural Biology (incl. Macromolecular...,,,,2018-01-01,figshare.ars,Molecular dynamics trajectory of membrane prot...,,
4,https://www.elsevier.com/about/policies/open-a...,Ioannis G. Tsoulos|Athanassios Stavrakoudis,https://data.mendeley.com/datasets/55rdy6fdyc,DATASET|FILE_SET,2019-12-05,Abstract \n Eucb is a standalone program for g...,10.17632/55rdy6fdyc.1,Biological Sciences|Computational Physics|Mole...,,Biological Sciences|Computational Physics|Mole...,,2019-12-05,MENDELEY_DATA,Eucb: A C++ program for molecular dynamics tra...,article,1.0


## Check that the data is OK

In [8]:
md_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2002 entries, 0 to 2001
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   accessRights          1546 non-null   object 
 1   authors               2001 non-null   object 
 2   containerURI          2002 non-null   object 
 3   dataTypes             2002 non-null   object 
 4   dateAvailable         1245 non-null   object 
 5   description           2002 non-null   object 
 6   doi                   2002 non-null   object 
 7   externalSubjectAreas  1786 non-null   object 
 8   institutions          308 non-null    object 
 9   keywords              967 non-null    object 
 10  method                86 non-null     object 
 11  publicationDate       2000 non-null   object 
 12  source                2002 non-null   object 
 13  title                 2002 non-null   object 
 14  type_cont             982 non-null    object 
 15  version              

In [9]:
mt_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 438 entries, 0 to 437
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   accessRights          346 non-null    object 
 1   authors               438 non-null    object 
 2   containerURI          438 non-null    object 
 3   dataTypes             438 non-null    object 
 4   dateAvailable         373 non-null    object 
 5   description           438 non-null    object 
 6   doi                   438 non-null    object 
 7   externalSubjectAreas  379 non-null    object 
 8   institutions          86 non-null     object 
 9   keywords              291 non-null    object 
 10  method                13 non-null     object 
 11  publicationDate       438 non-null    object 
 12  source                438 non-null    object 
 13  title                 438 non-null    object 
 14  type_cont             291 non-null    object 
 15  version               1

There are 2 columns with almost all `np.nan`:s in both data sets, namely 'method' and 'institutions'. Let's check out what they contain just for curiosity's sake. (This doesn't really affect anything else in the ensuing analyses.)

In [10]:
md_data.loc[md_data["method"].notnull()] # https://stackoverflow.com/a/42137824

Unnamed: 0,accessRights,authors,containerURI,dataTypes,dateAvailable,description,doi,externalSubjectAreas,institutions,keywords,method,publicationDate,source,title,type_cont,version
36,http://creativecommons.org/licenses/by/4.0,Adrien Cerdan|Nicolas Martin|Marco Cecchini,https://data.mendeley.com/datasets/mh34bc6gty,SOFTWARE_CODE|SEQUENCING_DATA|DATASET,2018-09-13,The dataset presented here is supporting the a...,10.17632/mh34bc6gty.1,Molecular Dynamics|Biophysics,Institut Pasteur|Universite de Strasbourg,Molecular Dynamics|Biophysics,Molecular Dynamics simulations where carried o...,2018-09-13,MENDELEY_DATA,An ion permeable state of the Glycine Receptor...,article,1.0
38,http://creativecommons.org/licenses/by/4.0,Adrien Cerdan|Nicolas Martin|Marco Cecchini,https://data.mendeley.com/datasets/mh34bc6gty,SOFTWARE_CODE|SEQUENCING_DATA|DATASET,2018-09-13,The dataset presented here is supporting the a...,10.17632/mh34bc6gty.1,Molecular Dynamics|Biophysics,Institut Pasteur|Universite de Strasbourg,Molecular Dynamics|Biophysics,Molecular Dynamics simulations where carried o...,2018-09-13,MENDELEY_DATA,An ion permeable state of the Glycine Receptor...,article,1.0
107,https://creativecommons.org/licenses/by-nc/3.0,Liao Y Chen,https://data.mendeley.com/datasets/tghr5d9zgx,DATASET|FILE_SET,2018-11-11,This data has three parts: \n\n1. C++ code for...,10.17632/tghr5d9zgx.2,Molecular Dynamics,University of Texas at San Antonio,Molecular Dynamics,1. tar zxvf *.gz\n2. go to rasral/100-equil a...,2018-11-11,MENDELEY_DATA,"hSMD code, scripts, and coordinates for protei...",article,2.0
128,http://creativecommons.org/licenses/by/4.0,Ivan Novoselov,https://data.mendeley.com/datasets/4wfyg22srj,OTHER|SOFTWARE_CODE|DATASET,2018-12-06,This is an example of the input files required...,10.17632/4wfyg22srj.1,Molecular Dynamics,,Molecular Dynamics,"In order to perform the calculation, first yo...",2018-12-06,MENDELEY_DATA,MTP as a promising tool to study diffusion - r...,article,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1982,http://creativecommons.org/licenses/by/4.0,Taro Mieno|Aaron Hrozencik|Jordan Suter|Mani ...,https://data.mendeley.com/datasets/2csdvyry9t,SOFTWARE_CODE|DATASET|TEXT,2019-08-21,The data contain annual observations of ground...,10.17632/2csdvyry9t.1,Hydrology|Agricultural Irrigation|Agricultural...,Colorado State University|University of Nebras...,Hydrology|Agricultural Irrigation|Agricultural...,Steps to reproduce the results can be found at...,2019-08-21,MENDELEY_DATA,Annual well-level groundwater use records and ...,article,1.0
1994,http://creativecommons.org/licenses/by/4.0,Francesco Paolo Mancuso,https://data.mendeley.com/datasets/xfvykctgp6,SOFTWARE_CODE|DATASET|FILE_SET,2019-04-04,This data repository comprises the data and co...,10.17632/xfvykctgp6.1,Marine Biology|Natural Sciences,Universita degli Studi di Bologna,Marine Biology|Natural Sciences,The html files contain the R codes used to mak...,2019-04-04,MENDELEY_DATA,Influence of ambient temperature on the photos...,article,1.0
1997,http://creativecommons.org/licenses/by/4.0,Jibo He,https://data.mendeley.com/datasets/fvtfjyvw7d,TABULAR_DATA|DATASET|FILE_SET,2020-03-04,"This dataset is shared by Dr. Jibo HE, founder...",10.17632/fvtfjyvw7d.2,Big Data|University Student|Online Teaching,Tsinghua University|Peking University,Big Data|University Student|Online Teaching,"Using web crawling techniques, 1.8 million ori...",2020-03-04,MENDELEY_DATA,Big Data Set from RateMyProfessor.com for Prof...,article,2.0
1999,http://creativecommons.org/licenses/by/4.0,Connon I. Thomas|Christian Keine|Satoko Okayam...,https://data.mendeley.com/datasets/v88r5t5myz,SOFTWARE_CODE|IMAGE|VIDEO|TABULAR_DATA|DATASET...,2019-10-29,Contains data and software from the publicatio...,10.17632/v88r5t5myz.4,Electron Microscopy|Light Microscopy|Presynapt...,University of Iowa|Max Planck Florida Institute,Electron Microscopy|Light Microscopy|Presynapt...,All data are stored in CSV-files with periods ...,2019-10-29,MENDELEY_DATA,"Data/Software for ""Presynaptic Mitochondria Vo...",article,4.0


In [11]:
mt_data.loc[mt_data["method"].notnull()]

Unnamed: 0,accessRights,authors,containerURI,dataTypes,dateAvailable,description,doi,externalSubjectAreas,institutions,keywords,method,publicationDate,source,title,type_cont,version
9,http://creativecommons.org/licenses/by/4.0,Adrien Cerdan|Nicolas Martin|Marco Cecchini,https://data.mendeley.com/datasets/mh34bc6gty,SOFTWARE_CODE|SEQUENCING_DATA|DATASET,2018-09-13,The dataset presented here is supporting the a...,10.17632/mh34bc6gty.1,Molecular Dynamics|Biophysics,Institut Pasteur|Universite de Strasbourg,Molecular Dynamics|Biophysics,Molecular Dynamics simulations where carried o...,2018-09-13,MENDELEY_DATA,An ion permeable state of the Glycine Receptor...,article,1.0
308,http://creativecommons.org/licenses/by/4.0,Edoardo Paluan,https://data.mendeley.com/datasets/2ct5gfw6s3,OTHER|SOFTWARE_CODE|GEO_DATA|IMAGE|TABULAR_DAT...,2016-12-06,The aim of the task was to investigate the Dia...,10.17632/2ct5gfw6s3.1,Molecules|Density Functional Theory (DFT)|Comp...,King's College London,Molecules|Density Functional Theory (DFT)|Comp...,The code essentially utilises the Klenmar-Byla...,2016-12-06,MENDELEY_DATA,Density functional theory simulations of molec...,article,1.0
339,http://creativecommons.org/licenses/by/4.0,YAO YI|Xu Jiang He|Andrew B Barron|Yi Bo Liu|Z...,https://data.mendeley.com/datasets/wzfmyz3rp8,TABULAR_DATA|DATASET,2020-04-24,Whether a female honey bee (Apis mellifera) de...,10.17632/wzfmyz3rp8.2,Inheritance|Epigenetics|Honey Bee|Insect,Macquarie University Department of Biological ...,Inheritance|Epigenetics|Honey Bee|Insect,G1E were generation 1 queens reared from eggs ...,2020-04-24,MENDELEY_DATA,Transgenerational accumulation of methylome ch...,article,2.0
386,http://creativecommons.org/licenses/by/4.0,Lisa Rose-Wiles,https://data.mendeley.com/datasets/7fk4n7ych7,TABULAR_DATA|DATASET,2018-05-24,This is the raw data for our samples of refere...,10.17632/7fk4n7ych7.1,Library and Information Science|Chemistry,Seton Hall University,Library and Information Science|Chemistry,The methodology is described in the associated...,2018-05-24,MENDELEY_DATA,Chemistry reference data 2018,article,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
419,http://creativecommons.org/licenses/by/4.0,Maycon Franco|GILBERTO PEREZ|SILVIO POPADIUK,https://data.mendeley.com/datasets/4x7xkj3d4p,TABULAR_DATA|DATASET,2019-01-15,This accelerator is based on the bibliometric ...,10.17632/4x7xkj3d4p.1,Bibliometrics,Universidade Presbiteriana Mackenzie,Bibliometrics,The usage instructions can be accessed at this...,2019-01-15,MENDELEY_DATA,Accelerator for bibliometric study,article,1.0
421,http://creativecommons.org/licenses/by/4.0,Alexis Huf|Frank Siqueira,https://data.mendeley.com/datasets/hcbcg23836,OTHER|SOFTWARE_CODE|TABULAR_DATA|DATASET|DOCUM...,2019-01-04,This repository contains document data for a S...,10.17632/hcbcg23836.2,Software Engineering|Systematic Review|Web Ser...,Coordenacao de Aperfeicoamento de Pessoal de N...,Software Engineering|Systematic Review|Web Ser...,To reproduce the automated stages of selection...,2019-01-04,MENDELEY_DATA,"Documents Data for ""Composition of Heterogeneo...",article,2.0
423,http://creativecommons.org/licenses/by/4.0,Danny S Guamán,https://data.mendeley.com/datasets/zvp3986f5b,OTHER|TABULAR_DATA|DATASET|DOCUMENT|TEXT|FILE_SET,2020-03-08,This repo contains the data used in a systemat...,10.17632/zvp3986f5b.1,Engineering|Privacy|Computer Science|Software ...,Escuela Politecnica Nacional|Universidad Polit...,Engineering|Privacy|Computer Science|Software ...,Read the associated paper that contains the me...,2020-03-08,MENDELEY_DATA,A Systematic Mapping Study on Software Quality...,article,1.0
433,http://creativecommons.org/licenses/by/4.0,charles-francois LATCHOUMANE|Lohitash Karumbai...,https://data.mendeley.com/datasets/743vvdtk7n,SOFTWARE_CODE|TABULAR_DATA|DATASET,2020-03-16,Data set extracted from https://globalclinical...,10.17632/743vvdtk7n.1,Clinical Data Collection,University of Georgia,Clinical Data Collection,run analyze_ClinicalTrials.m having all xls. f...,2020-03-16,MENDELEY_DATA,Neurostimulation and reach-to-grasp function r...,article,1.0


In [12]:
md_data.loc[md_data["institutions"].notnull()]

Unnamed: 0,accessRights,authors,containerURI,dataTypes,dateAvailable,description,doi,externalSubjectAreas,institutions,keywords,method,publicationDate,source,title,type_cont,version
3,,"Laranjeiro, Ricardo|Whitmore, David",https://doi.org/10.5061/dryad.r07bc,TABULAR_DATA|DATASET,2014-06-13,The circadian clock is known to regulate a wid...,10.5061/dryad.r07bc,photoreceptors|neuroD|retina|transcription fac...,"Centre for Cell and Molecular Dynamics, Depart...",Danio rerio,,2014-06-13,DRYAD,Data from: Transcription factors involved in r...,,
21,,"Risso, Valeria A.|Martinez Rodriguez, Sergio|C...",https://doi.org/10.5061/dryad.53629,DATASET|FILE_SET,2017-07-20,Protein engineering studies often suggest the ...,10.5061/dryad.53629,,Departamento de Quimica Fisica|Facultad de Cie...,,,2017-07-20,DRYAD,Data from: De novo active sites for resurrecte...,,
36,http://creativecommons.org/licenses/by/4.0,Adrien Cerdan|Nicolas Martin|Marco Cecchini,https://data.mendeley.com/datasets/mh34bc6gty,SOFTWARE_CODE|SEQUENCING_DATA|DATASET,2018-09-13,The dataset presented here is supporting the a...,10.17632/mh34bc6gty.1,Molecular Dynamics|Biophysics,Institut Pasteur|Universite de Strasbourg,Molecular Dynamics|Biophysics,Molecular Dynamics simulations where carried o...,2018-09-13,MENDELEY_DATA,An ion permeable state of the Glycine Receptor...,article,1.0
38,http://creativecommons.org/licenses/by/4.0,Adrien Cerdan|Nicolas Martin|Marco Cecchini,https://data.mendeley.com/datasets/mh34bc6gty,SOFTWARE_CODE|SEQUENCING_DATA|DATASET,2018-09-13,The dataset presented here is supporting the a...,10.17632/mh34bc6gty.1,Molecular Dynamics|Biophysics,Institut Pasteur|Universite de Strasbourg,Molecular Dynamics|Biophysics,Molecular Dynamics simulations where carried o...,2018-09-13,MENDELEY_DATA,An ion permeable state of the Glycine Receptor...,article,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1990,http://creativecommons.org/licenses/by/4.0,Ryuta Shioi|Fumika Karaki|Hiromasa Yoshioka|To...,https://data.mendeley.com/datasets/jr23ccpp46,SOFTWARE_CODE|IMAGE|TABULAR_DATA|DATASET|DOCUM...,2020-04-20,An example data set and accompanying R script ...,10.17632/jr23ccpp46.1,Screening|Image Analysis,Tokyo Daigaku Teiryo Seimei Kagaku Kenkyujo|RI...,Screening|Image Analysis,,2020-04-20,MENDELEY_DATA,Supplementary datasets and R scripts for: Imag...,article,1.0
1994,http://creativecommons.org/licenses/by/4.0,Francesco Paolo Mancuso,https://data.mendeley.com/datasets/xfvykctgp6,SOFTWARE_CODE|DATASET|FILE_SET,2019-04-04,This data repository comprises the data and co...,10.17632/xfvykctgp6.1,Marine Biology|Natural Sciences,Universita degli Studi di Bologna,Marine Biology|Natural Sciences,The html files contain the R codes used to mak...,2019-04-04,MENDELEY_DATA,Influence of ambient temperature on the photos...,article,1.0
1997,http://creativecommons.org/licenses/by/4.0,Jibo He,https://data.mendeley.com/datasets/fvtfjyvw7d,TABULAR_DATA|DATASET|FILE_SET,2020-03-04,"This dataset is shared by Dr. Jibo HE, founder...",10.17632/fvtfjyvw7d.2,Big Data|University Student|Online Teaching,Tsinghua University|Peking University,Big Data|University Student|Online Teaching,"Using web crawling techniques, 1.8 million ori...",2020-03-04,MENDELEY_DATA,Big Data Set from RateMyProfessor.com for Prof...,article,2.0
1999,http://creativecommons.org/licenses/by/4.0,Connon I. Thomas|Christian Keine|Satoko Okayam...,https://data.mendeley.com/datasets/v88r5t5myz,SOFTWARE_CODE|IMAGE|VIDEO|TABULAR_DATA|DATASET...,2019-10-29,Contains data and software from the publicatio...,10.17632/v88r5t5myz.4,Electron Microscopy|Light Microscopy|Presynapt...,University of Iowa|Max Planck Florida Institute,Electron Microscopy|Light Microscopy|Presynapt...,All data are stored in CSV-files with periods ...,2019-10-29,MENDELEY_DATA,"Data/Software for ""Presynaptic Mitochondria Vo...",article,4.0


In [13]:
mt_data.loc[mt_data["institutions"].notnull()]

Unnamed: 0,accessRights,authors,containerURI,dataTypes,dateAvailable,description,doi,externalSubjectAreas,institutions,keywords,method,publicationDate,source,title,type_cont,version
8,http://creativecommons.org/licenses/by/4.0,Yevgen Yurenko|Martin Lepšík|Juraj Dobiaš,https://data.mendeley.com/datasets/vtxgt2y9rc,OTHER|SEQUENCING_DATA|DATASET|TEXT,2019-05-16,The dataset contains molecular dynamics trajec...,10.17632/vtxgt2y9rc.1,Molecular Dynamics|DNA|Ligand Binding|Molecula...,Ustav organicke chemie a biochemie Akademie ve...,Molecular Dynamics|DNA|Ligand Binding|Molecula...,,2019-05-16,MENDELEY_DATA,"Structures, molecular dynamics trajectories an...",article,1.0
9,http://creativecommons.org/licenses/by/4.0,Adrien Cerdan|Nicolas Martin|Marco Cecchini,https://data.mendeley.com/datasets/mh34bc6gty,SOFTWARE_CODE|SEQUENCING_DATA|DATASET,2018-09-13,The dataset presented here is supporting the a...,10.17632/mh34bc6gty.1,Molecular Dynamics|Biophysics,Institut Pasteur|Universite de Strasbourg,Molecular Dynamics|Biophysics,Molecular Dynamics simulations where carried o...,2018-09-13,MENDELEY_DATA,An ion permeable state of the Glycine Receptor...,article,1.0
14,info:eu-repo/semantics/openAccess,"Stachura, Sławomir|Kneller, Gerald R.",https://zenodo.org/record/61743,OTHER|DATASET,2016-09-12,This file contains the center-of-mass coordina...,10.5281/zenodo.61743,POPC|ActivePapers|MARTINI|molecular dynamics,Centre de Biophysique Moléculaire|CNRS|Synchro...,POPC|ActivePapers|MARTINI|molecular dynamics,,2016-09-07,ZENODO,Lipid center-of-mass trajectory: long-time dyn...,article,
16,http://creativecommons.org/licenses/by/4.0,Alessandro Nascimento,https://data.mendeley.com/datasets/ptxn54nc8m,DATASET|FILE_SET,2018-12-07,This dataset contains the MD trajectories of c...,10.17632/ptxn54nc8m.1,Biotechnology|Molecular Mechanics with Molecul...,Universidade de Sao Paulo Instituto de Fisica ...,Biotechnology|Molecular Mechanics with Molecul...,,2018-12-07,MENDELEY_DATA,Structure and Dynamics of Trichoderma harzianu...,article,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
433,http://creativecommons.org/licenses/by/4.0,charles-francois LATCHOUMANE|Lohitash Karumbai...,https://data.mendeley.com/datasets/743vvdtk7n,SOFTWARE_CODE|TABULAR_DATA|DATASET,2020-03-16,Data set extracted from https://globalclinical...,10.17632/743vvdtk7n.1,Clinical Data Collection,University of Georgia,Clinical Data Collection,run analyze_ClinicalTrials.m having all xls. f...,2020-03-16,MENDELEY_DATA,Neurostimulation and reach-to-grasp function r...,article,1.0
434,http://creativecommons.org/licenses/by/4.0,Charles-Francois Latchoumane,https://data.mendeley.com/datasets/77pxrcssj3,SOFTWARE_CODE|TABULAR_DATA|DATASET,2020-04-15,Clinical trials posted and filtered for brain ...,10.17632/77pxrcssj3.1,Clinical Trial Results,University of Georgia,Clinical Trial Results,,2020-04-15,MENDELEY_DATA,Brain Injury Neuromodulation Reach-to-Grasp,article,1.0
436,http://creativecommons.org/licenses/by/4.0,Agung Purnomo|Nur Asitah,https://data.mendeley.com/datasets/dsdnv2s7t3,TABULAR_DATA|DATASET,2020-03-28,The knowledge management reseach dataset with ...,10.17632/dsdnv2s7t3.1,Knowledge Management|Management|Business,Bina Nusantara University,Knowledge Management|Management|Business,,2020-03-28,MENDELEY_DATA,Knowledge Management Research in Indonesia Dat...,article,1.0
437,http://creativecommons.org/licenses/by/4.0,Agung Purnomo|Andre Septianto,https://data.mendeley.com/datasets/c77sxxms9f,TABULAR_DATA|DATASET,2020-04-08,The brand management reseach & publication dat...,10.17632/c77sxxms9f.2,Brand Management|Management|Business|Marketing,Bina Nusantara University|Universitas Airlangga,Brand Management|Management|Business|Marketing,,2020-04-08,MENDELEY_DATA,Brand Management Research Data (1968-2019),article,2.0


Let's reset the index to reflect the new data frame:

In [14]:
md_clean = md_data.copy()
md_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2002 entries, 0 to 2001
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   accessRights          1546 non-null   object 
 1   authors               2001 non-null   object 
 2   containerURI          2002 non-null   object 
 3   dataTypes             2002 non-null   object 
 4   dateAvailable         1245 non-null   object 
 5   description           2002 non-null   object 
 6   doi                   2002 non-null   object 
 7   externalSubjectAreas  1786 non-null   object 
 8   institutions          308 non-null    object 
 9   keywords              967 non-null    object 
 10  method                86 non-null     object 
 11  publicationDate       2000 non-null   object 
 12  source                2002 non-null   object 
 13  title                 2002 non-null   object 
 14  type_cont             982 non-null    object 
 15  version              

In [15]:
first_term

['orf1ab',
 'orf3a',
 'orf6',
 'orf7a',
 'orf8',
 'orf10',
 'nsp1',
 'nsp2',
 'nps3',
 'nps4',
 'nsp5',
 'nsp6',
 'nsp7',
 'nsp8',
 'nsp9',
 'nsp10',
 'rdrp',
 '3cl-pro',
 'host translation inhibitor nsp1',
 'papain-like proteinase',
 '3c-like proteinase',
 'uridylate-specific endoribonuclease',
 "2'-o-methyltransferase",
 'protein 3a',
 'protein 6',
 'protein 7a',
 'protein 8',
 'sars-cov-2',
 'sars-cov2',
 'covid19',
 'covid-19',
 'sars',
 'coronavirus',
 'ncov2019',
 'ncov-2019',
 'ncov2019',
 'ncov-2019',
 'covid2019',
 'ncov']

By looking at excerpt, what could be searched is first to look for `first_term` keywords in all columns but `Authors` and `Type of possible) format` and then look for `second_term` keywords in `Keywords` and `Description` columns to focus in on Molecular dynamics data.

## Finding all rows with keywords in them in `Title` and `Description` columns
The purpose of initialising pd.Series name `found` is to hold a boolen index rows with matches (if a certain keyword is found in the particular entry).

In [16]:
falses_md = np.zeros(len(md_data["title"]), dtype=bool) # https://stackoverflow.com/a/21174962
falses_mt = np.zeros(len(mt_data["title"]), dtype=bool)
found_md = pd.Series(data = falses_md,
                   dtype = bool)
found_mt = pd.Series(data = falses_mt,
                   dtype = bool)
#found2 = pd.Series(data = falses,
#                   dtype = bool)

### Find all indexes with a match

#### Obtain column names of interest to loop through

In [17]:
num_elem_md = len(md_data.columns.values.tolist())
num_elem_mt = len(mt_data.columns.values.tolist())
cols_md = md_data.columns.values.tolist()
cols_mt = mt_data.columns.values.tolist()
print(cols_md)

['accessRights', 'authors', 'containerURI', 'dataTypes', 'dateAvailable', 'description', 'doi', 'externalSubjectAreas', 'institutions', 'keywords', 'method', 'publicationDate', 'source', 'title', 'type_cont', 'version']


Let's remove the last element from the list of column names ("version" column) because its data type is float and it can't be searched in the next step with string search.

In [18]:
cols_md.pop(num_elem_md-1)
cols_mt.pop(num_elem_mt-1)
print(cols_mt)

['accessRights', 'authors', 'containerURI', 'dataTypes', 'dateAvailable', 'description', 'doi', 'externalSubjectAreas', 'institutions', 'keywords', 'method', 'publicationDate', 'source', 'title', 'type_cont']


### Find matches in the "Molecular dynamics" keyword search

In [19]:
for col in cols_md:
    for word in first_term:
        # Find out if the current search term can be found in the column
        cur_match = md_data[col].str.lower().str.contains(word) # https://stackoverflow.com/a/15333283
        # Join the found matches to one Series
        found_md = found_md | cur_match

Let's check how many rows got some hits with these keywords

In [20]:
found_md.value_counts()

False    1937
True       65
dtype: int64

Let's check out some of the matches:

In [21]:
md_data[found_md].head()

Unnamed: 0,accessRights,authors,containerURI,dataTypes,dateAvailable,description,doi,externalSubjectAreas,institutions,keywords,method,publicationDate,source,title,type_cont,version
133,http://creativecommons.org/licenses/by/4.0,Teruhisa S. KOMATSU|Yohei Koyama|Noriaki OKIMO...,https://data.mendeley.com/datasets/vpps4vhryg,OTHER|IMAGE|TABULAR_DATA|DATASET|TEXT,2020-04-27,Raw trajectory data (GROMACS format) of 10 mic...,10.17632/vpps4vhryg.2,Virus|Drug|Molecular Dynamics,,Virus|Drug|Molecular Dynamics,,2020-04-27,MENDELEY_DATA,COVID-19 related trajectory data of 10 microse...,article,2.0
168,info:eu-repo/semantics/openAccess,"Durdagi, Serdar|Aksoydan, Busecan|Dogan, Berna...",https://zenodo.org/record/3756976,DATASET,2020-04-19,Data includes all of the trajectories (1000) o...,10.5281/zenodo.3756976,"SARS-CoV2 Main Protease, holo form, PDB 6LU7, ...",,"SARS-CoV2 Main Protease, holo form, PDB 6LU7, ...",,2020-04-18,ZENODO,All Atom Molecular Dynamics Simulations of inh...,article,
184,info:eu-repo/semantics/openAccess,"Durdagi, Serdar|Aksoydan, Busecan|Dogan, Berna...",https://zenodo.org/record/3751321,DATASET,2020-04-15,Data includes all of the trajectories (2000) o...,10.5281/zenodo.3751321,"SARS-CoV2 Main Protease, lopinavir, molecular ...",,"SARS-CoV2 Main Protease, lopinavir, molecular ...",,2020-04-14,ZENODO,All Atom Molecular Dynamics Simulations of Lop...,article,
191,http://creativecommons.org/licenses/by/4.0,Ryunosuke Yoshino|Nobuaki Yasuo|Masakazu Sekijima,https://data.mendeley.com/datasets/5jfsx6j75g,DATASET|FILE_SET,2020-04-21,MD simulations were performed using Desmond on...,10.17632/5jfsx6j75g.2,Drug|Coronavirus Disease 2019|Molecular Dynami...,,Drug|Coronavirus Disease 2019|Molecular Dynami...,,2020-04-21,MENDELEY_DATA,Trajectory data of molecular dynamics simulati...,article,2.0
199,CC BY 4.0,"Cespugli, Marco|Durmaz, Vedat|Steinkellner, Ge...",https://figshare.com/articles/Molecular_dynami...,DATASET,,Molecular dynamics simulations (500 ps MD at 3...,10.6084/m9.figshare.11788794.v1,60102 Bioinformatics|60506 Virology,,,,2020-01-01,figshare.ars,Molecular dynamics simulations of coronavirus ...,,


### Find matches in the "Molecular trajectories" keyword search

In [22]:
for col in cols_mt:
    for word in first_term:
        # Find out if the current search term can be found in the column
        cur_match = mt_data[col].str.lower().str.contains(word) # https://stackoverflow.com/a/15333283
        # Join the found matches to one Series
        found_mt = found_mt | cur_match

Let's check how many rows got some hits with these keywords

In [23]:
found_mt.value_counts()

False    417
True      21
dtype: int64

Let's check out some of the matches:

In [24]:
mt_data[found_mt].head()

Unnamed: 0,accessRights,authors,containerURI,dataTypes,dateAvailable,description,doi,externalSubjectAreas,institutions,keywords,method,publicationDate,source,title,type_cont,version
105,http://creativecommons.org/licenses/by/4.0,Teruhisa S. KOMATSU|Yohei Koyama|Noriaki OKIMO...,https://data.mendeley.com/datasets/vpps4vhryg,OTHER|IMAGE|TABULAR_DATA|DATASET|TEXT,2020-04-27,Raw trajectory data (GROMACS format) of 10 mic...,10.17632/vpps4vhryg.2,Virus|Drug|Molecular Dynamics,,Virus|Drug|Molecular Dynamics,,2020-04-27,MENDELEY_DATA,COVID-19 related trajectory data of 10 microse...,article,2.0
139,http://creativecommons.org/licenses/by/4.0,Ryunosuke Yoshino|Nobuaki Yasuo|Masakazu Sekijima,https://data.mendeley.com/datasets/5jfsx6j75g,DATASET|FILE_SET,2020-04-21,MD simulations were performed using Desmond on...,10.17632/5jfsx6j75g.2,Drug|Coronavirus Disease 2019|Molecular Dynami...,,Drug|Coronavirus Disease 2019|Molecular Dynami...,,2020-04-21,MENDELEY_DATA,Trajectory data of molecular dynamics simulati...,article,2.0
153,info:eu-repo/semantics/openAccess,"Durdagi, Serdar|Aksoydan, Busecan|Dogan, Berna...",https://zenodo.org/record/3756976,DATASET,2020-04-19,Data includes all of the trajectories (1000) o...,10.5281/zenodo.3756976,"SARS-CoV2 Main Protease, holo form, PDB 6LU7, ...",,"SARS-CoV2 Main Protease, holo form, PDB 6LU7, ...",,2020-04-18,ZENODO,All Atom Molecular Dynamics Simulations of inh...,article,
156,info:eu-repo/semantics/openAccess,"Durdagi, Serdar|Aksoydan, Busecan|Dogan, Berna...",https://zenodo.org/record/3746892,DATASET,2020-04-09,Data includes all of the trajectories (2000) o...,10.5281/zenodo.3746892,"SARS-CoV2 Main Protease, ritonavir, molecular ...",,"SARS-CoV2 Main Protease, ritonavir, molecular ...",,2020-04-09,ZENODO,All Atom Molecular Dynamics Simulations of Rit...,article,
179,info:eu-repo/semantics/openAccess,Průša Jiří|Cifra Michal,https://zenodo.org/record/3352030,DATASET|FILE_SET,2019-08-06,We present molecular dynamics (MD) trajectorie...,10.5281/zenodo.3352030,Dielectric spectroscopy|Amino acid|Molecular d...,,Dielectric spectroscopy|Amino acid|Molecular d...,,2019-07-31,ZENODO,Dataset A1606 and A1905,article,


## Output the filtered data sets as tsv files

In [25]:
results = "02_processed/"
md_copy = md_data[found_md].copy()
# Set row names to start from 0 again instead of the indexing before filtering
md_copy.reset_index(drop = True).to_csv(path_or_buf = results + "filtered_molecular_dynamics.tsv",
                         sep = "\t"
                        )

In [26]:
mt_copy = mt_data[found_mt].copy()

mt_copy.reset_index(drop = True).to_csv(path_or_buf = results + "filtered_molecular_trajectories.tsv",
                         sep = "\t"
                        )