# CORD-19 Software Counting
This jupyter notebook is designated to count software mentions based on the CORD19 dataset from: 
https://datadryad.org/stash/dataset/doi:10.5061/dryad.vmcvdncs0

First, relevant packages must be imported to the Notebook. 

In [1]:
import numpy as np
import pandas as pd
import csv
import ast
import collections
import matplotlib.pyplot as plt
import Levenshtein as lev
from fuzzywuzzy import fuzz 

Get the data and save it to a variable. 

In [2]:
CORD19_CSV = pd.read_csv('../data/cord-19/CORD19_software_mentions.csv' , converters={'software': lambda x: x[1:-1].split(',')})

Show the head of the dataset to inspect all columns and obtain a broad overview. 

In [3]:
CORD19_CSV.head(20)

Unnamed: 0,paper_id,doi,title,source_x,license,publish_time,journal,url,software
0,00006903b396d50cc0037fed39916d57d50ee801,,Urban green space and happiness in developed c...,ArXiv,arxiv,2021-01-04,,https://arxiv.org/pdf/2101.00807v1.pdf,['Google Street View']
1,0000fcce604204b1b9d876dc073eb529eb5ce305,10.1016/j.regg.2021.01.002,La Geriatría de Enlace con residencias en la é...,Elsevier; PMC,els-covid,2021-01-13,Rev Esp Geriatr Gerontol,https://api.elsevier.com/content/article/pii/S...,['SEGG']
2,000122a9a774ec76fa35ec0c0f6734e7e8d0c541,10.1016/j.rec.2020.08.002,Impact of COVID-19 on ST-segment elevation myo...,Elsevier; Medline; PMC,no-cc,2020-09-08,Rev Esp Cardiol (Engl Ed),https://api.elsevier.com/content/article/pii/S...,"['STATA', 'IAMCEST']"
3,0001418189999fea7f7cbe3e82703d71c85a6fe5,10.1016/j.vetmic.2006.11.026,Absence of surface expression of feline infect...,Elsevier; Medline; PMC,no-cc,2007-03-31,Vet Microbiol,https://www.sciencedirect.com/science/article/...,['SPSS']
4,00033d5a12240a8684cfe943954132b43434cf48,10.3390/v12080849,Detection of Severe Acute Respiratory Syndrome...,Medline; PMC,cc-by,2020-08-04,Viruses,https://www.ncbi.nlm.nih.gov/pubmed/32759673/;...,"['R', 'MassARRAY Typer Analyzer']"
5,00035ac98d8bc38fbca02a1cc957f55141af67c0,10.3389/fpsyt.2020.559701,The Psychological Pressures of Breast Cancer P...,Medline; PMC,cc-by,2020-12-15,Front Psychiatry,https://doi.org/10.3389/fpsyt.2020.559701; htt...,"['Wechat', 'SPSS Statistics']"
6,00039b94e6cb7609ecbddee1755314bcfeb77faa,10.1111/j.1365-2249.2004.02415.x,Plasma inflammatory cytokines and chemokines i...,Medline; PMC,bronze-oa,2004-04-01,Clinical & Experimental Immunology,https://onlinelibrary.wiley.com/doi/pdfdirect/...,['Statistical Package for Social Sciences (SPS...
7,0004456994f6c1d5db7327990386d33c01cff32a,10.1186/1471-2334-10-8,Seasonal influenza risk in hospital healthcare...,PMC,cc-by,2010-01-12,BMC Infect Dis,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...,"['STATA', 'STATA', 'Statacorp']"
8,00073cb65dd2596249230fab8b15a71c4a135895,10.1086/605034,Risk Parameters of Fulminant Acute Respiratory...,Medline; PMC,no-cc,2009-08-01,J Infect Dis,https://doi.org/10.1086/605034; https://www.nc...,"['SPSS', 'SPSS']"
9,0007f972812bb45abbe5b0edf8db5359d49c23eb,10.1186/s42234-020-00057-1,The role of nicotinic receptors in SARS-CoV-2 ...,Medline; PMC,cc-by,2020-10-28,Bioelectron Med,https://www.ncbi.nlm.nih.gov/pubmed/33292872/;...,"['geNorm', 'GraphPad Prism', 'GraphPad', 'C..."


The dataset contains nine different columns.
Thusly, the next lines of this notebook explores the column "software". 
Therefore, the column software will be saved to a object. 

In [4]:
software = CORD19_CSV.software

In [5]:
software

0                    ['Google Street View']
1                                  ['SEGG']
2                     ['STATA',  'IAMCEST']
3                                  ['SPSS']
4        ['R',  'MassARRAY Typer Analyzer']
                        ...                
77443                          ['UpToDate']
77444                   ['SALib',  'Panda']
77445                ['Prism',  'GraphPad']
77446    ['R package circular',  'R',  'R']
77447       ['GRAM',  'R studio',  'Stata']
Name: software, Length: 77448, dtype: object

In [6]:
len_software_multiple_entries = len(software)
len_software_multiple_entries

77448

The software object contains 77448 rows. In each row, there are software entries. Some rows contain more than one software entry. For instance, row two and four have each two entries. As a result, the object needs to be transformed to an object which contains soley one software entry per row. 

In [7]:
software = software.explode(ignore_index = True)

Remove the brackets around the software mentions. 

In [8]:
software = software.str.replace('\'', '')

Control the software object and inspect if each row contains only one entry. 

In [9]:
software

0         Google Street View
1                       SEGG
2                      STATA
3                    IAMCEST
4                       SPSS
                 ...        
558787                     R
558788                     R
558789                  GRAM
558790              R studio
558791                 Stata
Name: software, Length: 558792, dtype: object

In [10]:
len_software_single_entries = len(software)
len_software_single_entries

558792

Now, the object contains solely one entry per row and has the length of 558792.

In [11]:
average_entries_per_row = len_software_single_entries/len_software_multiple_entries
average_entries_per_row

7.215060427641773

Due to the alignment of the software object, it can be obtained that the dataset contains on average 7.2 software entries per row. Furthermore, the value_counts function will be user to minimise the amount of rows by checking identical duplicates. 

In [12]:
software.value_counts(dropna=False)

 R                            8389
 SPSS                         4738
SPSS                          4472
 BLAST                        3166
 Excel                        2666
                              ... 
"R package  graphics"            1
 SnpEff (SNP Effects)            1
 Gatan Microscopy Suite 37       1
 MyoMaster                       1
VULNERABLE                       1
Name: software, Length: 120264, dtype: int64

The function value_counts reduced the amount of rows to 120279. Nevertheless, the dtype counts same software mentions as distinct. For instance, "SPSS" is listed twice with two varied numbers. For this, the datatype will be converted to a dictionary to check for possible empty spaces.

In [13]:
software_dict = software.to_dict()
software_dict

{0: 'Google Street View',
 1: 'SEGG',
 2: 'STATA',
 3: ' IAMCEST',
 4: 'SPSS',
 5: 'R',
 6: ' MassARRAY Typer Analyzer',
 7: 'Wechat',
 8: ' SPSS Statistics',
 9: 'Statistical Package for Social Sciences (SPSS)',
 10: ' BD CBA',
 11: 'STATA',
 12: ' STATA',
 13: ' Statacorp',
 14: 'SPSS',
 15: ' SPSS',
 16: 'geNorm',
 17: ' GraphPad Prism',
 18: ' GraphPad',
 19: ' Cellranger',
 20: ' R',
 21: ' Seurat',
 22: ' ggplot2',
 23: ' LinRegPCR',
 24: 'GramA',
 25: 'R package edgeR',
 26: ' R package edgeR',
 27: ' R package edgeR',
 28: ' STAR',
 29: ' FastQC',
 30: ' R package ALDEx2',
 31: ' ImageJ',
 32: ' PfAlbas',
 33: ' SAM',
 34: 'MORO Praxis',
 35: ' MORO',
 36: 'R2HC',
 37: 'Singapour',
 38: 'PRESET',
 39: 'Google Trends (GT',
 40: ' GT',
 41: 'SINUS',
 42: ' VICTIMES',
 43: 'Spirocall',
 44: ' LIBSVM',
 45: ' MATLAB voicebox',
 46: ' openS',
 47: ' MILE',
 48: 'DP',
 49: ' DP',
 50: ' DSGVO',
 51: ' iOS',
 52: ' DSGVO',
 53: ' PEPP',
 54: ' PT',
 55: ' DP',
 56: ' 3T',
 57: 'REDCap

In this case, there are empty cases which make the function value_counts sum up software mentons with an additional empty space prior the string as distinct. Therefore, the function remove_empty_spaces(d) takes a dictionary and removes an empty space at the first position of a string. 

In [14]:
def remove_empty_spaces(dic):
    """ Function removing an empty space at the first position of a string. 
    """
    for i in dic:
        if dic[i][:1] == " ":
            dic[i] = dic[i][1:] #.strip() -> Improvement
    return dic

In [15]:
software_dict = remove_empty_spaces(software_dict)
software_dict

{0: 'Google Street View',
 1: 'SEGG',
 2: 'STATA',
 3: 'IAMCEST',
 4: 'SPSS',
 5: 'R',
 6: 'MassARRAY Typer Analyzer',
 7: 'Wechat',
 8: 'SPSS Statistics',
 9: 'Statistical Package for Social Sciences (SPSS)',
 10: 'BD CBA',
 11: 'STATA',
 12: 'STATA',
 13: 'Statacorp',
 14: 'SPSS',
 15: 'SPSS',
 16: 'geNorm',
 17: 'GraphPad Prism',
 18: 'GraphPad',
 19: 'Cellranger',
 20: 'R',
 21: 'Seurat',
 22: 'ggplot2',
 23: 'LinRegPCR',
 24: 'GramA',
 25: 'R package edgeR',
 26: 'R package edgeR',
 27: 'R package edgeR',
 28: 'STAR',
 29: 'FastQC',
 30: 'R package ALDEx2',
 31: 'ImageJ',
 32: 'PfAlbas',
 33: 'SAM',
 34: 'MORO Praxis',
 35: 'MORO',
 36: 'R2HC',
 37: 'Singapour',
 38: 'PRESET',
 39: 'Google Trends (GT',
 40: 'GT',
 41: 'SINUS',
 42: 'VICTIMES',
 43: 'Spirocall',
 44: 'LIBSVM',
 45: 'MATLAB voicebox',
 46: 'openS',
 47: 'MILE',
 48: 'DP',
 49: 'DP',
 50: 'DSGVO',
 51: 'iOS',
 52: 'DSGVO',
 53: 'PEPP',
 54: 'PT',
 55: 'DP',
 56: '3T',
 57: 'REDCapbased',
 58: 'REDCap',
 59: 'Research

Now, the software mentions within the dictionary do not contain empty spaces at the first position of the string. For the use of value_counts, the dictionary is converted to a pandas series. 

In [16]:
software_series = pd.Series(software_dict)
software_series.value_counts()

R                        10805
SPSS                      9210
GraphPad Prism            3986
Excel                     3856
BLAST                     3674
                         ...  
Maldita                      1
Loccioni                     1
NetPlantGene Server          1
Discover                     1
GWAS Catalog REST API        1
Length: 102725, dtype: int64

Due to the removing of empty spaces, the length of the dtype decreased. For the purpose of minimising the length of the dtype, all strings will be capitalized. 

In [17]:
software_dict = software_series.to_dict()

In [18]:
def capitalize_mentions(dic):
    """ Function iterating a dictionary and capitalizing all strings.
    """
    for i in dic:
        dic[i] = dic[i].upper()
    return dic

In [19]:
software_dict = capitalize_mentions(software_dict)
software_dict

{0: 'GOOGLE STREET VIEW',
 1: 'SEGG',
 2: 'STATA',
 3: 'IAMCEST',
 4: 'SPSS',
 5: 'R',
 6: 'MASSARRAY TYPER ANALYZER',
 7: 'WECHAT',
 8: 'SPSS STATISTICS',
 9: 'STATISTICAL PACKAGE FOR SOCIAL SCIENCES (SPSS)',
 10: 'BD CBA',
 11: 'STATA',
 12: 'STATA',
 13: 'STATACORP',
 14: 'SPSS',
 15: 'SPSS',
 16: 'GENORM',
 17: 'GRAPHPAD PRISM',
 18: 'GRAPHPAD',
 19: 'CELLRANGER',
 20: 'R',
 21: 'SEURAT',
 22: 'GGPLOT2',
 23: 'LINREGPCR',
 24: 'GRAMA',
 25: 'R PACKAGE EDGER',
 26: 'R PACKAGE EDGER',
 27: 'R PACKAGE EDGER',
 28: 'STAR',
 29: 'FASTQC',
 30: 'R PACKAGE ALDEX2',
 31: 'IMAGEJ',
 32: 'PFALBAS',
 33: 'SAM',
 34: 'MORO PRAXIS',
 35: 'MORO',
 36: 'R2HC',
 37: 'SINGAPOUR',
 38: 'PRESET',
 39: 'GOOGLE TRENDS (GT',
 40: 'GT',
 41: 'SINUS',
 42: 'VICTIMES',
 43: 'SPIROCALL',
 44: 'LIBSVM',
 45: 'MATLAB VOICEBOX',
 46: 'OPENS',
 47: 'MILE',
 48: 'DP',
 49: 'DP',
 50: 'DSGVO',
 51: 'IOS',
 52: 'DSGVO',
 53: 'PEPP',
 54: 'PT',
 55: 'DP',
 56: '3T',
 57: 'REDCAPBASED',
 58: 'REDCAP',
 59: 'RESEARCH

In [20]:
software_series = pd.Series(software_dict)
software_series = software_series.value_counts()
software_series

R                 10805
SPSS               9229
GRAPHPAD PRISM     4461
EXCEL              4054
BLAST              3943
                  ...  
SENOVA                1
WSOLVE                1
GOSE                  1
MSCARLET              1
ROBUSTALERT           1
Length: 89482, dtype: int64

Due to the capitalization of software mentions, the length of the dytpe could be decreased. Subsequently, the fuzzywuzzy compare algorithm will be introduced. This algorithm is based on Levenshtein which checks the similarity of strings by various aspects. 

In [21]:
def fuzzy_ratio_compare(str1, str2, th):
    """ Function to compare to strings based on a given threeshold. 
    """
    ratio = fuzz.ratio(str1, str2)
    if(ratio > th):
        return True
    else:
        return False
    
def fuzzy_partial_ratio_compare(str1, str2, th):
    """ Function to compare to strings based on a given threeshold. 
    """
    ratio = fuzz.partial_ratio(str1, str2)
    if(ratio > th):
        return True
    else:
        return False
    
def fuzzy_token_sort_ratio_compare(str1, str2, th):
    """ Function to compare to strings based on a given threeshold. 
    """
    ratio = fuzz.token_sort_ratio(str1, str2)
    if(ratio > th):
        return True
    else:
        return False

def fuzzy_token_set_ratio_compare(str1, str2, th):
    """ Function to compare to strings based on a given threeshold. 
    """
    ratio = fuzz.token_set_ratio(str1, str2)
    if(ratio > th):
        return True
    else:
        return False

Due to performance reasons, further investigation will be conducted with a subset of the initial series based on the limit. 

In [22]:
limit = 10000
software_series_shaped = software_series.head(limit)
software_series_shaped

R                 10805
SPSS               9229
GRAPHPAD PRISM     4461
EXCEL              4054
BLAST              3943
                  ...  
GOCHECK               8
OURA                  8
EXONERATE             8
TOUSANTICOVID         8
DAL                   8
Length: 10000, dtype: int64

Converting the series to a DataFrame for comparison purposes. The index of the DataFrame is required for selecting rows. 

In [23]:
ts = software_series_shaped.to_frame()
list_soft = []
list_matches = [0]
for i in range(len(ts)):
    list_soft.append(software_series_shaped.index[i])
    list_matches.append(software_series_shaped[i])
df_shaped = pd.DataFrame()
df_shaped['Software'] = list_soft
df_shaped['Matches'] = list_matches[1:]
df_shaped

Unnamed: 0,Software,Matches
0,R,10805
1,SPSS,9229
2,GRAPHPAD PRISM,4461
3,EXCEL,4054
4,BLAST,3943
...,...,...
9995,GOCHECK,8
9996,OURA,8
9997,EXONERATE,8
9998,TOUSANTICOVID,8


Replacing special characters to prevent unterminated subpattern at a later stage of this Notebook.

In [24]:
df_shaped['Software'] = df_shaped.Software.str.replace('(','')
df_shaped['Software'] = df_shaped.Software.str.replace(')','')

The following function compares software mentions based on the fuzzywuzzy method. As a result, a blacklist with identified duplicates and a modified dataframe are returned. 

In [25]:
def unify_dataframe(df, th):
    """Match software mentions based on fuzzywuzzy algorithm
    """
    df_holder = df
    blacklist = set()
    for i in range(len(df)):
        for j in range(i + 1, len(df)):
            if df['Software'][i] not in blacklist:
                if(fuzzy_token_set_ratio_compare(df['Software'][i], df['Software'][j], th)):
                    print("'"+df['Software'][i]+"' with " + str(df['Matches'][i]) + " mentions matched with '" + df['Software'][j] + "' mentioned " + str(df['Matches'][j])+ " times")
                    df['Matches'][i] = int(df['Matches'][i] + df['Matches'][j])
                    blacklist.add(df['Software'][j])
    df_holder = df
    return df_holder, blacklist

In [26]:
%%time
threeshold = 84
df_returned = unify_dataframe(df_shaped, threeshold)
df_unified = df_returned[0]
blacklist = df_returned[1]
#Wall time: 1h 4min 32s -> head = 10000 th -> 85
#Wall time: 17min 25s -> head = 5000 th -> 85

'R' with 10805 mentions matched with 'R PACKAGE' mentioned 313 times
'R' with 11118 mentions matched with 'R FOUNDATION FOR STATISTICAL COMPUTING' mentioned 280 times
'R' with 11398 mentions matched with 'R CORE TEAM' mentioned 207 times
'R' with 11605 mentions matched with 'R STUDIO' mentioned 148 times
'R' with 11753 mentions matched with 'R DEVELOPMENT CORE TEAM' mentioned 86 times
'R' with 11839 mentions matched with 'R SCRIPT' mentioned 84 times
'R' with 11923 mentions matched with 'MASK R - CNN' mentioned 71 times
'R' with 11994 mentions matched with 'R FOUNDATION FOR STATISTICAL' mentioned 63 times
'R' with 12057 mentions matched with 'R CORE' mentioned 62 times


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Matches'][i] = int(df['Matches'][i] + df['Matches'][j])


'R' with 12119 mentions matched with 'R PROJECT FOR STATISTICAL COMPUTING' mentioned 57 times
'R' with 12176 mentions matched with 'R SCRIPTS' mentioned 55 times
'R' with 12231 mentions matched with 'R FOUNDATION FOR' mentioned 38 times
'R' with 12269 mentions matched with 'R SHINY' mentioned 36 times
'R' with 12305 mentions matched with 'R LANGUAGE' mentioned 34 times
'R' with 12339 mentions matched with 'R PACKAGES' mentioned 33 times
'R' with 12372 mentions matched with 'R FOR' mentioned 33 times
'R' with 12405 mentions matched with 'R ENVIRONMENT' mentioned 32 times
'R' with 12437 mentions matched with 'R PACKAGE LIMMA' mentioned 31 times
'R' with 12468 mentions matched with 'R FOUNDATION' mentioned 28 times
'R' with 12496 mentions matched with 'R PACKAGE SEURAT' mentioned 27 times
'R' with 12523 mentions matched with 'R PACKAGE "' mentioned 27 times
'R' with 12550 mentions matched with 'LIMMA R' mentioned 24 times
'R' with 12574 mentions matched with 'R PACKAGE GGPLOT2' mentioned 

'BLAST' with 6524 mentions matched with 'LAST' mentioned 28 times
'BLAST' with 6552 mentions matched with 'BLAST BASIC LOCAL ALIGNMENT SEARCH TOOL' mentioned 24 times
'BLAST' with 6576 mentions matched with 'BASIC LOCAL ALIGNMENT SEARCH TOOL BLAST' mentioned 21 times
'BLAST' with 6597 mentions matched with '- BLAST' mentioned 18 times
'BLAST' with 6615 mentions matched with 'PBLAST' mentioned 17 times
'BLAST' with 6632 mentions matched with 'BLAST +' mentioned 16 times
'BLAST' with 6648 mentions matched with 'BLAST SEARCH' mentioned 16 times
'BLAST' with 6664 mentions matched with 'PRIMER BLAST' mentioned 15 times
'BLAST' with 6679 mentions matched with 'PROTEIN BLAST' mentioned 12 times
'BLAST' with 6691 mentions matched with 'NBLAST' mentioned 11 times
'BLAST' with 6702 mentions matched with 'NUCLEOTIDE BLAST' mentioned 9 times
'BLAST' with 6711 mentions matched with 'BLAST BASIC LOCAL ALIGNMENT SEARCH TOOL' mentioned 8 times
'STATA' with 3688 mentions matched with 'STAT' mentioned 2

'ENSEMBL' with 800 mentions matched with 'ENSEMBLE' mentioned 167 times
'ENSEMBL' with 967 mentions matched with 'ENSEMBL GENOME BROWSER' mentioned 27 times
'ENSEMBL' with 994 mentions matched with 'ENSEMBL BIOMART' mentioned 12 times
'ENSEMBL' with 1006 mentions matched with 'ENSEMBL BROWSER' mentioned 12 times
'SKYPE' with 780 mentions matched with 'SKYPE FOR BUSINESS' mentioned 10 times
'BERT' with 717 mentions matched with 'MBERT' mentioned 104 times
'BERT' with 821 mentions matched with 'SBERT' mentioned 40 times
'BERT' with 861 mentions matched with 'BRT' mentioned 19 times
'BERT' with 880 mentions matched with '- BERT' mentioned 13 times
'BERT' with 893 mentions matched with 'IBERT' mentioned 10 times
'BERT' with 903 mentions matched with 'ERT' mentioned 9 times
'GENEIOUS' with 711 mentions matched with 'GENEIOUS PRIME' mentioned 100 times
'GENEIOUS' with 811 mentions matched with 'GENEIOUS PRO' mentioned 44 times
'GENEIOUS' with 855 mentions matched with 'GENEIOUS R11' mentione

'MOE' with 678 mentions matched with 'MOLECULAR OPERATING ENVIRONMENT MOE' mentioned 15 times
'STRING' with 428 mentions matched with 'STING' mentioned 10 times
'DAVID' with 421 mentions matched with 'VISUALIZATION AND INTEGRATED DISCOVERY DAVID' mentioned 27 times
'DAVID' with 448 mentions matched with 'DAVID BIOINFORMATICS RESOURCES' mentioned 13 times
'RSTUDIO' with 419 mentions matched with 'R STUDIO' mentioned 148 times
'RSTUDIO' with 567 mentions matched with 'STUDIO' mentioned 106 times
'ONE' with 416 mentions matched with 'QUANTITY ONE' mentioned 190 times
'ONE' with 606 mentions matched with 'CONE' mentioned 103 times
'ONE' with 709 mentions matched with 'ONE HEALTH' mentioned 50 times
'ONE' with 759 mentions matched with 'PLOS ONE' mentioned 21 times
'ONE' with 780 mentions matched with 'PREDICTION ONE' mentioned 12 times
'ONE' with 792 mentions matched with 'RAD QUANTITY ONE' mentioned 10 times
'ONE' with 802 mentions matched with 'ONEE' mentioned 8 times
'WEB' with 414 ment

'NETMHCPAN' with 461 mentions matched with 'NETMHCPAN EL' mentioned 9 times
'PREDICT' with 286 mentions matched with 'PREDICTA' mentioned 14 times
'PREDICT' with 300 mentions matched with '19PREDICT' mentioned 10 times
'PREDICT' with 310 mentions matched with 'PREDICTRV' mentioned 9 times
'PREDICT' with 319 mentions matched with 'PREDICTIV' mentioned 8 times
'DISCOVERY STUDIO' with 286 mentions matched with 'R STUDIO' mentioned 148 times
'DISCOVERY STUDIO' with 434 mentions matched with 'DISCOVERY' mentioned 132 times
'DISCOVERY STUDIO' with 566 mentions matched with 'STUDIO' mentioned 106 times
'DISCOVERY STUDIO' with 672 mentions matched with 'DISCOVERY STUDIO VISUALIZER' mentioned 98 times
'DISCOVERY STUDIO' with 770 mentions matched with 'BIOVIA DISCOVERY STUDIO' mentioned 46 times
'DISCOVERY STUDIO' with 816 mentions matched with 'BIOVIA DISCOVERY STUDIO VISUALIZER' mentioned 31 times
'DISCOVERY STUDIO' with 847 mentions matched with 'ACCELRYS DISCOVERY STUDIO' mentioned 22 times


'CNN' with 229 mentions matched with 'MASK R - CNN' mentioned 71 times
'CNN' with 300 mentions matched with 'RCNN' mentioned 25 times
'CNN' with 325 mentions matched with 'CONN' mentioned 23 times
'CNN' with 348 mentions matched with 'FASTER R - CNN' mentioned 22 times
'CNN' with 370 mentions matched with 'CANN' mentioned 14 times
'CNN' with 384 mentions matched with 'MASK R CNN' mentioned 11 times
'CNN' with 395 mentions matched with '- CNN' mentioned 9 times
'CNN' with 404 mentions matched with 'CASCADE R - CNN' mentioned 8 times
'CNN' with 412 mentions matched with 'CGNN' mentioned 8 times
'SPARK' with 227 mentions matched with 'APACHE SPARK' mentioned 82 times
'SPARK' with 309 mentions matched with 'SPARKY' mentioned 28 times
'SPARK' with 337 mentions matched with 'SPARKS' mentioned 21 times
'SPARK' with 358 mentions matched with 'SPARK SQL' mentioned 18 times
'SPARK' with 376 mentions matched with 'SPARK NLP' mentioned 14 times
'SPARK' with 390 mentions matched with 'SPAR' mention

'ADOBE PHOTOSHOP' with 341 mentions matched with 'ADOBE PHOTOSHOP CC' mentioned 9 times
'ADOBE PHOTOSHOP' with 350 mentions matched with 'PHOTOSHOP CC' mentioned 8 times
'MEME' with 208 mentions matched with 'MEM' mentioned 168 times
'MEME' with 376 mentions matched with 'MEME SUITE' mentioned 24 times
'LASERGENE' with 207 mentions matched with 'DNASTAR LASERGENE' mentioned 38 times
'LASERGENE' with 245 mentions matched with 'LASERGENE PACKAGE' mentioned 14 times
'PANTHER' with 207 mentions matched with 'PANTHER FUSION' mentioned 12 times
'RNAFOLD' with 206 mentions matched with 'UNAFOLD' mentioned 20 times
'RNAFOLD' with 226 mentions matched with 'RNACOFOLD' mentioned 14 times
'RNAFOLD' with 240 mentions matched with 'RNALFOLD' mentioned 11 times
'RNAFOLD' with 251 mentions matched with 'RNAFOLD WEBSERVER' mentioned 11 times
'RNAFOLD' with 262 mentions matched with 'RNAPLFOLD' mentioned 9 times
'[UNK]' with 204 mentions matched with 'GOOGLE [UNK]' mentioned 22 times
'[UNK]' with 226 m

'COVIDNET' with 310 mentions matched with 'CVDNET' mentioned 10 times
'GENECONV' with 169 mentions matched with 'GENECOV' mentioned 12 times
'EXPASY' with 169 mentions matched with 'EXPASY PROTPARAM' mentioned 30 times
'CCP4' with 169 mentions matched with 'CCP4 SUITE' mentioned 23 times
'CCP4' with 192 mentions matched with 'CCP' mentioned 10 times
'SAM' with 169 mentions matched with 'SCAM' mentioned 23 times
'SAM' with 192 mentions matched with 'STEPWATCH ACTIVITY MONITOR SAM' mentioned 16 times
'SAM' with 208 mentions matched with 'SAMD' mentioned 14 times
'SAM' with 222 mentions matched with 'SIGNIFICANCE ANALYSIS OF MICROARRAYS SAM' mentioned 13 times
'SAM' with 235 mentions matched with 'SLAM' mentioned 9 times
'SP' with 168 mentions matched with 'GLIDE SP' mentioned 11 times
'GBSA' with 168 mentions matched with 'GSA' mentioned 41 times
'NETCTL' with 168 mentions matched with 'NETCTL1' mentioned 18 times
'JALVIEW' with 168 mentions matched with 'ALIVIEW' mentioned 48 times
'3D'

'STATISTICAL PACKAGE FOR THE SOCIAL SCIENCES' with 761 mentions matched with 'STATISTICAL PACKAGE FOR THE SOCIAL SCIENCES SOFTWARE SPSS' mentioned 8 times
'STATISTICAL PACKAGE FOR THE SOCIAL SCIENCES' with 769 mentions matched with 'SPSS STATISTICAL PACKAGE FOR THE SOCIAL SCIENCES' mentioned 8 times
'ALLOY' with 150 mentions matched with 'ALLOY ANALYZER' mentioned 51 times
'POSTGRESQL' with 148 mentions matched with 'POSTGRES' mentioned 11 times
'ARDUINO' with 148 mentions matched with 'ARDUINO IDE' mentioned 18 times
'ARDUINO' with 166 mentions matched with 'ARDUINO UNO' mentioned 9 times
'BLACKBOARD' with 147 mentions matched with 'BLACKBOARD COLLABORATE' mentioned 18 times
'VADR' with 146 mentions matched with 'VADER' mentioned 98 times
'VADR' with 244 mentions matched with 'VAR' mentioned 34 times
'REACTOME' with 146 mentions matched with 'REACTOMEFI' mentioned 12 times
'BIACORE' with 145 mentions matched with 'BIACORE EVALUATION' mentioned 23 times
'BIACORE' with 168 mentions matc

'SCIENCE' with 118 mentions matched with 'STATISTICAL PACKAGE FOR SOCIAL SCIENCE SPSS' mentioned 61 times
'SCIENCE' with 179 mentions matched with 'STATISTICAL PACKAGE FOR SOCIAL SCIENCE' mentioned 34 times
'SCIENCE' with 213 mentions matched with 'OPEN SCIENCE' mentioned 30 times
'SCIENCE' with 243 mentions matched with 'OPEN SCIENCE FRAMEWORK' mentioned 25 times
'SCIENCE' with 268 mentions matched with 'OF SCIENCE' mentioned 24 times
'SCIENCE' with 292 mentions matched with 'SCIENCE DIRECT' mentioned 22 times
'SCIENCE' with 314 mentions matched with 'WEB OF SCIENCE' mentioned 17 times
'SCIENCE' with 331 mentions matched with 'STATISTICAL PACKAGE FOR SOCIAL SCIENCE SPSS' mentioned 14 times
'SCIENCE' with 345 mentions matched with 'STATISTICAL PACKAGE FOR THE SOCIAL SCIENCE SPSS' mentioned 12 times
'SCIENCE' with 357 mentions matched with 'MATRIX SCIENCE' mentioned 10 times
'SCIENCE' with 367 mentions matched with 'SPSS STATISTICAL PACKAGE FOR SOCIAL SCIENCE' mentioned 8 times
'IMPACT'

'ROS' with 184 mentions matched with 'ROCS' mentioned 20 times
'ROS' with 204 mentions matched with 'ROBOT OPERATING SYSTEM ROS' mentioned 16 times
'ROS' with 220 mentions matched with 'ROSE' mentioned 15 times
'ROS' with 235 mentions matched with 'ROST' mentioned 15 times
'DEAL' with 101 mentions matched with 'EAL' mentioned 62 times
'DEAL' with 163 mentions matched with 'DEA' mentioned 17 times
'DEAL' with 180 mentions matched with 'IDEAL' mentioned 12 times
'DEAL' with 192 mentions matched with 'DEL' mentioned 11 times
'DEAL' with 203 mentions matched with 'DAL' mentioned 8 times
'PONDR' with 100 mentions matched with 'PONDR VLXT' mentioned 13 times
'PONDR' with 113 mentions matched with 'PONDR®' mentioned 10 times
'CELLRANGER' with 100 mentions matched with 'CELL RANGER' mentioned 88 times
'AUTODOCK TOOLS' with 100 mentions matched with 'AUTODOCKTOOLS' mentioned 99 times
'AUTODOCK TOOLS' with 199 mentions matched with 'TOOLS' mentioned 38 times
'AUTODOCK TOOLS' with 237 mentions ma

'GOOGLE PLAY' with 190 mentions matched with 'PLAY' mentioned 11 times
'RADS' with 91 mentions matched with 'RAD' mentioned 29 times
'RADS' with 120 mentions matched with 'ROADS' mentioned 26 times
'RADS' with 146 mentions matched with 'RAS' mentioned 12 times
'GENCODE' with 91 mentions matched with 'ENCODE' mentioned 88 times
'GENCODE' with 179 mentions matched with 'GECODE' mentioned 9 times
'ONT' with 91 mentions matched with 'ONTO' mentioned 8 times
'RANDOM' with 90 mentions matched with 'RANDOM FOREST' mentioned 41 times
'METABOANALYST' with 89 mentions matched with 'METABOANALYSTR' mentioned 10 times
'IMAGENET' with 88 mentions matched with 'IMAGENE' mentioned 15 times
'NGS' with 88 mentions matched with 'MNGS' mentioned 55 times
'NGS' with 143 mentions matched with 'NGSI' mentioned 37 times
'ITOL' with 88 mentions matched with 'INTERACTIVE TREE OF LIFE ITOL' mentioned 20 times
'AMBER18' with 88 mentions matched with 'AMBER16' mentioned 58 times
'AMBER18' with 146 mentions matche

'MIMIC' with 77 mentions matched with 'MIMICS' mentioned 25 times
'GAIA' with 77 mentions matched with 'AIA' mentioned 9 times
'GAIA' with 86 mentions matched with 'GAA' mentioned 8 times
'DYNAMUT' with 77 mentions matched with 'DYNAMUT2' mentioned 32 times
'TREESOLVE' with 77 mentions matched with 'RESOLVE' mentioned 32 times
'GTEX' with 77 mentions matched with 'TEX' mentioned 39 times
'PROTÉGÉ' with 76 mentions matched with 'PROT' mentioned 45 times
'GOOGLE DOCS' with 76 mentions matched with 'GOOGLE' mentioned 42 times
'GOOGLE DOCS' with 118 mentions matched with 'GOOGLE®' mentioned 26 times
'GOOGLE DOCS' with 144 mentions matched with 'GOOGLE DOC' mentioned 20 times
'GOOGLE DOCS' with 164 mentions matched with 'GOOGLE DUO' mentioned 19 times
'HOMER' with 76 mentions matched with 'HOME' mentioned 14 times
'MCODE' with 76 mentions matched with 'MODE' mentioned 34 times
'MCODE' with 110 mentions matched with 'CODE' mentioned 21 times
'MCODE' with 131 mentions matched with 'MOLECULAR 

'- PAD PRISM' with 115 mentions matched with 'PAD' mentioned 18 times
'- PAD PRISM' with 133 mentions matched with 'PAD PRISM' mentioned 11 times
'MIRA' with 69 mentions matched with 'MIA' mentioned 40 times
'MIRA' with 109 mentions matched with 'AMIRA' mentioned 35 times
'MIRA' with 144 mentions matched with 'MIR' mentioned 23 times
'MIRA' with 167 mentions matched with 'MICRA' mentioned 11 times
'MIRA' with 178 mentions matched with 'MISRA' mentioned 9 times
'WECHAT' with 69 mentions matched with 'WECHAT APP' mentioned 12 times
'FACEBOOK LIVE' with 69 mentions matched with 'LIVE' mentioned 34 times
'REALSTAR' with 69 mentions matched with 'REALSTAR®' mentioned 26 times
'HUMANN' with 69 mentions matched with 'HUMANN2' mentioned 8 times
'TEST' with 69 mentions matched with 'EST' mentioned 38 times
'TEST' with 107 mentions matched with 'ETEST' mentioned 29 times
'TEST' with 136 mentions matched with '3TEST' mentioned 8 times
'RNAALIFOLD' with 68 mentions matched with 'RNALFOLD' mentione

'POP' with 61 mentions matched with 'PROP' mentioned 29 times
'POP' with 90 mentions matched with 'POMP' mentioned 13 times
'GCG' with 61 mentions matched with 'GCG APP' mentioned 11 times
'BIOGRID' with 60 mentions matched with 'ISOGRID' mentioned 17 times
'UPTODATE' with 60 mentions matched with 'UPTODATE®' mentioned 16 times
'ANALYSE' with 60 mentions matched with 'ANALYST' mentioned 35 times
'ANALYSE' with 95 mentions matched with 'ANALYZE' mentioned 22 times
'LIB' with 60 mentions matched with 'DLIB' mentioned 20 times
'LIB' with 80 mentions matched with 'ZLIB' mentioned 10 times
'LIB' with 90 mentions matched with 'ILIB' mentioned 8 times
'LEAST' with 60 mentions matched with 'LAST' mentioned 28 times
'LEAST' with 88 mentions matched with 'EAST' mentioned 12 times
'GENESPRING' with 60 mentions matched with 'GENESPRING GX' mentioned 29 times
'GENESPRING' with 89 mentions matched with 'GENESPRING GX11' mentioned 16 times
'MOL' with 60 mentions matched with 'JMOL' mentioned 49 times

'SOTA' with 52 mentions matched with 'STA' mentioned 22 times
'EDNA' with 52 mentions matched with 'DNA' mentioned 30 times
'EDNA' with 82 mentions matched with 'ENA' mentioned 27 times
'ACE2' with 52 mentions matched with 'ACE' mentioned 43 times
'ACE2' with 95 mentions matched with 'HACE2' mentioned 10 times
'DESMOND' with 52 mentions matched with 'MYDESMOND' mentioned 8 times
'REGENN' with 52 mentions matched with 'REGNN' mentioned 29 times
'COACH' with 52 mentions matched with 'PE COACH' mentioned 9 times
'COACH' with 61 mentions matched with 'PTSD COACH' mentioned 8 times
'SSA' with 52 mentions matched with 'CSSA' mentioned 12 times
'AMBERTOOLS' with 52 mentions matched with 'AMBERTOOLS18' mentioned 15 times
'AMBERTOOLS' with 67 mentions matched with 'AMBERTOOLS19' mentioned 9 times
'GRAPHPAD INSTAT' with 51 mentions matched with 'INSTAT' mentioned 38 times
'EXETERA' with 51 mentions matched with 'NEXTERA' mentioned 25 times
'GOOGLE EARTH ENGINE' with 51 mentions matched with 'GOO

'NIH IMAGE' with 126 mentions matched with 'NIH IMAGE J' mentioned 23 times
'FPOCKET' with 45 mentions matched with 'POCKETS' mentioned 16 times
'FPOCKET' with 61 mentions matched with 'POCKET' mentioned 12 times
'ASTRA' with 45 mentions matched with 'ASTREA' mentioned 42 times
'ASTRA' with 87 mentions matched with 'ASTRAL' mentioned 15 times
'SUMO' with 45 mentions matched with 'SUM' mentioned 9 times
'CHESS' with 45 mentions matched with 'CESS' mentioned 10 times
'NCSS' with 45 mentions matched with 'CSS' mentioned 30 times
'SHAP' with 45 mentions matched with 'SHAPE' mentioned 42 times
'SHAP' with 87 mentions matched with 'SHARP' mentioned 17 times
'SHAP' with 104 mentions matched with 'SHP' mentioned 9 times
'MCC' with 45 mentions matched with 'MCCE' mentioned 10 times
'VERA' with 45 mentions matched with 'ERA' mentioned 18 times
'VERA' with 63 mentions matched with 'VER' mentioned 8 times
'NODEXL' with 44 mentions matched with 'NODEXL PRO' mentioned 12 times
'PAR' with 44 mentions

'PROCAT' with 40 mentions matched with 'PROCT' mentioned 9 times
'CC' with 40 mentions matched with 'ADOBE ILLUSTRATOR CC' mentioned 14 times
'CC' with 54 mentions matched with 'ADOBE PHOTOSHOP CC' mentioned 9 times
'CC' with 63 mentions matched with 'PHOTOSHOP CC' mentioned 8 times
'ACAD' with 40 mentions matched with 'ACD' mentioned 23 times
'ACAD' with 63 mentions matched with 'CAD' mentioned 20 times
'ACAD' with 83 mentions matched with 'ACA' mentioned 15 times
'ACAD' with 98 mentions matched with 'ARCAD' mentioned 10 times
'CALC' with 40 mentions matched with 'CAL' mentioned 17 times
'CALC' with 57 mentions matched with 'MED CALC' mentioned 10 times
'HEX' with 40 mentions matched with 'SHEX' mentioned 8 times
'CPACHECKER' with 40 mentions matched with 'ACHECKER' mentioned 8 times
'OCTET' with 40 mentions matched with 'OCTET DATA ANALYSIS' mentioned 11 times
'SCREENED' with 40 mentions matched with 'SCREEN' mentioned 19 times
'VAE' with 40 mentions matched with 'CVAE' mentioned 16 

'DEEPMIND' with 47 mentions matched with 'DEEPMINE' mentioned 9 times
'CMA' with 36 mentions matched with 'CMDA' mentioned 14 times
'CMA' with 50 mentions matched with 'CMAQ' mentioned 10 times
'CMA' with 60 mentions matched with 'MCMA' mentioned 9 times
'METAXL' with 36 mentions matched with 'METAL' mentioned 18 times
'GT' with 36 mentions matched with 'GOOGLE TRENDS GT' mentioned 18 times
'GT' with 54 mentions matched with 'GOOGLE TRENDS GT' mentioned 17 times
'ALARA' with 36 mentions matched with 'AARA' mentioned 17 times
'MINER' with 36 mentions matched with 'MINE' mentioned 24 times
'MINER' with 60 mentions matched with 'MYSTERY MINER' mentioned 14 times
'MINER' with 74 mentions matched with 'AMINER' mentioned 10 times
'MINER' with 84 mentions matched with 'QDA MINER' mentioned 9 times
'CTFFIND4' with 36 mentions matched with 'CTFFIND' mentioned 11 times
'GSVA' with 36 mentions matched with 'SVA' mentioned 23 times
'MAXBIN' with 35 mentions matched with 'MAXBIN2' mentioned 8 times

'FREESTYLE LIBRE' with 33 mentions matched with 'FREESTYLE' mentioned 12 times
'REDZONE COLLECTOR' with 33 mentions matched with 'REDZONE' mentioned 14 times
'SURVIVAL' with 33 mentions matched with 'R PACKAGE SURVIVAL' mentioned 8 times
'REDIS' with 33 mentions matched with 'REDS' mentioned 9 times
'GENETYX' with 33 mentions matched with 'GENETIX' mentioned 11 times
'BASIC' with 33 mentions matched with 'BASIC LOCAL ALIGNMENT SEARCH TOOL' mentioned 31 times
'BASIC' with 64 mentions matched with 'BLAST BASIC LOCAL ALIGNMENT SEARCH TOOL' mentioned 24 times
'BASIC' with 88 mentions matched with 'VISUAL BASIC' mentioned 24 times
'BASIC' with 112 mentions matched with 'BASIC LOCAL ALIGNMENT SEARCH TOOL BLAST' mentioned 21 times
'BASIC' with 133 mentions matched with 'BLAST BASIC LOCAL ALIGNMENT SEARCH TOOL' mentioned 8 times
'BBMAP' with 33 mentions matched with 'BBAP' mentioned 19 times
'BBMAP' with 52 mentions matched with 'BBMP' mentioned 13 times
'SPARX' with 32 mentions matched with '

'VIRTUOSO' with 29 mentions matched with 'VIRTUS' mentioned 14 times
'DUET' with 29 mentions matched with 'DET' mentioned 26 times
'GRADEPRO' with 29 mentions matched with 'GRADEPRO GUIDELINE DEVELOPMENT TOOL' mentioned 8 times
'CONNECT' with 29 mentions matched with 'CONCNET' mentioned 27 times
'CONNECT' with 56 mentions matched with 'ADOBE CONNECT' mentioned 23 times
'CONNECT' with 79 mentions matched with 'CONVNET' mentioned 16 times
'CONNECT' with 95 mentions matched with 'MTCONNECT' mentioned 13 times
'GPR' with 29 mentions matched with 'GDPR' mentioned 21 times
'GPR' with 50 mentions matched with 'GPAR' mentioned 20 times
'BEAGLE' with 29 mentions matched with 'EAGLE' mentioned 17 times
'BAIDU' with 29 mentions matched with 'BAIDU MAP' mentioned 12 times
'BAIDU' with 41 mentions matched with 'BAIDU MAPS' mentioned 8 times
'DIAL' with 29 mentions matched with 'DIALS' mentioned 20 times
'DIAL' with 49 mentions matched with 'DAL' mentioned 8 times
'CORODET' with 29 mentions matched 

'CEEMDAN' with 26 mentions matched with 'ICEEMDAN' mentioned 10 times
'ORBIT' with 26 mentions matched with 'KORBIT' mentioned 23 times
'RBM' with 26 mentions matched with 'RRBM' mentioned 8 times
'EISS' with 26 mentions matched with 'ZEISS' mentioned 21 times
'KOMPACT' with 26 mentions matched with 'COMPACT' mentioned 25 times
'MINDS' with 26 mentions matched with 'MIDS' mentioned 19 times
'MINDS' with 45 mentions matched with 'MIND' mentioned 10 times
'RISK' with 26 mentions matched with 'COCHRANE RISK OF BIAS TOOL' mentioned 22 times
'RISK' with 48 mentions matched with 'ERISK' mentioned 21 times
'RISK' with 69 mentions matched with 'COCHRANE RISK OF BIAS' mentioned 17 times
'RISK' with 86 mentions matched with 'COCHRANE RISK' mentioned 9 times
'OGA' with 26 mentions matched with 'MOGA' mentioned 25 times
'UPARSE' with 26 mentions matched with 'PARSE' mentioned 11 times
'UPARSE' with 37 mentions matched with 'CUSPARSE' mentioned 11 times
'OSM' with 25 mentions matched with 'OSEM' me

'REX' with 23 mentions matched with 'RTEX' mentioned 11 times
'INCUCYTE' with 23 mentions matched with 'INCYTE' mentioned 15 times
'RTC' with 23 mentions matched with 'RTCA' mentioned 9 times
'ERGO' with 23 mentions matched with 'EGO' mentioned 8 times
'MPSS' with 23 mentions matched with 'OMPSS' mentioned 10 times
'PRESS' with 23 mentions matched with 'XPRESS' mentioned 12 times
'EMOJIS' with 23 mentions matched with 'EMOJI' mentioned 19 times
'GLYCAM' with 23 mentions matched with 'GLYCAM06' mentioned 9 times
'SHELXD' with 23 mentions matched with 'SHELX' mentioned 19 times
'CIRSEQ' with 23 mentions matched with 'CIRSE' mentioned 15 times
'TREEMAP' with 23 mentions matched with 'STREETMAP' mentioned 9 times
'DYNAMO' with 23 mentions matched with 'DYNAMODB' mentioned 15 times
'IMAGESCOPE' with 23 mentions matched with 'APERIO IMAGESCOPE' mentioned 22 times
'IOCT' with 23 mentions matched with 'IOT' mentioned 14 times
'BOT' with 23 mentions matched with 'BOOT' mentioned 9 times
'RAVEL'

'BESI' with 20 mentions matched with 'ESI' mentioned 11 times
'IES' with 20 mentions matched with 'PIES' mentioned 11 times
'GEMF' with 20 mentions matched with 'GEM' mentioned 17 times
'EPICOLLECT5' with 20 mentions matched with 'EPICOLLECT' mentioned 15 times
'PSYCHINFO' with 20 mentions matched with 'PSYCINFO' mentioned 10 times
'IDSR' with 20 mentions matched with 'IDS' mentioned 14 times
'MCP' with 20 mentions matched with 'MCNP' mentioned 9 times
'COUNT' with 20 mentions matched with 'LINGUISTIC INQUIRY AND WORD COUNT LIWC' mentioned 18 times
'COUNT' with 38 mentions matched with '- COUNT' mentioned 8 times
'ESMDA' with 20 mentions matched with 'ESDA' mentioned 9 times
'SMC' with 20 mentions matched with 'MSMC' mentioned 13 times
'SMC' with 33 mentions matched with 'STMC' mentioned 11 times
'EPITOOLS' with 20 mentions matched with 'TEPITOOL' mentioned 17 times
'CHROMASPRO' with 20 mentions matched with 'CROMASPRO' mentioned 15 times
'OPC' with 20 mentions matched with 'OPCE' ment

'BETA' with 17 mentions matched with 'ETA' mentioned 8 times
'TRI' with 17 mentions matched with 'TRIM' mentioned 17 times
'SELECT' with 17 mentions matched with 'SELECTON' mentioned 9 times
'CORINA' with 17 mentions matched with 'CORIA' mentioned 13 times
'SILAC' with 17 mentions matched with 'SLAC' mentioned 11 times
'TOME' with 17 mentions matched with 'TOE' mentioned 10 times
'NOSOI' with 17 mentions matched with 'NOSO' mentioned 10 times
'CLUSTALX2' with 17 mentions matched with 'CLUSTALX1' mentioned 11 times
'ADVANCE' with 17 mentions matched with 'VECTOR NTI ADVANCE' mentioned 14 times
'RADTRANSLATE' with 17 mentions matched with 'TRANSLATE' mentioned 16 times
'TAP' with 17 mentions matched with 'GTAP' mentioned 12 times
'TAP' with 29 mentions matched with 'STAP' mentioned 10 times
'QSEE' with 17 mentions matched with 'SEE' mentioned 11 times
'HAPLOTYPECALLER' with 17 mentions matched with 'GATK HAPLOTYPECALLER' mentioned 9 times
'PORTER' with 17 mentions matched with 'REPORTER'

'TER' with 15 mentions matched with 'TERM' mentioned 8 times
'FCOVID' with 15 mentions matched with '- COVID' mentioned 10 times
'GLFORTHEL' with 15 mentions matched with 'FORTHEL' mentioned 9 times
'DIST' with 15 mentions matched with 'DIS' mentioned 14 times
'DIST' with 29 mentions matched with 'DISTS' mentioned 12 times
'PHYLODYNAMIC' with 14 mentions matched with 'PHYLODYNAMICS' mentioned 12 times
'PEACE' with 14 mentions matched with 'PACE' mentioned 13 times
'LOLAS' with 14 mentions matched with 'LOLA' mentioned 10 times
'SSPS' with 14 mentions matched with 'SSP' mentioned 10 times
'VEGAS' with 14 mentions matched with 'VGAS' mentioned 10 times
'ACTILIFE' with 14 mentions matched with 'ACTLIFE' mentioned 10 times
'DATAWARRIOR' with 14 mentions matched with 'OSIRIS DATAWARRIOR' mentioned 9 times
'SIG' with 14 mentions matched with 'SIGA' mentioned 11 times
'STATA16' with 14 mentions matched with 'STATA14' mentioned 12 times
'STATA16' with 26 mentions matched with 'STATA15' mention

'SEQUENCE DETECTOR' with 12 mentions matched with 'SEQUENCE DETECTION' mentioned 8 times
'INTERNET' with 12 mentions matched with 'INTERVET' mentioned 9 times
'INTERNET' with 21 mentions matched with 'INTERNET ARCHIVE' mentioned 8 times
'GOA' with 12 mentions matched with 'GOAP' mentioned 10 times
'GOA' with 22 mentions matched with 'GOEA' mentioned 9 times
'EVM' with 12 mentions matched with 'KEVM' mentioned 9 times
'MRT' with 12 mentions matched with 'IMRT' mentioned 9 times
'NAIVE' with 12 mentions matched with 'NATIVE' mentioned 11 times
'SIMVA' with 12 mentions matched with 'SIMV' mentioned 9 times
'IFCN' with 12 mentions matched with 'FCN' mentioned 9 times
'SCIMAT' with 12 mentions matched with 'SCIMATH' mentioned 10 times
'PENN' with 12 mentions matched with 'PEN' mentioned 8 times
'IMR' with 12 mentions matched with 'SIMR' mentioned 9 times
'IMR' with 21 mentions matched with 'IMRT' mentioned 9 times
'SPLUS' with 12 mentions matched with 'SPLS' mentioned 10 times
'IDSA' with 1

The blacklist contains the matched duplicates which means that they need to be removed from the dataframe.

In [27]:
for i in blacklist:
    name_of_index = df_unified[ df_unified['Software'] == i ].index
    df_unified.drop(name_of_index, inplace = True)

For comparison purposes, the DataFrame is sorted in descending order by matches. 

In [28]:
df_unified.sort_values(by=['Matches'], inplace=True, ascending=False)
df_unified

Unnamed: 0,Software,Matches
0,R,13187
1,SPSS,11322
2,GRAPHPAD PRISM,8507
4,BLAST,6719
3,EXCEL,4319
...,...,...
9571,GROWTREE,8
9570,KIT,8
9569,POLYJET,8
9567,ARCHS4,8


To verify the removal of duplicates, the length of the dataframe is outputed.  

In [29]:
len(df_unified)

7610

To investigate the postion change of software mentions, the following algorithm compares its index postion to the sorted postion by matches.

In [30]:
list_change = []
for i in range(len(df_unified)):
    dif = df_unified.index[i]-i
    if(dif > 0):
        list_change.append("+"+str(dif))
    else:
        list_change.append(df_unified.index[i]-i)
df_unified['Change'] = list_change
df_unified.head(50)

Unnamed: 0,Software,Matches,Change
0,R,13187,0
1,SPSS,11322,0
2,GRAPHPAD PRISM,8507,0
4,BLAST,6719,1
3,EXCEL,4319,-1
5,STATA,4048,0
10,MEGA,3428,4
6,SAS,3399,-1
12,IMAGEJ,2779,4
7,MATLAB,2710,-2


Assigning the outcome of this notebook to a new dataframe for the classification notebook.
The outcome is stored on an external file. 

In [31]:
df_software_mentions = df_unified
df_software_mentions.to_pickle('software_mentions_CS5099.pkl')