# CORD-19 Software classification

This jupyter notebook is designated to classify software mentions based on the CORD19 dataset from: https://datadryad.org/stash/dataset/doi:10.5061/dryad.vmcvdncs0

First, relevant packages must be imported to the Notebook.

In [1]:
import numpy as np
import pandas as pd
import csv
import ast
import collections
import matplotlib.pyplot as plt
import Levenshtein as lev
from fuzzywuzzy import fuzz 
import json

The outcome "df_software_mentions" of the notebook "CORD-19-software-counting-cs5099.ipynb" will be used for classification purposes. Therefore, the notebook reads the content of the file "software_mentions.pkl".

In [2]:
df_software_mentions = pd.read_pickle('software_mentions_CS5099.pkl')
df_software_mentions

Unnamed: 0,Software,Matches,Change
0,R,11994,0
1,SPSS,10930,0
4,BLAST,6448,+2
3,EXCEL,4209,0
5,STATA,3688,+1
...,...,...,...
994,COVNET,69,+106
995,CAPS,69,+106
996,DR,69,+106
997,WECHAT,69,+106


Shift the focus to the column software and creat a column for classification

In [3]:
#df_software = df_software_mentions.drop('Matches', 1)
df_software = df_software_mentions.drop('Change', 1)
df_software = df_software.reset_index()
df_software = df_software.drop('index', 1)
df_software['Classification'] = "Unclassified"
df_software

Unnamed: 0,Software,Matches,Classification
0,R,11994,Unclassified
1,SPSS,10930,Unclassified
2,BLAST,6448,Unclassified
3,EXCEL,4209,Unclassified
4,STATA,3688,Unclassified
...,...,...,...
888,COVNET,69,Unclassified
889,CAPS,69,Unclassified
890,DR,69,Unclassified
891,WECHAT,69,Unclassified


In [4]:
# result = df_software.to_json(orient='records')
# parsed = json.loads(result)
# software_json = json.dumps(parsed, indent=4) 
# print(software_json)

In [5]:
# df_read_json = pd.read_json(software_json)
# print(df_read_json.to_string()) 

In [6]:
# df_json_classifier = pd.read_json('software_classification_CS5099.json')
# df_json_classifier

In [7]:
Categories_CSV = pd.read_csv('software_categories_CS5099.csv')
Categories_CSV

Unnamed: 0,Statistics,Bioinformatics,Communication,BibliographyServices,OperatingSystems,ProgrammingLanguage,Uncertain
0,R,BLAST,REDCAP,GOOGLE SCHOLAR,IOS,MATLAB,EXCEL
1,SPSS,PYMOL,SKYPE,SCOPUS,LINUX,NET,MEGA
2,STATA,CHIMERA,QUALTRICS,GISAID,WINDOWS,PYTHON,MUSCLE
3,SAS,FLOWJO,GITHUB,GOOGLE TRENDS,MS,BERT,SWISS
4,NVIVO,ENSEMBL,REDDIT,XGBOOST,MOE,TENSORFLOW,PHENIX
5,SEURAT,BEAST,FACETIME,FASTTEXT,ROSETTA,SIMPLOT,ONE
6,MEDCALC,MAFFT,SURVEYMONKEY,CHEMBL,,SCIKIT,MODEL
7,GRAPHPAD PRISM,CYTOSCAPE,,,,WORD2VEC,LEARN
8,GGPLOT2,GROMACS,,,,SPARQL,CT
9,STATISTICAL PACKAGE FOR THE SOCIAL SCIENCES,GENEIOUS,,,,OPENMP,ARCGIS


In [8]:
def get_category(mention):
    """
    Function receiving a software mention a returning its category. When no category is found 'None' is returned.
    The function works dynamically to the entries of the Categoies_CSV.
    """
    category_holder = "None"
    len_categories = len(Categories_CSV.columns)
    i = 0
    while i < len_categories: 
        column_holder = Categories_CSV.columns[i]
        if(any(Categories_CSV[column_holder] == mention) == True):
            return Categories_CSV.columns[i]
        i = i + 1

In [9]:
%%time
for i, row in df_software.iterrows():
    row.Classification = get_category(row.Matches)
    df_software.Classification[i] = get_category(row.Software)
df_software.head(5)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Wall time: 5.41 s


Unnamed: 0,Software,Matches,Classification
0,R,11994,Statistics
1,SPSS,10930,Statistics
2,BLAST,6448,Bioinformatics
3,EXCEL,4209,Uncertain
4,STATA,3688,Statistics


In [15]:
len_df_classification_holder = len(df_software)
classification_series = df_software['Classification'].value_counts()
len_classification_series = len(classification_series.index)

df_total_matches = pd.DataFrame(columns=['Matches'], index = classification_series.index )
df_total_matches['Matches'] = 0

i = 0
while i < len_classification_series:
    x = 0
    while x < len_df_classification_holder:
        if df_software['Classification'][x] == classification_series.index[i]:
            df_total_matches['Matches'][classification_series.index[i]] = df_total_matches['Matches'][classification_series.index[i]] + df_software['Matches'][x]
        x = x + 1
    i = i + 1

df_total_matches.sort_values(by="Matches", ascending=False)

Unnamed: 0,Matches
Statistics,32578
Bioinformatics,24987
Uncertain,17838
ProgrammingLanguage,10594
BibliographyServices,8473
Communication,4473
OperatingSystems,3027
