# CORD-19 Software classification

This jupyter notebook is designated to classify software mentions based on the CORD19 dataset from: https://datadryad.org/stash/dataset/doi:10.5061/dryad.vmcvdncs0

First, relevant packages must be imported to the Notebook.

In [1]:
import numpy as np
import pandas as pd
import csv
import ast
import collections
import matplotlib.pyplot as plt
import Levenshtein as lev
from fuzzywuzzy import fuzz 
import json

The outcome "df_software_mentions" of the notebook "CORD-19-software-counting-cs5099.ipynb" will be used for classification purposes. Therefore, the notebook reads the content of the file "software_mentions.pkl".

In [2]:
df_software_mentions = pd.read_pickle('software_mentions.pkl')
df_software_mentions

Unnamed: 0,Software,Matches,Change
0,R,10805,0
1,SPSS,9229,0
4,BLAST,6300,+2
2,GRAPHPAD PRISM,4617,-1
3,EXCEL,4054,-1
...,...,...,...
995,ANSYS FLUENT,69,+77
996,CUSTOMMUNE,69,+77
997,REALSTAR,69,+77
998,DPLYR,69,+77


Shift the focus to the column software and creat a column for classification

In [3]:
df_software = df_software_mentions.drop('Matches', 1)
df_software = df_software.drop('Change', 1)
df_software = df_software.reset_index()
df_software = df_software.drop('index', 1)
df_software['Classification'] = "Unclassified"
df_software

Unnamed: 0,Software,Classification
0,R,Unclassified
1,SPSS,Unclassified
2,BLAST,Unclassified
3,GRAPHPAD PRISM,Unclassified
4,EXCEL,Unclassified
...,...,...
918,ANSYS FLUENT,Unclassified
919,CUSTOMMUNE,Unclassified
920,REALSTAR,Unclassified
921,DPLYR,Unclassified


In [4]:
result = df_software.to_json(orient='records')
parsed = json.loads(result)
software_json = json.dumps(parsed, indent=4) 
print(software_json)

[
    {
        "Software": "R",
        "Classification": "Unclassified"
    },
    {
        "Software": "SPSS",
        "Classification": "Unclassified"
    },
    {
        "Software": "BLAST",
        "Classification": "Unclassified"
    },
    {
        "Software": "GRAPHPAD PRISM",
        "Classification": "Unclassified"
    },
    {
        "Software": "EXCEL",
        "Classification": "Unclassified"
    },
    {
        "Software": "STATA",
        "Classification": "Unclassified"
    },
    {
        "Software": "SAS",
        "Classification": "Unclassified"
    },
    {
        "Software": "MATLAB",
        "Classification": "Unclassified"
    },
    {
        "Software": "GOOGLE SCHOLAR",
        "Classification": "Unclassified"
    },
    {
        "Software": "GRAPHPAD",
        "Classification": "Unclassified"
    },
    {
        "Software": "MEGA",
        "Classification": "Unclassified"
    },
    {
        "Software": "NET",
        "Classification": "Unclassifie

In [5]:
df_read_json = pd.read_json(software_json)
print(df_read_json.to_string()) 

                                    Software Classification
0                                          R   Unclassified
1                                       SPSS   Unclassified
2                                      BLAST   Unclassified
3                             GRAPHPAD PRISM   Unclassified
4                                      EXCEL   Unclassified
5                                      STATA   Unclassified
6                                        SAS   Unclassified
7                                     MATLAB   Unclassified
8                             GOOGLE SCHOLAR   Unclassified
9                                   GRAPHPAD   Unclassified
10                                      MEGA   Unclassified
11                                       NET   Unclassified
12                                     PRISM   Unclassified
13                                    SCOPUS   Unclassified
14                                    PYTHON   Unclassified
15                                    GI

In [6]:
df_json_classifier = pd.read_json('software_classification.json')
df_json_classifier

Unnamed: 0,Software,Classification
0,R,Category 1
1,SPSS,Category 1
2,BLAST,Category 1
3,GRAPHPAD PRISM,Category 1
4,EXCEL,Category 1
...,...,...
890,DR,Category 1
891,GRAPHPPI,Category 1
892,POLYMAKE,Category 1
893,- PAD PRISM,Category 1
