# **Pattern Matching Colab**

This Colab notebook allows you to categorise a set of scientific papers into different categories. There are 34 supported subject categories and 3 main study designs.

**Note**: Please name your data file *input_data.csv* (*title* column should be named 'Title' or 'title' and *abstract* column if present should be named 'Abstract' or 'abstract'), and upload it by pressing the upload button on the top left of the left sidebar. The results will appear in a folder named *RESULTS*. RESULTS folder will be automatically created by the code

In [None]:
#@title Clone the GITHub repo { form-width: "20%" }

!git clone https://github.com/nice-digital/SciLiteratureProcessing

In [None]:
#@title Install Python packages { form-width: "20%" }

#@markdown Please execute this cell by pressing the _Play_ button 
#@markdown on the left to download and import third-party software 
#@markdown in this Colab notebook. 

#@markdown This installs the software on the Colab 
#@markdown notebook in the cloud and not on your computer.
from IPython.utils import io
try:
  with io.capture_output() as captured:
    %shell pip install scispacy
    %shell pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_md-0.4.0.tar.gz
    %shell pip install import-ipynb
    %shell pip install pandas
   
except subprocess.CalledProcessError:
  print(captured)
  raise

import os
import numpy as np
import spacy
import scispacy
import pandas as pd
from spacy.matcher import Matcher

from pathlib import Path
import logging

 # Load relevant Spacy models
nlp = spacy.load("en_core_sci_md")
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO)

In [None]:
#@title Function definitions { form-width: "20%" }

#@markdown Please execute this cell by pressing the _Play_ button 
#@markdown on the left 

#@markdown This defines code which will be used for pre-processing, identifying study design/categories.

%cd /content/SciLiteratureProcessing/code/

# Function definitions to pre-process title and abstract
%run -i "text_preprocess.py"

# Function definitions for pattern matching
%run -i "text_patternmatch.py"

# Function that categorizes the data into 35 Covid topics
%run -i "categorise_covid_topics.py"

%cd /content/

In [None]:
#@title File settings to get started { form-width: "20%" }

#@markdown Please ensure the input_data.csv is uploaded and execute this cell by pressing the _Play_ button 
#@markdown on the left 
input_filename = 'input_data.csv'
DATA_PATH = input_filename 

results_folder = 'RESULTS' 
RESULTS_FOLDER = results_folder     #***user input
if not os.path.isdir(RESULTS_FOLDER):
    os.makedirs(RESULTS_FOLDER)
RESULTS_PATH = Path(RESULTS_FOLDER)


In [None]:
#@title Pre-process input data { form-width: "20%" }

#@markdown Please execute this cell by pressing the _Play_ button 
#@markdown on the left 
try:
    lit_data = pd.read_csv(DATA_PATH)
except Exception as e:
    print(e)
    try:
      lit_data = pd.read_csv(DATA_PATH, encoding = "ISO-8859-1")
    except Exception as e:
      print("Unable to read the input file! Have you uploaded input_data.csv?")
      raise

lit_data.rename(columns = {'Title':'title', 'Abstract': 'abstract'}, inplace = True)

try:
  lit_data = preproc_title(lit_data) 
except Exception as e:
  print(e)
  print("Error- No title detected! Title is needed for pattern matching!")
  raise

lit_data.drop_duplicates(subset=['title'], inplace=True)

try:  
  lit_data = preproc_abstract(lit_data)
except Exception as e:
  print(e)
  print("No abstracts! Proceeding with title only pattern match")
    
lit_data['analyst_valid'] = 0
lit_data['analyst_comments'] = " "
lit_data.to_csv('data.csv')

In [None]:
#@title Execute Pattern Matching { form-width: "20%" }

#@markdown **Note:** By default, the code will screen for all 35 categories. To screen only for specific categories, do as follows

#@markdown Please press on 'show code'; ensure the 'topic_list' only has as many categories you need and then press the _Play_ button to execute this cell.

#@markdown Please note, the 'upper case' and commas after each category name except for the last category are important for the code to execute correctly

#@markdown **Note**: This block of code will take some time, depending on how many records are there in the data file, and how many categories are selected . 
#@markdown For 1000 records and 39 categories it takes around 20 mins.

import time
data = "data.csv"
lit_data = pd.read_csv(data)
start_time = time.time()
topic_list = ['OBSERVATIONAL', 'RELEVANT',
                'REMDESIVIR', 'TOCILIZUMAB', 'SARILUMAB', 'IVERMECTIN',
                'ASPIRIN', 'BUDESONIDE', 'CORTICOSTEROIDS', 'ANTIBIOTICS',
                'COLCHICINE', 'AZITHROMYCIN','LONG COVID', 'NAb', 'SACT', 'BONE MARROW',
                'ASTHMA', 'RHEUMATOLOGY', 'COPD', 'CYSTIC FIBROSIS', 'PREGNANCY',
                'MYOCARDIAL', 'GASTRO', 'KIDNEY', 'DERMATOLOGY', 'VIT C', 'VIT D',
                'VIIT', 'VTE','RESPIRATORY', 'PLANNED CARE', 'INTERSTITIAL LUNG', 'CO_INFECTION',
                'MANAGEMENT', 'ASSESSMENT','CYP IMMUNOSUPPRESSION', 'THERAPEUTICS']

lit_data = categorise_topics_covid(lit_data, topic_list)

end_time = time.time()
elapsed_time = end_time - start_time
print("Total time taken (seconds):", round(elapsed_time, 2))
lit_data.to_csv(RESULTS_PATH / "processed.csv", index=False) 

***Note***: Run the cells below only if you have all the default categories screened. If a custom category was screened, please download the processed.csv file in Results folder and exit. The cells below also creates graphs of the distribution of the various categories in the dataset.

In [None]:
#@title Calculate counts of each category { form-width: "20%" }

#@markdown Please execute this cell by pressing the _Play_ button 
#@markdown on the left 

def create_stats_df(in_data, studyCatStartCol, studyCatEndCol, fname1, fname2):
  start_col = in_data.columns.get_loc("RCT")
  end_col = in_data.columns.get_loc("subject relevant")
  in_data_subset = in_data.iloc[:,start_col: end_col+1] #subset the dataframe
  stats_data = {'StudyType': in_data_subset.columns.values, 'Count': in_data_subset[in_data_subset > 0].sum()} 
  stats_df = pd.DataFrame(stats_data)
  stats_df.to_csv(fname1)

  start_col = in_data.columns.get_loc(studyCatStartCol)
  end_col = in_data.columns.get_loc(studyCatEndCol)
  in_data_subset = in_data.iloc[:,start_col: end_col+1] #subset the dataframe
  stats_data = {'StudyCategory': in_data_subset.columns.values, 'Count': in_data_subset[in_data_subset > 0].sum()} 
  stats_df = pd.DataFrame(stats_data)
  stats_df.to_csv(fname2)

data = RESULTS_PATH / "processed.csv"
in_data = pd.read_csv(data)

create_stats_df(in_data,"remdesivir","therapeutics", RESULTS_PATH / "whole_stats_studydesign_relevant.csv",RESULTS_PATH / "whole_stats_studycategory_relevant.csv")

In [None]:
#@title Execute this to run R code in subsequent cells { form-width: "20%" }

#@markdown Please execute this cell by pressing the _Play_ button 
#@markdown on the left 

%load_ext rpy2.ipython

In [None]:
#@title Execute this to generate graphs { form-width: "20%" }

#@markdown Please execute this cell by pressing the _Play_ button 
#@markdown on the left 

%%R

library(ggplot2)
RESULTS_PATH <- "RESULTS/"
plot_study_design <- function(in_fname, out_fname, title){
  in_fname <- paste(RESULTS_PATH, in_fname, sep="")
  out_fname <- paste(RESULTS_PATH, out_fname, sep="")
  df2 <- read.csv(in_fname, stringsAsFactors = FALSE)
  p <- ggplot(data = df2) + 
                geom_bar(data = df2,mapping = aes(x = reorder(df2$StudyType, -df2$Count), y = df2$Count), stat = "identity", fill = "#05C3DE", width = 0.3) +
                labs(title = title) + 
                xlab("Themes") +
                ylab("Count") +
                theme(axis.text.x = element_text(angle = 90, hjust = 1))
  ggsave(out_fname)
  print(p)
}

plot_study_category <- function(in_fname, out_fname, title){
  in_fname <- paste(RESULTS_PATH, in_fname, sep="")
  out_fname <- paste(RESULTS_PATH, out_fname, sep="")
  df2 <- read.csv(in_fname, stringsAsFactors = FALSE)
  p <- ggplot(data = df2) + 
                geom_bar(data = df2,mapping = aes(x = reorder(df2$StudyCategory, -df2$Count), y = df2$Count), stat = "identity", fill = "#05C3DE", width = 0.3) +
                labs(title = title) + 
                xlab("Themes") +
                ylab("Count") +
                theme(axis.text.x = element_text(angle = 90, hjust = 1))
  ggsave(out_fname)
  print(p)
}

plot_study_design("whole_stats_studydesign_relevant.csv", "whole_studydesign.png", "Study Designs- Full")
plot_study_category("whole_stats_studycategory_relevant.csv", "whole_studycategory.png", "Study Category- Full")


In [None]:
#@title Execute this to download the full RESULTS folder as a zip file { form-width: "20%" }

#@markdown Optional to execute this cell. 

#@markdown A RESULTS.zip file will be generated on the left pane if you execute this cell

!zip -r /content/RESULTS.zip /content/RESULTS