# BatApp paper classificator
## Author: Łukasz Popek

### Description

A notebook created to train and compare classifiers to assess whether an article is concerned with the emergence of new types of viruses among bat species. The classification is based on text analysis of abstracts of scientific articles. Based on this, two classifiers were trained: one based on logistic regression and one based on a support vector machine. The dataset is single-class in nature, with up to checked articles based on the Chinese DBatVir database (http://www.mgc.ac.cn/DBatVir/), and articles randomly downloaded based on keywords implicitly adopted as a negative class. Then, in an iterative process, the trained classifier was used to separate positive cases from the negative class. 

In [1]:
!gdown 1upjZzIE8_Qka1pNtijQTaufaA4-0qKpx

Downloading...
From: https://drive.google.com/uc?id=1upjZzIE8_Qka1pNtijQTaufaA4-0qKpx
To: /content/bat-list.txt
  0% 0.00/55.2k [00:00<?, ?B/s]100% 55.2k/55.2k [00:00<00:00, 61.5MB/s]


In [4]:
PATH = "bat-list.txt"

In [5]:
with open(PATH) as file:
  data = {
      'bat-list' : []
  }
  for line in file:
    latin_start_ind = line.find('(')
    latin_end_ind = line.find(')')
    data['bat-list'].append({
        'latin-name': line[latin_start_ind + 1: latin_end_ind ], 
        'common-name': line[: latin_start_ind - 1]
    })

data


{'bat-list': [{'latin-name': 'Aethalops alecto',
   'common-name': 'Pygmy fruit bat'},
  {'latin-name': 'Aethalops aequalis', 'common-name': 'Borneo fruit bat'},
  {'latin-name': 'Alionycteris paucidentata',
   'common-name': 'Mindanao pygmy fruit bat'},
  {'latin-name': 'Balionycteris maculata',
   'common-name': 'Spotted-winged fruit bat'},
  {'latin-name': 'Chironax melanocephalus',
   'common-name': 'Black-capped fruit bat'},
  {'latin-name': 'Cynopterus brachyotis',
   'common-name': 'Lesser short-nosed fruit bat'},
  {'latin-name': 'Cynopterus horsfieldii',
   'common-name': "Horsfield's fruit bat"},
  {'latin-name': 'Cynopterus luzoniensis',
   'common-name': "Peters's fruit bat"},
  {'latin-name': 'Cynopterus minutus', 'common-name': 'Minute fruit bat'},
  {'latin-name': 'Cynopterus nusatenggara',
   'common-name': 'Nusatenggara short-nosed fruit bat'},
  {'latin-name': 'Cynopterus sphinx',
   'common-name': 'Greater short-nosed fruit bat'},
  {'latin-name': 'Cynopterus titthae

In [6]:
data['bat-list'].sort(key=lambda d: d['latin-name'])
data

{'bat-list': [{'latin-name': 'Acerodon celebensis',
   'common-name': 'Sulawesi fruit bat'},
  {'latin-name': 'Acerodon humilis', 'common-name': 'Talaud fruit bat'},
  {'latin-name': 'Acerodon jubatus', 'common-name': 'Golden-capped fruit bat'},
  {'latin-name': 'Acerodon leucotis', 'common-name': 'Palawan fruit bat'},
  {'latin-name': 'Acerodon mackloti', 'common-name': 'Sunda fruit bat'},
  {'latin-name': 'Aethalops aequalis', 'common-name': 'Borneo fruit bat'},
  {'latin-name': 'Aethalops alecto', 'common-name': 'Pygmy fruit bat'},
  {'latin-name': 'Alionycteris paucidentata',
   'common-name': 'Mindanao pygmy fruit bat'},
  {'latin-name': 'Ametrida centurio',
   'common-name': 'Little white-shouldered bat'},
  {'latin-name': 'Amorphochilus schnablii', 'common-name': 'Smoky bat'},
  {'latin-name': 'Anoura cadenai', 'common-name': "Cadena's tailless bat"},
  {'latin-name': 'Anoura caudifer', 'common-name': 'Tailed tailless bat'},
  {'latin-name': 'Anoura cultrata', 'common-name': "Ha

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [9]:
import json
json_object = json.dumps(data, indent=4)
 
# Writing to sample.json
with open("/content/drive/MyDrive/UW ICM Master/Data/bat-list.json", "w") as outfile:
    outfile.write(json_object)