## EDA - Grail QA

The data is in `JSON` format, and the schema is provided below. For our purposes, we only care about "domains" and "question".
```
{
  "answer": {
    "[]": {
      "answer_argument": "string",
      "answer_type": "string",
      "entity_name": "string"
    }
  },
  "domains": {
    "[]": "string"
  },
  "function": "string",
  "graph_query": {
    "edges": {
      "[]": {
        "end": "int32",
        "friendly_name": "string",
        "relation": "string",
        "start": "int32"
      }
    },
    "nodes": {
      "[]": {
        "class": "string",
        "friendly_name": "string",
        "function": "string",
        "id": "string",
        "nid": "int32",
        "node_type": "string",
        "question_node": "int32"
      }
    }
  },
  "level": "string",
  "num_edge": "int32",
  "num_node": "int32",
  "qid": "string",
  "question": "string",
  "s_expression": "string",
  "sparql_query": "string"
}
```

In [1]:
import pandas as pd

pd.options.display.max_rows = 150
pd.options.display.max_colwidth = 0

In [2]:
from src.data.utils import *

train = pd.DataFrame(get_domains_and_questions('train', 'grail_qa'))

In [3]:
train.head()

Unnamed: 0,domains,questions
0,medicine,oxybutynin chloride 5 extended release film coated tablet is the ingredients of what routed drug?
1,education,the type single-sex school are in which institutions?
2,religion,the leaders of the earliest established religious organization are given what title?
3,user.patrick.default_domain,"on 07/01/1970, which warship v1.1 was hit?"
4,language,what is the language regulator of basque?


In [7]:
train.domains.value_counts()

music                                          2280
fictional_universe                             2120
book                                           2060
medicine                                       2002
computer                                       1923
astronomy                                      1534
people                                         1468
sports                                         1383
spaceflight                                    1381
tv                                             1321
biology                                        1248
government                                     1247
food                                           1200
comic_books                                    1054
education                                      994 
aviation                                       930 
business                                       871 
time                                           823 
architecture                                   795 
religion    

From this list, there are a handful of candidate subdomains that may fall under the umbrella domains of `healthcare` and `technology`. 

## Candidates for `healthcare`

In [6]:
from functools import partial

def get_candidate_sample(df, subdomain, n=5):
    return df.loc[df.domains == subdomain].sample(n)

get_sample = partial(get_candidate_sample, train)

In [14]:
get_sample('medicine')

Unnamed: 0,domains,questions
3836,medicine,what are some of the symptoms of neck pain?
19795,medicine,what medications should not be used during acute pain
27564,medicine,which drug formulation has the manufactured forms of tylenol cold cough and severe congestion 325/10/200/5 liquid?
43138,medicine,what medical treatments have the side effect symptoms of pains?
1847,medicine,driving is a contradiction for which medical treatment?


In [15]:
get_sample('biology')

Unnamed: 0,domains,questions
28721,biology,find an organism that weighs less than 21.3.
13632,biology,what animal breed's origin is macedonia?
28319,biology,which plant disease is hosted on oats?
40553,biology,which organism classifications rank lower than infraorder ?
37268,biology,what is the name of the cytogenic band that is associated with a genomic locus also associated with chromosome 4 (human)?


In [16]:
get_sample('biology;people')

Unnamed: 0,domains,questions
8510,biology;people,how many illustrators are in the same gender with codex?
28973,biology;people,little wolf shares a gender with how many illustrators?
38852,biology;people,how many illustrators are in the same gender with risen star?
16896,biology;people,herman the bull is the same gender as how many illustrators?
7444,biology;people,how many illustrators share a gender with lights up?


In [17]:
get_sample('business;medicine')

Unnamed: 0,domains,questions
39923,business;medicine,what is the largest manufactured drug with the brand of micardis?
36111,business;medicine,what is the largest drug manufactured with the brand colace?
8438,business;medicine,what is the largest manufactured drug retrovir?
796,business;medicine,name the largest manufactured drug with the brand ofrequip?
3299,business;medicine,what is the largest manufactured drug with the brand of therabenzaprine-90?


## Candidates for `technology`

In [18]:
get_sample('computer')

Unnamed: 0,domains,questions
16573,computer,zx printer is the peripherals within which computer emulator?
34720,computer,what computer manufacturer/brand has a computer line that is the parent model of apple iic?
7275,computer,which programming language is influenced by hypercard ?
31557,computer,what family of computers is the zx interface 2 compatible with?
21135,computer,which file format has the genre of package manager?


In [19]:
get_sample('spaceflight')

Unnamed: 0,domains,questions
18687,spaceflight,the name of the space mission that has astronaut mark c. lee?
35660,spaceflight,263.3 is the isp of which rocket engine?
10163,spaceflight,which rocket engine that is manufactured by npo energomash has an isp (vacuum) bigger than 400.0?
22580,spaceflight,the designer of rd-118 designed what other rocket engines?
24140,spaceflight,"find a bipropellant rocket engine, that has a status of test fired, whose thrust (sea level) is more than or equal to 784000.0."


In [20]:
get_sample('aviation')

Unnamed: 0,domains,questions
11585,aviation,which aviation incident is of the type improvised explosive devices?
18700,aviation,boeing b737 is part of which aircraft model line?
24330,aviation,find information about the wuhan airlines aviation incident.
34002,aviation,which airline's hub includes baghdad international airport?
14081,aviation,what models of aircraft were designed by john carver meadows frost?


In [21]:
get_sample('automotive')

Unnamed: 0,domains,questions
4440,automotive,can you name the engines that use the same fuel as the hyundai 2.0l 4 cylinder 210 hp 223 ft-lbs turbo?
15416,automotive,which class is the automotive class of automobile model whose sister model is the crown victoria?
22043,automotive,which automobile has a generation number of 8?
23721,automotive,which automobile model directly preceded the second generation mercury topaz?
7342,automotive,what is the automobile generation that uses the platform of ford d3 platform?


In [22]:
get_sample('broadcast')

Unnamed: 0,domains,questions
39856,broadcast,what capital of administrative division is the location of a broadcast producer that is the producer of wcai/wnan talk radio?
26442,broadcast,what is the radio format with format of wjse and stations of ktmp?
20592,broadcast,what is the broadcast content with genre latino?
21792,broadcast,2008-11-23 was the end date of which broadcast content?
29314,broadcast,what is an example of a house music and urban contemporary broadcast content?


In [23]:
get_sample('internet')

Unnamed: 0,domains,questions
8546,internet,which website has the api of javascript object notation?
24332,internet,the website ended most recently is a member of which website category?
6848,internet,name the subsites of yssy atc simulation.
8870,internet,what is the website that belongs to the same website category with mess with msn messenger?
2822,internet,matias basilico owns how many active websites?


In [24]:
get_sample('engineering')

Unnamed: 0,domains,questions
38974,engineering,what is the upper material class of aviation fuel?
22537,engineering,what are the names of reaction engines with a mass of at least 5393.0?
32908,engineering,ac motor is a subcategory of which category of engine?
34062,engineering,a ranted voltage of 250.0 is the voltage of what kind of power plug?
23336,engineering,what are the upper material classes of wood?


In [25]:
get_sample('military')

Unnamed: 0,domains,questions
33676,military,1st red banner army belongs to which armed force sub-division?
16029,military,the 7th illinois volunteer infantry regiment belongs to which armed force?
16583,military,state of texas is from which armed force military combatant?
6918,military,gestapo is included in which armed force?
10920,military,which military person participated in battle of grozny?


## Ontology

In a real-time meeting, we took samples as below of candidate subdomains and agreed on which of them we thought should map to `healthcare` or `technology`. 

In [28]:
get_candidate_sample(train, 'engineering', n=10)

Unnamed: 0,domains,questions
25213,engineering,2vsb is the child modulation of which signal modulation mode?
13139,engineering,the heinkel hes 40 engine is under what category?
30912,engineering,which piston engine has the capacity of 27.0?
27971,engineering,what material can be classified with grey iron?
15982,engineering,which engine category has sub-categories like turbofan?
33095,engineering,what sub-category does rolls-royce vulture fall under in the engine category?
42663,engineering,vantage xp-360 is the variation of what engine?
12203,engineering,malleable iron and what other materials are in the same class?
8236,engineering,what signal modulation mode is the child modulation of a-vsb?
18684,engineering,which battery shape format is the size of aaaa battery?
