# Building OHEN

This notebook serves to help us build OHEN and details some considerations that went into each expansion.

<a id='top'></a>

# Table of Contents

## [Arranging Our Data](#arrange)

## [Helper Functions](#helper)


## [Food Assertions](#food)

### [Class Assertions](#class)
[High Protein Foods](#highProtein)<br>
[High Magnesium Foods](#highMag)<br>
[Saturated Fat Free Foods](#satFat)<br>
[Low Cholesterol Foods](#lowChol)<br>
[High Fiber Foods](#highFiber)<br>
[Low Sugar Foods](#lowSugar)<br>

### [Data Property Assertions](#data)
[Grams of Protein](#protein)<br>
[Vitamin C](#vitc)<br>
[Zinc](#zinc)<br>

## [Exercise Assertions](#exercise)

<a id='arrange'></a>

## Arranging Our Data

In [1]:
import pandas as pd

In [2]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

In [3]:
foods = pd.read_csv("fndds_ingredient_nutrient_value.csv")
nutrients = pd.read_csv("nutrient.csv")

We have downloaded the above two files from the USDA database located here: https://fdc.nal.usda.gov/download-datasets.html as files in the "Full Download of All Data Types" *April 2020 version 2* (CSV – 85M)*.  

The file *fndds_ingredient_nutrient_value.csv* contains data of a wide variety of foods with values for 65 different various nutrients.  These values however are only shown in codes, and so we need the *nutrient.csv* file to map each nutrient code to the name of the nutrient.

In [4]:
nutrientCodes = []

for i in range(len(nutrients)):
    nutrientCodes.append(nutrients.iloc[i].nutrient_nbr)

In [5]:
nutrientCodes = foods["Nutrient code"][0:65] # nutrient codes repeat in this same pattern throughout rest of data
nutrientList = []

for i,code in enumerate(nutrientCodes):
    nutrientName = nutrients[nutrients.nutrient_nbr==code].iloc[0]["name"]
    foods.loc[i, "Nutrient name"] = nutrientName
    nutrientList.append(nutrientName)

In [6]:
foods.head()

Unnamed: 0,Ingredient code,SR description,Nutrient code,Nutrient value,Nutrient value source,SR 28 derivation code,SR 28 AddMod year,Start date,End date,Nutrient name
0,1001,"Butter, salted",203,0.85,SR28,,1976,2015-01-01 00:00:00.0,2016-12-31 00:00:00.0,Protein
1,1001,"Butter, salted",204,81.11,SR28,,1976,2015-01-01 00:00:00.0,2016-12-31 00:00:00.0,Total lipid (fat)
2,1001,"Butter, salted",205,0.06,SR28,NC,1976,2015-01-01 00:00:00.0,2016-12-31 00:00:00.0,"Carbohydrate, by difference"
3,1001,"Butter, salted",208,717.0,SR28,NC,2010,2015-01-01 00:00:00.0,2016-12-31 00:00:00.0,Energy
4,1001,"Butter, salted",221,0.0,SR28,,1985,2015-01-01 00:00:00.0,2016-12-31 00:00:00.0,"Alcohol, ethyl"


Our data originally looks like the above.  We see that "Butter, Salted", for example, has 0.85 Nutrient Value of Nutrient Code 203.  By looking at "nutrient.csv" we decipher that this means that "Butter, Salted" has 0.85 grams of protein.  All values of our foods are shown in 100 gram amounts.  We also see that "Butter, Salted" has 81.11 grams of Total fat per 100 grams.  We now transform the data into a more usable form:

In [7]:
columns = nutrientList.copy()
columns.insert(0,"Food")

ohen = pd.DataFrame(columns = columns)

for i in range(0, len(foods), len(nutrientCodes)):
    nutrientValues = []
    foodName = foods.loc[i,"SR description"]
    
    for ind in range(len(nutrientCodes)):
        nutrientValue = foods.loc[ind+i,"Nutrient value"]
        nutrientValues.append(nutrientValue)
        
    rowValues = nutrientValues.copy()
        
    rowValues.insert(0,foodName)
        
    newRow = pd.DataFrame([rowValues], columns = columns)
    ohen = ohen.append(newRow, ignore_index=True)

ohen.head()

Unnamed: 0,Food,Protein,Total lipid (fat),"Carbohydrate, by difference",Energy,"Alcohol, ethyl",Water,Caffeine,Theobromine,"Sugars, total including NLEA","Fiber, total dietary","Calcium, Ca","Iron, Fe","Magnesium, Mg","Phosphorus, P","Potassium, K","Sodium, Na","Zinc, Zn","Copper, Cu","Selenium, Se",Retinol,"Vitamin A, RAE","Carotene, beta","Carotene, alpha",Vitamin E (alpha-tocopherol),Vitamin D (D2 + D3),"Cryptoxanthin, beta",Lycopene,Lutein + zeaxanthin,"Vitamin C, total ascorbic acid",Thiamin,Riboflavin,Niacin,Vitamin B-6,"Folate, total",Vitamin B-12,"Choline, total",Vitamin K (phylloquinone),Folic acid,"Folate, food","Folate, DFE","Vitamin E, added","Vitamin B-12, added",Cholesterol,"Fatty acids, total saturated",4:0,6:0,8:0,10:0,12:0,14:0,16:0,18:0,18:1,18:2,18:3,20:4,22:6 n-3 (DHA),16:1,18:4,20:1,20:5 n-3 (EPA),22:1,22:5 n-3 (DPA),"Fatty acids, total monounsaturated","Fatty acids, total polyunsaturated"
0,"Butter, salted",0.85,81.11,0.06,717.0,0.0,15.87,0.0,0.0,0.06,0.0,24.0,0.02,2.0,24.0,24.0,643.0,0.09,0.0,1.0,671.0,684.0,158.0,0.0,2.32,0.0,0.0,0.0,0.0,0.0,0.005,0.034,0.042,0.003,3.0,0.17,18.8,7.0,0.0,3.0,3.0,0.0,0.0,215.0,51.368,3.226,2.007,1.19,2.529,2.587,7.436,21.697,9.999,19.961,2.728,0.315,0.0,0.0,0.961,0.0,0.1,0.0,0.0,0.0,21.021,3.043
1,"Butter, whipped, with salt",0.49,78.3,2.87,718.0,0.0,16.72,0.0,0.0,0.06,0.0,23.0,0.05,1.0,24.0,41.0,583.0,0.05,0.01,0.0,671.0,683.0,135.0,1.0,1.37,0.0,6.0,0.0,13.0,0.0,0.007,0.064,0.022,0.008,4.0,0.07,18.8,4.6,0.0,4.0,4.0,0.0,0.0,225.0,45.39,1.635,1.373,0.858,2.039,2.354,7.515,20.531,7.649,17.37,2.713,0.298,0.119,0.003,1.417,0.003,0.147,0.022,0.005,0.045,19.874,3.331
2,"Butter oil, anhydrous",0.28,99.48,0.0,876.0,0.0,0.24,0.0,0.0,0.0,0.0,4.0,0.0,0.0,3.0,5.0,2.0,0.01,0.001,0.0,824.0,840.0,193.0,0.0,2.8,0.0,0.0,0.0,0.0,0.0,0.001,0.005,0.003,0.001,0.0,0.01,22.3,8.6,0.0,0.0,0.0,0.0,0.0,256.0,61.924,3.226,1.91,1.112,2.495,2.793,10.005,26.166,12.056,25.026,2.247,1.447,0.0,0.0,2.228,0.0,0.0,0.0,0.0,0.0,28.732,3.694
3,"Cheese, blue",21.4,28.74,2.34,353.0,0.0,42.41,0.0,0.0,0.5,0.0,528.0,0.31,23.0,387.0,256.0,1146.0,2.66,0.04,14.5,192.0,198.0,74.0,0.0,0.25,0.5,0.0,0.0,0.0,0.0,0.029,0.382,1.016,0.166,36.0,1.22,15.4,2.4,0.0,36.0,36.0,0.0,0.0,75.0,18.669,0.658,0.361,0.247,0.601,0.491,3.301,9.153,3.235,6.622,0.536,0.264,0.0,0.0,0.816,0.0,0.0,0.0,0.0,0.0,7.778,0.8
4,"Cheese, brick",23.24,29.68,2.79,371.0,0.0,41.11,0.0,0.0,0.51,0.0,674.0,0.43,24.0,451.0,136.0,560.0,2.6,0.024,14.5,286.0,292.0,76.0,0.0,0.26,0.5,0.0,0.0,0.0,0.0,0.014,0.351,0.118,0.065,20.0,1.26,15.4,2.5,0.0,20.0,20.0,0.0,0.0,94.0,18.764,0.914,0.373,0.299,0.585,0.482,3.227,8.655,3.455,7.401,0.491,0.293,0.0,0.0,0.817,0.0,0.0,0.0,0.0,0.0,8.598,0.784


In [8]:
len(ohen)

2730

Now we can easily search and filter all 2730 types of our foods according to various definitions set forth by the USDA, or by our own choosing when no formal definition exists, (such as a "high magnesium" food).  The following blocks of code detail considerations that went into creating each list.  These lists of foods are then formatted into an OWL-appropriate syntax and saved in a text file in the "Foods" folder.  From there, we simply copy/paste each file directly into our ohen.OWL file from a text editor.  We have commented areas where we have copy/pasted as: <br>

\<!--MC Start *some food category* --> <br> *list of foods with class assertions and named individuals* <br>
\<!--MC End *some food category*--> 

to easily track what modifications we have made.

[Back to Top](#top)

<a id='helper'></a>

## Helper Functions

In [9]:
import os

def writeFile(fileName, foodsList, IRI, assertionType="Class", dataFeature="Protein"):
    fileName = "Foods/" + fileName + '.txt'
    if os.path.exists(fileName): #deleting file here is useful only for debugging purposes
        os.remove(fileName)

    with open(fileName, 'w') as file:
        for food in foodsList:
            foodAssertion = food
            foodAssertion = foodAssertion.replace("%","percent")
            foodAssertion = foodAssertion.replace(" ","")
            foodAssertion = foodAssertion.replace(",","-")
            foodAssertion = foodAssertion.replace(";","_")
            foodAssertion = foodAssertion.replace('"',"inch")
            foodAssertion = foodAssertion.replace(')',"-")
            foodAssertion = foodAssertion.replace('(',"-")
            foodAssertion = foodAssertion.replace(':',"_")
            foodAssertion = foodAssertion.replace('&',"_and_")
            if assertionType=="Class":
                file.write('\t<ClassAssertion>\n\t\t<Class IRI="{}"/>\n\t\t<NamedIndividual IRI="#{}"/>\n\t</ClassAssertion>\n'.format(IRI, foodAssertion))
            if assertionType=="DataProperty":
                dataValue = ohen[ohen.Food==food][dataFeature].values[0]
                file.write('\t<DataPropertyAssertion>\n\t\t<DataProperty IRI="{}"/>\n\t\t<NamedIndividual IRI="#{}"/>\n\t\t<Literal datatypeIRI="http://www.w3.org/2001/XMLSchema#decimal">{}</Literal>\n\t</DataPropertyAssertion>\n'.format(IRI, foodAssertion, dataValue))

In [10]:
def foodsByQuantile(feature, topQuantile):
    '''
        Find all foods with "feature" value greater than or equal to the given topQuantile value
    '''
    value = ohen[feature].quantile(1-topQuantile)
    return list(ohen[ohen[feature]>=value].sort_values(by="Food").Food)

In [11]:
def foodsLessThanAmount(feature, value):
    '''
        Find all foods with "feature" value less than or equal to the given value 
    '''
    return list(ohen[ohen[feature]<=value].sort_values(by="Food").Food)

In [12]:
def foodsGreaterThanAmount(feature, value):
    '''
        Find all foods with "feature" value greater than or equal to the given value 
    '''
    return list(ohen[ohen[feature]>=value].sort_values(by="Food").Food)

[Back to Top](#top)

<a id='food'></a>

## Food Assertions

This section will explain how we added various food assertions to OHEN.

<a id='class'></a>

### Class Assertions

We begin with various class assertions.  Any considerations for why a food was (not) included is detailed in the comments of each code block.

<a id='highProtein'></a>

#### High Protein Foods

In [13]:
'''
definition: "A claim that a food is high in protein, and any claim likely to have the same meaning for the consumer, 
may only be made where at least 20 % of the energy value of the food is provided by protein 
[REGULATION (EC) No 1924/2006 Corrigendum 2007-01-18]."
'''
highProteinFoods = []
for i in range(len(ohen)):
    if ohen.loc[i,"Protein"]*4 >= ohen.loc[i,"Energy"] * 0.2:
        highProteinFoods.append(ohen.loc[i,"Food"])
highProteinFoods;

In [14]:
writeFile('highProteinFoods', highProteinFoods, 'http://purl.obolibrary.org/obo/FOODON_03510203')

<a id='highMag'></a>
#### High Magnesium Foods

In [15]:
'''
Here we have arbitrarily decided that "high magnesium foods" should be foods with MG values in the top 10% of our foods
'''

highMagnesiumFoods = foodsByQuantile("Magnesium, Mg", 0.1)
writeFile("highMagnesiumFoods", highMagnesiumFoods, "http://www.semanticweb.org/mclou/ontologies/2020/6/ohen3#highMagnesium")

<a id='satFat'></a>

#### Saturated Fat Free Foods

In [16]:
'''
definition "A claim that a food does not contain saturated fat, and any claim likely to have the same meaning for the consumer, 
may only be made where the sum of saturated fat and trans-fatty acids does not exceed 0,1 g of saturated fat per 100 g 
or 100 ml [REGULATION (EC) No 1924/2006 Corrigendum 2007-01-18].
'''

saturatedFatFreeFoods = foodsLessThanAmount("Fatty acids, total saturated", 0.1)
writeFile("saturatedFatFreeFoods", saturatedFatFreeFoods, "http://purl.obolibrary.org/obo/FOODON_03510179")

<a id='lowChol'></a>

#### Low Cholesterol Foods

In [17]:
'''
definition "Food having 20 miligrams or less cholesterol per amount customarily consumed 
(and per 50 grams of food if the amount customarily consumed is small). 
Meals and main dishes contain 20 milligrams or less cholesterol per 100 grams of food. 
If the food qualifies by special processing and total fat exceeds 13 grams per amount and labeled serving, 
the amount of cholesterol must be 'substantially less' (25%) than in a comparable food with significant market share 
(5% of market)."
'''
lowCholesterolFoods = foodsLessThanAmount("Cholesterol", 20)
writeFile("lowCholesterolFoods", lowCholesterolFoods, "http://purl.obolibrary.org/obo/FOODON_03510043")

<a id='highFiber'></a>

#### High Fiber Foods

In [18]:
'''
definition "A claim that a food is high in fibre, and any claim likely to have the same meaning for the consumer, 
may only be made where the product contains at least 6 g of fibre per 100 g or at least 3 g of fibre per 100 kcal 
[REGULATION (EC) No 1924/2006 Corrigendum 2007-01-18].

*For simplicity, we simply check if there are fewer than 6g fiber per 100g
'''
highFiberFoods = foodsGreaterThanAmount("Fiber, total dietary", 6)
writeFile("highFiberFoods", highFiberFoods, "http://purl.obolibrary.org/obo/FOODON_03510048")

<a id='lowSugar'></a>

#### Low Sugar Foods

In [19]:
'''
definition "A claim that a food is low in sugars, and any claim likely to have the same meaning for the consumer, 
may only be made where the product contains no more than 5 g of sugars per 100 g for solids or 2,5 g of sugars per 
100 ml for liquids [REGULATION (EC) No 1924/2006 Corrigendum 2007-01-18].
Not defined in U.S. Federal Register; no basis for a recommended intake."

*We do not have a foolproof way to classify solids vs liquids for our foods, so we shall choose '2.5' as our
threshold to be conservative
'''
lowSugarFoods = foodsLessThanAmount("Sugars, total including NLEA", 2.5)
writeFile("lowSugarFoods", lowSugarFoods, "http://purl.obolibrary.org/obo/FOODON_03510062")

[Back to Top](#top)

<a id='data'></a>

### Data Property Assertions

For the sake of minimizing the stress load on our Protege reasoner, we will only include the data properties of some values for selected classes of foods.  Ideally, all food instances would have complete information.

<a id='protein'></a>

#### Grams of Protein

In [20]:
'''
We choose to include the actual amount of protein that is contained in a high protein food.
The actual amount of protein one intakes is more important than the percentage of protein a food is, as OPE classifies 
a high protein food only by percentage.  This allows some foods such as "Arugula, raw" to be included which may be misleading 
as arugula is only a high protein food by percentage, not actual amount.
'''
writeFile('highProteinFoodsGrams', highProteinFoods, '#protein_Grams', assertionType="DataProperty")

<a id='vitc'></a>

#### Vitamin C

In [21]:
'''
Vitamin C is typically found among certain plant foods, so the most relevant class from those we have defined above would 
likely be "high fiber foods".
'''
writeFile('highFiberVitaminCMG', highFiberFoods, '#vitaminC_MG', assertionType="DataProperty", dataFeature="Vitamin C, total ascorbic acid")

<a id='zinc'></a>

#### Zinc

In [22]:
'''
Zinc is prevalent in many animal products, and so we shall check "high protein foods".
'''
writeFile('highProteinZinc', highProteinFoods, '#zinc_MG', assertionType="DataProperty", dataFeature="Zinc, Zn")

[Back to Top](#top)

<a id='exercise'></a>

### Exercise Assertions

We now want to add some assertions about the major muscle groups in the body.  Our Ontology of Physical Exercises, (OPE), has defined many muscle groups as classes.  In building OHEN, we want to create specific instances of each of the major muscle groups. 

In [23]:
muscleClassName = '<Class IRI="http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#Muscle"/>'
majorMuscleGroups = []
import re

with open('ohenTest.owl', 'r', encoding="utf8") as ohenOWLFile:
    lines = ohenOWLFile.readlines()
    for i, line in enumerate(lines):
        if muscleClassName in line:
            if 'Declaration' not in lines[i-1] and 'SubClassOf' not in lines[i-1]: # these are now all major muscle groups
                majorMuscleGroups.append(re.search('#(.*)"', lines[i-1]).group(1))

For example, "Triceps" is subclass of "Muscle" in OPE, and we shall choose to represent a user's triceps group by creating instances of the triceps muscle with "userMuscle-" as a prefix.  Though we only have one user as of now, it would be simple to extend this concept to create various instances of each muscle group for multiple users.  In this way, we could have more detailed information such as some metric to measure the strength of that user's muscle, or whether or not that user had an injury in that muscle group.   

In [24]:
len(majorMuscleGroups)

103

We see that we have 103 major muscle groups as defined by OPE.  In the following code we create the user's instances of each muscle group.  For the purpose of the example in the paper, we assert that the user has an injury in his triceps muscle. 

In [25]:
fileName = "Muscle Groups/majorMuscleGroups.txt"
if os.path.exists(fileName): #deleting file here is useful only for debugging purposes
    os.remove(fileName)

with open(fileName, 'w') as file:
    for muscle in majorMuscleGroups:
        classIRI = "http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#" + muscle
        muscleName = 'userMuscle-' + muscle
        dataValue = 'false'
        dataIRI = 'isInjured'
        if muscle == 'Triceps': # for our example, the user only has a triceps injury
            dataValue = 'true'
        file.write('\t<ClassAssertion>\n\t\t<Class IRI="{}"/>\n\t\t<NamedIndividual IRI="#{}"/>\n\t</ClassAssertion>\n'.format(classIRI, muscleName))
        file.write('\t<DataPropertyAssertion>\n\t\t<DataProperty IRI="{}"/>\n\t\t<NamedIndividual IRI="#{}"/>\n\t\t<Literal datatypeIRI="http://www.w3.org/2001/XMLSchema#boolean">{}</Literal>\n\t</DataPropertyAssertion>\n'.format(dataIRI, muscleName, dataValue))

We have written this text file to the "Muscle Groups" folder.  From here, it is a simple matter of copying/pasting as before into the *ohenTest.owl* file as before 

\<!--MC Start Major Muscle Groups --> <br> *Class and Data Assertions* <br>
\<!--MC End Major Muscle Groups--> 

These text edits can be found in the *ohenTest.owl* file.  Note *ohenFinal.owl* does not directly contain these edits.  When saving and exporting the file in Protege comments are deleted and the manual assertions we have made will be ordered differently than how they were pasted into the raw text representation.

[Back to Top](#top)