# Create RDF Adding PFAF Plant Details

## Description 
Given a file holding family names and the scientific and common names of plant genuses within those families (with 3+ stars for edible, medicinal or 'other' uses), obtain the following fields:

* USDA hardiness
* Known hazards
* Habitats
* Range
* Edibility rating and uses
* Medicinal rating and uses
* Other use rating and uses
* Physical characteristics

The information in these fields is then output in a Turtle format.

An example of the file format information is:
```
    ...
    Aizoaceae (carpetweed family)
    Achyranthes aspera	Devil's Horsewhip
    Achyranthes bidentata	Niu Xi
    Achyranthes japonica	Japanese Chaff Flower
    Alternanthera sissoo	Brazilian Spinach, Sambu, Samba lettuce
    Atriplex semibaccata	Australian Saltbush, Australian saltbush, Creeping saltbush
    
    Amaranthaceae (Chenopodiaceae$, amaranth family, goosefoot family)
    Amaranthus blitum	Slender Amaranth, Purple amaranth
    Amaranthus caudatus	Love Lies Bleeding
    Amaranthus cruentus	Purple Amaranth, Red amaranth
    Amaranthus hybridus	Rough Pigweed, Slim amaranth
    ...
```
where the name (scientific and common name) values are tab-separated. 

Notes:

* The file of names (Plants.txt) was created by querying PFAF.org for plants with 3+ ratings (3, 4 or 5 out of 5 star ratings) for edible, medicinal or 'other' uses, and then  parsing the resulting tables. 
* Common names were manually corrected where there were periods, dashes or semicolons separating names. 
* Additional common names were appended to the list of PFAF names, where these names were derived from parsing the Wikipedia pages for specific scientific names. (This was done in an earlier approach to locate plant-related details and reused.)
* Scientific names are suffixed by '$' to indicate that the text should not be converted to lower case and the '@en' language tag should not be appended in the Turtle output.
* Even after all the manual cleanup, errors existed in the Turtle files (such as literals exceeding 1024 characters and invalid characters in URNs). These were located by adding the data to Stardog and correcting the errors as they were encountered.
  * A work item was created to address these problems, to allow non-technical users to extend the plant ontologies.

## Library imports

In [1]:
import re
import requests
import browsercookie
from bs4 import BeautifulSoup
import itertools, time

## Constants, lists and dictionaries

In [2]:
pfafUrl = 'https://pfaf.org/user/Plant.aspx?LatinName='

getRating = 'getRating'
langTag = 'addLangTag'

edibleUse = 'edible'
edibleUses = ['ChocolateAndSubstitute', 'CoffeeAndSubstitute', 'Coloring', 
              'CondimentAndSeasoning', 'CurdlingAgent', 'Beverage',
              'EggAndSubstitute', 'MilkAndSubstitute', 'VegetableOil', 'Pectin', 
              'Rutin', 'SugarAndSubstitute', 'TeaAndSubstitute']
medicinalUse = 'medicinal'
otherUse = 'other'

# Convert several usage tags to more meaningful strings
mappingPfafTerms = {# Convert plant part tags to more meaningful strings
                    'ApicalBud' : 'PlantBud', 
                    'Flowers' : 'PlantFlower',
                    'Fruit' : 'PlantFruit',
                    'InnerBark' : 'PlantInnerBark',
                    'Leaves' : 'PlantLeaf',
                    'Manna' : 'PlantSapAndManna',
                    'Nectar' : 'PlantNectar',
                    'Pollen' : 'PlantPollen',
                    'Root' : 'PlantRoot',
                    'Sap' : 'PlantSapAndManna',
                    'Seed' : 'PlantSeed',
                    'Seedpod' : 'PlantSeedpod',
                    'Shoots' : 'PlantShoot',
                    'Stem' : 'PlantStem',
                    # Conversion of PFAF edible uses
                    'Chocolate': 'ChocolateAndSubstitute',    
                    'Coffee' : 'CoffeeAndSubstitute',   
                    'Colouring' : 'Coloring',   
                    'Condiment' : 'CondimentAndSeasoning',   
                    'Drink' : 'Beverage',
                    'Egg' : 'EggAndSubstitute',
                    'Milk' : 'MilkAndSubstitute',   
                    'Oil' : 'VegetableOil',   
                    'Salt' : 'CondimentAndSeasoning',
                    'Sweetener' : 'SugarAndSubstitute',
                    'Tea' : 'TeaAndSubstitute',
                    # Conversion of PFAF medicinal uses
                    'Acrid' : 'Irritant',
                    'Anaesthetic' : 'Anesthetic',
                    'Anodyne' : 'Analgesic',
                    'Antidandruff' : 'Antiseborrheic',
                    'Antidermatosic ' : 'SkinTreatment',
                    'Antihaemorrhoidal' : 'Antihemorrhoidal',
                    'Antiphlogistic' : 'Antiinflammatory',
                    'Antirheumatic' : 'Antiarthritic',
                    'Antitumor' : 'Anticancer',
                    'Aperient' : 'Laxative',
                    'Appetizer' : 'AppetiteStimulant',
                    'Aromatherapy' : 'AromaticAndAromatherapy', 
                    'Aromatic' : 'AromaticAndAromatherapy',
                    'Bitter' : 'Stomachic, AppetiteStimulant',
                    'BloodTonic' : 'Tonic',
                    'Cancer' : 'Anticancer',
                    'Cardiac' : 'HeartTreatment',
                    'Cardiotonic' : 'HeartTreatment',
                    'Carminative' : 'Antiflatulent',
                    'Cathartic' : 'Laxative',
                    'Detergent' : 'MedicinalCleaning',
                    'Enuresis' : 'EnuresisTreatment',
                    'Expectorant' : 'ExpectorantAndMucoactive',
                    'Febrifuge' : 'Antipyretic',
                    'Haemolytic' : 'Hemolytic',
                    'Haemostatic' : 'Coagulant',
                    'Hydrogogue' : 'Laxative',
                    'Hypnotic' : 'SedativeAndHypnotic',
                    'Hypoglycaemic' : 'Antidiabetic',
                    'Infertility' : 'InfertilityTreatment',
                    'Kidney' : 'KidneyTreatment',
                    'Narcotic' : 'Analgesic, Hypnotic',
                    'Nutritive' : 'Restorative',
                    'Plaster' : 'CastSplintAndBrace',
                    'Purgative' : 'Laxative',
                    'Refrigerant' : 'MedicinalCooling',
                    'Resolvent' : 'Anticancer',
                    'Salve' : 'Emollient',
                    'Sedative' : 'SedativeAndHypnotic',
                    'Skin' : 'SkinTreatment',
                    'Sternutatory' : 'Errhine',
                    'Stings' : 'Antiallergy',
                    'Tb' : 'Antituberculosis',
                    'UterineTonic' : 'Oxytoxic',
                    'Vd' : 'Antivenereal',
                    'Vermifuge' : 'Anthelmintic',
                    'Warts' : 'Keratolytic',
                    'Antineoplastics' : 'Anticancer',
                    'AppetiteStimulants' : 'AppetiteStimulant',
                    'Antianxiety' : 'Anxiolytic',
                    'Antipanic' : 'Anxiolytic',
                    'Antidiarrhoeal' : 'Antidiarrheal',
                    # Conversion of PFAF other uses
                    'Adhesive' : 'GumAndAdhesive',
                    'Alcohol' : 'FuelAndPower',
                    'BabyCare' : 'PersonalAndBabyCare',
                    'Beads' : 'Bead',
                    'Besom' : 'Broom',
                    'Biomass' : 'Biofuel',
                    'Bottles' : 'Bottle',
                    'Companion' : 'CompanionPlant',
                    'Containers' : 'Container',
                    'Cork' : 'WoodCorkAndSubstitute',
                    'Dye' : 'DyeInkAndPaint',
                    'Essential' : 'EssentialOil',
                    'Fuel' : 'FuelAndPower',
                    'FrictionSticks' : 'FrictionStick',
                    'Furniture' : 'FurnitureAndFurnishing',
                    'Gum' : 'GumAndAdhesive',
                    'Hair' : 'HairCare',
                    'Ink' : 'DyeInkAndPaint',
                    'Leather' : 'LeatherAndSubstitute',
                    'Lining' : 'StuffingAndLining',
                    'LiquidFeed' : 'Fertilizer',
                    'Muscial' : 'MusicalInstrument',
                    'Nails' : 'NailAndSubstitute',
                    'Needles' : 'NeedleAndPin',
                    'Packing' : 'StuffingAndLining',
                    'Paint' : 'DyeInkAndPaint',
                    'Pins' : 'NeedleAndPin',
                    'Pioneer' : 'PioneerPlant',
                    'Pipes' : 'Pipe',
                    'Pitch' : 'Waterproofing',
                    'Pollution' : 'PollutionControl',
                    'Pot-pourri' : 'Incense',
                    'Rust' : 'RustTreatment',
                    'Size' : 'SurfacePreparation',
                    'Soap' : 'SoapAndSubstitute',
                    'String' : 'Fibre',
                    'Stuffing' : 'StuffingAndLining',
                    'Teeth' : 'ToothCare',
                    'Tinder' : 'Kindling',
                    'Wax' : 'VegetableWax',
                    'Weaving' : 'PlantWeaving',
                    'Wood' : 'WoodCorkAndSubstitute'}

# Non-unique tags that are mapped from the 'medicinal use' type
mappingPfafMedicinalUsageTerms = {'Disinfectant' : 'MedicinalCleaning',
                                  'Parasiticide': 'Antiparasitic',
                                  'Plaster': 'CastSplintAndBrace'}

# (Unique) Usage tags that are discarded
discardedPfafUsage = ['Gelatine', 'Stabilizer', 
                      'Antibilious', 'Bach', 'Balsamic', 'FootCare',  
                      'Homeopathy', 'Lenitive', 'Miscellany', 'Women', 
                      'BlottingPaper', 'Buttons', 'DarningBall', 'Lighting',  
                      'Litmus', 'Microscope', 'Pencil', 'Porcelain', 'Potash', 'Raffia', 
                      'SoapMaking', 'Straw', 'WaxedPaper', 'WeatherForecasting', 'Wick'] 

# Usage tags that are duplicated 
discardedPfafUsageByType = [(medicinalUse, 'Deodorant'),    # Using deodorant in 'other use'
                            (otherUse, 'Pectin'),           # Using pectin in 'edible use'
                            (edibleUse, 'Gum')]             # Using gum in 'other use'

## Functions

In [3]:
def addDetailsToPlantFile(textString, processing, predicate, plantFile):
    if textString and not('Coming soon' in textString):
        textString = cleanupText(textString).strip()
        if getRating in processing:
            # Rating is formatted as '(x of 5)'
            textString = textString[1:2]    
            return plantFile.write(f'  :{predicate} "{textString}"^^xsd:integer ;\n')
        else:  
            if langTag in processing:
                plantFile.write(f'  :{predicate} "{textString}"@en ;\n')
            else:
                plantFile.write(f'  :{predicate} "{textString}" ;\n')

def addNames(textString, plantFile = None):
    if textString:
        syns = textString.split(',')     # Names are separated by commas
        finalSyns = []
        for syn in syns:
            if len(syn) > 3:
                if syn.endswith('$'):    # '$' indicates an alternate scientific name
                    finalSyns.append(f'"{syn[:len(syn) - 1].strip()}"')
                else:
                    finalSyns.append(f'"{syn.strip().lower()}"@en')
        finalString = ', '.join([str(finSyn) for finSyn in finalSyns])
        if plantFile:
            if finalString:    
                plantFile.write(f';\n  :synonym {finalString} .\n')
            else:
                plantFile.write('.\n')
        else:
            return finalString

def addUsesToPlantFile(navString, predicate, plantFile):
    useSet = set()    # Using sets since there may be duplicate tags
    partSet = set()    
    # Edible details include edible plant parts and uses
    for useTag in navString.find_all('a'):
        useTagString = useTag.string.title().replace(' ', '')
        # Account for invalid capitalization for Women's Complaints (which we want to discard anyway)
        if not('Women' in useTagString):   
            if not(useTagString in discardedPfafUsage) \
            and not((predicate, useTagString) in discardedPfafUsageByType):
                # Fix up tags
                if predicate == medicinalUse and useTagString in mappingPfafMedicinalUsageTerms :
                    useTageString = mappingPfafMedicinalUsageTerms[useTagString]
                elif useTagString in mappingPfafUsageTerms:
                    useTagString = mappingPfafUsageTerms[useTagString]
                if not(predicate == edibleUse) or useTagString in edibleUses:
                    useSet.add(useTagString)
                else:
                    partSet.add(useTagString)
    if bool(partSet):
        finalString = ', '.join(f':{part}' for part in partSet)
        plantFile.write(f'  :plant_edible_part {finalString} ;\n')
    if bool(useSet):
        finalString = ', '.join(f':{use}' for use in useSet)
        plantFile.write(f'  :plant_{predicate}_use {finalString} ;\n')
        
    last = getLastText(navString.stripped_strings)
    if not(last is None):
        if predicate is otherUse:
            plantFile.write(f'  :plant_other_use_text "{cleanupText(last)}"@en .\n\n')
        else:
            plantFile.write(f'  :plant_{predicate}_text "{cleanupText(last)}"@en ;\n')
    elif predicate is otherUse:
        plantFile.write('  .\n\n')    # Clean up by closing the Turtle declaration

def cleanupText(textString):
    # Remove PFAF bibliographic references (found within square brackets), replace double quotes 
    # with single quotes and remove CRs
    textString = re.sub('\[.*?\]', '', textString)
    textString = textString.replace('\n', ' ').replace('\r', '')
    return textString.replace('"', "'")
    
def getLastText(texts):
    try:
        *_, last = texts
    except ValueError:
        return None
    return last

## Process the family and genus names

In [4]:
# Get PFAF cookies from an existing https://pfaf.org page opened in a browser 
# The code below uses Chrome (for ex, to use Firefox, change the code to browsercookie.firefox())
# Sending the cookies is necessary to get a complete response from requests.get()
# Note that you may have to give specific permission for this code to access your browser's cookie data
cj = browsercookie.chrome()

with open(f'Plants.txt', 'r') as plantsFile:
    plantData = plantsFile.read()

# Get each line from the file
plants = plantData.split('\n')

# Check each line for either a family name or scientific and common names of genuses 
#    within the family
# Write the family names to a file, to be added to the commodity-plants.ttl file
# Also use the family name to create a .ttl file to hold the genus details
familyName = ''
with open('Families.ttl', 'a') as ttlFamilies:
    for plantName in plants:
        names = plantName.split('\t')     # Tab-separated data
        if names[0]:                      # Blank lines were inserted for readability and can be ignored
            if ')' in names[0]:           # If the line includes a closing parenthesis, then it is a family name
                familyNames = names[0].split(' (')    # Split the family details outside vs inside the parentheses
                familyName = familyNames[0]
                print(f'Family: {familyName}')
                ttlFamilies.write(f':{familyName} a owl:Class ;\n  rdfs:subClassOf :FloweringPlant ')             
                if not('(family)' in names[0]):       # If there are details in parentheses, these are synonyms
                    ttlFamilies.write(f';\n  :synonym {addNames(familyNames[1][:len(familyNames[1]) - 1])} ')
                ttlFamilies.write('.\n\n')
                with open(f'plants_{familyName}.ttl', 'a') as plantFile:
                    plantFile.write('@prefix : <urn:ontoinsights:ontology:dna:> .\n'\
                                    '@prefix dna: <urn:ontoinsights:ontology:dna:> .\n'\
                                    '@prefix owl: <http://www.w3.org/2002/07/owl#> .\n'\
                                    '@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .\n'\
                                    '@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .\n'\
                                    '@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .\n\n'\
                                    '########################################################################\n'\
                                    '# File defining plants in the family, ')
                    plantFile.write(familyName[7:-4])
                    plantFile.write('\n# \n'\
                                    '# Created: February 8, 2020\n'\
                                    '# Last modified: February 9, 2020\n'\
                                    '########################################################################\n\n'\
                                    '########################################################################\n'\
                                    '# Classes and Punned Individuals\n'\
                                    '########################################################################\n\n')
            else:
                genusName = names[0]
                print(genusName)
                if len(names) > 1:
                    time.sleep(2)    # Pace requests to the PFAF website to avoid being blocked
                    pfafPlantUrl = f"{pfafUrl}{genusName.replace(' ', '+')}"
                    resp = requests.get(pfafPlantUrl, cookies=cj)
                    if resp.status_code is 200:    # Valid response
                        soup = BeautifulSoup(resp.content, 'html.parser')
                        with open(f'plants_{familyName}.ttl', 'a') as plantFile:
                            urn = genusName.title().replace(' ', '')
                            plantFile.write(f":{urn} a owl:Class ;\n  rdfs:subClassOf :{familyName} ")
                            addNames(names[1], plantFile)
                            plantFile.write(f':{urn} \n')
                            addDetailsToPlantFile(soup.find(id='ctl00_ContentPlaceHolder1_lblUSDAhardiness').string,
                                                  ' ', 'plant_hardiness_zones', plantFile)
                            addDetailsToPlantFile(soup.find(id='ctl00_ContentPlaceHolder1_lblKnownHazards').string, 
                                                  langTag, 'plant_hazards', plantFile)
                            addDetailsToPlantFile(soup.find(id='ctl00_ContentPlaceHolder1_txtHabitats').string, 
                                                  langTag, 'plant_habitat', plantFile)
                            addDetailsToPlantFile(soup.find(id='ctl00_ContentPlaceHolder1_lblRange').string, 
                                                  langTag, 'plant_country_range', plantFile)
                            addDetailsToPlantFile(soup.find(id='ctl00_ContentPlaceHolder1_txtEdrating').string, 
                                                  getRating, 'plant_edibility_rating', plantFile)
                            addDetailsToPlantFile(soup.find(id='ctl00_ContentPlaceHolder1_txtMedRating').string, 
                                                  getRating, 'plant_medicinal_rating', plantFile)
                            addDetailsToPlantFile(soup.find(id='ctl00_ContentPlaceHolder1_txtOtherUseRating').string, 
                                                  getRating, 'plant_other_use_rating', plantFile)
                            addDetailsToPlantFile(soup.find(id='ctl00_ContentPlaceHolder1_lblPhystatment').text, 
                                                  langTag, 'plant_physical_characteristics', plantFile)
                            addUsesToPlantFile(soup.find(id='ctl00_ContentPlaceHolder1_txtEdibleUses'), 
                                               edibleUse, plantFile)
                            addUsesToPlantFile(soup.find(id='ctl00_ContentPlaceHolder1_txtMediUses'), 
                                               medicinalUse, plantFile)
                            addUsesToPlantFile(soup.find(id='ctl00_ContentPlaceHolder1_txtOtherUses'), 
                                               otherUse, plantFile)
                    else:
                        print(f'Error response: {resp.status_code} for {genusName}')
                        break;        

Family: Grossulariaceae
Ribes aciculare
Ribes alpinum
Ribes altissimum
Ribes aureum
Ribes burejense
Ribes curvatum
Ribes cynosbati
Ribes divaricatum
Ribes fragrans
Ribes gayanum
Ribes himalense
Ribes hirtellum
Ribes horridum
Ribes janczewskii
Ribes lacustre
Ribes longiracemosum
Ribes maximowiczii
Ribes maximowiczii floribundum
Ribes missouriense
Ribes montigenum
Ribes nigrum
Ribes odoratum
Ribes oxyacanthoides
Ribes palczewskii
Ribes petiolare
Ribes petraeum
Ribes petraeum biebersteinii
Ribes pinetorum
Ribes procumbens
Ribes punctatum
Ribes rotundifolium
Ribes rubrum
Ribes sachalinense
Ribes sativum
Ribes triste
Ribes uva-crispa
Ribes warszewiczii
Ribes x culverwellii


# Unit tests