# To Code

## template mapping files are in the git repository
## original data in _CyVerse Discovery Environment_ 
### data file is: "Extant Aepyceros database_updated 11_2016.csv"

### _lifeStage_
- in _Age (juv, prime adult, older adult, old)_
- create new columns ageValue and ageUnit
- separate out lifeStage (e.g., juvenile, adult) from ageValue and ageUnit
- ageUnit is singular and spelled out (e.g., "year")

### _measurementUnit_
- linear measurements are in "mm"
- separate unite from _Weight_
    - Weight is in "lb"

### _sex_
- make lowercase

### _unused columns_
- Location code
- Notes

In [1]:
import pandas as pd
import re
import uuid

In [2]:
#Import Aepyceros Data Locally
aepyceros = pd.read_csv("../Original Data/Extant Aepyceros database_updated 11_2016.csv")

In [3]:
#Preliminary data cleaning

#add columns for ageValue, and ageUnit.
aepyceros=aepyceros.assign(lifeStage="")
aepyceros=aepyceros.assign(ageValue = "")
aepyceros=aepyceros.assign(ageUnit = "year")
aepyceros=aepyceros.assign(ageEstimation = "")

for ind in aepyceros.index:
    x=aepyceros['Age (juv, prime adult, older adult, old)'][ind]
    y=str(x)
    z=str(y).split()
    
    #Organizing ageValue into correct column
    a=y[y.find("(")+1:y.find(")")]
    if a == "Prime Adul":
        aepyceros['ageValue'][ind]=""
        aepyceros['ageUnit'][ind]=""
    elif a == "22-24 months":
        aepyceros['ageValue'][ind]="22-24"
        aepyceros['ageUnit'][ind]="month"
    elif a == "No teeth, horns fully developed Adul":
        aepyceros['ageValue'][ind]=""
        aepyceros['ageUnit'][ind]=""
        aepyceros['ageEstimation'][ind]="No teeth, horns fully developed Adult"
    elif a == "Molars well worn":
        aepyceros['ageValue'][ind]=""
        aepyceros['ageUnit'][ind]=""
        aepyceros['ageEstimation'][ind]="Molars well worn"
    else:
        aepyceros['ageValue'][ind]=a
        
    #Organizing ageEstimation further
    if "M3 erupting" in y:
        aepyceros['ageEstimation'][ind]="M3 erupting"
    elif "molars not well worn" in y:
        aepyceros['ageEstimation'][ind]="molars not well worn"
    elif "M1 slightly worn-all M3s erupted" in y:
        aepyceros['ageEstimation'][ind]="M1 slightly worn-all M3s erupted"

    #Organizing lifeStage column
    if z[0] == "Prime":
        aepyceros['lifeStage'][ind]="Prime Adult"
    elif z[0] == "Juvenile":
        aepyceros['lifeStage'][ind]="Juvenile"
    elif z[0] == "juvenile":
        aepyceros['lifeStage'][ind]="Juvenile"
    elif z[0] == "Old":
        aepyceros['lifeStage'][ind]="Old Adult"
    elif z[0] == "Older":
        aepyceros['lifeStage'][ind]="Older Adult"
    elif z[0] == "Young":
        aepyceros['lifeStage'][ind]="Young Adult"
    elif z[0] == "Very":
        aepyceros['lifeStage'][ind]="Very Old"
    elif z[0] == "No":
        aepyceros['lifeStage'][ind]="Adult"
    else:
        aepyceros['lifeStage'][ind]=y
        

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#return

In [4]:
#Rearrange columns so that template columns are first, followed by measurement values

#Create column list
cols = aepyceros.columns.tolist()

#Specify desired columns
cols = ['Museum',
        'Specimen #',
        'Species',
        'SEX',
        'Country/Continent',
        'State/Province',
        'lifeStage',
        'ageValue',
        'ageUnit',
        'ageEstimation',
        'Weight',
        'Humerus Length',
        'Medapodial Length',#fovt name?
        'Medapodial Width AP',#fovt name?
        'Medapodial Width ML',#fovt name?
        'Astragalus Length',#fovt name?
        'Astragalus Width',#fovt name?
        'Femur Length']

#Subset dataframe
aepyceros = aepyceros[cols]

In [5]:
#Matching template and column terms

#Renaming columns 
aepyceros = aepyceros.rename(columns = {'Museum':'institutionCode',
                                        'Specimen #':'individualID',
                                        'Species':'scientificName',
                                        'SEX':'sex',
                                        'Country/Continent':'country',
                                        'State/Province':'stateProvince'})

In [6]:
#Matching trait and ontology terms

#Renaming columns
aepyceros = aepyceros.rename(columns={'Weight':'body mass',
                                      'Humerus Length':'humerus length from caput',
                                      'Femur Length':'femur length'})


In [7]:
#Create measurementUnit column
aepyceros=aepyceros.assign(measurementUnit="")

In [8]:
#create materialSampleID which is a UUID for each measurement
aepyceros=aepyceros.assign(materialSampleID = '')
aepyceros['materialSampleID'] = uuid.uuid4() 

In [9]:
#create long version so that each trait has its own row

#creating long version, first specifiying keep variables, then naming variable and value
longVers=pd.melt(aepyceros, 
                id_vars=['institutionCode',
                         'individualID',
                         'scientificName',
                         'sex',
                         'country',
                         'stateProvince',
                         'lifeStage',
                         'ageValue',
                         'ageUnit',
                         'ageEstimation',
                         'measurementUnit',
                         'materialSampleID'], 
                var_name = 'measurementType', 
                value_name = 'measurementValue')


In [10]:
#Populating measurementUnit column with appropriate measurement units in long version
for ind in longVers.index:
    if longVers['measurementType'][ind] == "body mass":
        longVers['measurementUnit'][ind]="lb"
    else:
        longVers['measurementUnit'][ind]="mm"

In [11]:
#create diagnosticID which is a UUID for each measurement
longVers=longVers.assign(diagnosticID = '')
longVers['diagnosticID'] = [uuid.uuid4() for _ in range(len(longVers.index))]

In [12]:
#Writing long data csv file
longVers.to_csv('../Mapped Data/Aepyceros_Data_Long.csv')