# Manual Changes

## template mapping files are in the git repository
## original data in _CyVerse Discovery Environment_ 
### data file is: "J.Biogeo.2008.AllData.Final.csv"

### _catalogNumber_
- in Specimen.Number column (new catalogNumber)
- separate out institutionCode from Specimen.Number
- create new column titled institutionCode

### _measurementUnit_
- either in "g" or "mm"

### _otherCatalogNumbers_
- concatenated list of:
    - Proxy.Specimen.Number
    - Annual.Specimen.Number
    - YOC.Specimen.Number

### _unused columns_
- datum (units for latitude)

## To Code
### _elevationInMeters_
- in _elevation.ft_
- convert ot meters

In [1]:
import pandas as pd
import numpy as np
import re
import uuid

In [2]:
#Import Biogeo Data Locally
biogeo = pd.read_csv("../Original Data/biogeo.csv")

In [3]:
#Preliminary data cleaning

#Convert elevation.ft values from feet to meters
#1 foot is exactly 0.3048 meters
biogeo['elevation.ft']=biogeo['elevation.ft'].multiply(0.3048)
#it's not being renamed here because it is renamed later in the script

In [4]:
#Preliminary data cleaning

#Creating a new column called institutionCode and moving from Specimen.Number to institutionCode.  
biogeo=biogeo.assign(institutionCode = "")
for ind in biogeo.index:
    x=biogeo['Specimen.Number'][ind]
    y=str(x)
    z=str(y).split()
    biogeo['institutionCode'][ind]=z[0]
    y=re.sub(z[0],'',y)
    biogeo['Specimen.Number'][ind]=y


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


In [5]:
#Add measurementUnit column 
biogeo=biogeo.assign(measurementUnit = "")

In [6]:
#Add otherCatalogNumbers
biogeo=biogeo.assign(otherCatalogNumbers = biogeo['Proxy.Specimen.Number'].fillna('')+biogeo['Annual.Specimen.Number'].fillna('')+biogeo['Annual.Specimen.Number'].fillna('') )



In [7]:
#Rearrange columns so that template columns are first, followed by measurement values

#Create column list
cols = biogeo.columns.tolist()

#Specify desired columns
cols = ['Specimen.Number',
        'institutionCode',
        'otherCatalogNumbers',
        'dec.lat',
        'dec.long',  
        'max.error',
        'elevation.ft',
        'ear.length.mm',
        'hind.foot.length.mm',
        'tail.length.mm',
        'total.length.mm',
        'body.mass.g',
        'measurementUnit']

#Subset dataframe
biogeo = biogeo[cols]

In [8]:
#Matching template and column terms

#Renaming columns 
biogeo = biogeo.rename(columns = {'Specimen.Number':'catalogNumber', 
                                  'dec.lat':'decimalLatitude', 
                                  'dec.long':'decimalLongitude',  
                                  'max.error':'coordinateUncertaintyInMeters', 
                                  'elevation.ft':'pointElevationInMeters'})

In [9]:
#Matching trait and ontology terms

#Renaming columns
biogeo = biogeo.rename(columns={'ear.length.mm':'ear length',
                                'hind.foot.length.mm':'hind foot length',
                                'tail.length.mm': 'tail length',
                                'total.length.mm':'full body length',
                                'body.mass.g':'body mass'})

In [10]:
#create materialSampleID which is a UUID for each measurement
biogeo=biogeo.assign(materialSampleID = '')
biogeo['materialSampleID'] = uuid.uuid4() 

In [11]:
#create long version so that each trait has its own row

#creating long version, first specifiying keep variables, then naming variable and value
longVers=pd.melt(biogeo, 
                id_vars=['catalogNumber',
                         'institutionCode',
                         'otherCatalogNumbers',
                         'decimalLatitude',
                         'decimalLongitude',  
                         'coordinateUncertaintyInMeters',
                         'pointElevationInMeters',
                         'materialSampleID',
                         'measurementUnit'], 
                          var_name = 'measurementType', 
                          value_name = 'measurementValue')

#Populating measurementUnit column with appropriate measurement units in long version
for ind in longVers.index:
    if longVers['measurementType'][ind] == "body mass":
        longVers['measurementUnit'][ind]="g"
    else:
        longVers['measurementUnit'][ind]="mm"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [12]:
#create diagnosticID which is a UUID for each measurement
longVers=longVers.assign(diagnosticID = '')
longVers['diagnosticID'] = [uuid.uuid4() for _ in range(len(longVers.index))]

In [13]:
#Writing long data csv file
longVers.to_csv('../Mapped Data/Biogeo_Data_Long.csv');