# Manual Changes

## template mapping files are in the git repository

## original data in _CyVerse Discovery Environment_ 
### data file is: "ODOVIRGCLEAN.csv"

### _lifeStage_ and _ageValue_
- in _lifestage_
- create new columns _ageValue_ and _ageUnit_
- separate out lifeStage (e.g., juvenile, adult) from ageValue and ageUnit
- make sure ageUnit is spelled out and singular (e.g., "year")

### _yearCollected_
- in _eventDate_
- create new column _yearCollected_
- separate out year
- include century as well (e.g., 1999)

### _unused columns_
- LocationCode
- Note

## To Code
### _measurementValue_
- select only "1st_" measurement

### _measurementUnit_
- make sure either in "g" or "mm"

In [1]:
import pandas as pd
import re
import uuid

In [2]:
#Import Deer VertNet Data Locally
deer = pd.read_csv("../Original Data/ODOVIRGCLEAN.csv")
#Import Deer VertNet Data from Cyverse
#deer = pd.read_csv("https://de.cyverse.org/dl/d/126821C9-D23A-4B22-9B3F-25F19311066E/ODOVIRGCLEAN.csv")

In [3]:
#Preliminary data cleaning

#For values in ageValue that were present, the string was split, moved to the ageUnit
#column, and was renamed "year". 
for ind in deer.index:  
    x=deer['ageValue'][ind]
    y=str(x)
    z=str(y).split()
    
    if any(char.isdigit() for char in z[0]):
        deer['ageUnit'][ind]="year"
        y=re.sub(z[1], '', y)
        deer['ageValue'][ind]=y



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]


In [4]:
#Prelimary data cleaning

#Parsed through the eventdata column, identified year and moved year to new yearCollected column
deer=deer.assign(yearCollected = '')

for ind in deer.index:
    a=deer['eventdate'][ind]
    b=str(a)
    slash=re.compile('/')
    dash =re.compile('-')

    if slash.findall(b):
        c = b.split('/')
        deer['yearCollected'][ind]=c[2]
    elif dash.findall(b):
        c = b.split('-')
        deer['yearCollected'][ind]=c[0]
    else:
        deer['yearCollected'][ind]=b 


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [5]:
#Rearrange columns so that template columns are first, followed by measurement values

#Create column list
cols = deer.columns.tolist()

#Specify desired columns
cols = ['catalognumber',
        'collectioncode',
        'decimallatitude',
        'decimallongitude',
        'eventdate',
        'institutioncode',
        'lifestage',
        'ageValue',
        'ageUnit',
        'locality',
        'sex',
        'scientificname',
        'yearCollected',
        '1st_body_mass',
        '1st_ear_length',
        '1st_hind_foot_length',
        '1st_tail_length',
        '1st_total_length']

#Subset dataframe
deer = deer[cols]

In [6]:
#Matching template and column terms

#Renaming columns 
deer = deer.rename(columns = {'catalognumber':'catalogNumber', 
                            'collectioncode':'collectionCode',
                            'decimallatitude':'decimalLatitude',
                            'decimallongitude':'decimalLongitude',
                            'eventdate':'verbatimEventDate',
                            'institutioncode' :'institutionCode',
                            'lifestage':'verbatimAgeValue',
                            'locality':'verbatimLocality',
                            'scientificname':'scientificName'})


In [7]:
#Matching trait and ontology terms

#Renaming columns
deer = deer.rename(columns={'1st_body_mass':'body mass',
                            '1st_ear_length': 'ear length',
                            '1st_hind_foot_length':'hind foot length',
                            '1st_tail_length':'tail length',
                            '1st_total_length':'full body length'})


In [8]:
#create new column individualID that has a unique identifer (e.g., collectionCode, insitutionCode, catalogNumber)
deer=deer.assign(individualID = deer['collectionCode'] + deer['institutionCode']+ deer['catalogNumber'])

In [9]:
#create new column basisOfRecord which is "preservedSpecimen"
deer=deer.assign(basisOfRecord = 'preservedSpecimen')

In [10]:
#make a measurementUnit column
deer=deer.assign(measurementUnit = "")

In [11]:
#create long version so that each trait has its own row

#creating long version, first specifiying keep variables, then naming variable and value
longVers=pd.melt(deer, 
                id_vars=['catalogNumber',
                         'individualID',
                         'collectionCode',
                         'decimalLatitude',
                         'decimalLongitude', 
                         'verbatimEventDate', 
                         'institutionCode',
                         'verbatimAgeValue',
                         'ageValue',
                         'ageUnit',
                         'verbatimLocality',
                         'sex',
                         'scientificName',
                         'yearCollected',
                         'measurementUnit'], 
                          var_name = 'measurementType', 
                          value_name = 'measurementValue')

#Populating measurementUnit column with appropriate measurement units in long version
for ind in longVers.index:
    if longVers['measurementType'][ind] == "body mass":
        longVers['measurementUnit'][ind]="g"
    else:
        longVers['measurementUnit'][ind]="mm"


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [12]:
#create materialSampleID which is a UUID for each measurement
#for ind in longVers.index:
longVers=longVers.assign(materialSampleID = '')
longVers['materialSampleID'] = [uuid.uuid4() for _ in range(len(longVers.index))]

In [13]:
#Writing long data csv file
longVers.to_csv('../Mapped Data/VertNet_Deer_Data_Long.csv')