# WholeTale Summer Internship 2018-- Taxonomy alignment as a key to enhance reproducibility in biodiversity research: a case study of Magnolia 

# Step 5: Concept Mapping

## Goal of this project
Oftentimes in biodiversity research, we expect the scientific names of species to be unique identifiers, but actually they may not be. Why is that? 

(1) The scientific names can vary over time

(2) The names stay the same, but the semantics of the names change

Other complicated issues: 

(1) Different people may have different perceptions to the taxonomy of a same topic 

(2) Species distribution datasets oftentimes only include information on a species ‘name’ without crediting the authorship of that taxonomy

This is why we are in a pressing need to align diffferent taxonomies that is addressing the same topic, not to only make the names more interoperable, but also to make way for further datasets usage. 

## Overview of the tasks for this project

**Step 1**: Decide **which species** (or genus) to examine 

**Step 2**: Domain experts provide a **mapping** table for the taxonomies used over time for that particular species 

**Step 3**: Researcher transpose domain experts’ table into **Euler/X or LeanEuler** input file  

**Step 4**: Gather **species distribution dataset** from biodiverisity portals

**Step 5**: **Concept mapping** of the taxonomies and create new datasets based on different taxonomies

**Step 6**: **Data cleaning** - geocode missing lat-long information 

**Step 7**: **Visualizing** species co-occurrence distribution & synthesized taxonomy alignment distribution

**Step 8**: **Niche modeling** and further analyisis

## The tasks achieved in this notebook
In this notebook, we mainly focused on **Step 5**: the **Concept Mapping** process.

We have separate notebooks for the details of step 7 and step 8. 

## Explanation of the Concept Mapping process
The main objective for the concept mapping process is to take the dataset that uses Taxonomy 1 and create new datasets that use Taxonomy 2, 3, 4 or more. 

To be more specific, in the case of Magnolia, the scientific names column in our species co-occurrence dataset is based on Weakley 2014 taxonomy. We want map the scientific names column to be based on other authority's taxonomies, for example the Magnolia taxonomy based on Chapman 1883, so that we can produce **new** co-occurrence dataset based on the Chapman 1883 taxonomy. This will further enable us to do analysis or visualization on 'what it could have been' if we model the datasets using different taxonomies. 


In [71]:
#import all the necessary libraries we will be using 
import pandas as pd
import numpy as np

## Reads into LeanEuler output relations table
In Step 3 we did the taoxnomy alignment on Taxonomy 1 Weakley 2014 and Taxonomy 2 Chapman 1883 using a RCC-5, logic-based tool named LeanEuler. This process gave us an output table (the **MIR** table) that show us every possible pairwise relation the concepts in taxonomy 1 can have with taoxnomy 2. We will utilze this table for our concept mapping task:

In [72]:
#create dataframe from LeanEuler output file 
df=pd.read_csv("inputFiles/rel_3.csv")

In [73]:
df

Unnamed: 0.1,Unnamed: 0,pw,x1,x2,x3
0,0,1,"""2014_Magnolia""","""1883_Magnolia_acuminata""",""">"""
1,1,1,"""2014_Magnolia""","""1883_Magnolia_Clade1""",""">"""
2,2,1,"""2014_Magnolia""","""1883_Magnolia_cordata""",""">"""
3,3,1,"""2014_Magnolia""","""1883_Magnolia_grandiflora""",""">"""
4,4,1,"""2014_Magnolia""","""1883_Magnolia_umbrella""",""">"""
5,5,1,"""2014_Magnolia""","""1883_Magnolia_glauca""",""">"""
6,6,1,"""2014_Magnolia""","""1883_Magnolia_Clade2""",""">"""
7,7,1,"""2014_Magnolia""","""1883_Magnolia_macrophylla""",""">"""
8,8,1,"""2014_Magnolia""","""1883_Magnolia_fraseri""",""">"""
9,9,1,"""2014_Magnolia_Clade1""","""1883_Magnolia_acuminata""",""">"""


## Data preprocessing
We basically clean up the MIR table in the following steps: replace the punctuation marks and get rid of unnecessary notations using regular expressions.

In [74]:
#copy original column to a new column and replace all the punctuation marks
df['t1'] = df['x1'].str.replace('[^\w\s]','')
df['t2'] = df['x2'].str.replace('[^\w\s]','')
df['relations'] = df['x3'].str.replace('"','')

In [75]:
df.head()

Unnamed: 0.1,Unnamed: 0,pw,x1,x2,x3,t1,t2,relations
0,0,1,"""2014_Magnolia""","""1883_Magnolia_acuminata""",""">""",2014_Magnolia,1883_Magnolia_acuminata,>
1,1,1,"""2014_Magnolia""","""1883_Magnolia_Clade1""",""">""",2014_Magnolia,1883_Magnolia_Clade1,>
2,2,1,"""2014_Magnolia""","""1883_Magnolia_cordata""",""">""",2014_Magnolia,1883_Magnolia_cordata,>
3,3,1,"""2014_Magnolia""","""1883_Magnolia_grandiflora""",""">""",2014_Magnolia,1883_Magnolia_grandiflora,>
4,4,1,"""2014_Magnolia""","""1883_Magnolia_umbrella""",""">""",2014_Magnolia,1883_Magnolia_umbrella,>


In [76]:
#strip the preceding taxonomy name in t1 and t2; ***NOTE: may have to revise the regex - the case 'taxon.magnolia' might not match
df["t1"] = df['t1'].str.replace('^[^_]+_','')
df["t2"] = df['t2'].str.replace('^[^_]+_','')

In [77]:
df.head()

Unnamed: 0.1,Unnamed: 0,pw,x1,x2,x3,t1,t2,relations
0,0,1,"""2014_Magnolia""","""1883_Magnolia_acuminata""",""">""",Magnolia,Magnolia_acuminata,>
1,1,1,"""2014_Magnolia""","""1883_Magnolia_Clade1""",""">""",Magnolia,Magnolia_Clade1,>
2,2,1,"""2014_Magnolia""","""1883_Magnolia_cordata""",""">""",Magnolia,Magnolia_cordata,>
3,3,1,"""2014_Magnolia""","""1883_Magnolia_grandiflora""",""">""",Magnolia,Magnolia_grandiflora,>
4,4,1,"""2014_Magnolia""","""1883_Magnolia_umbrella""",""">""",Magnolia,Magnolia_umbrella,>


In [78]:
#replace underscores with blanks
df["t1"] = df["t1"].str.replace('_',' ')
df["t2"] = df["t2"].str.replace('_',' ')

In [79]:
df.head()

Unnamed: 0.1,Unnamed: 0,pw,x1,x2,x3,t1,t2,relations
0,0,1,"""2014_Magnolia""","""1883_Magnolia_acuminata""",""">""",Magnolia,Magnolia acuminata,>
1,1,1,"""2014_Magnolia""","""1883_Magnolia_Clade1""",""">""",Magnolia,Magnolia Clade1,>
2,2,1,"""2014_Magnolia""","""1883_Magnolia_cordata""",""">""",Magnolia,Magnolia cordata,>
3,3,1,"""2014_Magnolia""","""1883_Magnolia_grandiflora""",""">""",Magnolia,Magnolia grandiflora,>
4,4,1,"""2014_Magnolia""","""1883_Magnolia_umbrella""",""">""",Magnolia,Magnolia umbrella,>


In [80]:
#show all the column names in this table
df.columns

Index(['Unnamed: 0', 'pw', 'x1', 'x2', 'x3', 't1', 't2', 'relations'], dtype='object')

In [81]:
#Drop unnecessary columns
df.drop(['Unnamed: 0', 'pw', 'x1', 'x2', 'x3'],1, inplace=True)

In [82]:
df.head()

Unnamed: 0,t1,t2,relations
0,Magnolia,Magnolia acuminata,>
1,Magnolia,Magnolia Clade1,>
2,Magnolia,Magnolia cordata,>
3,Magnolia,Magnolia grandiflora,>
4,Magnolia,Magnolia umbrella,>


## Concept mapping process starts here
Our main interest in replacing the taoxnomies are in the "scientific names" column - now the "t1" column. Therefore here we look first at all the names in this column. 

In [83]:
#look at the unique concepts in the dataset
df['t1'].unique()

array(['Magnolia', 'Magnolia Clade1', 'Magnolia Clade2',
       'Magnolia acuminata var acuminata',
       'Magnolia acuminata var subcordata', 'Magnolia grandiflora',
       'Magnolia tripetala', 'Magnolia virginiana var australis',
       'Magnolia virginiana var virginiana', 'Magnolia ashei',
       'Magnolia macrophylla', 'Magnolia fraseri', 'Magnolia pyramidata'],
      dtype=object)

When we do concept mapping, we operate on the **leaf-node**, or the **children**, meaning that we only map the lowest level concepts in our taxonomy. In this sense, we can remove all the higher level concepts, or the **parents**. 

In [84]:
#drop the parent level concepts - Leaving only the children in the table
df2=df[df['t1'].str.contains("Magnolia Clade1|Magnolia Clade2")==False]
df2=df2[df2['t2'].str.contains("Magnolia Clade1|Magnolia Clade2")==False]

In [85]:
df2

Unnamed: 0,t1,t2,relations
0,Magnolia,Magnolia acuminata,>
2,Magnolia,Magnolia cordata,>
3,Magnolia,Magnolia grandiflora,>
4,Magnolia,Magnolia umbrella,>
5,Magnolia,Magnolia glauca,>
7,Magnolia,Magnolia macrophylla,>
8,Magnolia,Magnolia fraseri,>
16,Magnolia,Magnolia,=
17,Magnolia acuminata var acuminata,Magnolia acuminata,=
18,Magnolia acuminata var subcordata,Magnolia cordata,=


In [86]:
#Drop the top-level node "Magnolia" in both taxonomies
df2=df2[df2['t1'] != "Magnolia"] 
df2=df2[df2['t2'] != "Magnolia"] 

In [87]:
df2

Unnamed: 0,t1,t2,relations
17,Magnolia acuminata var acuminata,Magnolia acuminata,=
18,Magnolia acuminata var subcordata,Magnolia cordata,=
19,Magnolia grandiflora,Magnolia grandiflora,=
20,Magnolia tripetala,Magnolia umbrella,=
28,Magnolia virginiana var australis,Magnolia glauca,<
31,Magnolia virginiana var virginiana,Magnolia glauca,<
37,Magnolia ashei,Magnolia macrophylla,<
40,Magnolia macrophylla,Magnolia macrophylla,<
43,Magnolia fraseri,Magnolia fraseri,<
46,Magnolia pyramidata,Magnolia fraseri,<


## RCC-5 alignments notes:
Our RCC-5 relations consist of five realtions: equals, is_included_in, includes, overlaps, and disjoints. In the concept-mapping process, since we are replacing the names in a column from one dataset to the other, the only thing that will be *equivalently mapped* are the *equals* and the *is_included_in*, for the other relations we need to do extra processing with it. If we map all five relations with equivalent mapping, it will create a lot of false alignments - we have a separate notebook to show how things can go wrong if they are not *logically* mapped. 

### Congruence and Inverse-Inclusion

In [88]:
#Create new column, add the 'equivalent' and the 'is_included_in' to the new column 
a=df2['relations']=="="
b=df2['relations']=="<"
df2['new']=np.where(a|b,df2['t2'],'NA')

In [89]:
df2

Unnamed: 0,t1,t2,relations,new
17,Magnolia acuminata var acuminata,Magnolia acuminata,=,Magnolia acuminata
18,Magnolia acuminata var subcordata,Magnolia cordata,=,Magnolia cordata
19,Magnolia grandiflora,Magnolia grandiflora,=,Magnolia grandiflora
20,Magnolia tripetala,Magnolia umbrella,=,Magnolia umbrella
28,Magnolia virginiana var australis,Magnolia glauca,<,Magnolia glauca
31,Magnolia virginiana var virginiana,Magnolia glauca,<,Magnolia glauca
37,Magnolia ashei,Magnolia macrophylla,<,Magnolia macrophylla
40,Magnolia macrophylla,Magnolia macrophylla,<,Magnolia macrophylla
43,Magnolia fraseri,Magnolia fraseri,<,Magnolia fraseri
46,Magnolia pyramidata,Magnolia fraseri,<,Magnolia fraseri


### Disjointness
The *disjoint* relations means that the two concepts in two taoxnomies are **NOT** the same. Therefore they should not even be mapped. 

In [90]:
#get rid of all the 'disjoint' articulations 
df2=df2[df2['relations']!="!"]

In [91]:
df2

Unnamed: 0,t1,t2,relations,new
17,Magnolia acuminata var acuminata,Magnolia acuminata,=,Magnolia acuminata
18,Magnolia acuminata var subcordata,Magnolia cordata,=,Magnolia cordata
19,Magnolia grandiflora,Magnolia grandiflora,=,Magnolia grandiflora
20,Magnolia tripetala,Magnolia umbrella,=,Magnolia umbrella
28,Magnolia virginiana var australis,Magnolia glauca,<,Magnolia glauca
31,Magnolia virginiana var virginiana,Magnolia glauca,<,Magnolia glauca
37,Magnolia ashei,Magnolia macrophylla,<,Magnolia macrophylla
40,Magnolia macrophylla,Magnolia macrophylla,<,Magnolia macrophylla
43,Magnolia fraseri,Magnolia fraseri,<,Magnolia fraseri
46,Magnolia pyramidata,Magnolia fraseri,<,Magnolia fraseri


### Inclusion and Overlapping
These relations will give the taxonomies more than one names to choose from, therefore we join them by using **"OR"**.
Since this co-occurrence dataset does not have these two relations included, it doesn't show in the following df2 tables. 

In [92]:
#Join all other articulations "includes", "overlaps" using "OR" on the new column
df2['new'] = df2.groupby(['t1'])['t2'].transform(lambda x: ' OR '.join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [93]:
df2

Unnamed: 0,t1,t2,relations,new
17,Magnolia acuminata var acuminata,Magnolia acuminata,=,Magnolia acuminata
18,Magnolia acuminata var subcordata,Magnolia cordata,=,Magnolia cordata
19,Magnolia grandiflora,Magnolia grandiflora,=,Magnolia grandiflora
20,Magnolia tripetala,Magnolia umbrella,=,Magnolia umbrella
28,Magnolia virginiana var australis,Magnolia glauca,<,Magnolia glauca
31,Magnolia virginiana var virginiana,Magnolia glauca,<,Magnolia glauca
37,Magnolia ashei,Magnolia macrophylla,<,Magnolia macrophylla
40,Magnolia macrophylla,Magnolia macrophylla,<,Magnolia macrophylla
43,Magnolia fraseri,Magnolia fraseri,<,Magnolia fraseri
46,Magnolia pyramidata,Magnolia fraseri,<,Magnolia fraseri


In [94]:
#drop the duplicated rows 
df2=df2.drop_duplicates('t1')

In [95]:
df2

Unnamed: 0,t1,t2,relations,new
17,Magnolia acuminata var acuminata,Magnolia acuminata,=,Magnolia acuminata
18,Magnolia acuminata var subcordata,Magnolia cordata,=,Magnolia cordata
19,Magnolia grandiflora,Magnolia grandiflora,=,Magnolia grandiflora
20,Magnolia tripetala,Magnolia umbrella,=,Magnolia umbrella
28,Magnolia virginiana var australis,Magnolia glauca,<,Magnolia glauca
31,Magnolia virginiana var virginiana,Magnolia glauca,<,Magnolia glauca
37,Magnolia ashei,Magnolia macrophylla,<,Magnolia macrophylla
40,Magnolia macrophylla,Magnolia macrophylla,<,Magnolia macrophylla
43,Magnolia fraseri,Magnolia fraseri,<,Magnolia fraseri
46,Magnolia pyramidata,Magnolia fraseri,<,Magnolia fraseri


## Co-occurrence dataset
This is the co-occurrence datset we got from Step 4.

In [96]:
#read into the co-occurrence csv file based on one taxonomy: e.g. Weakley2014's taxonomy co-occurrence datasets
occurrences=pd.read_csv("inputFiles/Magnolia_secWeakley2014_geocoded.csv")

In [97]:
#display all the headers of the dataset
occurrences.columns

Index(['Unnamed: 0', 'id', 'institutionCode', 'collectionCode',
       'ownerInstitutionCode', 'basisOfRecord', 'occurrenceID',
       'catalogNumber', 'otherCatalogNumbers', 'kingdom', 'phylum', 'class',
       'order', 'family', 'scientificName', 'taxonID',
       'scientificNameAuthorship', 'genus', 'specificEpithet', 'taxonRank',
       'infraspecificEpithet', 'identifiedBy', 'dateIdentified',
       'identificationReferences', 'identificationRemarks', 'taxonRemarks',
       'identificationQualifier', 'typeStatus', 'recordedBy',
       'associatedCollectors', 'recordNumber', 'eventDate', 'year', 'month',
       'day', 'startDayOfYear', 'endDayOfYear', 'verbatimEventDate',
       'occurrenceRemarks', 'habitat', 'substrate', 'verbatimAttributes',
       'fieldNumber', 'informationWithheld', 'dataGeneralizations',
       'dynamicProperties', 'associatedTaxa', 'reproductiveCondition',
       'establishmentMeans', 'cultivationStatus', 'lifeStage', 'sex',
       'individualCount', 'prepa

In [98]:
#drop all the unncessary columns 
occurrences.drop(['Unnamed: 0', 'id', 'institutionCode', 'collectionCode',
       'ownerInstitutionCode', 'basisOfRecord', 'occurrenceID',
       'catalogNumber', 'otherCatalogNumbers', 'kingdom', 'phylum', 'class',
       'order', 'family', 'taxonID',
       'scientificNameAuthorship', 'genus', 'specificEpithet', 'taxonRank',
       'infraspecificEpithet', 'identifiedBy', 'dateIdentified',
       'identificationReferences', 'identificationRemarks', 'taxonRemarks',
       'identificationQualifier', 'typeStatus', 'recordedBy',
       'associatedCollectors', 'recordNumber', 'eventDate', 'year', 'month',
       'day', 'startDayOfYear', 'endDayOfYear', 'verbatimEventDate',
       'occurrenceRemarks', 'habitat', 'substrate', 'verbatimAttributes',
       'fieldNumber', 'informationWithheld', 'dataGeneralizations',
       'dynamicProperties', 'associatedTaxa', 'reproductiveCondition',
       'establishmentMeans', 'cultivationStatus', 'lifeStage', 'sex',
       'individualCount', 'preparations', 'country',
       'municipality', 'locality', 'locationRemarks', 'localitySecurity',
       'localitySecurityReason', 'geodeticDatum',
       'coordinateUncertaintyInMeters', 'verbatimCoordinates',
       'georeferencedBy', 'georeferenceProtocol', 'georeferenceSources',
       'georeferenceVerificationStatus', 'georeferenceRemarks',
       'minimumElevationInMeters', 'maximumElevationInMeters',
       'minimumDepthInMeters', 'maximumDepthInMeters', 'verbatimDepth',
       'verbatimElevation', 'disposition', 'language', 'recordEnteredBy',
       'modified', 'sourcePrimaryKey-dbpk', 'collId', 'recordId', 'references'],inplace=True,axis=1)

In [99]:
#remove all the punctuation marks in the scientificName column
occurrences["scientificName"] = occurrences['scientificName'].str.replace('[^\w\s]','')

In [100]:
#remove all leading or trailing whitespaces in the scientificName column
occurrences["scientificName"].str.strip()

0                   Magnolia grandiflora
1                   Magnolia grandiflora
2                   Magnolia grandiflora
3                   Magnolia grandiflora
4                   Magnolia grandiflora
5                   Magnolia grandiflora
6                     Magnolia tripetala
7                   Magnolia grandiflora
8                     Magnolia tripetala
9                     Magnolia tripetala
10                  Magnolia grandiflora
11                  Magnolia grandiflora
12                    Magnolia tripetala
13                    Magnolia tripetala
14                  Magnolia grandiflora
15                    Magnolia tripetala
16                  Magnolia grandiflora
17                    Magnolia tripetala
18                    Magnolia tripetala
19                  Magnolia grandiflora
20                    Magnolia tripetala
21                    Magnolia tripetala
22                    Magnolia tripetala
23                  Magnolia grandiflora
24              

In [101]:
occurrences.head()

Unnamed: 0,scientificName,stateProvince,county,decimalLongitude,decimalLatitude
0,Magnolia grandiflora,North Carolina,Brunswick,-78.244113,34.074733
1,Magnolia grandiflora,North Carolina,Brunswick,-78.244113,34.074733
2,Magnolia grandiflora,North Carolina,Brunswick,-78.244113,34.074733
3,Magnolia grandiflora,North Carolina,Brunswick,-78.244113,34.074733
4,Magnolia grandiflora,North Carolina,Brunswick,-78.244113,34.074733


In [102]:
df2.head()

Unnamed: 0,t1,t2,relations,new
17,Magnolia acuminata var acuminata,Magnolia acuminata,=,Magnolia acuminata
18,Magnolia acuminata var subcordata,Magnolia cordata,=,Magnolia cordata
19,Magnolia grandiflora,Magnolia grandiflora,=,Magnolia grandiflora
20,Magnolia tripetala,Magnolia umbrella,=,Magnolia umbrella
28,Magnolia virginiana var australis,Magnolia glauca,<,Magnolia glauca


### Merge the co-occurrence dataset with the concept mapping df2 table

In [103]:
result=pd.merge(occurrences,df2,left_on='scientificName',right_on='t1',how='left')

In [104]:
result

Unnamed: 0,scientificName,stateProvince,county,decimalLongitude,decimalLatitude,t1,t2,relations,new
0,Magnolia grandiflora,North Carolina,Brunswick,-78.244113,34.074733,Magnolia grandiflora,Magnolia grandiflora,=,Magnolia grandiflora
1,Magnolia grandiflora,North Carolina,Brunswick,-78.244113,34.074733,Magnolia grandiflora,Magnolia grandiflora,=,Magnolia grandiflora
2,Magnolia grandiflora,North Carolina,Brunswick,-78.244113,34.074733,Magnolia grandiflora,Magnolia grandiflora,=,Magnolia grandiflora
3,Magnolia grandiflora,North Carolina,Brunswick,-78.244113,34.074733,Magnolia grandiflora,Magnolia grandiflora,=,Magnolia grandiflora
4,Magnolia grandiflora,North Carolina,Brunswick,-78.244113,34.074733,Magnolia grandiflora,Magnolia grandiflora,=,Magnolia grandiflora
5,Magnolia grandiflora,North Carolina,Beaufort,-76.863077,35.494745,Magnolia grandiflora,Magnolia grandiflora,=,Magnolia grandiflora
6,Magnolia tripetala,North Carolina,Beaufort,-76.863077,35.494745,Magnolia tripetala,Magnolia umbrella,=,Magnolia umbrella
7,Magnolia grandiflora,North Carolina,Alamance,-79.399460,36.044027,Magnolia grandiflora,Magnolia grandiflora,=,Magnolia grandiflora
8,Magnolia tripetala,North Carolina,Alamance,-79.399460,36.044027,Magnolia tripetala,Magnolia umbrella,=,Magnolia umbrella
9,Magnolia tripetala,North Carolina,Alamance,-79.399460,36.044027,Magnolia tripetala,Magnolia umbrella,=,Magnolia umbrella


In [105]:
#add a new column "newName", and match the column where scientific name = t1, return t2 in the newName column
result['newName']=np.where(result['scientificName']==result['t1'],result['new'],np.nan)

In [106]:
result

Unnamed: 0,scientificName,stateProvince,county,decimalLongitude,decimalLatitude,t1,t2,relations,new,newName
0,Magnolia grandiflora,North Carolina,Brunswick,-78.244113,34.074733,Magnolia grandiflora,Magnolia grandiflora,=,Magnolia grandiflora,Magnolia grandiflora
1,Magnolia grandiflora,North Carolina,Brunswick,-78.244113,34.074733,Magnolia grandiflora,Magnolia grandiflora,=,Magnolia grandiflora,Magnolia grandiflora
2,Magnolia grandiflora,North Carolina,Brunswick,-78.244113,34.074733,Magnolia grandiflora,Magnolia grandiflora,=,Magnolia grandiflora,Magnolia grandiflora
3,Magnolia grandiflora,North Carolina,Brunswick,-78.244113,34.074733,Magnolia grandiflora,Magnolia grandiflora,=,Magnolia grandiflora,Magnolia grandiflora
4,Magnolia grandiflora,North Carolina,Brunswick,-78.244113,34.074733,Magnolia grandiflora,Magnolia grandiflora,=,Magnolia grandiflora,Magnolia grandiflora
5,Magnolia grandiflora,North Carolina,Beaufort,-76.863077,35.494745,Magnolia grandiflora,Magnolia grandiflora,=,Magnolia grandiflora,Magnolia grandiflora
6,Magnolia tripetala,North Carolina,Beaufort,-76.863077,35.494745,Magnolia tripetala,Magnolia umbrella,=,Magnolia umbrella,Magnolia umbrella
7,Magnolia grandiflora,North Carolina,Alamance,-79.399460,36.044027,Magnolia grandiflora,Magnolia grandiflora,=,Magnolia grandiflora,Magnolia grandiflora
8,Magnolia tripetala,North Carolina,Alamance,-79.399460,36.044027,Magnolia tripetala,Magnolia umbrella,=,Magnolia umbrella,Magnolia umbrella
9,Magnolia tripetala,North Carolina,Alamance,-79.399460,36.044027,Magnolia tripetala,Magnolia umbrella,=,Magnolia umbrella,Magnolia umbrella


In [107]:
#drop the unnecessary columns
result.drop(['scientificName','t1','t2','relations','new'],inplace=True,axis=1)

In [108]:
result.head()

Unnamed: 0,stateProvince,county,decimalLongitude,decimalLatitude,newName
0,North Carolina,Brunswick,-78.244113,34.074733,Magnolia grandiflora
1,North Carolina,Brunswick,-78.244113,34.074733,Magnolia grandiflora
2,North Carolina,Brunswick,-78.244113,34.074733,Magnolia grandiflora
3,North Carolina,Brunswick,-78.244113,34.074733,Magnolia grandiflora
4,North Carolina,Brunswick,-78.244113,34.074733,Magnolia grandiflora


In [109]:
#update the column name to scientificName
result=result.rename(columns={'newName':'scientificName'})

In [110]:
result.head()

Unnamed: 0,stateProvince,county,decimalLongitude,decimalLatitude,scientificName
0,North Carolina,Brunswick,-78.244113,34.074733,Magnolia grandiflora
1,North Carolina,Brunswick,-78.244113,34.074733,Magnolia grandiflora
2,North Carolina,Brunswick,-78.244113,34.074733,Magnolia grandiflora
3,North Carolina,Brunswick,-78.244113,34.074733,Magnolia grandiflora
4,North Carolina,Brunswick,-78.244113,34.074733,Magnolia grandiflora


In [111]:
#save the new result as a new CSV file
result.to_csv('MagnoliaSecChapman1883Correct.csv')