In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
import random

%matplotlib inline 

# All iNaturalist Records of Lobelia Spicata inside of the U.S.
Spicata is the target species. It has a similar number of U.S. observations as does Yellow Lady's Slipper and Spring Ladies' Tresses, yet those species have higher level conservation status than Spicata so, if studied, Spicata's latitude and longitude won't be obscured via a taxon geoprivacy, making it data that I can use for validation of models.

To simulate how I imagine someone might target an endangered species, I pulled all observations of Spicata and from there a list of users whose records will have "clues" about the location of the Spicata plants. For validation purposes, I also need to make sure the user's observations had an actual public positional accuracy of at least 30 m. 

In [2]:
df_spicata = pd.read_csv("../data/observations_spicata.csv")
df_spicata.head()

Unnamed: 0,id,observed_on_string,observed_on,time_observed_at,time_zone,user_id,user_login,user_name,created_at,updated_at,...,taxon_supertribe_name,taxon_tribe_name,taxon_subtribe_name,taxon_genus_name,taxon_genushybrid_name,taxon_species_name,taxon_hybrid_name,taxon_subspecies_name,taxon_variety_name,taxon_form_name
0,82406,"May 20, 2012 16:44",2012-05-20,2012-05-20 21:44:00 UTC,Central Time (US & Canada),604,eric_hunt,Eric Hunt,2012-05-23 22:43:55 UTC,2020-12-09 10:50:50 UTC,...,,,,Lobelia,,Lobelia spicata,,,,
1,82408,"May 20, 2012 16:19",2012-05-20,2012-05-20 21:19:00 UTC,Central Time (US & Canada),604,eric_hunt,Eric Hunt,2012-05-23 22:46:36 UTC,2020-12-09 10:50:51 UTC,...,,,,Lobelia,,Lobelia spicata,,,,
2,87039,Sun Jun 03 2012 09:52:31 GMT-0400 (EDT),2012-06-03,2012-06-03 13:52:31 UTC,Eastern Time (US & Canada),477,loarie,Scott Loarie,2012-06-04 05:25:45 UTC,2019-07-02 19:37:38 UTC,...,,,,Lobelia,,Lobelia spicata,,,,
3,92772,"June 16, 2012 05:40",2012-06-16,2012-06-16 12:40:00 UTC,Pacific Time (US & Canada),477,loarie,Scott Loarie,2012-06-18 18:04:31 UTC,2015-10-08 14:36:10 UTC,...,,,,Lobelia,,Lobelia spicata,,,,
4,195645,2008-07-06,2008-07-06,,Eastern Time (US & Canada),12610,susanelliott,Susan Elliott,2013-02-10 18:04:29 UTC,2020-02-19 21:15:23 UTC,...,,,,Lobelia,,Lobelia spicata,,,,


In [3]:
df_spicata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4547 entries, 0 to 4546
Data columns (total 67 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   id                                4547 non-null   int64  
 1   observed_on_string                4538 non-null   object 
 2   observed_on                       4538 non-null   object 
 3   time_observed_at                  4437 non-null   object 
 4   time_zone                         4547 non-null   object 
 5   user_id                           4547 non-null   int64  
 6   user_login                        4547 non-null   object 
 7   user_name                         3079 non-null   object 
 8   created_at                        4547 non-null   object 
 9   updated_at                        4547 non-null   object 
 10  quality_grade                     4547 non-null   object 
 11  license                           3489 non-null   object 
 12  url   

# Comparing public positional accuracy to positional accuracy

Positional accuracy is a kind of confidence measure in the lat/lon provided. An observation is said to be within x meters of the given lat/lon where x is the positional accuracy. A positional accuracy of 30 m or less is considered "research standard" [according to iNaturalist](https://www.inaturalist.org/posts/2035-observation-location-accuracy). However, the downloaded data has two categories for positional accuracy--one labelled "positional accuracy" and one labelled "public positional accuracy". I was not able to find an explanation about the difference on the website so I've compared below with the data. Because I want to use spicata observations where the lat/lon is tied to a positional accuracy of 30 m or less, it's important to know which of the two standards to use, since they do show they differ. 

In [4]:
# number of spicata observations with public pos acc of 30 m or less
(df_spicata["public_positional_accuracy"] <= 30).sum()

2076

In [5]:
# number of spicata observations with pos acc of 30 m or less
(df_spicata["positional_accuracy"] <= 30).sum()

2304

In [6]:
# suspect the 228 difference between two counts is due to geoprivacy
# checking that suspicion 

#boolean table for all rows--True if positional accuracy is at least 30, but public pos acc is greater
truth_table = (df_spicata[df_spicata["positional_accuracy"] <= 30]["public_positional_accuracy"] > 30) 
# an array for those index numbers where the condition is True
indices_for_diff = truth_table[truth_table == True].index
# len function confirms I've isolated the 228 rows
len(indices_for_diff)

228

In [7]:
# a count of geoprivacy for those entries shows all of them have "obscured" geoprivacy
df_spicata.loc[indices_for_diff, "id":"coordinates_obscured"]["geoprivacy"].value_counts()

obscured    228
Name: geoprivacy, dtype: int64

All differences between public pos acc and pos acc are accounted for by an obscured geoprivacy. I am going to extrapolate that information to state that I believe that "positional accuracy" is the accuracy attached to the "private latitude" and "private longitude"--values which are not available to me. Whereas "public positional accuracy" is the accuracy attached to the "latitude" and "longitude" values--which are publicly available to me. So the "public positional accuracy" is the metric I need to focus on.

# Constructing a Dataset around Spicata

In [8]:
# list of all users in US who posted spicata with public positional accuracy <= 30
spicata_users = df_spicata[df_spicata["public_positional_accuracy"] <= 30]["user_id"].unique().tolist()
len(spicata_users)

1245

Ideally, I would get to work with all of the data for the full number of users in the set above. And that may be possible as I learn more ways to access the data. But given my current skillset and the way iNaturalist's data query page works, I am using a brute force method of choosing 10 random users at a time from the list of generated users and pulling each user's records by query from the iNaturalist query page until I have a set to explore that has at least 100,000 data points AND at least 100 Spicata records. 

In [9]:
# 10 users chosen at random using random.shuffle ALL HAVE BEEN INCLUDED
to_remove = [ 635041, 2588524,  923056, 18434,  318468, 1549697, 3583533,
       1679129, 2047965, 2570804]

In [10]:
for id in to_remove:
    spicata_users.remove(id)
len(spicata_users)

1235

In [11]:
# next 10 random users: HAS BEEN INCLUDED
to_remove2 = [787855, 1892152, 542981, 656158, 2248142, 780600, 4659453, 2336149, 3512034, 1773265]
to_remove.extend(to_remove2)

In [12]:
for id in to_remove2:
    spicata_users.remove(id)
len(spicata_users)

1225

In [13]:
# HAS BEEN INCLUDED
to_remove3 = [2889644, 2718421, 805798, 518143, 1237684, 4531483, 818667, 1081392, 2585152, 1401675]
to_remove.extend(to_remove3)

In [14]:
for id in to_remove3:
    spicata_users.remove(id)
len(spicata_users)

1215

In [15]:
# HAS BEEN INCLUDED
to_remove4 = [328268, 557161,2808877,2907199,1234210,1006938,1750165,2154312,1531120,1834316]
to_remove.extend(to_remove4)

In [16]:
for id in to_remove4:
    spicata_users.remove(id)
len(spicata_users)

1205

In [17]:
# HAS BEEN INCLUDED
to_remove5 = [1511270, 1237597, 5800322, 416976, 5669487, 3823990, 4736884, 5241676, 1149437, 1916723]
to_remove.extend(to_remove5)

In [18]:
for id in to_remove5:
    spicata_users.remove(id)
len(spicata_users)

1195

In [19]:
# HAS BEEN INCLUDED
to_remove6 = [436529, 1262660, 674664, 477, 1976181, 475357, 1883226, 3062686, 2125177, 265604]
to_remove.extend(to_remove6)


In [20]:
for id in to_remove6:
    spicata_users.remove(id)
len(spicata_users)

1185

In [21]:
to_remove7 = [1804605, 11792, 434504, 604, 3199368, 1872981, 1387258, 392818, 1635190, 289544]
to_remove.extend(to_remove7)

In [22]:
for id in to_remove7:
    spicata_users.remove(id)
len(spicata_users)

1175

In [23]:
#reads in the csv files from separate users and puts them together in one data frame
data_sp = pd.read_csv("../data/spicata_0001.csv")
for i in range(2,len(to_remove)+1):
    if i < 10:
        file = "../data/spicata_000" + str(i) +".csv"
    elif i < 100:
        file = "../data/spicata_00" + str(i) +".csv"
    data_sp = pd.concat([data_sp, pd.read_csv(file)])

data_sp = data_sp.reset_index(drop = True)

  data_sp = pd.concat([data_sp, pd.read_csv(file)])
  data_sp = pd.concat([data_sp, pd.read_csv(file)])
  data_sp = pd.concat([data_sp, pd.read_csv(file)])


In [24]:
# double-checking that the set has the expects number of users
data_sp["user_id"].nunique()

70

In [25]:
data_sp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155607 entries, 0 to 155606
Data columns (total 67 columns):
 #   Column                            Non-Null Count   Dtype  
---  ------                            --------------   -----  
 0   id                                155607 non-null  int64  
 1   observed_on_string                155146 non-null  object 
 2   observed_on                       155111 non-null  object 
 3   time_observed_at                  148730 non-null  object 
 4   time_zone                         155607 non-null  object 
 5   user_id                           155607 non-null  int64  
 6   user_login                        155607 non-null  object 
 7   user_name                         124934 non-null  object 
 8   created_at                        155607 non-null  object 
 9   updated_at                        155607 non-null  object 
 10  quality_grade                     155607 non-null  object 
 11  license                           126938 non-null  o

# Cleaning the Dataset

### Dropping Columns and Rows

__dropping columns__
I originally downloaded all data columns available via the iNaturalist query page. However, many are not of immediate interest, especially as my first model will involve basic interpolation. Other columns may be of interest later, but for now I'm choosing 21 of the 66 columns for EDA and then my first attempts at modeling. 

__Dropping missing lat/lon rather than attempting to impute because:__
1) accurate lat/lon is essential to this analysis
2) accurately imputing from other clues would be time intensive
3) null lat/lon makes up less than 1% of the rows, so dropping doesn't significantly hurt my dataset

__Dropping missing "time_observed_on" rows because:__
1) much of my analysis is based on this value
2) null values is less than 5% of the rows, so dropping doesn't hurt my dataset
*HOWEVER: it is worth noting that there are two choices for imputation that could be explored:*
1) impute the "time_created_at" value AND also working with the "time_created_at" value for ordering is another way to consider ordering the data.
2) order the data by "time_created_at" and then impute a "time_observed_at" that falls between the "time_observed_at" tows on either side of the null row (assuming the same user)
But it is my assessment that this will not impact my dataset enough to make it worth the effort.

In [26]:
#columns of interest
to_keep = ['id', 'time_observed_at','user_id', 'created_at',
       'quality_grade', 'num_identification_agreements',
       'num_identification_disagreements', 'captive_cultivated',
       'latitude', 'longitude',
       'positional_accuracy', 'public_positional_accuracy', 'geoprivacy',
       'taxon_geoprivacy', 'coordinates_obscured', 'species_guess', 'scientific_name', 'common_name',
       'taxon_kingdom_name','taxon_genus_name',
      'taxon_species_name']
data_sp = data_sp[to_keep]

In [27]:
# dropping missing "time_observed_at" rows and confirming
data_sp.dropna(subset=['time_observed_at'], inplace=True)
print(f'Number of null time_observed_at entries = {data_sp[data_sp["time_observed_at"].isnull()].shape[0]}')

Number of null time_observed_at entries = 0


In [28]:
# dropping missing latitude/longitude rows and confirming
# for most part, these nulls usually are the same
# but occasional strangeness requires choosing to drop category with most nulls
if data_sp[data_sp["latitude"].isnull()].shape[0] >= data_sp[data_sp["longitude"].isnull()].shape[0]:
    drop_name = "latitude"
else:
    drop_name = "longitude"
data_sp.dropna(subset=[drop_name], inplace=True)
print(f'Number of null time_observed_at entries = {data_sp[data_sp[drop_name].isnull()].shape[0]}')

Number of null time_observed_at entries = 0


### Dummy and Boolean

__geoprivacy__
has only an "obscured" value--the rest are null. Simple boolean.

__taxon_geoprivacy__
has two values "open" and "obscured" but also many nulls, so I choose to dummy the two and drop the original column. 
*HOWEVER: it might be worth researchin what "open" means for the dataset because it may be that any "unmarked" columns are, by default, "open", in which case this column could also just be a dummy. *

__taxon_kingdom_name__
The bulk of the data is in the Animal, Plant or Fungi kingdoms so it is unlikely that the other kingdoms will contribute to a data model. Additionally, these are the kingdoms of interest to me. It could be worth checking out a difference if the next two most frequent kingdoms were included, but I don't think it will be worth my time at this juncture.



In [29]:
data_sp["geoprivacy"].value_counts() #to boolean

obscured    8598
Name: geoprivacy, dtype: int64

In [30]:
# this code demonstrated during Dec 19th kickoff and had short runtime compared to others
# this way of using .apply and lambda isn't yet comfortable/natural to me
data_sp["geoprivacy"] = data_sp["geoprivacy"].apply(lambda x: 1 if x == "obscured" else 0)

In [31]:
data_sp["geoprivacy"].value_counts()

0    139645
1      8598
Name: geoprivacy, dtype: int64

In [32]:
data_sp["taxon_geoprivacy"].value_counts() # to dummy

open        25657
obscured     3041
Name: taxon_geoprivacy, dtype: int64

In [33]:
# makes dummies
taxon_geoprivacy_dum = pd.get_dummies(data_sp["taxon_geoprivacy"])
# concatenates dummy columns with previous data set
data_sp = pd.concat([data_sp,taxon_geoprivacy_dum],axis='columns')
# drops original taxon_geoprivacy column 
data_sp.drop(columns="taxon_geoprivacy",inplace=True)


In [34]:
data_sp["taxon_kingdom_name"].value_counts() 

Plantae      81968
Animalia     58343
Fungi         7525
Protozoa       237
Chromista       40
Bacteria        10
Viruses          9
Name: taxon_kingdom_name, dtype: int64

In [35]:
# dummy Animalia, Plantae and Fungi
kingdom_dum = pd.get_dummies(data_sp["taxon_kingdom_name"])[["Animalia", "Fungi", "Plantae"]]
data_sp = pd.concat([data_sp, kingdom_dum], axis = "columns")
data_sp.drop(columns = "taxon_kingdom_name", inplace = True)

### Imputation

__Other Naming Categoricals__
There are too many options for these to dummy at this time and I would like to sort and group on the categoricals for initial EDA. Given that these are strings, I am imputing the string "not stated" into any nulls here.

__Positional Accuracy Columns__
I choose to impute the mean into the null values for both of these columns. There are few null values and imputing the mean will at least preserve the mean. Additionally, a danger of imputing a value near the medians will put those rows inside of positional accuracies that are considered good for research. Whereas, sticking with the mean will signal "poor" positional accuracy which is the better assumption to make. 

In [36]:
#filling categoricals with missing info with "not stated" and confirming
cat_with_null = ['species_guess', 'scientific_name','common_name', 
                 'taxon_genus_name','taxon_species_name']
data_sp[cat_with_null] = data_sp[cat_with_null].fillna("not stated")
print(f'Number of null entries in stated columns = {data_sp[cat_with_null].isnull().sum().sum()}')

Number of null entries in stated columns = 0


In [37]:
round(data_sp[['positional_accuracy','public_positional_accuracy']].describe(),2)

Unnamed: 0,positional_accuracy,public_positional_accuracy
count,131942.0,132953.0
mean,309.45,2667.02
std,13957.8,15889.78
min,0.0,0.0
25%,5.0,5.0
50%,9.0,10.0
75%,32.0,65.0
max,2703477.0,2703477.0


In [38]:
#imputing the mean into null categories for both accuracy
PA_mean = data_sp["positional_accuracy"].mean()
PPA_mean = data_sp["public_positional_accuracy"].mean()
data_sp["positional_accuracy"] = data_sp["positional_accuracy"].fillna(PA_mean)
data_sp["public_positional_accuracy"] = data_sp["public_positional_accuracy"].fillna(PPA_mean)

In [39]:
round(data_sp[['positional_accuracy','public_positional_accuracy']].describe(),2)

Unnamed: 0,positional_accuracy,public_positional_accuracy
count,148243.0,148243.0
mean,309.45,2667.02
std,13168.04,15048.03
min,0.0,0.0
25%,5.0,5.0
50%,10.0,11.0
75%,98.0,244.0
max,2703477.0,2703477.0


### Final Cleaning Touches/Checks

In [40]:
# changing to datetimes
data_sp["time_observed_at"] = pd.to_datetime(data_sp["time_observed_at"])
data_sp["created_at"] = pd.to_datetime(data_sp["created_at"])

In [41]:
data_sp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 148243 entries, 0 to 155606
Data columns (total 24 columns):
 #   Column                            Non-Null Count   Dtype              
---  ------                            --------------   -----              
 0   id                                148243 non-null  int64              
 1   time_observed_at                  148243 non-null  datetime64[ns, UTC]
 2   user_id                           148243 non-null  int64              
 3   created_at                        148243 non-null  datetime64[ns, UTC]
 4   quality_grade                     148243 non-null  object             
 5   num_identification_agreements     148243 non-null  int64              
 6   num_identification_disagreements  148243 non-null  int64              
 7   captive_cultivated                148243 non-null  bool               
 8   latitude                          148243 non-null  float64            
 9   longitude                         148243 non-nul

In [42]:
# number of spicata observations in this set with necessary positional accuracy
# Goal is for this number to be at least 100
((data_sp["taxon_species_name"] == "Lobelia spicata")&(data_sp["public_positional_accuracy"]<=30)).sum()

100

# Adding Important Data Columns for Analysis

### Clustering User Data According to Proximity in Time

In [43]:
# sorting data by user then by time_observed_at so "minute_diff" and "km_diff" can operate on a single user in the order
# the observations were made
# NOTE: there may be a reason to choose "created_at" or "id" instead if "time_observed_at" turns out to be poor for modeling
data_sp = data_sp.sort_values(by = ["user_id", "time_observed_at"]).reset_index(drop = True)

In [44]:
# column that calculates time difference between given row and previous row in total seconds
data_sp["minute_diff"] = data_sp["time_observed_at"].diff()
data_sp["minute_diff"] = data_sp["minute_diff"].dt.total_seconds() / 60

For each user's first entry "minute_diff", the calculated value compared to previous user's last entry doesn't make good sense. There are some options for what could be imputed there. The mean or median of that user's time differences, for example. But also perhaps 0? I'm starting with -0.001 because it's close to zero and so will act a lot like zero in this dataset if included in summary stats, but can be more easily filtered out of the dataset all together, since all actual "time differences" are zero or positive given that I constructed the set sequentially in time for each user.

In [45]:
users = data_sp["user_id"].unique() # identifies users in set --REQUIRED FOR LATER CODE
imputed_initial_time = -0.001  # change here for different imputation
for user in users:
    min_index = data_sp.index[data_sp["user_id"] == user].min()
    data_sp.loc[min_index, "minute_diff"] = imputed_initial_time

In [46]:
# summary stats with -0.001 as initial minute_diff included
round(data_sp["minute_diff"].describe(),3)

count     148243.000
mean        1749.765
std        27436.864
min           -0.001
25%            1.167
50%            4.767
75%           34.133
max      6661710.917
Name: minute_diff, dtype: float64

In [47]:
# summary stats with -0.001 as initial minute_diff excluded
round(data_sp[data_sp["minute_diff"]!= -0.001]["minute_diff"].describe(),3)

count     148173.000
mean        1750.592
std        27443.318
min            0.000
25%            1.167
50%            4.767
75%           34.200
max      6661710.917
Name: minute_diff, dtype: float64

### Clustering User Data According to Proximity in Space (with time as an ordering category)

Using scikit learn's [haversine function ](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.haversine_distances.html) to approximate distance between two lat/lon points.

In [48]:
from sklearn.metrics.pairwise import haversine_distances
from math import radians
def dist_in_km(df, i, s = 1): 
    '''
    Takes the index of one row in a given dataset
    and the number of rows ahead of that row in the dataset
    and computes the estimated distance in km 
    between the two latitude/longitude coordinates
    
    Arguments:
    df = dataframe
    i = index of row with first location of concern
    s = step or number of rows ahead of initial row for second location
        (default is 1--to calc distance between location in given row and next one)
    
    Returns: 
    Estimated distance in km between the two lat/lon locations 
    for the row given and the row s steps ahead of row i
    
    Example: 
    dist_in_km(data_sp, 4, 2)
    Will find the distance in km between the location specified in data_sp[4] and data_sp[4 + 2]
    '''
    location1 = [df["latitude"][i], df["longitude"][i]]
    location2 = [df["latitude"][i + s], df["longitude"][i + s]]
    loc1_rad = [radians(deg) for deg in location1]
    loc2_rad = [radians(deg) for deg in location2]
    result = haversine_distances([loc1_rad, loc2_rad])
    dist = result[0][1] * 6371  # multiply by Earth radius in km to get km
    return dist

In [49]:
# creates list of indices where new users begin
first_index_for_user = []
for user in users: # users as defined in previous section
    min_index = data_sp.index[data_sp["user_id"] == user].min()
    first_index_for_user.append(min_index)

In [50]:
# CODE RELIES ON EARLIER SORTING AND INDEXING BY USER AND TIME (or ID if chosen)
# creates distance difference column by user
# initial row for a user is imputed 
# every row afterward for that user, distance is calculated between given row and the row before
imputed_initial_dist = -0.001 # change here for different imputation
for i in range(1, data_sp.shape[0]):
    data_sp.loc[0, "km_diff"] = imputed_initial_dist #sets initial row as imputed value
    if i in first_index_for_user:
        data_sp.loc[i, "km_diff"] = imputed_initial_dist # sets all other first indices as imputed value
    else: 
        data_sp.loc[i, "km_diff"] = dist_in_km(data_sp, i-1) 
        # sets all non first indices as difference using dist_in_km function

In [51]:
# summary stats with imputed initial distance differences included
round(data_sp["km_diff"].describe(),3)

count    148243.000
mean         28.404
std         297.634
min          -0.001
25%           0.012
50%           0.094
75%           2.667
max       17744.088
Name: km_diff, dtype: float64

In [52]:
# summary stats with imputed initial distance differences excluded
round(data_sp[data_sp["km_diff"]!= imputed_initial_dist]["km_diff"].describe(),3)

count    148173.000
mean         28.417
std         297.703
min           0.000
25%           0.012
50%           0.094
75%           2.676
max       17744.088
Name: km_diff, dtype: float64

In [53]:
data_sp.head()

Unnamed: 0,id,time_observed_at,user_id,created_at,quality_grade,num_identification_agreements,num_identification_disagreements,captive_cultivated,latitude,longitude,...,common_name,taxon_genus_name,taxon_species_name,obscured,open,Animalia,Fungi,Plantae,minute_diff,km_diff
0,39481,2007-04-14 16:48:00+00:00,477,2011-11-18 18:32:06+00:00,research,1,0,False,36.294893,-78.996592,...,Pickerel Frog,Lithobates,Lithobates palustris,0,0,1,0,0,-0.001,-0.001
1,4914407,2009-08-10 20:10:00+00:00,477,2017-01-08 21:14:11+00:00,research,2,0,False,-23.13739,-44.170157,...,Atlantic Ghost Crab,Ocypode,Ocypode quadrata,0,0,1,0,0,1222762.0,7553.014945
2,8755,2010-11-01 03:52:00+00:00,477,2010-11-01 04:04:18+00:00,research,0,0,False,37.740688,-122.440483,...,Argentinian biddy biddy,Acaena,Acaena pinnatifida,0,0,0,0,1,644142.0,10598.859075
3,10565,2011-01-26 18:14:01+00:00,477,2011-01-26 19:08:52+00:00,research,3,0,False,37.428551,-122.18013,...,Toyon,Heteromeles,Heteromeles arbutifolia,0,0,0,0,1,124702.0,41.60482
4,10566,2011-01-26 19:16:39+00:00,477,2011-01-26 19:17:41+00:00,research,1,0,False,37.428566,-122.17971,...,wild radish,Raphanus,Raphanus sativus,0,0,0,0,1,62.63333,0.037091


In [55]:
# sending cleaned dataset to csv to use for EDA to cut down on runtime
data_sp.to_csv("../data/spicata_clean.csv")