# Compiling Final Dataset
We've made our table from the two datasets with most of the information. Excluded were species that were not in the taxa information we gathered. Of course there is a lot of missing fields, so we will explore what to keep and what to throw out in this notebook

In [1]:
import os
import pickle
import numpy as np
import pandas as pd

base_dir = os.environ['HOME'] + '/python/Mushroom_Classifier/'

In [101]:
DF_MO = pickle.load(open(base_dir + 'MO_tables/finished_df_MO.p','rb'))
DF_DSA = pickle.load(open(base_dir + 'DSA_info/finished_df_DSA.p', 'rb'))
DF = pd.concat([DF_MO, DF_DSA], axis=0)
#creating truly unique identifier for each image
DF['Unique_ID'] = \
    DF['Data_Source'].apply(lambda x: x + '_') \
    + DF['id'].apply(lambda x: str(x)) 
    
print(DF.shape)
display(DF)
pickle.dump(DF, open(base_dir + 'combined_dataset_df.p', 'wb'))

(1022772, 16)


Unnamed: 0,id,location_id,in_location,vote_cache,when,Domain,Kingdom,Phylum,Class,Order,Family,Genus,Species,Image,Data_Source,Unique_ID
0,1,214,1,1.923350,['2004-07-13'],Eukarya,Fungi,Ascomycota,Sordariomycetes,Xylariales,Xylariaceae,Xylaria,polymorpha,1,MO,MO_1
0,2,53,1,2.706040,['2004-07-17'],,,,,,,Xylaria,magnoliae,2,MO,MO_2
0,3,60,1,2.499910,['2002-01-08'],Eukarya,Fungi,Ascomycota,Sordariomycetes,Xylariales,Xylariaceae,Xylaria,hypoxylon,3,MO,MO_3
0,4,5,1,2.499910,['1996-01-15'],Eukarya,Fungi,Ascomycota,Sordariomycetes,Xylariales,Xylariaceae,Xylaria,hypoxylon,4,MO,MO_4
0,5,36,1,1.666610,['2002-12-28'],Eukarya,Fungi,Basidiomycota,Agaricomycetes,Agaricales,Typhulaceae,Xeromphalina,,5,MO,MO_5
0,6,58,1,2.505280,['2002-01-08'],Eukarya,Fungi,Basidiomycota,Agaricomycetes,Agaricales,Typhulaceae,Xeromphalina,campanella,6,MO,MO_6
0,7,58,0,2.499910,['2005-01-07'],,,,,,,Xerocomus,zelleri,7,MO,MO_7
0,8,39,1,2.451440,['2004-11-26'],,,,,,,Xerocomus,zelleri,8,MO,MO_8
0,9,69,1,2.499910,['2003-01-03'],,,,,,,Xerocomus,subtomentosus,9,MO,MO_9
1,9,69,1,2.499910,['2003-01-03'],,,,,,,Xerocomus,subtomentosus,10,MO,MO_9


OK. Let's get some stats. I want the following:  

1) How many entries have most taxa fields  
2) How many images per class  
3) How many entries do we lose with vote_cache > 1.5  


In [9]:
filters = []
fields_to_include = \
        ['Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']
for fields in fields_to_include:
    filters.append(DF[fields].apply(lambda x: x != ''))

filter_complete = pd.concat(filters, axis=1)
observations_with_number_of_fields = {}
for i in range(7):
    fields_complete_filter = (filter_complete.sum(axis=1) == i)
    observations_with_number_of_fields[i] = DF[fields_complete_filter].shape[0]
for k,v in observations_with_number_of_fields.items():
    print('Number of entries (%i) with %i taxonomy labels' %(v,k))
    
#how many entries have both genus and species?
filters = []
for fields in ['Genus','Species']:
    filters.append(DF[fields].apply(lambda x: x != ''))
filter_complete = pd.concat(filters, axis=1)
contains_genus_and_species = sum(filter_complete.sum(axis=1) == 2)
contains_only_genus = 
print('Number of observations that contain genus and species: %i/%i'\
                                  %(contains_genus_and_species, DF.shape[0]))

Number of entries (5908) with 0 taxonomy labels
Number of entries (37036) with 1 taxonomy labels
Number of entries (274812) with 2 taxonomy labels
Number of entries (21678) with 3 taxonomy labels
Number of entries (42166) with 4 taxonomy labels
Number of entries (224998) with 5 taxonomy labels
Number of entries (416174) with 6 taxonomy labels
Number of observations that contain genus and species: 704783/1022772


In [63]:
yes_g = np.reshape(filter_complete.values[:,0]==1, (filter_complete.shape[0],1))
no_g = np.reshape(filter_complete.values[:,0]==0, (filter_complete.shape[0],1))
yes_s = np.reshape(filter_complete.values[:,1]==1, (filter_complete.shape[0],1))
no_s = np.reshape(filter_complete.values[:,1]==0, (filter_complete.shape[0],1))

genus_and_no_species = sum(np.sum(np.concatenate([yes_g, no_s], axis=1), axis=1) == 2)
no_genus_and_species = sum(np.sum(np.concatenate([no_g, yes_s], axis=1), axis=1) == 2)
genus_and_species = sum(np.sum(np.concatenate([yes_g, yes_s], axis=1), axis=1) == 2)
neither_genus_or_species = sum(np.sum(np.concatenate([no_g, no_s], axis=1), axis=1) == 2)

In [66]:
print('Genus and no species %i' %genus_and_no_species)
print('No Genus but species %i' %no_genus_and_species)
print('Genus and Species :) %i' %genus_and_species)
print('Neither: %i' %neither_genus_or_species)
adds_up = neither_genus_or_species + genus_and_species \
            + no_genus_and_species + genus_and_no_species
print('Does it add up? %i/%i' %(adds_up, DF.shape[0]))

Genus and no species 290559
No Genus but species 0
Genus and Species :) 704783
Neither: 27430
Does it add up? 1022772/1022772


OK, there are no observations that only have species and no genus, so at least it follows some hierarchical logic. I'm going to add those species as 'unknown'. That would be a respectable prediction, if we can get the genus but not the species, and keep 290k images. 

In [102]:
DF.loc[DF['Species'] == '', 'Species'] = 'unknown'
display(DF)

Unnamed: 0,id,location_id,in_location,vote_cache,when,Domain,Kingdom,Phylum,Class,Order,Family,Genus,Species,Image,Data_Source,Unique_ID
0,1,214,1,1.923350,['2004-07-13'],Eukarya,Fungi,Ascomycota,Sordariomycetes,Xylariales,Xylariaceae,Xylaria,polymorpha,1,MO,MO_1
0,2,53,1,2.706040,['2004-07-17'],,,,,,,Xylaria,magnoliae,2,MO,MO_2
0,3,60,1,2.499910,['2002-01-08'],Eukarya,Fungi,Ascomycota,Sordariomycetes,Xylariales,Xylariaceae,Xylaria,hypoxylon,3,MO,MO_3
0,4,5,1,2.499910,['1996-01-15'],Eukarya,Fungi,Ascomycota,Sordariomycetes,Xylariales,Xylariaceae,Xylaria,hypoxylon,4,MO,MO_4
0,5,36,1,1.666610,['2002-12-28'],Eukarya,Fungi,Basidiomycota,Agaricomycetes,Agaricales,Typhulaceae,Xeromphalina,unknown,5,MO,MO_5
0,6,58,1,2.505280,['2002-01-08'],Eukarya,Fungi,Basidiomycota,Agaricomycetes,Agaricales,Typhulaceae,Xeromphalina,campanella,6,MO,MO_6
0,7,58,0,2.499910,['2005-01-07'],,,,,,,Xerocomus,zelleri,7,MO,MO_7
0,8,39,1,2.451440,['2004-11-26'],,,,,,,Xerocomus,zelleri,8,MO,MO_8
0,9,69,1,2.499910,['2003-01-03'],,,,,,,Xerocomus,subtomentosus,9,MO,MO_9
1,9,69,1,2.499910,['2003-01-03'],,,,,,,Xerocomus,subtomentosus,10,MO,MO_9


In [103]:
#replace vote_cache which is empty with a 1 for now
DF.loc[DF.vote_cache.apply(lambda x: np.isnan(x)), 'vote_cache'] = 1

In [104]:
DF['GS_Dir'] = DF['Genus'] + '_' + DF['Species']

## Filtering
We are making the following rules for images that we want to keep. But I also want statistics of how many pictures we're losing:  
1) Must have both Genus and Species  
-> We are losing 27,430 pictures  
2) There must be at least 100 pictures per class  
-> We are losing 204,719 pictures  
3) Vote_cache must be greater than 0  
-> We are losing 6000 pictures  

In [105]:
taxa_fields = ['Phylum','Class','Order','Family','Genus','Species']
#Must have both Genus and Species
DF = DF[DF.Genus != '']
#Vote_cache must be greater than 0
DF = DF[DF.vote_cache > 0]

In [99]:
images_per_class = {i: sub_df.shape[0] for i, sub_df in DF.groupby('GS_Dir')}
images_per_class = sorted(images_per_class.items(), key=lambda x: x[1], reverse=True)

species_more_than_100 = [w for w in images_per_class if w[1]>=100]
no_observations = sum([w[1] for w in species_more_than_100])
no_classes = len(species_more_than_100)
observations_lost = DF.shape[0] - no_observations
print('Observations: %i across %i classes' %(no_observations, no_classes))
print('We lost %i observations' %observations_lost)

Observations: 784646 across 1890 classes
We lost 204719 observations


In [100]:
filtered_df = []
#let's actually make a new dataframe
for i, sub_df in DF.groupby('GS_Dir'):
    if sub_df.shape[0] >= 100:
        filtered_df.append(sub_df)
DF = pd.concat(filtered_df, axis=0)
display(DF)
pickle.dump(DF, open(base_dir + 'combined_dataset_df.p', 'wb'))

Unnamed: 0,id,location_id,in_location,vote_cache,when,Domain,Kingdom,Phylum,Class,Order,Family,Genus,Species,Image,Data_Source,Unique_ID,GS_Dir
0,32028,780,1,0.836956,['2009-12-22'],Eukarya,Fungi,Ascomycota,Lecanoromycetes,Acarosporales,Acarosporaceae,Acarospora,unknown,74305,MO,MO_32028,Acarospora_unknown
1,32028,780,1,0.836956,['2009-12-22'],Eukarya,Fungi,Ascomycota,Lecanoromycetes,Acarosporales,Acarosporaceae,Acarospora,unknown,74306,MO,MO_32028,Acarospora_unknown
2,32028,780,1,0.836956,['2009-12-22'],Eukarya,Fungi,Ascomycota,Lecanoromycetes,Acarosporales,Acarosporaceae,Acarospora,unknown,74307,MO,MO_32028,Acarospora_unknown
0,50671,939,1,0.831821,['2010-08-10'],Eukarya,Fungi,Ascomycota,Lecanoromycetes,Acarosporales,Acarosporaceae,Acarospora,unknown,99288,MO,MO_50671,Acarospora_unknown
1,50671,939,1,0.831821,['2010-08-10'],Eukarya,Fungi,Ascomycota,Lecanoromycetes,Acarosporales,Acarosporaceae,Acarospora,unknown,99289,MO,MO_50671,Acarospora_unknown
2,50671,939,1,0.831821,['2010-08-10'],Eukarya,Fungi,Ascomycota,Lecanoromycetes,Acarosporales,Acarosporaceae,Acarospora,unknown,99290,MO,MO_50671,Acarospora_unknown
0,67984,2930,1,1.678310,['2011-05-23'],Eukarya,Fungi,Ascomycota,Lecanoromycetes,Acarosporales,Acarosporaceae,Acarospora,unknown,147707,MO,MO_67984,Acarospora_unknown
1,67984,2930,1,1.678310,['2011-05-23'],Eukarya,Fungi,Ascomycota,Lecanoromycetes,Acarosporales,Acarosporaceae,Acarospora,unknown,147708,MO,MO_67984,Acarospora_unknown
2,67984,2930,1,1.678310,['2011-05-23'],Eukarya,Fungi,Ascomycota,Lecanoromycetes,Acarosporales,Acarosporaceae,Acarospora,unknown,147709,MO,MO_67984,Acarospora_unknown
3,67984,2930,1,1.678310,['2011-05-23'],Eukarya,Fungi,Ascomycota,Lecanoromycetes,Acarosporales,Acarosporaceae,Acarospora,unknown,147710,MO,MO_67984,Acarospora_unknown


In [106]:
pickle.dump(DF, open(base_dir + 'combined_dataset_df.p', 'wb'))

And we're done.