# WibblyWobbly

### Match data to a catalog

Import wibblywobbly and load your data and catalog as list.

In [1]:
import wibblywobbly as ww

catalog = ["Mouse", "Cat", "Dog", "Human"]
data = ["mice",  "CAT ", "doggo", "PERSON", 999]

WibblyWobbly compares the data to the catalog and returns the most likely options and a similarity score. 
If it cannot find a good match it will return the original data. By default it returns a pandas dataframe.

WibblyWobbly automaticaly accepts the catalog options that have a higher similarity score than `thr_accept` and rejects those that have a lower score than `thr_reject`. This treshold values can be adjusted depending in the data quality. It ignores non-string values.

In [2]:
ww.map_list_to_catalog(data, catalog, thr_accept=95, thr_reject=40)

Unnamed: 0,Data,Option1,Score1,Option2,Score2,Option3,Score3
0,CAT,Cat,100,,,,
1,doggo,Dog,90,Mouse,20.0,Human,0.0
2,mice,Mouse,44,Cat,29.0,Human,22.0
3,PERSON,PERSON,0,,,,
4,999,999,0,,,,


WibblyWobbly can also return a dictionary with the best options.

In [3]:
ww.map_list_to_catalog(data, catalog, output_format="dictionary")

{'PERSON': 'PERSON', 999: 999, 'CAT ': 'Cat', 'doggo': 'Dog', 'mice': 'mice'}

It is possible set a `reject_value`.

In [4]:
ww.map_list_to_catalog(data, catalog, output_format="dictionary", reject_value='Other')

{'PERSON': 'Other', 999: 999, 'CAT ': 'Cat', 'doggo': 'Dog', 'mice': 'Other'}

WibblyWobbly can also raise warnings of the suspicious values to facilitate visual inspection.

In [5]:
ww.map_list_to_catalog(data, catalog, output_format="dictionary", 
                       thr_accept=95, thr_reject=40,  warnings=True)

REJECT: PERSON
	Options: Dog (30), Human (18), Mouse (18)
WOBBLY: doggo
	Options: Dog (90), Mouse (20), Human (0)
WOBBLY: mice
	Options: Mouse (44), Cat (29), Human (22)


{'PERSON': 'PERSON', 999: 999, 'CAT ': 'Cat', 'doggo': 'Dog', 'mice': 'Mouse'}

### Clean a dataframe

First import pandas

In [6]:
import pandas as pd

Then load the catalog and the data table with _.read_csv()_ or _.read_excel()_

In [7]:
df_catalog = pd.read_csv("./tests/example_taxa.csv")
df_catalog

Unnamed: 0,Common name,Order,Family,Genus,Species
0,Guinea pig,Rodentia,Caviidae,Cavia,porcellus
1,Mouse,Rodentia,Muridae,Mus,musculus
2,Rat,Rodentia,Muridae,Rattus,norvegicus
3,Cat,Carnivora,Felidae,Felis,catus
4,Dog,Carnivora,Canidae,Canis,lupus
5,Rhesus macaque,Primates,Cercopithecidae,Macaca,mulatta
6,Chimpanzee,Primates,Hominidae,Pan,troglodytes
7,Gorilla,Primates,Hominidae,Gorilla,gorilla
8,Orangutan,Primates,Hominidae,Pongo,pygmaeus
9,Human,Primates,Hominidae,Homo,sapiens


In [114]:
df_data = pd.read_csv("./tests/example_dirty_name.csv")
df_data

Unnamed: 0,Animal,Count
0,mice,3
1,CAT,1
2,doggo,5
3,PERSON,0
4,guinea pig,1
5,pig,2
6,Gorilla,3
7,Chimpanzee,0
8,orangután,1
9,chinpanze,7


Then create two lists with the columns you want to use as catalog and data using _.to_list()_.

In [115]:
catalog = df_catalog["Common name"].to_list()
print('Catalog: ', catalog)
data = df_data["Animal"].to_list()
print('Data: ', data)

Catalog:  ['Guinea pig', 'Mouse', 'Rat', 'Cat', 'Dog', 'Rhesus macaque', 'Chimpanzee', 'Gorilla', 'Orangutan', 'Human']
Data:  ['mice', 'CAT ', 'doggo', 'PERSON', 'guinea pig', 'pig', 'Gorilla', 'Chimpanzee', 'orangután', 'chinpanze', 'gorila', nan, 'dogs', 'rats', 'mouse', 'kitty', 'Cat', 'macaco']


Create an equivalence dictionary with _.map_list_to_catalog()_. Use `output_format="dictionary"`,  to get a dictionary and `warnings=True` to check the results.

It may be necessary to adjust `thr_accept` and `thr_reject` to get the best results.

In [10]:
equivalence = ww.map_list_to_catalog(data, catalog, output_format="dictionary", 
                                     thr_accept=80, thr_reject=50,  warnings=True)
equivalence

REJECT: PERSON
	Options: Rhesus macaque (43), Chimpanzee (30), Dog (30)
REJECT: kitty
	Options: Cat (30), Rat (30), Chimpanzee (18)
REJECT: mice
	Options: Rhesus macaque (45), Guinea pig (45), Mouse (44)
WOBBLY: macaco
	Options: Rhesus macaque (60), Cat (60), Human (36)


{nan: nan,
 'pig': 'Guinea pig',
 'orangután': 'Orangutan',
 'gorila': 'Gorilla',
 'guinea pig': 'Guinea pig',
 'PERSON': 'PERSON',
 'dogs': 'Dog',
 'chinpanze': 'Chimpanzee',
 'mouse': 'Mouse',
 'CAT ': 'Cat',
 'kitty': 'kitty',
 'Cat': 'Cat',
 'doggo': 'Dog',
 'Gorilla': 'Gorilla',
 'mice': 'mice',
 'macaco': 'Rhesus macaque',
 'Chimpanzee': 'Chimpanzee',
 'rats': 'Rat'}

Manually correct the errors by changing the dictionary

In [11]:
equivalence['macaco'] = 'Rhesus macaque'
equivalence['kitty']  = 'Cat'
equivalence['PERSON'] = 'Human'
equivalence['mice']   = 'Mouse'
equivalence

{nan: nan,
 'pig': 'Guinea pig',
 'orangután': 'Orangutan',
 'gorila': 'Gorilla',
 'guinea pig': 'Guinea pig',
 'PERSON': 'Human',
 'dogs': 'Dog',
 'chinpanze': 'Chimpanzee',
 'mouse': 'Mouse',
 'CAT ': 'Cat',
 'kitty': 'Cat',
 'Cat': 'Cat',
 'doggo': 'Dog',
 'Gorilla': 'Gorilla',
 'mice': 'Mouse',
 'macaco': 'Rhesus macaque',
 'Chimpanzee': 'Chimpanzee',
 'rats': 'Rat'}

Clean de dirty data using the equivalence dictionary and the function _.map()_. Don't forget to save the new values.

In [12]:
df_data['Animal'] = df_data['Animal'].map(equivalence)
df_data

Unnamed: 0,Animal,Count
0,Mouse,3
1,Cat,1
2,Dog,5
3,Human,0
4,Guinea pig,1
5,Guinea pig,2
6,Gorilla,3
7,Chimpanzee,0
8,Orangutan,1
9,Chimpanzee,7


Save the clean table as a file with _.to_csv()_ or _.to_excel()_.

```python
df_data.to_csv("./tests/example_clean_name.csv")
```