In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 

# Data Tidying and Cleaning Lab
## Reading, tidying and cleaning data. Preparing data for exploration, mining, analysis and learning

In this lab, you'll be working with the Coffee Quality Index dataset, located [here](https://www.kaggle.com/datasets/volpatto/coffee-quality-database-from-cqi). For convenience (and to save trouble in case you can't download files, or someone uploads a newer version), I've provided the dataset in the `data/` folder. The metadata (description) is at the Kaggle link. For this lab, you'll only need `merged_data_cleaned.csv`, as it is the concatenation of the other two datasets.

In this (and the following labs), you'll get several questions and problems. Do your analysis, describe it, use any tools and plots you wish, and answer. You can create any amount of cells you'd like.

Sometimes, the answers will not be unique, and they will depend on how you decide to approach and solve the problem. This is usual - we're doing science after all!

It's a good idea to save your clean dataset after all the work you've done to it.

### Problem 1. Read the dataset (1 point)
This should be self-explanatory. The first column is the index.

In [6]:
coffe_index = pd.read_csv("data/merged_data_cleaned.csv", index_col=0)

In [7]:
coffe_index

Unnamed: 0,Species,Owner,Country.of.Origin,Farm.Name,Lot.Number,Mill,ICO.Number,Company,Altitude,Region,...,Color,Category.Two.Defects,Expiration,Certification.Body,Certification.Address,Certification.Contact,unit_of_measurement,altitude_low_meters,altitude_high_meters,altitude_mean_meters
0,Arabica,metad plc,Ethiopia,metad plc,,metad plc,2014/2015,metad agricultural developmet plc,1950-2200,guji-hambela,...,Green,0,"April 3rd, 2016",METAD Agricultural Development plc,309fcf77415a3661ae83e027f7e5f05dad786e44,19fef5a731de2db57d16da10287413f5f99bc2dd,m,1950.0,2200.0,2075.0
1,Arabica,metad plc,Ethiopia,metad plc,,metad plc,2014/2015,metad agricultural developmet plc,1950-2200,guji-hambela,...,Green,1,"April 3rd, 2016",METAD Agricultural Development plc,309fcf77415a3661ae83e027f7e5f05dad786e44,19fef5a731de2db57d16da10287413f5f99bc2dd,m,1950.0,2200.0,2075.0
2,Arabica,grounds for health admin,Guatemala,"san marcos barrancas ""san cristobal cuch",,,,,1600 - 1800 m,,...,,0,"May 31st, 2011",Specialty Coffee Association,36d0d00a3724338ba7937c52a378d085f2172daa,0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660,m,1600.0,1800.0,1700.0
3,Arabica,yidnekachew dabessa,Ethiopia,yidnekachew dabessa coffee plantation,,wolensu,,yidnekachew debessa coffee plantation,1800-2200,oromia,...,Green,2,"March 25th, 2016",METAD Agricultural Development plc,309fcf77415a3661ae83e027f7e5f05dad786e44,19fef5a731de2db57d16da10287413f5f99bc2dd,m,1800.0,2200.0,2000.0
4,Arabica,metad plc,Ethiopia,metad plc,,metad plc,2014/2015,metad agricultural developmet plc,1950-2200,guji-hambela,...,Green,2,"April 3rd, 2016",METAD Agricultural Development plc,309fcf77415a3661ae83e027f7e5f05dad786e44,19fef5a731de2db57d16da10287413f5f99bc2dd,m,1950.0,2200.0,2075.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1334,Robusta,luis robles,Ecuador,robustasa,Lavado 1,our own lab,,robustasa,,"san juan, playas",...,Blue-Green,1,"January 18th, 2017",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,,,
1335,Robusta,luis robles,Ecuador,robustasa,Lavado 3,own laboratory,,robustasa,40,"san juan, playas",...,Blue-Green,0,"January 18th, 2017",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,40.0,40.0,40.0
1336,Robusta,james moore,United States,fazenda cazengo,,cafe cazengo,,global opportunity fund,795 meters,"kwanza norte province, angola",...,,6,"December 23rd, 2015",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,795.0,795.0,795.0
1337,Robusta,cafe politico,India,,,,14-1118-2014-0087,cafe politico,,,...,Green,1,"August 25th, 2015",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,,,


### Problem 2. Observations and features (1 point)
How many observations are there? How many features? Which features are numerical, and which are categorical?

**Note:** Think about the _meaning_, not the data types. The dataset hasn't been thoroughly cleaned.

#### Observations 
In a dataset observation is a single row of data and represent one individual point of record. In out dataset observations are equal to the rows - 1340. 
#### Features
The features are specific measurable properties or characteristics of the data and they are equal to the columns. In the our dataset we have 43 features.
##### Numerical Features
These are features that represent quantities.Usually,they are numbers with which we can do mathematical operations. For example in our dataset those are ICO.Number, Alttitude, Category.Two.Defects, etc.
##### Categorical Features
These are features that represent categories or groups. They can be nomial or ordinal. In our dataset such features are Species, Owner, Country.Of.Origin, etc.  

### Problem 3. Column manipulation (1 point)
Make the column names more Pythonic (which helps with the quality and... aesthetics). Convert column names to `snake_case`, i.e. `species`, `country_of_origin`, `ico_number`, etc. Try to not do it manually.

In [11]:
coffe_index.columns

Index(['Species', 'Owner', 'Country.of.Origin', 'Farm.Name', 'Lot.Number',
       'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region', 'Producer',
       'Number.of.Bags', 'Bag.Weight', 'In.Country.Partner', 'Harvest.Year',
       'Grading.Date', 'Owner.1', 'Variety', 'Processing.Method', 'Aroma',
       'Flavor', 'Aftertaste', 'Acidity', 'Body', 'Balance', 'Uniformity',
       'Clean.Cup', 'Sweetness', 'Cupper.Points', 'Total.Cup.Points',
       'Moisture', 'Category.One.Defects', 'Quakers', 'Color',
       'Category.Two.Defects', 'Expiration', 'Certification.Body',
       'Certification.Address', 'Certification.Contact', 'unit_of_measurement',
       'altitude_low_meters', 'altitude_high_meters', 'altitude_mean_meters'],
      dtype='object')

In [12]:
# Use a function to convert the column names to lower case and snake_case  
coffe_index = coffe_index.rename(columns = lambda col: col.lower().replace(".", "_"))

In [13]:
coffe_index

Unnamed: 0,species,owner,country_of_origin,farm_name,lot_number,mill,ico_number,company,altitude,region,...,color,category_two_defects,expiration,certification_body,certification_address,certification_contact,unit_of_measurement,altitude_low_meters,altitude_high_meters,altitude_mean_meters
0,Arabica,metad plc,Ethiopia,metad plc,,metad plc,2014/2015,metad agricultural developmet plc,1950-2200,guji-hambela,...,Green,0,"April 3rd, 2016",METAD Agricultural Development plc,309fcf77415a3661ae83e027f7e5f05dad786e44,19fef5a731de2db57d16da10287413f5f99bc2dd,m,1950.0,2200.0,2075.0
1,Arabica,metad plc,Ethiopia,metad plc,,metad plc,2014/2015,metad agricultural developmet plc,1950-2200,guji-hambela,...,Green,1,"April 3rd, 2016",METAD Agricultural Development plc,309fcf77415a3661ae83e027f7e5f05dad786e44,19fef5a731de2db57d16da10287413f5f99bc2dd,m,1950.0,2200.0,2075.0
2,Arabica,grounds for health admin,Guatemala,"san marcos barrancas ""san cristobal cuch",,,,,1600 - 1800 m,,...,,0,"May 31st, 2011",Specialty Coffee Association,36d0d00a3724338ba7937c52a378d085f2172daa,0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660,m,1600.0,1800.0,1700.0
3,Arabica,yidnekachew dabessa,Ethiopia,yidnekachew dabessa coffee plantation,,wolensu,,yidnekachew debessa coffee plantation,1800-2200,oromia,...,Green,2,"March 25th, 2016",METAD Agricultural Development plc,309fcf77415a3661ae83e027f7e5f05dad786e44,19fef5a731de2db57d16da10287413f5f99bc2dd,m,1800.0,2200.0,2000.0
4,Arabica,metad plc,Ethiopia,metad plc,,metad plc,2014/2015,metad agricultural developmet plc,1950-2200,guji-hambela,...,Green,2,"April 3rd, 2016",METAD Agricultural Development plc,309fcf77415a3661ae83e027f7e5f05dad786e44,19fef5a731de2db57d16da10287413f5f99bc2dd,m,1950.0,2200.0,2075.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1334,Robusta,luis robles,Ecuador,robustasa,Lavado 1,our own lab,,robustasa,,"san juan, playas",...,Blue-Green,1,"January 18th, 2017",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,,,
1335,Robusta,luis robles,Ecuador,robustasa,Lavado 3,own laboratory,,robustasa,40,"san juan, playas",...,Blue-Green,0,"January 18th, 2017",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,40.0,40.0,40.0
1336,Robusta,james moore,United States,fazenda cazengo,,cafe cazengo,,global opportunity fund,795 meters,"kwanza norte province, angola",...,,6,"December 23rd, 2015",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,795.0,795.0,795.0
1337,Robusta,cafe politico,India,,,,14-1118-2014-0087,cafe politico,,,...,Green,1,"August 25th, 2015",Specialty Coffee Association,ff7c18ad303d4b603ac3f8cff7e611ffc735e720,352d0cf7f3e9be14dad7df644ad65efc27605ae2,m,,,


In [14]:
pd.set_option('display.max_columns', None)

In [15]:
print(coffe_index)

      species                     owner country_of_origin  \
0     Arabica                 metad plc          Ethiopia   
1     Arabica                 metad plc          Ethiopia   
2     Arabica  grounds for health admin         Guatemala   
3     Arabica       yidnekachew dabessa          Ethiopia   
4     Arabica                 metad plc          Ethiopia   
...       ...                       ...               ...   
1334  Robusta               luis robles           Ecuador   
1335  Robusta               luis robles           Ecuador   
1336  Robusta               james moore     United States   
1337  Robusta             cafe politico             India   
1338  Robusta             cafe politico           Vietnam   

                                     farm_name lot_number            mill  \
0                                    metad plc        NaN       metad plc   
1                                    metad plc        NaN       metad plc   
2     san marcos barrancas "san cris

### Problem 4. Bag weight (1 point)
What's up with the bag weights? Make all necessary changes to the column values. Don't forget to document your methods and assumptions.

In [17]:
coffe_index.bag_weight

0       60 kg
1       60 kg
2           1
3       60 kg
4       60 kg
        ...  
1334     2 kg
1335     2 kg
1336     1 kg
1337    5 lbs
1338    5 lbs
Name: bag_weight, Length: 1339, dtype: object

In [18]:
coffe_index.bag_weight.dtype

dtype('O')

In [19]:
coffe_index.bag_weight.value_counts()

bag_weight
1 kg        331
60 kg       256
69 kg       200
70 kg       156
2 kg        122
100 lbs      59
30 kg        29
5 lbs        23
6            19
20 kg        14
50 kg        14
10 kg        11
59 kg        10
1 lbs         8
1             7
3 lbs         7
5 kg          7
2 lbs         5
4 lbs         4
80 kg         4
18975 kg      4
0 lbs         3
46 kg         3
29 kg         2
9000 kg       2
25 kg         2
66 kg         2
35 kg         2
12000 kg      2
40 kg         2
6 kg          2
19200 kg      2
15 kg         2
13800 kg      1
100 kg        1
55 lbs        1
4 kg          1
67 kg         1
350 kg        1
3 kg          1
8 kg          1
80 lbs        1
24 kg         1
1500 kg       1
2 kg,lbs      1
0 kg          1
660 kg        1
1218 kg       1
2             1
18 kg         1
150 lbs       1
18000 kg      1
1 kg,lbs      1
132 lbs       1
34 kg         1
130 lbs       1
Name: count, dtype: int64

#### Problems found with bag_weight column
Looking at the value_counts(), the dtypes and the column data we can suggest that there are the following problems 
with the bag_wight column:
- different weigth units (kg and lbs)
- numbers and units are no separated
- entries without units
- ambiguous entries - 1 kg, lbs
- problems with whitespaces and casing
- data type is object 

To solve the above issues with the bag_weight columns we will perform a set of operations

In [22]:
# 1. Normalize casting and spaces 

coffe_index["bag_weight"] = coffe_index["bag_weight"].str.lower().str.replace(",", " ").str.strip()

In [23]:
coffe_index["bag_weight"]

0       60 kg
1       60 kg
2           1
3       60 kg
4       60 kg
        ...  
1334     2 kg
1335     2 kg
1336     1 kg
1337    5 lbs
1338    5 lbs
Name: bag_weight, Length: 1339, dtype: object

In [24]:
# 2. Extract weight
# We want to see what is the clear weight of each observation. To do this we will extract the weight
# as a float. The extracted weight we will keep in a new feature column - bag_weight_num.

coffe_index["bag_weight_num"] = coffe_index["bag_weight"].str.extract(r'(\d+\.?\d*)')[0].astype(float)

In [25]:
coffe_index["bag_weight_num"]

0       60.0
1       60.0
2        1.0
3       60.0
4       60.0
        ... 
1334     2.0
1335     2.0
1336     1.0
1337     5.0
1338     5.0
Name: bag_weight_num, Length: 1339, dtype: float64

In [26]:
# 3.Extract the unit

# As a next step we will extract the unit from the bag_weight column and store it in a new columng bag_weight_unit

coffe_index["bag_weight_unit"] =  coffe_index["bag_weight"].str.extract(r'\d+\.?\d*\s*([a-z]*)')[0]

In [27]:
# 4.Convert the lbs to kg

# 
coffe_index["weight_kg"] = np.where(
    coffe_index["bag_weight_unit"].isin(['lb', 'lbs']),
    coffe_index["bag_weight_num"]* 0.453592,
    coffe_index["bag_weight_num"]
)

In [28]:
coffe_index["bag_weight"]

0       60 kg
1       60 kg
2           1
3       60 kg
4       60 kg
        ...  
1334     2 kg
1335     2 kg
1336     1 kg
1337    5 lbs
1338    5 lbs
Name: bag_weight, Length: 1339, dtype: object

In [29]:
# Delete the repeating and doubling columns

In [30]:
print(coffe_index)

      species                     owner country_of_origin  \
0     Arabica                 metad plc          Ethiopia   
1     Arabica                 metad plc          Ethiopia   
2     Arabica  grounds for health admin         Guatemala   
3     Arabica       yidnekachew dabessa          Ethiopia   
4     Arabica                 metad plc          Ethiopia   
...       ...                       ...               ...   
1334  Robusta               luis robles           Ecuador   
1335  Robusta               luis robles           Ecuador   
1336  Robusta               james moore     United States   
1337  Robusta             cafe politico             India   
1338  Robusta             cafe politico           Vietnam   

                                     farm_name lot_number            mill  \
0                                    metad plc        NaN       metad plc   
1                                    metad plc        NaN       metad plc   
2     san marcos barrancas "san cris

### Problem 5. Dates (1 point)
This should remind you of problem 4 but it's slightly nastier. Fix the harvest years, document the process.

While you're here, fix the expiration dates, and grading dates. Unlike the other column, these should be dates (`pd.to_datetime()` is your friend).

In [65]:
coffe_index.harvest_year.value_counts()

harvest_year
2012                        354
2014                        233
2013                        181
2015                        129
2016                        124
2017                         70
2013/2014                    29
2015/2016                    28
2011                         26
2017 / 2018                  19
2014/2015                    19
2009/2010                    12
2010                         10
2010-2011                     6
2016 / 2017                   6
4T/10                         4
March 2010                    3
2009-2010                     3
Mayo a Julio                  3
4T/2010                       3
Abril - Julio                 2
January 2011                  2
2011/2012                     2
08/09 crop                    2
December 2009-March 2010      1
TEST                          1
4T72010                       1
2018                          1
1t/2011                       1
2016/2017                     1
3T/2011                    

In [69]:
coffe_index.columns


Index(['species', 'owner', 'country_of_origin', 'farm_name', 'lot_number',
       'mill', 'ico_number', 'company', 'altitude', 'region', 'producer',
       'number_of_bags', 'bag_weight', 'in_country_partner', 'harvest_year',
       'grading_date', 'owner_1', 'variety', 'processing_method', 'aroma',
       'flavor', 'aftertaste', 'acidity', 'body', 'balance', 'uniformity',
       'clean_cup', 'sweetness', 'cupper_points', 'total_cup_points',
       'moisture', 'category_one_defects', 'quakers', 'color',
       'category_two_defects', 'expiration', 'certification_body',
       'certification_address', 'certification_contact', 'unit_of_measurement',
       'altitude_low_meters', 'altitude_high_meters', 'altitude_mean_meters',
       'bag_weight_num', 'bag_weight_unit', 'weight_kg'],
      dtype='object')

In [73]:
coffe_index.expiration.value_counts()

expiration
July 11th, 2013        25
December 26th, 2014    25
June 6th, 2013         19
August 30th, 2013      18
July 26th, 2013        15
                       ..
March 8th, 2012         1
May 11th, 2012          1
December 1st, 2012      1
April 27th, 2013        1
December 23rd, 2015     1
Name: count, Length: 566, dtype: int64

In [77]:
coffe_index.grading_date.value_counts()

grading_date
July 11th, 2012        25
December 26th, 2013    24
June 6th, 2012         19
August 30th, 2012      18
July 26th, 2012        15
                       ..
March 9th, 2011         1
May 12th, 2011          1
December 2nd, 2011      1
April 27th, 2012        1
December 23rd, 2014     1
Name: count, Length: 567, dtype: int64

### Problem 6. Countries (1 point)
How many coffees are there with unknown countries of origin? What can you do about them?

In [83]:
coffe_index.country_of_origin.value_counts()

country_of_origin
Mexico                          236
Colombia                        183
Guatemala                       181
Brazil                          132
Taiwan                           75
United States (Hawaii)           73
Honduras                         53
Costa Rica                       51
Ethiopia                         44
Tanzania, United Republic Of     40
Uganda                           36
Thailand                         32
Nicaragua                        26
Kenya                            25
El Salvador                      21
Indonesia                        20
China                            16
India                            14
Malawi                           11
United States                    10
Peru                             10
Myanmar                           8
Vietnam                           8
Haiti                             6
Philippines                       5
United States (Puerto Rico)       4
Panama                            4
Ecuador   

In [79]:
coffe_index.country_of_origin.isna()

0       False
1       False
2       False
3       False
4       False
        ...  
1334    False
1335    False
1336    False
1337    False
1338    False
Name: country_of_origin, Length: 1339, dtype: bool

In [81]:
coffe_index.country_of_origin.isna().value_counts()

country_of_origin
False    1338
True        1
Name: count, dtype: int64

In [89]:
unknown_countries = coffe_index[coffe_index["country_of_origin"].isna()]

In [91]:
unknown_countries

Unnamed: 0,species,owner,country_of_origin,farm_name,lot_number,mill,ico_number,company,altitude,region,producer,number_of_bags,bag_weight,in_country_partner,harvest_year,grading_date,owner_1,variety,processing_method,aroma,flavor,aftertaste,acidity,body,balance,uniformity,clean_cup,sweetness,cupper_points,total_cup_points,moisture,category_one_defects,quakers,color,category_two_defects,expiration,certification_body,certification_address,certification_contact,unit_of_measurement,altitude_low_meters,altitude_high_meters,altitude_mean_meters,bag_weight_num,bag_weight_unit,weight_kg
1197,Arabica,racafe & cia s.c.a,,,,,3-37-1980,,,,,149,70 kg,Almacafé,,"March 1st, 2011",Racafe & Cia S.C.A,,,6.75,6.75,6.42,6.83,7.58,7.5,10.0,10.0,10.0,7.25,79.08,0.1,0,0.0,,3,"February 29th, 2012",Almacafé,e493c36c2d076bf273064f7ac23ad562af257a25,70d3c0c26f89e00fdae6fb39ff54f0d2eb1c38ab,m,,,,70.0,kg,70.0


There is only one coffe with unknown country of Origin. We can drop that row and cleaned the data.

### Problem 7. Owners (1 point)
There are two suspicious columns, named `Owner`, and `Owner.1` (they're likely called something different after you solved problem 3). Do something about them. Is there any link to `Producer`?

In [93]:
coffe_index.owner.value_counts()

owner
juan luis alvarado romero           155
racafe & cia s.c.a                   60
exportadora de cafe condor s.a       54
kona pacific farmers cooperative     52
ipanema coffees                      50
                                   ... 
alvaro quiros perez                   1
olivia hernandez virves               1
finca las nieves                      1
pedro santos e silva                  1
james moore                           1
Name: count, Length: 315, dtype: int64

In [95]:
coffe_index.owner_1.value_counts()

owner_1
Juan Luis Alvarado Romero           155
Racafe & Cia S.C.A                   60
Exportadora de Cafe Condor S.A       54
Kona Pacific Farmers Cooperative     52
Ipanema Coffees                      50
                                   ... 
ALVARO QUIROS PEREZ                   1
OLIVIA HERNANDEZ VIRVES               1
FINCA LAS NIEVES                      1
Pedro Santos e Silva                  1
James Moore                           1
Name: count, Length: 319, dtype: int64

In [97]:
# Compare the two columns 
# First we will make the text in the two columns lowercase and strip whitespace for comparision
coffe_index["owner_clean"]= coffe_index["owner"].str.lower().str.strip()
coffe_index["owner_1_clean"]= coffe_index["owner_1"].str.lower().str.strip()

In [99]:
coffe_index.owner_clean.value_counts() 

owner_clean
juan luis alvarado romero           155
racafe & cia s.c.a                   60
exportadora de cafe condor s.a       54
kona pacific farmers cooperative     52
ipanema coffees                      50
                                   ... 
alvaro quiros perez                   1
olivia hernandez virves               1
finca las nieves                      1
pedro santos e silva                  1
james moore                           1
Name: count, Length: 315, dtype: int64

In [101]:
coffe_index.owner_1_clean.value_counts()

owner_1_clean
juan luis alvarado romero           155
racafe & cia s.c.a                   60
exportadora de cafe condor s.a       54
kona pacific farmers cooperative     52
ipanema coffees                      50
                                   ... 
romulo bello flores                   1
rachel peterson                       1
josé luis rojas yeo                   1
nitin coffee estate                   1
james moore                           1
Name: count, Length: 317, dtype: int64

In [103]:
# Compare the clean columns
coffe_index["compare"] = coffe_index["owner_clean"] == coffe_index["owner_1_clean"]

In [105]:
# See in how many values differ
coffe_index["compare"].value_counts()

compare
True     1328
False      11
Name: count, dtype: int64

In [107]:
# List the rows which differ
differences = coffe_index[coffe_index["compare"] == False][["owner_clean", "owner_1_clean"]]

In [109]:
differences

Unnamed: 0,owner_clean,owner_1_clean
219,"ceca, s.a.","ceca,s.a."
364,,
392,federacion nacional de cafeteros,federación nacional de cafeteros
459,,
602,,
734,klem organics,klemorganics
848,,
882,,
961,klem organics,klemorganics
975,,


Taking into account that the anasyss shows differences between the two columns are 
related to small ... we can remove column Owner   

In [111]:
owner_set = set(coffe_index['owner_clean'].dropna().unique())
owner1_set = set(coffe_index['owner_1_clean'].dropna().unique())

only_in_owner = owner_set - owner1_set
only_in_owner1 = owner1_set - owner_set

In [113]:
only_in_owner

{'klem organics'}

In [115]:
only_in_owner1

{'ceca,s.a.', 'federación nacional de cafeteros', 'klemorganics'}

In [None]:
coffe_index.Producer.value_counts()

To find if there are any link to Producers column, first we will make the text in the column lowercase and strip the white speces  

In [117]:
coffe_index["producer_clean"]= coffe_index["producer"].str.lower().str.strip()

In [121]:
# Compare the two
coffe_index["relation"] = coffe_index["producer_clean"] == coffe_index["owner_clean"]

In [123]:
coffe_index.relation.value_counts()

relation
False    1188
True      151
Name: count, dtype: int64

In [125]:
coffe_index[coffe_index["relation"]]

Unnamed: 0,species,owner,country_of_origin,farm_name,lot_number,mill,ico_number,company,altitude,region,producer,number_of_bags,bag_weight,in_country_partner,harvest_year,grading_date,owner_1,variety,processing_method,aroma,flavor,aftertaste,acidity,body,balance,uniformity,clean_cup,sweetness,cupper_points,total_cup_points,moisture,category_one_defects,quakers,color,category_two_defects,expiration,certification_body,certification_address,certification_contact,unit_of_measurement,altitude_low_meters,altitude_high_meters,altitude_mean_meters,bag_weight_num,bag_weight_unit,weight_kg,owner_clean,owner_1_clean,compare,producer_clean,relation
0,Arabica,metad plc,Ethiopia,metad plc,,metad plc,2014/2015,metad agricultural developmet plc,1950-2200,guji-hambela,METAD PLC,300,60 kg,METAD Agricultural Development plc,2014,"April 4th, 2015",metad plc,,Washed / Wet,8.67,8.83,8.67,8.75,8.50,8.42,10.0,10.0,10.00,8.75,90.58,0.12,0,0.0,Green,0,"April 3rd, 2016",METAD Agricultural Development plc,309fcf77415a3661ae83e027f7e5f05dad786e44,19fef5a731de2db57d16da10287413f5f99bc2dd,m,1950.0,2200.0,2075.0,60.0,kg,60.0,metad plc,metad plc,True,metad plc,True
1,Arabica,metad plc,Ethiopia,metad plc,,metad plc,2014/2015,metad agricultural developmet plc,1950-2200,guji-hambela,METAD PLC,300,60 kg,METAD Agricultural Development plc,2014,"April 4th, 2015",metad plc,Other,Washed / Wet,8.75,8.67,8.50,8.58,8.42,8.42,10.0,10.0,10.00,8.58,89.92,0.12,0,0.0,Green,1,"April 3rd, 2016",METAD Agricultural Development plc,309fcf77415a3661ae83e027f7e5f05dad786e44,19fef5a731de2db57d16da10287413f5f99bc2dd,m,1950.0,2200.0,2075.0,60.0,kg,60.0,metad plc,metad plc,True,metad plc,True
4,Arabica,metad plc,Ethiopia,metad plc,,metad plc,2014/2015,metad agricultural developmet plc,1950-2200,guji-hambela,METAD PLC,300,60 kg,METAD Agricultural Development plc,2014,"April 4th, 2015",metad plc,Other,Washed / Wet,8.25,8.50,8.25,8.50,8.42,8.33,10.0,10.0,10.00,8.58,88.83,0.12,0,0.0,Green,2,"April 3rd, 2016",METAD Agricultural Development plc,309fcf77415a3661ae83e027f7e5f05dad786e44,19fef5a731de2db57d16da10287413f5f99bc2dd,m,1950.0,2200.0,2075.0,60.0,kg,60.0,metad plc,metad plc,True,metad plc,True
9,Arabica,diamond enterprise plc,Ethiopia,tulla coffee farm,,tulla coffee farm,2014/15,diamond enterprise plc,1795-1850,"snnp/kaffa zone,gimbowereda",Diamond Enterprise Plc,50,60 kg,METAD Agricultural Development plc,2014,"March 30th, 2015",Diamond Enterprise Plc,Other,Natural / Dry,8.08,8.58,8.50,8.50,7.67,8.42,10.0,10.0,10.00,8.50,88.25,0.10,0,0.0,Green,4,"March 29th, 2016",METAD Agricultural Development plc,309fcf77415a3661ae83e027f7e5f05dad786e44,19fef5a731de2db57d16da10287413f5f99bc2dd,m,1795.0,1850.0,1822.5,60.0,kg,60.0,diamond enterprise plc,diamond enterprise plc,True,diamond enterprise plc,True
22,Arabica,roberto licona franco,Mexico,la herradura,,la herradura,0,,1320,xalapa,ROBERTO LICONA FRANCO,14,1 kg,AMECAFE,2012,"July 26th, 2012",ROBERTO LICONA FRANCO,Other,Washed / Wet,8.17,8.25,8.17,8.00,7.83,8.17,10.0,10.0,10.00,8.58,87.17,0.13,0,0.0,Green,0,"July 26th, 2013",AMECAFE,59e396ad6e22a1c22b248f958e1da2bd8af85272,0eb4ee5b3f47b20b049548a2fd1e7d4a2b70d0a7,m,1320.0,1320.0,1320.0,1.0,kg,1.0,roberto licona franco,roberto licona franco,True,roberto licona franco,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1314,Robusta,ugacof,Uganda,ugacof project area,,ugacof,0,ugacof ltd,1212,central,UGACOF,320,60 kg,Uganda Coffee Development Authority,2013,"July 14th, 2014",UGACOF,,,8.00,7.92,7.92,7.75,7.83,7.75,10.0,10.0,7.75,8.08,83.00,0.12,0,0.0,Green,7,"July 14th, 2015",Uganda Coffee Development Authority,e36d0270932c3b657e96b7b0278dfd85dc0fe743,03077a1c6bac60e6f514691634a7f6eb5c85aae8,m,1212.0,1212.0,1212.0,60.0,kg,60.0,ugacof,ugacof,True,ugacof,True
1315,Robusta,katuka development trust ltd,Uganda,katikamu capca farmers association,,katuka development trust,0,katuka development trust ltd,1200-1300,luwero central region,Katuka Development Trust Ltd,1,60 kg,Uganda Coffee Development Authority,2013,"June 26th, 2014",Katuka Development Trust Ltd,,,8.33,7.83,7.83,7.75,8.25,7.75,10.0,10.0,7.58,7.67,83.00,0.12,0,0.0,Green,3,"June 26th, 2015",Uganda Coffee Development Authority,e36d0270932c3b657e96b7b0278dfd85dc0fe743,03077a1c6bac60e6f514691634a7f6eb5c85aae8,m,1200.0,1300.0,1250.0,60.0,kg,60.0,katuka development trust ltd,katuka development trust ltd,True,katuka development trust ltd,True
1324,Robusta,kasozi coffee farmers association,Uganda,kasozi coffee farmers,,,0,kasozi coffee farmers association,1367,eastern,Kasozi coffee farmers Association,1,60 kg,Uganda Coffee Development Authority,2013,"July 14th, 2014",Kasozi Coffee Farmers Association,,,8.00,7.75,7.75,7.58,7.67,7.50,10.0,10.0,7.58,7.67,81.50,0.11,0,0.0,Green,7,"July 14th, 2015",Uganda Coffee Development Authority,e36d0270932c3b657e96b7b0278dfd85dc0fe743,03077a1c6bac60e6f514691634a7f6eb5c85aae8,m,1367.0,1367.0,1367.0,60.0,kg,60.0,kasozi coffee farmers association,kasozi coffee farmers association,True,kasozi coffee farmers association,True
1328,Robusta,kawacom uganda ltd,Uganda,bushenyi,,kawacom,0,kawacom uganda ltd,1600,western,Kawacom uganda ltd,1,60 kg,Uganda Coffee Development Authority,2013,"June 27th, 2014",Kawacom Uganda LTD,,,7.33,7.58,7.50,7.75,7.75,7.67,10.0,10.0,7.75,7.58,80.92,0.12,0,0.0,Green,1,"June 27th, 2015",Uganda Coffee Development Authority,e36d0270932c3b657e96b7b0278dfd85dc0fe743,03077a1c6bac60e6f514691634a7f6eb5c85aae8,m,1600.0,1600.0,1600.0,60.0,kg,60.0,kawacom uganda ltd,kawacom uganda ltd,True,kawacom uganda ltd,True


### Problem 8. Coffee color by country and continent (1 point)
Create a table which shows how many coffees of each color are there in every country. Leave the missing values as they are.

**Note:** If you ask me, countries should be in rows, I prefer long tables much better than wide ones.

Now do the same for continents. You know what continent each country is located in.

In [127]:
coffe_index.color.value_counts()

color
Green           870
Bluish-Green    114
Blue-Green       85
Name: count, dtype: int64

In [129]:
coffe_index_p = coffe_index.pivot_table(index = "country_of_origin", columns = "color", values = "species", 
                                            aggfunc = "count")

In [131]:
coffe_index_p.reset_index()

color,country_of_origin,Blue-Green,Bluish-Green,Green
0,Brazil,14.0,12.0,92.0
1,Burundi,,,1.0
2,China,,,16.0
3,Colombia,8.0,8.0,118.0
4,Costa Rica,10.0,9.0,28.0
5,Cote d?Ivoire,,1.0,
6,Ecuador,2.0,1.0,
7,El Salvador,2.0,2.0,9.0
8,Ethiopia,,2.0,15.0
9,Guatemala,2.0,7.0,159.0


In [133]:
coffe_index_p.columns.name=None

### Problem 9. Ratings (1 point)
The columns `Aroma`, `Flavor`, etc., up to `Moisture` represent subjective ratings. Explore them. Show the means and range; draw histograms and / or boxplots as needed. You can even try correlations if you want. What's up with all those ratings?

In [137]:
coffe_index.aroma.value_counts()

aroma
7.67    179
7.50    165
7.58    152
7.75    125
7.42    122
7.83    103
7.33     98
7.25     78
7.92     59
8.00     48
7.17     45
7.08     28
7.00     23
8.08     20
8.17     20
6.92     14
8.42      9
8.25      9
6.83      9
6.75      7
8.33      7
6.67      3
8.50      3
6.50      2
8.67      2
7.81      2
5.08      1
8.75      1
6.42      1
6.17      1
8.58      1
6.33      1
0.00      1
Name: count, dtype: int64

In [139]:
coffe_index.flavor.value_counts()

flavor
7.50    166
7.58    166
7.67    148
7.75    126
7.42    116
7.33    111
7.83     89
7.25     64
7.17     56
7.92     45
7.08     42
8.00     41
7.00     36
8.17     18
6.83     17
6.92     15
8.08     14
6.75     10
6.50      9
8.25      7
8.33      5
8.42      5
6.58      5
6.67      5
8.50      5
8.67      4
6.33      3
7.88      2
6.17      2
8.58      2
6.42      1
8.83      1
6.08      1
7.81      1
0.00      1
Name: count, dtype: int64

In [141]:
coffe_index.moisture.value_counts()

moisture
0.11    383
0.12    294
0.00    264
0.10    182
0.13     76
0.09     27
0.14     23
0.08     16
0.01     15
0.15      8
0.05      8
0.02      7
0.06      7
0.07      5
0.16      5
0.04      4
0.03      4
0.20      3
0.17      3
0.18      2
0.28      1
0.21      1
0.22      1
Name: count, dtype: int64

### Problem 10. High-level errors (1 point)
Check the countries against region names, altitudes, and companies. Are there any discrepancies (e.g. human errors, like a region not matching the country)? Take a look at the (cleaned) altitudes; there has been a lot of preprocessing done to them. Was it done correctly?

In [143]:
coffe_index.altitude.value_counts()

altitude
1100             43
1200             42
1300             32
1400             32
4300             31
                 ..
1473              1
1600 - 1800 m     1
4500 pies         1
900-1100          1
795 meters        1
Name: count, Length: 396, dtype: int64

In [145]:
coffe_index.region.value_counts()

region
huila                                                                      112
oriente                                                                     80
south of minas                                                              68
kona                                                                        66
veracruz                                                                    35
                                                                          ... 
phahi                                                                        1
mahuixtlan                                                                   1
52 narino (exact location: mattituy; municipal region: florida code 381      1
aceh                                                                         1
kwanza norte province, angola                                                1
Name: count, Length: 356, dtype: int64

In [151]:
coffe_index.company

0           metad agricultural developmet plc
1           metad agricultural developmet plc
2                                         NaN
3       yidnekachew debessa coffee plantation
4           metad agricultural developmet plc
                        ...                  
1334                                robustasa
1335                                robustasa
1336                  global opportunity fund
1337                            cafe politico
1338                            cafe politico
Name: company, Length: 1339, dtype: object

### * Problem 11. Clean and explore at will
The dataset claimed to be clean, but we were able to discover a lot of things to fix and do better.

Play around with the data as much as you wish, and if you find variables to tidy up and clean - by all means, do that!