## Data Cleaning

Import some basic packages

In [1]:
import pandas as pd
import numpy as np

Now let's load up the second coffee data

In [2]:
coffee_df = pd.read_csv('../data/coffee.csv', index_col=[0])

Creating a copy of the original

In [3]:
coffee = coffee_df.copy().reset_index()

Take a glimpse of the data

In [4]:
display(coffee[:3])
print('The shape of coffee is ', coffee.shape)

Unnamed: 0,all_text,name,rating,roaster,slug,region_africa_arabia,region_caribbean,region_central_america,region_hawaii,region_asia_pacific,...,aroma,acid,body,flavor,aftertaste,with_milk,desc_1,desc_2,desc_3,desc_4
0,\n\n\n\n \n93\nFlight Coffee Co.\nEthiopia Der...,Ethiopia Deri Kochoha,93,Flight Coffee Co.,/review/ethiopia-deri-kochoha-2,1,0,0,0,0,...,9.0,8.0,9.0,9.0,8.0,,"Bright, crisp, sweetly tart. Citrus medley, ca...",From the Deri Kochoha mill in the Hagere Marya...,A poised and melodic wet-processed Ethiopia co...,
1,\n\n\n\n\n91\nDoi Chaang Coffee\nEspresso\nLoc...,Espresso,91,Doi Chaang Coffee,/review/espresso-14,0,0,0,0,1,...,8.0,,8.0,8.0,8.0,9.0,"Evaluated as espresso. Deeply rich, sweetly ro...",Doi Chaang is a single-estate coffee produced ...,"A rich, resonant espresso from Thailand, espec...",
2,\n\n\n\n \n95\nTemple Coffee and Tea\nKenya Ru...,Kenya Ruthaka Peaberry,95,Temple Coffee and Tea,/review/kenya-ruthaka-peaberry,1,0,0,0,0,...,9.0,8.0,9.0,10.0,8.0,,"Deeply sweet, richly savory. Dark chocolate, p...",Despite challenges ranging from contested gove...,"A high-toned, nuanced Kenya cup, classic in it...",


The shape of coffee is  (5124, 34)


Now we remove the columns that we are not interested in.

In [5]:
coffee.columns

Index(['all_text', 'name', 'rating', 'roaster', 'slug', 'region_africa_arabia',
       'region_caribbean', 'region_central_america', 'region_hawaii',
       'region_asia_pacific', 'region_south_america', 'type_espresso',
       'type_organic', 'type_fair_trade', 'type_decaffeinated',
       'type_pod_capsule', 'type_blend', 'type_estate', 'location', 'origin',
       'roast', 'est_price', 'review_date', 'agtron', 'aroma', 'acid', 'body',
       'flavor', 'aftertaste', 'with_milk', 'desc_1', 'desc_2', 'desc_3',
       'desc_4'],
      dtype='object')

In [6]:
Unwanted = ['all_text', 'name', 'roaster', 'slug', 'origin', 'est_price', 'location',
            'agtron', 'with_milk', 'desc_1', 'desc_2', 'desc_3', 'desc_4']

In [7]:
for categ in Unwanted:
    coffee.drop(categ, inplace = True, axis = 1)
print('The shape of coffee is ', coffee.shape)

The shape of coffee is  (5124, 21)


Determine the percentage of the missing values within each feature. 

In [8]:
cols = coffee.columns
coffee[cols].isna().sum().values/coffee[cols].shape[0]*100

df_null = pd.DataFrame({
    'null%' : coffee[cols].isna().sum().values/coffee[cols].shape[0]*100},
    index = cols)

df_null

Unnamed: 0,null%
rating,0.0
region_africa_arabia,0.0
region_caribbean,0.0
region_central_america,0.0
region_hawaii,0.0
region_asia_pacific,0.0
region_south_america,0.0
type_espresso,0.0
type_organic,0.0
type_fair_trade,0.0


We decide to drop `acid` and `aftertaste` since their high null rate.

In [9]:
coffee.drop(['acid', 'aftertaste'], inplace = True, axis = 1)

Drop the rows where `roast` or `aroma` or `body` or `flavor` contain Null values.

In [10]:
coffee = coffee[coffee['roast'].notna() & coffee['aroma'].notna() & coffee['body'].notna() & coffee['flavor'].notna()]
print('The shape of coffee is ', coffee.shape)

The shape of coffee is  (4680, 19)


Adjust the type of each feature in the data and convert one-hot encoding back to a categorical column (region).

In [13]:
coffee['region'] = (coffee.iloc[:, 1:6] == 1).idxmax(1)
to_convert = ['type_espresso', 'type_organic', 'type_fair_trade', 'type_blend',
              'type_decaffeinated', 'type_pod_capsule', 'type_estate']
coffee[to_convert] = coffee[to_convert].astype('category')
coffee['rating'] = pd.to_numeric(coffee['rating'], errors='coerce')

In [14]:
coffee.dtypes

rating                     float64
region_africa_arabia         int64
region_caribbean             int64
region_central_america       int64
region_hawaii                int64
region_asia_pacific          int64
region_south_america         int64
type_espresso             category
type_organic              category
type_fair_trade           category
type_decaffeinated        category
type_pod_capsule          category
type_blend                category
type_estate               category
roast                       object
review_date                 object
aroma                      float64
body                       float64
flavor                     float64
region                      object
dtype: object

Finally, we save the file.

In [15]:
coffee.to_csv('../data/coffee_after_cleaning.csv', index = False)