# Data cleaning 

In total we collected 5 csv files with data of different aspects of coffee production that we will later explore. All files will be imported in this notebook. We will examine each file and 4 of the files will go through the following data cleaning process with a function:

1. Convert csv file to a dataframe
2. Drop unncessary columns
3. Rename columns
4. Rearrange columns
5. Export to cleaned csv files

Exception: 
- Data for 'production_quantity' has different column name for country. We will do it seperately without function. 

In [95]:
import pandas as pd

## Dataframe 1 : production_quantity

In [105]:
# Convert csv file to dataframe
production_quantity = pd.read_csv('./original_data/production_quantity.csv')

In [106]:
# Drop unnecessary columns
to_drop = ['Domain Code','Domain','Country Code','Element Code','Element','Item Code','Item','Year Code','Flag','Flag Description']

production_quantity = production_quantity.drop(to_drop,axis=1)

In [107]:
# Rename columns
production_quantity = production_quantity.rename({'Value': 'production_quantity'},axis=1)

In [108]:
# Rearrange columns
production_quantity = production_quantity[['Country','Year','production_quantity','Unit']]

In [113]:
production_quantity

Unnamed: 0,Country,Year,production_quantity,Unit
0,Angola,1961,169,1000 tonnes
1,Angola,1962,185,1000 tonnes
2,Angola,1963,168,1000 tonnes
3,Angola,1964,198,1000 tonnes
4,Angola,1965,205,1000 tonnes
...,...,...,...,...
3970,China,2009,70,1000 tonnes
3971,China,2010,50,1000 tonnes
3972,China,2011,65,1000 tonnes
3973,China,2012,92,1000 tonnes


In [114]:
production_quantity.to_csv(f'./cleaned_data/production_quantity_c.csv')

## Dataframes 2-4: gross_production_value, primary_forest, pesticides_usage

- Define a function for cleaning

In [110]:
def csv_to_dataframe(filename):
    df = pd.read_csv(f'./original_data/{filename}.csv')
    to_drop = ['Domain Code','Domain','Area Code','Element Code','Element','Item Code','Item','Year Code','Flag','Flag Description']
    df = (df.drop(to_drop,axis=1)
            .rename({'Area':'Country','Value': f'{filename}'},axis=1)[['Country','Year',f'{filename}','Unit']])
    filename = df.copy()
    return filename
                        
    

- Save all dataframes first into a dictionary

In [112]:
files= ['gross_production_value','primary_forest','pesticides_usage','crop_residue']

df_dict={}
for file in files:
    csv_to_dataframe(file).to_csv(f'./cleaned_data/{file}_c.csv')
    