# Data Joiner

This notebook is responsible to join the two data tables - Indvidual Carbon Footprints and Carbon Footprint of each resource together so that a better data frame can be produced which can be used for further analysis

### Section 1 - Importing Required Libraries

Importing the required libraries to join the two datasets

In [1]:
import pandas as pd
import numpy as np

### Section 2 - Loading the data

This section is responsible to load data from two different csv files containing data about individuals activity and the carbon footprint of each and every resource type used with respect to an activity performed by the individual

In [2]:
individual_df = pd.read_csv(r'../../data/output_data/Individuals_Carbon_Footprint_NA_Dropped.csv')
carbon_fp_df = pd.read_csv(r'../../data/output_data/Resources_Carbon_Footprint_NA_Dropped.csv')

In [3]:
individual_df.head()

Unnamed: 0,Indnum,Group,Activity,Units,Consumption,Quality_of_Life_Importance__1_10,Name of Resource Used,Amount of Resource Used per Unit
0,1,1,Household heating < 70F,hours,10.0,85.0,solar_powered__water_heater,1.0
1,2,2,wash-up,count,44.0,34.0,solar_powered__water_heater,1.0
2,3,2,shower - long (> 3 min),count,40.0,85.0,solar_powered__water_heater,1.0
3,3,2,wash-up,count,45.0,27.0,solar_powered__water_heater,1.0
4,5,3,use of clothes washer,count,7.0,41.0,solar_powered__water_heater,1.0


In [4]:
carbon_fp_df.head()

Unnamed: 0,Activity,Per,Name of Resource Used,Carbon Footprint of Resource per Unit
0,shower - short,activity,solar powered water heater,1.2e-05
1,shower - long (> 3 min),activity,solar powered water heater,1.7e-05
2,bath,activity,solar powered water heater,8.8e-05
3,wash-up,activity,solar powered water heater,4e-06
4,use of dishwasher,activity,solar powered water heater,2.5e-05


### Section 3 - Column Name Standardization

This section is responsible to change the column names of the dataframes so that they have the same names, without any inconsistencies. This will help in efficient join, while minimizing the redundancy.

In [15]:
# Import regex library to apply the string manipulation functions on each and every column
import re

In [16]:
indiv_column_names = individual_df.columns.values
carbon_fp_column_names = carbon_fp_df.columns.values

In [17]:
indiv_column_names

array(['Indnum', 'Group', 'Activity', 'Units', 'Consumption',
       'Quality_of_Life_Importance__1_10', 'Name of Resource Used',
       'Amount of Resource Used per Unit'], dtype=object)

In [18]:
carbon_fp_column_names

array(['Activity', 'Per', 'Name of Resource Used',
       'Carbon Footprint of Resource per Unit'], dtype=object)

In [71]:
def column_names_standardizer(column_name):
    return re.sub('_+', '_', re.sub('\s+', '_', column_name.lower()))

column_names_standardizer_vfunc = np.vectorize(column_names_standardizer)

In [72]:
column_names_standardizer_vfunc(indiv_column_names)

array(['indnum', 'group', 'activity', 'units', 'consumption',
       'quality_of_life_importance_1_10', 'name_of_resource_used',
       'amount_of_resource_used_per_unit'], dtype='<U32')

In [73]:
column_names_standardizer_vfunc(carbon_fp_column_names)

array(['activity', 'per', 'name_of_resource_used',
       'carbon_footprint_of_resource_per_unit'], dtype='<U37')