### Data transformation

This is the second automatically graded exercise for JODA. The objective here is to get our hands dirty with data.

The context of this particular analysis is a fictional company that routinely runs different machine learning operations.

We have generated a dataset that has the following columns or properties (to be engineered into features):

- Date
- Department
- ML Task ID
- ML Method
- Task Category
- Model Complexity (Parameters)
- Training Data Size (GB)
- Training Duration (Hours)
- Hardware Used
- Energy Consumption (kWh)
- CO2 Emissions (Kg)
- Cloud Provider

Moreover, there is a secondary dataset that includes information about the energy sources for the different cloud providers:

- Cloud Provider
- Green Energy

Import the needed packages

In [2]:
import pandas as pd

In [18]:
df_co2 = pd.read_excel('data/co2-emissions.xlsx')
df_providers = pd.read_excel('data/cloud-providers.xlsx')

df_final = df_co2.merge(df_providers, how='left', on='Cloud Provider')
print(df_final.columns)

Index(['Date', 'Department', 'ML Task ID', 'ML Method', 'Task Category',
       'Model Complexity (Parameters)', 'Training Data Size (GB)',
       'Training Duration (Hours)', 'Hardware Used',
       'Energy Consumption (kWh)', 'CO2 Emissions (Kg)', 'Cloud Provider',
       'Green Energy'],
      dtype='object')


Aggregate the data to department level. That is, each row should represents the aggregated values for each department. Notice that you do not need to aggregate each different column, only the ones instructed explicitly.

Calculate the total of CO2 emissions for each department

In [27]:
df_department_level = df_final.groupby('Department').agg({'CO2 Emissions (Kg)': 'sum'})

# Reset index to convert the index (Department) into a regular column
df_department_level = df_department_level.reset_index()

# Display the resulting DataFrame
print(df_department_level)

         Department  CO2 Emissions (Kg)
0  Customer Support        12565.569898
1           Finance        13568.637182
2   Human Resources        15256.236043
3         Marketing        12821.756125
4        Operations        15004.901708
5               R&D        14644.874294


Rename CO2 emission column to co2_emissions_kg

In [28]:
df_department_level.rename(columns={'CO2 Emissions (Kg)': 'co2_emissions_kg'}, inplace=True)

Create a function that picks the most common value among in a Pandas Series object

In [29]:
def pick_most_frequent(values):
    mode_series = values.mode()
    
    # If multiple modes are present, return the first mode
    most_common = mode_series.iloc[0] if not mode_series.empty else None
    
    return most_common

pick_most_frequent(pd.Series(['A', 'B', 'B', 'C']))

'B'

Pick the most frequent ML method for each department.

In [31]:
most_frequent_ml_method_per_department = df_final.groupby('Department')['ML Method'].agg(pick_most_frequent)
print(most_frequent_ml_method_per_department)

df_department_level = df_department_level.merge(most_frequent_ml_method_per_department, on='Department', how='left')
print(df_department_level)

Department
Customer Support    Linear Regression
Finance                           RNN
Human Resources                   RNN
Marketing               Decision Tree
Operations                Transformer
R&D                               RNN
Name: ML Method, dtype: object
         Department  co2_emissions_kg          ML Method
0  Customer Support      12565.569898  Linear Regression
1           Finance      13568.637182                RNN
2   Human Resources      15256.236043                RNN
3         Marketing      12821.756125      Decision Tree
4        Operations      15004.901708        Transformer
5               R&D      14644.874294                RNN


Make sure that the rows are sorted according to CO2 emissions in a way that the department with the largest emissions is first.

In [37]:
df_department_level = df_department_level.sort_values(by='co2_emissions_kg', ascending=False)

print(df_department_level)

         Department  co2_emissions_kg          ML Method
2   Human Resources      15256.236043                RNN
4        Operations      15004.901708        Transformer
5               R&D      14644.874294                RNN
1           Finance      13568.637182                RNN
3         Marketing      12821.756125      Decision Tree
0  Customer Support      12565.569898  Linear Regression


Calculate the CO2 emissions for each department in different Green Energy categories. That is, the resulting dataframe will have as many colums as there are values for Green Energy.

In [40]:
pivot_table = df_final.pivot_table(index='Department', columns='Green Energy', values='CO2 Emissions (Kg)', aggfunc='sum')

print(pivot_table)

Green Energy            Green       Hybrid      Unknown
Department                                             
Customer Support  1463.697220  1425.139532  9676.733146
Finance           2991.539843  2427.085307  8150.012032
Human Resources   2423.439560  3431.874604  9400.921879
Marketing         2282.675323  2923.163567  7615.917236
Operations        2253.368372  3779.599277  8971.934059
R&D               3028.493500  2280.103568  9336.277226


Next, let's try to do something a bit more difficult. That is, calculate department CO2 emissions per energy type.

One way to achieve this is to use pivot_table() function to create a separate dataframe with the new columns and join (using merge()) that to the main dataframe. We are sure there are even more clever ways.

Include the specified columns to the result dataframe, one per each energy type.

In [42]:
pivot_table = pd.pivot_table(df_final, values='CO2 Emissions (Kg)', index='Department', columns='Green Energy', aggfunc='sum')

df_result = pd.merge(df_department_level, pivot_table, how='left', on='Department')

print(df_result)

         Department  co2_emissions_kg          ML Method        Green  \
0   Human Resources      15256.236043                RNN  2423.439560   
1        Operations      15004.901708        Transformer  2253.368372   
2               R&D      14644.874294                RNN  3028.493500   
3           Finance      13568.637182                RNN  2991.539843   
4         Marketing      12821.756125      Decision Tree  2282.675323   
5  Customer Support      12565.569898  Linear Regression  1463.697220   

        Hybrid      Unknown  
0  3431.874604  9400.921879  
1  3779.599277  8971.934059  
2  2280.103568  9336.277226  
3  2427.085307  8150.012032  
4  2923.163567  7615.917236  
5  1425.139532  9676.733146  


Save the results

In [43]:
import os

def ensure_folder_exists(folder_path):
    if not os.path.exists(folder_path):
        os.makedirs(folder_path)
        print(f"Folder '{folder_path}' created.")
    else:
        print(f"Folder '{folder_path}' already exists.")

ensure_folder_exists('results')

Folder 'results' already exists.


In [44]:
# df_result_sorted.to_excel('results/department_co2.xlsx', index=False)
df_result.to_pickle('results/department_co2.pkl')