# Finding scope 3 emissions of a Pharma company.
- Scope 3 category is downstream transportation and distribution
- Data taken from data.world
- Distance-based methology will be used for calculations
    - According to GHG "If sub-contractor fuel data cannot be easily obtained in order to use the fuel-based method, then the distance-based method should be used."

# Data points required for calculations of scope 3
1. Total distance in KM 
    - Coordinates of places
2. Size of packages
3. Mode of Transport
4. Emission Factors
5. Calculation formulaes
    1. Emissions from road transport: = ∑ (mass of goods purchased (tonnes) × distance travelled in transport leg × emission factor of transport mode or vehicle type (kg CO2e/tonne-km)
    2. emissions from air transport: = ∑ (quantity of goods purchased (tonnes) x distance travelled in transport leg x emission factor of transport mode or vehicle type (kg CO2e/tonne-km))
    3. emissions from sea transport: = ∑ (quantity of goods purchased (tonnes) x distance travelled in transport leg x emission factor of transport mode or vehicle type (kg CO2e/tonne-km))

### Steps I took
1. Searching for dataset
2. Understanding the dataset and making calculative assumptions
    - Assumptions made should have proper justification.
3. Listing down the data points required for calculations
4. Listing down the calculation formulaes
5. Filtering out the data columns required from main data
6. Exploratory Data Analysis
7. Choosing a single year for calculations


## Data columns
1. Country - Destination location
2. Mode - Mode of transport (Air, Ocean, Road)
3. Weight - Kilogram of weight
4. Delivery Date - Date of delivery
5. Manufacturing Site - Source location

In [175]:
# Import libraries
import pandas as pd
import numpy as np

In [103]:
# Collecting data
raw_df = pd.read_csv("./Data/SCMS_Delivery_History_Dataset_20150929.csv")

In [205]:
# Filtering out required columns from the main data.

filtered_df = raw_df[['ID','Country','Shipment Mode','Manufacturing Site','Weight (Kilograms)','Delivery Recorded Date']].copy()
filtered_df.rename(columns={"Shipment Mode":"Mode","Weight (Kilograms)":"Weight","Delivery Recorded Date":"Delivery_Date",'Manufacturing Site':'Manufacturing_site'},inplace=True)

In [206]:
# Get data of a single year
filtered_df.loc[:,'Delivery_Date'] = pd.to_datetime(filtered_df['Delivery_Date'])

# Exploratory Data Analysis

In [235]:
# Checking total number of data points for each year
filtered_df['Delivery_Date'].groupby(by=filtered_df['Delivery_Date'].dt.year).size()

Delivery_Date
2006      65
2007     670
2008    1109
2009    1192
2010    1176
2011    1049
2012    1250
2013    1205
2014    1599
2015    1009
Name: Delivery_Date, dtype: int64

In [207]:
# Check for null values
filtered_df.isna().sum()

ID                      0
Country                 0
Mode                  360
Manufacturing_site      0
Weight                  0
Delivery_Date           0
dtype: int64

Null values found in the mode of transport. 

In [234]:
# Checking Mode column's null values

# filtered_df[filtered_df['Mode'].isna()]

# Counting the number of null values per year
filtered_df[filtered_df['Mode'].isna()].groupby(by=filtered_df['Delivery_Date'].dt.year).size()

Delivery_Date
2006      2
2007    264
2008     94
dtype: int64

Since mode of transport is not available for these years (2006,2007,2008) I will not consider these years for our base year.
Even though year 2006 has only 2 NaN rows for Mode, the number of data points are only 65. Thus not enough to consider as base year as compared to other years.

In [228]:
# Checking Weight column values
filtered_df[filtered_df['Weight'].str.isnumeric() == False]

Unnamed: 0,ID,Country,Mode,Manufacturing_site,Weight,Delivery_Date
8,46,Nigeria,Air,"Aurobindo Unit III, India",See ASN-93 (ID#:1281),2006-12-07
12,62,Nigeria,Air,"EY Laboratories, USA",Weight Captured Separately,2007-01-10
15,68,Zimbabwe,Air,"BMS Meymac, France",Weight Captured Separately,2007-03-19
16,69,Nigeria,,ABBVIE GmbH & Co.KG Wiesbaden,Weight Captured Separately,2007-05-07
31,262,South Africa,,GSK Mississauga (Canada),Weight Captured Separately,2008-01-29
...,...,...,...,...,...,...
10318,86817,Zimbabwe,Truck,"Cipla, Goa, India",See DN-4307 (ID#:83920),2015-07-20
10319,86818,Zimbabwe,Truck,"Mylan, H-12 & H-13, India",See DN-4307 (ID#:83920),2015-07-20
10320,86819,C�te d'Ivoire,Truck,Hetero Unit III Hyderabad IN,See DN-4313 (ID#:83921),2015-08-07
10321,86821,Zambia,Truck,Cipla Ltd A-42 MIDC Mahar. IN,Weight Captured Separately,2015-09-03


As you can see above the weight column has non-numeric values 'Weight Captured Separately', thus the values for these rows are not available. While values like 'See ASN-93 (ID#:1281)' can be found by mapping to that particular ID and getting the weight

In [233]:
# Counting the number of 'Weight Captured Separately' values in weight columns per year.

filtered_df[(filtered_df['Weight'].str.isnumeric() == False) & (filtered_df['Weight'] == 'Weight Captured Separately')].groupby(by=filtered_df['Delivery_Date'].dt.year).size()

Delivery_Date
2006      6
2007     28
2008    233
2009    262
2010     67
2011     56
2012     33
2013     80
2014    386
2015    356
dtype: int64

Seeing that 2012 has least number of 'Weight Captured Separately' values and it has good number of data points i.e. 1250. We can consider it as our base year.

# Data Cleaning

In [242]:
# Fixing Weight column of year 2012

final_df = filtered_df[(filtered_df['Delivery_Date'].dt.year == 2012) & (filtered_df['Weight'] != 'Weight Captured Separately')]


In [260]:
final_df

Unnamed: 0,ID,Country,Mode,Manufacturing_site,Weight,Delivery_Date
2684,12973,Haiti,Air,"Cipla, Goa, India",21,2012-06-12
2688,13016,Burundi,Air,Chembio Diagnostics Sys. Inc.,34,2012-06-01
2689,13020,Haiti,Air,Premier Medical Corporation,1,2012-01-30
2706,13311,C�te d'Ivoire,Air,"Alere Medical Co., Ltd.",957,2012-03-29
2717,13608,Uganda,Air,"Janssen-Cilag, Latina, IT",43,2012-05-21
...,...,...,...,...,...,...
10125,86529,Uganda,Truck,Hetero Unit III Hyderabad IN,1,2012-11-20
10126,86530,Uganda,Air,"Aurobindo Unit III, India",2034,2012-11-28
10128,86532,Nigeria,Air,Mylan (formerly Matrix) Nashik,See DN-2947 (ID#:83642),2012-12-04
10129,86533,C�te d'Ivoire,Truck,Mylan (formerly Matrix) Nashik,See DN-2954 (ID#:82534),2012-12-06


In [342]:
filtered_df[filtered_df['ID'] == 82434]['Weight'].iloc[0]

'Weight Captured Separately'

In [346]:
# Function to map the IDs given in weight column and adding the weight

def mapWeights(weight):
    try:
        if weight.isnumeric() == False:
            ID = weight[-6:-1]
            Weight_returned = int(filtered_df[filtered_df['ID'] == int(weight[-6:-1])]['Weight'].iloc[0])
            print(f'{ID} -- {Weight_returned}')
    except Exception as e:
        print(f'Error == {weight[-6:-1]} --- {filtered_df[filtered_df["ID"] == int(weight[-6:-1])]["Weight"].iloc[0]}')

In [347]:
final_df['Weight'].apply(mapWeights)

16901 -- 1701
20321 -- 2951
16232 -- 1486
16901 -- 1701
20697 -- 5939
15772 -- 10210
14279 -- 10579
30618 -- 4669
22658 -- 197
30211 -- 213
25990 -- 331
16085 -- 1173
20219 -- 941
35053 -- 546
20371 -- 4587
37031 -- 2998
21091 -- 4654
16085 -- 1173
15772 -- 10210
29734 -- 6433
33116 -- 9117
20383 -- 4726
16173 -- 1598
32594 -- 22
39053 -- 175
18629 -- 1208
44737 -- 6142
22658 -- 197
24384 -- 94
44737 -- 6142
21091 -- 4654
20371 -- 4587
44737 -- 6142
30320 -- 351
19512 -- 1675
27586 -- 201
16725 -- 436
42394 -- 2486
15772 -- 10210
22658 -- 197
35053 -- 546
20383 -- 4726
20697 -- 5939
18269 -- 16
57392 -- 12945
35053 -- 546
35053 -- 546
25990 -- 331
18240 -- 1364
46858 -- 41
20371 -- 4587
37031 -- 2998
14988 -- 181
20321 -- 2951
16725 -- 436
15772 -- 10210
37031 -- 2998
25423 -- 57
20321 -- 2951
30618 -- 4669
77780 -- 35
15772 -- 10210
46005 -- 6490
30618 -- 4669
29734 -- 6433
82436 -- 308
82436 -- 308
82457 -- 2652
82459 -- 14487
82462 -- 1414
82462 -- 1414
82471 -- 729
82994 -- 8
82994

2684     None
2688     None
2689     None
2706     None
2717     None
         ... 
10125    None
10126    None
10128    None
10129    None
10136    None
Name: Weight, Length: 1217, dtype: object