**Feature Engineering**

Part 1: Get the CSVs into DataFrames

Opium Sown: Have number of sown hectares for each district in each province for each year between 2010 - 2020, except for part of Bamyan, Day Kundi, Farah, Faryab, Ghazni, and Ghor, where we have 2008-2018 data. Pending further investigation, we will code these provinces with zeros. All fields marked '-' or 'p-f' (poppy-free) in the CSV are replaced by zeros. Data is of numerical type. All NaN values are set to zero.

Soil Data: Dropped the WRB Codes column because it contained numerous inaccuracies as an artefact of the scraping process and because it was collinear with the Soil Type column. Soil Type column was categorical and has been turned into dummy variables with one hot encoding. All other data was of numerical type. Province name is broadcast for each row of soil sample information. All fields marked (-)(-) in the CSV are replaced by zeros. Dataframe only contains the estimates of sown areas. All margin of error information is removed from the dataframe. All NaN values are set to zero.

BIG CAVEAT FOR SOIL DATA: We do not unique soil data for all 34 provinces. We only have soil data for 9 provinces. I cannot find any up to date soil data for all the provinces - only PDF maps from 2011 that shows topsoil texture distribution (source: Afghan Geodesy and Cartography Head Office, conforms to United Nations Afghanistan Regions 3958.1 R3, June 2011), and qualitative assessments from the UNODC Opium Yields reports. So we are going to subjectively broadcast the soil data from each of the 9 provinces to the closest provinces whose topsoil texture distributions closely resemble each others per the 2011 maps. To get a single reading for each province, we will consider the area of each soil type in each province as the "weight" vector and multiply it to the respective chemical measurement of that soil type, and take the sum of each multiplied column, kind of like a weighted average. And so we will have with subjective soil quality metrics for each province. But these broadcast readings should be subsituted with unique soil sample results as soon as the data becomes available.

Temperature and Precipitation Data: All data is transposed such that the years are columns and months are rows. Province name is broadcast for each row of climatological information. All NaN values are set to zero. Data is of numerical type.



In [1]:
import pandas as pd
import os
import numpy as np

In [52]:
tract_directory = "/content/ML-Climate-Final-Project-Template/data"
opium_sown = None
soil_data = None
temp_data = None
precip_data = None
climate_data = None
for filename in os.listdir(tract_directory):
  fn = tract_directory + '/' + filename
  if "Opium" in fn:
    section_frame = pd.read_csv (fn, header=0)
    section_frame.dropna(how='all', inplace=True)
    section_frame = section_frame.loc[:, ~section_frame.columns.str.contains('^Unnamed')]
    section_frame = section_frame[(section_frame['Province'].str.contains("Total")==True) | (section_frame['District'].str.contains("Total")==True)]
    section_frame.fillna('', inplace=True)
    section_frame["combo"] = section_frame["Province"] + section_frame["District"]
    section_frame['new province'] = section_frame['combo'].map(lambda tot_str: tot_str.partition('Total')[0])
    Y_sown = section_frame.drop(['Province', 'District', 'combo', 'new province'], axis=1)
    Y_sown.replace(regex={'[^0-9]': 0}, inplace=True)
    Y_sown["Province"] = section_frame['new province']
    Y_sown.replace(regex={'Sari  Pul': 'Sar-e-Pul'}, inplace=True)
    Y_sown.replace(regex={'\s+$': ''}, inplace=True)
    if opium_sown is None:
      opium_sown = Y_sown
    else:
      opium_sown = pd.concat([opium_sown, Y_sown]).reset_index(drop=True)
      opium_sown.fillna(0, inplace=True)    

  if "Soil" in fn:
    section_frame = pd.read_csv (fn, header=0)
    section_frame.set_axis(['WRB_Code', 'Soil_Type', 'Area', 'Sand_Perc', 'Clay_Perc', 'OM_Perc', 'pH_Water', 'EC', 'Tot_N_ppm', 'P_ppm', 'K_ton_per_ha', 'S_ppm', 'CaCO_ton_per_ha3'], axis=1, inplace=True)
    section_frame.drop(['WRB_Code', 'Soil_Type'], axis=1, inplace=True)
    section_frame.replace(regex={'\(+.*': '', '±.*': '', ' ±.*': '', ' \(±.*': '', '\(-\)\(-\)': 0,  '\(±.*': ''}, inplace=True)
    section_frame.replace(regex={'[^0-9.]': ''}, inplace=True)
    section_frame.replace(r'^\s*$', np.NaN, regex=True, inplace=True)
    section_frame.fillna(0, inplace=True)
    section_frame.dropna(how='all', inplace=True)
    section_frame = section_frame.astype('float')
    
    province = filename.split('_')[0]
    if province == 'Balkh':
      similar_provs = ['Balkh', 'Kunduz', 'Jawzjan', 'Samangan', 'Sar-e-Pul', 'Faryab']
    elif province == 'Bamyan':
      similar_provs = ['Bamyan', 'Day Kundi', 'Ghor', 'Ghazni']
    elif province == 'Hirat':
      similar_provs = ['Hirat', 'Badghis', 'Farah']
    elif province == 'Kabul':
      similar_provs = ['Kabul', 'Wardak', 'Logar', 'Kapisa', 'Parwan']
    elif province == 'Kandahar':
      similar_provs = ['Kandahar', 'Uruzgan', 'Zabul']
    elif province == 'Khost':
      similar_provs = ['Khost', 'Paktika', 'Paktya']
    elif province == 'Nangarhar':
      similar_provs = ['Nangarhar', 'Kunar', 'Laghman']
    elif province == 'Nimroz':
      similar_provs = ['Nimroz', 'Hilmand']
    elif province == 'Takhar':
      similar_provs = ['Takhar', 'Badakhshan', 'Baghlan', 'Panjsher', 'Nuristan']

    X_soil = pd.DataFrame(similar_provs, columns=['Province'])
    for col_name in section_frame.columns.values.tolist():
      if col_name != 'Area':
        X_soil[col_name] = pd.Series(section_frame['Area'] * section_frame[col_name]).sum()

    if soil_data is None:
      soil_data = X_soil
    else:
      soil_data = pd.concat([soil_data, X_soil]).reset_index(drop=True)

  if ("pr" in fn) or ("tas" in fn):
    info = pd.read_csv(fn, skiprows=2, nrows=0)
    section_frame = pd.read_csv (fn, skiprows=3)
    section_frame.rename(columns={'Unnamed: 0':'Years'}, inplace=True )
    section_frame.drop(section_frame[section_frame['Years'] < 2010].index, inplace = True)
    section_frame.insert(loc=0, column='Province', value=info.columns[1])
    section_frame.replace(regex={'Daykundi': 'Day Kundi'}, inplace=True)
    section_frame.fillna(0, inplace=True)
    if "pr" in fn:
      section_frame.columns = section_frame.columns[:2].union('mean_precip_' + section_frame.columns[2:])
      if precip_data is None:
        precip_data = section_frame
      else:
        precip_data = pd.concat([precip_data, section_frame]).reset_index(drop=True)

    elif "tas" in fn:
      section_frame.columns = section_frame.columns[:2].union('mean_temp_' + section_frame.columns[2:])
      if temp_data is None:
        temp_data = section_frame
      else:
        temp_data = pd.concat([temp_data, section_frame]).reset_index(drop=True)

climate_data = precip_data.merge(temp_data, on=['Province', 'Years'])

province_list = [
                 'Balkh', 
                 'Kunduz', 
                 'Jawzjan', 
                 'Samangan', 
                 'Sar-e-Pul', 
                 'Faryab', 
                 'Bamyan', 
                 'Day Kundi', 
                 'Ghor', 
                 'Ghazni', 
                 'Hirat', 
                 'Badghis', 
                 'Farah', 
                 'Kabul', 
                 'Wardak', 
                 'Logar', 
                 'Kapisa', 
                 'Parwan', 
                 'Kandahar', 
                 'Uruzgan', 
                 'Zabul', 
                 'Khost', 
                 'Paktika', 
                 'Paktya', 
                 'Nangarhar', 
                 'Kunar', 
                 'Laghman',
                 'Nimroz',
                 'Hilmand',
                 'Takhar', 
                 'Badakhshan', 
                 'Baghlan', 
                 'Panjsher', 
                 'Nuristan'
                 ]
opium_sown = opium_sown[opium_sown['Province'].isin(province_list)]
opium_sown = opium_sown.melt(id_vars=['Province'], var_name="Years", value_name="Hectares_Sown")

print("Features Compiled. Dataframes:")
print("Opium Sown")
print(opium_sown)
print("Soil Data")
print(soil_data)
print("Climate Data")
print(climate_data)

Features Compiled. Dataframes:
Opium Sown
      Province Years Hectares_Sown
0     Panjsher  2010             0
1       Parwan  2010             0
2     Samangan  2010             0
3    Sar-e-Pul  2010             0
4       Takhar  2010             0
..         ...   ...           ...
369  Nangarhar  2020          2225
370     Nimroz  2020          2931
371   Nuristan  2020             0
372    Paktika  2020             0
373     Paktya  2020             0

[374 rows x 3 columns]
Soil Data
      Province     Sand_Perc     Clay_Perc        OM_Perc      pH_Water  \
0    Nangarhar  8.917814e+05  1.642795e+05    8102.759193  1.155549e+05   
1        Kunar  8.917814e+05  1.642795e+05    8102.759193  1.155549e+05   
2      Laghman  8.917814e+05  1.642795e+05    8102.759193  1.155549e+05   
3        Hirat  4.385181e+06  4.082190e+06  320336.086800  1.069221e+06   
4      Badghis  4.385181e+06  4.082190e+06  320336.086800  1.069221e+06   
5        Farah  4.385181e+06  4.082190e+06  320336.086

In [77]:
X_set = soil_data.merge(climate_data, on='Province', how='outer')
print(X_set)

      Province    Sand_Perc     Clay_Perc       OM_Perc       pH_Water  \
0    Nangarhar  891781.3891  164279.52858   8102.759193  115554.886126   
1    Nangarhar  891781.3891  164279.52858   8102.759193  115554.886126   
2    Nangarhar  891781.3891  164279.52858   8102.759193  115554.886126   
3    Nangarhar  891781.3891  164279.52858   8102.759193  115554.886126   
4    Nangarhar  891781.3891  164279.52858   8102.759193  115554.886126   
..         ...          ...           ...           ...            ...   
369     Paktya  408612.1570  122078.46700  11652.528800   60579.290000   
370     Paktya  408612.1570  122078.46700  11652.528800   60579.290000   
371     Paktya  408612.1570  122078.46700  11652.528800   60579.290000   
372     Paktya  408612.1570  122078.46700  11652.528800   60579.290000   
373     Paktya  408612.1570  122078.46700  11652.528800   60579.290000   

              EC      Tot_N_ppm          P_ppm  K_ton_per_ha        S_ppm  \
0    3471.719562  642438.152062  4

**Random Forest Benchmark Regressor**

For each province, for each year, the X dataset is the local soil features, and the mean temperature and mean precipitation in the 12 months of the precending year, while the Y is the number of hectares sown in the current year. We have climatological data from 2010 through 2020 (11 years). We will use 34 provinces * 10 years from 2010 to 2019 = 340 datapoints in total for training and testing. Once we have fine-tuned our benchmark, we will use the 2020 climatological features (and existing soil featuers) to predict the number of hectares of opium sown in 2021, and compare our prediction against the UNODC report that will come out later in the year.

We will train the regressor on data from 27 (or roughly 80%) of the 34 provinces, and test the regressor on data from the remaining 7 provinces.

In [80]:
X_set['Years_to_match'] = X_set['Years'] + 1
X_set['primary_key'] = X_set['Province'] + '_' + X_set['Years_to_match'].astype(str)
X_set.drop(['Years_to_match'], axis=1)
X_set.set_index('primary_key', inplace=True)
X_train = X_set.copy()
X_test = X_set.copy()
X_train.drop(X_train[X_train['Years'] > 2019].index, inplace = True)
X_test.drop(X_test[X_test['Years'] > 2019].index, inplace = True)
X_train = X_train[X_train['Province'].isin(province_list[:27])]
X_test = X_test[X_test['Province'].isin(province_list[27:])]
X_train.drop(['Years', 'Province'], axis=1, inplace=True)
X_test.drop(['Years', 'Province'], axis=1, inplace=True)


opium_sown['primary_key'] = opium_sown['Province'] + '_' + opium_sown['Years'].astype(str)
opium_sown.set_index('primary_key', inplace=True)
Y_train = opium_sown.copy()
Y_test = opium_sown.copy()
Y_train.drop(Y_train[Y_train['Years'].astype(int) < 2011].index, inplace = True)
Y_test.drop(Y_test[Y_test['Years'].astype(int) < 2011].index, inplace = True)
Y_train = Y_train[Y_train['Province'].isin(province_list[:27])]
Y_test = Y_test[Y_test['Province'].isin(province_list[27:])]
Y_train.drop(['Years', 'Province'], axis=1, inplace=True)
Y_test.drop(['Years', 'Province'], axis=1, inplace=True)

# Need to ensure X and Y datapoint match up by order of entry in respective df
training_set = pd.merge(X_train, Y_train, left_index=True, right_index=True).reset_index(drop=True)
test_set = pd.merge(X_test, Y_test, left_index=True, right_index=True).reset_index(drop=True)

Y_train = training_set[['Hectares_Sown']]
X_train = training_set.drop(['Hectares_Sown'], axis=1)
Y_test = test_set[['Hectares_Sown']]
X_test = test_set.drop(['Hectares_Sown'], axis=1)

print("X_train:")
print(X_train)
print("X_test:")
print(X_test)
print("Y_train:")
print(Y_train)
print("Y_test:")
print(Y_test)

X_train:
       Sand_Perc     Clay_Perc       OM_Perc       pH_Water           EC  \
0    891781.3891  164279.52858   8102.759193  115554.886126  3471.719562   
1    891781.3891  164279.52858   8102.759193  115554.886126  3471.719562   
2    891781.3891  164279.52858   8102.759193  115554.886126  3471.719562   
3    891781.3891  164279.52858   8102.759193  115554.886126  3471.719562   
4    891781.3891  164279.52858   8102.759193  115554.886126  3471.719562   
..           ...           ...           ...            ...          ...   
265  408612.1570  122078.46700  11652.528800   60579.290000  1468.462000   
266  408612.1570  122078.46700  11652.528800   60579.290000  1468.462000   
267  408612.1570  122078.46700  11652.528800   60579.290000  1468.462000   
268  408612.1570  122078.46700  11652.528800   60579.290000  1468.462000   
269  408612.1570  122078.46700  11652.528800   60579.290000  1468.462000   

         Tot_N_ppm          P_ppm  K_ton_per_ha        S_ppm  \
0    642438.15

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [83]:
# Fitting Random Forest Regression to the dataset
# import the regressor
from sklearn.ensemble import RandomForestRegressor
  
 # create regressor object
random_forest_regressor = RandomForestRegressor(n_estimators = 100, random_state = 1)
  
# fit the regressor with x and y data
random_forest_regressor.fit(X_train, Y_train)

  if __name__ == '__main__':


RandomForestRegressor(random_state=1)

In [84]:
# Evaluating the Trained Random Forest Regressor
from sklearn.metrics import explained_variance_score, max_error, mean_absolute_error, mean_squared_error, mean_squared_log_error, r2_score
Y_pred_test = random_forest_regressor.predict(X_test)

# View Regression Metrics:
print("Explained variance regression score function")
print(explained_variance_score(Y_test, Y_pred_test))
print("The max_error metric calculates the maximum residual error")	
print(max_error(Y_test, Y_pred_test))
print("Mean absolute error regression loss")
print(mean_absolute_error(Y_test, Y_pred_test))
print("Mean squared error regression loss")
print(mean_squared_error(Y_test, Y_pred_test))
print("Mean squared logarithmic error regression loss")
print(mean_squared_log_error(Y_test, Y_pred_test))
print("R^2 (coefficient of determination) regression score function")
print(r2_score(Y_test, Y_pred_test))

Explained variance regression score function
-0.16329474600828342
The max_error metric calculates the maximum residual error
46507.18
Mean absolute error regression loss
4401.492428571429
Mean squared error regression loss
81438056.21063569
Mean squared logarithmic error regression loss
30.57135336140466
R^2 (coefficient of determination) regression score function
-0.29917301876219704
