## **Feature Engineering**

Get the CSVs into DataFrames

Opium Sown: Have number of sown hectares for each district in each province for each year between 2010 - 2020, except for part of Bamyan, Day Kundi, Farah, Faryab, Ghazni, and Ghor, where we have 2008-2018 data. Pending further investigation, we will code these provinces with zeros. All fields marked '-' or 'p-f' (poppy-free) in the CSV are replaced by zeros. Data is of numerical type. All NaN values are set to zero.

Soil Data: Dropped the WRB Codes column because it contained numerous inaccuracies as an artefact of the scraping process and because it was collinear with the Soil Type column. Soil Type column was categorical and has been turned into dummy variables with one hot encoding. All other data was of numerical type. Province name is broadcast for each row of soil sample information. All fields marked (-)(-) in the CSV are replaced by zeros. Dataframe only contains the estimates of sown areas. All margin of error information is removed from the dataframe. All NaN values are set to zero.

BIG CAVEAT FOR SOIL DATA: We do not unique soil data for all 34 provinces. We only have soil data for 9 provinces. I cannot find any up to date soil data for all the provinces - only PDF maps from 2011 that shows topsoil texture distribution (source: Afghan Geodesy and Cartography Head Office, conforms to United Nations Afghanistan Regions 3958.1 R3, June 2011), and qualitative assessments from the UNODC Opium Yields reports. So we are going to subjectively broadcast the soil data from each of the 9 provinces to the closest provinces whose topsoil texture distributions closely resemble each others per the 2011 maps. To get a single reading for each province, we will consider the area of each soil type in each province as the "weight" vector and multiply it to the respective chemical measurement of that soil type, and take the sum of each multiplied column, kind of like a weighted average. And so we will have with subjective soil quality metrics for each province. But these broadcast readings should be subsituted with unique soil sample results as soon as the data becomes available.

Temperature and Precipitation Data: All data is transposed such that the years are columns and months are rows. Province name is broadcast for each row of climatological information. All NaN values are set to zero. Data is of numerical type.



In [1]:
import pandas as pd
import os
import numpy as np

In [2]:
tract_directory = "/content/ML-Climate-Final-Project-Template/data"
opium_sown = None
soil_data = None
temp_data = None
precip_data = None
climate_data = None
for filename in os.listdir(tract_directory):
  fn = tract_directory + '/' + filename
  if "WGI" in fn:
    governance_indicators = pd.read_csv (fn, header=0)
  if "Opium" in fn:
    section_frame = pd.read_csv (fn, header=0)
    section_frame.dropna(how='all', inplace=True)
    section_frame = section_frame.loc[:, ~section_frame.columns.str.contains('^Unnamed')]
    section_frame = section_frame[(section_frame['Province'].str.contains("Total")==True) | (section_frame['District'].str.contains("Total")==True)]
    section_frame.fillna('', inplace=True)
    section_frame["combo"] = section_frame["Province"] + section_frame["District"]
    section_frame['new province'] = section_frame['combo'].map(lambda tot_str: tot_str.partition('Total')[0])
    Y_sown = section_frame.drop(['Province', 'District', 'combo', 'new province'], axis=1)
    Y_sown.replace(regex={'[^0-9]': 0}, inplace=True)
    Y_sown["Province"] = section_frame['new province']
    Y_sown.replace(regex={'Sari  Pul': 'Sar-e-Pul'}, inplace=True)
    Y_sown.replace(regex={'\s+$': ''}, inplace=True)
    if opium_sown is None:
      opium_sown = Y_sown
    else:
      opium_sown = pd.concat([opium_sown, Y_sown]).reset_index(drop=True)
      opium_sown.fillna(0, inplace=True)    

  if "Soil" in fn:
    section_frame = pd.read_csv (fn, header=0)
    section_frame.set_axis(['WRB_Code', 'Soil_Type', 'Area', 'Sand_Perc', 'Clay_Perc', 'OM_Perc', 'pH_Water', 'EC', 'Tot_N_ppm', 'P_ppm', 'K_ton_per_ha', 'S_ppm', 'CaCO_ton_per_ha3'], axis=1, inplace=True)
    section_frame.drop(['WRB_Code', 'Soil_Type'], axis=1, inplace=True)
    section_frame.replace(regex={'\(+.*': '', '±.*': '', ' ±.*': '', ' \(±.*': '', '\(-\)\(-\)': 0,  '\(±.*': ''}, inplace=True)
    section_frame.replace(regex={'[^0-9.]': ''}, inplace=True)
    section_frame.replace(r'^\s*$', np.NaN, regex=True, inplace=True)
    section_frame.fillna(0, inplace=True)
    section_frame.dropna(how='all', inplace=True)
    section_frame = section_frame.astype('float')
    
    province = filename.split('_')[0]
    if province == 'Balkh':
      similar_provs = ['Balkh', 'Kunduz', 'Jawzjan', 'Samangan', 'Sar-e-Pul', 'Faryab']
    elif province == 'Bamyan':
      similar_provs = ['Bamyan', 'Day Kundi', 'Ghor', 'Ghazni']
    elif province == 'Hirat':
      similar_provs = ['Hirat', 'Badghis', 'Farah']
    elif province == 'Kabul':
      similar_provs = ['Kabul', 'Wardak', 'Logar', 'Kapisa', 'Parwan']
    elif province == 'Kandahar':
      similar_provs = ['Kandahar', 'Uruzgan', 'Zabul']
    elif province == 'Khost':
      similar_provs = ['Khost', 'Paktika', 'Paktya']
    elif province == 'Nangarhar':
      similar_provs = ['Nangarhar', 'Kunar', 'Laghman']
    elif province == 'Nimroz':
      similar_provs = ['Nimroz', 'Hilmand']
    elif province == 'Takhar':
      similar_provs = ['Takhar', 'Badakhshan', 'Baghlan', 'Panjsher', 'Nuristan']

    X_soil = pd.DataFrame(similar_provs, columns=['Province'])
    for col_name in section_frame.columns.values.tolist():
      if col_name != 'Area':
        X_soil[col_name] = pd.Series(section_frame['Area'] * section_frame[col_name]).sum()
        X_soil[col_name]  = X_soil[col_name] / pd.Series(section_frame['Area']).sum()

    if soil_data is None:
      soil_data = X_soil
    else:
      soil_data = pd.concat([soil_data, X_soil]).reset_index(drop=True)

  if ("pr" in fn) or ("tas" in fn):
    info = pd.read_csv(fn, skiprows=2, nrows=0)
    section_frame = pd.read_csv (fn, skiprows=3)
    section_frame.rename(columns={'Unnamed: 0':'Years'}, inplace=True )
    section_frame.drop(section_frame[section_frame['Years'] < 2010].index, inplace = True)
    section_frame.insert(loc=0, column='Province', value=info.columns[1])
    section_frame.replace(regex={'Daykundi': 'Day Kundi'}, inplace=True)
    section_frame.fillna(0, inplace=True)
    if "pr" in fn:
      section_frame.columns = section_frame.columns[:2].union('mean_precip_' + section_frame.columns[2:])
      if precip_data is None:
        precip_data = section_frame
      else:
        precip_data = pd.concat([precip_data, section_frame]).reset_index(drop=True)

    elif "tas" in fn:
      section_frame.columns = section_frame.columns[:2].union('mean_temp_' + section_frame.columns[2:])
      if temp_data is None:
        temp_data = section_frame
      else:
        temp_data = pd.concat([temp_data, section_frame]).reset_index(drop=True)

climate_data = precip_data.merge(temp_data, on=['Province', 'Years'])

province_list = [
                 'Balkh', 
                 'Kunduz', 
                 'Jawzjan', 
                 'Samangan', 
                 'Sar-e-Pul', 
                 'Faryab', 
                 'Bamyan', 
                 'Day Kundi', 
                 'Ghor', 
                 'Ghazni', 
                 'Hirat', 
                 'Badghis', 
                 'Farah', 
                 'Kabul', 
                 'Wardak', 
                 'Logar', 
                 'Kapisa', 
                 'Parwan', 
                 'Kandahar', 
                 'Uruzgan', 
                 'Zabul', 
                 'Khost', 
                 'Paktika', 
                 'Paktya', 
                 'Nangarhar', 
                 'Kunar', 
                 'Laghman',
                 'Nimroz',
                 'Hilmand',
                 'Takhar', 
                 'Badakhshan', 
                 'Baghlan', 
                 'Panjsher', 
                 'Nuristan'
                 ]
opium_sown = opium_sown[opium_sown['Province'].isin(province_list)]
opium_sown = opium_sown.melt(id_vars=['Province'], var_name="Years", value_name="Hectares_Sown")

print("Features Compiled. Dataframes:")
print("Opium Sown")
print(opium_sown)
print("Governance Indicators")
print(governance_indicators)
print("Soil Data")
print(soil_data)
print("Climate Data")
print(climate_data)

Features Compiled. Dataframes:
Opium Sown
       Province Years Hectares_Sown
0    Badakhshan  2010          1100
1       Badghis  2010          2958
2       Baghlan  2010             0
3         Balkh  2010             0
4     Nangarhar  2010           719
..          ...   ...           ...
369    Panjsher  2020             0
370      Parwan  2020             0
371    Samangan  2020             0
372   Sar-e-Pul  2020             0
373      Takhar  2020             0

[374 rows x 3 columns]
Governance Indicators
    Years  Control of Corruption: Estimate  \
0    2010                        -1.636177   
1    2011                        -1.579174   
2    2012                        -1.419741   
3    2013                        -1.436510   
4    2014                        -1.354829   
5    2015                        -1.342216   
6    2016                        -1.526172   
7    2017                        -1.515626   
8    2018                        -1.487624   
9    2019           

In [3]:
part_of_X_set = climate_data.merge(governance_indicators, on="Years")
X_set = soil_data.merge(part_of_X_set, on='Province', how='outer')
print(X_set)

     Province  Sand_Perc  Clay_Perc   OM_Perc  pH_Water        EC  Tot_N_ppm  \
0    Kandahar  62.086269  14.910295  0.840974  7.929618  0.742861  36.879112   
1    Kandahar  62.086269  14.910295  0.840974  7.929618  0.742861  36.879112   
2    Kandahar  62.086269  14.910295  0.840974  7.929618  0.742861  36.879112   
3    Kandahar  62.086269  14.910295  0.840974  7.929618  0.742861  36.879112   
4    Kandahar  62.086269  14.910295  0.840974  7.929618  0.742861  36.879112   
..        ...        ...        ...       ...       ...       ...        ...   
369    Paktya  54.983806  16.427164  1.567991  8.151691  0.197600  39.386921   
370    Paktya  54.983806  16.427164  1.567991  8.151691  0.197600  39.386921   
371    Paktya  54.983806  16.427164  1.567991  8.151691  0.197600  39.386921   
372    Paktya  54.983806  16.427164  1.567991  8.151691  0.197600  39.386921   
373    Paktya  54.983806  16.427164  1.567991  8.151691  0.197600  39.386921   

         P_ppm  K_ton_per_ha     S_ppm 

## **Training and Test Set Splitting**

For each province, for each year, the X dataset is the governance indicators, local soil features, and the mean temperature and mean precipitation in the 12 months of the precending year, while the Y is an indicator variable about whether any hectares of poppy were sown in the current year. We have climatological data from 2010 through 2020 (11 years). We will use 34 provinces * 10 years from 2010 to 2019 = 340 datapoints in total for training and testing. Once we have fine-tuned our benchmark, we will use the 2020 climatological features (and existing soil featuers) to predict the number of hectares of opium sown in 2021, and compare our prediction against the UNODC report that will come out later in the year.

We will train the classifier on data from 27 (or roughly 80%) of the 34 provinces, and test the classifier on data from the remaining 7 provinces.

In [4]:
# importing random module
import random

# initializing the value of n
n = 27

# get random n provinces from list
training_provinces = random.sample(province_list, n)

X_set['Years_to_match'] = X_set['Years'] + 1
X_set['primary_key'] = X_set['Province'] + '_' + X_set['Years_to_match'].astype(str)
X_set.drop(['Years_to_match'], axis=1)
X_set.set_index('primary_key', inplace=True)
X_set.drop(X_set[X_set['Years'] >= 2020].index, inplace = True)

X_train = X_set[X_set['Province'].isin(training_provinces)]
X_test = X_set[~X_set['Province'].isin(training_provinces)]

X_train.drop(['Years', 'Province'], axis=1, inplace=True)
X_test.drop(['Years', 'Province'], axis=1, inplace=True)


opium_sown['primary_key'] = opium_sown['Province'] + '_' + opium_sown['Years'].astype(str)
opium_sown.set_index('primary_key', inplace=True)

opium_sown['Hectares_Sown'] = opium_sown['Hectares_Sown'].astype(float)
opium_sown['Poppy_Region'] = (opium_sown['Hectares_Sown'] > 0).astype(int)

Y_all = opium_sown.copy()
Y_all.drop(Y_all[Y_all['Years'].astype(int) <= 2010].index, inplace = True)

Y_train = Y_all[Y_all['Province'].isin(training_provinces)]
Y_test = Y_all[~Y_all['Province'].isin(training_provinces)]

Y_train.drop(['Years', 'Province', 'Hectares_Sown'], axis=1, inplace=True)
Y_test.drop(['Years', 'Province', 'Hectares_Sown'], axis=1, inplace=True)

# Need to ensure X and Y datapoint match up by order of entry in respective df
training_set_OG = pd.merge(X_train, Y_train, left_index=True, right_index=True).reset_index(drop=True)
test_set = pd.merge(X_test, Y_test, left_index=True, right_index=True).reset_index(drop=True)

# Upsampling poppy regions in the training set only to balance the two classes
training_set_positives = training_set_OG[training_set_OG['Poppy_Region']==1]
training_set_upsample = training_set_positives.sample(frac=0.9, replace=True, random_state=1)
training_set = pd.concat([training_set_OG, training_set_upsample])


Y_train = training_set[['Poppy_Region']]
X_train = training_set.drop(['Years_to_match', 'Poppy_Region'], axis=1)

Y_test = test_set[['Poppy_Region']]
X_test = test_set.drop(['Years_to_match', 'Poppy_Region'], axis=1)

print("X_train:")
print(X_train)
print("Y_train:")
print(Y_train)

print("X_test:")
print(X_test)
print("Y_test:")
print(Y_test)

X_train:
     Sand_Perc  Clay_Perc   OM_Perc  pH_Water        EC   Tot_N_ppm  \
0    62.086269  14.910295  0.840974  7.929618  0.742861   36.879112   
1    62.086269  14.910295  0.840974  7.929618  0.742861   36.879112   
2    62.086269  14.910295  0.840974  7.929618  0.742861   36.879112   
3    62.086269  14.910295  0.840974  7.929618  0.742861   36.879112   
4    62.086269  14.910295  0.840974  7.929618  0.742861   36.879112   
..         ...        ...       ...       ...       ...         ...   
45   60.661439   4.905354  0.877024  7.993465  0.310647    0.000000   
67   60.400277   5.867435  2.579025  7.857396  0.216705    0.000000   
58   60.661439   4.905354  0.877024  7.993465  0.310647    0.000000   
232  33.132586  30.843317  2.420325  8.078585  1.088076  142.083959   
70   60.400277   5.867435  2.579025  7.857396  0.216705    0.000000   

         P_ppm  K_ton_per_ha      S_ppm  CaCO_ton_per_ha3  ...  mean_temp_Mar  \
0    32.179072      1.540933   7.549862        954.943891

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [5]:
Y_train['Poppy_Region'].value_counts()

1    182
0    174
Name: Poppy_Region, dtype: int64

In [6]:
Y_test['Poppy_Region'].value_counts()

0    54
1    16
Name: Poppy_Region, dtype: int64

## **Random Forest Benchmark Classifier**

In [7]:
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier

#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)

#Train the model using the training sets
clf.fit(X_train,Y_train)

Y_pred_rfclass=clf.predict(X_test)

  


In [8]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(Y_test, Y_pred_rfclass))

Accuracy: 0.8


In [9]:
# Model Precision: what percentage of positive tuples are labeled as such?
print("Precision:",metrics.precision_score(Y_test, Y_pred_rfclass))

# Model Recall: what percentage of positive tuples are labelled as such?
print("Recall:",metrics.recall_score(Y_test, Y_pred_rfclass))

# Model F1 Score:
print("F1 Score:",metrics.f1_score(Y_test, Y_pred_rfclass))

Precision: 0.625
Recall: 0.3125
F1 Score: 0.4166666666666667


In [10]:
Y_pred_rfclass

array([1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0])

# **SVM Classifier**

In [11]:
#Import svm model
from sklearn import svm

#Create a svm Classifier
clf = svm.SVC(kernel='rbf', C=1.0, gamma='scale') # Non-Linear problem, using radial basis function kernel

#Train the model using the training sets
clf.fit(X_train, Y_train)

#Predict the response for test dataset
Y_pred_svm = clf.predict(X_test)

  y = column_or_1d(y, warn=True)


In [12]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# Model Accuracy: how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(Y_test, Y_pred_svm))

Accuracy: 0.7142857142857143


In [13]:
# Model Precision: what percentage of positive tuples are labeled as such?
print("Precision:",metrics.precision_score(Y_test, Y_pred_svm))

# Model Recall: what percentage of positive tuples are labelled as such?
print("Recall:",metrics.recall_score(Y_test, Y_pred_svm))

# Model F1 Score:
print("F1 Score:",metrics.f1_score(Y_test, Y_pred_svm))

Precision: 0.43333333333333335
Recall: 0.8125
F1 Score: 0.5652173913043479


In [14]:
Y_pred_svm

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0])