**Feature Engineering**

Part 1: Get the CSVs into DataFrames

Opium Sown: Have number of sown hectares for each district in each province for each year between 2010 - 2020, except for part of Bamyan, Day Kundi, Farah, Faryab, Ghazni, and Ghor, where we have 2008-2018 data. Pending further investigation, we will code these provinces with zeros. All fields marked '-' or 'p-f' (poppy-free) in the CSV are replaced by zeros. Data is of numerical type. All NaN values are set to zero.

Soil Data: Dropped the WRB Codes column because it contained numerous inaccuracies as an artefact of the scraping process and because it was collinear with the Soil Type column. Soil Type column was categorical and has been turned into dummy variables with one hot encoding. All other data was of numerical type. Province name is broadcast for each row of soil sample information. All fields marked (-)(-) in the CSV are replaced by zeros. Dataframe only contains the estimates of sown areas. All margin of error information is removed from the dataframe. All NaN values are set to zero.

Temperature and Precipitation Data: All data is transposed such that the years are columns and months are rows. Province name is broadcast for each row of climatological information. All NaN values are set to zero. Data is of numerical type.



In [1]:
import pandas as pd
import os

In [6]:
tract_directory = "/content/ML-Climate-Final-Project-Template/data"
opium_sown = None
soil_data = None
temp_data = None
precip_data = None
for filename in os.listdir(tract_directory):
  fn = tract_directory + '/' + filename
  if "Opium" in fn:
    section_frame = pd.read_csv (fn, header=0)
    section_frame = section_frame.loc[:, ~section_frame.columns.str.contains('^Unnamed')]
    section_frame = section_frame[section_frame.Province.str.contains("total", case=False) == False]
    section_frame = section_frame[section_frame.District.str.contains("total", case=False) == False]
    section_frame.replace(regex={'-': 0,  'p-f': 0}, inplace=True)
    section_frame.dropna(how='all', inplace=True)
    if opium_sown is None:
      opium_sown = section_frame
    else:
      opium_sown = pd.concat([opium_sown, section_frame]).reset_index(drop=True)
      opium_sown.fillna(0, inplace=True)

  if "Soil" in fn:
    section_frame = pd.read_csv (fn, header=0)
    section_frame.replace(regex={' \(±.*': '', '\(-\)\(-\)': 0,  '\(±.*': ''}, inplace=True)
    section_frame.dropna(how='all', inplace=True)
    section_frame.drop(['WRB Code'], axis=1, inplace=True)
    province = filename.split('_')[0]
    section_frame['Province']=province
    section_frame = pd.get_dummies(section_frame, columns=['Soil type'], prefix="Soil_Type")
    if soil_data is None:
      soil_data = section_frame
    else:
      soil_data = pd.concat([soil_data, section_frame]).reset_index(drop=True)
      soil_data.fillna(0, inplace=True)

  if ("pr" in fn) or ("tas" in fn):
    info = pd.read_csv(fn, skiprows=2, nrows=0)
    section_frame = pd.read_csv (fn, skiprows=3)
    section_frame.rename(columns={'Unnamed: 0':'Months'}, inplace=True )
    section_frame['Months'] = section_frame['Months'].astype(str)
    section_frame_transposed = section_frame.T
    section_frame_transposed.columns = section_frame_transposed.iloc[0]
    section_frame_transposed.drop(section_frame_transposed.index[0], inplace=True)
    section_frame_transposed.reset_index(drop=True)
    section_frame_transposed.drop(columns=section_frame_transposed.columns[:109], axis=1, inplace=True)
    section_frame_transposed['Province']=info.columns[1]
    section_frame_transposed.fillna(0, inplace=True)
    if "pr" in fn:
      if precip_data is None:
        precip_data = section_frame_transposed
      else:
        precip_data = pd.concat([precip_data, section_frame_transposed]).reset_index(drop=True)
    elif "tas" in fn:
      if temp_data is None:
        temp_data = section_frame_transposed
      else:
        temp_data = pd.concat([temp_data, section_frame_transposed]).reset_index(drop=True)


print("Features Compiled. Dataframes:")
print("Opium Sown")
print(opium_sown)
print("Soil Data")
print(soil_data)
print("Precipitation Data")
print(precip_data)
print("Mean Temperature Data")
print(temp_data)

Features Compiled. Dataframes:
Opium Sown
    Province                     District 2008 2009  ... 2017 2018 2019 2020
0     Bamyan   Bamyan (Provincial Center)    0    0  ...    0    0    0    0
1     Bamyan                      Kahmard    0    0  ...    0    0    0    0
2     Bamyan                       Panjab    0    0  ...    0    0    0    0
3     Bamyan                      Saighan    0    0  ...    0    0    0    0
4     Bamyan                       Shebar    0    0  ...    0    0    0    0
..       ...                          ...  ...  ...  ...  ...  ...  ...  ...
481   Takhar                     Namak Ab    0    0  ...    0    0    0    0
482   Takhar                       Rustaq    0    0  ...   23  193    0    0
483   Takhar  Taloqan (Provincial Center)    0    0  ...    0    1    0    0
484   Takhar                       Warsaj    0    0  ...    0    0    0    0
485   Takhar                   Yangi Qala    0    0  ...    0    0    0    0

[486 rows x 15 columns]
Soil Data

In [None]:
print(soil_data)

0       Hirat
1       Hirat
2       Hirat
3       Hirat
4       Hirat
        ...  
128    Takhar
129    Takhar
130    Takhar
131    Takhar
132    Takhar
Name: Province, Length: 133, dtype: object
