This is my attempt to replicate and improve the predictive models in Blair & Sambanis (2020) Forecasting Civil Wars: Theory and Structure in an Age of “Big Data” and Machine Learning. 

In [36]:
import pandas as pd 
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

Loading the 1 month data from Blair & Sambanis (2020).

In [37]:
file_path = '~/Desktop/blair&sambanis2020_rn/data/b&s_data/1mo_data.dta'
df = pd.read_stata(file_path)

Deleting NaN for the target feature. I need to look into why these values are missing. Can they be assumed 0? Are they observations in which a conflict is ongoing that should be excluded?

In [38]:
df = df[df['incidence_civil_ns'].notna()]

Selecting the imputs and target


In [39]:
y = df.incidence_civil_ns

features = ["year", "month", "gov_opp_low_level", "gov_reb_low_level", "opp_gov_low_level", 
    "reb_gov_low_level", "gov_opp_nonviol_repression", 
    "gov_reb_nonviol_repression", "gov_opp_accommodations", 
    "gov_reb_accommodations", "reb_gov_demands", "opp_gov_demands"]
X = df[features]

Creating the training and testing split.

In [40]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 78)

Setting up a pipeline to scale, impute, and otherwise prepare the data.

In [41]:
my_imputer = SimpleImputer()
X_train_imputed = pd.DataFrame(my_imputer.fit_transform(X_train))
X_test_imputed = pd.DataFrame(my_imputer.transform(X_test))

X_train_imputed.columns = X_train.columns
X_test_imputed.columns = X_test.columns

# feature scaling
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train_imputed)
X_test_sc = sc.transform(X_test_imputed)

Initial random forest model

In [43]:
rf_reg = RandomForestRegressor(n_estimators=100, max_leaf_nodes=5,
    n_jobs=1, random_state=0)
rf_reg.fit(X_train_sc, y_train)
preds = rf_reg.predict(X_test_sc)

print('\nRandom Forest\n')
print('Mean absoulte error: %.2f'
        % mean_absolute_error(y_test, preds))

#print('Coefficients: \n', rf_reg.coef_)
# The mean squared error
print('Mean squared error: %.2f'
      % mean_squared_error(y_test, preds))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
      % r2_score(y_test, preds))



Random Forest

Mean absoulte error: 0.00
Mean squared error: 0.00
Coefficient of determination: 0.02
