# Forecasting Civil Wars

## Importing section

In [1]:
import civil_war_base as cw
import pandas as pd
import timeit
from itertools import compress
from country_converter import CountryConverter
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import timedelta  
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import scipy.stats as sp
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import roc_auc_score

## 1. Reading and cleaning ICEWS data

### 1.1. About the data

We have created a python document called **"civil_war_base.py"** that contains most of the functions used in this notebook, wether they are present or not. For more information about the functions please look them up in this document since they all contain docstrings are commented with information. 

This part was the most difficult one since it required to read very large datasets, making changes to them and concatenating them to each other. 

The data comes from the Integrated Crisis Early Warning System ([ICEWS](https://en.wikipedia.org/wiki/Integrated_Conflict_Early_Warning_System)) dataset that contains information of political events across the world with their correspondent date and multiple information ranging from origin and target agents to values of the impact of the event in the country. The latest updated version can be found in [Harvard Dataverse](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/28075).

We used version 29.0 from the ICEWS dataset except for the year 2017 which was corrupted and we used version 28.0 instead. We renamed every file to be just the name of the year it contains and zipped them into the file **"events.zip"**. The size of **"events.zip"** is really huge, and it was included in this github repository with the help of [Git Large File Storage](https://git-lfs.github.com/).  

The data contains the events coded and the country where they took place, the source and the target. 


#### 1.1.1. Conflict and Mediation Event Observations code (CAMEO Code)

The Conflict and Mediation Event Observations code ([CAMEO Code](https://en.wikipedia.org/wiki/Conflict_and_Mediation_Event_Observations#:~:text=Conflict%20and%20Mediation%20Event%20Observations%20(CAMEO)%20is%20a%20framework%20for,system%20developed%20by%20Charles%20A.)) is a framework for coding event data. The codebook can be also downloaded from the [Harvard Dataverse](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/28075) database that contains the ICEWS data. The coding sumarizes the action taken in a number and that number reflects the action taken. There are over 300 codes for more than 300 different actions. 

### 1.2. Cleaning the data

For this section we will use as an example the year 2019. We will show how the cleaning process worked for that year and then apply the same process to every year. The process goes as follows: 

**1. Read the data of the year**

For this purpose we have defined a function inside **"civil_war_base.py"** called <code>read_events_year()</code> which will extract from the **"events.zip"** file the dataset containing all events of the selected year:

In [None]:
all_events = cw.read_events_year("events.zip",2019)

In [None]:
all_events.sample(5)

**2. Selecting events with same source and target country**

The purpose of this project is to analyse civil wars and civil wars occur within a country, so we are not interested in events envolving different countries. The function <code>internal_events_year()</code> will do this selection for us:

In [None]:
internal_events = cw.internal_events_year("events.zip",2019)

In [None]:
internal_events.sample(5)

**3. Establishing source and target**

Once we have internal events selected, we have to establish who did what to whom. Fortunately, ICEWS has the [ICEWS Dictionaries](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/28118). These dictionaries contain revelevant information about the sectors, agents and actors that take part in the events. 

We will study three different groups as identified from the Sector ICEWS Dictionary:
 - Government (Gov): Rules the country.
 - Opposition (Opp): Comfronts the government.
 - Insurgents (Ins): Violently comfronts the government. 
 - People (Peo): Workers, citizes, cilians affected by the others. Media is also included in this sector, due to its influence on the opinion. 
 
We have created a function that helps us do this <code>source_to_sector()</code>. Our objective is to maximize the number of events we extract so different names are considered for each group: 

In [None]:
internal_events["Source Sectors"] = internal_events["Source Sectors"].apply(cw.sector_filter)
pct_miss = internal_events["Source Sectors"].isnull().mean()*100
print("Missing "+str(pct_miss)+"% of the data")

In [None]:
internal_events["Target Sectors"] = internal_events["Target Sectors"].apply(cw.sector_filter)
pct_miss = internal_events["Target Sectors"].isnull().mean()*100
print("Missing "+str(pct_miss)+"% of the data")

We see that with our criteria we are missing about 20% of the data. This percentage of data missed is the same for every year. We could try to use the _"Source Name"_ and _"Target Name"_ columns to get more relevant information regarding the sources and targets.

In [None]:
internal_events[internal_events["Source Sectors"].isnull() | internal_events["Target Sectors"].isnull()].sample(5)

We see that sometimes when the _"Source Sector"_ column is missing, the _"Source Name"_ column contains the name of the country. The same happens for the _"Target Sector"_ and _"Target Name"_ columns. We will assign these cases as events with source or target in the Government. 

In [None]:
internal_events.loc[internal_events["Source Name"] == internal_events["Source Country"],"Source Sectors"] = "Government"
internal_events.loc[internal_events["Target Name"] == internal_events["Target Country"],"Target Sectors"] = "Government"

In [None]:
pct_miss = internal_events["Source Sectors"].isnull().mean()*100
print("Missing "+str(pct_miss)+"% of the data")

In [None]:
pct_miss = internal_events["Target Sectors"].isnull().mean()*100
print("Missing "+str(pct_miss)+"% of the data")

We see that we have improved by far the ammount of data we are keeping. We could go further and extract information from the Actors and Agents ICEWS Dictionaries. We tried this approach however the code took 4.9 minutes to run, which means that to filter all the years in the ICEWS dataset we will need about 2 hours! 

The reported improvement was roughly an extra 0.4% of data. Considering the ammount of time it takes to run and the little extra information it gets we will skip this step. The code used is hide on the cell bellow (...) in case the reader wants to chek it:

We created a function called <code>read_filtered_data()</code> that implements this changes to the data before generating the dataset, so that we can directly obtain a cleaned dataset: 

In [None]:
filtered_events = cw.read_filtered_data("events.zip",2019)

In [None]:
filtered_events.sample(5)

**4. Selecting relevant columns**

Now that we have established the sectors of the source and the target we will select only relevant columns. These are listed bellow, followed by the new name they will be given:
- Source Sectors - Source
- Event Text - Event
- CAMEO Code - CAMEO
- Intensity - Intensity
- Target Sectors - Target
- Country - Country
- Month - Month
- Year - Year

Finally, since we are focusing on the interaction between different groups, we have to select events in which the source and the target are different. If there is a missing value in the _"Source"_ or _"Target"_ columns, we will drop the whole row. 

This selection with all the previous steps is implemented in the function <code>read_cols_filtered()</code>:

In [None]:
final_events = cw.read_cols_filtered("events.zip",2019)

In [None]:
final_events.sample(5)

In [None]:
print(len(final_events)/len(all_events)*100,"%")

As shown in the cell above, after cleaning our data we're just retaining about a 27% of the total number of events reported by the ICEWS Dataset for the year 2019. 

To have all the possible interactions between different sectores registered we will use dummy variables to code events from source a to target b as a_b. If a_b has 1 it means the event had a as the source and b as the target. 

All these steps are done in the function <code>source_target_interaction</code>:

In [None]:
final_events2 = cw.source_target_interaction("events.zip",2019)

In [None]:
final_events2

To avoid having a country being named in diferent ways (such as North Korea and Republic of Korea), we will use the ISO3 notation for all countries. 

In [None]:
iso3df = cw.iso3country("events.zip",2019)

In [None]:
iso3df.sample(5)

This way, although we are adding more variables to our model, we also are getting usefull columns that will allow us to select events that occur between the sectors we decide. 

# 2. Reading and cleaning the Civil War data

In the previous step we cleaned and selected the data we are going to use as our independent variable. Now it's time we clean an work with our dependent variable. 

## 2.1. About the data

To determine ongoing conflict historians determine certain threshold values for different variables which may or may not include number of total deaths, the factions taking part or the duration of the peace agreements. Civil war datasets may use different criteria and, thefore, the start and ending dates may differ from dataset to dataset. Moreover, some datasets may consider different civil wars while others may classify it as a single civil war. 

### 2.1.1. Political Instability Task Force (PITF)

We will use the State Failure Problem set from the Political Instability Task Force (PITF) found in the Center of Systemic Peace [webiste](http://www.systemicpeace.org/inscrdata.html). In particular we will consider the [Cosolidated Cases](http://www.systemicpeace.org/inscr/PITF%20Consolidated%20Case%20List%202018.pdf) dataset. The PITF critaria can be found in its [Codebook](http://www.systemicpeace.org/inscr/PITFProbSetCodebook2018.pdf). This datasets contain consolidated cases of civil wars from 1955 to 2018. 

## 2.2. Cleaning the data

To extract the data from the PDF we used Smallpdf's [PDF to Excel Converter](https://smallpdf.com/pdf-to-excel). As a free service it obviously comes with errors that need to be fixed. We only want information realed to the month and year the conflict started and the month and year the conflict finished. We defined a function that reads and fixes the data for us <code>read_PITF()</code>:

In [None]:
PITF = cw.read_PITF("PITF Consolidated Case List 2018-converted.xlsx")

In [None]:
pd.set_option('display.max_rows', None)
PITF.sample(5)

# 3. Defining our model
Now that we have cleaned the ICEWS dataset we have to define our model. Our purpose is to group events by periods of time by counts or means. We have come up with 2 models (inspired by "Forecasting Civil Wars: Theory and Structure in an Age of “Big Data” and Machine Learning" Robert A. Blair and Nicholas Sambanis (2020)). 

As for civil wars, we will use the PITF dataset we generated before to map them into three columns:
- "*CW_s*": 1 if a civil war started.
- "*CW_o*": 1 if a civil war is ongoing.
- "*CW_f*": 1 if a civil war ended. 

After that, we will create three new columns ("*CW_s_plus1*", "*CW_f_plus1*", "*CW_o_plus1*") to store the value of the next month.

We will train our model with target variables "*CW_s_plus1*", "*CW_o_plus1*" and "*CW_f_plus1*" and we will leave out columns "*ISO3*" and "*Year_Month*" from the set of predictors. The purpose of this first step is to generalize the prediction of civil wars only considering events. 

## 3.1. Interaction fraction
We will use monthly fraction of interaction between each pair of sectors. The fraction reflects how many of the total events took place between each pair of sectors. The purpose of this model is to see if counting interactions is enough to determine the starting, finishing and ongoingness of a civil war. The function to generate this dataset is <code>interaction_fraction</code>. 

In [2]:
interaction_fraction = cw.interaction_fraction("events.zip", "PITF Consolidated Case List 2018-converted.xlsx")
interaction_fraction.to_csv("interaction_fraction_model.csv")

Generating model...
Cleaning missing years...
Adding civil wars...
Done!


Due to the time it takes to run this model, the data has been saved into "*interaction_counts_model.csv*".

In [None]:
interaction_fraction = pd.read_csv("interaction_fraction_model.csv").drop("Unnamed: 0", axis=1)
interaction_fraction["Year_Month"] = pd.to_datetime(interaction_fraction["Year_Month"])

In [3]:
interaction_fraction.sample(10)

Unnamed: 0,ISO3,Year_Month,Gov_Ins,Gov_Opp,Gov_Peo,Ins_Gov,Ins_Opp,Ins_Peo,Opp_Gov,Opp_Ins,...,Peo_Gov,Peo_Ins,Peo_Opp,CW_s,CW_f,CW_o,CW_s_plus1,CW_f_plus1,CW_o_plus1,CW_plus1
13499,CPV,2015-12,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0
40601,MLT,2018-06,0.0,0.333333,0.333333,0.0,0.0,0.0,0.333333,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0
41107,MNE,2012-08,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0
13053,COM,2002-10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0
32685,KHM,2006-10,0.083333,0.383333,0.233333,0.0,0.0,0.0,0.216667,0.0,...,0.083333,0.0,0.0,0.0,0.0,0.0,0,0,0,0
55502,SLV,2012-03,0.333333,0.083333,0.083333,0.25,0.0,0.0,0.083333,0.0,...,0.166667,0.0,0.0,0.0,0.0,0.0,0,0,0,0
56078,SOM,2012-03,0.216561,0.006369,0.159236,0.261146,0.012739,0.16242,0.003185,0.003185,...,0.143312,0.031847,0.0,0.0,0.0,1.0,0,0,1,1
42365,MSR,1997-06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0
26358,HKG,2007-07,0.0,0.0,0.612903,0.0,0.0,0.0,0.064516,0.0,...,0.225806,0.0,0.032258,0.0,0.0,0.0,0,0,0,0
36383,LTU,2002-12,0.0,0.111111,0.422222,0.0,0.0,0.0,0.155556,0.0,...,0.288889,0.0,0.022222,0.0,0.0,0.0,0,0,0,0


In [None]:
interaction_fraction.head()

### 3.1.1. Principal component analysis
Performing principal component analysis will give us an insight of our variables and how relevant to our model they are. 

In [None]:
cols = ["Gov_Ins", "Gov_Opp", "Gov_Peo", "Ins_Gov", "Ins_Opp", "Ins_Peo", 
           "Opp_Gov", "Opp_Ins", "Opp_Peo", "Peo_Gov", "Peo_Ins", "Peo_Opp"]
if_data = interaction_fraction[cols]

# Standardize
ss_if = StandardScaler()
if_data = ss_if.fit_transform(if_data)

In [None]:
pca_if = PCA(2)
pca_if.fit(if_data)

In [None]:
pca_if.explained_variance_ratio_

With just the first two components we are able to explain about 25% of the variance in the data. Considering we have 12 predictors, if each predictor explained the same fraction of the variance in the data it will be about 8% of explained variance per predictor. This suggests that some predictors might be more usefull than others to split the data. 

25% of explained variance in just two components is not enough to see strong diferences in the data. However, it is good if we visualize our data with just these two components. 

In [None]:
if_data_transformed = pca_if.transform(if_data)

In [None]:
f, ax = plt.subplots(ncols = 3, figsize=(20,5))
sns.scatterplot(if_data_transformed[interaction_fraction["CW_s_plus1"]==0,0],
                if_data_transformed[interaction_fraction["CW_s_plus1"]==0,1],
                ax = ax[0], alpha = 0.1)
sns.scatterplot(if_data_transformed[interaction_fraction["CW_s_plus1"]==1,0],
                if_data_transformed[interaction_fraction["CW_s_plus1"]==1,1],
                ax = ax[0], alpha = 0.5)
ax[0].set_title("Starting Civil War")
sns.scatterplot(if_data_transformed[interaction_fraction["CW_f_plus1"]==0,0],
                if_data_transformed[interaction_fraction["CW_f_plus1"]==0,1],
                ax = ax[1], alpha = 0.1)
sns.scatterplot(if_data_transformed[interaction_fraction["CW_f_plus1"]==1,0],
                if_data_transformed[interaction_fraction["CW_f_plus1"]==1,1],
                ax = ax[1], alpha = 0.5)
ax[1].set_title("Finishing Civil War")
sns.scatterplot(if_data_transformed[interaction_fraction["CW_o_plus1"]==0,0],
                if_data_transformed[interaction_fraction["CW_o_plus1"]==0,1],
                ax = ax[2], alpha = 0.1)
sns.scatterplot(if_data_transformed[interaction_fraction["CW_o_plus1"]==1,0],
                if_data_transformed[interaction_fraction["CW_o_plus1"]==1,1],
                ax = ax[2], alpha = 0.5)
ax[2].set_title("Ongoing Civil War");

We see that the values are display in a fan pattern, with starting an finishing civil war cases located near the limits of the distribution, but not in a very clear way. For starting civil wars, most of them are clustered to values close to (0, 0). For ongoing civil wars, we see that the values are distributed evenly acrros the whole dataset.

To see which predictors affect each components the most we can extract their coefficients:

In [None]:
components_if = pd.DataFrame(abs(pca_if.components_)).transpose()
sorted_cols = components_if[0].sort_values(ascending = False).index
[(cols[i], components_if.iloc[i,0]) for i in sorted_cols]

In [None]:
sorted_cols = components_if[1].sort_values(ascending = False).index
[(cols[i], components_if.iloc[i,1]) for i in sorted_cols]

We see that the first component is heavily determined by the interactions between the Insurgents and the Government, and also the interactions between People and Insurgents, while the second component is heavily determined by the interactions between the Opposition and the Government.

This could explain why we saw civil wars situating at the limits of the fan distribution we saw above, civil wars were the Insurgents have a bigger impact are situated along the first component, while civil wars were the Opposition has a bigger impact are situated along the second component. 


### 3.1.2. Statistical significance
To see if differences between each column are statistically significant, we will perform a t-test comparing cases were a civil war starts, is ongoing or finishes with cases were none of these occur:

In [None]:
# To store results
colst = ["CW_s_plus1", "CW_o_plus1", "CW_f_plus1"]
pvalues_if = pd.DataFrame(index=colst, columns=cols)
pvalues_if = pvalues_if.fillna(0)

In [None]:
# Getting p-values
for i in range(0, len(cols)):
    for j in range(0, len(colst)):
        filter_col = [col for col in colst if col != colst[j]]
        war = (interaction_fraction[colst[j]]==0)
        not_others = (interaction_fraction[filter_col[0]]==0) & (interaction_fraction[filter_col[1]]==0)
        a = interaction_fraction[war & not_others][cols[i]]
        b = interaction_fraction[~war & not_others][cols[i]]
        pvalue = sp.ttest_ind(a, b, equal_var=False).pvalue
        pvalues_if.loc[colst[j],cols[i]] = pvalue

In [None]:
pvalues_if.round(4)

For cases were a civil war is starting, the differences between means are not very statistically significant, with all of them being over 0.05, which means that at least there is an 5% chance to see simmilar results. This suggest a certain amount of spontaneity that our model fails to capture. 

For cases were a civil war is ongoing, most of the p-values are very close to cero, suggesting that due to the nature of the conflict some sectors interact with each other significantly different. We have high p-values for interaction between Government and People and between Opposition and People, suggesting that during the conflict People play a secondary role, and it is more relevant to focus on the actions of the combatants. 

For cases were a civil war is finishing, some p-values are very close to 0, suggesting that the relaxation of the conflict can be predicted with more accuracy than the starting of the conflict. However, most of the p-values are relatively high. 

## 3.2. Mean intensity interaction between each group
We will use the "Intensity" column for this model. Each of the interaction columns (Gov_Ins, Gov_Opp, ...) will have the mean intensity in a country in a particular period of time (between monthly and yearly periods). Negative values imply a negative action (for example, an armed fight between sectors) and positive values a positive action (for example, an agreement between sectors). The mean value will show the average positivity or negativity of the interaction between sectors during that period of time. This model is called using the function <code>mean_intensity</code>. 

This model is supossed to upgrade the previous model including not only the number of interactions but the intensity of them. 

In [4]:
mean_intensity = cw.mean_intensity("events.zip", "PITF Consolidated Case List 2018-converted.xlsx")
mean_intensity.to_csv("mean_intensity_model.csv")

Generating model...
Cleaning missing years...
Adding civil wars...
Done!


The function takes way too long to run so the result is already stored in a file called <code>"mean_intensity_model.csv"</code> so that it can be directly read into the notebook:

In [None]:
mean_intensity = pd.read_csv("mean_intensity_model.csv").drop("Unnamed: 0", axis=1)
mean_intensity["Year_Month"] = pd.to_datetime(mean_intensity["Year_Month"])

In [6]:
mean_intensity.sample(10)

Unnamed: 0,ISO3,Year_Month,Gov_Ins,Gov_Opp,Gov_Peo,Ins_Gov,Ins_Opp,Ins_Peo,Opp_Gov,Opp_Ins,...,Peo_Gov,Peo_Ins,Peo_Opp,CW_s,CW_f,CW_o,CW_s_plus1,CW_f_plus1,CW_o_plus1,CW_plus1
59396,SYR,2000-09,0.0,0.0,-0.5,0.0,0.0,0.0,0.375,0.0,...,-0.2,0.0,0.0,0.0,0.0,0.0,0,0,0,0
31997,KEN,1997-06,-0.238095,-0.416667,-1.964286,-0.214286,0.0,-0.214286,-0.25,0.0,...,-1.507143,0.0,0.0,0.0,0.0,0.0,0,0,0,0
38241,MCO,2013-10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0
29821,ISL,2008-02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0
64758,URY,2015-07,0.0,-0.181818,-1.545455,0.0,0.0,0.0,0.0,0.0,...,0.854545,0.0,0.0,0.0,0.0,0.0,0,0,0,0
57934,SVN,1998-11,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.333333,0.0,0.0,0.0,0.0,0.0,0,0,0,0
47075,NRU,2005-12,0.0,0.0,-5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0
53010,SDN,1996-07,0.485714,0.0,-1.428571,0.0,0.0,0.0,-0.285714,0.0,...,-0.285714,0.0,0.0,0.0,0.0,1.0,0,0,1,1
58282,SWE,2003-11,0.0,0.0,-1.166667,0.0,0.0,0.0,0.0,0.0,...,-0.333333,0.0,0.0,0.0,0.0,0.0,0,0,0,0
12199,COG,2003-08,1.166667,-0.333333,-2.333333,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0


In [None]:
mean_intensity.describe()

### 3.2.1. Principal component analysis
We still have 12 predictors, however they contain different information in them. As before, on average, each predictor should explain about 8% of the variance in the data, let's see if this is the case or if some components are above that 8%. 

In [None]:
cols = ["Gov_Ins", "Gov_Opp", "Gov_Peo", "Ins_Gov", "Ins_Opp", "Ins_Peo", 
           "Opp_Gov", "Opp_Ins", "Opp_Peo", "Peo_Gov", "Peo_Ins", "Peo_Opp"]
mi_data = mean_intensity[cols]

# Standardize
ss_mi = StandardScaler()
mi_data = ss_mi.fit_transform(mi_data)

In [None]:
pca_mi = PCA(2)
pca_mi.fit(mi_data)

In [None]:
pca_mi.explained_variance_ratio_

With two components we can explain about 24% of the variation in the data. We see that these two components are above that 8% but are a little bellow the explained variance we saw in the previous model. Let's visualize them:

In [None]:
mi_data_transformed = pca_mi.transform(mi_data)

In [None]:
f, ax = plt.subplots(ncols = 3, figsize=(20,5))
sns.scatterplot(mi_data_transformed[mean_intensity["CW_s_plus1"]==0,0],
                mi_data_transformed[mean_intensity["CW_s_plus1"]==0,1],
                ax = ax[0], alpha = 0.1)
sns.scatterplot(mi_data_transformed[mean_intensity["CW_s_plus1"]==1,0],
                mi_data_transformed[mean_intensity["CW_s_plus1"]==1,1],
                ax = ax[0], alpha = 0.5)
ax[0].set_title("Starting Civil War")
sns.scatterplot(mi_data_transformed[mean_intensity["CW_f_plus1"]==0,0],
                mi_data_transformed[mean_intensity["CW_f_plus1"]==0,1],
                ax = ax[1], alpha = 0.1)
sns.scatterplot(mi_data_transformed[mean_intensity["CW_f_plus1"]==1,0],
                mi_data_transformed[mean_intensity["CW_f_plus1"]==1,1],
                ax = ax[1], alpha = 0.5)
ax[1].set_title("Finishing Civil War")
sns.scatterplot(mi_data_transformed[mean_intensity["CW_o_plus1"]==0,0],
                mi_data_transformed[mean_intensity["CW_o_plus1"]==0,1],
                ax = ax[2], alpha = 0.1)
sns.scatterplot(mi_data_transformed[mean_intensity["CW_o_plus1"]==1,0],
                mi_data_transformed[mean_intensity["CW_o_plus1"]==1,1],
                ax = ax[2], alpha = 0.5)
ax[2].set_title("Ongoing Civil War")

We see two mean axis this time, and for starting and finishing civil wars, it looks like those cases are situated along those axis, most of the values located near (0, 0). For finishing civil wars, in particular, cases tend to be located along the quasi-horizontal axis. For ongoing civil wars, values are more spreaded inside the distribution. 

To better explain those components, let's see which predictors have more importance in each component:

In [None]:
components_mi = pd.DataFrame(abs(pca_mi.components_)).transpose()
sorted_cols = components_mi[0].sort_values(ascending = False).index
[(cols[i], components_mi.iloc[i,0]) for i in sorted_cols]

In [None]:
sorted_cols = components_mi[1].sort_values(ascending = False).index
[(cols[i], components_mi.iloc[i,1]) for i in sorted_cols]

We see very simmilar results to the ones we obtained in the previous model. The interactions between the Insurgents and the Government, and also the interactions between People and Insurgents, heavily influence the first component, with Opposition taking a higher relevance in the second component. 

Starting and finishing civil wars located on the quasi-axis of the two principal components scatterplot splits civil wars between those where Insurgents are one of the combatants and those where the Opposition is one of the combatants. For finishing civil wars, the most populated quasi-axis is the quasi-horizontal one, suggesting that the intereaction between government and insurgents is more relevant to predict the ending of a civil war. 

### 3.2.2. Statistical significance
We are going to see if there is statistical differences for means of each column between cases were a civil war starts, finishes or is ongoing. Again, we will use a t-test to obtain the p-values:

In [None]:
pvalues_mi = pd.DataFrame(index=colst, columns=cols)
pvalues_mi = pvalues_mi.fillna(0)

In [None]:
for i in range(0, len(cols)):
    for j in range(0, len(colst)):
        a = mean_intensity[mean_intensity[colst[j]]==0][cols[i]]
        b = mean_intensity[mean_intensity[colst[j]]==1][cols[i]]
        pvalue = sp.ttest_ind(a, b, equal_var=False).pvalue
        pvalues_mi.loc[colst[j],cols[i]] = pvalue

In [None]:
# Getting p-values
for i in range(0, len(cols)):
    for j in range(0, len(colst)):
        filter_col = [col for col in colst if col != colst[j]]
        war = (mean_intensity[colst[j]]==0)
        not_others = (mean_intensity[filter_col[0]]==0) & (mean_intensity[filter_col[1]]==0)
        a = mean_intensity[war & not_others][cols[i]]
        b = mean_intensity[~war & not_others][cols[i]]
        pvalue = sp.ttest_ind(a, b, equal_var=False).pvalue
        pvalues_mi.loc[colst[j],cols[i]] = pvalue

In [None]:
pvalues_mi.round(4)

For starting civil wars, we see high p-values as with the previous model. This again suggests the spontaneity of the event. However, there are two p-values very close to 0: *Peo_Ins* and *Opp_Ins*, suggesting that the interaction between people and opposition with insurgents can be relevant when analysing the possibility of a starting civil war. 

For ongoing civil wars, most of the p-values are close to 0, suggesting that the nature of the conflict generates a very negative mean intensity, and therefore differenciating those cases were no civil war is taking place. 

For finishing civil wars the p-values increase and only a couple of them suggest a statistically significant differnece: *Ins_Gov* and *Ins_Peo*.

## 3.3. CAMEO counts 
One columns that we left unused in our last model is the CAMEO (Conflict and Mediation Event Observations) code. There are 20 different codes:

01: Make public statement, 02: Appeal, 03: Express intent to cooperate, 04: Consult, 05: Engage in diplomatic cooperation, 06: Engage in material cooperation, 07: Provide aid, 08: Yield, 09: Investigate, 10: Demand, 11: Disaprove, 12: Reject, 13: Threaten, 14: Protest, 15: Exhibit force posture, 16: Reduce relations, 17: Coerce, 18: Assault, 19: Fight and 20: Use unconventional mass violence. 

Each one of this codes have their own sublevels of coding (0211: Appeal for economic cooperation, 1011: Demand economic cooperation). We will focus only in the main ones (described in the paragraph above). This model compromises only the first two digits of the code.

In [7]:
cameo_fraction = cw.cameo_fraction("events.zip", "PITF Consolidated Case List 2018-converted.xlsx")
cameo_fraction.to_csv("cameo_fraction_model.csv")

Generating model...
Cleaning missing years...
Adding civil wars...
Done!


We have saved the resulting dataframe as "*cameo_counts_model.csv*" to avoid the long time it takes to generate it:

In [None]:
cameo_fraction = pd.read_csv("cameo_fraction_model.csv").drop("Unnamed: 0", axis=1)
cameo_fraction["Year_Month"] = pd.to_datetime(cameo_fraction["Year_Month"])

In [9]:
cameo_fraction.sample(10)

Unnamed: 0,ISO3,Year_Month,Gov_Ins_01,Gov_Ins_02,Gov_Ins_03,Gov_Ins_04,Gov_Ins_05,Gov_Ins_06,Gov_Ins_07,Gov_Ins_08,...,Peo_Opp_18,Peo_Opp_19,Peo_Opp_20,CW_s,CW_f,CW_o,CW_s_plus1,CW_f_plus1,CW_o_plus1,CW_plus1
57494,SUR,2010-03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0
68439,ZAF,2010-04,0.021622,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0
37597,MAF,2008-02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0
36257,LSO,2016-06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0
14878,CYM,2010-11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0
66473,VIR,2014-06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0
54090,SGS,2014-07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0
60801,TJK,1997-10,0.233333,0.0,0.0,0.0,0.0,0.0,0.0,0.2,...,0.0,0.0,0.0,0.0,0.0,1.0,0,0,1,1
53134,SDN,2006-11,0.020202,0.0,0.015152,0.005051,0.070707,0.015152,0.0,0.005051,...,0.0,0.0,0.0,0.0,0.0,1.0,0,0,1,1
65047,UZB,2015-08,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0


### 3.3.1. Principal component analysis
As we did before, this will allow us to get some insight in our model. This time we have 240 predictors (20 CAMEO codes for each of the 12 types of interaction) so each predictor on average should predict about 0.4% of the variance. 

In [None]:
cf_data = cameo_fraction.iloc[:,2:-6]

# Standardize
ss_cf = StandardScaler()
cf_data = ss_cf.fit_transform(cf_data)

In [None]:
pca_cf = PCA(2)
pca_cf.fit(cf_data)

In [None]:
pca_cf.explained_variance_ratio_

We see that the first component describes 1% of the variance in the model, while the second component describes about 0.9%. This is interecting because combining both components we only get abou 2% of the total variance, which is way too little compared to the previous models were we explained around 25% of the variance with just the first two components, but we have to consider that we have 240 predcitors and that the first two components are explaining twice what a single predictor should. 

In [None]:
cf_data_transformed = pca_cf.transform(cf_data)

In [None]:
f, ax = plt.subplots(ncols = 3, figsize=(20,5))
sns.scatterplot(cf_data_transformed[cameo_fraction["CW_s_plus1"]==0,0],
                cf_data_transformed[cameo_fraction["CW_s_plus1"]==0,1],
                ax = ax[0], alpha = 0.1)
sns.scatterplot(cf_data_transformed[cameo_fraction["CW_s_plus1"]==1,0],
                cf_data_transformed[cameo_fraction["CW_s_plus1"]==1,1],
                ax = ax[0], alpha = 0.5)
ax[0].set_title("Starting civil war")
sns.scatterplot(cf_data_transformed[cameo_fraction["CW_f_plus1"]==0,0],
                cf_data_transformed[cameo_fraction["CW_f_plus1"]==0,1],
                ax = ax[1], alpha = 0.1)
sns.scatterplot(cf_data_transformed[cameo_fraction["CW_f_plus1"]==1,0],
                cf_data_transformed[cameo_fraction["CW_f_plus1"]==1,1],
                ax = ax[1], alpha = 0.5)
ax[1].set_title("Finishing civil war")
sns.scatterplot(cf_data_transformed[cameo_fraction["CW_o_plus1"]==0,0],
                cf_data_transformed[cameo_fraction["CW_o_plus1"]==0,1],
                ax = ax[2], alpha = 0.1)
sns.scatterplot(cf_data_transformed[cameo_fraction["CW_o_plus1"]==1,0],
                cf_data_transformed[cameo_fraction["CW_o_plus1"]==1,1],
                ax = ax[2], alpha = 0.5)
ax[0].set_title("Ongoing civil war")

Again, we see that fan-like shape of the distribution, but this time starting and finishing civil wars are not really falling in any particular axis, although some variation is observed along the first principal component. Again most of the values for starting and finishing civil wars tend to be near (0, 0) while ongoing civil wars are spreaded across the distribution. 

To better understand this components we will analyze the coefficients for the transformation for each variable:

In [None]:
components_cf = pd.DataFrame(abs(pca_cf.components_)).transpose()
cols_cam = cameo_fraction.columns[2:-6]
sorted_cols = components_cf[0].sort_values(ascending = False).index
[(cols_cam[i], components_cf.iloc[i,0]) for i in sorted_cols][:10]

In [None]:
sorted_cols = components_cf[1].sort_values(ascending = False).index
[(cols_cam[i], components_cf.iloc[i,1]) for i in sorted_cols][:10]

We see that division for Government-Insurgents and Government-Opposition interaction that we saw in previous models. The first component realies on Government-Insurgents interactions, while the second component relies on Government-Opposition interactions. 

As for the CAMEO code, the first component has higher values for Government-Insurgents Consults (04), followed by 18 and 19, which rely on Assaults and Fights. This is surprising because we would suspect that Assaults (18) and Fights (19) had more relevance than Consults (04). However, Consults (04) compromises visits, meetings, mediations and negotiations, so maybe it's the lack of them what make this component so relevant. However, with the first two components only explaining about 2% of the total variance this is not a rigorous explanation. The second component also relies on Consults (04). 

### 3.3.2. Statistical significance
This model is a a little bit more difficult to analyse, since we have 240 predictors, but we can still check which predictors have more statistical significant when predicting civil wars. As we did before, we will use t-tests to evaluate the p-values of observing diferences between cases were a civil war starts, finishes or is ongoing and cases were it is not. 

In [None]:
# Dataframes to fill
pvalues_cf = pd.DataFrame(index=colst, columns=cols)
increase_decrease_cf = pd.DataFrame(index=colst, columns=cols)
pvalues_cf = pvalues_cf.fillna(0)
increase_decrease_cf = increase_decrease_cf.fillna(0)

In [None]:
# Filling the dataframes
for i in range(0, len(cols_cam)):
    for j in range(0, len(colst)):
        filter_col = [col for col in colst if col != colst[j]]
        war = (mean_intensity[colst[j]]==0)
        not_others = (mean_intensity[filter_col[0]]==0) & (mean_intensity[filter_col[1]]==0)
        a = cameo_fraction[war & not_others][cols_cam[i]]
        b = cameo_fraction[~war & not_others][cols_cam[i]]
        pvalue = sp.ttest_ind(a, b, equal_var=False).pvalue
        pvalues_cf.loc[colst[j],cols_cam[i]] = pvalue
        increase_decrease_cf.loc[colst[j],cols_cam[i]] = b.mean()-a.mean()

Now we can see the 5 more relevant predictors, those were the difference is statistically significant (less than 1%) and have very high differences. For starting civil wars:

In [None]:
order_s = increase_decrease_cf[pvalues_cf<0.01].loc["CW_s_plus1"].abs().sort_values(ascending=False).index
increase_decrease_cf[order_s[:5]].loc["CW_s_plus1"]

We see that at the start of a civil war, interaction with the People sector is very relevant. We have CAMEO codes 16 (Reduce relations), 17 (Coerce), 9 (Investigate), 20 (Use unconventional mass violence), 03 (Express intent to cooperate),. All those values are negative, meaning that this kind of interactions decrease at the beginning of a civil war: People increase relations with the Opposition and Government is less cooperative with Opposition.

In [None]:
order_o = increase_decrease_cf[pvalues_cf<0.01].loc["CW_o_plus1"].abs().sort_values(ascending=False).index
increase_decrease_cf[order_s[:5]].loc["CW_o_plus1"]

During the course of a civil war, we see the same predictors we saw during the start of a civil war. This time relations between People and Opposition also decrease (more 16 (Reduce relations) and more 17 (Coerce)).

In [None]:
order_s = increase_decrease_cf[pvalues_cf<0.01].loc["CW_f_plus1"].abs().sort_values(ascending=False).index
increase_decrease_cf[order_s[:5]].loc["CW_f_plus1"]

At the end of a civil war, we have 18 (Assault), 19 (Fight), 01 (Make public statement), 16 (Reduce relations). We see that fights and assaults between Insurgents and Government increase. Relations between People and Opposition improve (less 16 (Reduce relations)).

Relations between People and Opposition appeared in all of our three predictors (*Peo_Opp_16*), suggesting that this is a relevant interaction when it comes to predicting conflict. 

## 3.3. Mean model
This model will just use all the three previous models and predict using the mean prediction or a ponderated mean prediction. 

# 4. Training our models
We will use logistic regression to train all our models. 

## 4.1. Interaction fraction

In [None]:
# Getting sets
date = "12-2007"
X_if_train = interaction_fraction[interaction_fraction["Year_Month"]<=date].iloc[:,2:-6]
y_if_train = interaction_fraction[interaction_fraction["Year_Month"]<=date].iloc[:,-3:]

X_if_test = interaction_fraction[interaction_fraction["Year_Month"]>date].iloc[:,2:-6]
y_if_test = interaction_fraction[interaction_fraction["Year_Month"]>date].iloc[:,-3:]

In [None]:
# Starting 
model_if_s = LogisticRegression(random_state=1492)

# Finished
model_if_f = LogisticRegression(random_state=1492)

# Ongoing
model_if_o = LogisticRegression(random_state=1492)

Since we do not have many situations where civil wars occured compared to situations, accuracy is not a valid error meassure. Instead, we will use the Area Under the Curve (AUC). This method allows us to meassure at the same time True and False Positives (TP, FP) and True and False Negatives (TN, FN). 

In [None]:
cv_if_s = cross_val_score(model_if_s, X_if_train, y_if_train.iloc[:,0], cv=10, scoring = 'roc_auc')
cv_if_f = cross_val_score(model_if_f, X_if_train, y_if_train.iloc[:,1], cv=10, scoring = 'roc_auc')
cv_if_o = cross_val_score(model_if_o, X_if_train, y_if_train.iloc[:,2], cv=10, scoring = 'roc_auc')

In [None]:
print('Starting: [',np.quantile(cv_if_s,0.025),',',np.quantile(cv_if_s,0.975),']')
print('\t mean:',np.mean(cv_if_s),'+-',np.std(cv_if_s),'\n')
print('Finishing: [',np.quantile(cv_if_f,0.025),',',np.quantile(cv_if_f,0.975),']')
print('\t mean:',np.mean(cv_if_f),'+-',np.std(cv_if_f),'\n')
print('Ongoing: [',np.quantile(cv_if_o,0.025),',',np.quantile(cv_if_o,0.975),']')
print('\t mean:',np.mean(cv_if_o),'+-',np.std(cv_if_o),'\n')

Above we see the 95% confidence intervals for the AUCs obtained using cross-validation as well as the mean AUC followed by the standard error. We see that this model does not work very good at forecasting a civil war. However, it performs relatively good at predicting the ending of a civil war and very good at predicting the ongoingness of the conflict. 

To get a deeper insight of the model, it is good to take a look at the coefficients as they also might help us explain how the prediction is occuring:

In [None]:
coefs_df_if_s = pd.DataFrame(index=cols)
model_if_s.fit(X_if_train,y_if_train["CW_s_plus1"])
coefs_if_s = model_if_s.coef_
coefs_df_if_s["Coefs"] = coefs_if_s[0]
coefs_df_if_s.loc[coefs_df_if_s.abs().sort_values("Coefs", ascending=False).index,:].head()

So for starting a civil war, the actions from Insurgents to Government is the most relevant, followed by interaction between Government and Opposition. This is the same we observed when we performed the PCA, were the first two principal components relied very strongly on Insurgents and Opposition interacting with the government. 

For finishing civil wars:

In [None]:
coefs_df_if_f = pd.DataFrame(index=cols)
model_if_f.fit(X_if_train,y_if_train["CW_f_plus1"])
coefs_if_f = model_if_f.coef_
coefs_df_if_f["Coefs"] = coefs_if_f[0]
coefs_df_if_f.loc[coefs_df_if_f.abs().sort_values("Coefs", ascending=False).index,:].head()

For finishing civil wars we see that the importance of actions from Government to Insurgents becomes greater than it was for starting civil wars, suggesting that the response of the Government increases. Surprisingly Opposition does not play a big part at predicting the end of a Civil War. 

For predicting the ongoingness of the civil war:

In [None]:
coefs_df_if_o = pd.DataFrame(index=cols)
model_if_o.fit(X_if_train,y_if_train["CW_o_plus1"])
coefs_if_o = model_if_o.coef_
coefs_df_if_o["Coefs"] = coefs_if_o[0]
coefs_df_if_o.loc[coefs_df_if_o.abs().sort_values("Coefs", ascending=False).index,:].head()

We see very strong coefficients and Insurgents take park in all of them. Actions between Insurgents and Government and People are the most relevant factors. 

Finally, we can see how our test performs in the test set:

In [None]:
# Starting
y_if_prd_s = model_if_s.predict_proba(X_if_test)
print("Starting:", roc_auc_score(y_if_test, y_if_prd_s))

# Finishing
y_if_prd_f = model_if_f.predict_proba(X_if_test)
print("Finishing:", roc_auc_score(y_if_test, y_if_prd_f))

# Ongoing
y_if_prd_o = model_if_o.predict_proba(X_if_test)
print("Ongoing:", roc_auc_score(y_if_test, y_if_prd_o))

We see overfitting. 

## 4.2. Mean intensity

In [None]:
# Getting sets
date = "12-2007"
X_mi_train = mean_intensity[mean_intensity["Year_Month"]<=date].iloc[:,2:-6]
y_mi_train = mean_intensity[mean_intensity["Year_Month"]<=date].iloc[:,-3:]

X_mi_test = mean_intensity[mean_intensity["Year_Month"]>date].iloc[:,2:-6]
y_mi_test =mean_intensity[mean_intensity["Year_Month"]>date].iloc[:,-3:]

In [None]:
# Starting 
model_mi_s = LogisticRegression(random_state=1492)

# Finished
model_mi_f = LogisticRegression(random_state=1492)

# Ongoing
model_mi_o = LogisticRegression(random_state=1492)

In [None]:
cv_mi_s = cross_val_score(model_mi_s, X_mi_train, y_mi_train.iloc[:,0], cv=10, scoring = 'roc_auc')
cv_mi_f = cross_val_score(model_mi_f, X_mi_train, y_mi_train.iloc[:,1], cv=10, scoring = 'roc_auc')
cv_mi_o = cross_val_score(model_mi_o, X_mi_train, y_mi_train.iloc[:,2], cv=10, scoring = 'roc_auc')

In [None]:
print('Starting: [',np.quantile(cv_mi_s,0.025),',',np.quantile(cv_mi_s,0.975),']')
print('\t mean:',np.mean(cv_mi_s),'+-',np.std(cv_mi_s),'\n')
print('Finishing: [',np.quantile(cv_mi_f,0.025),',',np.quantile(cv_mi_f,0.975),']')
print('\t mean:',np.mean(cv_mi_f),'+-',np.std(cv_mi_f),'\n')
print('Ongoing: [',np.quantile(cv_mi_o,0.025),',',np.quantile(cv_mi_o,0.975),']')
print('\t mean:',np.mean(cv_mi_o),'+-',np.std(cv_mi_o),'\n')

We are getting better results than what we did with the previous models.

In [None]:
coefs_df_mi_s = pd.DataFrame(index=cols)
model_mi_s.fit(X_mi_train,y_mi_train["CW_s_plus1"])
coefs_mi_s = model_mi_s.coef_
coefs_df_mi_s["Coefs"] = coefs_mi_s[0]
coefs_df_mi_s.loc[coefs_df_mi_s.abs().sort_values("Coefs", ascending=False).index,:].head()

In [None]:
coefs_df_mi_f = pd.DataFrame(index=cols)
model_mi_f.fit(X_mi_train,y_mi_train["CW_f_plus1"])
coefs_mi_f = model_mi_f.coef_
coefs_df_mi_f["Coefs"] = coefs_mi_f[0]
coefs_df_mi_f.loc[coefs_df_mi_f.abs().sort_values("Coefs", ascending=False).index,:].head()

In [None]:
coefs_df_mi_o = pd.DataFrame(index=cols)
model_mi_o.fit(X_mi_train,y_mi_train["CW_o_plus1"])
coefs_mi_o = model_mi_o.coef_
coefs_df_mi_o["Coefs"] = coefs_mi_o[0]
coefs_df_mi_o.loc[coefs_df_mi_o.abs().sort_values("Coefs", ascending=False).index,:].head()

In [None]:
# Starting
y_mi_prd_s = model_mi_s.predict_proba(X_mi_test)
print("Starting:", roc_auc_score(y_mi_test, y_mi_prd_s))

# Finishing
y_mi_prd_f = model_mi_f.predict_proba(X_mi_test)
print("Finishing:", roc_auc_score(y_mi_test, y_mi_prd_f))

# Ongoing
y_mi_prd_o = model_mi_o.predict_proba(X_mi_test)
print("Ongoing:", roc_auc_score(y_mi_test, y_mi_prd_o))

## 4.3. Cameo fraction

In [None]:
# Getting sets
date = "12-2007"
X_cf_train = cameo_fraction[cameo_fraction["Year_Month"]<=date].iloc[:,2:-6]
y_cf_train = cameo_fraction[cameo_fraction["Year_Month"]<=date].iloc[:,-3:]

X_cf_test = cameo_fraction[cameo_fraction["Year_Month"]>date].iloc[:,2:-6]
y_cf_test = cameo_fraction[cameo_fraction["Year_Month"]>date].iloc[:,-3:]

In [None]:
# Starting 
model_cf_s = LogisticRegression(random_state=1492)

# Finished
model_cf_f = LogisticRegression(random_state=1492)

# Ongoing
model_cf_o = LogisticRegression(random_state=1492)

In [None]:
cv_cf_s = cross_val_score(model_cf_s, X_cf_train, y_cf_train.iloc[:,0], cv=10, scoring = 'roc_auc')
cv_cf_f = cross_val_score(model_cf_f, X_cf_train, y_cf_train.iloc[:,1], cv=10, scoring = 'roc_auc')
cv_cf_o = cross_val_score(model_cf_o, X_cf_train, y_cf_train.iloc[:,2], cv=10, scoring = 'roc_auc')

In [None]:
print('Starting: [',np.quantile(cv_cf_s,0.025),',',np.quantile(cv_cf_s,0.975),']')
print('\t mean:',np.mean(cv_cf_s),'+-',np.std(cv_cf_s),'\n')
print('Finishing: [',np.quantile(cv_cf_f,0.025),',',np.quantile(cv_cf_f,0.975),']')
print('\t mean:',np.mean(cv_cf_f),'+-',np.std(cv_cf_f),'\n')
print('Ongoing: [',np.quantile(cv_cf_o,0.025),',',np.quantile(cv_cf_o,0.975),']')
print('\t mean:',np.mean(cv_cf_o),'+-',np.std(cv_cf_o),'\n')

In [None]:
coefs_df_cf_s = pd.DataFrame(index=cols_cam)
model_cf_s.fit(X_cf_train,y_cf_train["CW_s_plus1"])
coefs_cf_s = model_cf_s.coef_
coefs_df_cf_s["Coefs"] = coefs_cf_s[0]
coefs_df_cf_s.loc[coefs_df_cf_s.abs().sort_values("Coefs", ascending=False).index,:].head()

In [None]:
coefs_df_cf_f = pd.DataFrame(index=cols_cam)
model_cf_f.fit(X_cf_train,y_cf_train["CW_f_plus1"])
coefs_cf_f = model_cf_f.coef_
coefs_df_cf_f["Coefs"] = coefs_cf_f[0]
coefs_df_cf_f.loc[coefs_df_cf_f.abs().sort_values("Coefs", ascending=False).index,:].head()

In [None]:
coefs_df_cf_o = pd.DataFrame(index=cols_cam)
model_cf_o.fit(X_cf_train,y_cf_train["CW_o_plus1"])
coefs_cf_o = model_cf_o.coef_
coefs_df_cf_o["Coefs"] = coefs_cf_o[0]
coefs_df_cf_o.loc[coefs_df_cf_o.abs().sort_values("Coefs", ascending=False).index,:].head()

In [None]:
# Starting
y_cf_prd_s = model_cf_s.predict_proba(X_cf_test)
print("Starting:", roc_auc_score(y_cf_test, y_cf_prd_s))

# Finishing
y_cf_prd_f = model_cf_f.predict_proba(X_cf_test)
print("Finishing:", roc_auc_score(y_cf_test, y_cf_prd_f))

# Ongoing
y_cf_prd_o = model_cf_o.predict_proba(X_cf_test)
print("Ongoing:", roc_auc_score(y_cf_test, y_cf_prd_o))