In [10]:
/*
5. Recall that the data are dyadic.  I.e., for each crisis, we have all the pairs of countries that were 
hostile.  In our data, accordingly, we have suffixes of “1” and “2” denoting whether the column 
applies to the first or second actor.  NOTE: when we coded the data, the idea was to list the first 
country as the initiator.  NOTE: It may be the case that this choice was ambiguous.  A robustness 
test would be to change the monte carlos we’re running (see 6 and 7) to reverse the order of the 
actors.  I.e., treat actor2 as the initiator. 
 
6. First goal: try to write a monte carlo to do “historical confidence intervals”.  For each crisis, do a 
uniform draw of the different historical codings (picking ONE of them, not all) and use this in the 
sample.  Repeat for multiple samples and average the results of your model across these. 
 
7. Second goal: improve the monte carlo to pick ONE row for each crisis.  You’ll note that some 
crises have a ton of dyads; this inflates the importance of these dyads.  Ideally, for each sample 
in the monte carlo, we want 1 dyad from 1 historical understanding.  Your sample size in each of 
these samples should = the number of crises.  Give or take, your samples should have around N 
= 354. 
*/

#import libraries

import numpy as np
import pandas as pd

#DATA IMPORT, CLEANING, PREPROCESSING

df = pd.read_csv('crisis_dataset_edited.csv')

# drop notes column, drop complexity, drop A1 Nukes and A2 Nukes, drop name
df.drop(['NOTES','COMPLEXITY','A1 NUKES','A2 NUKES','NAME'], axis=1, inplace=True)
df.dropna(how='all', inplace=True)
#df.dropna(inplace=True)

#initialize things
first_column = 'CRISNO'
second_column = 'HISTORY'
unique_crisis_no = df[first_column].unique()
final_selected_rows = []

#print things
print(df)
print(unique_crisis_no)

# iterate over crisis nos
for first_value in unique_crisis_no:
    # this filters data for unique crisis no
    filtered_df_first = df[df[first_column] == first_value]
    
    # this initializes the unique history for each crisis
    unique_history = filtered_df_first[second_column].unique()
    
    if len(unique_history) > 0:
        chosen_history = np.random.choice(unique_history)
        final_filtered_df = filtered_df_first[filtered_df_first[second_column] == chosen_history] # filter data for the chosen history

    #if there are any rows to sample from; i think there always should be though
        if not final_filtered_df.empty:
            # random select
            final_selected_row = final_filtered_df.sample(n=1)
            final_selected_rows.append(final_selected_row)
        #check incase something wrong
        else:
            print('this should not be happening')

# combine to df and sort by crisno
final_combined_df = pd.concat(final_selected_rows)
final_combined_df = final_combined_df.sort_values(by=first_column)

print(final_combined_df)
output_filename = 'final_selected_rows_example.csv'
final_combined_df.to_csv(output_filename, index=False)

#model stuff




     CRISNO  YEAR  HISTORY ACTOR 1 ACTOR 2            A1 THREAT  \
2     445.0  2004      1.0     GRG     RUS         CONVENTIONAL   
3     446.0  2005      1.0     ETH     ERI         CONVENTIONAL   
4     447.0  2005      1.0     CHA     SUD         CONVENTIONAL   
5     448.0  2006      1.0     IRN     USA                  NaN   
6     448.0  2006      2.0     IRN     UKG              NUCLEAR   
..      ...   ...      ...     ...     ...                  ...   
991   171.0  1959      1.0     CHN     IND         CONVENTIONAL   
992   172.0  1959      1.0     IRQ     IRN  CONVENTIONAL, OTHER   
993   173.0  1960      1.0     EGY     ISR         CONVENTIONAL   
994   175.0  1960      1.0     DOM     VEN         CONVENTIONAL   
995   175.0  1960      1.0     DOM     USA                  NaN   

               A2 THREAT  A1 INTENSITY  A2 INTENSITY  KINETIC  ...  \
2           CONVENTIONAL           2.0           1.0      1.0  ...   
3           CONVENTIONAL           3.0           3.0   