# Scientific Python
## Central European University // Fall 2021

## Final Project
Instructor: Marton Posfai

Student: Alessandra Oshiro


## 1. Introduction

## 2. Loading and cleaning the dataset



In [14]:
import pandas as pd
import numpy as np

In [47]:
social_movements_df = pd.read_csv("social_movements.csv")
social_movements_df.head(5)

Unnamed: 0,id,country,ccode,year,region,protest,protestnumber,startday,startmonth,startyear,...,protesterdemand4,stateresponse1,stateresponse2,stateresponse3,stateresponse4,stateresponse5,stateresponse6,stateresponse7,sources,notes
0,201990001,Canada,20,1990,North America,1,1,15.0,1.0,1990.0,...,,ignore,,,,,,,1. great canadian train journeys into history;...,canada s railway passenger system was finally ...
1,201990002,Canada,20,1990,North America,1,2,25.0,6.0,1990.0,...,,ignore,,,,,,,1. autonomy s cry revived in quebec the new yo...,protestors were only identified as young peopl...
2,201990003,Canada,20,1990,North America,1,3,1.0,7.0,1990.0,...,,ignore,,,,,,,1. quebec protest after queen calls for unity ...,"the queen, after calling on canadians to remai..."
3,201990004,Canada,20,1990,North America,1,4,12.0,7.0,1990.0,...,,accomodation,,,,,,,1. indians gather as siege intensifies; armed ...,canada s federal government has agreed to acqu...
4,201990005,Canada,20,1990,North America,1,5,14.0,8.0,1990.0,...,,crowd dispersal,arrests,accomodation,,,,,1. dozens hurt in mohawk blockade protest the ...,protests were directed against the state due t...


In [60]:
social_movements_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17145 entries, 0 to 17144
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   country                17145 non-null  object 
 1   year                   17145 non-null  int64  
 2   region                 17145 non-null  object 
 3   protest                17145 non-null  int64  
 4   protestnumber          17145 non-null  int64  
 5   startday               15239 non-null  float64
 6   startmonth             15239 non-null  float64
 7   startyear              15239 non-null  float64
 8   endday                 15239 non-null  float64
 9   endmonth               15239 non-null  float64
 10  endyear                15239 non-null  float64
 11  protesterviolence      15758 non-null  float64
 12  participants_category  9887 non-null   object 
 13  participants           15746 non-null  object 
 14  protesteridentity      14684 non-null  object 
 15  pr

This initial summary shows that there are some NaN values in our datset. Moreover, by the name and description of the variables, it is possible to tell that some of the data will not be relevant for the analysis. Therefore, it is now turn to clean the data and drop the columns that will not be needed. First, I will drop the columns that do not provide relevant information, then I will drop those who have a high proportion of NaNs. 

In [49]:
social_movements_df.drop(["id", "ccode", "location", "sources", "notes"], axis = 1, inplace = True)

In [50]:
column_na_proportion = social_movements_df.isna().sum()/len(social_movements_df)
print(column_na_proportion)

country                  0.000000
year                     0.000000
region                   0.000000
protest                  0.000000
protestnumber            0.000000
startday                 0.111169
startmonth               0.111169
startyear                0.111169
endday                   0.111169
endmonth                 0.111169
endyear                  0.111169
protesterviolence        0.080898
participants_category    0.423330
participants             0.081598
protesteridentity        0.143540
protesterdemand1         0.111228
protesterdemand2         0.826363
protesterdemand3         0.977661
protesterdemand4         0.951531
stateresponse1           0.112978
stateresponse2           0.831554
stateresponse3           0.945757
stateresponse4           0.985768
stateresponse5           0.950481
stateresponse6           0.999067
stateresponse7           0.946340
dtype: float64


In [61]:
social_movements_df.dropna(subset = ["stateresponse1", "protesterdemand1"], inplace = True)

In [62]:
column_na_proportion = social_movements_df.isna().sum()/len(social_movements_df)
print(column_na_proportion)

country                  0.000000
year                     0.000000
region                   0.000000
protest                  0.000000
protestnumber            0.000000
startday                 0.000000
startmonth               0.000000
startyear                0.000000
endday                   0.000000
endmonth                 0.000000
endyear                  0.000000
protesterviolence        0.000000
participants_category    0.350539
participants             0.000789
protesteridentity        0.036428
protesterdemand1         0.000000
protesterdemand2         0.804971
protesterdemand3         0.974882
protesterdemand4         0.948777
stateresponse1           0.000000
stateresponse2           0.810231
stateresponse3           0.938848
stateresponse4           0.983956
stateresponse5           0.947593
stateresponse6           0.998948
stateresponse7           0.943648
dtype: float64


In the previous cell, I have calculated the proportion of NaNs per column. As we can see, there are several columns that have a proportion of NaNs of almost 1. The cases in which this is more evident, is the variables for protesters demands and those for state response. This is because, for some mobilizations, protesters have had different demands and states have answered in different ways as the conflict progressed. Given that eliminating straight away those variables would not give us accurate information about the state response, I have decided to recode "stateresponse1" to "stateresponse7" as a single variable that indicates whether the response was pacific (accomodation or agreement) or used force (arrests, dispersion, shootings, killings).

In [64]:
state_response_complete = list(
    zip(social_movements_df["stateresponse1"], 
            social_movements_df["stateresponse2"], 
            social_movements_df["stateresponse3"], 
            social_movements_df["stateresponse4"], 
            social_movements_df["stateresponse5"], 
            social_movements_df["stateresponse6"], 
            social_movements_df["stateresponse7"]))


state_response_complete

[('ignore', nan, nan, nan, nan, nan, nan),
 ('ignore', nan, nan, nan, nan, nan, nan),
 ('ignore', nan, nan, nan, nan, nan, nan),
 ('accomodation', nan, nan, nan, nan, nan, nan),
 ('crowd dispersal', 'arrests', 'accomodation', nan, nan, nan, nan),
 ('crowd dispersal', 'shootings', nan, nan, nan, nan, nan),
 ('ignore', nan, nan, nan, nan, nan, nan),
 ('ignore', nan, nan, nan, nan, nan, nan),
 ('arrests', nan, nan, nan, nan, nan, nan),
 ('ignore', nan, nan, nan, nan, nan, nan),
 ('arrests', nan, nan, nan, nan, nan, nan),
 ('ignore', nan, nan, nan, nan, nan, nan),
 ('ignore', nan, nan, nan, nan, nan, nan),
 ('ignore', nan, nan, nan, nan, nan, nan),
 ('shootings', 'killings', nan, nan, nan, nan, nan),
 ('ignore', nan, nan, nan, nan, nan, nan),
 ('ignore', nan, nan, nan, nan, nan, nan),
 ('ignore', nan, nan, nan, nan, nan, nan),
 ('ignore', nan, nan, nan, nan, nan, nan),
 ('accomodation', nan, nan, nan, nan, nan, nan),
 ('ignore', nan, nan, nan, nan, nan, nan),
 ('crowd dispersal', 'arrests'

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

For the case of protesters demands, I did something similar. Here, I have recoded "protesterdemand1" to "protesterdemand4" as a single variable that indicates the nature of the demand. There are three categories: economical, political, both. 

Then, I formatted the variables for the starting and finishing dates, so that fit in one column each instead of three separate ones. For this, I used the datetime module, given that this allows me to calculate the duration of each mobilization. The duration has also been added to the dataframe because it could be a relevant predictor for the machine learning model. 

In [58]:
start_time_complete = list(
    zip(social_movements_df["startday"], 
            social_movements_df["startmonth"], 
            social_movements_df["startyear"]))

start_time_complete

[(15.0, 1.0, 1990.0),
 (25.0, 6.0, 1990.0),
 (1.0, 7.0, 1990.0),
 (12.0, 7.0, 1990.0),
 (14.0, 8.0, 1990.0),
 (19.0, 9.0, 1990.0),
 (10.0, 9.0, 1991.0),
 (28.0, 9.0, 1991.0),
 (4.0, 5.0, 1992.0),
 (16.0, 5.0, 1993.0),
 (1.0, 7.0, 1993.0),
 (1.0, 9.0, 1994.0),
 (18.0, 11.0, 1994.0),
 (20.0, 2.0, 1995.0),
 (8.0, 9.0, 1995.0),
 (26.0, 10.0, 1996.0),
 (28.0, 10.0, 1997.0),
 (19.0, 11.0, 1997.0),
 (nan, nan, nan),
 (nan, nan, nan),
 (22.0, 2.0, 2000.0),
 (26.0, 2.0, 2000.0),
 (9.0, 5.0, 2000.0),
 (16.0, 6.0, 2000.0),
 (nan, nan, nan),
 (nan, nan, nan),
 (3.0, 5.0, 2003.0),
 (nan, nan, nan),
 (3.0, 3.0, 2005.0),
 (10.0, 9.0, 2005.0),
 (21.0, 2.0, 2006.0),
 (29.0, 6.0, 2007.0),
 (10.0, 8.0, 2008.0),
 (13.0, 5.0, 2009.0),
 (10.0, 11.0, 2009.0),
 (12.0, 2.0, 2010.0),
 (3.0, 7.0, 2011.0),
 (10.0, 2.0, 2012.0),
 (nan, nan, nan),
 (nan, nan, nan),
 (6.0, 10.0, 2015.0),
 (2.0, 2.0, 2016.0),
 (10.0, 2.0, 2016.0),
 (25.0, 2.0, 2016.0),
 (20.0, 3.0, 2016.0),
 (24.0, 3.0, 2016.0),
 (12.0, 4.0, 2016.0),

In [59]:
end_time_complete = list(
    zip(social_movements_df["endday"], 
            social_movements_df["endmonth"], 
            social_movements_df["endyear"]))

for row in end_time_complete: 
    

[(15.0, 1.0, 1990.0),
 (25.0, 6.0, 1990.0),
 (1.0, 7.0, 1990.0),
 (6.0, 9.0, 1990.0),
 (15.0, 8.0, 1990.0),
 (19.0, 9.0, 1990.0),
 (17.0, 9.0, 1991.0),
 (2.0, 10.0, 1991.0),
 (5.0, 5.0, 1992.0),
 (16.0, 5.0, 1993.0),
 (31.0, 8.0, 1993.0),
 (1.0, 9.0, 1994.0),
 (18.0, 11.0, 1994.0),
 (20.0, 2.0, 1995.0),
 (8.0, 9.0, 1995.0),
 (26.0, 10.0, 1996.0),
 (9.0, 11.0, 1997.0),
 (4.0, 12.0, 1997.0),
 (nan, nan, nan),
 (nan, nan, nan),
 (23.0, 2.0, 2000.0),
 (26.0, 2.0, 2000.0),
 (9.0, 5.0, 2000.0),
 (16.0, 6.0, 2000.0),
 (nan, nan, nan),
 (nan, nan, nan),
 (3.0, 5.0, 2003.0),
 (nan, nan, nan),
 (3.0, 4.0, 2005.0),
 (10.0, 9.0, 2005.0),
 (17.0, 8.0, 2006.0),
 (29.0, 6.0, 2007.0),
 (10.0, 8.0, 2008.0),
 (13.0, 5.0, 2009.0),
 (10.0, 11.0, 2009.0),
 (13.0, 2.0, 2010.0),
 (3.0, 7.0, 2011.0),
 (6.0, 6.0, 2012.0),
 (nan, nan, nan),
 (nan, nan, nan),
 (6.0, 10.0, 2015.0),
 (2.0, 2.0, 2016.0),
 (10.0, 2.0, 2016.0),
 (25.0, 2.0, 2016.0),
 (4.0, 4.0, 2016.0),
 (24.0, 3.0, 2016.0),
 (12.0, 4.0, 2016.0),
 (1

Finally, to conclude the data cleaning process, I dropped the observations which had NaNs. 

In [None]:
col_names = column_na_proportion.index
valid_cols = [col for col in col_names if column_na_proportion[col] < 0.5]
social_movements_df = social_movements_df[valid_cols]
social_movements_df.head()