# IMT574 Problem Set 4
February 6, 2020

## INTRODUCTION
This problem set should not be hard per se but it may be very slow in terms of computing. You are mostly asked just to use existing packages to analyze a big survey dataset.
The problem set has two parts: first you try to get as good a prediction as you can using k-NN, logistic regression and SVM methods, and thereafter you analyze how much does knowledge of country of the respondent improve our prediction.
The problem set has 4 goals: a) explore classification methods we have learned in a novel context (a large survey) b) learn more about cross validation c) learn a little bit about the global opinion (but we don't do the latter that rigorously).

## World Values Survey
In this database we use World Values Survey data. The data is free to be downloaded from the webpage, just you have to sign up but I expect to download the version on canvas. It is a survey, conducted every few years in a number of countries. Here we use wave 6 data, mostly from 2013-2014. Note that not all countries are participating in each wave.
The questions revolve around different opinion topics, including trust, work, religion, family, gender equality, and nationalism. In this problem set we focus on what the respondents think about abortion: "Please tell if abortion can always be justified, never be justified, or something in between". The responses range between 1 - never justifiable, and 10 - always justifiable. Besides of the numeric range 1..10, a number of cases have negative codes (this applies to many variables). These are various types of missing information (-5: missing, -4: not asked, -3: not applicable, -2: no answer, -1: don't know). We treat all these as just missing below.
The version we use here is a little bit simplified, I have removed a large number of variables that are constructed from the other variables and hence highly collinear with the rest of the data.
I strongly recommend you to browse the documentation before you start, there are two large-ish documentation files provided.


## 1 Explore and prepare the data (20pt) 
As the first step, explore the data.
1. (2pt) Load the data. How many responses and variables do we have?

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
from sklearn.neighbors import KNeighborsClassifier

In [2]:
world_survey_data = pd.read_csv('../../data/wvs.csv.bz2', sep= '\t')
world_survey_data.head()

Unnamed: 0,V2,V4,V5,V6,V7,V8,V9,V10,V11,V12,...,MN_228S8,MN_229A,MN_230A,MN_233A,MN_237B1,MN_249A1,MN_249A3,I_RELIGBEL,I_NORM1,I_VOICE1
0,12,1,1,1,-2,1,1,2,1,1,...,3,-3,-3,-3,-3,1,1,0.0,1.0,0.0
1,12,1,2,3,4,2,2,2,2,2,...,3,-3,-3,-3,-3,2,-1,0.0,1.0,0.66
2,12,1,3,2,4,2,1,2,2,2,...,4,1,1,2,-3,1,1,0.0,1.0,0.33
3,12,1,1,3,4,3,1,2,1,2,...,2,2,1,2,-3,1,2,0.0,1.0,0.0
4,12,1,1,1,2,1,1,1,3,2,...,2,2,1,2,-3,1,2,0.0,1.0,0.66


In [3]:
print(world_survey_data.columns)

Index(['V2', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12',
       ...
       'MN_228S8', 'MN_229A', 'MN_230A', 'MN_233A', 'MN_237B1', 'MN_249A1',
       'MN_249A3', 'I_RELIGBEL', 'I_NORM1', 'I_VOICE1'],
      dtype='object', length=328)


In [4]:
len(world_survey_data)

90350

We have in total 90350 rows (responses) and 328 variables.

2. (3pt) Create a summary table over all responses for V204: is abortion justifiable. How many non- missing responses (i.e. positive answers) do you find? Describe the the opinion about the abortion among the global pool of respondents.

In [5]:
world_survey_data['V204'].describe()

count    90350.000000
mean         2.946386
std          2.964040
min         -5.000000
25%          1.000000
50%          2.000000
75%          5.000000
max         10.000000
Name: V204, dtype: float64

In [6]:
v204_summary = world_survey_data.groupby('V204').V204.count()
v204_summary

V204
-5        23
-4      1523
-2      1045
-1      2017
 1     40227
 2      7896
 3      6294
 4      4497
 5      9580
 6      4395
 7      3493
 8      3397
 9      1896
 10     4067
Name: V204, dtype: int64

In [7]:
##Aggregating all the missing entries
missing = v204_summary[:4].sum()
missing

4608

In [8]:
non_missing = v204_summary.sum() - missing 
non_missing

85742

In [9]:
##Getting general opinion (only taking into account the positive responses)
world_survey_data[world_survey_data.V204 > 0].V204.describe()

count    85742.000000
mean         3.225024
std          2.764319
min          1.000000
25%          1.000000
50%          2.000000
75%          5.000000
max         10.000000
Name: V204, dtype: float64

Thus, there are 4608 missing and 85742 positive (non-missing) responses in the 'V204' variable.
In order to know the general opinion I look at the avg. rating of the positive responses which came out to be 3.22. This shows that general opinion is more on the lower end or in other words is more towards "abortion cannot be justified". 

3. (4pt) Now remove missings. We do it in two ways:
    a). remove everything that are not positive integers for V204 and V2 (country).
    b). for all other variables, remove the missings in the sense of missing value on computer. You may leave negative answers in the data, otherwise I am afraid your sample size collapses. What is the final number of observations?


In [10]:
##Removing non-positive values for V204 and V2 variables
positive_data = world_survey_data[(world_survey_data.V204 > 0) & (world_survey_data.V2 > 0)]
## Dropping NAs for all other variables in positive data
positive_data.dropna(inplace = True)
positive_data

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Unnamed: 0,V2,V4,V5,V6,V7,V8,V9,V10,V11,V12,...,MN_228S8,MN_229A,MN_230A,MN_233A,MN_237B1,MN_249A1,MN_249A3,I_RELIGBEL,I_NORM1,I_VOICE1
0,12,1,1,1,-2,1,1,2,1,1,...,3,-3,-3,-3,-3,1,1,0.0,1.0,0.00
1,12,1,2,3,4,2,2,2,2,2,...,3,-3,-3,-3,-3,2,-1,0.0,1.0,0.66
2,12,1,3,2,4,2,1,2,2,2,...,4,1,1,2,-3,1,1,0.0,1.0,0.33
3,12,1,1,3,4,3,1,2,1,2,...,2,2,1,2,-3,1,2,0.0,1.0,0.00
4,12,1,1,1,2,1,1,1,3,2,...,2,2,1,2,-3,1,2,0.0,1.0,0.66
5,12,1,2,2,2,4,1,2,1,2,...,3,2,1,1,-3,1,2,0.0,1.0,0.00
6,12,1,1,1,1,1,1,2,2,1,...,3,2,2,2,-3,1,1,0.0,1.0,0.66
7,12,1,1,1,1,2,2,2,1,2,...,3,1,1,2,-3,2,2,0.0,1.0,0.00
8,12,1,1,1,2,2,2,2,2,2,...,3,2,1,1,-3,-3,-3,0.0,1.0,0.33
9,12,1,1,1,2,1,1,1,1,2,...,3,-3,-3,-3,0,-3,-3,0.0,1.0,0.66


In [11]:
positive_data.describe()

Unnamed: 0,V2,V4,V5,V6,V7,V8,V9,V10,V11,V12,...,MN_228S8,MN_229A,MN_230A,MN_233A,MN_237B1,MN_249A1,MN_249A3,I_RELIGBEL,I_NORM1,I_VOICE1
count,79267.0,79267.0,79267.0,79267.0,79267.0,79267.0,79267.0,79267.0,79267.0,79267.0,...,79267.0,79267.0,79267.0,79267.0,79267.0,79267.0,79267.0,79267.0,79267.0,79267.0
mean,463.059054,1.097935,1.667945,1.851767,2.571776,1.471091,1.856195,1.811826,2.065412,1.479632,...,-3.438581,-3.512319,-3.636949,-3.579712,-3.858213,-3.57816,-3.582449,0.315238,0.467017,0.337433
std,247.769472,0.392617,0.761947,0.88478,1.0671,0.917788,1.087241,0.761601,0.867091,0.503287,...,1.83686,1.574501,1.32586,1.580421,0.627439,1.421441,1.413258,0.464614,0.498914,0.316637
min,12.0,-5.0,-5.0,-5.0,-5.0,-5.0,-5.0,-5.0,-5.0,-5.0,...,-4.0,-4.0,-4.0,-4.0,-4.0,-5.0,-5.0,0.0,0.0,0.0
25%,275.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,...,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,0.0,0.0,0.0
50%,434.0,1.0,2.0,2.0,3.0,1.0,1.0,2.0,2.0,1.0,...,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,0.0,0.0,0.33
75%,702.0,1.0,2.0,2.0,3.0,2.0,3.0,2.0,3.0,2.0,...,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,-4.0,1.0,1.0,0.66
max,887.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,2.0,...,4.0,2.0,2.0,5.0,1.0,2.0,2.0,1.0,1.0,1.0


After cleaning we now have 79267 rows of data with no NAs and only positive values in 'V2' and 'V204'.

4.(2pt) In order to simplify the analysis below, create a new binary variable abortion as

$$abortion = \begin{cases}
       1 &\quad\text{V204 > 3}\\
       0 &\quad\text{otherwise} \\ 
     \end{cases}$$


In [12]:
## Creating a new binary variable called abortion
positive_data['abortion'] = positive_data.V204.apply(lambda x: 1 if x>3 else 0)
positive_data.head(100)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,V2,V4,V5,V6,V7,V8,V9,V10,V11,V12,...,MN_229A,MN_230A,MN_233A,MN_237B1,MN_249A1,MN_249A3,I_RELIGBEL,I_NORM1,I_VOICE1,abortion
0,12,1,1,1,-2,1,1,2,1,1,...,-3,-3,-3,-3,1,1,0.0,1.0,0.00,0
1,12,1,2,3,4,2,2,2,2,2,...,-3,-3,-3,-3,2,-1,0.0,1.0,0.66,0
2,12,1,3,2,4,2,1,2,2,2,...,1,1,2,-3,1,1,0.0,1.0,0.33,0
3,12,1,1,3,4,3,1,2,1,2,...,2,1,2,-3,1,2,0.0,1.0,0.00,0
4,12,1,1,1,2,1,1,1,3,2,...,2,1,2,-3,1,2,0.0,1.0,0.66,0
5,12,1,2,2,2,4,1,2,1,2,...,2,1,1,-3,1,2,0.0,1.0,0.00,0
6,12,1,1,1,1,1,1,2,2,1,...,2,2,2,-3,1,1,0.0,1.0,0.66,0
7,12,1,1,1,1,2,2,2,1,2,...,1,1,2,-3,2,2,0.0,1.0,0.00,0
8,12,1,1,1,2,2,2,2,2,2,...,2,1,1,-3,-3,-3,0.0,1.0,0.33,0
9,12,1,1,1,2,1,1,1,1,2,...,-3,-3,-3,0,-3,-3,0.0,1.0,0.66,0


5. (5pt) Compute (pearson) correlation table between abortion and all other variables in the data. There are many of these!
Present these variables in descending order according to the absolute value of the correlation. 
  <br> Take a look at a few variables that have strong correlation with abortion. What do these represent?

In [13]:
corr_abortion = positive_data.corrwith(positive_data['abortion'])
corr_abortion = pd.DataFrame(corr_abortion)
corr_abortion['abs_corr'] = abs(corr_abortion[0])
corr_abortion.sort_values(by = 'abs_corr', ascending = False)[0]

abortion      1.000000
V204          0.881048
V205          0.548653
V203          0.485419
V206          0.446394
V207          0.418271
V152         -0.315280
V9            0.314117
V203A         0.291576
V146          0.272220
V210          0.257035
V19           0.249042
V202          0.246232
V145          0.243545
V200          0.239010
V147          0.224269
I_RELIGBEL    0.217138
V153          0.210643
V199          0.204017
V185          0.198473
V201          0.197711
V79           0.197341
V108          0.196402
V252         -0.191483
V208          0.188882
V90           0.179921
I_NORM1       0.179657
V43           0.176142
V125_00       0.172031
V231          0.169773
                ...   
V142          0.015085
V73           0.014980
V216          0.014682
V250          0.014277
V63          -0.013637
V140          0.012815
V56           0.012504
V16          -0.012139
V155          0.011695
V171         -0.010739
V60           0.010198
V173          0.009864
V123       

The top 3 rows are that of variables having a strong correlation with 'abortion'. The first row is the variable itself (having a correlation of 1), followed by 'V204' (having a correlation of 0.881) and lastly, 'V205' (having a correlation of 0.549).
The variable 'V204' is the same column from which we computed our abortion variable and captures what respondents think about abortion. While, 'V205' captures what respondents think about divorce.

6. (4pt) convert country code V2 into dummies. First rename V2 to country. Thereafter use pd.get_dummies along these lines:


In [14]:
positive_data.rename(columns={'V2':'country'}, inplace= True)
positive_data.columns

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


Index(['country', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12',
       ...
       'MN_229A', 'MN_230A', 'MN_233A', 'MN_237B1', 'MN_249A1', 'MN_249A3',
       'I_RELIGBEL', 'I_NORM1', 'I_VOICE1', 'abortion'],
      dtype='object', length=329)

In [15]:
len(positive_data.columns)

329

In [16]:
data = pd.get_dummies(positive_data, columns = ['country'])
data.head()

Unnamed: 0,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,...,country_752,country_764,country_780,country_788,country_792,country_804,country_840,country_858,country_860,country_887
0,1,1,1,-2,1,1,2,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,1,2,3,4,2,2,2,2,2,1,...,0,0,0,0,0,0,0,0,0,0
2,1,3,2,4,2,1,2,2,2,2,...,0,0,0,0,0,0,0,0,0,0
3,1,1,3,4,3,1,2,1,2,2,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,2,1,1,1,3,2,1,...,0,0,0,0,0,0,0,0,0,0


Afterwards, remove country variable from the data. How many rows/columns do you have now? How many country dummies does the data contain?
Note that get_dummies creates a dummy for every category, so you have to remove one of these dummies in order to avoid perfect multicollinearity.

In [17]:
data.drop(columns=['country_887', 'V204'], inplace=True)
##Country column is already dropped when creating dummies 
diff_columns = data.columns.difference(positive_data.columns)
print(len(diff_columns))
diff_columns

57


Index(['country_112', 'country_12', 'country_152', 'country_156',
       'country_158', 'country_170', 'country_196', 'country_218',
       'country_233', 'country_268', 'country_275', 'country_276',
       'country_288', 'country_31', 'country_32', 'country_344', 'country_356',
       'country_36', 'country_368', 'country_392', 'country_398',
       'country_400', 'country_410', 'country_417', 'country_422',
       'country_434', 'country_458', 'country_48', 'country_484',
       'country_504', 'country_51', 'country_528', 'country_554',
       'country_566', 'country_586', 'country_604', 'country_608',
       'country_616', 'country_634', 'country_642', 'country_643',
       'country_646', 'country_702', 'country_705', 'country_710',
       'country_716', 'country_724', 'country_752', 'country_76',
       'country_764', 'country_780', 'country_788', 'country_792',
       'country_804', 'country_840', 'country_858', 'country_860'],
      dtype='object')

In [18]:
len(data.columns)

384

There are 79267 rows and 384 columns in our data after creating the dummy variables. The get_dummies function added 57 columns (for 58 unique countries) in our data.

## 2 Implement Cross-Validation (40 pt)
Now it's time to write your own code that does k-fold CV. I recommend to go the following path:
1. (3pt) Make it as a function that takes k, the (unfitted) model, features X and the target y as arguments.
2. (10pt) Next, one should randomly shuffle the data. However, it is easier to generate a list of indices, and shuffle those randomly.
3. (25pt) Loop the following k times
  <br> (a). Select every k-th of your indices for validation data
  <br>  (b). For training data, select all indices, except those that went into validation data. Hint: check out set operations
   <br> (c). Separate the data X and the target y into training/validation parts. 
   <br> (d) Fit the model on training data
   <br> (e). Predict outcome on validation data
   <br> (f). Compute the resulting statistic (you may compute more than one).
4. (2pt) Finally, return mean of the statistics.
<br> Note: This is my suggested path but you may follow another one.

In [19]:
## Testing this function for a small dataset
X= data.iloc[:50,:].drop(columns='abortion')
y = data.iloc[:50].abortion

In [20]:
indices = np.array([i for i in range(len(X))])
np.random.shuffle(indices)
print(indices)

[18 14  5 39 46 32 49 30 28  8 42 13 47 33  7 26 38 48 19 37 23 31 27 24
 10 16 17  2 15 44 11  1 34  4 35  3 21 22 41 45  9 43 29 20  0 25 40  6
 12 36]


In [23]:
## This function returns validation and training indices
def valadationDataset(fold_size, indices, k):
    val_indices = indices[((k-1)*fold_size):(k*fold_size)]
    train_indices = np.setdiff1d(indices, val_indices)
    val_indices = np.array(val_indices)
    return (val_indices, train_indices)

In [24]:
##Testing the above function
(val_indices, train_indices) = valadationDataset(10, indices, 5)
print(train_indices)
print(val_indices)
# X.iloc[train_indices,:]

[ 1  2  3  4  5  7  8 10 11 13 14 15 16 17 18 19 21 22 23 24 26 27 28 30
 31 32 33 34 35 37 38 39 41 42 44 45 46 47 48 49]
[ 9 43 29 20  0 25 40  6 12 36]


In [27]:
def k_fold(k, model, X, y, statistic):
    results = {}
    ## If given  more than 1 metric
    for stat in statistic:
        results[str(stat)] = []
    indices = np.array([i for i in range(len(X))])
    np.random.shuffle(indices)
    group_size = int(len(X)/k)
    for i in range(1, k+1):
        ##Selecting indices which would go into validation and training sets
        (val_indices, train_indices) = valadationDataset(group_size, indices, i)
        
        ## Creating validation set
        validation_X = X.iloc[val_indices,:]
        validation_y = y.iloc[val_indices]
        
        ## Creating training set
        train_X = X.iloc[train_indices,:]
        train_y = y.iloc[train_indices]
        
        model_fit = model.fit(train_X, train_y)
        y_pred = model_fit.predict(validation_X)
        for stat in statistic:
            results[str(stat)].append(stat(validation_y, y_pred)) 
         
    stat_means = []
    for i in range(len(statistic)):
        stat_means.append(np.mean(results[str(statistic[i])]))
    return stat_means

In [28]:
##Testing the function with KNN model with k = 5 and 3-fold validation
model_kNN = KNeighborsClassifier(n_neighbors= 5)
k_fold(3, model_kNN, X, y, [accuracy_score, f1_score])

  'precision', 'predicted', average, warn_for)


[0.8958333333333334, 0.0]

## 3 Find the best model (40)
In this section your task is to find which model: k-NN, logistic regression, or SVM works best. You will evaluate the model performance using 5-fold cross-validation with accuracy and F-score as the metric. And unlike in all your future work, here you will use your own CV impementation!
k-NN and SVM are sensitive to the distance metric, so you may also try to normalized versus non- normalized features. Check out sklearn.preprocessing.normalize. Logistic regression is agnostic with respect to the metric, but may benefit from more similar variable values for numerical reasons.
Some of the methods (k-NN, SVM) are slow to compute, so you may start with a subset of data (say, 5000 random lines only). If everything turns out fine, you increase the data size as far as your computer can go.
### 3.1 k-NN (13pt)
First, use k-NN and experiment with a few different k-s.
1. (2pt) Separate your training data into X (features), and y (target). Target will be the abortion variable, X are all the other features.


In [29]:
data2 = data.dropna()

In [30]:
## Target variable y
y = data2.abortion
## Features X
X = data2.drop(columns='abortion')

2. (2pt) pick a k and set up the k-NN model. Use your freshly-minted CV routine to cross-validate accuracy and F-score of your k-NN model.

In [31]:
import sklearn.model_selection as ms

In [32]:
models_accuracy = []
models_f1 = []

In [33]:
# Creating a kNN model with k = 3
model_kNN = KNeighborsClassifier(n_neighbors= 3)

In [34]:
## Performing 5 fold cross-validation 
results_model1 = k_fold(5, model_kNN, X, y, [accuracy_score, f1_score])
results_model1

[0.7962152274017535, 0.7134124152713449]

In [35]:
models_accuracy.append(results_model1[0])
models_f1.append(results_model1[1])

3. (5pt) Try a few different k-NN models (pick different k, choose to normalize/not-to-normalize your features).

In [36]:
##Trying the same model with normalization
## Normalizing the features
import sklearn.preprocessing as prep
Xnorm = prep.normalize(X, norm='l2', axis=1, copy=True, return_norm=False)


In [39]:
Xnorm = pd.DataFrame(Xnorm)

In [40]:
results_model2 = k_fold(5, model_kNN, Xnorm, y, [accuracy_score, f1_score])
results_model2

[0.797880527344982, 0.7205424354400936]

In [41]:
models_accuracy.append(results_model2[0])
models_f1.append(results_model2[1])

The accuracy and f1-score has slightly improved. Thus, keeping the data normalized.

In [42]:
##Trying the a different kNN model with k = 5
model_kNN2 = KNeighborsClassifier(n_neighbors= 5)

In [43]:
results_model2 = k_fold(5, model_kNN2, Xnorm, y, [accuracy_score, f1_score])
results_model2

[0.8056393111713872, 0.7282845620448672]

In [45]:
models_accuracy.append(results_model2[0])
models_f1.append(results_model2[1])

In [48]:
##Creating a new model with k = 8
model_kNN3 = KNeighborsClassifier(n_neighbors= 8)
results_model3 = k_fold(5, model_kNN3, Xnorm, y, [accuracy_score, f1_score])
results_model3

[0.782, 0.6992984014954644]

In [49]:
models_accuracy.append(results_model3[0])
models_f1.append(results_model3[1])

In [50]:
##Creating a new model with k = 10
model_kNN4 = KNeighborsClassifier(n_neighbors= 10)
results_model4 = k_fold(5, model_kNN3, Xnorm, y, [accuracy_score, f1_score])
results_model4

[0.78105, 0.6987364305969128]

In [51]:
models_accuracy.append(results_model4[0])
models_f1.append(results_model4[1])

4. (4pt) Present the results from your best k-NN model. Note: as you are using two metrics here, you may end up with different models performing better according to different measures.

In [53]:
models = ['k=3_no_norm', 'k=3', 'k=5', 'k=8', 'k=10']
##For storing results of these models
results = pd.DataFrame(columns= ['model', 'accuracy', 'f1'])

results['model'] = models
results['accuracy'] = models_accuracy
results['f1'] = models_f1
results['model_type'] = 'kNN'

In [54]:
results

Unnamed: 0,model,accuracy,f1,model_type
0,k=3_no_norm,0.796215,0.713412,kNN
1,k=3,0.797881,0.720542,kNN
2,k=5,0.805639,0.728285,kNN
3,k=8,0.782,0.699298,kNN
4,k=10,0.78105,0.698736,kNN


The best model is the fourth one where k = 5. I achieved an accuracy of 0.806 and and f1-score of 0.728.

### 3.2 Logistic Regression (9pt)
1. Now repeat the process above with logistic regression. As we have a myriad of features anyway, we are not going to do any feature engineering. Just a plain logistic regression.

In [55]:
import sklearn.linear_model as lm

In [56]:
##Here using normalized data only
model_log = lm.LogisticRegression()

In [59]:
results_log = k_fold(5, model_log, X, y, [accuracy_score, f1_score])
results_log

[0.8134, 0.7492451244297966]

### 3.3 SVM (15pt)
Now repeat the process with support vector machines while choosing between a few different kernels and kernel options, such as degree for polynomial kernels.
Hint: I have mixed experience with sklearn version of SVM. I recommend to limit the number of iterations, initially maybe to just 1000, in order to ensure your model actually terminates.
1. (14pt) pick a kernel and repeat the process above.
Note that some kernels are slower than others, so be careful.

In [60]:
from sklearn.svm import SVC

In [61]:
model_svm = SVC(gamma='auto', kernel= 'linear', max_iter = 100)

In [62]:
results_svm = k_fold(5, model_svm, Xnorm, y, [accuracy_score, f1_score])
results_svm



[0.46105, 0.5605644405615481]

In [63]:
model_svm2 = SVC(gamma='auto', max_iter = 100)

In [64]:
results_svm2 = k_fold(5, model_svm2, Xnorm, y, [accuracy_score, f1_score])
results_svm2



[0.49195, 0.5715969322455511]

2. (2pt) If your models worked like mine, you may have noticed that while accuracy seems all right, precision and recall are rather low. Explain what does such a phenomenon mean.


In my case, accuarcy is also fairly low.
A high accuracy with low precision and recall indicates that the data maybe imbalanced. In other words, our data doesn't have a balanced number of cases for positive and negative classes. 

### 3.4 Compare the models (3pt)
1. (2pt) Finally, compare the models. Which ones performed the best in terms of accuracy? Which ones in terms of F-score? Did you encounter other kind of issues with certain models? Which models were fast and which ones slow?


In [None]:
In terms of accuracy kNN works better.
SVM models took the longest time (more than 30 minutes) while logistic regression was the fastest.

2. (1pt) If you have to repeat the exercise with a single model (and you have, see below), which one will you pick?


I would go with my the kNN model with k = 8 because it performed better in terms of both accuracy and f1_score. Also, it was way faster than the SVMs.

## 4 How large a role does country play? (20pt)
Here we switch from machine learning to social sciences. Public opinion differs from country to country, but also inside the countries. Does the fact that we include country code in data help us to substantially improve the predictions?
<br> You pick the best ML method from above. You estimate two sets of models: one with country information included, and one where it is removed. Is the former noticeably better than the latter?
1. (10pt) Pick your best ML method based you designed above. Cross-validate the accuracy of abortion variable using all the features, including country dummies and report the accuracy. Essentially you repeat here what you did above, so you can also just copy the result from above.

2. (15pt) Now remove all the country dummies, but keep the other variables intact. And repeat.

In [67]:
##positive_data is the dataset before i created dummy variables
new_data = positive_data.drop(columns = 'country')

In [68]:
## Target variable y
y = new_data.abortion
## Features X
X = new_data.drop(columns='abortion')

In [72]:
##Using log model
results_log2 = k_fold(5, model_log, X, y, [accuracy_score, f1_score])
results_log2

[1.0, 1.0]

3. (5pt) Comment what you found. Does country information help to noticeably improve the prediction?

Removing country information increases the model performance considerably. Thus,in my case, latter is considerably better than formal.