# Class 2: Solution

## Data

In this exercise, we will work with data from Kayser et al. (2023) and the article "Coalition inclusion probabilities: a party-strategic measure for predicting policy and politics''. 

The paper develops a method to estimate coalition inclusion probabilities (CIPs). The idea is to predict the possibility of every possible coalition that could be at each time point *t* in a range of countries. The probabilities take public opinion, polls, and bargaining leverage into account. To arrive at a party's probability of entering government, the different coalition combinations can then be summarized by party.

## Question 1: Working with Pandas Dataframes

#### Exercise 1.0: Importing Pandas 

In [None]:
# Start by importing pandas as pd

#### Solution 1.0

In [None]:
import pandas as pd

#### Exercise 1.1: Reading data

Read in the dataset called `CIP_static_Denmark.csv`. 

*Hint:* Make sure to specify the correct filepath and the correct delimiter (done using the `sep` parameter). You can use the `os` module to specify your working directory to be able to use relative paths.

#### Solution 1.1

In [None]:
# Import os here and change working directory so that you can load in the data using a relative path
import os
os.chdir('C:/Users/au535365/Dropbox/teaching/css_fall2023')

In [None]:
fpath = 'data/CIP_static_Denmark.csv'
df = pd.read_csv(fpath, sep=';')

#### Exercise 1.2: Inspecting the data

- Inspect the top rows of the dataframe
- Get info about the dataframe

#### Solution 1.2

In [None]:
# Print top row rows
df.head()

In [None]:
# Get info on the type of columns, the shape and so on
df.info()

#### Exercise 1.3
We only want to keep CIPs from 1997 and onwards. Filter away any rows before 1997. 

*Hint:* Pay attention to the type of column. 

#### Solution 1.3

In [None]:
df = df.loc[df['year'] >= 1997]

#### Exercise 1.4

We only want to keep a subset of parties. This can be done using the `party_abbrv` column. 

Keep only parties abbreviated as A, DF, En-O, KF, NLA, NB, RV, Sd, SF, and V. 

Remember to reset indices.

*Hint:* Store the abbreviations in a list. Remember to enclose each element as a string. Using A results in NameError, but 'A' is accepted.

#### Solution 1.4

In [None]:
parties = ['A', 'DF', 'En-O', 'KF', 'NLA', 'NB', 'RV', 'Sd', 'SF', 'V']

In [None]:
df = df.loc[df['party_abbr'].isin(parties)].reset_index(drop=True)

#### Exercise 1.5

We are not happy with the current naming of the abbreviations. Below, you find a dictionary that creates the desired replacements.

*Hint*: The `.apply()` method does not work here. We need to use the `.map()` method. Give the replacement dictionary as input to `.map()`.

In [None]:
replacement_dict = {"A": "ALT",
                    "DF": "DF",
                    "En-O": "EL",
                    "KF": "KF",
                    "NLA": "LA",
                    "NB": "NB",
                    "RV": "RV",
                    "Sd": "S",
                    "SF": "SF",
                    "V": "V"}

#### Solution 1.5

In [None]:
# Write solution here:
df['party_abbr'] = df['party_abbr'].map(replacement_dict)

#### Exercise 1.6
Now we are happy with the current state of the data. Now we want to compute simple descriptive statistics to get to know our data even better.

However, when doing it we encounter problems. We realize that `pr_ingov_mean`, the main variable of interest, is encoded as a string.

Before we ahead, lets see what's happening:

- Compute the mean of the `pr_ingov_mean` column (this is CIP for party $i$ and time $t$)


*Hint*: If you encounter problems, it is likely because `pr_ingov_mean` is not a numerical value. 

The problem is that the `pr_ingov_mean`, which is supposed to be numerical, is a string. This is not itself a problem, but the problem is that the numerical string contains ','. For instance, writing `float('1')` is perfectly fine, but `float('1,1')` is not. Try it yourself.

To solve this, we need to replace the ',' in the numerical string. We replace it with a dot '.'

*Hint*: Use `.apply()` combined with a lambda function using a `.replace()` method. 

#### Solution 1.6

In [None]:
# Compute mean of pr_ingov_mean - you should get an error message
df['pr_ingov_mean'].mean()

In [None]:
# Use .apply() and a lambda function to replace , with . in the string
df['pr_ingov_mean'] = df['pr_ingov_mean'].apply(lambda x: x.replace(',', '.'))

#### Exercise 1.7

Now that you have replace `,` with `.`, we want to convert the string to a numerical object. This can be done using type casting. Since we have no NaNs, it is straightforward. 

*Hint*: Use the `.astype()` where you specify the type of object you want to type cast. In this case, it is a float. Remember to overwrite the original column in the dataframe.

#### Solution 1.7

In [None]:
# Use type casting to convert string to a float object
df['pr_ingov_mean'] = df['pr_ingov_mean'].astype(float)

#### Exercise 1.8
Finally - we are ready to compute some descriptive statistics.

Compute the:
- mean
- median
- standard deviation

of the `pr_ingov_mean` column. 

*Hint:* Use the `.describe()` method to get all stats at once.

#### Solution 1.8

In [None]:
# Compute descriptive statistics on the pr_ingov_mean column 
df['pr_ingov_mean'].describe()

#### Exercise 1.9
We now want to decompose the CIPs by party and year. We are still interested in `pr_ingov_mean`

How do you achieve that?

*Hint:* To flatten the indices of the resulting dataframe, you the `.reset_index()` method. 

#### Solution 1.9

In [None]:
# Your solution here:
grouped_df = df.groupby(['party_abbr','year'])['pr_ingov_mean'].describe().reset_index()

#### Exercise 1.10
We want to plot the results of the `grouped_df` for each party over time. Make a plot using the code provided in the `class2-tutorial` notebook.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme()

#### Solution 1.10

In [None]:
# Make the plot here
fig, ax = plt.subplots()
for k, v in grouped_df.groupby('party_abbr'):
    v.plot(x='year', y='mean', label=k, ax=ax, marker='o', markersize=3)
plt.xlabel('Year', fontsize=10)
plt.ylabel('Mean CIP', fontsize=10)
plt.xticks(size=10)
plt.yticks(size=10)
plt.legend(frameon=False, fontsize=8, loc='upper right', ncol=3, bbox_to_anchor=(1.02, 1.0))
plt.show()

## Question 2: NumPy Arrays

We now turn to NumPy arrays. We continue working with the CIPs. 

#### Exercise 2.1
Start by importing NumPy

#### Solution 2.1

In [None]:
# Import numpy as np here
import numpy as np

#### Exercise 2.2
The first thing we do is to convert a pandas column to a NumPy array to have some data to work with.

Continue to use the dataframe `grouped_df`. Make two arrays, one called `features` and another called `labels`. 

The first array, `features`, should be an array of the `mean` column in the dataframe used to make the plot. 

The second array, `labels`, should be an array of the `party_abbr` column in the dataframe used to make the plot.

*Hint:* We can do this using the `.to_numpy()` method. Alternatively, you can simply use `np.array()`

#### Solution 2.2

In [None]:
# One solution
features = grouped_df['mean'].to_numpy()
labels = grouped_df['party_abbr'].to_numpy()

In [None]:
# Another solution
features = np.array(grouped_df['mean'])
labels = np.array(grouped_df['party_abbr'])

#### Exercise 2.3

Compute the shape of the two arrays, `features` and `labels`. What are their dimensions?

#### Solution 2.3

In [None]:
# Compute shape
features.shape, labels.shape

#### Exercise 2.4
We only want to work with the grand old parties. 
* 1) Filter the arrays based on the list `parties` given below. 
* 2) When this is done, recompute the shape of the filtered array. 
* 3) Make sure that the parties appear the same amount of time. You can test this using using the `np.unique(X, return_counts=True)` where `X` should be replaced with your array.

*Hint:* Use the `np.isin()` function. The syntax is `np.isin(element, test_elements)` where `element` should the be array and `test_elements` should be the list given below. The function returns an array with boolean values, which can be thought of as a mask. Use this to filter the arrays. Remember to filter both `features` and `labels`.

In [None]:
parties = ['KF', 'S', 'V', 'RV', 'SF']

#### Solution 2.4

In [None]:
# Generate mask
mask = np.isin(labels, parties)

In [None]:
# Filter
features = features[mask]
labels = labels[mask]

In [None]:
# Recompute shape
features.shape, labels.shape

In [None]:
# Use np.unique
np.unique(labels, return_counts=True)[1]

#### Exercise 2.5

Now that all parties appear an equal amount of time, we are able to reshape the data. 

We want to reshape the array such that each row corresponds to a party and each column is an yearly observation. 

What should the resulting dimension be?

Save it in objects called `features_reshaped` and `labels_reshaped`.

#### Solution 2.5

In [None]:
# Reshape arrays: Input the correct dimensions in the parentheses

# One solution
features_reshaped = features.reshape(5, -1)
labels_reshaped = labels.reshape(5, -1)

# Alternative solution
features_reshaped = features.reshape(5, 10)
labels_reshaped = labels.reshape(5, 10)

#### Exercise 2.6

Check the dimensions of the arrays now. Are they as intended?

#### Solution 2.6

In [None]:
features_reshaped.shape, labels_reshaped.shape

#### Exercise 2.7

We want to figure out each party's highest CIP at any point in time. 

Do this using the `np.max()` function. Pay attention to the dimension of the output. Is it as intended? What's the problem?

#### Solution 2.7

In [None]:
# Compute the maximum CIP for each party using np.max()
np.max(features_reshaped)

#### Exercise 2.8

The `np.max()` works, but we do not get a value for each party. Instead, it is the pooled average whereas our intended dimension is (5,), one observation for each party. This happens since we need to tell the function whether we want to compute the row or column-wise mean. If nothing is specified, it returns the pooled average. 

We can control whether we get a pooled, row- or column-wise mean using the `axis` argument, which must be either $0$ or $1$. If nothing is specified, we get the pooled average. Try computing the maximum value using first $0$ and then $1$. Which one is the correct version when we want to get each party's highest probability? Save the two results in objects called `max_axis0` and `max_axis1`, respectively. 

#### Solution 2.8

In [None]:
# Axis 0
max_axis0 = np.max(features_reshaped, axis=0)

In [None]:
# Axis 1
max_axis1 = np.max(features_reshaped, axis=1)

#### Exercise 2.9

We can see from the output that the highest probability is the last index ([4] or [-1]). 

We want to figure out which party this corresponds to. Of course we can do this manually, but often we want an automated solution.

To do this, we can use the `np.argmax()` function on the `max_axis1` object to return the index of the highest probability. This will return [4] in this case. 

Use the index to figure out which party we are talking about by filtering the `labels_reshaped` using the output from the argmax.

If you have done it correct, it should return an array with 'V' (Venstre). 


#### Solution 2.9

In [None]:
# Write your solution here:
labels_reshaped[np.argmax(max_axis1)]

#### Exercise 2.10
We now return to our original `labels` object with shape (50,). 

The workflow with two arrays called `features` and `labels` is very common in machine learning applications. For some unknown reasons, many algorithms in Python assume that the output is two-dimensional. In most cases the `features` vector is N-dimensional. This is not the case here and as a result, it would cause problems. 

For the example here, we only reshape `labels` to shape (50, 1). Save the reshaped array in an object called `labels_new`. 

Verify that the result is correct using `.shape`. 

*Hint:* This can be done using both the `.reshape()` method and the `np.newaxis` function. Use the latter in this case.

#### Solution 2.10

In [None]:
# Reshape array using np.newaxis
labels_new = labels[:, np.newaxis]

In [None]:
# Verify shape 
labels_new.shape

In [None]:
# Alternative solution
labels_new = labels.reshape(labels.shape[0], 1)           # I use labels.shape[0] to get the number of rows.
labels_new.shape

## Question 3: Writing and Reading Files

#### Exercise 3.1
We now want to write and read some files to get familiar with typical file formats. 

Start by writing the newly reshaped array `labels_new` to a numpy file.

This can be done using both `np.save()` `np.savetxt()`. Try using both. Can you use both? If not, why?

When you have saved it, read in the saved file using the `np.load()` function. Save the data in an object called `labels_loaded`. 

*Hint:* You probably get an error when reading the file again. Read the error message and adapt your code. It should be straightforward.

In [None]:
# SETUP
import os
import platform

path_to_folder = 'Dropbox/teaching/css_fall2023'

if platform.system() == 'Linux':
    base_dir = '/home/rask/'
else:
    base_dir = 'C:/Users/au535365/'

base_dir = os.path.join(base_dir, path_to_folder)

#### Solution 3.1

In [None]:
# np.save to a filename called 'labels_new_save.npy'
np.save(os.path.join(base_dir, 'data/class2/labels_new_save.npy'), labels_new) 

In [None]:
# np.savetxt to a filename called 'labels_new_savetxt.npy'
np.savetxt(os.path.join(base_dir, 'data/class2/labels_new_savetxt.npy'), labels_new) 

In [None]:
# np.load the file 'labels_new_save.npy'
labels_loaded = np.load(os.path.join(base_dir, 'data/class2/labels_new_npsave.npy'), allow_pickle=True) 

#### Exercise 3.2

We forgot to write our Pandas dataframe `grouped_df`. 

Write the dataframe to a `.csv` file using the `.to_csv()` method. Remember to specify `index=False`. 

When this is done, read the file back in. Save it in an object called `grouped_df_loaded`.

#### Solution 3.2

In [None]:
# Write to csv
grouped_df.to_csv(os.path.join(base_dir, 'data/class2/partyyear-CPI.csv'), index=False)

In [None]:
# Read as csv
grouped_df_loaded = pd.read_csv(os.path.join(base_dir, 'data/class2/partyyear-CPI.csv'))