# Candidate Test 2022 Analysis Part 1

This exercise focuses on the candidate tests from two television networks: DR and TV2. Data from both tests have been given on a scale of five responses (-2, -1, 0, 1, 2).

---

There are 6 datasets included in this exercise:

- `alldata.xlsx`: Contains responses from both TV stations.
- `drdata.xlsx`: Contains responses from DR.
- `drq.xlsx`: Contains questions from DR.
- `tv2data.xlsx`: Contains responses from TV2.
- `tv2q.xlsx`: Contains questions from TV2.
- `electeddata.xlsx`: Contains responses from both TV stations for candidates who were elected to the parliament. Note that 9 members are missing; 7 of them didn't take any of the tests. Additionally, some notable figures like Mette F. and Lars Løkke did not participate in any of the tests.

---

It's entirely up to you how you approach this data, but at a *minimum*, your analysis should include:
- Age of the candidates grouped by parties.
- An overview of the most "confident" candidates, i.e., those with the highest proportion of "strongly agree" or "strongly disagree" responses.
- Differences in responses between candidates, both inter-party and intra-party, along with an explanation of which parties have the most internal disagreements.
- Classification models to predict candidates' party affiliations. Investigate if there are any candidates who seem to be in the "wrong" party based on their political landscape positions. You must use the following  algorithms: **Decision Tree**, **Random Forest** and **Gradient Boosted Tree**, i.e. a total of 3 models are to be trained.

---

The following parties are represented:

| Party letter | Party name | Party name (English) | Political position |
| :-: | :-: | :-: | :-: |
| A | Socialdemokratiet | Social Democrats | Centre-left |
| V | Venstre | Danish Liberal Party | Centre-right |
| M | Moderaterne | Moderates | Centre-right |
| F | Socialistisk Folkeparti | Socialist People's Party | Left-wing |
| D | Danmarksdemokraterne | Denmark Democrats | Right-wing |
| I | Liberal Alliance | Liberal Alliance | Right-wing |
| C | Konservative | Conservative People's Party | Right-wing |
| Æ | Enhedslisten | Red-Green Alliance | Far-left |
| B | Radikale Venstre | Social Liberal Party | Centre-left |
| D | Nye Borgerlige | New Right | Far-right |
| Z | Alternativet | The Alternative | Centre-left |
| O | Dansk Folkeparti | Danish People's Party | Far-right |
| G | Frie Grønne | Free Greens | Centre-left |
| K | Kristendemokraterne | Christian Democrats | Centre-right |

Below you can see the results and the colors chosen to represent the parties. Use these colors in your analysis above.

![Alt text](image-1.png)


Others have undertaken similar analyses. You can draw inspiration from the following (use Google tranlsate if your Danish is rusty):

- [Analysis of where individual candidates stand relative to each other and their parties](https://v2022.dumdata.dk/)
- [Candidate Test 2022 – A deep dive into the data](https://kwedel.github.io/kandidattest2022/)
- [The Political Landscape 2019](https://kwedel.github.io/kandidattest2019/)



In [None]:
# Age of the candidates grouped by parties

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the data
all_data = pd.read_csv('alldata.csv')
dr_data = pd.read_csv('drdata.csv')
drq_data = pd.read_csv('drq.csv')
elected_data = pd.read_csv('electeddata.csv')
tv2_data = pd.read_csv('tv2data.csv')
tv2q_data = pd.read_csv('tv2q.csv')

analysis = all_data[['navn','parti', 'alder']]

# Remove rows with missing values
analysis = analysis.dropna()
analysis = analysis[analysis['alder'] > 0]

# Displaythe age of the candidates grouped by parties
plt.figure(figsize=(12, 6))
sns.boxplot(data=analysis, x='parti', y='alder', palette="Set2")
plt.xticks(rotation=90)
plt.title("Age Distribution of Candidates by Party")
plt.xlabel("Parti")
plt.ylabel("Alder")
plt.show()


In [None]:
#An overview of the most "confident" candidates, i.e., those with the highest proportion of "strongly agree" or "strongly disagree" responses.

confident_candidates = all_data
confident_candidates = confident_candidates.drop(columns=['parti', 'alder', 'storkreds'])

#for each row in the dataframe,count the -2 and 2 values and store them in a new dataframe with the name of the candidate
confident_candidates['strongly_agree'] = (confident_candidates == 2).sum(axis=1) 
confident_candidates['strongly_disagree'] = (confident_candidates == -2).sum(axis=1) #axis 1 is for rows and axis 0 is for columns
confident_candidates['total'] = confident_candidates['strongly_agree'] + confident_candidates['strongly_disagree']

#create new data frame with only the name and the total of strongly agree and disagree
confident_candidates_filtered = confident_candidates[['navn', 'strongly_agree', 'strongly_disagree','total']]

confident_candidates_filtered = confident_candidates_filtered.sort_values(by='total', ascending=False)

confident_candidates_filtered


In [None]:
#Differences in responses between candidates, both inter-party and intra-party, along with an explanation of which parties have the most internal disagreements.
import numpy as np


# Drop non-numeric columns and the 'parti' column for intra-party analysis
numeric_columns = all_data.drop(columns=['navn', 'parti', 'alder', 'storkreds'])

# Calculating intra-party differences (standard deviation within each party)
intra_std = numeric_columns.groupby(all_data['parti']).std()
intra_var = (intra_std**2).mean(axis=1)
intra_std_overall = np.sqrt(intra_var)
intra_std['mean_overall_std'] = intra_std_overall

print("Intra-party overall standard deviations:")
print(intra_std[['mean_overall_std']])

#plot the intra-party differences
plt.figure(figsize=(12, 6))
intra_std['mean_overall_std'].plot(kind='bar', title='Intra-party Differences in Responses')
plt.xlabel("Parti")
plt.ylabel("Standard Deviation")
plt.show()


Which parties have the most internal disagreements?

The parties with the most internal disagreements are the parties with the highest standard deviation in their responses. The standard deviation is a measure of the amount of variation or dispersion of a set of values. The higher the standard deviation, the more spread out the values are. In this case, the parties with the highest standard deviation are the parties with the most internal disagreements.

In [None]:
# Calculating inter-party differences (overall standard deviation across all candidates)
inter_std = numeric_columns.std()

print("Inter-party standard deviations:")
print(inter_std)

#plot of the inter-party differences
plt.figure(figsize=(12, 6))      
inter_std.plot(kind='bar', title='Inter-party Differences in Responses')
plt.xlabel("Statement")
plt.ylabel("Standard Deviation")
plt.show()

Classification models to predict candidates' party affiliations. Investigate if there are any candidates who seem to be in the "wrong" party based on their political landscape positions. You must use the following  algorithms: **Decision Tree**, **Random Forest** and **Gradient Boosted Tree**, i.e. a total of 3 models are to be trained.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report

# Prepare the data
X = numeric_columns  # Features
y = all_data['parti']  # Target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
# Decision Tree 
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_predictions = dt_model.predict(X_test)
print("Decision Tree Classifier Report:")
print(classification_report(y_test, dt_predictions, zero_division=0))

In [None]:
# Random Forest 
rf_model = RandomForestClassifier(random_state=42, n_estimators=100)
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)
print("Random Forest Classifier Report:")
print(classification_report(y_test, rf_predictions))

In [None]:
# Gradient Boosted Tree 
gb_model = GradientBoostingClassifier(random_state=42, n_estimators=100)
gb_model.fit(X_train, y_train)
gb_predictions = gb_model.predict(X_test)
print("Gradient Boosted Tree Classifier Report:")
print(classification_report(y_test, gb_predictions))

In [None]:
# Identifying candidates who seem to be in the "wrong" party for gradient boosted tree predictions

# Creating a DataFrame for misclassified candidates
wrong_party_candidates = X_test[y_test != gb_predictions].copy()

# Adding actual and predicted party columns
wrong_party_candidates['actual_party'] = y_test[y_test != gb_predictions]
wrong_party_candidates['predicted_party'] = gb_predictions[y_test != gb_predictions]

# Adding candidate names from the original dataset
wrong_party_candidates['navn'] = all_data.loc[wrong_party_candidates.index, 'navn']

print("Candidates who seem to be in the 'wrong' party:")
display(wrong_party_candidates[['navn', 'actual_party','predicted_party']])

In [None]:
# Identifying candidates who seem to be in the "wrong" party for random forest predictions

# Creating a DataFrame for misclassified candidates
wrong_party_rf_candidates = X_test[y_test != rf_predictions].copy()

# Adding actual and predicted party columns
wrong_party_rf_candidates['actual_party'] = y_test[y_test != rf_predictions]
wrong_party_rf_candidates['predicted_party'] = rf_predictions[y_test != rf_predictions]

# Adding candidate names from the original dataset
wrong_party_rf_candidates['navn'] = all_data.loc[wrong_party_rf_candidates.index, 'navn']

print("Candidates who seem to be in the 'wrong' party (Random Forest):")
display(wrong_party_rf_candidates[['navn', 'actual_party', 'predicted_party']])

In [None]:
#Identifying candidates who seem to be in the "wrong" party with consufion matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, rf_predictions, labels=gb_model.classes_)

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=gb_model.classes_)
disp.plot(xticks_rotation=90, cmap='viridis')
plt.title("Confusion Matrix for Gradient Boosted Tree Classifier")
plt.show()