# Data
The dataset derived from "The Danish Parliament Corpus 2009 - 2017, v2, w. subject annotation".

The dataset consists of transcriptions of speeches in the Folketing (Danish Parliament) from the first session of 2009 to and including the first session of 2016.(6/10 2009 – 7/9 2017). 

For each speech, metadata is attached, partly about the member of the Danish Parliament.('Name', 'Gender', 'Party', 'Role', 'Title', 'Birth', 'Age'), partly about the speech (Date', 'samling', 'Start time', 'End time', 'Time', 'Agenda item', 'Case no', 'Case type', 'Agenda title', 'Subject 1', 'Subject 2').

The dataset is originally structured in tsv txt-files with one file per meeting.

 _(Hansen, Dorte Haltrup and Navarretta, Costanza, 2021, The Danish Parliament Corpus 2009 - 2017, v2, w. subject annotation, CLARIN-DK-UCPH Centre Repository, http://hdl.handle.net/20.500.12115/44.)_


For this notebook, we have collected the tsv files into a new dataset, which we have saved in a csv file separated by pipes. The csv file has been uploaded to [sciencedata.dk](https://sciencedata.dk/), from where it can be downloaded via url using the pandas.read_csv() method.


# Problem statement

The work we have done is inspired by the Report [A Decade of Immigration in the British Press](https://migrationobservatory.ox.ac.uk/resources/reports/decade-immigration-british-press/) by Willim L. Allen. The report examines the UK media coverage of migration and holds interesting visualisations, that can be used for analytical purposes.

In this notebook will produces some visualisations that also can be used for analytical purposes, but instead of investigating media, we we investigate speeches from the Danish Parliament.

The topic is immigration policy from 2009 - 2017. We examine speeches from the following parties:

- Enhedslisten (EL)
- Socialistisk Folkeparti (SF)
- Socialdemokratiet (S)
- Radikale Venstre (RV)
- Venstre (V)
- Liberal Alliance (LA)
- Konservative Folkeparti (KF)
- Dansk Folkeparti (DF)

What characteristics define the policies of political parties as assessed from the speeches of party members in the Folketing (Danish Parliament)? We limit the dataset to speeches from the mentioned parties and to those speeches that contain at least one of the following selected nouns: "asylansøger" (asylum seeker), "flygtning" (refugee), "indvandrer" (immigrant), or "integration". The speeches are grouped in relation to the party and parliamentary session.

## We perform the following data analysis:

- [Data analysis 1: Using Scikit-learn's Tf-Idf algorithm, we analyze the distinctive keywords that characterize the speeches of different parties.](#da1)
- [Data analysis 2: Build Vega-Altair chart for visual word trend analysis. 2.a. Use the Statistikbank API and a Bag-of-words (BOW) on the speeches to build a layered Vega-Altair chart for visual analysis. 2.b. Use the BOW to visualise trends of the keywords.](#da2)
- [Data analysis 3: Analysis and visualise the changes of modifiers. Using Spacy's pos-tagger and dependency parsing, we conduct an analysis of which modifiers are associated with "asylansøger" (asylum seeker), "flygtning" (refugee), "indvandrer" (immigrant), or "integration"](#da3)
- [Data analysis 4: Using Sentida's Sentiment Analysis algorithm, we analyze the mood in the speeches of the different parties.](#da4)

In [1]:
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
import json 
import altair as alt
import warnings
warnings.filterwarnings("ignore")
import re
import time
from urllib.request import urlopen
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter

In [2]:
# Load data from sciencedata.dk 
#df = pd.read_csv('folketingsreferater_2009_2016.csv', sep='|')
df = pd.read_csv('https://sciencedata.dk/shared/825e999a5c13fd22d28d4289fa899ba1?download', sep='|')

In [3]:
df = df.rename(columns={'samling':'Session'})

## Limit the dataset to selected the parties and search terms (nouns)

In [4]:
# Subset dataframe
# Define the parties to include in the analysis
values = ['EL', 'SF', 'S', 'RV', 'V', 'LA', 'KF', 'DF']
data = df[df['Party'].isin(values)] 


# Subset dataframe
# Define the terms to search for, create a condition for each term using str.contains, 
# and subset the dataframe based on the conditions
terms_to_search = ['asylansøger', 'flygtning', 'indvandrer', 'integration']
conditions = data['Text'].str.contains('|'.join(terms_to_search), case=False, na=False)
subset = data[conditions].reset_index(drop=True)

# Subset dataframe
# Pick the relevant columns 
subset = subset[['Date', 'Session', 'Party', 'Text', 'Name']]

# Save the subset
subset.to_csv('subset.csv', index=False)

In [5]:
# Read the subset
subset = pd.read_csv('subset.csv')
#subset = pd.read_csv('https://sciencedata.dk/shared/ab2377f7e7858f483cd101b8a4eb934f?download')

In [6]:
subset['Year'] =  subset['Date'].str.extract(r'(\d{4})')

# Preprocess text
In this process the text will be cleaned. This includes the removing of:
- stopwords
- names of the politicians from speeches
- numbers
- words with numbers
- words of one or two letters
- punctuation 

In [7]:
# Group by 'session' and 'party' and aggregate speeches
subset_group_by_session_party = subset.groupby(['Year', 'Party'])['Text'].agg(' '.join).reset_index()


startTime = time.time()

# Extract speeches to list
print ('Get speeches') 
speeches = subset_group_by_session_party['Text'].tolist()

print ('Load stopwords')
# load stopword list from sciencedata.dk
with urlopen('https://sciencedata.dk/shared/646999887396d04771f84554d30f75ff?download') as response:
    stop_words = response.read().decode('utf-8-sig').split('\r\n') 
    
    
# Find names in the dataset and extend the stopword list with those names     
name_list = list(subset['Name'])
new_name_list = []
for i in name_list:
    name_split = i.split()
    new_name_list.append(name_split)
    
# Merge the inner lists into one single list
merged_name_list = [i for s in new_name_list for i in s]
# remove the duplicates using set and list  
name_set_as_list = list(set(merged_name_list))
# extend stopword list with name list 
stop_words.extend(name_set_as_list)
# change all letters to lower case
stop_words = [i.lower() for i in stop_words]

print ('Build text scrubber')
# Preprocesser function that takes a text string and returns a clean text string without stopwords
def scrub_text(text):
    # Remove all numbers (including integers and decimals)
    text_without_numbers = re.sub(r'\b\d+(\.\d+)?\b', '', text)

    # Remove words that include numbers
    text_without_words_with_numbers = re.sub(r'\w*\d\w*', '', text_without_numbers)
    
    # remove short words
    text_without_short_words = re.sub(r'\b\w{1,2}\b', '', text_without_words_with_numbers)
    
    # find all none whitspace characters between two word bounderies
    text_without_punctuation = ' '.join(re.findall(r'\b\S+\b', text_without_short_words.lower().replace('_', ' ')))
    
    tokens = text_without_punctuation.split()
    
    tokens_without_stopwords = [i for i in tokens if i not in stop_words]
    
    clean_text_with_out_stopwords = ' '.join(tokens_without_stopwords)
        
    return clean_text_with_out_stopwords

print ('Scrub the texts')
clean_strings_in_list = [scrub_text(i) for i in speeches]
print ('Add clear text to dataframe')
subset_group_by_session_party['Clean_text_wo_sw'] = clean_strings_in_list

executionTime = (time.time() - startTime)
print('Execution time in seconds: ' + str(executionTime))

Get speeches
Load stopwords
Build text scrubber
Scrub the texts
Add clear text to dataframe
Execution time in seconds: 15.90079140663147


In [8]:
# Save the data for later use
subset_group_by_session_party.to_csv('preprocessed_text_data_grouped_by_year.csv', index=False)

<a id="da1"></a>
# Data analysis 1
# TF-IDF.

1. Using Tf-Idf, we identify the distinctive keywords that characterize the different parties.

Method:
1. Subset data to speeches that contain at least one of the following words: "asylansøger" (asylum seeker), "flygtning" (refugee), or "indvandrer" (immigrant). The speeches are grouped in relation to the party and session.

2. Preprocess the texts using a stopword list and ***Spacy*** to lemmatize words in speeches.
3. Use **Tf-Idf** to identify distinctive keywords


It is necessary to produce a subset of the data because it requires a powerful computer to use Spacy's lemmatizer when there are many words. Therefore, it is appropriate to subset the dataset as described above in the first bullet. We also disable some components from the spaCy pipeline, to make the text processing run faster. _SpaCy Guides: [Language Processing Pipelines](https://spacy.io/usage/processing-pipelines)._

The issue is addressed on Stackexchange, for example in this blog post
[Increasing SpaCy max NLP limit](https://datascience.stackexchange.com/questions/38745/increasing-spacy-max-nlp-limit), where suggestions are made to 1. increase the value in the function _nlp.max_length_ and 2. to add to the nlp function an argument that disables ner and parses from the object returned by the nlp function.

In [9]:
# Load the data
preprocessed_text_data = pd.read_csv('preprocessed_text_data_grouped_by_year.csv')

In [10]:
# Load Danish spacy model
nlp = spacy.load("da_core_news_sm")
nlp.max_length = 5000000 #or any large value, as long as you don't run out of RAM

def lemmatize_text(text):
    doc = nlp(text, disable = ['ner', 'parser'])
    lemmas = [x.lemma_ for x in doc]
    return lemmas

In [11]:
startTime = time.time()
preprocessed_text_data['Lemmatized_text'] = preprocessed_text_data['Clean_text_wo_sw'].progress_apply(lambda x : lemmatize_text(x))
executionTime = (time.time() - startTime)


def join_list(word_list):
    return ' '.join(word_list)
preprocessed_text_data['Lemmatized_text'] = preprocessed_text_data['Lemmatized_text'].apply( lambda x : join_list(x)) 

print('Execution time in seconds: ' + str(executionTime))

100%|██████████████████████████████████████████████████████████████████████████████████| 72/72 [00:39<00:00,  1.85it/s]

Execution time in seconds: 39.006192207336426





Now we got the text data collected, cleaned, and lemmatized. That is what we need for the Tf-Idf study, where we will identify distinctive keywords.

I save these data in a csv file to be able to save time in the future.

In [12]:
preprocessed_text_data[['Year', 'Party','Lemmatized_text']].to_csv('sub_set_lemmatized_text_df_groupby_year.csv', index=False)

## Tf-Idf

We calculate Tf-Idf scores for the remaining words and modify our dataframe so that we get a vector structure where words become features and the Tf-Idf values become values in rows.

In [13]:
# Load the clean and lemmatized text data 
sub_set_lemmatized_text_df = pd.read_csv('sub_set_lemmatized_text_df_groupby_year.csv')

In [14]:
def tf_idf(text_list):
    # Initialize the TF-IDF vectorizer - choose a max feature 
    tfidf_vectorizer = TfidfVectorizer(max_features=5000)

    # Fit and transform the text data to calculate TF-IDF scores
    tfidf_matrix = tfidf_vectorizer.fit_transform(text_list)

    # Get the TF-IDF scores for each item in the text_list
    tfidf_scores = tfidf_matrix.toarray()

    # Initialize a list to store the TF-IDF scores for each item
    tfidf_scores_list = []

    # Iterate through the text_list and build the list of TF-IDF scores
    for i, item in enumerate(text_list):
        words = tfidf_vectorizer.get_feature_names_out()
        scores = tfidf_scores[i]
        #item_scores = {word: score for word, score in zip(words, scores) if score > 0}
        item_scores = {word: score for word, score in zip(words, scores)}
        tfidf_scores_list.append(item_scores)

    return tfidf_scores_list



text_list = sub_set_lemmatized_text_df['Lemmatized_text'].tolist() # input data
distinctive_words = tf_idf(text_list) # Apply the function
sub_set_lemmatized_text_df['Distinctive_words'] =  distinctive_words # add the output data to the dataframe

# subset to remove the 'Lemmatized_text' column
subset_grouped_lemmatized_distinctive_keywords = sub_set_lemmatized_text_df[['Year', 'Party','Distinctive_words']]


def build_vector_structure(data_frame):
    
    meta_data_structure = data_frame[['Year', 'Party']]
    
    vector_rows = []
    
    # Iterate over rows using iterrows()
    for index, row in data_frame.iterrows():
        data_distinctive_words = subset_grouped_lemmatized_distinctive_keywords.at[index,'Distinctive_words']
        df_distinctive_words = pd.DataFrame([data_distinctive_words])
        vector_rows.append(df_distinctive_words)
                
    vector_structure = pd.concat(vector_rows).reset_index(drop=True)
    output_data = pd.concat([meta_data_structure, vector_structure], axis=1)
    
    return output_data
    
td_idf_vector_frame  = build_vector_structure(subset_grouped_lemmatized_distinctive_keywords)

# Save the td_idf_vector_frame for later 
td_idf_vector_frame.to_csv('tf_idf_vector_frame_groupby_year.csv', index=False)

### Extract and visualise the top n most distinct words

The top n most distinct word for each Party in each session is displayed using the tf-idf algoritm, a [Sorted Bar Chart](https://altair-viz.github.io/gallery/bar_chart_sorted.html), and a the binding methods that gives us interactive dropdown menues to choose from. 

Please notice that we have used the "empty=False" parametre in the drop down function. THis means that one actively got to select both a party and a year the read the 15 words that is most distinct for a party in a specific year compared to the rest of the parties. 

In [15]:
# Load the td_idf_vector_frame
td_idf_vector_frame = pd.read_csv('tf_idf_vector_frame_groupby_year.csv')


from itertools import islice
def take(n, iterable):
    return list(islice(iterable, n))

def topN(row, n):
    x = row.to_dict() # convert the input row to a dictionary 
    x = {k: v for k, v in sorted(x.items(), key=lambda item: -item[1])} # sort the dictionary based on their values 
    n_items = take(n, x.items()) # take the first n values from the dictionary 
    return n_items


# Find the n largest values in rows
n = 15 #number of elements needed
top_n_words_df = td_idf_vector_frame.copy()
top_n_words_df['top_n_words'] = top_n_words_df.iloc[:, 2:].apply(lambda row : topN(row,n), axis = 1)
top_n_words_df = top_n_words_df[['Year','Party', 'top_n_words']]


# Data transformation; flattening the top_n_words column, and send the data back to a dataframe
rows = []
for index, row in top_n_words_df.iterrows():
    for word, score in row['top_n_words']:
        rows.append({'Year': row['Year'], 'Party': row['Party'], 'Word': word, 'Score': score})
flat_df = pd.DataFrame(rows) # Create a new DataFrame


# Create dropdown selections for Session and Party
dropdown_session = alt.binding_select(options=flat_df['Year'].unique(), name='Year')
selection_session = alt.selection_single(fields=['Year'], bind=dropdown_session, empty=False)

dropdown_party = alt.binding_select(options=flat_df['Party'].unique(), name='Party')
selection_party = alt.selection_single(fields=['Party'], bind=dropdown_party, empty=False)

# Bar chart
bar_chart = alt.Chart(flat_df).mark_bar().encode(
    x='Score:Q',
    y=alt.Y('Word:N', sort='-x'),  # Sort bars by Value in descending order
    color=alt.value('steelblue'),
    tooltip=['Word', 'Score']
).transform_filter(
    selection_session
).transform_filter(
    selection_party
).properties(
    title='Distinctive words compaired to the rest of the parties',
    width=450,
    height=250)

# Combine chart with dropdown selections
distinct_words = bar_chart.add_params(selection_session, selection_party                                  
).configure_view(continuousWidth=400 + 30,  # Adjusts the width to account for padding
).configure(padding={"left": 60, "right": 50, "top": 10, "bottom": 10})  # Adds x pixels of padding


distinct_words

<a id="da2"></a>
# Data analysis 2
## Build Vega-Altair chart for visual word trend analysis
### 2.a. -  A Layered chart with data from Statistikbank API and the parlament speeches
#### Build BOW 

In [16]:
# load data 
sub_set_lemmatized_text_df = pd.read_csv('sub_set_lemmatized_text_df_groupby_year.csv')

# Build vectors using BOW
corpus = sub_set_lemmatized_text_df['Lemmatized_text'].to_list() # Prepare for sk-learns CountVectorizer 
vectorizer = CountVectorizer()  
X = vectorizer.fit_transform(corpus) 
feature_names = vectorizer.get_feature_names_out() # Get features out 
vectors = X.toarray() # Get vectors
bow = pd.DataFrame(vectors, columns=feature_names) # Build the bow
meta_data_df = sub_set_lemmatized_text_df[['Year', 'Party']]
meta_data_bows = pd.concat([meta_data_df,bow], axis=1)


# save bow for later use
meta_data_bows.to_csv('meta_data_bows_groupby_year.csv', index=False)


# Caculate word frequency
meta_data_bows1 = meta_data_bows.copy()
bows_word_count = meta_data_bows1.iloc[ : , 2:] # word count 
bows_sum = meta_data_bows1.iloc[ : , 2:].sum(axis=1) # sum of word count
bow_word_frequency = bows_word_count.div(bows_sum, axis=0) # Calculate word frequency
meta_data = meta_data_bows1.iloc[ : , :2] # get metadata
meta_data_bows_word_freq = pd.concat([meta_data, bow_word_frequency], axis=1) # Append columns of DataFrames


# Save data for later
meta_data_bows_word_freq.to_csv('meta_data_bows_word_freq_groupby_year.csv', index=False)

### Get data from Statistikbank API and extract relative word frequency of keywords from the  parlament speeches

The gross number of applicants indicates all applications for asylum in Denmark - regardless of whether the asylum case is substantively processed in Denmark or not. Read more at www.nyidanmark.dk. [Note on the Gross Number of Applicants from the page]( https://www.statistikbanken.dk/VAN5RKA) (found on 31-10-2023).

In [17]:
url = 'https://api.statbank.dk/v1/data/VAN5/CSV?Tid=2009K1%2C2009K2%2C2009K3%2C2009K4%2C2010K1%2C2010K2%2C2010K3%2C2010K4%2C2011K1%2C2011K2%2C2011K3%2C2011K4%2C2012K1%2C2012K2%2C2012K3%2C2012K4%2C2013K1%2C2013K2%2C2013K3%2C2013K4%2C2014K1%2C2014K2%2C2014K3%2C2014K4%2C2015K1%2C2015K2%2C2015K3%2C2015K4%2C2016K1%2C2016K2%2C2016K3%2C2016K4%2C2017K1%2C2017K2%2C2017K3%2C2017K4&ASYLTYPE=BRU'

tabel_van5_raw = pd.read_csv(url, sep = ";")
tabel_van5_raw['Year'] = tabel_van5_raw['TID'].str.extract(r'(\d{4})')
tabel_van5_raw = tabel_van5_raw.rename(columns= {'INDHOLD' : 'Asylum seekers'})
tabel_van5_raw = tabel_van5_raw.iloc[:,2:]
asylum_seekers_2009_2017 = tabel_van5_raw.groupby('Year')['Asylum seekers'].sum().to_frame().reset_index()
asylum_seekers_2009_2016 = asylum_seekers_2009_2017.iloc[:, :]
asylum_seekers_2009_2016['Year'] = asylum_seekers_2009_2016['Year'].astype(int)


# Load bow data
data = meta_data_bows_word_freq.copy() 
data = data[['Year','Party','asylansøger']]
data['Year'] = data['Year'].astype(str).str[:4].astype(int)
data = data[['Party', 'asylansøger', 'Year']]

### layered_chart 1 ###
chart_1 = alt.Chart(asylum_seekers_2009_2016).mark_bar(
    color="lightblue",
    interpolate='step-after',
    line=True
).encode(
    x='Year:O',
    y= alt.Y('Asylum seekers:Q', axis=alt.Axis(title='Brutto Asylum Seekers to Denmark')),
    color=alt.value('grey'),
    opacity=alt.value(0.3)  # Adjust opacity as needed (0.0 to 1.0)
)



### layered_chart 2 ###

# Define a selection that allows users to select a Party
selection = alt.selection_single(
    fields=['Party'],  # Assuming 'Party' is a field in your 'data' DataFrame
    bind=alt.binding_select(options=data['Party'].unique(), name='Select Party'),
    empty='all'  # 'all' means nothing is selected by default
)



# Create a line chart
chart_2 = alt.Chart(data).mark_line(interpolate='basis').encode(
    x='Year:O',  # Use 'O' for ordinal data (years)
    y=alt.Y('asylansøger:Q', axis=alt.Axis(title='Asylum seekers')),  # Use 'Q' for quantitative data (values)
    color= 'Party',
    opacity=alt.condition(selection, alt.value(0.8), alt.value(0.2))    
).properties(
    width=500,
    height=250,
    title="Word frequency"
).interactive(
).add_params(
    selection
)   

# Layer the two charts
layered_chart = alt.layer(chart_1, chart_2).resolve_scale(y='independent').properties(
    title= 'Applications for Asylum in Denmark and the mentions of Asylum Seekers in speeches',
    width= 500,
    height= 300
)

layered_chart

### 2.b. The Trends of the keywords in the Speeches 
Relative frequency

In [18]:

w=350
h=250

# Define a selection that allows users to select a Party
selection = alt.selection_single(
    fields=['Party'],  # Assuming 'Party' is a field in your 'data' DataFrame
    bind=alt.binding_select(options=data['Party'].unique(), name='Select Party'),
    empty='all'  # 'all' means nothing is selected by default
)



data = meta_data_bows_word_freq.copy()
data = data[['Year','Party','asylansøger']]
### chart 1 ###
#selection = alt.selection_point(fields=['Party'], bind='legend')
# Create a line chart
chart1 = alt.Chart(data).mark_line(interpolate='basis').encode(
    x='Year:O',  # Use 'O' for ordinal data (years)
    y='asylansøger:Q',  # Use 'Q' for quantitative data (values)
    color= 'Party',
    opacity=alt.condition(selection, alt.value(0.8), alt.value(0.2))    
).properties(
    width=w,
    height=h,
    title="Word frequency"
).interactive(
).add_params(
    selection
)

# input data
data = meta_data_bows_word_freq.copy()
data = data[['Year','Party','flygtning']]
### chart 2 ### 
#selection = alt.selection_point(fields=['Party'], bind='legend')
# Create a line chart
chart2 = alt.Chart(data).mark_line(interpolate='basis').encode(
    x='Year:O',  # Use 'O' for ordinal data (years)
    y='flygtning:Q',  # Use 'Q' for quantitative data (values)
    color= 'Party',
    opacity=alt.condition(selection, alt.value(0.8), alt.value(0.2))

).properties(
    width=w,
    height=h,
    title="Word frequency"
).interactive(
).add_params(
    selection
)


# input data
data = meta_data_bows_word_freq.copy()
data = data[['Year','Party','indvandrer']]
### chart 3 ### 
#selection = alt.selection_point(fields=['Party'], bind='legend')
# Create a line chart
chart3 = alt.Chart(data).mark_line(interpolate='basis').encode(
    x='Year:O',  # Use 'O' for ordinal data (years)
    y='indvandrer:Q',  # Use 'Q' for quantitative data (values)
    color= 'Party',
    opacity=alt.condition(selection, alt.value(0.8), alt.value(0.2))

).properties(
    width=w,
    height=h,
    title="Word frequency"
).interactive(
).add_params(
    selection
)

data = meta_data_bows_word_freq.copy()
data = data[['Year','Party','integration']]
### chart 4 ###
#selection = alt.selection_point(fields=['Party'], bind='legend')
# Create a line chart
chart4 = alt.Chart(data).mark_line(interpolate='basis').encode(
    x='Year:O',  # Use 'O' for ordinal data (years)
    y='integration:Q',  # Use 'Q' for quantitative data (values)
    color= 'Party',
    opacity=alt.condition(selection, alt.value(0.8), alt.value(0.2))    
).properties(
    width=w,
    height=h,
    title="Word frequency"
).interactive(
).add_params(
    selection
)

# Display the charts
alt.concat(
    chart1,
    chart2,
    chart3,
    chart4,
    columns=2
).properties(
    title="Word trends and relative frequency of the use of the words 'asylansøger', 'flygtning', 'indvandrer', and 'integration among different parties"
)

<a id="da3"></a>
# Data analysis 3
## Analysis and visualise the changes of modifiers.
### Using Spacy's pos-tagger and dependency parsing to conduct an visual analysis of which modifiers are associated with "asylansøger" (asylum seeker), "flygtning" (refugee), "indvandrer" (immigrant), or "integration".

Spacy's Dependency Parsing: https://spacy.io/usage/linguistic-features#dependency-parse is used as a tool to identify the words that members of the Folketing (Danish Parliament) use to modify the words "asylansøger" (asylum seeker), "flygtning" (refugee), "indvandrer" (immigrant), and "integration".

For the study, data is divided into subsets.

The temporal subsets are:
1. Subset of speeches from all parties from all years
2. Subset of speeches from all parties from before 2015
3. Subset of speeches from all parties from after 2015

The political and temporal subsets are:
1. Subset of speeches from Enhedslisten and Socialistisk Folkeparti
2. Subset of speeches from Dansk Folkeparti and Konservative Folkeparti


The subsets are filtered, so only speeches containing following text strings 'asylansøger', 'flygtning', 'indvandrer', or 'integration'. Remember that a text string like 'flygtning' (refugee) is also present in a the word like 'flygtningefamilie' (refugee family).

In [19]:
# load data
print ('Loading data ...')
preprocessed_text_data = pd.read_csv('https://sciencedata.dk/shared/b67b24bd3bbb5c1926fde03c9013a761?download')
preprocessed_text_data = preprocessed_text_data.drop('Clean_text_wo_sw', axis=1)

Loading data ...


In [20]:
print ('building temporal subsets  ...')
all_years = ' '.join(preprocessed_text_data['Text'])
before_2015 = ' '.join(preprocessed_text_data.query('Year < 20115')['Text'])
after_2015 = ' '.join(preprocessed_text_data.query('Year >= 2015')['Text'])

building temporal subsets  ...


In [21]:
print ('building political and temporal subsets  ...')
print ('building the left wing parties subset before 2015...')
before2015 = preprocessed_text_data.query('Year < 2015')
left_parties = ['EL', 'SF']
left_parties_before2015 = before2015[before2015['Party'].isin(left_parties)]
left_parties_before2015_text = ' '.join(left_parties_before2015['Text'])


print ('building the left wing parties subset after 2015 ...')
after2015 = preprocessed_text_data.query('Year >= 2015')
left_parties = ['EL', 'SF']
left_parties_after2015 = after2015[after2015['Party'].isin(left_parties)]
left_parties_after2015_text = ' '.join(left_parties_after2015['Text'])


print ('building the right wing parties subset before 2015...')
before2015 = preprocessed_text_data.query('Year < 2015')
right_parties = ['DF', 'KF']
right_parties_before2015 = before2015[before2015['Party'].isin(right_parties)]
right_parties_before2015_text = ' '.join(right_parties_before2015['Text'])


print ('building the right wing parties subset after 2015...')
after2015 = preprocessed_text_data.query('Year >= 2015')
right_parties = ['DF', 'KF']
right_parties_after2015 = after2015[after2015['Party'].isin(right_parties)]
right_parties_after2015_text = ' '.join(right_parties_after2015['Text'])


print ('building the center parties subset before 2015...')
before2015 = preprocessed_text_data.query('Year < 2015')
center_parties = ['S', 'V']
center_parties_before2015 = before2015[before2015['Party'].isin(center_parties)]
center_parties_before2015_text = ' '.join(center_parties_before2015['Text'])

print ('building the center parties subset after 2015...')
after2015 = preprocessed_text_data.query('Year >= 2015')
center_parties = ['S', 'V']
center_parties_after2015 = after2015[after2015['Party'].isin(center_parties)]
center_parties_after2015_text = ' '.join(center_parties_after2015['Text'])

building political and temporal subsets  ...
building the left wing parties subset before 2015...
building the left wing parties subset after 2015 ...
building the right wing parties subset before 2015...
building the right wing parties subset after 2015...
building the center parties subset before 2015...
building the center parties subset after 2015...


In [22]:
# Extract sententences holding selected keywords
print ('Extract sententences holding selected keywords ...')

def extract_sententences_holding_keywords(text):
    split_text = text.split('.')
    filtered_sent = []
    for sent in split_text:
        for word in ['asylansøger', 'flygtning', 'indvandrer', 'integration']:
            if word in sent:
                filtered_sent.append(sent.strip())
                
    return filtered_sent


# Temp data
filtered_sent_all_years = extract_sententences_holding_keywords(all_years)
filtered_sent_before_2015 = extract_sententences_holding_keywords(before_2015)
filtered_sent_after_2015 = extract_sententences_holding_keywords(after_2015)


# Politcal wing data
left_parties_before2015_filtered_sent = extract_sententences_holding_keywords(left_parties_before2015_text)
left_parties_after2015_filtered_sent = extract_sententences_holding_keywords(left_parties_after2015_text)
right_parties_before2015_filtered_sent = extract_sententences_holding_keywords(right_parties_before2015_text)
right_parties_after2015_filtered_sent = extract_sententences_holding_keywords(right_parties_after2015_text)
center_parties_before2015_filtered_sent = extract_sententences_holding_keywords(center_parties_before2015_text)
center_parties_after2015_filtered_sent = extract_sententences_holding_keywords(center_parties_after2015_text)

Extract sententences holding selected keywords ...


In [23]:
def extract_modifiers(filtered_sent):
    # Process the sentences using Spacy
    # Load spacy model
    nlp = spacy.load("da_core_news_sm")
    nlp.max_length = 5000000 #or any large value, as long as you don't run out of RAM

    lemmatized_text_strings = []
    for sent in filtered_sent:
        doc = nlp(sent)

        # Additional words to filter out
        additional_filter_words = {'ønske', 'se', 'få', 'mene', 'sige', 'tage', \
                                   'bruge', 'give', 'vigtig', 'gå', 'kigge', 'integrationsminister',\
                                  'minister', 'ordfører', 'justitsminister'}

        # Filter out stopwords, punctuation, and extract only nouns, verbs, and adjectives
        # Apply lemmatization to each token
        lemmatized_text = [token.lemma_ for token in doc if not token.is_stop \
                           and not token.is_punct \
                           and token.pos_ in ['NOUN', 'VERB', 'ADJ']\
                           and token.lemma_.lower() not in additional_filter_words]

        # Join the lemmatized tokens back into a string
        lemmatized_text_str = ' '.join(lemmatized_text)

        # Append the lemmatized_text_str to a list
        lemmatized_text_strings.append(lemmatized_text_str)


    # Count the frequency of each word
    word_freq = Counter(' '.join(lemmatized_text_strings).split())

    # Convert to a Pandas DataFrame for Altair
    df = pd.DataFrame(word_freq.items(), columns=['Word', 'Frequency']).sort_values(by='Frequency', ascending=False)
    
    return df

In [24]:
# Extract modifiers
print ('Extract modifiers from temp data ...  (Long processing time) - Thanks for your patience.')
df_all_years = extract_modifiers(filtered_sent_all_years)
df_before_2015 = extract_modifiers(filtered_sent_before_2015)
df_after_2015 = extract_modifiers(filtered_sent_after_2015)

Extract modifiers from temp data ...  (Long processing time) - Thanks for your patience.


In [25]:
# Extract modifiers
print ('Extract modifiers from political wing data ...')
df_left_before_2015 = extract_modifiers(left_parties_before2015_filtered_sent)
df_left_after_2015 = extract_modifiers(left_parties_after2015_filtered_sent)
df_right_before_2015 = extract_modifiers(right_parties_before2015_filtered_sent)
df_right_after_2015 = extract_modifiers(right_parties_after2015_filtered_sent)
df_center_before_2015 = extract_modifiers(center_parties_before2015_filtered_sent)
df_center_after_2015 = extract_modifiers(center_parties_after2015_filtered_sent)

Extract modifiers from political wing data ...


In [26]:
# Saving modifiers for later
df_all_years.to_csv('modifiers_all_years_groupby_year.csv', index=False)
df_before_2015.to_csv('modifiers_before_2015_groupby_year.csv', index=False)
df_after_2015.to_csv('modifiers_after_2015_groupby_year.csv', index=False)

In [27]:
# Saving modifiers for later
df_left_before_2015.to_csv('modifiers_left_before_2015_groupby_year.csv', index=False)
df_left_after_2015.to_csv('modifiers_left_after_2015_groupby_year.csv', index=False)
df_right_before_2015.to_csv('modifiers_right_before_2015_groupby_year.csv', index=False)
df_right_after_2015.to_csv('modifiers_right_after_2015_groupby_year.csv', index=False)
df_center_before_2015.to_csv('modifiers_center_before_2015_groupby_year.csv', index=False)
df_center_after_2015.to_csv('modifiers_center_after_2015_groupby_year.csv', index=False)

# Analysis and visualise the changes of modifiers 

Which modifiers gained more importance from sesion 20141 and onwards? 

In [28]:
# Load data

df_all_years = pd.read_csv('modifiers_all_years_groupby_year.csv')
df_before_2015 = pd.read_csv('modifiers_before_2015_groupby_year.csv')
df_after_2015 = pd.read_csv('modifiers_after_2015_groupby_year.csv')
df_left_before_2015 = pd.read_csv('modifiers_left_before_2015_groupby_year.csv')
df_left_after_2015 = pd.read_csv('modifiers_left_after_2015_groupby_year.csv')
df_right_before_2015 = pd.read_csv('modifiers_right_before_2015_groupby_year.csv')
df_right_after_2015 = pd.read_csv('modifiers_right_after_2015_groupby_year.csv')
df_center_before_2015 = pd.read_csv('modifiers_center_before_2015_groupby_year.csv')
df_center_after_2015 = pd.read_csv('modifiers_center_after_2015_groupby_year.csv')

In [29]:
#################   Analysis    ########################



###### the left wing parties subset
df_left_before_2015_c = df_left_before_2015.copy()

sum_frequency = df_left_before_2015_c['Frequency'].sum()
df_left_before_2015_c['Relative_frequency'] = df_left_before_2015_c['Frequency'] / sum_frequency

df_left_after_2015_c = df_left_after_2015.copy()
sum_frequency = df_left_after_2015_c['Frequency'].sum()
df_left_after_2015_c['Relative_frequency'] = df_left_after_2015_c['Frequency'] / sum_frequency

# Merge the DataFrames on the word column
common_words_left_df = pd.merge(df_left_before_2015_c, df_left_after_2015_c, on='Word')

# Display the result
common_words_left_df = common_words_left_df.drop(['Frequency_x', 'Frequency_y'], axis=1)

# Calculate difference
common_words_left_df['Frequency_Difference'] = common_words_left_df['Relative_frequency_x'] - common_words_left_df['Relative_frequency_y']
# Sort by difference
df_left_sorted = common_words_left_df.sort_values(by='Frequency_Difference', ascending=False)
df_left_sorted = df_left_sorted.drop(['Relative_frequency_x','Relative_frequency_y'], axis=1)



###### the right wing parties subset
df_right_before_2015_c = df_right_before_2015.copy()

sum_frequency = df_right_before_2015_c['Frequency'].sum()
df_right_before_2015_c['Relative_frequency'] = df_right_before_2015_c['Frequency'] / sum_frequency

df_right_after_2015_c = df_right_after_2015.copy()
sum_frequency = df_right_after_2015_c['Frequency'].sum()
df_right_after_2015_c['Relative_frequency'] = df_right_after_2015_c['Frequency'] / sum_frequency

# Merge the DataFrames on the word column
common_words_right_df = pd.merge(df_right_before_2015_c, df_right_after_2015_c, on='Word')

# Display the result
common_words_right_df = common_words_right_df.drop(['Frequency_x', 'Frequency_y'], axis=1)

# Calculate difference
common_words_right_df['Frequency_Difference'] = common_words_right_df['Relative_frequency_x'] - common_words_right_df['Relative_frequency_y']
# Sort by difference
df_right_sorted = common_words_right_df.sort_values(by='Frequency_Difference', ascending=False)
df_right_sorted = df_right_sorted.drop(['Relative_frequency_x','Relative_frequency_y'], axis=1)




###### the center parties subset
df_center_before_2015_c = df_center_before_2015.copy()

sum_frequency = df_center_before_2015_c['Frequency'].sum()
df_center_before_2015_c['Relative_frequency'] = df_center_before_2015_c['Frequency'] / sum_frequency

df_center_after_2015_c = df_center_after_2015.copy()
sum_frequency = df_center_after_2015_c['Frequency'].sum()
df_center_after_2015_c['Relative_frequency'] = df_center_after_2015_c['Frequency'] / sum_frequency

# Merge the DataFrames on the word column
common_words_center_df = pd.merge(df_center_before_2015_c, df_center_after_2015_c, on='Word')

# Display the result
common_words_center_df = common_words_center_df.drop(['Frequency_x', 'Frequency_y'], axis=1)

# Calculate difference
common_words_center_df['Frequency_Difference'] = common_words_center_df['Relative_frequency_x'] - common_words_center_df['Relative_frequency_y']
# Sort by difference
df_center_sorted = common_words_center_df.sort_values(by='Frequency_Difference', ascending=False)
df_center_sorted = df_center_sorted.drop(['Relative_frequency_x','Relative_frequency_y'], axis=1)




###### all parties subset
df_before_2015_c = df_before_2015.copy()

sum_frequency = df_before_2015_c['Frequency'].sum()
df_before_2015_c['Relative_frequency'] = df_before_2015_c['Frequency'] / sum_frequency

df_after_2015_c = df_after_2015.copy()
sum_frequency = df_after_2015_c['Frequency'].sum()
df_after_2015_c['Relative_frequency'] = df_after_2015_c['Frequency'] / sum_frequency

# Merge the DataFrames on the word column
common_words_all_df = pd.merge(df_before_2015_c, df_after_2015_c, on='Word')

# Display the result
common_words_all_df = common_words_all_df.drop(['Frequency_x', 'Frequency_y'], axis=1)

# Calculate difference
common_words_all_df['Frequency_Difference'] = common_words_all_df['Relative_frequency_x'] - common_words_all_df['Relative_frequency_y']
# Sort by difference
df_all_sorted = common_words_all_df.sort_values(by='Frequency_Difference', ascending=False)
df_all_sorted = df_all_sorted.drop(['Relative_frequency_x','Relative_frequency_y'], axis=1)





##################    Visualise    ################################

# Create bar charts with sorting
w= 600
h= 250

### The left
df_left_before_2015_chart = alt.Chart(df_left_before_2015.head(15)).mark_bar(color='#db3a34').encode(
    y=alt.Y('Word', sort=alt.EncodingSortField(field='Frequency', order='descending'), title='Modifiers'),
    x='Frequency',
    tooltip=['Word', 'Frequency']
).properties(
    width=w,
    height=h,
    title='Modifiers used by left wing parties before 2015',
)

df_left_after_2015_chart = alt.Chart(df_left_after_2015.head(15)).mark_bar(color='#db3a34').encode(
    y=alt.Y('Word', sort=alt.EncodingSortField(field='Frequency', order='descending'), title='Modifiers'),
    x='Frequency',
    tooltip=['Word', 'Frequency']
).properties(
    width=w,
    height=h,
    title='Modifiers used by left wing parties after 2015'
)


modifier_change_chart_left = alt.Chart(df_left_sorted.head(15)).mark_bar(color='#db3a34').encode(
    y=alt.Y('Word', sort=alt.EncodingSortField(field='Frequency_Difference', order='descending'), title='Modifiers'),
    x='Frequency_Difference',
    tooltip=['Word', 'Frequency_Difference']
).properties(
    width=w,
    height=h,
    title='Modifiers that gained more importance among left wing parties from 2015 to 2017'
)

### The right
df_right_before_2015_chart = alt.Chart(df_right_before_2015.head(15)).mark_bar(color='#177e89').encode(
    y=alt.Y('Word', sort=alt.EncodingSortField(field='Frequency', order='descending'), title='Modifiers'),
    x='Frequency',
    tooltip=['Word', 'Frequency']
).properties(
    width=w,
    height=h,
    title='Modifiers used by right wing parties before 2015'
)


df_right_after_2015_chart = alt.Chart(df_right_after_2015.head(15)).mark_bar(color='#177e89').encode(
    y=alt.Y('Word', sort=alt.EncodingSortField(field='Frequency', order='descending'), title='Modifiers'),
    x='Frequency',
    tooltip=['Word', 'Frequency']
).properties(
    width=w,
    height=h,
    title='Modifiers used by right wing parties after 2015'
)


modifier_change_chart_right = alt.Chart(df_right_sorted.head(15)).mark_bar(color='#177e89').encode(
    y=alt.Y('Word', sort=alt.EncodingSortField(field='Frequency_Difference', order='descending'), title='Modifiers'),
    x='Frequency_Difference',
    tooltip=['Word', 'Frequency_Difference']
).properties(
    width=w,
    height=h,
    title='Modifiers that gained more importance among right wing parties from 2015 to 2017'
)



### The center
df_center_before_2015_chart = alt.Chart(df_center_before_2015.head(15)).mark_bar(color='#084c61').encode(
    y=alt.Y('Word', sort=alt.EncodingSortField(field='Frequency', order='descending'), title='Modifiers'),
    x='Frequency',
    tooltip=['Word', 'Frequency']
).properties(
    width=w,
    height=h,
    title='Modifiers used by center parties before 2015'
)


df_center_after_2015_chart = alt.Chart(df_center_after_2015.head(15)).mark_bar(color='#084c61').encode(
    y=alt.Y('Word', sort=alt.EncodingSortField(field='Frequency', order='descending'), title='Modifiers'),
    x='Frequency',
    tooltip=['Word', 'Frequency']
).properties(
    width=w,
    height=h,
    title='Modifiers used by center parties after 2015'
)


modifier_change_chart_center = alt.Chart(df_center_sorted.head(15)).mark_bar(color='#084c61').encode(
    y=alt.Y('Word', sort=alt.EncodingSortField(field='Frequency_Difference', order='descending'), title='Modifiers'),
    x='Frequency_Difference',
    tooltip=['Word', 'Frequency_Difference']
).properties(
    width=w,
    height=h,
    title='Modifiers that gained more importance among center parties from 2015 to 2017'
)




### All parties

before_2015_chart = alt.Chart(df_before_2015.head(15)).mark_bar(color='#ffc857').encode(
    y=alt.Y('Word', sort=alt.EncodingSortField(field='Frequency', order='descending'), title='Modifiers'),
    x='Frequency',
    tooltip=['Word', 'Frequency']
).properties(
    width=w,
    height=h,
    title='Most common lemmatised modifiers before 2015'
)


after_2015_chart = alt.Chart(df_after_2015.head(15)).mark_bar(color='#ffc857').encode(
    y=alt.Y('Word', sort=alt.EncodingSortField(field='Frequency', order='descending'), title='Modifiers'),
    x='Frequency',
    tooltip=['Word', 'Frequency']
).properties(
    width=w,
    height=h,
    title='Most common lemmatised modifiers after 2015'
)

modifier_change_chart_all = alt.Chart(df_all_sorted.head(15)).mark_bar(color='#ffc857').encode(
    y=alt.Y('Word', sort=alt.EncodingSortField(field='Frequency_Difference', order='descending'), title='Modifiers'),
    x='Frequency_Difference',
    tooltip=['Word', 'Frequency_Difference']
).properties(
    width=w,
    height=h,
    title='Modifiers that gained more importance among all parties from 2015 to 2017'
)



# Display the charts
alt.concat(
    
    # blue
    df_right_before_2015_chart,
    df_right_after_2015_chart,
    modifier_change_chart_right,
    
    # center
    df_center_before_2015_chart,
    df_center_after_2015_chart,
    modifier_change_chart_center,
    
    #red
    df_left_before_2015_chart,
    df_left_after_2015_chart,
    modifier_change_chart_left,
    
    #all_parites_chart,
    before_2015_chart,
    after_2015_chart,
    modifier_change_chart_all,
    columns=1,
    title='Common modifiers used by the left, the center, and the right wing parties that relates to aylum seeker, immigrant, integration, and refugee Political wings subset', 
   
)



<a id="da4"></a>
# Data analysis 4
## Using Sentida's Sentiment Analysis algorithm, we analyze the mood in the speeches of the different parties.

We perform sentiment analysis to identify changes in sentiments in speeches related to the words "asylansøger" (asylum seeker), "flygtning" (refugee), "indvandrer" (immigrant), and "integration".

Which sentiment analysis approach do we use? On the Alexandra Institute's DaNLP repository, a repository for Natural Language Processing resources for the Danish language, the Alexandra Institute provides an overview of open sentiment analysis models and datasets for Danish. There are two types of models. 1. wordlist models, that returns a float score which correspond an emotion. Negative ( minus ), neutral (0) or positive ( plus ). 2. languages models that returns a descriptive text string that corresponds to a semantic field, for example joy/peace of mind, expectation/interest, trust/acceptance. [Alexandra Institute. (2021). Sentiment_analysis.md.](https://github.com/alexandrainst/danlp/blob/master/docs/docs/tasks/sentiment_analysis.md)

In this notebook we will use sentiment analyse to look at variations in scores. This leaves us with two models and that is AFINN and Sentida. We choose Sentida, sinces it have been updatede with new functionalities and Finn Årup Nielsen, that is the author of AFINN have not published about sentiment analysis since 2019. BOth AFINN and Sentida are wordlist models that aggregates a sentiment score based on the occurrence of words from the wordlist in a given text. There are four problems with this approach:

- the model does not take into account syntactic relations among words
- the model ignores adverbs, and thus the meaning adverbs have in expressing degrees of something and attitudes towards something.
- the model does not reflect people's way of perceiving emotions.
- the model cannot handle words that mean two different things. The developers behind Sentida have found inspiration in the English-language VADER model and have tried to minimize the problems by building in different, simple forms of "awareness". For example, increasing scores around negations, capital letters and exclamation marks to give these elements of the language more attention.

For example:

_“Maden (+0.3) var god (+2.3), (← x 0.5) men(1.5 x →) serviceringen (+0.3) var elendig (-4.3).” ⇒ 1.3 -6 ⇒ sentiment score: -4.7

“The food (+0.3)was good (+2.3), (← x 0.5) but(1.5x →) the service (+0.3)was horrendous (-4.3).” ⇒ 1.3 -6 ⇒ sentiment score: -4.7_

_“Det er så sejt (+3.6)!(← x 1.291)” ⇒ sentiment score: 4.6

“It is so cool (+3.6)!(←x 1.291)”⇒ sentiment score: 4.6_

_“DET ERSÅ SEJT (+3.6). (← x 1.733)” ⇒ sentiment score: 6.2

“IT IS SO COOL (+3.6). (← x 1.733)” ⇒ sentiment score: 6.2_

_Kran, E., & Orm, S.: EMMA - Danish Natural-Language Processing of Emotion in Text. The new State-of-the-Art in Danish Sentiment Analysis and a Multidimensional Emotional Sentiment Validation Dataset, Jouarnal of Language Work, Vol 5, No. 1, 2020._

## Add a Sentiment Score to the selected speeches

In [30]:
# Read the subset
subset = pd.read_csv('https://sciencedata.dk/shared/ab2377f7e7858f483cd101b8a4eb934f?download')
subset['Year'] =  subset['Date'].str.extract(r'(\d{4})')
subset_for_sa = subset.copy()

In [31]:
# split text to get sentences with a keyword.

def extract_keyword_sentences(text):
    key_words = ['asylansøger', 'flygtning', 'indvandrer', 'integration']
    sentence = re.split('[?.]', text)
    kw_sent = []
    for s in sentence:
        for w in key_words:
            if w in s:
                kw_sent.append(s.strip())
                
                
    return '. '.join(kw_sent)
            
subset_for_sa['keyword_sents'] = subset_for_sa['Text'].apply(lambda x : extract_keyword_sentences(x))


# add Sentiment Score

print ('adding sentiment score - it takes time ... ')
from sentida import Sentida
def sentiment_score(text_input):
    SV = Sentida()
    # Define the class:
    return SV.sentida(text = text_input, output = 'mean', normal = False, speed = 'normal')


keyword_sents = list(subset_for_sa['keyword_sents'])


sentiment_scores = []
for i in enumerate(keyword_sents):
    text = str(i)
    sentiment_scores.append(sentiment_score(text))
    
    
subset_for_sa['Sentiment_Scores'] = sentiment_scores

subset_for_sa_output = subset_for_sa[['Date','Year','Party', 'Sentiment_Scores']].copy()


# Save for later
subset_for_sa_output.to_csv('subset_sentiment_score_2.csv', index=False)

print ('done adding sentiment score')

adding sentiment score - it takes time ... 
done adding sentiment score


In [32]:
subset_for_sa.at[15, 'keyword_sents']

'Vi har et ønske om at sikre en god integration, der gør, at vi alle sammen får et bedre forhold, ved at alle kommer til at bidrage til samfundsudviklingen'

In [33]:
# Inspect
print (f"Text:\n{subset_for_sa.at[15, 'keyword_sents']}\n")
print (f"Sentiment Score: {subset_for_sa.at[15, 'Sentiment_Scores']}")

Text:
Vi har et ønske om at sikre en god integration, der gør, at vi alle sammen får et bedre forhold, ved at alle kommer til at bidrage til samfundsudviklingen

Sentiment Score: 0.7530864197530863


## Visualise the mean Sentiment Scores over time - by year 

In [34]:
import pandas as pd
import altair as alt
import warnings
warnings.filterwarnings("ignore")

#subset_sentiment_score = pd.read_csv('https://sciencedata.dk/shared/5ce9ee25d12d6c895de9e3a309de27a1?download')
subset_sentiment_score = pd.read_csv('subset_sentiment_score_2.csv')

In [35]:
sentiment_year = subset_sentiment_score.groupby(['Year', 'Party'])['Sentiment_Scores'].mean().to_frame().reset_index()

# Define a selection that allows users to select a Party
selection = alt.selection_single(
    fields=['Party'],  # Assuming 'Party' is a field in your 'data' DataFrame
    bind=alt.binding_select(options=sentiment_year['Party'].unique(), name='Select Party'),
    empty='all'  # 'all' means nothing is selected by default
)


chart = alt.Chart(sentiment_year).mark_line(interpolate='basis').encode(
    x='Year:O',  # Ordinal data for year
    y='Sentiment_Scores:Q',  # Quantitative data for sentiment scores
    color='Party:N', 
    opacity=alt.condition(selection, alt.value(0.8), alt.value(0.2))# Nominal data for party
    #tooltip=['Year', 'Party', 'Sentiment_Scores']
).properties(
    width=700,
    height=250,
    title='Sentiment Scores by Party over Years',
).interactive(
).add_params(
    selection
)

chart

## Visualise the mean Sentiment Scores over time - by day

In [36]:
# Group data by 'Party' and 'Date', then calculate the **mean** sentiment score
grouped_data = subset_sentiment_score.groupby(['Party', 'Date'])['Sentiment_Scores'].mean().reset_index()

# Create a dropdown selection using .binding_select() and .selection_single()
dropdown = alt.binding_select(options=subset_sentiment_score['Party'].unique(), name='Party')
selection = alt.selection_single(fields=['Party'], bind=dropdown)

# Create brush for interactivity using .selection_interval()
brush = alt.selection_interval(encodings=['x'])


# Conditional color encoding
color_condition = alt.condition(
    alt.datum.Sentiment_Scores > 0, 
    alt.value('#809c13'),  # Positive values colored green
    alt.value('#db3a34')     # Negative values colored red
)


# Base chart with dynamic filter based on dropdown and aggregated data
base = alt.Chart(grouped_data).mark_bar().encode(
    x = 'Date:T',
    y = alt.Y('Sentiment_Scores:Q', title='Min Sentiment Score'),
    color= color_condition
).transform_filter(
    selection
).properties(
    width = 700,
    height = 200,
    title = 'The Daily Mean Sentiment Score Grouped by Party' 
)

# Upper chart uses the brush for interactivity
upper = base.encode(alt.X('Date:T', scale=alt.Scale(domain=brush)))

# Lower chart allows brushing
lower = base.properties(height=60, title='Click on drop down to select Party and on this chart to select interval').add_params(brush)

# Combine the charts
combined_chart = alt.vconcat(upper, lower).add_params(selection)

combined_chart 


# References

[Hansen, Dorte Haltrup and Navarretta, Costanza, 2021, The Danish Parliament Corpus 2009 - 2017, v2, w. subject annotation, CLARIN-DK-UCPH Centre Repository](http://hdl.handle.net/20.500.12115/44)

spaCy Usage Documentation, [Language Processing Pipelines](https://spacy.io/usage/processing-pipelines). Date accessed
December 13, 2023. 


Altair Guides: [Altair.binding_select](https://altair-viz.github.io/user_guide/generated/api/altair.binding_select.html#altair-binding-select) Date accessed
December 13, 2023. 

Altair Guides: [Altair.selection_single](https://altair-viz.github.io/user_guide/generated/api/altair.selection_single.html#altair-selection-single) Date accessed
December 13, 2023. 

Altair Guides: [altair.selection_interval](https://altair-viz.github.io/user_guide/generated/api/altair.selection_interval.html#altair-selection-interval) Date accessed
December 13, 2023.  


Data Science Stack Exchange: [Increasing SpaCy max NLP limit](https://datascience.stackexchange.com/questions/38745/increasing-spacy-max-nlp-limit) Date accessed December 13, 2023. 

Statistikbanken.dk [Note on the Gross Number of Applicants from the page]( https://www.statistikbanken.dk/VAN5RKA). Date accessed 31-10-2023.

Kran, E., & Orm, S.: [EMMA - Danish Natural-Language Processing of Emotion in Text. The new State-of-the-Art in Danish Sentiment Analysis and a Multidimensional Emotional Sentiment Validation Dataset, Jouarnal of Language Work, Vol 5, No. 1, 2020.](https://tidsskrift.dk/lwo/article/download/121221/168666)

Alexandra Institute: [Sentiment_analysis.md.](https://github.com/alexandrainst/danlp/blob/master/docs/docs/tasks/sentiment_analysis.md), 2021, Date accessed 31-10-2023

Allen, Willim L.: [A Decade of Immigration in the British Press, 2016](https://migrationobservatory.ox.ac.uk/resources/reports/decade-immigration-british-press/) 