# Exam details
Written work:  

You will have to analyze a dataset and test a relevant hypothesis.

You should choose your own dataset by Lecture 12 (written hand-in already available)

The report will be written as a Jupiter notebook.

Minimal set of analysis:
1) Describe your data and visualize some key dimensions.
2) Perform at least two analysis (depending on what is appropriate given the data you selected):
- Does your data contain quantitative values that allow for a hypothesis testing?
   IF YES: Formulate an hypothesis and test it. Complement the testing with an appropriate visualization.
- Does your data contain unstructured textual information?
 IF YES: Perform sentiment analysis on your data and describe and visualize the results.
- Does your data contain network structures (or a network structure can be extracted)?
 IF YES: Ask a question about the network structure and answer it.

OBS: Different datasets can be investigated in many different ways. Any combination of the above-described analysis is acceptable as long as you ask 2 questions. 
e.g. Statistical hypothesis testing + network analysis, Sentiment analysis + network analysis , Statistical hypothesis testing + Statistical hypothesis testing.

Groups:  Find your own group (2 ppls - 3 is possible) let the TAs know before Fall break.

Opgaven skal indeholde en eller flere af følgende: 
- Hypothesis testing
- Sentiment Analyse
- Network Analysis 

Desuden også
- Interactive Visualization 

# Data
Datasættet stammer fra "The Danish Parliament Corpus 2009 - 2017, v2, w. subject annotation".

Kilde: 
Hansen, Dorte Haltrup and Navarretta, Costanza, 2021, The Danish Parliament Corpus 2009 - 2017, v2, w. subject annotation, CLARIN-DK-UCPH Centre Repository, http://hdl.handle.net/20.500.12115/44.


Datasættet består af transkriptioner af taler i Folketinget fra første samling 2009 til og med først samling 2016 (6/10 2009 – 7/9 2017). Til hver tale er der tilknyttet metadata, dels om medlemmet af folketinget ('Name', 'Gender', 'Party', 'Role', 'Title', 'Birth', 'Age'), dels om talen (Date', 'samling', 'Start time', 'End time', 'Time', 'Agenda item', 'Case no', 'Case type', 'Agenda title', 'Subject 1', 'Subject 2').

Datasættet er struktureret i tsv txt-filer, som er formateret i utf-8. Der er en fil per møde.

Kilde:
Samme, Readme


Til denne opgave har vi samlet tsv filerne i et nyt datasæt, som vi har gemt i en csv fil separeret med pipes. Csv filen er uploadet til sciencedata.dk, hvorfra den kan downloades via url med pandas.read_csv() metoden.





# Emnet
Emnet er immigrationspolitik fra 2009 - 2017. Hvilke kendetegn har de forskellige partiers politik vurdereret ud fra partimedlemers taler i Folketinget?

1. Ved hjælp af Tf-Idf identificerer vi de særegne nøgleord, der kendetegner de forskellige partier.

2. sentiment analyse på taler som indeholder noget om flygtning. Det kan f.eks. være taler, som handler om 'os' og 'dem' - nærlæsning. Bliver der større variation i sentiment-scorrerne mere varieret op til et valg? Bliver de mere varieret omkring 2015?  
3. pos -tag f.eks. verber fra forskellige partier, og hvilke adjektiver knytters sig til et begreb. 


# Load data

Data ligger på sciencedata.dk og deles derfra med download link.

In [1]:
import pandas as pd
from tqdm import tqdm
tqdm.pandas()

In [2]:
# Load data from sciencedata.dk
df = pd.read_csv('https://sciencedata.dk/shared/825e999a5c13fd22d28d4289fa899ba1?download', sep='|')

In [3]:
print (f'The coloumns are {df.columns}')

The coloumns are Index(['ID', 'Date', 'samling', 'Start time', 'End time', 'Time',
       'Agenda item', 'Case no', 'Case type', 'Agenda title', 'Subject 1',
       'Subject 2', 'Name', 'Gender', 'Party', 'Role', 'Title', 'Birth', 'Age',
       'Text'],
      dtype='object')


# Question 1
# TF-IDF. Segment = speech 

1. Ved hjælp af Tf-Idf identificerer vi de særegne nøgleord, der kendetegner de forskellige partier.

Method:
1. Subset data on the subject value "Immigration" and on the role value "member". Group on session and party and aggregate the speeches according to the groups.  
2. Preprocess the texts using a stopword list and ***Spacy*** to lemmatize words in speeches.
3. Use **Tf-Idf** to identify distinctive keywords



Der skal renses bedre, fordi spacy lemmatizer løber tør for plads med så mange ord.
Jeg kan fjerne ord med tal, tal og ord mindre end to bogstaver.



In [5]:
# Subset data
input_data = df[(df['Subject 1'] == 'Immigration') & (df['Role'] == 'medlem')].reset_index()
# Group by 'session' and 'party' and aggregate speeches
input_data_grouped = df.groupby(['samling', 'Party'])['Text'].agg(' '.join).reset_index()

In [8]:
# Preprocess
import re
import time
from urllib.request import urlopen
startTime = time.time()

def scrub_text(text):
    # Remove all numbers (including integers and decimals)
    text_without_numbers = re.sub(r'\b\d+(\.\d+)?\b', '', text)

    # Remove words that include numbers
    text_without_words_with_numbers = re.sub(r'\w*\d\w*', '', text_without_numbers)
    
    # find all none whitspace characters between two word bounderies
    return re.findall(r'\b\S+\b', text_without_words_with_numbers.lower().replace('_', ' ')) 

# load stopword list from sciencedata.dk
with urlopen('https://sciencedata.dk/shared/01f12a3094769a0a6c66fdd0d7bb7dac?download') as response:
    stop_words = response.read().decode('utf-8-sig').split('\r\n')    
    
    
def filter_stopword(text_list):
    return [i for i in text_list if i not in stop_words]

def remove_short_words(text_list):
    return [i for i in text_list if len(i) > 2]
  
print('speeches')    
speeches = input_data_grouped['Text'].tolist()

print('clean_strings_in_list')
clean_strings_in_list = [scrub_text(i) for i in speeches]

print('strings_wo_stop_words')
strings_wo_stop_words = [filter_stopword(text) for text in clean_strings_in_list] 

print('filter_short_words')
remove_short_words = [remove_short_words(text) for text in strings_wo_stop_words] 

print('strings')
strings = [' '.join(i) for i in remove_short_words]

executionTime = (time.time() - startTime)
print('Execution time in seconds: ' + str(executionTime))

speeches
clean_strings_in_list
strings_wo_stop_words
filter_short_words
strings
Execution time in seconds: 163.52673745155334


In [9]:
input_data_grouped['Clean_text_wo_sw'] = strings

Hvis jeg kører Spacy-lemmatizer-koden nedenfor får jeg en valueerror:

_ValueError: [E088] Text of length 1975499 exceeds maximum of 1000000. The parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`._

For at koorigere den plan vi har haft, kunne vi opdele data i mindre subsets, applicere Spacy lemmatizeren på mindre dele og samle datasættet igen. Vi er interesseret i at anvende Spacy frem for NLTK, fordi NLTK har ikke en lematizer, der kan benyttes på dansk tekst. Clarin-dk har udviklet en lemmatizer til dansk tekst, men den er ufleksibel, fordi den skal benyttes via et webinterface og kan ikke håndtere de mængder af data, som har arbejder med.

Vores problem er at tekstlængden på 1975499, som giver en valueerror er kun en brøkdel af vores data. Det er kun den del, der består af alle DF medlemmers taler inden for emnet 'immigration' i samlingen 20091.

Vi kan observere, at det er krævende for en computer at arbejde med Spacy, når det kommer til store tekstmængder, hvilket er problematisk, fordi flere og flere studerende og professionelle arbejder med text mining på almindelige, gennemsnitlige computere.

Problemet bliver behandlet på online fællesskaber som Stack overflow og Stackexchange, for eksempel i denne blogpost 
_Increasing SpaCy max NLP limit:_ https://datascience.stackexchange.com/questions/38745/increasing-spacy-max-nlp-limit, hvor der stilles forslag om 1. at hæve værdien i funktionen _nlp.max_length_ for eksempel nlp.max_length = 4000000 og 2. at tilføje til nlp funktionen et argument, der deaktiverer ner og parser fra det obejct, som nlp functionen returnerer. 
nlp(text, disable = ['ner', 'parser'])


In [10]:
import spacy

# Load Danish spacy model
nlp = spacy.load("da_core_news_sm")
nlp.max_length = 5000000 #or any large value, as long as you don't run out of RAM

def lemmatize_text(text):
    doc = nlp(text, disable = ['ner', 'parser'])
    lemmas = [x.lemma_ for x in doc]
    return lemmas

In [11]:
import time
startTime = time.time()
input_data_grouped['Lemmatized_text'] = input_data_grouped['Clean_text_wo_sw'].progress_apply(lambda x : lemmatize_text(x))
executionTime = (time.time() - startTime)
print('Execution time in seconds: ' + str(executionTime))

100%|████████████████████████████████████████████████████████████████████████████████| 120/120 [09:22<00:00,  4.68s/it]

Execution time in seconds: 562.18892121315





In [12]:
def join_list(word_list):
    return ' '.join(word_list)
input_data_grouped['Lemmatized_text'] = input_data_grouped['Lemmatized_text'].apply( lambda x : join_list(x)) 

In [15]:
input_data_grouped[['samling', 'Party','Lemmatized_text']].to_csv('lemmas_tf_df.csv', index=False)

# Tf-Idf

In [17]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from urllib.request import urlopen

# load stopword list from sciencedata.dk
with urlopen('https://sciencedata.dk/shared/01f12a3094769a0a6c66fdd0d7bb7dac?download') as response:
    stop_words = response.read().decode('utf-8-sig').split('\r\n') 

def tf_idf(text_list):
    # Initialize the TF-IDF vectorizer
    tfidf_vectorizer = TfidfVectorizer(max_features=10000)

    # Fit and transform the text data to calculate TF-IDF scores
    tfidf_matrix = tfidf_vectorizer.fit_transform(text_list)

    # Get the TF-IDF scores for each item in the text_list
    tfidf_scores = tfidf_matrix.toarray()

    # Initialize a list to store the TF-IDF scores for each item
    tfidf_scores_list = []

    # Iterate through the text_list and build the list of TF-IDF scores
    for i, item in enumerate(text_list):
        words = tfidf_vectorizer.get_feature_names_out()
        scores = tfidf_scores[i]
        item_scores = {word: score for word, score in zip(words, scores) if score > 0}
        tfidf_scores_list.append(item_scores)

    return tfidf_scores_list

text_list = input_data_grouped['Lemmatized_text'].tolist()


distinctive_words = tf_idf(text_list)
input_data_grouped['distinctive_words'] =  distinctive_words

In [19]:
input_data_grouped[['samling', 'Party','distinctive_words']].to_csv('distinctive_keywords.csv', index=False)

In [21]:
input_data_grouped['distinctive_words']

0      {'aa': 0.0017881791460270247, 'aaen': 0.027348...
1      {'aaen': 0.0015018063733042001, 'aalborg': 0.0...
2      {'aaen': 0.012353096560451231, 'aalborg': 0.01...
3      {'afslutning': 0.03590976435798225, 'afslutnin...
4      {'aa': 0.0017273624779354773, 'aaen': 0.025228...
                             ...                        
115    {'aaja': 0.013314896049065541, 'aalborg': 0.00...
116    {'aaja': 0.0065066882588372425, 'aalborg': 0.0...
117    {'aaja': 0.015361357552289612, 'absolut': 0.01...
118    {'absolut': 0.012174042175246407, 'acceptabelt...
119    {'aaja': 0.00585553659971789, 'aalborg': 0.004...
Name: distinctive_words, Length: 120, dtype: object