We need to know how much 'attention' the Member States give environmental issues in their speeches.

We will need to convert Speech to a quantative variable. 
An approach would be through sentence-based climate change topic detection. 
Every sentence for every speech would need to be classified on being about climate change or not. 

We can use the following paper: https://s3.us-east-1.amazonaws.com/climate-change-ai/papers/neurips2020/69/paper.pdf and the corresponding dataset https://www.sustainablefinance.uzh.ch/en/research/climate-fever.html and following tutorial https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

In [33]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Loading the labeled dataset to be usd for training
climate = pd.read_json('data/climate-fever-dataset-r1.jsonl', lines=True) # Source https://www.sustainablefinance.uzh.ch/en/research/climate-fever.html
climate.head()



Unnamed: 0,claim_id,claim,claim_label,evidences
0,0,Global warming is driving polar bears toward e...,SUPPORTS,[{'evidence_id': 'Extinction risk from global ...
1,5,The sun has gone into ‘lockdown’ which could c...,SUPPORTS,"[{'evidence_id': 'Famine:386', 'evidence_label..."
2,6,The polar bear population has been growing.,REFUTES,"[{'evidence_id': 'Polar bear:1332', 'evidence_..."
3,9,Ironic' study finds more CO2 has slightly cool...,REFUTES,"[{'evidence_id': 'Atmosphere of Mars:131', 'ev..."
4,10,Human additions of CO2 are in the margin of er...,REFUTES,[{'evidence_id': 'Carbon dioxide in Earth's at...


In [42]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(climate.claim, climate.claim_label,\
    test_size=0.33, random_state=42)

Pipeline includes: tokenizing,  filtering stopwords, text preprocessing, counting occurencies, tfidf and Naive Bayes

In [47]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Training a classifier
climate_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

climate_clf.fit(X_train, y_train)

In [48]:
import numpy as np

# Evaluation
predicted = climate_clf.predict(X_test)

np.mean(predicted == y_test)

0.46745562130177515

See if we can improve the acccuracy with linear support vector machine (SVM) instead of Naïve Bayes.

In [49]:
from sklearn.linear_model import SGDClassifier

climate_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(loss='hinge', penalty='l2',\
        alpha=1e-3, random_state=42,\
        max_iter=5, tol=None)),
])

climate_clf.fit(X_train, y_train)

In [51]:
predicted = climate_clf.predict(X_test)
np.mean(predicted == y_test)

0.46548323471400394

In [53]:
from sklearn import metrics

print(metrics.classification_report(y_test, predicted))

                 precision    recall  f1-score   support

       DISPUTED       0.29      0.05      0.08        41
NOT_ENOUGH_INFO       0.38      0.29      0.33       160
        REFUTES       0.39      0.21      0.28        75
       SUPPORTS       0.51      0.74      0.60       231

       accuracy                           0.47       507
      macro avg       0.39      0.32      0.32       507
   weighted avg       0.43      0.47      0.43       507



Parameter tuning

In [54]:
from sklearn.model_selection import GridSearchCV
parameters = {
    'vect__ngram_range': [(1, 1), (1, 2)],
    'tfidf__use_idf': (True, False),
    'clf__alpha': (1e-2 , 1e-3)
}

gs_clf = GridSearchCV(climate_clf, parameters, cv=5, n_jobs=-1)
gs_clf.fit(X_train[:100], y_train[:100])


  gs_clf.fit(X_train[:100], y_train[:100])


In [55]:
gs_clf.best_score_

0.44000000000000006

In [56]:
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))
    

clf__alpha: 0.01
tfidf__use_idf: True
vect__ngram_range: (1, 2)
