NLP process
1. Segmentation
2. Tokenizing
3. Stop words
4. Stemming (-ing -s -ed)
5. Lemmatization(am are is : be)
6. Speech Tagging (noun, verb, preposition...)
7. Named Entity Tagging (location, name...)

# 1. Load data

In [1]:
import pandas as pd
import os

In [2]:
 # dir
work_dir = os.getcwd()

In [3]:
input_path = os.path.join(work_dir, "INPUT/all_speeches.csv")
speeches_data = pd.read_csv(input_path)
speeches_data["date"] = pd.to_datetime(speeches_data["date"],format="%d/%m/%Y")

In [4]:
# select 100 row for test
#df_raw = speeches_data.sample(n = 100)

# selected latest 20 row for test
df_raw = speeches_data.sort_values("date").tail(20)

# 2.1 keyBERT: key words extraction

CountVectorizer Tips & Tricks
1. vectorizer = CountVectorizer(ngram_range=(1, 3), stop_words="english")
2. use KeyphraseVectorizers: \
Extract grammatically accurate keyphases based on their part-of-speech tags.\
No need to specify n-gram ranges.\
Get document-keyphrase matrices.\
Multiple language support.\
User-defined part-of-speech patterns for keyphrase extraction possible.
3. leverage the MMR function on top of KeyBERT to diversify the output
4. part-of-speech\
KeyphraseVectorizers extracts the part-of-speech tags from the documents and then applies a regex pattern to extract keyphrases that fit within that pattern. \

Embedding models:
1. Sentence Transformers
2. Flair
3. Spacy
4. Universal Sentence Encoder (USE)
5. Gensim¶
Note that Gensim is primarily used for Word Embedding models. This works typically best for short documents since the word embeddings are pooled.


In [16]:
#%%capture
#!pip install keybert

In [7]:
#pip install keyphrase_vectorizers

In [5]:
from keybert import KeyBERT
from keyphrase_vectorizers import KeyphraseCountVectorizer

In [9]:
# improvement:
def keyBERT_word(text):   # CountVectorizer 
    
    kw_model = KeyBERT(model="all-MiniLM-L6-v2")
    #kw_model = KeyBERT()
    vectorizer = CountVectorizer(ngram_range=(1, 1), stop_words="english")
    keywords = kw_model.extract_keywords(
                      doc=text,
                      vectorizer=vectorizer,
                      # use_maxsum=True,
                      use_mmr=True, 
                      # diversity=0.7,
                      # nr_candidates=20, 
                      top_n=top_n)
  
    li_keywords = [pair[0] for pair in keywords] # keywords list without similarity value

    return li_keywords



def keyBERT_phase(text):   # KeyphraseVectorizers
    
    #kw_model = KeyBERT()
    kw_model = KeyBERT(model="all-MiniLM-L6-v2")
    vectorizer= KeyphraseCountVectorizer()
    keywords = kw_model.extract_keywords(
                        docs=text, 
                        vectorizer=vectorizer,
                        use_mmr=True)
    
    li_keywords = [pair[0] for pair in keywords] # keywords list without similarity value
    
    return li_keywords

In [16]:
df_keywords_01 = df_raw.copy()
#1. compare keyphrase_ngram_range of 1,2,3
df_keywords_01["keywords"] = df_keywords_01["text"].apply(lambda x: keyBERT_word(x）)
df_keywords_01["keyphase"] = df_keywords_01["text"].apply(lambda x: keyBERT_phase(x))

In [17]:
output_path = os.path.join(work_dir, "OUTPUT/all_speeches_BERT.csv")
df_keywords_01.sort_values(["country","date"]).to_csv(output_path)

Unnamed: 0,reference,country,date,title,author,is_gov,text,keywords_1,keywords_2,keywords_3
3221,r220620b_ECB,euro area,2022-06-20,NO_INFO,lane,0,Notes: The vertical line indicates the start o...,"[hicp, forecast, gdp, hicpx, forecasts]","[hicpx quarterly, model hicp, hicp energy, hic...","[hicp energy prices, hicp hicpx quarterly, hea..."
6114,r220620a_BOE,united kingdom,2022-06-20,UK monetary policy in the context of global sp...,mann,0,Welcome to this presentation of the May . The ...,"[inflation, inflationary, shocks, monetary, ec...","[shocks russia, shocks global, supply shocks, ...","[large shocks russia, shocks russia invasion, ..."
3220,r220620a_ECB,euro area,2022-06-20,Hearing of the Committee on Economic and Monet...,lagarde,1,It is a pleasure to be here again for our seco...,"[euro, eurosystem, monetary, brussels, sanctions]","[affecting euro, monetary policy, relevant eur...","[facing monetary policy, severely affecting eu..."
279,r220621a_BOA,australia,2022-06-21,Inflation and Monetary Policy,lowe,1,I would like to thank AMCHAM for the invitatio...,"[inflation, inflationary, monetary, yield, cpi]","[current inflationary, ongoing inflation, rece...","[responding higher inflation, inflation austra..."
3222,r220622a_ECB,euro area,2022-06-22,"Good, bad and hopeful news: the latest on the ...",elderson,0,I understand that today's audience includes ma...,"[ecb, risks, risk, crises, climate]","[risks climate, risks banks, practices ecb, cl...","[risk management ecb, risks practices ecb, ban..."
...,...,...,...,...,...,...,...,...,...,...
901,r221102a_BOC,canada,2022-11-02,Preparing for payments supervision,morrow,0,"Good morning, and thank you for inviting me to...","[banks, bank, fintech, payments, payment]","[payments fintech, bank canada, finance canada...","[bank canada evolving, payments ecosystem cana..."
3245,r221103a_ECB,euro area,2022-11-03,Mind the step: calibrating monetary policy in ...,panetta,0,The euro area is facing a sequence of unpreced...,"[inflationary, inflation, euro, eurosystem, mo...","[inflation euro, risks inflation, risks euro, ...","[euro reinforcing inflationary, risks inflatio..."
3246,r221104b_ECB,euro area,2022-11-04,The euro area economy and the energy transition,guindos,0,I am very pleased to be taking part in this ev...,"[inflation, macroeconomic, inflationary, econo...","[energy inflation, economy following, economy ...","[inflation rising energy, euro area economy, e..."
3247,r221104a_ECB,euro area,2022-11-04,Monetary policy in a high inflation environmen...,lagarde,1,"Inflation in the euro area is far too high, re...","[inflation, inflationary, monetary, recessions...","[estonia inflation, exacerbate inflationary, i...","[inflation entrenched euro, estonia inflation ..."


# 2.2. GPT2

In [None]:
#1.summary of full text 

In [None]:
#2.summary of paragraphs

# 3. text clustering: document level, sentence level, word level

1. 每一条speech对应的一组keywords也可以vectorization ,对每一个speech 聚类
2. 可以直接算document-level vector similarity吗？ 是否有可比性？文章长度限制？ （比较关键词更好？）

In [29]:
#pip install python-Levenshtein

In [3]:
#Libraries for preprocessing
from gensim.parsing.preprocessing import remove_stopwords
import string
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
#import webcolors

#Download once if using NLTK for preprocessing
import nltk
nltk.download('punkt')

#Libraries for vectorisation
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import GridSearchCV
#from fuzzywuzzy import fuzz


from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from sentence_transformers import SentenceTransformer

#Libraries for clustering
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/jiayue.yuan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 3.1 data preprocessing 

In [15]:
#!!!!!!!!!!!!!!!!!!!!!!
# Load output data if restart the kenel 
work_dir = os.getcwd()
output_path = os.path.join(work_dir, "OUTPUT/all_speeches_BERT.csv")
df_keywords_01 = pd.read_csv(output_path)
df_keywords_01['date'] = pd.to_datetime(df_keywords_01["date"])

def str2li(str):
    li = str.strip('][\'').replace("'","").split(', ')
    return li
    
df_keywords_01['keywords_1'] = df_keywords_01['keywords_1'].apply(str2li)
df_keywords_01['keywords_2'] = df_keywords_01['keywords_2'].apply(str2li)
df_keywords_01['keywords_3'] = df_keywords_01['keywords_3'].apply(str2li)
df_keywords_01['keywords'] = df_keywords_01['keywords'].apply(str2li)


In [45]:
df_keywords_01.head(5)

Unnamed: 0.1,Unnamed: 0,reference,country,date,title,author,is_gov,text,keywords_1,keywords_2,keywords_3,keywords
0,3221,r220620b_ECB,euro area,2022-06-20,NO_INFO,lane,0,Notes: The vertical line indicates the start o...,"[hicp, forecast, gdp, hicpx, forecasts]","[hicpx quarterly, model hicp, hicp energy, hic...","[hicp energy prices, hicp hicpx quarterly, hea...","[hicp energy prices, future oil demand, econom..."
1,6114,r220620a_BOE,united kingdom,2022-06-20,UK monetary policy in the context of global sp...,mann,0,Welcome to this presentation of the May . The ...,"[inflation, inflationary, shocks, monetary, ec...","[shocks russia, shocks global, supply shocks, ...","[large shocks russia, shocks russia invasion, ...","[domestic inflationary pressures, demand shock..."
2,3220,r220620a_ECB,euro area,2022-06-20,Hearing of the Committee on Economic and Monet...,lagarde,1,It is a pleasure to be here again for our seco...,"[euro, eurosystem, monetary, brussels, sanctions]","[affecting euro, monetary policy, relevant eur...","[facing monetary policy, severely affecting eu...","[monetary policy, monetary policy meeting, eur..."
3,279,r220621a_BOA,australia,2022-06-21,Inflation and Monetary Policy,lowe,1,I would like to thank AMCHAM for the invitatio...,"[inflation, inflationary, monetary, yield, cpi]","[current inflationary, ongoing inflation, rece...","[responding higher inflation, inflation austra...","[higher inflation, inflation rate, inflation s..."
4,3222,r220622a_ECB,euro area,2022-06-22,"Good, bad and hopeful news: the latest on the ...",elderson,0,I understand that today's audience includes ma...,"[ecb, risks, risk, crises, climate]","[risks climate, risks banks, practices ecb, cl...","[risk management ecb, risks practices ecb, ban...","[eu climate targets, climate risks, climate cr..."


In [38]:
# Collect all keywords into a list
li_keywords = []
for i in df_keywords_01["keywords_1"]:
    for k in i:
        li_keywords.append(k)

# 1)Removing stopwords (punctuation and numbers)
li_keywords_nonstop = [remove_stopwords(x) for x in li_keywords]

# 2)Stemming and making words lower case （remove）
#li_keywords_stemmed = [PorterStemmer().stem(word) for word in li_keywords_nonstop]

# 3)dedup
li_keywords_clean = list(set(li_keywords_nonstop))

In [39]:
print(len(li_keywords_stemmed))
print(len(li_keywords_clean))

500
184


In [40]:
li_keywords_clean

['overseeing',
 'critic',
 'bundesbank',
 'forecasting',
 'banknote',
 'findings',
 'longevity',
 'assets',
 'economists',
 'documents',
 'cpi',
 'insurer',
 'inflationary',
 'merchants',
 'payment',
 'mortgage',
 'stakeholders',
 'tribal',
 'lisbon',
 'award',
 'conference',
 'risk',
 'governors',
 'environmental',
 'studies',
 'guidance',
 'unemployment',
 'brussels',
 'yield',
 'stablecoins',
 'euro',
 'insurance',
 'fiscal',
 'unemployed',
 'easing',
 'hicpx',
 'biodiversity',
 'speech',
 'macroeconomics',
 'decades',
 'sustainable',
 'regulation',
 'hicp',
 'policies',
 'information',
 'policymakers',
 'innovation',
 'kansas',
 'investment',
 'rate',
 'disruptions',
 'canadians',
 'european',
 'pandemic',
 'disturbances',
 'reinsurance',
 'friedman',
 'energy',
 'macroeconomic',
 'cryptocurrency',
 'massachusetts',
 'eurobarometer',
 'housing',
 'crises',
 'lenders',
 'currencies',
 'hurricane',
 'tightening',
 'ecb',
 'currency',
 'economy',
 'rates',
 'demand',
 'belfast',
 'eco

# 3.2 embedding

1: trained a model using all speeches(BERT) and embedding selected key words/key phases \
2: use a pre-trained model(Word2Vector/BERT), load the full dictionary and convert each keyword to vector

## 1) KeyBERT.extract_embeddings

In [10]:
kw_model = KeyBERT()
doc_embeddings, word_embeddings = kw_model.extract_embeddings(li_keywords_clean)

In [1]:
pd.DataFrame(word_embeddings) # ?

NameError: name 'pd' is not defined

## 2) Word2Vec (gensim.models)

### Word2Vec VS BERT

Architecture: Word2Vec is a shallow neural network with a single hidden layer, while BERT is a deep transformer network.

Inputs: Word2Vec takes a single word as input and predicts the context words, while BERT takes a sequence of words as input and predicts the masked words.

Pretraining: Word2Vec is trained on a large corpus of text to learn word representations, while BERT is trained on a large corpus of text as well as on specific NLP tasks (such as question answering and sentiment analysis).

Transfer Learning: Word2Vec is often used as a feature in NLP models and provides fixed-length vectors that can be plugged into other models, while BERT can be fine-tuned on specific NLP tasks with minimal architecture changes, allowing it to adapt to new data and improve performance.

Performance: BERT generally outperforms Word2Vec on various NLP tasks, such as text classification, named entity recognition, and question answering. However, Word2Vec can still be useful in some cases where computational resources are limited.

In [93]:
# train the model first
#w = Word2Vec(,vector_size=100)

In [6]:
# Load the dictionary of a pre-trained model
import gensim.downloader
# Show all available models in gensim-data
print(list(gensim.downloader.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


In [None]:
#w2v_vectors = gensim.downloader.load('word2vec-google-news-300')

In [9]:
import gensim.downloader as api
glove_model = api.load("glove-twitter-25")
glove_model['inflation']

array([ 0.38943 ,  0.10575 , -1.2358  , -0.61433 ,  0.25176 ,  0.066561,
       -0.50221 , -1.9261  ,  0.91188 ,  0.33808 ,  1.075   ,  0.3934  ,
       -1.8223  ,  1.558   ,  0.81213 , -1.3259  , -0.58039 ,  0.72888 ,
        0.93991 , -0.62783 , -0.46672 ,  0.35953 ,  1.0572  , -0.054854,
       -0.86242 ], dtype=float32)

In [10]:
w2v_model = api.load("word2vec-google-news-300")
w2v_model['inflation']



array([ 0.04150391, -0.05224609, -0.34179688,  0.76171875,  0.015625  ,
       -0.11572266,  0.04199219, -0.24414062,  0.14160156, -0.36132812,
        0.05810547, -0.18066406,  0.22265625,  0.28710938, -0.47070312,
        0.52734375,  0.40039062, -0.04248047,  0.08984375, -0.11816406,
        0.3671875 ,  0.33398438,  0.53515625,  0.53125   ,  0.0112915 ,
       -0.29882812,  0.31054688, -0.00506592,  0.28320312,  0.21582031,
        0.09033203, -0.7421875 , -0.13964844, -0.33203125, -0.29882812,
       -0.30078125,  0.07910156,  0.09619141, -0.09667969,  0.59375   ,
        0.07470703, -0.13378906,  0.5703125 , -0.19824219, -0.26953125,
        0.02832031,  0.38085938,  0.19140625, -0.18164062, -0.11376953,
        0.24121094,  0.28320312, -0.08398438, -0.10205078,  0.4453125 ,
       -0.02380371,  0.03491211,  0.23535156,  0.07666016,  0.05200195,
        0.19921875,  0.01165771,  0.09570312, -0.03637695,  0.140625  ,
        0.15917969, -0.07275391,  0.27148438,  0.203125  ,  0.16

In [41]:
vec_dic = {}
for word in li_keywords_clean:
    vector = w2v_model[word]   # not in dictionary
    # exception pass
    vec_dic[word] = vector
pd.DataFrame(vec_dic)

KeyError: "Key 'bundesbank' not present"

## 3)  SentenceTransformer

In [1]:
trans_model = SentenceTransformer("all-MiniLM-L6-v2")
#embeddings = trans_model.encode(sentences)
len(trans_model.encode("inflation"))

# kernel die

NameError: name 'SentenceTransformer' is not defined

## 3.3 PCA&k-means

In [32]:
# Stemming and making words lower case

# kmeans


# 4. Speech labelling

可以直接用pre-trained model, 选择topic word, 然后print most similar words