#### NLP Workshops with AVIVA
#### Dzień 3
#### Feature Engineering and Text Representation - Financial Sentiment Analysis
#### część 1.

Notebook przedstawia funkcje, które pozwalają uzyskać reprezentacje wektorową wczytywanego tekstu z wykorzystaniem klasycznych metod. Jest to kolejny krok przygotowania danych do zadania predykcji sentymentu.

#### Załadowanie bibliotek

In [1]:
from collections import Counter
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from typing import Union

#### Wczytanie danych

Dane są outputem z poprzednich zajęć (Text Preprocessing) i zawierają kolumnę "PreprocessedSentence" z wyczyszczonym tekstem z kolumny "Sentence".

In [2]:
data_cleaned = pd.read_csv("data/FinancialNewsPreprocessed.csv").reset_index(drop=True)
data_cleaned

Unnamed: 0.1,Unnamed: 0,Sentence,PreprocessedSentence,Sentiment
0,0,The GeoSolutions technology will leverage Bene...,geosolution technology leverage benefon gps so...,positive
1,1,"$ESI on lows, down $1.50 to $2.50 BK a real po...",esi low down to bk real possibility,negative
2,2,"For the last quarter of 2010 , Componenta 's n...",for last quarter of componenta net sale double...,positive
3,3,According to the Finnish-Russian Chamber of Co...,accord to finnishrussian chamber of commerce a...,neutral
4,4,The Swedish buyout firm has sold its remaining...,swedish buyout firm have sell its remain perce...,neutral
...,...,...,...,...
5650,5837,RISING costs have forced packaging producer Hu...,rise cost have force packaging producer huhtam...,negative
5651,5838,Nordic Walking was first used as a summer trai...,nordic walking be first use as summer training...,neutral
5652,5839,"According shipping company Viking Line , the E...",accord ship company vike line eu decision have...,neutral
5653,5840,"In the building and home improvement trade , s...",building home improvement trade sale decrease ...,neutral


#### Zdefiniowanie przykładowych danych

In [3]:
corpus = pd.Series(["small dog", "cute cute cat", "cute cat"])
corpus

0        small dog
1    cute cute cat
2         cute cat
dtype: object

#### Bag of words

In [4]:
def create_bag_of_words(sentences: pd.Series) -> pd.DataFrame:
    """
    Creates a DataFrame of bag-of-words vectors from a Series of text documents
    without using any external libraries.

    :param sentences: corpus of documents that needs to be represented by vectors
    :return: DataFrame containing bag-of-words vectors with feature names as columns
    """
    word_frequencies = []

    for sentence in sentences:
        words = sentence.split()
        word_count = Counter(words)
        word_frequencies.append(word_count)

    return pd.DataFrame(word_frequencies).fillna(0)

In [5]:
print(corpus)
create_bag_of_words(corpus)

0        small dog
1    cute cute cat
2         cute cat
dtype: object


Unnamed: 0,small,dog,cute,cat
0,1.0,1.0,0.0,0.0
1,0.0,0.0,2.0,1.0
2,0.0,0.0,1.0,1.0


#### Reprezentacja wektorowa tesktu

In [6]:
def represent_text_as_vector(
    sentences: pd.Series,
    vectorize_method: Union[CountVectorizer, TfidfVectorizer] = CountVectorizer,
    **kwargs,
) -> pd.DataFrame:
    """
    Creates a DataFrame with vector representation from text documents.

    :param sentences: corpus of documents that needs to be represented by vectors
    :return: DataFrame containing bag-of-words or tf-idf vectors with feature names as columns
    """
    vectorizer = vectorize_method(**kwargs)
    document_term_matrix = vectorizer.fit_transform(sentences)
    feature_names = vectorizer.get_feature_names_out()

    return pd.DataFrame(document_term_matrix.toarray(), columns=feature_names)

In [7]:
represent_text_as_vector(corpus, CountVectorizer)

Unnamed: 0,cat,cute,dog,small
0,0,0,1,1
1,1,2,0,0
2,1,1,0,0


W dalszej części będziemy opierać się na danych z kolumny 'PreprocessedSentence'

In [8]:
sentences = data_cleaned["PreprocessedSentence"]
sentences

0       geosolution technology leverage benefon gps so...
1                     esi low down to bk real possibility
2       for last quarter of componenta net sale double...
3       accord to finnishrussian chamber of commerce a...
4       swedish buyout firm have sell its remain perce...
                              ...                        
5650    rise cost have force packaging producer huhtam...
5651    nordic walking be first use as summer training...
5652    accord ship company vike line eu decision have...
5653    building home improvement trade sale decrease ...
5654    helsinki afx kci konecranes say it have win or...
Name: PreprocessedSentence, Length: 5655, dtype: object

#### Ciekawostka
W scikit learn mamy wbudowaną listę stop_words i możemy ją wykorzystać do zrobienia wstępnego preprocessingu

In [9]:
represent_text_as_vector(sentences, CountVectorizer, stop_words="english")

Unnamed: 0,aa,aal,aaland,aalto,aaltonen,aapl,aaron,aava,aazhang,ab,...,zloty,znga,zoltan,zone,zoo,zs,zsl,zte,zurich,zxx
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5650,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5651,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5652,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5653,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#####  Reprezentacje danych finansowych z wybranego datasetu za pomocą omówionych metod prezentują się następująco

In [10]:
bow = represent_text_as_vector(sentences, CountVectorizer)
bow

Unnamed: 0,aa,aal,aaland,aalto,aaltonen,aapl,aaron,aava,aazhang,ab,...,zloty,znga,zoltan,zone,zoo,zs,zsl,zte,zurich,zxx
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5650,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5651,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5652,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5653,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


##### Analiza najczęściej występujących słów w dokumentach.

In [11]:
def get_most_common_words(text_as_vector: pd.DataFrame, n: int = None) -> pd.Series:
    """
    Creates a Series that presents words commonly beeing present in vocabulary/corpus

    :param text_as_vector: vector representaion of sentences
    :return: Series with n most important words and their frequency
    """
    sum_words = text_as_vector.sum(axis=0)
    words_freq = sum_words.sort_values(ascending=False)
    return words_freq[:n]

In [12]:
get_most_common_words(bow, 20)

of         3319
to         2749
be         2543
eur        1569
for        1292
from        925
company     916
have        862
mn          789
its         635
as          632
by          625
sale        580
profit      578
say         577
with        575
at          541
finnish     525
share       516
it          493
dtype: int64

##### n-gramy

In [13]:
tmp = represent_text_as_vector(sentences, CountVectorizer, ngram_range=(1, 3))
tmp

Unnamed: 0,aa,aa fb,aal,aal up,aal up as,aaland,aaland island,aalto,aalto university,aalto university university,...,zsl look very,zte,zte corp,zte corp she,zurich,zurich insurance,zurich insurance consider,zxx,zxx base,zxx base smartphone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5650,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5651,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5652,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5653,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
first_row = tmp.iloc[0, :]
first_row[first_row == 1]

base                               1
base search                        1
base search technology             1
benefon                            1
benefon gps                        1
benefon gps solution               1
by                                 1
by provide                         1
by provide location                1
commercial                         1
commercial model                   1
community                          1
community platform                 1
community platform location        1
content                            1
content new                        1
content new powerful               1
geosolution                        1
geosolution technology             1
geosolution technology leverage    1
gps                                1
gps solution                       1
gps solution by                    1
leverage                           1
leverage benefon                   1
leverage benefon gps               1
location base                      1
l

#### One Hot Encoding

In [15]:
print(corpus)
represent_text_as_vector(corpus, CountVectorizer, binary=True)

0        small dog
1    cute cute cat
2         cute cat
dtype: object


Unnamed: 0,cat,cute,dog,small
0,0,0,1,1
1,1,1,0,0
2,1,1,0,0


In [16]:
represent_text_as_vector(sentences, CountVectorizer, binary=True)

Unnamed: 0,aa,aal,aaland,aalto,aaltonen,aapl,aaron,aava,aazhang,ab,...,zloty,znga,zoltan,zone,zoo,zs,zsl,zte,zurich,zxx
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5650,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5651,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5652,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5653,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### TF-IDF

In [17]:
print(corpus)
represent_text_as_vector(corpus, TfidfVectorizer)

0        small dog
1    cute cute cat
2         cute cat
dtype: object


Unnamed: 0,cat,cute,dog,small
0,0.0,0.0,0.707107,0.707107
1,0.447214,0.894427,0.0,0.0
2,0.707107,0.707107,0.0,0.0


In [18]:
represent_text_as_vector(sentences, TfidfVectorizer)

Unnamed: 0,aa,aal,aaland,aalto,aaltonen,aapl,aaron,aava,aazhang,ab,...,zloty,znga,zoltan,zone,zoo,zs,zsl,zte,zurich,zxx
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5650,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5651,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5652,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5653,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Reprezentacja tesktu za pomocą wektorów o długości 100

In [19]:
represent_text_as_vector(sentences, TfidfVectorizer, max_features=100)

Unnamed: 0,about,accord,after,also,as,at,bank,be,business,buy,...,total,unit,up,use,value,we,well,which,with,year
0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
1,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
2,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.249836
3,0.0,0.524186,0.000000,0.0,0.000000,0.000000,0.0,0.209798,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
4,0.0,0.000000,0.406357,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5650,0.0,0.000000,0.000000,0.0,0.000000,0.389224,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
5651,0.0,0.000000,0.000000,0.0,0.409392,0.000000,0.0,0.235913,0.0,0.0,...,0.0,0.0,0.0,0.595887,0.0,0.0,0.0,0.0,0.0,0.000000
5652,0.0,0.613252,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
5653,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000


Budując słownik możemy zignorować tokeny, dla których wartość TF jest mniejsza od zadanego poziomu. Wartość graniczną przekazujemy w argumencie min_df.

In [20]:
represent_text_as_vector(sentences, TfidfVectorizer, min_df=0.1)

Unnamed: 0,be,company,eur,for,from,have,its,of,profit,say,to
0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000
2,0.000000,0.000000,0.000000,0.522932,0.575346,0.000000,0.000000,0.374350,0.329196,0.000000,0.383420
3,0.359018,0.527928,0.000000,0.000000,0.000000,0.000000,0.000000,0.685066,0.000000,0.000000,0.350832
4,0.000000,0.547783,0.000000,0.000000,0.000000,0.560083,0.621483,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...
5650,0.000000,0.000000,0.000000,0.000000,0.000000,0.613866,0.681162,0.000000,0.000000,0.000000,0.398983
5651,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
5652,0.000000,0.699213,0.000000,0.000000,0.000000,0.714913,0.000000,0.000000,0.000000,0.000000,0.000000
5653,0.000000,0.000000,0.825249,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.564769
