# Rule-based Sentiment Analysis for Czech

All code for rule-based sentiment analysis. It contains functions for loading datasets and lexicons, preprocessing data, the implementation of rule-based algorithms, and evaluation.

Most of the code blocks are just definitions of functions. They are then used at the end to carry out the sentiment analysis in the evaluation part.

To see the evaluations and examples, run all blocks before the Evaluation section. Then choose a block after that to run. As preprocessing whole datasets takes some time, the preprocessing has already been done and preprocessed datasets are serialized using the Pickle library.




### Imports

Libraries used are pandas, sklearn, spacy, spacy_udpipe, pickle, and a few others.

In [1]:
import pandas as pd
import sklearn
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
import spacy_udpipe
spacy_udpipe.download('cs')
#nlp = spacy_udpipe.load('cs')
from spacy import displacy
from spacy.tokens import Doc, Token, Span
from spacy.lang.cs import stop_words as stop_words
from string import punctuation
from collections import defaultdict
from decimal import Decimal
import pickle
import re
import math
import copy

Already downloaded a model for the 'cs' language


## Functions for loading datasets

Contains functions for loading datasets in various formats (xlsx, csv, txt, ...). They have parameters so that changes to the actual source file shouldn't be needed.

### Loads from two txt files - one containing texts, one labels - needs files names, label names

In [2]:
#loads from two txt files - one containing texts, one labels
def load_from_txt(text_file, label_file, pos_label, neg_label, neu_label, encoding='utf-8'):
    with open(text_file, 'r', encoding=encoding) as texts_file:
        texts = texts_file.readlines()

    with open(label_file, 'r', encoding=encoding) as labels_file:
        labels = labels_file.readlines()

    texts = [text.strip() for text in texts]
    labels = [label.strip() for label in labels]
    
    data = {
        'texts': texts,
        'labels': labels
    }
    
    df = pd.DataFrame(data)

    label_mapping = {pos_label: 'p', neg_label: 'n', neu_label: '0'}
    df['labels'] = df['labels'].replace(label_mapping)
    df = df[df['labels'].isin([pos_label, neg_label, neu_label])]

    print('data loaded')
    
    return df


### Loads from txt files that seperately have positive, negative and neutral texts - needs files names

In [3]:
#loads from txt files that seperately habe positive, negative and neutral texts
def load_from_separate_txts(pos_file, neg_file, neu_file, encoding='utf-8'):
    with open(pos_file, 'r', encoding=encoding) as file:
        pos_texts = file.readlines()
        
    with open(neg_file, 'r', encoding=encoding) as file:
        neg_texts = file.readlines()
        
    with open(neu_file, 'r', encoding=encoding) as file:
        neu_texts = file.readlines()

    texts=[text.strip() for text in pos_texts]+[text.strip() for text in neg_texts]+[text.strip() for text in neu_texts]
    labels=['p']*len(pos_texts)+['n']*len(neg_texts)+['0']*len(neu_texts)

    data = {
        'texts': texts,
        'labels': labels
    }
    
    df = pd.DataFrame(data)

    print('data loaded')
    
    return df


### Loads from one csv file containg both texts and labels - needs file name, header names (text, label), label names

In [4]:
#loads from one csv file containg both texts and labels
def load_from_csv(file_name, text_header, label_header, pos_label, neg_label, neu_label, encoding='utf-8', separator=','):
    df = pd.read_csv(file_name, sep=separator, encoding=encoding, names=[text_header, label_header])
    df.rename(columns={text_header: 'texts', label_header: 'labels'}, inplace=True)

    label_mapping = {pos_label: 'p', neg_label: 'n', neu_label: '0'}
    df['labels'] = df['labels'].replace(label_mapping)
    df = df[df['labels'].isin([pos_label, neg_label, neu_label])]
    
    print('data loaded')
    
    return df

### Loads from excel - needs file name, header names (text, label), label names

In [5]:
#load from excel
def load_from_excel(file_name, text_header, label_header, pos_label, neg_label, neu_label, sheet_name=0):
    df = pd.read_excel(file_name, sheet_name=sheet_name, usecols=[text_header, label_header])
    df.rename(columns={text_header: 'texts', label_header: 'labels'}, inplace=True)

    label_mapping = {pos_label: 'p', neg_label: 'n', neu_label: '0'}
    df['labels'] = df['labels'].replace(label_mapping)
    df = df[df['labels'].isin([pos_label, neg_label, neu_label])]
    
    print('data loaded')
    
    return df

## Data Preprocessing

Preprocessing using the udpipe model, spacy_udpipe and spacy libraries.

### Loading the udpipe model for preprocessing and setting up Spacy extensions

The model for preprocessing is loaded using spacy_udpipe. The model comes from https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3131.

In [6]:
nlp = spacy_udpipe.load_from_path(lang="cs", path="./czech-pdt-ud-2.5-191206.udpipe", meta={"description": "cz model cac"})

### Returns lemmatized word

In [7]:
def lemmatize_word(word):
    return nlp(word)[0].lemma_

### Some helper methods

In [8]:
def get_tokens(doc):
    return [token for token in doc]

#removes repeating punctuation
def clean_text(text):
    pattern = r'([^\w\s])\1+'

    cleaned_text = re.sub(pattern, r'\1', text)
    
    return cleaned_text

### Preprocesses one text, return doc object

In [9]:
def preprocess_text(text):
    return nlp(text)

### Preprocesses a whole dataframe

In [10]:
def preprocess_texts(df):
    print('starting preprocessing')

    df['texts'] = df['texts'].apply(clean_text)

    df['docs'] = list(nlp.pipe(df['texts']))
    df['tokens'] = df['docs'].apply(get_tokens)
    
    print('finished preprocessing')
    
    return df

## Serializing preprocessed data

As preprocessing the data takes some time, it is useful to do it just once and then serialize it and load when it is needed. This is done using the Pickle library. Preprocessing CSFD takes a long time (took around 45 minutes), so it is disabled by default. To enable it, change cell type from raw to code. Preprocessing the other datasets shouldn't take too long. The serialized files are used during evaluation, so this part needs to be run before that. 

### Preprocess and serialize whole csfd dataset (runs for a long time - enable first)

### Preprocess and serialize whole facebook dataset

In [11]:
def preprocess_serialize_facebook():
    df = load_from_txt('datasets/facebook/gold-posts.txt','datasets/facebook/gold-labels.txt','p','n','0')
    preprocessed_df = preprocess_texts(df)
    preprocessed_df = preprocessed_df.drop('tokens', axis=1)
    with open('datasets/facebook.pickle', 'wb') as f:
        pickle.dump(preprocessed_df, f)

preprocess_serialize_facebook()

data loaded
starting preprocessing
finished preprocessing


### Preprocess and serialize whole synthetic dataset

In [12]:
def preprocess_serialize_synthetic():
    df = load_from_separate_txts('datasets/synthetic/positive.txt','datasets/synthetic/negative.txt','datasets/synthetic/neutral.txt')
    preprocessed_df = preprocess_texts(df)
    preprocessed_df = preprocessed_df.drop('tokens', axis=1)
    with open('datasets/synthetic.pickle', 'wb') as f:
        pickle.dump(preprocessed_df, f)

preprocess_serialize_synthetic()

data loaded
starting preprocessing
finished preprocessing


### Preprocess and serialize whole extracted dataset

In [13]:
def preprocess_serialize_extracted():
    extracted_dataset = load_from_excel('datasets/extracted/extracted_dataset.xlsx', 'Text', 'Immer', 'p', 'n', '0')
    preprocessed_df = preprocess_texts(extracted_dataset)
    preprocessed_df = preprocessed_df.drop('tokens', axis=1)
    with open('datasets/extracted.pickle', 'wb') as f:
        pickle.dump(preprocessed_df, f)

preprocess_serialize_extracted()

data loaded
starting preprocessing
finished preprocessing


### Load from pickle, restore tokens column

In [14]:
def load_from_pickle(filename):
    with open(filename, 'rb') as file:
        df = pickle.load(file)
    df['tokens'] = df['docs'].apply(get_tokens)
    return df

## Loading and generating lexicons

Contains functions for loading lexicons (Sublex - https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0022-FF60-B, Affin.CZ - https://github.com/VilemR/affin.cz) and for automatic generation of lexicon from existing data. The loaded lexicons can be either a dataframe or a dictionary, with the dictionary being the final product that is then used later.

### Loads lexicon from csv (as a dataframe)

In [15]:
def load_lexicon_from_csv(file_name, lemma_header='lemma', label_header='sentiment_value', encoding='utf-8', separator=','):
    df = pd.read_csv(file_name, sep=separator, encoding=encoding)

    df.rename(columns={lemma_header: 'lemma', label_header: 'sentiment_value'}, inplace=True)
    
    print('lexicon loaded from '+file_name)
    
    return df

### Loads czech sublex (as a dataframe)

In [16]:
def remove_after_underscore(s):
    return s.split('_')[0]

def load_czech_sublex(file_name):
    column_names = ['negation', 'pos', 'lemma', 'sentiment_value', 'src']

    df = pd.read_csv(file_name, sep='\t', header=None, names=column_names, engine ='python')
    
    replacement_map = {'NEG': -1.0, 'POS': 1.0}
    df['sentiment_value'] = df['sentiment_value'].map(replacement_map).astype(float)

    df['lemma'] = df['lemma'].apply(remove_after_underscore)
    
    print('lexicon loaded from '+file_name)
    
    return df


### Loads czech sublex (as a dict)

In [17]:
def load_czech_sublex_dict(file_name):    
    return df_lexicon_to_dict(load_czech_sublex(file_name))

### Loads AFFIN CZ lexicon
As the Affin.CZ lexicon isn't fully lemmatized, lemmatization is applied here during loading.

In [18]:
def load_affincz(file_name):
    affin = load_lexicon_from_csv(file_name, lemma_header='word_cz', label_header='polarity', separator=',')
    
    lemmatized_dict = {}

    def process_row(row):
        word = row['lemma']
        label = row['sentiment_value']
        lemma = lemmatize_word(word)
        if word == 'není-li': # lematized as 'být' which would be incorrect
            lemmatized_dict['není-li']=-1
            return
        if word.startswith('ne'):
            if not lemma.startswith('ne') and lemma[0] == word[2]:
                lemmatized_dict['ne'+lemma] = label
            else:
                lemmatized_dict[lemma] = label
        else:
            lemmatized_dict[lemma] = label  

    # Apply the function to each row in the DataFrame
    affin.apply(process_row, axis=1)

    return lemmatized_dict

### Loads vulgarisms lexicon as dict (all negative)

In [19]:
def load_vulgarisms(file_name):
    vulgarisms_dict = {}
    
    with open(file_name, 'r', encoding='utf-8') as file:
        lines = file.readlines()
        
    for line in lines:
        line = line.strip()
        vulgarisms_dict[line] = -1

    print('lexicon loaded from '+file_name)
    
    return vulgarisms_dict

### Converts dataframe lexicon into dict lexicon (columns 'lemma' and 'sentiment_value')

In [20]:
def df_lexicon_to_dict(df):
    return dict(zip(df['lemma'], df['sentiment_value']))

### Automatic lexicon generation

Generates the lexicon - needs df with tokens (output of the preprocessing function) and labels. Parameter min_occurence is the minimal occurrence of a word in the data for it  to be used in the lexicon. Min_abs_value is the minimal absolute sentiment value a word must have to be in the lexicon. If simple_sentiment is True, it will round the sentiment values to either -1 or 1. Use additional_stats=True for counts and ratios used for calculating sentiment value.

In [21]:
def generate_lexicon(df, min_occurence=1, min_abs_value = 0.3, simple_sentiment=False, stop_words=stop_words.STOP_WORDS, additional_stats=False):
    print('starting lexicon generation')
    
    lemma_counts = defaultdict(lambda: defaultdict(int))

    for i, row in df.iterrows():
        label = row['labels']
        tokens = row['tokens']
        for token in tokens:
            if token.lemma_ not in stop_words and token.pos_ in ['NOUN', 'VERB', 'ADJ', 'ADV']:
                lemma_counts[label][token.lemma_] += 1

    lemma_counts_df = pd.DataFrame(lemma_counts).fillna(0).astype(int)

    lemma_counts_df['total'] = lemma_counts_df.sum(axis=1)

    for label in df['labels'].unique():
        lemma_counts_df[f'{label}_ratio'] = lemma_counts_df[label] / lemma_counts_df['total']

    lemma_counts_df = lemma_counts_df[lemma_counts_df['total'] >= min_occurence]

    lemma_counts_df['sentiment_value'] = -1 * lemma_counts_df.get('n_ratio', 0) + lemma_counts_df.get('p_ratio', 0)
    lemma_counts_df = lemma_counts_df[lemma_counts_df['sentiment_value'] != 0]

    lemma_counts_df = lemma_counts_df[abs(lemma_counts_df['sentiment_value']) >= min_abs_value]
    if simple_sentiment:
        lemma_counts_df['sentiment_value'] = lemma_counts_df['sentiment_value'].apply(lambda x: 1 if x > 0 else -1)
        

    lemma_counts_df.reset_index(inplace=True)
    lemma_counts_df.rename(columns={'index': 'lemma'}, inplace=True)
    
    print('finished lexicon generation')
    
    if additional_stats:
        return lemma_counts_df
    else:
        return lemma_counts_df[['lemma', 'sentiment_value']]

### Automatic lexicon generation - dict form

In [22]:
def generate_lexicon_dict(df, min_occurence=1, min_abs_value=0.3, simple_sentiment=False, stop_words=stop_words.STOP_WORDS):
    return df_lexicon_to_dict(generate_lexicon(df, min_occurence, min_abs_value, simple_sentiment, stop_words))

## Sentiment analysis implementation
Besides loading and initializing things, there are two main parts - 1) looking up the sentiment value of a word in a lexicon, and 2) obtaining the sentiment of a whole text. Two approaches have been implemented for 2) - proximity approach and dependency tree approach.

### Adding spacy extensions
Spacy extensions are set so necessary information (sentiment, shifter type, etc.) can be added to Tokens and Spans. These extensions are then used by the algorithms.

In [23]:
Token.set_extension("sentiment", default=0, force=True)
Token.set_extension("shifter_type", default='none', force=True)
Token.set_extension("passed_shifters", default=[], force=True)
Token.set_extension("passed_sentiment", default=0, force=True)
Token.set_extension("ignore_as_shifter", default=False, force=True)
Span.set_extension("sentiment", default=0, force=True)

### Loads shifter lists
Returns negators, intensificators, deintensificators, adversatives (all in lemmatized form).

In [24]:
def load_shifters():
    negators = []
    intensificators = []
    deintensificators = []
    adversatives = []
    
    with open('text_resources/shifters/negators_cs.txt', 'r', encoding='utf-8') as file:
        negators = [lemmatize_word(word) for word in file.readlines()]
    
    with open('text_resources/shifters/intensificators_cs.txt', 'r', encoding='utf-8') as file:
        intensificators = [lemmatize_word(word) for word in file.readlines()]
    
    with open('text_resources/shifters/deintensificators_cs.txt', 'r', encoding='utf-8') as file:
        deintensificators = [lemmatize_word(word) for word in file.readlines()]
    
    with open('text_resources/shifters/adversatives_cs.txt', 'r', encoding='utf-8') as file:
        adversatives = [lemmatize_word(word) for word in file.readlines()]

    return negators, intensificators, deintensificators, adversatives



### Initializing some constants, lists, sets, helper methods
Several thing are initialized here that are used by the algorithms later. This includes constants for shifter patterns, punctuation list, shifter lists, definition for the economics pattern and lists of emoticons. Also some helper methods.

In [25]:
constants = {
    'negation_positive': -1,
    'negation_negative': -0.5,
    'intensification': 1.8,
    'deintensification': 0.2,
    'adversative_before': 1.5,
    'adversative_after': 0.5
}

punct_list = ['.',',','?','!',';']

negators, intensificators, deintensificators, adversatives = load_shifters()

bigger_words = {'zvýšit', 'vysoký', 'vyšší', 'růst','roustoucí', 'sílící', 'zvyšující', 'roste', 'vzrůst', 'zvýšení', 'vzestup', 'vzestoupit', 'vzestupující','zvětšit', 'zvětšující', 'zesílit', 'zesílení', 'zesilující', 'posílit', 'posílení', 'posilující', 'velký','stoupnout','stoupnutí', 'stoupající'}
smaller_words = {'snížit', 'snižující', 'nízký', 'nižší', 'pokles', 'poklesnout', 'snížení', 'klesnout', 'klesající', 'klesání','klesat', 'sestoupit', 'sestup', 'sestupující', 'menší', 'malý','klesnutí'}

bigger_is_good_words = {'mzda','plat','ekonomika','HDP', 'hodnota', 'úrověň', 'zisk', 'výdělek', 'profit', 'tržba', 'zaměstnanost'}
bigger_is_bad_words = {'cena','dluh','zadlužení','inflace', 'nezaměstnanost', 'schodek', 'deficit'}

happy_emoticons  = set()
with open('text_resources/other/happy_emoticons.txt', 'r', encoding='utf-8') as file:
    for line in file:
        happy_emoticons.add(line.strip())
        
sad_emoticons  = set()
with open('text_resources/other/happy_emoticons.txt', 'r', encoding='utf-8') as file:
    for line in file:
        happy_emoticons.add(line.strip())    

In [26]:
def find_token_by_lemma(tokens, lemma):
    for token in tokens:
        if token.lemma_ == lemma:
            return token

def get_nbor(token, pos):
    try:
        return token.nbor(pos).text
    except (TypeError, IndexError) as e:
         return ''

### Looking up sentiment of a word in a lexicon
Returns sentiment of a given token using dict lookup, handles negation. Has fix for emoticons (the udpipe model used doesn't tokenize emoticons properly). See comments accompanying code for more detail.

In [27]:
def get_token_sentiment(token, lexicon, enable_negated_lemma=True):
    #lexical negation (word starting with 'ne' - eg.'nehezký', 'nebezpečí' etc.) must be accounted for, as lemma from lexicon and lemmatizator aren't always the same
    #eg. according to the cs udpipe model lemma for 'nebezpečí' is 'bezpečí', but the czech sublex only lists 'nebezpečí' as a polarized word
    #if enable_negated_lemma is True, function will check for the negated version of the word and return a flipped sentiment value of that word
    #eg. if lemma is 'nehezký' but lexicon only contains 'hezký' as positive, it will return negative - might not be 100 % accurate
    
    if token.text.startswith('ne'):
            #both token and lemma start with 'ne' -> return sentiment if lemma in lexicon
            if token.lemma_.startswith('ne') and token.lemma_ in lexicon:
                return  lexicon[token.lemma_]    
            if token.lemma_.startswith('ne') and token.lemma_ not in lexicon:
                return 0

            #token starts with 'ne' but lemma doesn't -> either it was lexical negation which got removed by lemmatization'ne' OR it wasn't a lexical negation (eg. 'nejoblíbenější' -> 'oblíbený')
            #if the first char of lemma matches third char of original, we assume it was lexical negation (only 'ne' was removed) -> then we figure out which version (original or non-negated) is in lexicon
            if not token.lemma_.startswith('ne') and len(token.text) > 3 and token.lemma_[0] == token.text[2]:
                #neither are in lexicon -> return 0
                if 'ne'+token.lemma_ not in lexicon and token.lemma_ not in lexicon:
                    return 0   
                #original (negated) is in lexicon -> return sentiment for original 
                if 'ne'+token.lemma_ in lexicon:
                    return lexicon['ne'+token.lemma_]
                #for negated polarized verbs
                if token.lemma_ in lexicon and token.pos_=='VERB' and  'Neg' in token.morph.get('Polarity'):
                    return -1 * lexicon[token.lemma_]
                if token.lemma_ in lexicon:
                    return -1 * lexicon[token.lemma_]
            #it isn't lexical negation, but something like 'nejoblíbenější' -> 'oblíbený'
            if token.lemma_ in lexicon:
                return lexicon[token.lemma_]
            return 0
    
    #token doesn't start with 'ne'
    else:        
        #lemma is in lexicon -> return sentiment value for lemma
        if token.lemma_ in lexicon:
            return lexicon[token.lemma_]
        #(optionally) if lemma not in lexicon but 'ne' + lemma is (potentially negated version of that word), return flipped version of that 
        if enable_negated_lemma and 'ne'+token.lemma_ in lexicon:
            return -1 * lexicon['ne'+token.lemma_]

        #text emoticon fix
        
        if token.text in  [':',';'] and False:
            nbor_minus_1 = get_nbor(token, -1)
            nbor_minus_2 = get_nbor(token, -2)
            nbor_plus_1 = get_nbor(token, 1)
            nbor_plus_2 = get_nbor(token, 2)

            possible_emoticons = [nbor_minus_1+nbor_minus_2+token.text,nbor_minus_1+token.text,token.text+nbor_plus_1,token.text+nbor_plus_1+nbor_plus_2]
            if set(possible_emoticons).intersection(happy_emoticons):
                return 1
            if set(possible_emoticons).intersection(sad_emoticons):
                return -1
        
        #lemma not in lexicon -> return 0
        return 0

### Marks token as shifter

In [28]:
def mark_as_shifter(token):
    if token.lemma_ in ['nikoliv', 'nikoli', 'ne', 'nikterak', 'nijak'] or (token.pos_ in ['VERB','AUX'] and
                'Neg' in token.morph.get('Polarity')):
        token._.shifter_type = 'neg'
    elif token.lemma_ in intensificators:
        token._.shifter_type = 'int'
    elif token.lemma_ in deintensificators:
        token._.shifter_type = 'deint'
    elif token.lemma_ in adversatives:
        token._.shifter_type = 'adv'

### Applies shifter pattern to polarized token
Depending on the shifter type, it applies the shifter pattern to the polarized token.

In [29]:
constants = {
    'negation_positive': -1,
    'negation_negative': -0.5,
    'intensification': 1.8,
    'deintensification': 0.2,
    'adversative_before': 1.5,
    'adversative_after': 0.5
}

shifter_counts = {
    'neg': 0,
    'int': 0,
    'deint': 0,
    'adv_before':0,
    'adv_after':0
}

def apply_shifter(token, shifter_type, constants=constants, on_passed=False):
    shifter_counts[shifter_type] = shifter_counts[shifter_type] + 1
    if on_passed:
        #print('applying shifter: ', token, shifter_type)
        if shifter_type == 'neg':
            if token._.passed_sentiment > 0:
                token._.passed_sentiment = constants['negation_positive'] * token._.passed_sentiment
            else:
                token._.passed_sentiment = constants['negation_negative'] * token._.passed_sentiment          
        if shifter_type == 'int':
            token._.passed_sentiment = constants['intensification'] * token._.passed_sentiment 
        if shifter_type == 'deint':
            token._.passed_sentiment = constants['deintensification'] * token._.passed_sentiment
        if shifter_type == 'adv_before':
            token._.passed_sentiment = constants['adversative_before'] * token._.passed_sentiment
        if shifter_type == 'adv_after':
            token._.passed_sentiment = constants['adversative_after'] * token._.passed_sentiment
    else:
        #print('applying shifter: ', token, shifter_type)
        if shifter_type == 'neg':
            if token._.sentiment > 0:
                token._.sentiment = constants['negation_positive'] * token._.sentiment
            else:
                token._.sentiment = constants['negation_negative'] * token._.sentiment   
        if shifter_type == 'int':
            token._.sentiment = constants['intensification'] * token._.sentiment 
        if shifter_type == 'deint':
            token._.sentiment = constants['deintensification'] * token._.sentiment
        if shifter_type == 'adv_before':
            token._.sentiment = constants['adversative_before'] * token._.sentiment
        if shifter_type == 'adv_after':
            token._.sentiment = constants['adversative_after'] * token._.sentiment

### Proximity-based shifter pattern approach
Returns overall sentiment from a list of tokens using proximity-based shifter pattern approach. First, it marks shifter words and assigns sentiment to tokens using the lexicon lookup. Then, it applies the shifter patterns using proximity clusters around polarized words. Lastly, it iterates over the set of sentences to calculate overall sentiment. The economics pattern (bigger is better / bigger is worse) is mixed in with the usual shifters (negation, intensification, deintensification, adversatives).

In [30]:
def get_sentiment_proxi(tokens, lexicon, before=4, after=2, constants = constants, debug=False):
    
    sentences = set()
    
    for token in tokens:

        #some extensions have to be reset
        token._.sentiment = 0
        token._.ignore_as_shifter = False
        
        sentences.add(token.sent)

        
        mark_as_shifter(token)
        
        token._.sentiment = get_token_sentiment(token, lexicon)
        
    
    for i in range(len(tokens)): # for every token
        token = tokens[i]
        if token._.sentiment != 0 or token.lemma_ in bigger_is_good_words.union(bigger_is_bad_words):

            #creates a list of tokens preceeding the polarized token, based on the 'before' argument. Cuts it short if there is any punctuation
            start_index = max(0, i-before)
            pre_tokens = tokens[start_index:i]
            for j in range(len(pre_tokens)):  
                if pre_tokens[j].text in punct_list:
                    start_index = j
            pre_tokens = pre_tokens[start_index:]       
                

            #creates a list of tokens following the polarized token, based on the 'after' argument. Cuts it short if there is punctuation, if it's a comma, it looks one token further for adversative
            end_index = min(len(tokens), i + after + 1)
            foll_tokens = tokens[i + 1:end_index]
            for j in range (len(foll_tokens)):
                if foll_tokens[j] in punct_list:
                    end_index = j
                if foll_tokens[j] == ',' and foll_tokens[j].lemma_ in adversatives:
                    end_index = min(len(tokens),j+1)       
            foll_tokens = foll_tokens[:end_index]

            context_tokens = pre_tokens + foll_tokens
            
            negators_count =  0

            #assign polarity to economic term based on increase/decrease ('high inflation' -> negative)
            #needs to happen before shifters are apllied
            #if it was used in this way, it needs to be ignored as a shifter later
            context_tokens_set = {t.lemma_ for t in context_tokens}
            #increase of economic domain term            
            intersection = context_tokens_set.intersection(bigger_words)
            if intersection:
                if token.lemma_ in bigger_is_good_words:
                    token._.sentiment = 1
                if token.lemma_ in bigger_is_bad_words:
                    token._.sentiment = -1
                used_lemma = list(intersection)[0]
                ignored_token = find_token_by_lemma(context_tokens, used_lemma)
                ignored_token._.ignore_as_shifter = True
                
            #decrease of economic domain term
            intersection = context_tokens_set.intersection(smaller_words)
            if intersection:
                if token.lemma_ in bigger_is_good_words:
                    token._.sentiment = -1
                if token.lemma_ in bigger_is_bad_words:
                    token._.sentiment = 1
                used_lemma = list(intersection)[0]
                ignored_token = find_token_by_lemma(context_tokens, used_lemma)
                ignored_token._.ignore_as_shifter = True

            debug and print('Evaluating context cluster of token ',token)
            for context_token in context_tokens: 
                

                if not context_token._.ignore_as_shifter:
                
                    #negation
                    #Only first negator is counted, as additional negators won't have additional effect for usual polarized context ('nebylo to nikterak dobré' stays negative)
                    if context_token._.shifter_type == 'neg' and negators_count == 0:
                        apply_shifter(token, 'neg')
                        negators_count += 1
                        debug and print('Applying shifter ',context_token,' (',context_token._.shifter_type,') to token ',token,', new sentiment: ', token._.sentiment)
    
                    #adversatives
                    if context_token._.shifter_type == 'adv':
                        if context_token in foll_tokens:
                            apply_shifter(token, 'adv_after')
                            debug and print('Applying shifter ',context_token,' (',context_token._.shifter_type,') to token ',token,', new sentiment: ', token._.sentiment)
                        elif context_token in pre_tokens:
                            apply_shifter(token, 'adv_before')
                            debug and print('Applying shifter ',context_token,' (',context_token._.shifter_type,') to token ',token,', new sentiment: ', token._.sentiment)
                            
                    #intensification  
                    if context_token._.shifter_type == 'int':
                        apply_shifter(token, 'int')
                        debug and print('Applying shifter ',context_token,' (',context_token._.shifter_type,') to token ',token,', new sentiment: ', token._.sentiment)
    
                    #deintensification
                    if context_token._.shifter_type == 'deint':
                        apply_shifter(token, 'deint')
                        debug and print('Applying shifter ',context_token,' (',context_token._.shifter_type,') to token ',token,', new sentiment: ', token._.sentiment)
                    
                token._.ignore_as_shifter = False
        else:
            token._.sentiment = 0
            
        

    overall_sentiment = 0
    
    for sentence in sentences:
        sentence_sentiment = 0
        for token in sentence:
            sentence_sentiment += token._.sentiment
        #sentence_sentiment = sentence_sentiment / math.sqrt(len(sentence))
        sentence._.sentiment = sentence_sentiment
        overall_sentiment += sentence._.sentiment

    if len(sentences)==0:
        return 0
    else:
        overall_sentiment = overall_sentiment / len(sentences)
        
    return tokens, overall_sentiment

### Dependency tree-based shifter pattern approach
Returns overall sentiment from a list of tokens using dependency tree-based shifter pattern approach. First, it marks shifter words, assigns sentiment to tokens using the lexicon lookup, and identifies the head token of each sentence (1 sentence = 1 tree). Then, for each tree, it applies the shifter patterns, starting from the rightmost lowest node. Lastly, overall sentiment is calculated from all the sentences. The economics pattern (bigger is better / bigger is worse) is mixed in with the usual shifters (negation, intensification, deintensification, adversatives).

First, a helper method that allows traversing the dependency trees

In [31]:
#generates token list out of tree, in order from bottom level to top, starting from the rightmost token on bottom level
def bottom_to_top_traversal(root_token):
    levels = defaultdict(list)
    max_level = 0

    def dfs(node, level):
        nonlocal max_level
        max_level = max(max_level, level)
        levels[level].append(node)
        for child in node.children:
            dfs(child, level + 1)

    dfs(root_token, 0)

    for level in range(max_level, -1, -1):
        for token in reversed(levels[level]):
            yield token

The actual algorithm

In [32]:
def get_sentiment_dep(tokens, lexicon, constants = constants, debug=False):

    root_tokens = set()
    
    for token in tokens:

        #some extensions have to be reset just in case
        token._.sentiment = 0
        token._.passed_sentiment = 0
        token._.passed_shifters = []
        
     
        if token.head == token:
            root_tokens.add(token.head)

        mark_as_shifter(token)
        
        token._.sentiment = get_token_sentiment(token, lexicon)

    overall_sentiment = 0
    
    # for every dependency tree
    for root_token in root_tokens:

        negator_count = 0
        
        #iterate over tokens in tree from bottom to top, starting from the rightmost bottom token
        for token in bottom_to_top_traversal(root_token):
            

            parent = token.head

            #token isn't root token
            if token != parent:

                ignore_own_shifter = False

                #assign polarity to economic term parent based on increase/decrease
                if token.lemma_ in bigger_words:
                    if parent.lemma_ in bigger_is_good_words:
                        parent._.sentiment = 1
                    if parent.lemma_ in bigger_is_bad_words:
                        parent._.sentiment = -1
                    ignore_own_shifter = True

                if token.lemma_ in smaller_words:
                    if parent.text in bigger_is_good_words:
                        parent._.sentiment = -1
                    if parent.text in bigger_is_bad_words:
                        parent._.sentiment = 1
                    ignore_own_shifter = True

                #token is a shifter 
                if token._.shifter_type != 'none' and not ignore_own_shifter:

                    # parent has sentiment - apply shifter
                    if parent._.sentiment != 0: 
                        if not (token._.shifter_type == 'neg' and negator_count > 0): # only one negation allowed
                            apply_shifter(parent, token._.shifter_type)
                            debug and print('Applying shifter ',token,' (',token._.shifter_type,') to parent node ', parent, 'new sentiment: ', parent._.sentiment)
                            if token._.shifter_type == 'neg':
                                negator_count += 1
                    #parent doesn't have sentiment - pass shifter to parent
                    else: 
                        parent._.passed_shifters.append(token._.shifter_type)
                        debug and print('Passing shifter ', token,' (',token._.shifter_type,') to parent node ', parent)

                #token has been passed some shifters
                if len(token._.passed_shifters) != 0:
                    #if token has sentiment - apply passed shifter on itself
                    if token._.sentiment != 0:
                        for passed_shifter in token._.passed_shifters:
                            apply_shifter(token, passed_shifter)
                            debug and print('Applying inherited shifter (', passed_shifter,') to node',token, 'new sentiment: ', token._.sentiment)
                            token._.passed_shifters.remove(passed_shifter)

                
                #finally, if token has any sentiment, it is passed to parent as passed sentiment
                if(token._.sentiment != 0):
                    parent._.passed_sentiment += token._.sentiment
                    debug and print('Passing sentiment from child node ',token,' to parent node ', parent, '(',token._.sentiment,')')
                
                #if token has been passed sentiment, it also passes it to parent as passed sentiment
                if(token._.passed_sentiment != 0):
                    parent._.passed_sentiment += token._.passed_sentiment
                    debug and print('Passing inherited sentiment from child node ',token ,' to parent node ', parent, '(',token._.passed_sentiment,')')

            #token is root token
            else:
                #root token has no sentiment of its own (not a polarized word)
                if token._.sentiment == 0:
                    #if it has been passed any shifter, apply it to the passed sentiment it received
                    for passed_shifter in token._.passed_shifters:
                        if not (passed_shifter == 'neg' and negator_count>0): #only one negation allowed
                            apply_shifter(token, passed_shifter, on_passed=True)
                            debug and print('Applying inherited shifter (', passed_shifter,') to node', token, ' on inherited sentiment, new inherited sentiment: ', token._.passed_sentiment)

                    #root token is a negator - most likely a negative verb, verbal negation negates whole sentence - negates any passed sentiment
                    if token._.shifter_type=='neg' and negator_count==0:
                        apply_shifter(token, 'neg', on_passed=True)
                        debug and print('Applying sentential negation from root token ',token, ' on inherited sentiment, new inherited sentiment: ', token._.passed_sentiment)

                #root token has sentiment on its own
                else:
                    #root token is negator - should negate any passed sentiment (sentential negation)
                    if token._.shifter_type == 'neg' and token._.passed_sentiment != 0 and negator_count==0:
                        apply_shifter(token, 'neg', on_passed=True)
                        debug and print('Applying sentential negation from root token ',token, 'on inherited sentiment, new inherited sentiment: ', token._.passed_sentiment)

                    #applies any passed shifters on itself and passed sentiment
                    for passed_shifter in token._.passed_shifters:
                       if not (passed_shifter == 'neg' and negator_count>0):
                           apply_shifter(token, passed_shifter)
                           print('Applying inherited shifter (', passed_shifter,') to node', token, 'new sentiment: ', token._.sentiment)
                           if(token._.passed_sentiment != 0):
                               apply_shifter(token, passed_shifter, on_passed = True)
                               debug and print('Applying inherited shifter (', passed_shifter,') to node', token, ' on inherited sentiment, new inherited sentiment: ', token._.passed_sentiment)
                        
        overall_sentiment = overall_sentiment + root_token._.sentiment + root_token._.passed_sentiment

    if len(root_tokens)==0:
        return 0
    else:
        overall_sentiment = overall_sentiment / len(root_tokens)
        
    return tokens, overall_sentiment

### Applies sentiment analysis algorithm on a dataset

Choose either proximity or dependency tree method using the parameter method ('proxi' or 'dep')

In [33]:
def get_df_row_sentiment(row, lexicon, method):
    tokens = row['tokens']
    sentiment = 0
    try:
        if method in ['proxi', 'proximity']:
            tokens, sentiment = get_sentiment_proxi(tokens, lexicon)
        elif method in ['dep', 'dependency']:
            tokens, sentiment = get_sentiment_dep(tokens, lexicon)
    finally:       
        return sentiment
    
def apply_sa_on_dataset(df, lexicon, method):
    df = df.copy()
    if method not in ['proxi', 'proximity', 'dep', 'dependency']:
        print('No such method as '+method+'.')
        return
    df['sentiment'] = df.apply(lambda row: get_df_row_sentiment(row, lexicon, method), axis=1)
    df['sentiment_pred'] = df['sentiment'].apply(lambda x: 'p' if x > 0 else ('n' if x < 0 else '0'))
    return df

### Load lexicons

In [34]:
sublex = load_czech_sublex_dict('text_resources/lexicons/sublex_1_0.csv')
affin = load_affincz('text_resources/lexicons/affincz.txt')
#autolex = generate_lexicon_dict(preprocessed_df, min_occurence=2, min_abs_value=0.5, simple_sentiment=True)
#joinedlex = dict(autolex)
#joinedlex = dict(affin)
#joinedlex.update(affin)
#joinedlex.update(sublex)

lexicon loaded from text_resources/lexicons/sublex_1_0.csv
lexicon loaded from text_resources/lexicons/affincz.txt


### Lemmatization test
For testing purposes

In [35]:
doc = nlp('není')
for token in doc:
    print('Input: '+token.text)
    print('Lemma: '+token.lemma_)
    print('POS: '+token.pos_)
    print(token.morph)

Input: není
Lemma: být
POS: AUX
Mood=Ind|Number=Sing|Person=3|Polarity=Neg|Tense=Pres|VerbForm=Fin|Voice=Act


### Sentiment analysis for one string
#### Proximity

In [37]:
doc = preprocess_text('Kytara nezněla vskutku špatně')
tokens = get_tokens(doc)
for token in tokens:
    mark_as_shifter(token)
    if(token._.shifter_type != 'none'):
       print(token, ' --- sentiment: ', get_token_sentiment(token, sublex), ', shifter type: ', token._.shifter_type) 
    else:
        print(token, ' --- sentiment: ',get_token_sentiment(token, sublex))
print('----------------------------')       
tokens, overall_sentiment = get_sentiment_proxi(tokens, sublex, debug=True)

print('Overall sentiment: ' + str(overall_sentiment))

Kytara  --- sentiment:  0
nezněla  --- sentiment:  0 , shifter type:  neg
vskutku  --- sentiment:  0 , shifter type:  int
špatně  --- sentiment:  -1.0
----------------------------
Evaluating context cluster of token  špatně
Applying shifter  nezněla  ( neg ) to token  špatně , new sentiment:  0.5
Applying shifter  vskutku  ( int ) to token  špatně , new sentiment:  0.9
Overall sentiment: 0.9


#### Dependency tree

In [38]:
doc = preprocess_text('Kytara nezněla vskutku špatně')
tokens = get_tokens(doc)
for token in tokens:
    mark_as_shifter(token)
    if(token._.shifter_type != 'none'):
       print(token, ' --- sentiment: ', get_token_sentiment(token, sublex), ', shifter type: ', token._.shifter_type) 
    else:
        print(token, ' --- sentiment: ',get_token_sentiment(token, sublex))          
displacy.render(doc, style="dep", jupyter=True, options={"distance": 90})
tokens = get_tokens(doc)
tokens, overall_sentiment = get_sentiment_dep(tokens, sublex, debug=True)
print('Overall sentiment: ' + str(overall_sentiment))

Kytara  --- sentiment:  0
nezněla  --- sentiment:  0 , shifter type:  neg
vskutku  --- sentiment:  0 , shifter type:  int
špatně  --- sentiment:  -1.0


Passing sentiment from child node  špatně  to parent node  nezněla ( -1.0 )
Passing shifter  vskutku  ( int ) to parent node  nezněla
Applying inherited shifter ( int ) to node nezněla  on inherited sentiment, new inherited sentiment:  -1.8
Applying sentential negation from root token  nezněla  on inherited sentiment, new inherited sentiment:  0.9
Overall sentiment: 0.9


## Evaluation
Make sure to run all previous cells before running evaluations. Contains evaluations for Facebook, CSFD, Exctracted and Synthetic datasets. Parameters can be changed. Which lexicons will be used can be changed. Method (proximity, dependency) can be changed. All evaluations are done using 10-fold cross-validation.

### Run this first - helper for generating and averaging classification reports

In [39]:
def get_report(label_col, pred_col, output_dict=False):
    report = classification_report(label_col, pred_col, output_dict=output_dict)
    return report

def avg_reports(*args):
    mean_dict = dict()
    for label in reports[0].keys():
        dictionary = dict()

        if label in 'accuracy':
            mean_dict[label] = sum(d[label] for d in reports) / len(reports)
            continue

        for key in reports[0][label].keys():
            dictionary[key] = sum(d[label][key] for d in reports) / len(reports)
        mean_dict[label] = dictionary

    return mean_dict

### Facebook data evaluation

In [40]:
preprocessed_df = load_from_pickle('datasets/facebook.pickle')

In [41]:
kfold = KFold(n_splits=10, shuffle=True, random_state=1)

sublex = load_czech_sublex_dict('text_resources/lexicons/sublex_1_0.csv')
affin = load_affincz('text_resources/lexicons/affincz.txt')

reports = []

for fold_idx, (train_idx, test_idx) in enumerate(kfold.split(preprocessed_df)):
    train = preprocessed_df.iloc[train_idx]
    test = preprocessed_df.iloc[test_idx]
    autolex = generate_lexicon_dict(train, min_occurence=5, min_abs_value=0.4, simple_sentiment=True)
    #joinedlex = sublex.copy()
    #joinedlex.update(autolex)
    result_df = apply_sa_on_dataset(test, sublex, method='proxi')
    report = get_report(result_df['labels'], result_df['sentiment_pred'], output_dict=True)
    reports.append(report)

avg_report = avg_reports(reports)
print(avg_report)


lexicon loaded from text_resources/lexicons/sublex_1_0.csv
lexicon loaded from text_resources/lexicons/affincz.txt
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
{'0': {'precision': 0.6328396220490873, 'recall': 0.49274075299265796, 'f1-score': 0.5538765274586636, 'support': 517.4}, 'n': {'precision': 0.43213861263871245, 'recall': 0.2612506581280924, 'f1-score': 0.3250802781010592, 'support': 199.1}, 'p': {'precision': 0.3894562069273318, 'recall': 0.679723215579535, 'f1-score': 0.

### CSFD data evaluation

Make sure that CSFD data was preprocessed and serialized first if you need to see the CSFD evaluation (it is disabled by default due to it taking a long time).

In [42]:
preprocessed_df = load_from_pickle('datasets/csfd.pickle')

In [43]:
kfold = KFold(n_splits=10, shuffle=True, random_state=1)

sublex = load_czech_sublex_dict('text_resources/lexicons/sublex_1_0.csv')
affin = load_affincz('text_resources/lexicons/affincz.txt')

reports = []

for fold_idx, (train_idx, test_idx) in enumerate(kfold.split(preprocessed_df)):
    train = preprocessed_df.iloc[train_idx]
    test = preprocessed_df.iloc[test_idx]
    autolex = generate_lexicon_dict(train, min_occurence=25, min_abs_value=0.5, simple_sentiment=True)
    #joinedlex = sublex.copy()
    #joinedlex.update(autolex)
    result_df = apply_sa_on_dataset(test, autolex, method='proxi')
    report = get_report(result_df['labels'], result_df['sentiment_pred'], output_dict=True)
    reports.append(report)

avg_report = avg_reports(reports)
print(avg_report)

lexicon loaded from text_resources/lexicons/sublex_1_0.csv
lexicon loaded from text_resources/lexicons/affincz.txt
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
{'0': {'precision': 0.45847198705800973, 'recall': 0.4794422275704098, 'f1-score': 0.4686844422931947, 'support': 3076.8}, 'n': {'precision': 0.694541192022708, 'recall': 0.46688852605234155, 'f1-score': 0.5583646936653408, 'support': 2971.6}, 'p': {'precision': 0.6202215551295744, 'recall': 0.7874472039502504, 'f1-score': 

### Extracted dataset evaluation

In [44]:
preprocessed_df = load_from_pickle('datasets/extracted.pickle')

In [45]:
kfold = KFold(n_splits=10, shuffle=True, random_state=1)

#sublex = load_czech_sublex_dict('text_resources/lexicons/sublex_1_0.csv')
#affin = load_affincz('text_resources/lexicons/affincz.txt')

reports = []

for fold_idx, (train_idx, test_idx) in enumerate(kfold.split(preprocessed_df)):
    train = preprocessed_df.iloc[train_idx]
    test = preprocessed_df.iloc[test_idx]
    autolex = generate_lexicon_dict(train, min_occurence=5, min_abs_value=0.5, simple_sentiment=False)
    #joinedlex = sublex.copy()
    #joinedlex.update(autolex)
    result_df = apply_sa_on_dataset(test, autolex, method='proxi')
    report = get_report(result_df['labels'], result_df['sentiment_pred'], output_dict=True)
    reports.append(report)

avg_report = avg_reports(reports)
print(avg_report)

starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
{'0': {'precision': 0.4449063233699797, 'recall': 0.6304585679717336, 'f1-score': 0.5212716731118343, 'support': 193.2}, 'n': {'precision': 0.6419769104301432, 'recall': 0.49466944257089207, 'f1-score': 0.5572701766373085, 'support': 193.4}, 'p': {'precision': 0.6081213209470266, 'recall': 0.49428656139308363, 'f1-score': 0.5449014097522296, 'support': 193.5}, 'accuracy': 0.5392186479909787, 'macro avg': {'precision': 0.5650015182490498

### Synthetic dataset evauluation

In [46]:
preprocessed_df = load_from_pickle('datasets/synthetic.pickle')

In [47]:
kfold = KFold(n_splits=10, shuffle=True, random_state=1)

sublex = load_czech_sublex_dict('text_resources/lexicons/sublex_1_0.csv')
affin = load_affincz('text_resources/lexicons/affincz.txt')

reports = []

for fold_idx, (train_idx, test_idx) in enumerate(kfold.split(preprocessed_df)):
    train = preprocessed_df.iloc[train_idx]
    test = preprocessed_df.iloc[test_idx]
    autolex = generate_lexicon_dict(train, min_occurence=10, min_abs_value=0.5, simple_sentiment=True)
    #joinedlex = sublex.copy()
    #joinedlex.update(autolex)
    result_df = apply_sa_on_dataset(test, autolex, method='proxi')
    report = get_report(result_df['labels'], result_df['sentiment_pred'], output_dict=True)
    reports.append(report)
    
avg_report = avg_reports(reports)
print(avg_report)



lexicon loaded from text_resources/lexicons/sublex_1_0.csv
lexicon loaded from text_resources/lexicons/affincz.txt
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
starting lexicon generation
finished lexicon generation
{'0': {'precision': 0.819811144265362, 'recall': 0.5396400342738978, 'f1-score': 0.6506030899219202, 'support': 250.0}, 'n': {'precision': 0.8130077966234891, 'recall': 0.8980815589428023, 'f1-score': 0.8529720947664691, 'support': 250.0}, 'p': {'precision': 0.7257059987594261, 'recall': 0.8983096962820077, 'f1-score': 0.80

## Examples for automatic lexicon generation

Parameters can be changed.

### Example - generate auto lexicon from fb data

In [48]:
preprocessed_df = load_from_pickle('datasets/facebook.pickle')
lexicon = generate_lexicon(preprocessed_df, min_occurence=25, min_abs_value = 0.5, additional_stats=True)
print(lexicon)

starting lexicon generation
finished lexicon generation
         lemma   n    p   0  total   n_ratio   p_ratio   0_ratio  \
0        drahý  26    1  10     37  0.702703  0.027027  0.270270   
1         vůně   8  100  15    123  0.065041  0.813008  0.121951   
2       chápat  26    0  19     45  0.577778  0.000000  0.422222   
3        pěkný   9   67  15     91  0.098901  0.736264  0.164835   
4        super  12   95  10    117  0.102564  0.811966  0.085470   
5      dovolat  19    2   8     29  0.655172  0.068966  0.275862   
6       skvělý   4   60   6     70  0.057143  0.857143  0.085714   
7     oblíbený   1   27   2     30  0.033333  0.900000  0.066667   
8      smlouva  27    1  21     49  0.551020  0.020408  0.428571   
9     zákazník  43    2  18     63  0.682540  0.031746  0.285714   
10         síť  21    1  18     40  0.525000  0.025000  0.450000   
11   odpovědět  17    1  13     31  0.548387  0.032258  0.419355   
12    procento  18    0  12     30  0.600000  0.000000  0.40

### Example - generate auto lexicon from csfd data

In [49]:
preprocessed_df = load_from_pickle('datasets/csfd.pickle')
lexicon = generate_lexicon(preprocessed_df, min_occurence=5, additional_stats=True)
print(lexicon)

starting lexicon generation
finished lexicon generation
            lemma     p     n     0  total   p_ratio   n_ratio   0_ratio  \
0          příběh  4329  1228  3178   8735  0.495592  0.140584  0.363824   
1      potvrzovat    88    37    42    167  0.526946  0.221557  0.251497   
2        pravidlo   163    58    84    305  0.534426  0.190164  0.275410   
3       perfektný    33     1     5     39  0.846154  0.025641  0.128205   
4          muzika   166    38    84    288  0.576389  0.131944  0.291667   
...           ...   ...   ...   ...    ...       ...       ...       ...   
13573     emzácký     0     3     4      7  0.000000  0.428571  0.571429   
13574   kostýmový     0     2     3      5  0.000000  0.400000  0.600000   
13575       usrat     0     5     0      5  0.000000  1.000000  0.000000   
13576     hasnout     0     2     3      5  0.000000  0.400000  0.600000   
13577   pochůzkář     0     3     3      6  0.000000  0.500000  0.500000   

       sentiment_value  
0     

### Example - generate auto lexicon from extracted dataset

In [50]:
preprocessed_df = load_from_pickle('datasets/extracted.pickle')
lexicon = generate_lexicon(preprocessed_df, min_occurence=5, additional_stats=True)
print(lexicon)

starting lexicon generation
finished lexicon generation
          lemma   p  n  0  total   p_ratio   n_ratio   0_ratio  \
0     vyžadovat   1  5  6     12  0.083333  0.416667  0.500000   
1       pomáhat  17  1  3     21  0.809524  0.047619  0.142857   
2    efektivita   7  0  0      7  1.000000  0.000000  0.000000   
3     umožňovat  11  1  3     15  0.733333  0.066667  0.200000   
4       nákupní   7  1  5     13  0.538462  0.076923  0.384615   
..          ...  .. .. ..    ...       ...       ...       ...   
897   pracující   0  3  3      6  0.000000  0.500000  0.500000   
898      záloha   0  5  3      8  0.000000  0.625000  0.375000   
899    narozený   0  3  2      5  0.000000  0.600000  0.400000   
900    náhradní   0  3  2      5  0.000000  0.600000  0.400000   
901     pasažér   0  2  3      5  0.000000  0.400000  0.600000   

     sentiment_value  
0          -0.333333  
1           0.761905  
2           1.000000  
3           0.666667  
4           0.461538  
..           

### Example - generate auto lexicon from synthetic dataset

In [51]:
preprocessed_df = load_from_pickle('datasets/synthetic.pickle')
lexicon = generate_lexicon(preprocessed_df, min_occurence=5, additional_stats=True)
print(lexicon)

starting lexicon generation
finished lexicon generation
         lemma     p    n   0  total   p_ratio   n_ratio   0_ratio  \
0      balíček    65    3   8     76  0.855263  0.039474  0.105263   
1    povzbudit    17    0   0     17  1.000000  0.000000  0.000000   
2         růst  1392  413  44   1849  0.752839  0.223364  0.023797   
3         malý   300   40  44    384  0.781250  0.104167  0.114583   
4      střední   274   25  54    353  0.776204  0.070822  0.152975   
..         ...   ...  ...  ..    ...       ...       ...       ...   
528    působit     0    3   5      8  0.000000  0.375000  0.625000   
529    uzavřít     0    3   2      5  0.000000  0.600000  0.400000   
530     uvádět     0    2   4      6  0.000000  0.333333  0.666667   
531    západní     0    2   4      6  0.000000  0.333333  0.666667   
532    rozruch     0    2   4      6  0.000000  0.333333  0.666667   

     sentiment_value  
0           0.815789  
1           1.000000  
2           0.529475  
3          