# Application of ```basicnlp``` functions to text data

One of the main purposes of this package is to provide functions that make it easy to apply NLP and text-mining tecniques to text documents usually stored in ```pandas.Series```. 

In this script, we show how ```basicnlp``` and ```utils``` make it straightforward to combine NLTK, TextBlob and any self-created functions to texts of different sentence-length.


## Set ups and Imports

Import modules and our user-defined functions

In [11]:
import pandas as pd
import numpy as np
import os

from nlpbumblebee.utils import *
from nlpbumblebee.basicnlp import *

In [2]:
pd.set_option('display.max_colwidth', -1)

Let's start by taking a look at the list of user-defined functions

In [10]:
print ( dir(utils) )

# This one throws NameError: name 'basicnlp' is not defined... but the functions are uploaded!
print ( dir(basicnlp) )


['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'collections', 'combine_2fs', 'combine_functions', 'combine_functions_output_series', 'flattenIrregularListOfLists', 'functools', 'list2string', 'merge_dfs', 'np', 'output_series', 'pd', 'reduce', 'series_output', 'string_to_series_out', 'word_tokens2string_sentences', 'wraps']
['POS_tagging', 'SentimentIntensityAnalyzer', 'TextBlob', 'WordNetLemmatizer', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'analyser', 'break_compound_words', 'classify_subjectivity', 'count_pos', 'count_punkt', 'count_words', 'fix_neg_auxiliary', 'get_sentiment_score_TB', 'get_sentiment_score_VDR', 'get_sentiment_stricter_threshold', 'get_subjectivity', 'get_wordnet_pos', 'is_part_string', 'itertools', 'keep_only_strict_polarity_sents', 'lemmatise', 'mark_neg', 'mark_negation', 'np', 'pd', 'pos_tag', 'punctuation', 'remove_objective_sents', 'rem

## Data

For the purpose of this notebook, we will create some text data by combining same-sentiment sentences into short paragraphs of text from 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015.

Importantly for our examples, the texts are made by varying numbers of sentences.

In [6]:
text = ["I know that sounds funny, but to me it seemed like sketchy technology that wouldn't work well. Well, this one works great.",
        "All I can do is whine on the Internet, so here it goes. The more I use the thing the less I like it.",
        "I still maintain that monkeys shouldn't make headphones, we just obviously don't share enough DNA to copy the design over to humans. Thank you for wasting my money!",
        "Not only will it drain your player, but may also potentially fry it. I want my money back.",
        "There are massive levels, massive unlockable characters... it's just a massive game.",
        "",
        "A great film by a great director. The movie had you on the edge of your seat and made you somewhat afraid to go to your car at the end of the night. The music in the film is really nice too. I'd advise anyone to go and see it.", 
        "I especially liked the non-cliche choices with the parents; in other movies, I could predict the dialog verbatim, but the writing in this movie made better selections.",
        "Brilliant!"
       ]

score = [1, 0, 0, 0, 1, np.nan, 1, 1, 1]
source = ['yelp', 'yelp', 'yelp', 'yelp', 'imbd', 'imbd', 'imdb', 'imdb', 'imdb']
user = [1,2,3,4,5,6,7,8,9]

In [7]:
import pandas as pd

df = pd.DataFrame({
    'user' : user,
    'text': text,
    'source' : source
    })
    

Let's have a quick look at the pandas dataframe we have just created.

In [8]:
df

Unnamed: 0,user,text,source
0,1,"I know that sounds funny, but to me it seemed like sketchy technology that wouldn't work well. Well, this one works great.",yelp
1,2,"All I can do is whine on the Internet, so here it goes. The more I use the thing the less I like it.",yelp
2,3,"I still maintain that monkeys shouldn't make headphones, we just obviously don't share enough DNA to copy the design over to humans. Thank you for wasting my money!",yelp
3,4,"Not only will it drain your player, but may also potentially fry it. I want my money back.",yelp
4,5,"There are massive levels, massive unlockable characters... it's just a massive game.",imbd
5,6,,imbd
6,7,A great film by a great director. The movie had you on the edge of your seat and made you somewhat afraid to go to your car at the end of the night. The music in the film is really nice too. I'd advise anyone to go and see it.,imdb
7,8,"I especially liked the non-cliche choices with the parents; in other movies, I could predict the dialog verbatim, but the writing in this movie made better selections.",imdb
8,9,Brilliant!,imdb


As we can see the texts are composed by varying numbers of sentences, and one datapoint is missing.

## Applying ```basicnlp``` functions to a pandas.Series

The main idea behind our NLP basic functions is to make it easy to apply NLP and text-mining functions from NLTK, TextBlob and other self-created functions to texts with different sentence-length stored that are stored in pandas.Series.

Let's see how this works.

### Building pipelines of functions

Let's suppose we want to sentence tokenise our texts for then calculate the polarity score for each sentence in each text, and then averaging the scores for each text.

#### Sentence and Word Tokenisation

One option is to chain our functions through a series of ```apply().apply()```. And this is what we do in the following code.

As you can see, the function ```word_tokenise()``` preserves the sentence boundaries, which is really useful when calculating sentiment scores as this must be done at sentence (and not paragraph) level.

We then save the results as a new pandas.Series.

In [12]:
df['text_word_tok'] = df['text'].apply(lambda x: sent_tokenise(x)).apply(lambda x: word_tokenise(x))


In [13]:
df[['text_word_tok']]

Unnamed: 0,text_word_tok
0,"[[I, know, that, sounds, funny, ,, but, to, me, it, seemed, like, sketchy, technology, that, would, n't, work, well, .], [Well, ,, this, one, works, great, .]]"
1,"[[All, I, can, do, is, whine, on, the, Internet, ,, so, here, it, goes, .], [The, more, I, use, the, thing, the, less, I, like, it, .]]"
2,"[[I, still, maintain, that, monkeys, should, n't, make, headphones, ,, we, just, obviously, do, n't, share, enough, DNA, to, copy, the, design, over, to, humans, .], [Thank, you, for, wasting, my, money, !]]"
3,"[[Not, only, will, it, drain, your, player, ,, but, may, also, potentially, fry, it, .], [I, want, my, money, back, .]]"
4,"[[There, are, massive, levels, ,, massive, unlockable, characters, ..., it, 's, just, a, massive, game, .]]"
5,[]
6,"[[A, great, film, by, a, great, director, .], [The, movie, had, you, on, the, edge, of, your, seat, and, made, you, somewhat, afraid, to, go, to, your, car, at, the, end, of, the, night, .], [The, music, in, the, film, is, really, nice, too, .], [I, 'd, advise, anyone, to, go, and, see, it, .]]"
7,"[[I, especially, liked, the, non-cliche, choices, with, the, parents, ;, in, other, movies, ,, I, could, predict, the, dialog, verbatim, ,, but, the, writing, in, this, movie, made, better, selections, .]]"
8,"[[Brilliant, !]]"


#### Calculate sentiment or subjecivity score at sentence level and then at paragraph level

Let's then use NLTK's Vader to compute the sentiment score for each sentence in each text. What we obtain is a list of score for each text, one score for each sentence in that text.

In [14]:
df['text'].apply(lambda x: sent_tokenise(x)).apply(lambda x: get_sentiment_score_VDR(x))

0    [0.455, 0.7351]           
1    [-0.3612, 0.2975]         
2    [-0.2235, -0.126]         
3    [0.0, 0.0772]             
4    [0.0]                     
5    NaN                       
6    [0.8481, 0.0, 0.4754, 0.0]
7    [0.7092]                  
8    [0.6239]                  
Name: text, dtype: object

Let's now calculate the text-level polarity score by averaging the sentence-level scores for each text:

In [15]:
df['text_sentiment_VDR'] = df['text'].apply(lambda x: sent_tokenise(x)).apply(lambda x: get_sentiment_score_VDR(x)).apply(lambda x: np.mean(x))

In [16]:
df[['text','text_sentiment_VDR']]

Unnamed: 0,text,text_sentiment_VDR
0,"I know that sounds funny, but to me it seemed like sketchy technology that wouldn't work well. Well, this one works great.",0.59505
1,"All I can do is whine on the Internet, so here it goes. The more I use the thing the less I like it.",-0.03185
2,"I still maintain that monkeys shouldn't make headphones, we just obviously don't share enough DNA to copy the design over to humans. Thank you for wasting my money!",-0.17475
3,"Not only will it drain your player, but may also potentially fry it. I want my money back.",0.0386
4,"There are massive levels, massive unlockable characters... it's just a massive game.",0.0
5,,
6,A great film by a great director. The movie had you on the edge of your seat and made you somewhat afraid to go to your car at the end of the night. The music in the film is really nice too. I'd advise anyone to go and see it.,0.330875
7,"I especially liked the non-cliche choices with the parents; in other movies, I could predict the dialog verbatim, but the writing in this movie made better selections.",0.7092
8,Brilliant!,0.6239


### A smarter way to build pipelines of functions with ```combine_functions()```

The function ```combine_functions()``` create chains (or pipelines) of functions. To real advantage of this approach is that once we have created our function pipeline, we can apply it to our text data in one go without building a chain of ```apply().apply()```, as it should be clear in the following code. 

Let's re-calculate the paragraph-level sentiment score using ```combine_function()```:

In [17]:
par_sentiment_VDR_fn = combine_functions(sent_tokenise,
                                        get_sentiment_score_VDR,
                                        np.mean)

In [18]:
df['text_sentiment_VDR'] = df['text'].apply(lambda x: par_sentiment_VDR_fn(x))
df[['text', 'text_sentiment_VDR']]

Unnamed: 0,text,text_sentiment_VDR
0,"I know that sounds funny, but to me it seemed like sketchy technology that wouldn't work well. Well, this one works great.",0.59505
1,"All I can do is whine on the Internet, so here it goes. The more I use the thing the less I like it.",-0.03185
2,"I still maintain that monkeys shouldn't make headphones, we just obviously don't share enough DNA to copy the design over to humans. Thank you for wasting my money!",-0.17475
3,"Not only will it drain your player, but may also potentially fry it. I want my money back.",0.0386
4,"There are massive levels, massive unlockable characters... it's just a massive game.",0.0
5,,
6,A great film by a great director. The movie had you on the edge of your seat and made you somewhat afraid to go to your car at the end of the night. The music in the film is really nice too. I'd advise anyone to go and see it.,0.330875
7,"I especially liked the non-cliche choices with the parents; in other movies, I could predict the dialog verbatim, but the writing in this movie made better selections.",0.7092
8,Brilliant!,0.6239


Let's compare these to the scores that we would obtain using TextBlob instead

In [19]:
df['text_sentiment_TB'] = df['text'].apply(combine_functions(sent_tokenise, get_sentiment_score_TB, np.mean))

In [20]:
df[['text','text_sentiment_VDR', 'text_sentiment_TB']]

Unnamed: 0,text,text_sentiment_VDR,text_sentiment_TB
0,"I know that sounds funny, but to me it seemed like sketchy technology that wouldn't work well. Well, this one works great.",0.59505,0.525
1,"All I can do is whine on the Internet, so here it goes. The more I use the thing the less I like it.",-0.03185,0.083333
2,"I still maintain that monkeys shouldn't make headphones, we just obviously don't share enough DNA to copy the design over to humans. Thank you for wasting my money!",-0.17475,0.0
3,"Not only will it drain your player, but may also potentially fry it. I want my money back.",0.0386,0.0
4,"There are massive levels, massive unlockable characters... it's just a massive game.",0.0,-0.1
5,,,
6,A great film by a great director. The movie had you on the edge of your seat and made you somewhat afraid to go to your car at the end of the night. The music in the film is really nice too. I'd advise anyone to go and see it.,0.330875,0.2
7,"I especially liked the non-cliche choices with the parents; in other movies, I could predict the dialog verbatim, but the writing in this movie made better selections.",0.7092,0.458333
8,Brilliant!,0.6239,1.0


#### Specifying functions' arguments

Chained functions' arguments can be specified by using lambda functions, as in the following example:

In [21]:
?remove_stopwords

In [22]:
text_processing_fn1 = combine_functions(sent_tokenise,
                                      word_tokenise,
                                      to_lower,
                                      POS_tagging,
                                      lemmatise,
                                      lambda x: remove_stopwords(x, extra_stopwords = ["'s", 'one'], words_to_keep = ['more'] )
                                     )

In [23]:
df['cleaned_text'] = df['text'].apply(lambda x: text_processing_fn1(x))
df[['cleaned_text']]

Unnamed: 0,cleaned_text
0,"[[know, sound, funny, ,, seem, like, sketchy, technology, would, n't, work, well, .], [well, ,, work, great, .]]"
1,"[[whine, internet, ,, go, .], [more, use, thing, less, like, .]]"
2,"[[still, maintain, monkey, n't, make, headphone, ,, obviously, n't, share, enough, dna, copy, design, human, .], [thank, waste, money, !]]"
3,"[[not, drain, player, ,, may, also, potentially, fry, .], [want, money, back, .]]"
4,"[[massive, level, ,, massive, unlockable, character, ..., massive, game, .]]"
5,[]
6,"[[great, film, great, director, .], [movie, edge, seat, make, somewhat, afraid, go, car, end, night, .], [music, film, really, nice, .], ['d, advise, anyone, go, see, .]]"
7,"[[especially, like, non-cliche, choice, parent, ;, movie, ,, could, predict, dialog, verbatim, ,, writing, movie, make, good, selection, .]]"
8,"[[brilliant, !]]"


The text data still contains punctuation and now some extra empty strings (where we removed stopwords). So let's remove these empty strings and the punctuation and also unflatten each list, so that we each text is a list of string sentences (as lemmas) instead of a list of lists of tokens. 

In [24]:
text_processing_fn2 = combine_functions(flattenIrregularListOfLists,
                                        remove_punctuation,
                                        lambda x: list(filter(None, x)))   # a lambda function included on the fly

df['cleaned_text'] = df['cleaned_text'].apply(lambda x: text_processing_fn2(x))

In [25]:
df[['cleaned_text']]

Unnamed: 0,cleaned_text
0,"[know, sound, funny, seem, like, sketchy, technology, would, nt, work, well, well, work, great]"
1,"[whine, internet, go, more, use, thing, less, like]"
2,"[still, maintain, monkey, nt, make, headphone, obviously, nt, share, enough, dna, copy, design, human, thank, waste, money]"
3,"[not, drain, player, may, also, potentially, fry, want, money, back]"
4,"[massive, level, massive, unlockable, character, massive, game]"
5,[]
6,"[great, film, great, director, movie, edge, seat, make, somewhat, afraid, go, car, end, night, music, film, really, nice, d, advise, anyone, go, see]"
7,"[especially, like, noncliche, choice, parent, movie, could, predict, dialog, verbatim, writing, movie, make, good, selection]"
8,[brilliant]


#### Remove objective sentences

Let's only keep those sentences within each text that are assessed as 'subjective' (using TexTBlob's subjectivity score).

In [26]:
my_preprocessor1 = combine_functions(sent_tokenise, 
                                     lambda x : remove_objective_sents(x, 0.3),  #keep only sentences with subjectivity score > 0.3
                                     list2string  # from list of lists to list of strings
                                    )

# Let's take a look at each sentence's subjectivity score first
df['text'].apply(combine_functions(sent_tokenise, get_subjectivity))

0    [1.0, 0.75]              
1    [0.0, 0.2833333333333333]
2    [0.5, 0.0]               
3    [1.0, 0.0]               
4    [0.85]                   
5    [nan]                    
6    [0.75, 0.9, 1.0, 0.0]    
7    [0.625]                  
8    [1.0]                    
Name: text, dtype: object

In [27]:
df['subj_text'] = df['text'].apply(my_preprocessor1)

df[['subj_text']]

Unnamed: 0,subj_text
0,"I know that sounds funny, but to me it seemed like sketchy technology that wouldn't work well. Well, this one works great."
1,
2,"I still maintain that monkeys shouldn't make headphones, we just obviously don't share enough DNA to copy the design over to humans."
3,"Not only will it drain your player, but may also potentially fry it."
4,"There are massive levels, massive unlockable characters... it's just a massive game."
5,
6,A great film by a great director. The movie had you on the edge of your seat and made you somewhat afraid to go to your car at the end of the night. The music in the film is really nice too.
7,"I especially liked the non-cliche choices with the parents; in other movies, I could predict the dialog verbatim, but the writing in this movie made better selections."
8,Brilliant!


#### Count the occurrances of a specific Part-Of-Speech in the text 

The function ```count_pos()``` requires a list of (token, POS_tag) tuples as input

In [28]:
count_adj = combine_functions(sent_tokenise
                                       ,word_tokenise
                                       ,POS_tagging
                                       ,lambda x: count_pos(x, pos_to_cnt='J', normalise=False)
                                       )   

df['text'].apply(count_adj)

0    3.0
1    2.0
2    1.0
3    0.0
4    4.0
5   NaN 
6    3.0
7    3.0
8    0.0
Name: text, dtype: float64

## Applying chained functions to strings and save results in a pd.Series

```combine_functions()``` can also be applied o strings, and then turn the results into a pandas.Series with the function ```output_series()```

In [29]:
output_series( [text_processing_fn1(text) for text in ['I love cabbage. But I am not big on sprouts', 'Me neither.', 'I am please we agree.']])


outcome    [[[love, cabbage, .], [not, big, sprout]], [[neither, .]], [[please, agree, .]]]
dtype: object

In [30]:
combine_functions_output_series(text_processing_fn1('I love cabbage'))

<function nlpbumblebee.utils.combine_2fs.<locals>.<lambda>>

### Series-output decorator
series_output


In [2]:
#TODO