<a href="https://colab.research.google.com/github/mjahanshahi/intermediate-nlp/blob/master/text-analytics/Text_Analytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Text Analysis

Welcome to this colab notebook that I will use for demonstrative purposes. 

## Comparing NLTK vs spaCy

In [2]:
import spacy
from spacy.lang.en import English
import nltk
from nltk.tokenize import word_tokenize

In [3]:
en = English()
text = 'We are doing Text Analysis.'
doc = en(text)
print(type(doc))
print([(x, type(x)) for x in doc])

<class 'spacy.tokens.doc.Doc'>
[(We, <class 'spacy.tokens.token.Token'>), (are, <class 'spacy.tokens.token.Token'>), (doing, <class 'spacy.tokens.token.Token'>), (Text, <class 'spacy.tokens.token.Token'>), (Analysis, <class 'spacy.tokens.token.Token'>), (., <class 'spacy.tokens.token.Token'>)]


In [4]:
nltk.download('punkt')
doc = word_tokenize(text)
print(type(doc))
print([(x, type(x)) for x in doc])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
<class 'list'>
[('We', <class 'str'>), ('are', <class 'str'>), ('doing', <class 'str'>), ('Text', <class 'str'>), ('Analysis', <class 'str'>), ('.', <class 'str'>)]


In [5]:
%timeit en(text)
%timeit nltk.tokenize.casual_tokenize(text)

The slowest run took 26.97 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8 µs per loop
The slowest run took 7.34 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 16.2 µs per loop


## spaCy's Language Models

In [8]:
from spacy.lang.en import English
en = English()
print(en.tokenizer)
print(en.pipe_names)


<spacy.tokenizer.Tokenizer object at 0x7f83754a5e58>
[]


In [9]:
nlp = spacy.load('en_core_web_sm')
print(nlp.pipe_names)

['tagger', 'parser', 'ner']


In [10]:
doc = en('Text analysis is so much fun!')
print(doc)
print(type(doc))
doc_attrs = set(dir(doc))
print(doc_attrs)

Text analysis is so much fun!
<class 'spacy.tokens.doc.Doc'>
{'lang_', '__ne__', '__sizeof__', 'to_utf8_array', '_realloc', '__init_subclass__', 'sentiment', '__format__', '__lt__', '_vector', 'ents', 'to_disk', 'vector_norm', 'is_parsed', 'get_lca_matrix', '__str__', 'to_bytes', '__unicode__', 'is_sentenced', '__init__', '__iter__', 'doc', 'noun_chunks_iterator', 'remove_extension', '__new__', '__class__', '__reduce_ex__', '_py_tokens', '__setattr__', '_', 'is_nered', 'to_json', '__bytes__', '__delattr__', 'retokenize', 'char_span', '__repr__', '__len__', 'from_array', 'text_with_ws', '__dir__', 'to_array', 'similarity', 'mem', 'count_by', 'from_disk', 'get_extension', 'has_vector', 'noun_chunks', '__getattribute__', '__pyx_vtable__', '_bulk_merge', '__setstate__', 'print_tree', 'sents', 'lang', '__doc__', '__ge__', 'has_extension', 'text', 'tensor', '_vector_norm', 'user_token_hooks', 'cats', '__subclasshook__', 'set_extension', 'user_data', 'extend_tensor', 'user_span_hooks', 'from_

Tokens are units of documents

In [11]:
print(doc[0])
print(type(doc[0]))
print(dir(doc[0]))

Text
<class 'spacy.tokens.token.Token'>
['_', '__bytes__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', 'ancestors', 'check_flag', 'children', 'cluster', 'conjuncts', 'dep', 'dep_', 'doc', 'ent_id', 'ent_id_', 'ent_iob', 'ent_iob_', 'ent_kb_id', 'ent_kb_id_', 'ent_type', 'ent_type_', 'get_extension', 'has_extension', 'has_vector', 'head', 'i', 'idx', 'is_alpha', 'is_ancestor', 'is_ascii', 'is_bracket', 'is_currency', 'is_digit', 'is_left_punct', 'is_lower', 'is_oov', 'is_punct', 'is_quote', 'is_right_punct', 'is_sent_start', 'is_space', 'is_stop', 'is_title', 'is_upper', 'lang', 'lang_', 'left_edge', 'lefts', 'lemma', 'lemma_', 'lex_id', 'like_email', 'like_num', 'like_url', 'lower', 'lowe

In [15]:
print(doc[0])
print(doc[0].lower_)

Text
text


### Text Preprocessing

#### Normalizing case

In [16]:
[x.lower_ for x in en(text)]

['we', 'are', 'doing', 'text', 'analysis', '.']

#### Stripping punctuation

In [17]:
[x.text for x in en(text) if x.is_alpha]

['We', 'are', 'doing', 'text', 'analysis']

In [19]:
text = "We're doing text analysis and it's fun!"
[x.text for x in en(text) if x.is_alpha]

Removing non-alpha ['We', 'doing', 'text', 'analysis', 'and', 'it', 'fun']


#### Lemmatizing

In [23]:
[x.lemma_ for x in nlp(text)]

['-PRON-', 'be', 'do', 'text', 'analysis', 'and', '-PRON-', 'be', 'fun', '!']

#### Stop Words

In [24]:
[x.text for x in en(text) if not x.is_stop]

['text', 'analysis', 'fun', '!']

#### Named Entities

First URLs

In [25]:
text = "Check out the course on Github: https://github.com/mjahanshahi/intermediate-nlp"
print([x for x in en(text) if not x.like_url])
print(['-URL-' if x.like_url else x for x in en(text)])

[Check, out, the, course, on, Github, :]
[Check, out, the, course, on, Github, :, '-URL-']


In [26]:
parsed = nlp(text)
# look at the individual tokens
tokens = [t for t in parsed]
print(tokens)
# look at the identified named-entities and their types
for e in parsed.ents:
    print(e, type(e), e.label_, spacy.explain(e.label_))

[Check, out, the, course, on, Github, :, https://github.com/mjahanshahi/intermediate-nlp]
Github <class 'spacy.tokens.span.Span'> ORG Companies, agencies, institutions, etc.


### Putting it all together

In [20]:
text_data = ["I'm taking a course on Safari.",
            "I'm learning about Text Analysis.",
            "We are studying preprocessing text and then analysing it",
            "Check out the course on Github: https://github.com/mjahanshahi/intermediate-nlp"]

In [21]:
def tokenize_full(docs, model=nlp, 
                  entities=False, 
                  stop_words=False, 
                  lowercase=True, 
                  alpha_only=True, 
                  lemma=True):
    """Full tokenizer with flags for processing steps
    entities: If False, replaces with entity type
    stop_words: If False, removes stop words
    lowercase: If True, lowercases all tokens
    alpha_only: If True, removes all non-alpha characters
    lemma: If True, lemmatizes words
    """
    tokenized_docs = []
    for d in docs:
        parsed = model(d)
        # token collector
        tokens = []
        # index pointer
        i = 0
        # entity collector
        ent = ''
        for t in parsed:
            # only need this if we're replacing entities
            if not entities:
                # replace URLs
                if t.like_url:
                    tokens.append('URL')
                    continue
                # if there's entities collected and current token is non-entity
                if (t.ent_iob_=='O')&(ent!=''):
                    tokens.append(ent)
                    ent = ''
                    continue
                elif t.ent_iob_!='O':
                    ent = t.ent_type_
                    continue
            # only include stop words if stop words==True
            if (t.is_stop)&(not stop_words):
                continue
            # only include non-alpha is alpha_only==False
            if (not t.is_alpha)&(alpha_only):
                continue
            if lemma:
                t = t.lemma_
            else:
                t = t.text
            if lowercase:
                t.lower()
            tokens.append(t)
        tokenized_docs.append(tokens)
    return(tokenized_docs)

In [27]:
tokenize_full(text_data, stop_words=True, alpha_only=False, entities=True)

[['-PRON-', 'be', 'take', 'a', 'course', 'on', 'Safari', '.'],
 ['-PRON-', 'be', 'learn', 'about', 'Text', 'Analysis', '.'],
 ['-PRON-',
  'be',
  'study',
  'preprocesse',
  'text',
  'and',
  'then',
  'analyse',
  '-PRON-'],
 ['check',
  'out',
  'the',
  'course',
  'on',
  'Github',
  ':',
  'https://github.com/mjahanshahi/intermediate-nlp']]

In [31]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
cv = CountVectorizer()
v = cv.fit_transform(text_data).toarray()
pd.DataFrame(v, columns=cv.get_feature_names())

Unnamed: 0,about,analysing,analysis,and,are,check,com,course,github,https,intermediate,it,learning,mjahanshahi,nlp,on,out,preprocessing,safari,studying,taking,text,the,then,we
0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0
1,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0
2,0,1,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,1,0,1,1
3,0,0,0,0,0,1,1,1,2,1,1,0,0,1,1,1,1,0,0,0,0,0,1,0,0


In [33]:
cv = CountVectorizer(vocabulary=['text', 'analysis', 'preprocessing', 'safari'])
v = cv.fit_transform(text_data).toarray()
pd.DataFrame(v, columns=cv.get_feature_names())

Unnamed: 0,text,analysis,preprocessing,safari
0,0,0,0,1
1,1,1,0,0
2,1,0,1,0
3,0,0,0,0


## Using a Dictionary to Analyse Review Sentiment

A traditional technique to analyse sentiments of texts is to use dictionaries of positive and negative connotations and count the incidences of words that are represented in thiese dictionaries, considering their polarity and valence. 

In this section, we will use the Afinn package which has 2.5k words coded by polarity and valence. 

In [2]:
!pip install afinn

Collecting afinn
[?25l  Downloading https://files.pythonhosted.org/packages/86/e5/ffbb7ee3cca21ac6d310ac01944fb163c20030b45bda25421d725d8a859a/afinn-0.1.tar.gz (52kB)
[K     |██████▎                         | 10kB 17.1MB/s eta 0:00:01[K     |████████████▌                   | 20kB 1.7MB/s eta 0:00:01[K     |██████████████████▊             | 30kB 2.2MB/s eta 0:00:01[K     |█████████████████████████       | 40kB 2.5MB/s eta 0:00:01[K     |███████████████████████████████▏| 51kB 2.0MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 1.8MB/s 
[?25hBuilding wheels for collected packages: afinn
  Building wheel for afinn (setup.py) ... [?25l[?25hdone
  Created wheel for afinn: filename=afinn-0.1-cp36-none-any.whl size=53453 sha256=693b118b7381dc265be177ae816b0ac4923bd634c0e3d1a309d10533dcafecd3
  Stored in directory: /root/.cache/pip/wheels/b5/1c/de/428301f3333ca509dcf20ff358690eb23a1388fbcbbde008b2
Successfully built afinn
Installing collected packages: afinn
Succes

In [13]:
from afinn import Afinn
afinn = Afinn(language='en')

In [8]:
afinn.score('Great')

3.0

In [7]:
afinn.score('Good')

3.0

In [9]:
afinn.score('Terrible')

-3.0

In [16]:
afinn.score('I feel great! :)')

5.0

Let's apply this to an actual dataset! This is the Women's Clothing E-Commerce Reviews from [this Kaggle Challenge](https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews). Bi-directional LSTMs have [reached an F1 score of 0.93.](https://github.com/AFAgarap/ecommerce-reviews-analysis)

In [18]:
import pandas as pd

In [30]:
df = pd.read_csv('/Womens Clothing E-Commerce Reviews.csv', index_col=0)

In [31]:
df.head()

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


Can see that some titles are null. Its also possible that some reviews do not contain any text. 

In [151]:
df[(df["Review Text"].isnull()) & (df["Title"].isnull())].shape[0]

844

We should remove these from the dataframe since this analysis aims to infer sentiment from text. 

In [159]:
df.drop(df[(df["Review Text"].isnull()) & (df["Title"].isnull())].index, inplace=True)

There are two columns that may convey sentiment:
- `Review Text`
- `Title`

To calculate the Afinn sentiment score for all of the responses in the dataframe, we can apply the scorer to the `Review Text` column and create a new column `text_score`. We do the same to generate a `title_score` column. 



In [82]:

#df['text_score'] = df[df["Review Text"].notnull()].loc["Review Text"].apply(afinn.score)
for index, row in df.iterrows():
  if pd.notna(row['Review Text']):
    df.at[index, "text_score"] = afinn.score(row['Review Text'])


In [87]:
for index, row in df.iterrows():
  if pd.notna(row['Title']):
    df.at[index, "title_score"] = afinn.score(row['Title'])

In [88]:
df["total_score"] = 2 * df["title_score"] + df["text_score"]

In [107]:
df['total_score'].describe()

count    23486.000000
mean        11.325172
std          7.615414
min        -20.000000
25%          6.000000
50%         11.000000
75%         16.000000
max         52.000000
Name: total_score, dtype: float64

In [100]:
df.groupby("Rating").median()

Unnamed: 0_level_0,Clothing ID,Age,Recommended IND,Positive Feedback Count,text_score,title_score,total_score
Rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,936,42,0,1,3,0,3
2,936,41,0,1,4,0,5
3,936,40,0,1,6,0,7
4,928,41,1,1,7,2,11
5,936,41,1,1,9,2,13


In [160]:
df[(df["total_score"]<10) & (df["Rating"]==5)]

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,text_score,title_score,total_score,word_count,normalized_score
4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses,6,0,6,36,0.166667
6,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,5,1,1,General Petite,Tops,Knits,-1,4,7,101,11.990099
8,1077,24,Flattering,I love this dress. i usually get an xs but it ...,5,1,0,General,Dresses,Dresses,3,0,3,34,0.088235
11,1095,39,,This dress is perfection! so pretty and flatte...,5,1,2,General Petite,Dresses,Dresses,4,0,4,8,0.500000
13,767,44,Runs big,Bought the black xs to go under the larkspur m...,5,1,0,Initmates,Intimate,Intimates,1,1,3,69,3.014493
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23417,850,39,Get it quick!,Can i tell you this top is amazing?! get it qu...,5,1,0,General,Tops,Blouses,6,0,6,15,0.400000
23438,181,68,Just right,I feel like snagging a pair of these was the e...,5,1,0,Initmates,Intimate,Legwear,4,0,4,62,0.064516
23441,1104,63,Sweet surprise,Don't know why but i didn't have high expectat...,5,1,25,General Petite,Dresses,Dresses,2,2,6,88,6.022727
23442,1104,39,Flattering dress,"Love this dress, very flattering fit and the f...",5,1,0,General Petite,Dresses,Dresses,6,0,6,41,0.146341


One of the drawbacks to using the raw Afinn score is the that longer texts may yield higher values simply because they contain more words. To adjust for that, we can divide the score by the number of words in the text.

In [161]:
df['word_count'] = 0
for index, row in df.iterrows():
  if pd.notna(row['Review Text']):
    df.at[index, "word_count"] = len(row['Review Text'].split())
df["normalized_score"] = (df["text_score"] / df["word_count"]) + (2* df["title_score"])

In [162]:
df["normalized_score"].describe()

count    22641.000000
mean         3.552963
std          3.909686
min        -11.865979
25%          0.146341
50%          4.070175
75%          6.183099
max         24.164706
Name: normalized_score, dtype: float64

In [213]:
def generate_confusion_matrix(df, score_column, threshold):
  total = df[df["Rating"]!=3].shape[0]
  tp = df[(df[score_column]>=threshold) & (df["Rating"]>3)].shape[0]
  fp = df[(df[score_column]>=threshold) & (df["Rating"]<3)].shape[0]
  tn = df[(df[score_column]<threshold) & (df["Rating"]<3)].shape[0]
  fn = df[(df[score_column]<threshold) & (df["Rating"]>3)].shape[0]
  return tp / (tp + 0.5*(fp + fn))

In [214]:
generate_confusion_matrix(df, "normalized_score", -1)

0.9426901223776224

In [215]:
generate_confusion_matrix(df, "total_score", 2)

0.9424460431654677

Feature engineering with an out of the box dictionary gives us some pretty good results!

## Creating your own classifier

Its possible that you may want to create a new set of words that relate to your specific use-case. 

In [223]:
def get_score(text, custom_set):
  # First we tokenize 
  text = text.lower()
  punctuation = '"!#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
  tokenized_text = "".join([ch for ch in text if ch not in punctuation]).split()
  tokenized_set = set(tokenized_text)
  
  return len(tokenized_set.intersection(custom_set)) * 2


In [242]:
custom_set = set(["flattering", "quick", "well", "right", "comfortable", "slimming", "confident"])

In [235]:
df.at[4, "Review Text"]

'This shirt is very flattering to all due to the adjustable front tie. it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan. love this shirt!!!'

In [243]:
get_score(df.at[4, "Review Text"], custom_set)

4

In [237]:
df.at[23442, "Review Text"]

'Love this dress, very flattering fit and the fabric does not feel heavy but is sturdy - i wore it for first dinner out with my husband after losing most of my baby weight and felt great and confident in it.'

In [244]:
get_score(df.at[23442, "Review Text"], custom_set)

4

In [245]:
get_score(df.at[23438, "Review Text"], custom_set)

4

In [241]:
df.at[23438, "Review Text"]

"I feel like snagging a pair of these was the equivalent to standing in line for black friday, as they always seem to be out of stock. now i know why. these are soft, comfortable, and slimming. they're somewhere between the hold of control top pantyhose and spans--they don't fall and sag throughout the day and are nicely slimming without being pain-inducing."