# Feature Engineering for NLP in Python
Rounak Banik - Data Scientist at Fractal Analytics

Rounak is a Young India Fellow and the author of the book, Hands-on Recommendation Systems with Python. He currently works as a Data Science Fellow with the QuantumBlack division of McKinsey and Company. He obtained his B.Tech degree in Electronics & Communication Engineering from IIT Roorkee.

Summary
1. Basic features and readability scores
2. Text preprocessing, POS tagging, and NER (named entity recognition)
3. N-Gram models - for sentiment analysis
4. TF-IDF and cosine similarity scores - for recommender

# Intro to NLP feature engineering

1. Introduction to NLP feature engineering
 - Welcome to Feature Engineering for NLP in Python! I am Rounak and I will be your instructor for this course. In this course, you will learn to extract useful features out of text and convert them into formats that are suitable for machine learning algorithms.

2. Numerical data
 - For any ML algorithm, data fed into it must be in tabular form and all the training features must be numerical. Consider the Iris dataset. Every training instance has exactly four numerical features. The ML algorithm uses these four features to train and predict if an instance belongs to class iris-virginica, iris-setosa or iris-versicolor.

3. One-hot encoding
 - ML algorithms can also work with categorical data provided they are converted into numerical form through one-hot encoding. Let's say you have a categorical feature 'sex' with two categories 'male' and 'female'.

4. One-hot encoding
 - One-hot encoding will convert this feature into two features,

5. One-hot encoding
 - 'sex_male' and 'sex_female' such that each male instance has a 'sex_male' value of 1 and 'sex_female' value of 0. For females, it is the vice versa.

6. One-hot encoding with pandas
 - To do this in code, we use pandas' get_dummies() function. Let's import pandas using the alias pd. We can then pass our dataframe df into the pd.get_dummies() function and pass a list of features to be encoded as the columns argument. Not mentioning columns will lead pandas to automatically encode all non-numerical features. Finally, we overwrite the original dataframe with the encoded version by assigning the dataframe returned by get_dummies() back to df.

7. Textual data
 - Consider a movie reviews dataset. This data cannot be utilized by any machine learning or ML algorithm. The training feature 'review' isn't numerical. Neither is it categorical to perform one-hot encoding on.

8. Text pre-processing
 - We need to perform two steps to make this dataset suitable for ML. The first is to standardize the text. This involves steps like converting words to lowercase and their base form. For instance, 'Reduction' gets lowercased and then converted to its base form, reduce. We will cover these concepts in more detail in subsequent lessons.

9. Vectorization
 - After preprocessing, the reviews are converted into a set of numerical training features through a process known as vectorization. After vectorization, our original review dataset gets converted

10. Vectorization
 - into something like this. We will learn techniques to achieve this in later lessons.

11. Basic features
 - We can also extract certain basic features from text. It maybe useful to know the word count, character count and average word length of a particular text. While working with niche data such as tweets, it also maybe useful to know how many hashtags have been used in a tweet. This tweet by Silverado Records,for instance, uses two.

12. POS tagging
 - So far, we have seen how to extract features out of an entire body of text. Some NLP applications may require you to extract features for individual words. For instance, you may want to do parts-of-speech tagging to know the different parts-of-speech present in your text as shown. As an example, consider the sentence 'I have a dog'. POS tagging will label each word with its corresponding part-of-speech.

13. Named Entity Recognition
 - You may also want to know perform named entity recognition to find out if a particular noun is referring to a person, organization or country. For instance, consider the sentence "Brian works at DataCamp". Here, there are two nouns "Brian" and "DataCamp". Brian refers to a person whereas DataCamp refers to an organization.

14. Concepts covered
 - Therefore, broadly speaking, this course will teach you how to conduct text preprocessing, extract certain basic features, word features and convert documents into a set of numerical features (using a process known as vectorization).

# One-hot encoding

In [None]:
# Print the features of df1
print(df1.columns)

# Perform one-hot encoding
df1 = pd.get_dummies(df1, columns=['feature 5'])

# Print the new features of df1
print(df1.columns)

# Print first five rows of df1
print(df1.head())

# Basic feature extraction

1. Basic feature extraction
 - In this video, we will learn to extract certain basic features  from text. While not very powerful, they can give us a good idea of the text we are dealing with.

2. Number of characters
 - The most basic feature we can extract from text is the number of characters, including whitespaces. For instance, the string "I don't know." has 13 characters. The number of characters is the length of the string. Python gives us a built-in len() function which returns the length of the string passed into it. The output will be 13 here too. If our dataframe df has a textual feature (say 'review'), we can compute the number of characters for each review and store it as a new feature 'num_chars' by using the pandas dataframe apply method. This is done by creating df['num_chars'] and assigning it to df['review'].apply(len).

3. Number of words
 - Another feature we can compute is the number of words. Assuming that every word is separated by a space, we can use a string's split() method to convert it into a list where every element is a word. In this example, the string Mary had a little lamb is split to create a list containing the words Mary, had, a, little and lamb. We can now compute the number of words by computing the number of elements in this list using len().

4. Number of words
 - To do this for a textual feature in a dataframe, we first define a function that takes in a string as an argument and returns the number of words in it. The steps followed inside the function are similar as before. We then pass this function word_count into apply. We create df['num_words'] and assign it to df['review'].apply(word_count).

5. Average word length
 - Let's now compute the average length of words in a string. Let's define a function avg_word_length() which takes in a string and returns the average word length. We first split the string into words and compute the length of each word. Next, we compute the average word length by dividing the sum of the lengths of all words by the number of words.

6. Average word length
 - We can now pass this into apply() to generate a average word length feature like before.

7. Special features
 - When working with data such as tweets, it maybe useful to compute the number of hashtags or mentions used. This tweet by DataCamp, for instance, has one mention upendra_35 which begins with an @ and two hashtags, PySpark and Spark which begin with a #.

8. Hashtags and mentions
 - Let's write a function that computes the number of hashtags in a string. We split the string into words. We then use list comprehension to create a list containing only those words that are hashtags. We do this using the startswith method of strings to find out if a word begins with #. The final step is to return the number of elements in this list using len. The procedure to compute number of mentions is identical except that we check if a word starts with @. Let's see this function in action. When we pass a string "@janedoe This is my first tweet! #FirstTweet #Happy", the function returns 2 which is indeed the number of hashtags in the string.

9. Other features
 - There are other basic features we can compute such as number of sentences, number of paragraphs, number of words starting with an uppercase, all-capital words, numeric quantities etc. The procedure to do this is extremely similar to the ones we've already covered.

## Number of characters

In [2]:
len("I don't know.")

13

In [None]:
# apply to column
# create a 'num_chars' feature
df['num_chars'] = df['review'].apply(len)

## Avg character length and average char length for feature

In [None]:
# Create a feature char_count
tweets['char_count'] = tweets['content'].apply(len)

# Print the average character count
print(tweets['char_count'].mean())

# note - fake news articles tend to have longer titles - per research

## Number of words - assume separated by space

In [5]:
text = "Mary had a little lamb."
words = text.split()
print(words)
print(len(words))

['Mary', 'had', 'a', 'little', 'lamb.']
5


In [None]:
# feature in df
# function that returns number of words in string
def word_count(string):
    # split the string into words
    words = string.split()
    # return length of words list
    return len(words)

# create num_words feature in df
df['num_words'] = df['review'].apply(word_count)

In [None]:
# Example - word count of TED talks

# Function that returns number of words in a string
def count_words(string):
    # Split the string into words
    words = string.split()
    
    # Return the number of words
    return len(words)

# Create a new feature word_count
ted['word_count'] = ted['transcript'].apply(count_words)

# Print the average word count of the talks
print(ted['word_count'].mean())

'''
You can use the word_count feature to compute its correlation 
with other variables such as number of views, number of comments,
etc. and derive extremely interesting insights about TED.
'''

## Average word length

In [None]:
# function that returns avg word length
def avg_word_length(x):
    # split the string into words
    words = x.split()
    # compute length of each word and store in a separate list
    word_lengths = [len(word) for word in words]
    # compute average word length
    avg_word_length = sum(word_lengths)/len(words)
    # return average word length
    return(avg_word_length)

# create a new feature avg_word_length
df['avg_word_length'] = df['review'].apply(doc_density)

## Special features like hashtags and mentions

In [6]:
# function that returns number of hashtags
def hashtag_count(string):
    # split the string into words
    words = string.split()
    # create a list of hashtags
    hashtags = [word for word in words if word.startswith('#')]
    # return number of hashtags
    return len(hashtags)

hashtag_count("@janedoe This is my first tweet! #FirstTweet #Happy")

2

In [None]:
# Create a feature hashtag_count and display distribution
tweets['hashtag_count'] = tweets['content'].apply(hashtag_count)
tweets['hashtag_count'].hist()
plt.title('Hashtag count distribution')
plt.show()

In [None]:
# same thing for mentions - using @ character
# Function that returns number of mentions in a string
def count_mentions(string):
    # Split the string into words
    words = string.split()
    
    # Create a list of words that are mentions
    mentions = [word for word in words if word.startswith('@')]
    
    # Return number of mentions
    return(len(mentions))

# Create a feature mention_count and display distribution
tweets['mention_count'] = tweets['content'].apply(count_mentions)
tweets['mention_count'].hist()
plt.title('Mention count distribution')
plt.show()

## Other features
- number of sentences
- number of paragraphs
- words starting with an uppercase
- all-capital words
- numeric quantities

# Readability Tests - word, syllable, sentence count

1. Readability tests
 - In this lesson, we will look at a set of interesting features known as readability tests.

2. Overview of readability tests
 - These tests are used to determine the readability of a particular passage. In other words, it indicates at what educational level a person needs to be in, in order to comprehend a particular piece of text. The scale usually ranges from primary school up to college graduate level and is in context of the American education system. 
 - Readability tests are done using a mathematical formula that utilizes the word, syllable and sentence count of the passage. They are routinely used by organizations to determine how easy their publications are to understand. They have also found applications in domains such as fake news and opinion spam detection.

3. Readability text examples
 - There are a variety of readability tests in use.
 - common ones include the 
    - Flesch reading ease
    - Gunning fog index
    - the simple measure of gobbledygook (SMOG) 
    - Dale-Chall score. 
 - Note that these tests are used for texts in English. Tests for other languages also exist that that take into consideration, the nuances of that particular language. For the sake of brevity, we will cover only the first two scores in detail. However, once you understand them, you will be in a good position to understand and use the other scores too.

5. Flesch reading ease
 - The Flesch Reading Ease is one of the oldest and most widely used readability tests. The score is based on two ideas: the first is that the greater the average sentence length, harder the text is to read. Consider these two sentences. The first is easier to follow than the second. The second is that the greater the average number of syllables in a word, the harder the text is to read. Therefore, I live in my home is considered easier to read than I reside in my domicile on account of its usage of lesser syllables per word. The higher the Flesch Reading Ease score, the greater is the readability. Therefore, a higher score indicates that the text is easier to understand.
 - Principles
     - Greater the avg sentence length, harder the text is to read
     - Greater the avg number of syllables in a word, harder the test is to read
 - higher the Flesch socre, greater the readability

6. Flesch reading ease score interpretation
 - This table shows how to interpret the Flesch Reading Ease scores. A score above 90 would imply that the text is comprehensible to a 5th grader whereas a score below 30 would imply the text can only be understood by college graduates.
 - Reading ease score - Grade level
     - 90-100, 5th grad
     - 80-90, 6
     - 70-80, 7
     - 60-70, 8-9
     - 50-60, 10-12
     - 30-50, College
     - 0-30, College graduate     

7. Gunning fog index
 - The Gunning fog index was developed in 1954. Like Flesch, this score is also dependent on the average sentence length. However, it uses percentage of complex words in place of average syllables per word to compute its score. Here, complex words refer to all words that have three or more syllables. 
 - Greater the % of complex words (3+ syllables), harder to read
 - Unlike Flesch, the formula for Gunning fog index is such that the higher the score ,the more difficult the passage is to understand.
 
8. Gunning fog index interpretation
 - The index can be interpreted using this table. A score of 6 would indicate 6th grade reading difficulty whereas a score of 17 would indicate college graduate level reading difficulty.
 - Gunning Fog index, grade level
     - 17, college graduate
     - 16, college senior
     - 15, college jr
     - 14, college soph
     - 13, college freshman
     - 12, high school senior
     - 11, hs jr
     - 10, hs soph
     - 9, hs freshman
     - 8, 8th grade
     - 7, 7
     - 6, 6
     
9. The textatistic library
 - We can conduct these readability tests in Python using the Textatistic library. We import the Textatistic class from textatistic. Next, we create a Textatistic object and pass in the passage or text we're evaluating. We then access the dictionary of readability scores from the Textatistic object using the 'scores' attribute and store it in a variable named readability_scores. Finally, we access the various scores from the readability_scores dictionary using their corresponding keys as shown. In this example, the text that was passed is between the reading level of a college senior and that of a college graduate.

## Textatistic library for readability

In [None]:
# import Textatistic class
from textatistic import Textatistic

# create a Textatistic object
readability_scores = Textatistic(text).scores

# generate scores
print(readability_scores['flesh_score'])
print(readability_scores['gunningfog_score'])

In [None]:
# Import Textatistic
from textatistic import Textatistic

# Compute the readability scores 
readability_scores = Textatistic(sisyphus_essay).scores

# Print the flesch reading ease score
flesch = readability_scores['flesch_score']
print("The Flesch Reading Ease is %.2f" % (flesch))

In [None]:
# Import Textatistic
from textatistic import Textatistic

# List of excerpts
excerpts = [forbes, harvard_law, r_digest, time_kids]

# Loop through excerpts and compute gunning fog index
gunning_fog_scores = []
for excerpt in excerpts:
  readability_scores = Textatistic(excerpt).scores
  gunning_fog = readability_scores['gunningfog_score']
  gunning_fog_scores.append(gunning_fog)

# Print the gunning fog indices
print(gunning_fog_scores)

# [14.436002482929858, 20.735401069518716, 11.085587583148559, 
# 5.926785009861934]

# Tokenization and Lemmatization
- aka standardizing the text for ML

1. Tokenization and Lemmatization
 - In NLP, we usually have to deal with texts from a variety of sources. For instance,

2. Text sources
 - it can be a news article where the text is grammatically correct and proofread. It could be tweets containing shorthands and hashtags. It could also be comments on YouTube where people have a tendency to abuse capital letters and punctuations.

3. Making text machine friendly
 - It is important that we standardize these texts into a machine friendly format. We want our models to treat similar words as the same. Consider the words Dogs and dog. Strictly speaking, they are different strings. However, they connotate the same thing. Similarly, reduction, reducing and reduce should also be standardized to the same string regardless of their form and case usage. Other examples include don't and do not, and won't and will not. In the next couple of lessons, we will learn techniques to achieve this.

4. Text preprocessing techniques
 - The text processing techniques you use are dependent on the application you're working on. 
 - Common text preprocessing
     - converting words into lowercase 
     - removing unnecessary whitespace, 
     - removing punctuation, 
     - removing commonly occurring words or stopwords, 
     - expanding contracted words like don't 
     - removing special characters such as numbers and emojis.

5. Tokenization
 - To do this, we must first understand tokenization. Tokenization is the process of splitting a string into its constituent tokens. These tokens may be sentences, words or punctuations and is specific to a particular language. 
 - In this course, we will primarily be focused with word and punctuation tokens. For instance, consider this sentence. Tokenizing it into its constituent words and punctuations will yield the following list of tokens. Tokenization also involves expanding contracted words. Therefore, a word like don't gets decomposed into two tokens: do and n't as can be seen in this example.
     - ie. ["Do", "n't", "do", "this", "."]

6. Tokenization using spaCy
 - To perform tokenization in python, we will use the spacy library. We first import the spacy library. Next, we load a pre-trained English model 'en_core_web_sm' using spacy.load(). This will return a Language object that has the know-how to perform tokenization. This is stored in the variable nlp. Let's now define a string we want to tokenize. We pass this string into nlp to generate a spaCy Doc object. We store this in a variable named doc. This Doc object contains the required tokens (and many other things, as we will soon find out). We generate the list of tokens by using list comprehension as shown. This is essentially looping over doc and extracting the text of each token in each iteration. The result is as follows.

7. Lemmatization
 - Lemmatization is the process of converting a word into its lowercased base form or lemma. This is an extremely powerful process of standardization. For instance, the words reducing, reduces, reduced and reduction, when lemmatized, are all converted into the base form reduce. Similarly be verbs such as am, are and is are converted into be. Lemmatization also allows us to convert words with apostrophes into their full forms. Therefore, n't is converted to not and 've is converted to have.
 - examples
     - reducing, reduces, reduced, reduction > reduce
     - am, are, is > be
     - n't > not
     - 've > have
     - all pronouns > '-PRON-'

8. Lemmatization using spaCy
 - When you pass the string into nlp, spaCy automatically performs lemmatization by default. Therefore, generating lemmas is identical to generating tokens except that we extract token.lemma_ in each iteration inside the list comprehension instead of token.text. Also, observe how spaCy converted the Is into -PRON-. This is standard behavior where every pronoun is converted into the string '-PRON-'.

In [None]:
# example - tokenization using spaCy
import spacy
# load the en_core_web_sm model - pretrained English model
nlp = spacy.load('en_core_web_sm')
# initialize string
string = "Hello! I don't know what I'm doing here."
# create a Doc object
doc = nlp(string)
# generate list of tokens
tokens = [token.text for token in doc]
print(tokens)

In [None]:
# example - lemmatization = convert word into its base form, lemma, in lowercase
import spacy
# load the en_core_web_sm model - pretrained English model
nlp = spacy.load('en_core_web_sm')
# initialize string
string = "Hello! I don't know what I'm doing here."
# create a Doc object
doc = nlp(string)
# generate list of tokens
lemmas = [token.lemma_ for token in doc]
print(lemmas)

# Convert lemmas into a string
print(' '.join(lemmas))

## Text cleaning

1. Text cleaning
 - Now that we know how to convert a string into a list of lemmas, we are now in a good position to perform basic text cleaning.

2. Text cleaning techniques
 - Some of the most common text cleaning steps include removing
     - extra whitespaces
     - escape sequences
     - punctuations
     - special characters such as numbers and stopwords. 
 - In other words, it is very common to remove non-alphabetic tokens and words that occur so commonly that they are not very useful for analysis.

3. isalpha()
 - Every python string has an isalpha() method that returns true if all the characters of the string are alphabets.
 - "Dog".isalpha() will return true 
 - but "3dogs".isalpha() will return false as it has a non-alphabetic character 3. 
 - Similarly, numbers, punctuations and emojis will all return false too. 
 - This is an extremely convenient method to remove all (lemmatized) tokens that are or contain numbers, punctuation and emojis.

4. A word of caution
 - If isalpha() as a silver bullet that cleans text meticulously seems too good to be true, it's because it is. Remember that isalpha() has a tendency of returning false on words we would not want to remove. 
 - Examples include abbreviations such as U.S.A. and U.K. which have periods in them, and proper nouns with numbers in them such as word2vec and xto10x. 
 - For such nuanced cases, isalpha() may not be sufficient. It may be advisable to write your own custom functions, typically using regular expressions, to ensure you're not inadvertently removing useful words.

5. Removing non-alphabetic characters
 - Consider the string here. This has a lot of punctuations, unnecessary extra whitespace, escape sequences, numbers and emojis. We will generate the lemmatized tokens like before.

6. Removing non-alphabetic characters
 - Next, we loop through the tokens again and choose only those words that are either -PRON- or contain only alphabetic characters. Let's now print out the sanitized string. We see that all the non-alphabetic characters have been removed and each word is separated by a single space.

7. Stopwords
 - There are some words in the English language that occur so commonly that it is often a good idea to just ignore them. Examples include articles such as a and the, be verbs such as is and am and pronouns such as he and she.

8. Removing stopwords using spaCy
 - spaCy has a built-in list of stopwords which we can access using spacy.lang.en.stop_words.STOP_WORDS..

9. Removing stopwords using spaCy
 - We make a small tweak to a_lemmas generation step. Notice that we have removed the -PRON- condition as pronouns are stopwords anyway and should be removed. Additionally, we have introduced a new condition to check if the word belongs to spacy's list of stopwords. The output is as follows. Notice how the string consists only of base form words. Always exercise caution while using third party stopword lists. It is common that an application find certain words useful that may be considered a stopword by third party lists. It is often advisable to create your custom stopword lists.

10. Other text preprocessing techniques
 - There are other preprocessing techniques that are used but have been omitted for the sake of brevity. 
 - Some of them include 
     - removing HTML or XML tags
     - replacing accented characters
     - correcting spelling errors and shorthands

11. A word of caution
 - We have covered a lot of text preprocessing techniques in the last couple of lessons. However, a word of caution is in place. The text preprocessing techniques you use is always dependent on the application. There are many applications which may find punctuations, numbers and emojis useful, so it may be wise to not remove them. In other cases, using all caps may be a good indicator of something. Remember to always use only those techniques that are relevant to your particular use case.

### Removing non-alpabetic characters

In [None]:
# remove tokesn that are not alphabetic
a_lemmas = [lemma for lemma in lemmas if lemma.isalpha() or lemma == '-PRON-']

# print string after text cleaning
print(' '.join(a_lemmas))

### Remove stopwords using spaCy
- words that are common but not useful
- ie. articles, be verbs, pronouns, etc.
- NOTE - advised to create custom stopword list

In [None]:
# get list of stopwords
stopwords = spacy.lang.en.stop_words.STOP_WORDS
string = """
OMG!!!! This is like    the best thing ever \t\n.
Wow,  such an amazing song! I'm hooked. Top 5 definitely. ?
"""

# remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas in lemma.isalpha()
            and lemma not in stopwords]
# print string after text cleaning
print(' '.join(a_lemmas))

### Example - fxn to preprocess text

In [None]:
# Function to preprocess text
def preprocess(text):
    # Create Doc object
    doc = nlp(text, disable=['ner', 'parser'])
    # Generate lemmas
    lemmas = [token.lemma_ for token in doc]
    # Remove stopwords and non-alphabetic characters
    a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma not in stopwords]
    
    return ' '.join(a_lemmas)
  
# Apply preprocess to ted['transcript']
ted['transcript'] = ted['transcript'].apply(preprocess)
print(ted['transcript'])

## POS tagging - parts of speech
- caution - not an exact science b/c depends on pretrained models
- spaCy uses 20 POS annotations
- https://spacy.io/api/annotation

1. Part-of-speech tagging
 - In this lesson, we will cover part-of-speech tagging, which is one of the most popularly used feature engineering techniques in NLP.

2. Applications
 - Part-of speech tagging or POS tagging has an immense number of applications in NLP. 
 - used in word-sense disambiguation to identify the sense of a word in a sentence. 
     - For instance, consider the sentences 
         - "the bear is a majestic animal" and 
         - "please bear with me". 
     - Both sentences use the word 'bear' but they mean different things. 
     - POS tagging helps in identifying this distinction by identifying one bear as a noun and the other as a verb. 
 - POS tagging is also used in 
     - sentiment analysis
     - question answering systems 
     - linguistic approaches to detect fake news and opinion spam. 
 - For example, one paper discovered that fake news headlines, on average, tend to use lesser common nouns and more proper nouns than mainstream headlines. Generating the POS tags for these words proved extremely useful in detecting false or hyperpartisan news.

3. POS tagging
 - So what is POS tagging? It is the process of assigning every word (or token) in a piece of text, its corresponding part-of-speech. For instance, consider the sentence "Jane is an amazing guitarist". A typical POS tagger will label Jane as a proper noun, is as a verb, an as a determiner (or an article), amazing as an adjective and finally, guitarist as a noun.

4. POS tagging using spaCy
 - POS Tagging is extremely easy to do using spaCy's models and performing it is almost identical to generating tokens or lemmas. As usual, we import the spacy library and load the en_core_web_sm model as nlp. We will use the same sentence "Jane is an amazing guitarist" from before. We will then create a Doc object that will perform POS tagging, by default.

5. POS tagging using spaCy
 - Using list comprehension, we generate a list of tuples pos where the first element of the tuple is the token and is generated using token.text and the second element is its POS tag, which is generated using token.pos_. Printing pos will give us the following output. Note how the tagger correctly identified all the parts-of-speech as we had discussed earlier. That said, remember that POS tagging is not an exact science. spaCy infers the POS tags of these words based on the predictions given by its pre-trained models. In other words, the accuracy of the POS tagging is dependent on the data that the model has been trained on and the data that it is being used on.

6. POS annotations in spaCy
 - spaCy is capable of identifying close to 20 parts-of-speech and as we saw in the previous slide, it uses specific annotations to denote a particular part of speech. For instance, PROPN referred to a proper noun and DET referred to a determinant. You can find the complete list of POS annotations used by spaCy in spaCy's documentation. Here is a snapshot of the web page.

### POS tagging

In [None]:
import spacy

# load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# initialize string
string = "Jane is an amazing guitarist"

# create a Doc object
doc = nlp(string)

# Generate list of tokens and pos tags using tuples
# .text and .pos_ are attributes
pos = [(token.text, token.pos_) for token in doc]
print(pos)

'''
[('Jane', 'PROPN'),
 ('is', 'VERB'),
 ('an', 'DET'),
 ('amazing', 'ADJ'),
 ('guitarist', 'NOUN')]
'''

### Counting nouns and pronouns in a piece of text

In [None]:
nlp = spacy.load('en_core_web_sm')

# Returns number of proper nouns
def proper_nouns(text, model=nlp):
    # Create doc object
    doc = model(text)
    # Generate list of POS tags
    pos = [token.pos_ for token in doc]

    # Return number of proper nouns
    return pos.count('PROPN')

print(proper_nouns("Abdul, Bill and Cathy went to the market to buy apples.", nlp))

In [None]:
nlp = spacy.load('en_core_web_sm')

# Returns number of other nouns
def nouns(text, model=nlp):
    # Create doc object
    doc = model(text)
    # Generate list of POS tags
    pos = [token.pos_ for token in doc]

    # Return number of other nouns
    return pos.count('NOUN')

print(nouns("Abdul, Bill and Cathy went to the market to buy apples.", nlp))

### Proper noun and Noun usage in fake news

In [None]:
# reuse functions proper_nouns and nouns above

# proper noun usage
headlines['num_propn'] = headlines['title'].apply(proper_nouns)

# Compute mean of proper nouns
real_propn = headlines[headlines['label'] == 'REAL']['num_propn'].mean()
fake_propn = headlines[headlines['label'] == 'FAKE']['num_propn'].mean()

# Print results
print("Mean no. of proper nouns in real and fake headlines are %.2f and %.2f respectively" %
      (real_propn, fake_propn))

# Mean no. of proper nouns in real and fake headlines are 2.46 and 4.86 respectively

In [None]:
# reuse functions proper_nouns and nouns above

# noun usage
headlines['num_noun'] = headlines['title'].apply(nouns)

# Compute mean of other nouns
real_noun = headlines[headlines['label'] == 'REAL']['num_noun'].mean()
fake_noun = headlines[headlines['label'] == 'FAKE']['num_noun'].mean()

# Print results
print("Mean no. of other nouns in real and fake headlines are %.2f and %.2f respectively"%(real_noun, fake_noun))

# Mean no. of other nouns in real and fake headlines are 2.30 and 1.44 respectively

## NER = Named entity recognition

1. Named entity recognition
 - The final technique we will learn as part of this chapter is named entity recognition.

2. Applications
 - Named entity recognition or NER has a host of extremely useful applications. 
 - Used to build 
     - efficient search algorithms 
     - question answering systems
         - For instance, let us say you have a piece of text and you ask your system about the people that are being talked about in the text. NER would help the system in answering this question by identifying all the entities that refer to a person in the text. 
     - news article classification - application with News Providers who use it to categorize their articles 
     - Customer Service centers who use it to classify and record their complaints efficiently.

3. Named entity recognition
 - A named entity is anything that can be denoted with a proper name or a proper noun. 
 - Named entity recognition or NER, therefore, is the process of
     - identifying such named entities in a piece of text and classifying them into predefined categories 
     - categories such as person, organization, country, etc. 
     - For example, consider the text "John Doe is a software engineer working at Google. He lives in France." Performing NER on this text will tell us that there are three named entities: John Doe, who is a person, Google, which is an organization and France, which is a country (or geopolitical entity)

4. NER using spaCy
 - Like POS tagging, performing NER is extremely easy using spaCy's pre-trained models. Let's try to find the named entities in the same sentence we used earlier. As usual, we import the spacy library, load the required model and create a Doc object for the string. When we do this, spaCy automatically computes all the named entities and makes it available as the ents attribute of doc. Therefore, to access the named entity and its category, we use list comprehension to loop over doc.ents and create a tuple containing the entity name, which is accessed using ent.text, and entity category, which is accessed using ent.label_. Printing this list out will give the following output. We see that spaCy has correctly identified and classified all the named entities in this string.

5. NER annotations in spaCy
 - Currently, spaCy's models are capable of identifying more than 15 different types of named entities. The complete list of categories and their annotations can be found in spaCy's documentatiion. Here is a snapshot of the page.
 - https://spacy.io/api/annotation#named-entities
 
6. A word of caution
 - In this chapter, we have used spacy's models to accomplish several tasks. 
 - However, remember that spacy's models are not perfect and its performance depends on the data it was trained with and the data it is being used on. 
 - For instance, if we are trying extract named entities for texts from a heavily technical field, such as medicine, spacy's pretrained models may not perform such a great job. In such nuanced cases, it is better to train your models with your specialized data. Also, remember that spacy's models are language specific. This is understandable considering that each language has its own grammar and nuances. The en_core_web_sm model that we've been using is, as the name suggests, only suitable for English texts.

### NER using spaCy
- 15+ categories of named entities in spaCy

In [None]:
import spacy
string="John Doe is a software engineer working at Google. He lives in France."

# load model and create Doc object
nlp = spacy.load('en_core_web_sm')
doc = nlp(string)

# generate named entities, ent = entity
ne = [(ent.text, ent.label_) for ent in doc.ents]
print(ne)

In [None]:
# Load the required model
nlp = spacy.load('en_core_web_sm')

# Create a Doc instance 
text = 'Sundar Pichai is the CEO of Google. Its headquarters is in Mountain View.'
doc = nlp(text)

# Print all named entities and their labels
for ent in doc.ents:
    print(ent.text, ent.label_)
    
'''
Sundar Pichai ORG - note mislabeled
Google ORG
Mountain View GPE
'''

In [None]:
# Example - Identifying people mentioned in a news article

def find_persons(text):
  # Create Doc object
  doc = nlp(text)
  
  # Identify the persons
  persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
  
  # Return persons
  return persons

# print persons in article tc
print(find_persons(tc))

# ['Sheryl Sandberg', 'Mark Zuckerberg']

# Vectorization - Building a bag of words (BoW) model
- Each value in the vector corresponds to the frequency of the corresponding word in the vocabulary.

1. Building a bag of words model
 - In this chapter, we will cover vectorization which is, as you may recall, the process of converting text into vectors.

2. Recap of data format for ML algorithms
 - Recall that for any ML algorithm to run properly, data fed into it must be in tabular form and all the training features must be numerical. This is clearly not the case for textual data. In this lesson, we will learn a technique called bag of words that converts text documents into vectors.

3. Bag of words model
 - Extract word tokens from a text document (henceforth, we will refer to this as just document)
 - Computing frequency of word tokens
 - Construct a word vector out of these frequencies and the vocabulary of the entire corpus of documents.

4. Bag of words model example
 - Consider a corpus of three documents. The lion is the king of the jungle. Lions have an average lifespan of 15 years. And, the lion is an endangered species.

5. Bag of words model example
 - We now extract the unique word tokens that occur in this corpus of documents. This will be the vocabulary of our model. In this example, the following 15 word tokens will constitute our vocabulary. Since there are 15 words in our vocabulary, our word vectors will have 15 dimensions and each dimension's value will correspond to the frequency of the word token corresponding to that dimension. For instance, the second dimension will correspond to the number of times the second word in the vocabulary, an, occurs in the document. Let's now convert our documents into word vectors using this bag of words model. The lion is the king of the jungle is converted to the following vector. Similarly, the other two sentences have the following word vector representations.

6. Text preprocessing
 - As we were constructing this model, you may have noticed how text preprocessing would have been extremely useful in creating arguably better models. We would usually want Lions and lion to mean the same thing and therefore, counted as the same thing. The same applies to 'the' with different cases. 
 - No punctuations 
 - No stopwords
 - Leads to smaller vocabularies, which is a good thing. 
 - Reducing number of dimensions hleps improve performance
     - While working with vectorization, it is routine to form word vectors running into thousands of dimensions and keeping this to a minimum helps improve performance.

7. Bag of words model using sklearn
 - To construct the bag of words model in Python, we will use the scikit-learn library. We will use the corpus from before, consisting of the three sentences on lions. Let's ignore text preprocessing for now.

8. Bag of words model using sklearn
 - We import the CountVectorizer class from sklearn.feature_extraction.text. This is the class that will help us build our bag of words model. Next, we instantiate a CountVectorizer object vectorizer. We finally create our matrix of word vectors by passing corpus to the fit_transform method of vectorizer. This is stored in bow_matrix. This bow_matrix is a sparse matrix and we can print out its 2D array form using bow matrix dot toarray(). This gives us the following output. Notice how this is different from the word vectors we generated. 
 - CountVectorizer automatically 
     - lowercases words 
     - ignores single character tokens such as 'a'. 
     - it doesn't index the vocabulary in alphabetical order. 
 - We will learn how to map the vocabulary to the indices in the exercises. We can now use this bow_matrix as our training features in ML models.

In [None]:
# example
corpus = pd.Series([
    'The lion is the king of the jungle',
    'Lions have lifespans of a decade',
    'The lion is an endangered species'
])

# import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
# create CountVectorizer object
vectorizer = CountVectorizer()
# generate matrix of word vectors, sparse matrix, 
# use as ML features
bow_matrix = vectorizer.fit_transform(corpus)
print(bow_matrix.toarray())

## BoW model for movie taglines

In [None]:
# corpus of more than 7000 movie tag lines
# skip text preprocessing in this example

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create CountVectorizer object
vectorizer = CountVectorizer()

# Generate matrix of word vectors
bow_matrix = vectorizer.fit_transform(corpus)

# Print the shape of bow_matrix
print(bow_matrix.shape)

# (7033, 6614)

## Analyzing dimensionality and preprocessing

In [None]:
# lem_corpus which contains the pre-processed versions of the 
# movie taglines from the previous exercise. 
# In other words, the taglines have been lowercased and lemmatized, 
# and stopwords have been removed.

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create CountVectorizer object
vectorizer = CountVectorizer()

# Generate matrix of word vectors
bow_lem_matrix = vectorizer.fit_transform(lem_corpus)

# Print the shape of bow_lem_matrix
print(bow_lem_matrix.shape)

# 0    roll dice unleash excitement
# 1           yell fight ready love
# 2    friend people let let forget
# 3      world normal surprise life
# 4          los angeles crime saga
# Name: 1, dtype: object

# <script.py> output:
#     (6959, 5223)

Notice how the number of features have reduced significantly from around 6600 to around 5223 for pre-processed movie taglines. The reduced number of dimensions on account of text preprocessing usually leads to better performance when conducting machine learning and it is a good idea to consider it. 

However, as mentioned in a previous lesson, the final decision always depends on the nature of the application.

## Mapping feature indices with feature names

In [8]:
'''
CountVectorizer doesn't necessarily index the vocabulary 
in alphabetical order. In this exercise, we will learn to 
map each feature index to its corresponding feature name 
from the vocabulary.
'''
corpus = ['The lion is the king of the jungle', 'Lions have lifespans of a decade', 'The lion is an endangered species']

# Create CountVectorizer object
vectorizer = CountVectorizer()

# Generate matrix of word vectors
bow_matrix = vectorizer.fit_transform(corpus)

# Convert bow_matrix into a DataFrame
bow_df = pd.DataFrame(bow_matrix.toarray())

# Map the column names to vocabulary 
bow_df.columns = vectorizer.get_feature_names()

# Print bow_df
print(bow_df)

'''
   an  decade  endangered  have  is  ...  lion  lions  of  species  the
0   0       0           0     0   1  ...     1      0   1        0    3
1   0       1           0     1   0  ...     0      1   1        0    0
2   1       0           1     0   1  ...     1      0   0        1    1

[3 rows x 13 columns]
'''

## Building a BoW Naive Bayes classifier

1. Building a BoW Naive Bayes classifier
 - In this lesson, we will walk through a machine learning problem that utilizes feature engineering techniques we've learned, to arrive at a desired result.

2. Spam filtering
 - Let's take a look at the spam filtering problem. We're given a dataset of messages that have been labelled as spam or ham. Here, you can see a typical spam and ham message. Our task is to train an ML model that can predict the label given a particular text.

3. Steps
 - There are 3 steps involved. 
    1. text preprocessing.
    2. build the bag-of-words model
    3. ML - predictive modeling using the generated BoW vectors. 
 - Note that although we use the term 'modeling' in the context of both BoW and machine learning, they mean two different things.

4. Text preprocessing using CountVectorizer arguments
 - lowercase: False, True
 - strip_accents: 'unicode', 'ascii', None
 - stop_words: 'english', list, None
 - token_pattern: regex
 - tokenizer: function
     - can use spaCy tokenization here
 - spaCy for lemmitization
 
 - We've already learned how to conduct text preprocessing using spaCy. However, it is also possible to do this using CountVectorizer. CountVectorizer takes in a number of arguments to perform preprocessing. The lowercase argument, when set to True, converts words to lowercase. The strip_accents argument can convert accented characters according to unicode or ASCII mapping. Passing in a stopwords argument will lead to CountVectorizer ignoring stopwords. You can pass in a custom list or the string 'english' to use scikit-learn's list of English stopwords. You can specify tokenization using a regular expression as the value of the token_pattern argument. Tokenization can also be specified using a tokenizer argument. Here, you can pass a function that takes a string as an argument and returns a list of tokens. This way, CountVectorizer allows usage of spaCy's tokenization techniques. CountVectorizer cannot perform certain steps such as lemmatization automatically. This is where spaCy is useful. Although it performs tokenization and preprocessing, CountVectorizer's main job is to convert a corpus into a matrix of numerical vectors.

5. Building the BoW model
 - As usual, we import CountVectorizer from scikit-learn. We then instantiate a CountVectorizer object called vectorizer. We perform accent stripping using ASCII mapping and remove English stopwords. We also set the lowercase argument to False. This is because spam messages usually tend to abuse all-capital words and we might want to preserve this information for the ML step. The dataset has been already been loaded into the dataframe df. We split this dataset into training and test sets using scikit-learn's train test split function.

6. Building the BoW model
 - We now fit the vectorizer on the training set and transform it into its bag-of-words representation. We can perform both these steps together using the fit transform method. Next, we transform the test set into its BoW representation. Note, that we do not fit the vectorizer with the test data. It is possible that there are some words in the test data that is not in the vocabulary of the vectorizer. In such cases, CountVectorizer simply ignores these words.

7. Training the Naive Bayes classifier
 - We're now in a good position to train an ML model. We will use the Multinomial Naive Bayes classifier for this task. We import the Multinomial NB class from scikit-learn and create an object named clf. We then fit the training BoW vectors and their corresponding labels to clf. We can now test the performance of our model. We compute the accuracy of the model on the test set using clf dot score. In this case, our model registered an accuracy of 76% on the test set.

### Building the BoW model - multinomial Naive Bayes classifier

In [None]:
# import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
# import train_test_split
from sklearn.model_selection import train_test_split
# import MultinomialNB
from sklearn.naive_bayes import MultinomialNB

# create CountVectorizer object
# SPAM abuses uppercase, so may be useful
vectorizer = CountVectorizer(strip_accents='ascii', 
                             stop_words='english',
                             lowercase=False)

# split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['message'],
                                                   df['label'],
                                                   test_size=0.25)

# generate training BoW vectors
X_train_bow = vectorizer.fit_transform(X_train)

# generate test BoW vectors
X_test_bow = vectorizer.transform(X_test)

# create multinomial object
clf = MultinomialNB()

# train clf
clf.fit(X_train_bow, y_train)

# compute accuracy on test set
accuracy = clf.score(X_test_bow, y_test)
print(accuracy)

# 0.76

### BoW vectors for movie reviews

In [None]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer object
vectorizer = CountVectorizer(lowercase=True, stop_words='english')

# Fit and transform X_train
X_train_bow = vectorizer.fit_transform(X_train)

# Transform X_test
X_test_bow = vectorizer.transform(X_test)

# Print shape of X_train_bow and X_test_bow
print(X_train_bow.shape)
print(X_test_bow.shape)

# (250, 8158)
# (750, 8158)

### Predicting the sentiment of a movie review
- use this model to train a Naive Bayes classifier that can detect the sentiment of a movie review and compute its accuracy. 
- Note that since this is a binary classification problem, the model is only capable of classifying a review as either positive (1) or negative (0). It is incapable of detecting neutral reviews.

In [None]:
# Create a MultinomialNB object
clf = MultinomialNB()

# Fit the classifier
clf.fit(X_train_bow, y_train)

# Measure the accuracy
accuracy = clf.score(X_test_bow, y_test)
print("The accuracy of the classifier on the test set is %.3f" % accuracy)

# Predict the sentiment of a negative review
review = "The movie was terrible. The music was underwhelming and the acting mediocre."
prediction = clf.predict(vectorizer.transform([review]))[0]
print("The sentiment predicted by the classifier is %i" % (prediction))

'''
<script.py> output:
    The accuracy of the classifier on the test set is 0.732
    The sentiment predicted by the classifier is 0
'''

## Building n-gram models

1. Building n-gram models
 - We already know how to build bag-of-words representations of our documents and use it to conduct various machine learning tasks.

2. BoW shortcomings - context is lost!
 - Consider the following mini reviews. One is a positive review which states that the movie was good and not boring. The other is negative; commenting that the movie was not good and boring. If we were to construct BoW vectors for these reviews, we would get identical vectors since both reviews contain exactly the same words. And here in lies the biggest shortcoming of the bag-of-words model: context of the words is lost. In this example, the position of the word 'not' changes the entire sentiment of the review. Therefore, in this lesson, we will study techniques that will allow us to model this.
 - BoW: n=1

3. n-grams
 - An n-gram is a contiguous sequence of n elements (or words) in a given document. The bag-of-words model that we've explored so far is nothing but an n-gram model where n is equal to one. Let's now explore n-grams when n is greater than one. Consider the sentence 'for you a thousand times over'. If we set n to 2, then the n-grams (called bigrams in this case) would be for you, you a, a thousand, thousand times and times over.

4. n-grams
 - Similarly, for n equal to 3, the n-grams (or trigrams) will be for you a, you a thousand, a thousand times, thousand times over. Therefore, we can use these n-grams to capture more context and account for cases like 'not'.

5. n-grams Applications
 - capturing more context
 - used in sentence completion
 - spelling correction 
 - machine translation correction
 - In all these cases, the model computes the probability of n words occurring contiguously to perform the above processes.

6. Building n-gram models using scikit-learn
 - Building these n-gram models using scikit-learn is extremely simple, now that we know how to use CountVectorizer. CountVectorizer takes in an argument ngram range which is a tuple containing the lower and upper bound for the range of n-values. 
 - Bigrams only - passing 2,2 as the ngram_range
     - bigrams = CountVectorizer(ngram_range=(2,2))
 - Generate unigrams, bigrams and trigrams - passing in 1,3 will generate n-grams where n is equal to 1, 2 and 3.
     - ngrams = CountVectorizer(ngram_range=(1,3))

7. Shortcomings
 - While on the surface, it may seem lucrative to generate n-grams of high orders to capture more and more context, it comes with caveats. We've already seen that the BoW vectors run into thousands of dimensions. 
 - Adding higher order n-grams increases the number of dimensions even more and while performing machine learning, leads to a problem known as the curse of dimensionality. 
 - Additionally, n-grams for n greater than 3 become exceedingly rare to find in multiple documents. So that feature becomes effectively useless. For these reasons, it is often a good idea to restrict yourself to n-grams where n is small.

### n-gram models for movie tag lines
We have a corpus of more than 9000 movie tag lines. 
Our job is to generate n-gram models up to n equal to 1, 
n equal to 2 and n equal to 3 for this data and discover 
the number of features for each model.

We will then compare the number of features generated for each model.

In [None]:
# Generate n-grams upto n=1
vectorizer_ng1 = CountVectorizer(ngram_range=(1, 1))
ng1 = vectorizer_ng1.fit_transform(corpus)

# Generate n-grams upto n=2
vectorizer_ng2 = CountVectorizer(ngram_range=(1, 2))
ng2 = vectorizer_ng2.fit_transform(corpus)

# Generate n-grams upto n=3
vectorizer_ng3 = CountVectorizer(ngram_range=(1, 3))
ng3 = vectorizer_ng3.fit_transform(corpus)

# Print the number of features for each model
print("ng1, ng2 and ng3 have %i, %i and %i features respectively" %
      (ng1.shape[1], ng2.shape[1], ng3.shape[1]))

# ng1, ng2 and ng3 have 6614, 37100 and 76881 features respectively

### Higher order n-grams for sentiment analysis
build a classifier that can detect if the review of a particular movie is positive or negative. However, this time, we will use n-grams up to n=2 for the task.

In [None]:
# Define an instance of MultinomialNB 
clf_ng = MultinomialNB()

# Fit the classifier 
clf_ng.fit(X_train_ng, y_train)

# Measure the accuracy 
accuracy = clf_ng.score(X_test_ng, y_test)
print("The accuracy of the classifier on the test set is %.3f" % accuracy)

# Predict the sentiment of a negative review
review = "The movie was not good. The plot had several holes and the acting lacked panache."
prediction = clf_ng.predict(ng_vectorizer.transform([review]))[0]
print("The sentiment predicted by the classifier is %i" % (prediction))

# The accuracy of the classifier on the test set is 0.758
# The sentiment predicted by the classifier is 0

### Comparing performance of n-gram models
- we will conduct sentiment analysis for the same movie reviews from before using two n-gram models: unigrams and n-grams upto n equal to 3.

- We will compare the performance using three criteria: 
    - accuracy of the model on the test set
    - time taken to execute the program 
    - the number of features created when generating the n-gram representation.

In [None]:
start_time = time.time()
# Splitting the data into training and test sets
train_X, test_X, train_y, test_y = train_test_split(
    df['review'], df['sentiment'], test_size=0.5, 
    random_state=42, stratify=df['sentiment'])

# Generating ngrams
vectorizer = CountVectorizer()
train_X = vectorizer.fit_transform(train_X)
test_X = vectorizer.transform(test_X)

# Fit classifier
clf = MultinomialNB()
clf.fit(train_X, train_y)

# Print accuracy, time and number of dimensions
print("The program took %.3f seconds to complete. The accuracy on the test set is %.2f. The ngram representation had %i features." %
      (time.time() - start_time, clf.score(test_X, test_y), train_X.shape[1]))

# The program took 0.163 seconds to complete. 
# The accuracy on the test set is 0.75. 
# The ngram representation had 12347 features.

In [None]:
start_time = time.time()
# Splitting the data into training and test sets
train_X, test_X, train_y, test_y = train_test_split(
    df['review'], df['sentiment'], test_size=0.5, 
    random_state=42, stratify=df['sentiment'])

# Generating ngrams
vectorizer = CountVectorizer(ngram_range=(1, 3))
train_X = vectorizer.fit_transform(train_X)
test_X = vectorizer.transform(test_X)

# Fit classifier
clf = MultinomialNB()
clf.fit(train_X, train_y)

# Print accuracy, time and number of dimensions
print("The program took %.3f seconds to complete. The accuracy on the test set is %.2f. The ngram representation had %i features." %
      (time.time() - start_time, clf.score(test_X, test_y), 
       train_X.shape[1]))

# The program took 2.683 seconds to complete. 
# The accuracy on the test set is 0.77. 
# The ngram representation had 178240 features.

# 2nd model with 3 grams performs marginally better

# Building tf-idf document vectors
- tf-idf = term frequency-inverse document frequency


1. Building tf-idf document vectors
 - In the last chapter, we learned about n-gram modeling.

2. n-gram modeling
 - In n-gram modeling, the weight of a dimension for the vector representation of a document is dependent on the number of times the word corresponding to the dimension occurs in the document. Let's say we have a document that has the word 'human' occurring 5 times. Then, the dimension of its vector representation corresponding to 'human' would have the value 5.

3. Motivation
 - However, some words occur very commonly across all the documents in the corpus. As a result, the vector representations get more characterized by these dimensions. Consider a corpus of documents on the Universe. Let's say there is a particular document on Jupiter where the word 'jupiter' and 'universe' both occur about 20 times. However, 'jupiter' rarely figures in the other documents whereas 'universe' is just as common. We could argue that although both *jupiter* and *universe* occur 20 times, *jupiter* should be given a larger weight on account of its exclusivity. In other words, the word 'jupiter' characterizes the document more than 'universe'.

4. Applications
 - Weighting words this way has a huge number of applications.
 - automatically detect stopwords for the corpus instead of relying on a generic list. 
 - search algorithms to determine the ranking of pages containing the search query 
 - recommender systems as we will soon find out.
 - In a lot of cases, this kind of weighting also generates better performance during predictive modeling.

5. Term frequency-inverse document frequency
 - The weighting mechanism we've described is known as term frequency-inverse document frequency or tf-idf for short. It is based on the idea that the weight of a term in a document should be proportional to its frequency and an inverse function of the number of documents in which it occurs.

6. Mathematical formula
 - Mathematically, the weight of a term i in document j is computed as term frequency of the term i in document j multiplied by the log of the ratio of the number of documents in the corpus and the number of documents in which the term i occurs or dfi.
 - Therefore, let's say the word 'library' occurs in a document 5 times. There are 20 documents in the corpus and 'library' occurs in 8 of them. Then, the tf-idf weight of 'library' in the vector representation of this document will be 5 times log of 20 by 8 which is approximately 2. In general, higher the tf-idf weight, more important is the word in characterizing the document. A high tf-idf weight for a word in a document may imply that the word is relatively exclusive to that particular document or that the word occurs extremely commonly in the document, or both.
 - weight of term i in document j = freq*log(N/n)
 - if a term occurs in every document, weight = 0 b/c log(1)=0

10. tf-idf using scikit-learn
 - Generating vectors that use tf-idf weighting is almost identical to what we've already done so far. Instead of using CountVectorizer, we use the TfidfVectorizer class of scikit-learn. The parameters and methods it has is almost identical to CountVectorizer. The only difference is that TfidfVectorizer assigns weights using the tf-idf formula from before and has extra parameters related to inverse document frequency which we will not cover in this course. Here, we can see how using TfidfVectorizer is almost identical to using CountVectorizer for a corpus. However, notice that the weights are non-integer and reflect values calculated by the tf-idf formula.

## tf-idf using scikit-learn

In [None]:
# import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# create TfidfVectorizer object
vectorizer = TfidfVectorizer()
# generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(corpus)
print(tfidf_matrix.toarray())

In [None]:
# tf-idf vectors for TED talks

# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(ted)

# Print the shape of tfidf_matrix
print(tfidf_matrix.shape)

# (500, 29158)

## Cosine similarity - how similar 2 vectors are

1. Cosine similarity
 - We now know how to compute vectors out of text documents. With this representation in mind, let us now explore techniques that will allow us to determine how similar two vectors and consequentially two documents, are to each other. More specifically, we will learn about the cosine similarity score which is one of the most popularly used similarity metrics in NLP.

2. Mathematical formula
 - Very simply put, the cosine similarity score of two vectors is the cosine of the angle between the vectors. Mathematically, it is the ratio of the dot product of the vectors and the product of the magnitude of the two vectors. Let's walk through what this formula really means.

- 1 Image courtesy techninpink.com

3. The dot product
 - The dot product is computed by summing the product of values across corresponding dimensions of the vectors. Let's say we have two n-dimensional vectors V and W as shown. Then, the dot product here would be v1 times w1 plus v2 times w2 and so on until vn times wn. As an example, consider two vectors A and B. By applying the formula above, we see that the dot product comes to 37.

4. Magnitude of a vector
 - The magnitude of a vector is essentially the length of the vector. Mathematically, it is defined as the square root of the sum of the squares of values across all the dimensions of a vector. Therefore, for an n-dimensional vector V, the magnitude,mod V, is computed as the square root of v1 square plus v2 square and so on until vn square. Consider the vector A from before. Using the above formula, we compute its magnitude to be root 66.

5. The cosine score
 - We are now in a position to compute the cosine similarity score of A and B. It is the dot product, which is 37, divided by the product of the magnitudes of A and B, which are root 66 and root 38 respectively. The value comes out to be approximately 0.738, which is the value of the cosine of the angle theta between the two vectors.

6. Cosine Score: points to remember
 - Since the cosine score is simply the cosine of the angle between two vectors, its value is bounded between -1 and 1. 
 - However, in NLP, document vectors almost always use non-negative weights. Therefore, cosine scores vary between 0 and 1 where 0 indicates no similarity and 1 indicates that the documents are identical. 
 - since the cosine score ignores the magnitude of the vectors, it is fairly robust to document length. This may be an advantage or a disadvantage depending on the use case.

7. Implementation using scikit-learn
 - Scikit-learn offers a cosine_similarity function that outputs a similarity matrix containing the pairwise cosine scores for a set of vectors. You can import cosine_similarity from sklearn dot metrics dot pairwise. However, remember that cosine_similarity takes in 2-D arrays as arguments. Passing in 1-D arrays will throw an error. Let us compute the cosine similarity scores of vectors A and B from before. We see that we get the same answer of 0.738 from before.

### cosine similarity using scikit-learn

In [None]:
# import the cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity

# define two 3-dimensional vectors A and B
A = (4,7,1)
B = (5,2,3)

# compute the cosine score of A and B
score = cosine_similarity([A], [B]))

# print the cosine score
print(score)

### Computing dot product
- compute the dot product between two vectors, A = (1, 3) and B = (-2, 2)
- using the numpy library, we will use the np.dot() function to compute the dot product of two numpy arrays.

In [None]:
# Initialize numpy vectors
A = np.array([1,3])
B = np.array([-2,2])

# Compute dot product
dot_prod = np.dot(A, B)

# Print dot product
print(dot_prod)

# dot product of the two vectors is 1 * -2 + 3 * 2 = 4

### Cosine similarity matrix of a corpus
In this exercise, you have been given a corpus, which is a list containing five sentences. You have to compute the cosine similarity matrix which contains the pairwise cosine similarity score for every pair of sentences (vectorized using tf-idf).

Remember, the value corresponding to the ith row and jth column of a similarity matrix denotes the similarity score for the ith and jth vector.

In [None]:
corpus = ['The sun is the largest celestial body in the solar system', 'The solar system consists of the sun and eight revolving planets', 'Ra was the Egyptian Sun God', 'The Pyramids were the pinnacle of Egyptian architecture', 'The quick brown fox jumps over the lazy dog']

# Initialize an instance of tf-idf Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Generate the tf-idf vectors for the corpus
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# Compute and print the cosine similarity matrix
# note - tfidf_matrix are both first and second arguments
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(cosine_sim)

'''
[[1.         0.36413198 0.18314713 0.18435251 0.16336438]
 [0.36413198 1.         0.15054075 0.21704584 0.11203887]
 [0.18314713 0.15054075 1.         0.21318602 0.07763512]
 [0.18435251 0.21704584 0.21318602 1.         0.12960089]
 [0.16336438 0.11203887 0.07763512 0.12960089 1.        ]]
'''
'''
From our similarity matrix, we see that the first and the 
second sentence are the most similar. 
Also the fifth sentence has, on average, the 
lowest pairwise cosine scores. 
This is intuitive as it contains entities that are 
not present in the other sentences.
'''

## Building a plot line based recommender
- tf-idf vectors and cosine scores to build a recommender system

1. Building a plot line based recommender
 - In this lesson, we will use tf-idf vectors and cosine scores to build a recommender system that suggests movies based on overviews.

2. Movie recommender
 - We've a dataset containing movie overviews. Here, we can see two movies, Shanghai Triad and Cry, the Beloved Country and their overviews.
 - Our task is to build a system that takes in a movie title and outputs a list of movies that has similar plot lines. For instance, if we passed in 'The Godfather', we could expect output like this. Notice how a lot of the movies listed here have to do with crime and gangsters, just like The Godfather.

4. Steps
 - Following are the steps involved. 
     1. text preprocessing - The first step, as always, is to preprocess movie overviews. 
     2. Generate tf-idf vectors for our overviews
     3. Generate cosine similarity matrix - which contains the pairwise similarity scores of every movie with every other movie. Once the cosine similarity matrix is computed
     4. build the recommender function

5. The recommender function
 - We will build a recommender function as part of this course.
 - The recommender function takes a movie title, the cosine similarity matrix and an indices series as arguments. 
     - The indices series is a reverse mapping of movie titles with their indices in the original dataframe. 
 - Extract pairwise cosine similarity scores of the movie passed in with every other movie. 
 - Sorts these scores in descending order
 - Outputs the titles of movies corresponding to the highest similarity scores. 
 - Note that the function ignores the highest similarity score of 1. This is because the movie most similar to a given movie is the movie itself!

6. Generating tf-idf vectors
 - Let's say we already have the preprocessed movie overviews as 'movie_plots'. We already know how to generate the tf-idf vectors.

7. Generating cosine similarity matrix
 - Generating the cosine similarity matrix is also extremely simple. We simply pass in tfidf_matrix as both the first and second argument of cosine_similarity. This generates a matrix that contains the pairwise similarity score of every movie with every other movie. The value corresponding to the ith row and the jth column is the cosine similarity score of movie i with movie j. Notice that the diagonal elements of this matrix is 1. This is because, as stated earlier, the cosine similarity score of movie k with itself is 1.

8. The linear_kernel function - use instead of cosine similarity
 - The magnitude of a tf-idf vector is always 1. 
 - Recall from the previous lesson that the cosine score is computed as the ratio of the dot product and the product of the magnitude of the vectors. Since the magnitude is 1, the cosine score of two tf-idf vectors is equal to their dot product! 
 - This fact can help us greatly improve the speed of computation of our cosine similarity matrix as we do not need to compute the magnitudes while working with tf-idf vectors. 
 - Therefore, while working with tf-idf vectors, we can use the linear_kernel function which computes the pairwise dot product of every vector with every other vector.

9. Generating cosine similarity matrix
 - Let us replace the cosine_similarity function with linear_kernel. As you can see, the output remains the same but it takes significantly lesser time to compute.

10. The get_recommendations function
 - The recommender function and the indices series described earlier will be built in the exercises. You can use this function to generate recommendations using the cosine similarity matrix.

In [None]:
# example
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# create TfidfVectorizer object
vectorizer = TfidfVectorizer()

# generate matrix of tf-idf vectors
tfidf_matrix = vectorizer.fit_transform(movie_plots)

# generate cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

### use linear_kernel instead of cosine similarity d/t decreased computational time

In [None]:
# example - use linear_kernel instead of cosine similarity
# b/c same matrix, but less computational time
# import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# generate cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

### get_recommendations function

In [None]:
get_recommendations('The Lion King', cosine_sim, indices)

### Comparing linear_kernel and cosine_similarity

In [None]:
# Record start time
start = time.time()

# Compute cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Print cosine similarity matrix
print(cosine_sim)

# Print time taken
print("Time taken: %s seconds" %(time.time() - start))
'''
[[1.         0.         0.         ... 0.         0.         0.        ]
 [0.         1.         0.         ... 0.         0.         0.        ]
 [0.         0.         1.         ... 0.         0.01418221 0.        ]
 ...
 [0.         0.         0.         ... 1.         0.01589009 0.        ]
 [0.         0.         0.01418221 ... 0.01589009 1.         0.        ]
 [0.         0.         0.         ... 0.         0.         1.        ]]
Time taken: 0.33606457710266113 seconds
'''

In [None]:
# Record start time
start = time.time()

# Compute cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Print cosine similarity matrix
print(cosine_sim)

# Print time taken
print("Time taken: %s seconds" %(time.time() - start))
'''
[[1.         0.         0.         ... 0.         0.         0.        ]
 [0.         1.         0.         ... 0.         0.         0.        ]
 [0.         0.         1.         ... 0.         0.01418221 0.        ]
 ...
 [0.         0.         0.         ... 1.         0.01589009 0.        ]
 [0.         0.         0.01418221 ... 0.01589009 1.         0.        ]
 [0.         0.         0.         ... 0.         0.         1.        ]]
Time taken: 0.31749439239501953 seconds
'''

### Plot recommendation engine
- build a recommendation engine that suggests movies based on similarity of plot lines. You have been given a get_recommendations() function that takes in the title of a movie, a similarity matrix and an indices series as its arguments and outputs a list of most similar movies. indices has already been provided to you.
- You have also been given a movie_plots Series that contains the plot lines of several movies. Your task is to generate a cosine similarity matrix for the tf-idf vectors of these plots.

In [None]:
# Initialize the TfidfVectorizer 
tfidf = TfidfVectorizer(stop_words='english')

# Construct the TF-IDF matrix
tfidf_matrix = tfidf.fit_transform(movie_plots)

# Generate the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
 
# Generate recommendations 
print(get_recommendations('The Dark Knight Rises', cosine_sim, indices))
'''
1                              Batman Forever
2                                      Batman
3                              Batman Returns
8                  Batman: Under the Red Hood
9                            Batman: Year One
10    Batman: The Dark Knight Returns, Part 1
11    Batman: The Dark Knight Returns, Part 2
5                Batman: Mask of the Phantasm
7                               Batman Begins
4                              Batman & Robin
Name: title, dtype: object
'''

### The recommender function
build a recommender function get_recommendations(), as discussed in the lesson and the previous exercise. As we know, it takes in a title, a cosine similarity matrix, and a movie title and index mapping as arguments and outputs a list of 10 titles most similar to the original title (excluding the title itself).

You have been given a dataset metadata that consists of the movie titles and overviews. The head of this dataset has been printed to console.

In [None]:
# Generate mapping between titles and index
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

def get_recommendations(title, cosine_sim, indices):
    # Get index of movie that matches title
    idx = indices[title]
    # Sort the movies based on the similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores for 10 most similar movies
    sim_scores = sim_scores[1:11]
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]

### TED talk recommender
build a recommendation system that suggests TED Talks based on their transcripts. You have been given a get_recommendations() function that takes in the title of a talk, a similarity matrix and an indices series as its arguments, and outputs a list of most similar talks. indices has already been provided to you.

You have also been given a transcripts series that contains the transcripts of around 500 TED talks. Your task is to generate a cosine similarity matrix for the tf-idf vectors of the talk transcripts.

Consequently, we will generate recommendations for a talk titled '5 ways to kill your dreams' by Brazilian entrepreneur Bel Pesce.

In [None]:
# Initialize the TfidfVectorizer 
tfidf = TfidfVectorizer(stop_words='english')

# Construct the TF-IDF matrix
tfidf_matrix = tfidf.fit_transform(transcripts)

# Generate the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
 
# Generate recommendations 
print(get_recommendations('5 ways to kill your dreams', cosine_sim, indices))

# Beyond n-grams: word embeddings

1. Beyond n-grams: word embeddings
 - We have covered a lot of ground in the last 4 chapters. However, before we bid adieu, we will cover one advanced topic that has a large number of applications in NLP.

2. The problem with BoW and tf-idf
 - Consider the three sentences
     - 'I am happy'
     - 'I am joyous'
     - 'I am sad'
 - Now if we were to compute the similarities, I am happy and I am joyous would have the same score as I am happy and I am sad, regardless of how we vectorize it. This is because 'happy', 'joyous' and 'sad' are considered to be completely different words. However, we know that happy and joyous are more similar to each other than sad. This is something that the vectorization techniques we've covered so far simply cannot capture.

3. Word embeddings
 - Word embedding = process of mapping words into an n-dimensional vector space. 
 - These vectors are usually produced using deep learning models and huge amounts of data. The techniques used are beyond the scope of this course. 
 - However, once generated, these vectors can be used to discern how similar two words are to each other. 
 - Consequently, they can also be used to detect synonyms and antonyms. 
 - Word embeddings are also capable of capturing complex relationships. For instance, it can be used to detect that the words king and queen relate to each other the same way as man and woman. Or that France and Paris are related in the same way as Russia and Moscow. 
 - One last thing to note is that word embeddings are not trained on user data; they are dependent on the pre-trained spacy model you're using and are independent of the size of your dataset.

4. Word embeddings using spaCy
 - Generating word embeddings is easy using spaCy's pre-trained models. As usual, we load the spacy model and create the doc object for our string. Note that it is advisable to load larger spacy models while working with word vectors. This is because the en_core_web_sm model does not technically ship with word vectors but context specific tensors, which tend to give relatively poorer results. We generate word vectors for each word by looping through the tokens and accessing the vector attribute. The truncated output is as shown.

5. Word similarities
 - We can compute how similar two words are to each other by using the similarity method of a spacy token. Let's say we want to compute how similar happy, joyous and sad are to each other. We define a doc containing the three words. We then use a nested loop to calculate the similarity scores between each pair of words. As expected, happy and joyous are more similar to each other than they are to sad.

6. Document similarities
 - Spacy also allows us to directly compute the similarity between two documents by using the average of the word vectors of all the words in a particular document. Let's consider the three sentences from before. We create doc objects for the sentences. Like spacy tokens, docs also have a similarity method. Therefore, we can compute the similarity between two docs as follows. As expected, I am happy is more similar to I am joyous than it is to I am sad. Note that the similarity scores are high in both cases because all sentences share 2 out of their three words, I and am.

## Word embeddings using spaCy

In [None]:
import spacy

# load model and create Doc object
nlp = spacy.load('en_core_web_lg')
doc = nlp('I am happy')

# generate word vectors for each token
for token in doc:
    print(token.vector)


## Word similarities
- generate the pairwise similarity scores of all the words in a sentence.

In [None]:
doc = nlp("happy joyous sad")
for token1 in doc:
    for token2 in doc:
        print(token1.text, token2.text, toke1.similarity(token2))

In [None]:
# the sentence is available as: sent

# Create the doc object
doc = nlp(sent)

# Compute pairwise similarity scores
for token1 in doc:
  for token2 in doc:
    print(token1.text, token2.text, token1.similarity(token2))
'''
    I I 1.0
    I like 0.023032807
    I apples 0.10175116
    I and 0.047492094
    I oranges 0.10894456
    like I 0.023032807
    like like 1.0
    like apples 0.015370452
    like and 0.189293
    like oranges 0.021943133
    apples I 0.10175116
    apples like 0.015370452
    apples apples 1.0
    apples and -0.17736834
    apples oranges 0.6315578
    and I 0.047492094
    and like 0.189293
    and apples -0.17736834
    and and 1.0
    and oranges 0.018627528
    oranges I 0.10894456
    oranges like 0.021943133
    oranges apples 0.6315578
    oranges and 0.018627528
    oranges oranges 1.0
'''

## Document similarities
- using the avg of word vectors of all the words in a document

In [None]:
# generate doc objects
sent1 = nlp('I am happy')
sent2 = nlp('I am sad')
sent3 = nlp('I am joyous')

# compute similarity between sent1 and sent2
sent1.similarity(sent2)

# compute similarity between sent1 and sent3
sent1.similarity(sent3)

# note if the sentence/document is similar, the scores will both be high.
# in the example above, all 3 sentences start with 'I am'

### Computing similarity of Pink Floyd songs
you have been given lyrics of three songs by the British band Pink Floyd, namely 'High Hopes', 'Hey You' and 'Mother'. The lyrics to these songs are available as hopes, hey and mother respectively.

Your task is to compute the pairwise similarity between mother and hopes, and mother and hey.

In [None]:
# Create Doc objects
mother_doc = nlp(mother)
hopes_doc = nlp(hopes)
hey_doc = nlp(hey)

# Print similarity between mother and hopes
print(mother_doc.similarity(hopes_doc))

# Print similarity between mother and hey
print(mother_doc.similarity(hey_doc))

# 0.6006234924640204
# 0.9135920924498578

Notice that 'Mother' and 'Hey You' have a similarity score of 0.9 whereas 'Mother' and 'High Hopes' has a score of only 0.6. This is probably because 'Mother' and 'Hey You' were both songs from the same album 'The Wall' and were penned by Roger Waters. On the other hand, 'High Hopes' was a part of the album 'Division Bell' with lyrics by David Gilmour and his wife, Penny Samson. 