# Feature Engineering for NLP in Python
Rounak Banik - Data Scientist at Fractal Analytics

Rounak is a Young India Fellow and the author of the book, Hands-on Recommendation Systems with Python. He currently works as a Data Science Fellow with the QuantumBlack division of McKinsey and Company. He obtained his B.Tech degree in Electronics & Communication Engineering from IIT Roorkee.

Summary
1. Basic features and readability scores
2. Text preprocessing, POS tagging, and NER (named entity recognition)
3. N-Gram models - for sentiment analysis
4. TF-IDF and cosine similarity scores - for recommender

# Intro to NLP feature engineering

1. Introduction to NLP feature engineering
 - Welcome to Feature Engineering for NLP in Python! I am Rounak and I will be your instructor for this course. In this course, you will learn to extract useful features out of text and convert them into formats that are suitable for machine learning algorithms.

2. Numerical data
 - For any ML algorithm, data fed into it must be in tabular form and all the training features must be numerical. Consider the Iris dataset. Every training instance has exactly four numerical features. The ML algorithm uses these four features to train and predict if an instance belongs to class iris-virginica, iris-setosa or iris-versicolor.

3. One-hot encoding
 - ML algorithms can also work with categorical data provided they are converted into numerical form through one-hot encoding. Let's say you have a categorical feature 'sex' with two categories 'male' and 'female'.

4. One-hot encoding
 - One-hot encoding will convert this feature into two features,

5. One-hot encoding
 - 'sex_male' and 'sex_female' such that each male instance has a 'sex_male' value of 1 and 'sex_female' value of 0. For females, it is the vice versa.

6. One-hot encoding with pandas
 - To do this in code, we use pandas' get_dummies() function. Let's import pandas using the alias pd. We can then pass our dataframe df into the pd.get_dummies() function and pass a list of features to be encoded as the columns argument. Not mentioning columns will lead pandas to automatically encode all non-numerical features. Finally, we overwrite the original dataframe with the encoded version by assigning the dataframe returned by get_dummies() back to df.

7. Textual data
 - Consider a movie reviews dataset. This data cannot be utilized by any machine learning or ML algorithm. The training feature 'review' isn't numerical. Neither is it categorical to perform one-hot encoding on.

8. Text pre-processing
 - We need to perform two steps to make this dataset suitable for ML. The first is to standardize the text. This involves steps like converting words to lowercase and their base form. For instance, 'Reduction' gets lowercased and then converted to its base form, reduce. We will cover these concepts in more detail in subsequent lessons.

9. Vectorization
 - After preprocessing, the reviews are converted into a set of numerical training features through a process known as vectorization. After vectorization, our original review dataset gets converted

10. Vectorization
 - into something like this. We will learn techniques to achieve this in later lessons.

11. Basic features
 - We can also extract certain basic features from text. It maybe useful to know the word count, character count and average word length of a particular text. While working with niche data such as tweets, it also maybe useful to know how many hashtags have been used in a tweet. This tweet by Silverado Records,for instance, uses two.

12. POS tagging
 - So far, we have seen how to extract features out of an entire body of text. Some NLP applications may require you to extract features for individual words. For instance, you may want to do parts-of-speech tagging to know the different parts-of-speech present in your text as shown. As an example, consider the sentence 'I have a dog'. POS tagging will label each word with its corresponding part-of-speech.

13. Named Entity Recognition
 - You may also want to know perform named entity recognition to find out if a particular noun is referring to a person, organization or country. For instance, consider the sentence "Brian works at DataCamp". Here, there are two nouns "Brian" and "DataCamp". Brian refers to a person whereas DataCamp refers to an organization.

14. Concepts covered
 - Therefore, broadly speaking, this course will teach you how to conduct text preprocessing, extract certain basic features, word features and convert documents into a set of numerical features (using a process known as vectorization).

# One-hot encoding

In [None]:
# Print the features of df1
print(df1.columns)

# Perform one-hot encoding
df1 = pd.get_dummies(df1, columns=['feature 5'])

# Print the new features of df1
print(df1.columns)

# Print first five rows of df1
print(df1.head())

# Basic feature extraction

1. Basic feature extraction
 - In this video, we will learn to extract certain basic features  from text. While not very powerful, they can give us a good idea of the text we are dealing with.

2. Number of characters
 - The most basic feature we can extract from text is the number of characters, including whitespaces. For instance, the string "I don't know." has 13 characters. The number of characters is the length of the string. Python gives us a built-in len() function which returns the length of the string passed into it. The output will be 13 here too. If our dataframe df has a textual feature (say 'review'), we can compute the number of characters for each review and store it as a new feature 'num_chars' by using the pandas dataframe apply method. This is done by creating df['num_chars'] and assigning it to df['review'].apply(len).

3. Number of words
 - Another feature we can compute is the number of words. Assuming that every word is separated by a space, we can use a string's split() method to convert it into a list where every element is a word. In this example, the string Mary had a little lamb is split to create a list containing the words Mary, had, a, little and lamb. We can now compute the number of words by computing the number of elements in this list using len().

4. Number of words
 - To do this for a textual feature in a dataframe, we first define a function that takes in a string as an argument and returns the number of words in it. The steps followed inside the function are similar as before. We then pass this function word_count into apply. We create df['num_words'] and assign it to df['review'].apply(word_count).

5. Average word length
 - Let's now compute the average length of words in a string. Let's define a function avg_word_length() which takes in a string and returns the average word length. We first split the string into words and compute the length of each word. Next, we compute the average word length by dividing the sum of the lengths of all words by the number of words.

6. Average word length
 - We can now pass this into apply() to generate a average word length feature like before.

7. Special features
 - When working with data such as tweets, it maybe useful to compute the number of hashtags or mentions used. This tweet by DataCamp, for instance, has one mention upendra_35 which begins with an @ and two hashtags, PySpark and Spark which begin with a #.

8. Hashtags and mentions
 - Let's write a function that computes the number of hashtags in a string. We split the string into words. We then use list comprehension to create a list containing only those words that are hashtags. We do this using the startswith method of strings to find out if a word begins with #. The final step is to return the number of elements in this list using len. The procedure to compute number of mentions is identical except that we check if a word starts with @. Let's see this function in action. When we pass a string "@janedoe This is my first tweet! #FirstTweet #Happy", the function returns 2 which is indeed the number of hashtags in the string.

9. Other features
 - There are other basic features we can compute such as number of sentences, number of paragraphs, number of words starting with an uppercase, all-capital words, numeric quantities etc. The procedure to do this is extremely similar to the ones we've already covered.

## Number of characters

In [2]:
len("I don't know.")

13

In [None]:
# apply to column
# create a 'num_chars' feature
df['num_chars'] = df['review'].apply(len)

## Number of words - assume separated by space

In [5]:
text = "Mary had a little lamb."
words = text.split()
print(words)
print(len(words))

['Mary', 'had', 'a', 'little', 'lamb.']
5


In [None]:
# feature in df
# function that returns number of words in string
def word_count(string):
    # split the string into words
    words = string.split()
    # return length of words list
    return len(words)

# create num_words feature in df
df['num_words'] = df['review'].apply(word_count)

## Average word length

In [None]:
# function that returns avg word length
def avg_word_length(x):
    # split the string into words
    words = x.split()
    # compute length of each word and store in a separate list
    word_lengths = [len(word) for word in words]
    # compute average word length
    avg_word_length = sum(word_lengths)/len(words)
    # return average word length
    return(avg_word_length)

# create a new feature avg_word_length
df['avg_word_length'] = df['review'].apply(doc_density)

## Special features like hashtags

In [6]:
# function that returns number of hashtages
def hashtag_count(string):
    # split the string into words
    words = string.split()
    # create a list of hashtags
    hashtags = [word for word in words if word.startswith('#')]
    # return number of hashtags
    return len(hashtags)

hashtag_count("@janedoe This is my first tweet! #FirstTweet #Happy")

2

## Other features
- number of sentences
- number of paragraphs
- words starting with an uppercase
- all-capital words
- numeric quantities