**Introduction to NLP feature engineering**
___
- concepts covered
    - text preprocessing
    - basic features
    - word features
    - vectorization
___

In [None]:
#One-hot encoding

#In the previous exercise, we encountered a dataframe df1 which
#contained categorical features and therefore, was unsuitable for
#applying ML algorithms to.

#In this exercise, your task is to convert df1 into a format that is
#suitable for machine learning.

# Print the features of df1
#print(df1.columns)
#################################################
#<script.py> output:
#    Index(['feature 1', 'feature 2', 'feature 3', 'feature 4', 'feature 5', 'label'], dtype='object')
#################################################

# Perform one-hot encoding
#df1 = pd.get_dummies(df1, columns=['feature 5'])

# Print the new features of df1
#print(df1.columns)

# Print first five rows of df1
#print(df1.head())

#################################################
#Index(['feature 1', 'feature 2', 'feature 3', 'feature 4', 'label', 'feature 5_female', 'feature 5_male'], dtype='object')
#       feature 1  feature 2  feature 3  feature 4  label  feature 5_female  feature 5_male
#    0    29.0000          0          0   211.3375      1                 1               0
#    1     0.9167          1          2   151.5500      1                 0               1
#    2     2.0000          1          2   151.5500      0                 1               0
#    3    30.0000          1          2   151.5500      0                 0               1
#    4    25.0000          1          2   151.5500      0                 1               0
#################################################
#You have successfully performed one-hot encoding on this dataframe.
#Notice how the feature 5 (which represents sex) gets converted to
#two features feature 5_male and feature 5_female. With one-hot
#encoding performed, df1 only contains numerical features and can
#now be fed into any standard ML model!

**Basic feature extraction**
___
- number of characters
- number of words
- average word length
- special features
    - e.g., number of hashtags in a tweet
- other features
    - number of sentences
    - number of paragraphs
    - words starting with an uppercase
    - all-capital words
    - numeric quantities
___

In [None]:
#Character count of Russian tweets

#In this exercise, you have been given a dataframe tweets which
#contains some tweets associated with Russia's Internet Research
#Agency and compiled by FiveThirtyEight.

#Your task is to create a new feature 'char_count' in tweets which
#computes the number of characters for each tweet. Also, compute the
#average length of each tweet. The tweets are available in the
#content feature of tweets.

# Create a feature char_count
#tweets['char_count'] = tweets['content'].apply(len)

# Print the average character count
#print(tweets['char_count'].mean())

#################################################
#<script.py> output:
#    103.462
#################################################
#Notice that the average character count of these tweets is
#approximately 104, which is much higher than the overall average
#tweet length of around 40 characters. Depending on what you're
#working on, this may be something worth investigating into. For
#your information, there is research that indicates that fake news
#articles tend to have longer titles! Therefore, even extremely
#basic features such as character counts can prove to be very useful
#in certain applications.

In [None]:
#Word count of TED talks

#ted is a dataframe that contains the transcripts of 500 TED talks.
#Your job is to compute a new feature word_count which contains the
#approximate number of words for each talk. Consequently, you also
#need to compute the average word count of the talks. The transcripts
#are available as the transcript feature in ted.

#In order to complete this task, you will need to define a function
#count_words that takes in a string as an argument and returns the
#number of words in the string. You will then need to apply this
#function to the transcript feature of ted to create the new feature
#word_count and compute its mean.

# Function that returns number of words in a string
#def count_words(string):
	# Split the string into words
#    words = string.split()

    # Return the number of words
#    return len(words)

# Create a new feature word_count
#ted['word_count'] = ted['transcript'].apply(count_words)

# Print the average word count of the talks
#print(ted['word_count'].mean())

#################################################
#<script.py> output:
#   1987.1
#################################################
#You now know how to compute the number of words in a given piece
#of text. Also, notice that the average length of a talk is close
#to 2000 words. You can use the word_count feature to compute its
#correlation with other variables such as number of views, number
#of comments, etc. and derive extremely interesting insights about
#TED.

In [None]:
#Hashtags and mentions in Russian tweets

#Let's revisit the tweets dataframe containing the Russian tweets.
#In this exercise, you will compute the number of hashtags and
#mentions in each tweet by defining two functions count_hashtags()
#and count_mentions() respectively and applying them to the content
#feature of tweets.

#In case you don't recall, the tweets are contained in the content
#feature of tweets.

# Function that returns numner of hashtags in a string
#def count_hashtags(string):
	# Split the string into words
#    words = string.split()

    # Create a list of words that are hashtags
#    hashtags = [word for word in words if word.startswith('#')]

    # Return number of hashtags
#    return(len(hashtags))

# Create a feature hashtag_count and display distribution
#tweets['hashtag_count'] = tweets['content'].apply(count_hashtags)
#tweets['hashtag_count'].hist()
#plt.title('Hashtag count distribution')
#plt.show()

![_images/19.1.svg](_images/19.1.svg)

In [None]:
# Function that returns number of mentions in a string
#def count_mentions(string):
	# Split the string into words
#    words = string.split()

    # Create a list of words that are mentions
#    mentions = [word for word in words if word.startswith('@')]

    # Return number of mentions
#    return(len(mentions))

# Create a feature mention_count and display distribution
#tweets['mention_count'] = tweets['content'].apply(count_mentions)
#tweets['mention_count'].hist()
#plt.title('Mention count distribution')
#plt.show()

![_images/19.2.svg](_images/19.2.svg)
You now have a good grasp of how to compute various types of
summary features. In the next lesson, we will learn about more
advanced features that are capable of capturing more nuanced
information beyond simple word and character counts.

**Readability tests**
___
- overview of readability tests
    - determine readability of an English passage
    - scale ranging from primary school up to college graduate level
    - a mathematical formula utilizing word, syllable, and sentence count
    - used in fake news and opinion spam detection
- readability text examples
    - **Flesch reading ease**
        - greater the average sentence length, harder text is to read
        - greater the average number of syllables in a word, harder the text is to read
        - higher the score, greater the readability
        ![_images/19.1.PNG](_images/19.1.PNG)
    - **Gunning fog index**
    - Simple Measure of Gobbledygook (SMOG)
    - Dale-Chall score
___