<h1><b><font color = 'brown'>
Feature Extraction from Texts
</font></b></h1>

<h2>
<b>

<ul>
<font color = 'brown green'>

<li>
Machine learning algorithms do not understand textual data directly.
</li><br>

<li>
We need to represent the text data in numerical form or vectors.
</li><br>

<li>
To convert each textual sentence into a vector, we need to represent it as a set of features.
</li><br>

<li>
This set of features should uniquely represent the text, though, individually, some of the features may be common across many textual sentences.
</li><br>

<li>
Features can be classified into two different categories:
</li><br>

<li>
General features: These features are statistical calculations and do not depend
on the content of the text. Some examples of general features could be the
number of tokens in the text, the number of characters in the text, and so on.
</li><br>

<li>
Specific features: These features are dependent on the inherent meaning of
the text and represent the semantics of the text. For example, the frequency of
unique words in the text is a specific feature.
</li><br>

</font>
</ul>
</b>
</h2>

<h1><b><font color = 'brown'>
Extracting General Features from Raw Text
</font></b></h1>

<h2>
<b>

<ul>
<font color = 'brown green'>

<li>
As we've already learned, general features refer to those that are not directly
dependent on the individual tokens constituting a text corpus.
</li><br>

<li>
Let's consider these two sentences: "The sky is blue" and "The pillar is yellow".
</li><br>

<li>
Here, the sentences have the same number of words (a general feature)—that is, four.
</li><br>

<li>
But the individual constituent tokens are different.
</li><br>

<li>
Let's complete an exercise to understand this better.
</li><br>

</font>
</ul>
</b>
</h2>

<h1><b><font color = 'brown'>
Exercise 01: Extracting General Features from Raw Text
</font></b></h1>

In this exercise, we will extract general features from input text. These general features include detecting the number of words, the presence of "wh" words (words beginning with "wh", such as "what" and "why") and the language in which the text is written.

1. Open a Jupyter Notebook.

2. Import the **pandas** library and create a DataFrame with four sentences.

In [None]:
import pandas as pd
from textblob import TextBlob

import nltk
nltk.download('punkt')  

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
df = pd.DataFrame([['The interim budget for 2019 will be announced on 1st February.'], ['Do you know how much expectation the middle-class working population is having from this budget?'], ['February is the shortest month in a year.'], ['This financial year will end on 31st March.']])
df.columns = ['text']
df.head()

Unnamed: 0,text
0,The interim budget for 2019 will be announced ...
1,Do you know how much expectation the middle-cl...
2,February is the shortest month in a year.
3,This financial year will end on 31st March.


3. Use the **apply()** function to iterate through each row of the column text,
convert them into **TextBlob** objects, and extract words from them.

In [None]:
def add_num_words(df):
    df['number_of_words'] = df['text'].apply(lambda x : len(TextBlob(str(x)).words))
    return df

In [None]:
add_num_words(df) ['number_of_words']

0    11
1    15
2     8
3     8
Name: number_of_words, dtype: int64

4. Use the **apply()** function to iterate through each row of the column text,
convert the text into **TextBlob** objects, and extract the words from them
to check whether any of them belong to the list of "wh" words that has been
declared.

In [None]:
def is_present(wh_words, df):
 
    # The below line of code will find the intersection between set of tokens of
    #  every sentence and the wh_words and will return true if the length of intersection
    #  set is non-zero.
    df['is_wh_words_present'] = df['text'].apply(lambda x : True if \
                                                 len(set(TextBlob(str(x)).words).intersection(wh_words))>0 else False)
    return df

wh_words = set(['why', 'who', 'which', 'what', 'where', 'when', 'how'])

is_present(wh_words, df)['is_wh_words_present']

0    False
1     True
2    False
3    False
Name: is_wh_words_present, dtype: bool

<h1><b><font color = 'brown'>
Exercise 02: Exercise 2.12: Extracting General Features from Text
</font></b></h1>

In this exercise, we will extract various general features from documents. 
<br>The dataset that we will be using here consists of random statements.
<br>Our objective is to find the frequency of various general features such as punctuation, uppercase and lowercase words, letters, digits, words, and whitespaces.

1. Open a Jupyter Notebook.

2. Insert a new cell and add the following code to import the necessary libraries:

In [None]:
import pandas as pd
from string import punctuation
import nltk
nltk.download('tagsets')
nltk.download('punkt')
from nltk.data import load
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag
from nltk import word_tokenize
from collections import Counter

[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


3. To see what different kinds of parts of speech **nltk** provides, add the
following code:

In [None]:
def get_tagsets():
  tagdict = load('help/tagsets/upenn_tagset.pickle')
  return list(tagdict.keys())
  
tag_list = get_tagsets()
print(tag_list)

['LS', 'TO', 'VBN', "''", 'WP', 'UH', 'VBG', 'JJ', 'VBZ', '--', 'VBP', 'NN', 'DT', 'PRP', ':', 'WP$', 'NNPS', 'PRP$', 'WDT', '(', ')', '.', ',', '``', '$', 'RB', 'RBR', 'RBS', 'VBD', 'IN', 'FW', 'RP', 'JJR', 'JJS', 'PDT', 'MD', 'VB', 'WRB', 'NNP', 'EX', 'NNS', 'SYM', 'CC', 'CD', 'POS']


4. Calculate the number of occurrences of each **PoS** by iterating through each
document and annotating each word with the corresponding pos tag. Add the
following code to implement this:

In [None]:
def  get_pos_occurrence_freq(data, tag_list):
  
  # get list of sentences in text_list
  text_list = data.text

  # create empty dataframe
  feature_df = pd.DataFrame(columns = tag_list)
  for text_line in text_list:

    # get pos tags of each word
    pos_tags = [j for i, j in pos_tag(word_tokenize(text_line))]

    # create a dict of pos tags and their frequency in given sentence.
    row = dict(Counter(pos_tags))
    feature_df = feature_df.append(row, ignore_index = True)
  feature_df.fillna(0, inplace = True)

  return feature_df

tag_list = get_tagsets()

data = pd.read_csv('/content/drive/MyDrive/NLP/Feature Extraction Methods in NLP/data.csv')
feature_df = get_pos_occurrence_freq(data, tag_list)
feature_df.head()

Unnamed: 0,LS,TO,VBN,'',WP,UH,VBG,JJ,VBZ,--,...,MD,VB,WRB,NNP,EX,NNS,SYM,CC,CD,POS
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


5. To calculate the number of punctuation marks, add the following code:

In [None]:
def add_punctuation_count(feature_df, data):

  feature_df['num_of_unique_punctuations'] = data['text'].apply(lambda x: len(set(x).intersection(set(punctuation))))
  
  return feature_df  

feature_df = add_punctuation_count(feature_df, data)

feature_df['num_of_unique_punctuations'].head()

0    0
1    0
2    1
3    1
4    0
Name: num_of_unique_punctuations, dtype: int64

6. To calculate the number of capitalized words, add the following code:

In [None]:
def get_captalized_word_count(feature_df, data):

  feature_df['number_of_captalized_words'] = data['text'].apply(lambda x: len([word for word in word_tokenize(str(x)) if word[0].isupper()]))

  return feature_df

feature_df = get_captalized_word_count(feature_df, data)

feature_df['number_of_captalized_words'].head()

0    1
1    1
2    1
3    1
4    1
Name: number_of_captalized_words, dtype: int64

7. To calculate the number of lowercase words, add the following code:

In [None]:
def get_small_word_count(feature_df, data):

  feature_df['number_of_small_words'] = data['text'].apply(lambda x: len([word for word in word_tokenize(str(x)) if word[0].islower()]))
  
  return feature_df

feature_df = get_small_word_count(feature_df, data)
feature_df['number_of_small_words'].head()

0    4
1    3
2    7
3    3
4    2
Name: number_of_small_words, dtype: int64

8. To calculate the number of letters in the DataFrame, use the following code:

In [None]:
def get_number_of_alphabets(feature_df, data):

  feature_df['number_of_alphabets'] = data['text'].apply(lambda x: len([ch for ch in str(x) if ch.isalpha()]))

  return feature_df

feature_df = get_number_of_alphabets(feature_df, data)
feature_df['number_of_alphabets'].head()

0    19
1    18
2    28
3    14
4    13
Name: number_of_alphabets, dtype: int64

9. To calculate the number of digits in the DataFrame, add the following code:

In [None]:
def get_number_of_digit_count(feature_df, data):

  feature_df['number_of_digits'] = data['text'].apply(lambda x: len([ch for ch in str(x) if ch.isdigit()]))

  return feature_df

feature_df = get_number_of_digit_count(feature_df, data)
feature_df['number_of_digits'].head()

0    0
1    0
2    0
3    0
4    0
Name: number_of_digits, dtype: int64

10. To calculate the number of words in the DataFrame, add the following code:

In [None]:
def get_number_of_words(feature_df, data):

  feature_df['number_of_words'] = data['text'].apply(lambda x: len(word_tokenize(str(x))))

  return feature_df

feature_df = get_number_of_words(feature_df, data)
feature_df['number_of_words'].head()

0    5
1    4
2    9
3    5
4    3
Name: number_of_words, dtype: int64

11. To calculate the number of whitespaces in the DataFrame, add the
following code:

In [None]:
def get_number_of_whitespaces(feature_df, data):

  feature_df['number_of_whitespaces'] = data['text'].apply(lambda x: len([ch for ch in str(x) if ch.isspace()]))

  return feature_df

feature_df = get_number_of_whitespaces(feature_df, data)
feature_df['number_of_whitespaces'].head()

0    4
1    3
2    7
3    3
4    2
Name: number_of_whitespaces, dtype: int64

12. To view the full feature set we have just created, add the following code:

In [None]:
feature_df.head()

Unnamed: 0,LS,TO,VBN,'',WP,UH,VBG,JJ,VBZ,--,...,CC,CD,POS,num_of_unique_punctuations,number_of_captalized_words,number_of_small_words,number_of_alphabets,number_of_digits,number_of_words,number_of_whitespaces
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0,1,4,19,0,5,4
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0,1,3,18,0,4,3
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1,1,7,28,0,9,7
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1,1,3,14,0,5,3
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0,1,2,13,0,3,2
