# <font color = 'dodgerblue'>**PreProcessing_Feature_Engineering_IMDB**</font>

We have covered the fundamentals of spaCy, especially the process of tokenization in the previous notebooks. In this notebook, we will learn how to alter the default tokenizer in spaCy. Then, we will use a simple example to demonstrate how spaCy can be employed for pre-processing and creating manual features. Lastly, we will develop classes for (1) general preprocessing and (2) manual feature extraction. These classes will be designed to integrate seamlessly with scikit-learn pipelines. Integrating with scikit-learn pipelines is beneficial as it streamlines the process of chaining multiple preprocessing steps and model training, ensuring consistency and efficiency in the workflow from data preparation to model evaluation.This will improve our understanding of scikit-learn classes as well.

**Plan**

1. Basic cleaning (remove HTML tags).
2. Understand how to do preprocessing using spaCy.
3. Understand how to change the default behavior of the tokenizer in spaCy.
4. Create a custom pre-processing class.
5. Understand the extraction of POS (Part Of Speech) related features.
6. Understand the extraction of Text Descriptive Features.
7. Extract the count of named entities.
8. Create a custom class for manual feature extraction.





# <font color = 'dodgerblue'>**Install/Import Libraries**

In [None]:
# install spacy
if 'google.colab' in str(get_ipython()):
    !pip install -U spacy -qq

In [None]:
# Import the pandas library for working with data frames
import pandas as pd
import numpy as np

# Import the spacy library for natural language processing
import spacy

# Import the List type from the typing module to use in function annotations
from typing import List

# for basic cleaning
from bs4 import BeautifulSoup
import re

from pprint import pprint

2024-01-28 22:10:58.632442: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-28 22:10:59.757804: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-01-28 22:10:59.759051: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-01-28 22:10:59.760005: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there m

In [None]:
# check spacy version
spacy.__version__


'3.7.2'

# <font color = 'dodgerblue'>**Data**

In [None]:
text = ["""New version of operation system is iOS 11. It is better than iOS 9.
The new version of iPhone X seems cool.""", """ <p> The video of iphone x released. I liked iOS 9 but I like iOS 11 more.
You may not like my like @Tech_Guru #Iphone #IOS harpreet@utdallas.edu  https://jindal.utdallas.edu/""",
        """</p><p>The concept of regular expressions began in the 1950s, when the American mathematician <a href="/wiki/Stephen_Cole_Kleene" title="Stephen Cole Kleene">Stephen Cole Kleene</a> formalized the description of a <i><a href="/wiki/Regular_language" title="Regular language">regular language</a></i>. They came into common use with <a href="/wiki/Unix" title="Unix">Unix</a> text-processing utilities. Different <a href="/wiki/Syntax_(programming_languages)" title="Syntax (programming languages)">syntaxes</a> for writing regular expressions have existed since the 1980s, one being the <a href="/wiki/POSIX" title="POSIX">POSIX</a> standard and another, widely used, being the <a href="/wiki/Perl" title="Perl">Perl</a> syntax.
</p><p>Regular expressions are used in <a href="/wiki/Search_engine" title="Search engine">search engines</a>, search and replace dialogs of <a href="/wiki/Word_processor" title="Word processor">word processors</a> and <a href="/wiki/Text_editor" title="Text editor">text editors</a>, in <a href="/wiki/Text_processing" title="Text processing">text processing</a> utilities such as <a href="/wiki/Sed" title="Sed">sed</a> and <a href="/wiki/AWK" title="AWK">AWK</a> and in <a href="/wiki/Lexical_analysis" title="Lexical analysis">lexical analysis</a>. Many <a href="/wiki/Programming_language" title="Programming language">programming languages</a> provide regex capabilities either built-in or via <a href="/wiki/Library_(computing)" title="Library (computing)">libraries</a>, as it has uses in many situations.
</p> """]
df = pd.DataFrame(text, columns=['Reviews'])

In [None]:
df.head()

Unnamed: 0,Reviews
0,New version of operation system is iOS 11. It ...
1,<p> The video of iphone x released. I liked i...
2,</p><p>The concept of regular expressions bega...


In [None]:
pprint(df['Reviews'].values[0], width = 80)

('New version of operation system is iOS 11. It is better than iOS 9.\n'
 'The new version of iPhone X seems cool.')


In [None]:
pprint(df['Reviews'].values[1], width = 80)

(' <p> The video of iphone x released. I liked iOS 9 but I like iOS 11 more.\n'
 'You may not like my like @Tech_Guru #Iphone #IOS harpreet@utdallas.edu  '
 'https://jindal.utdallas.edu/')


In [None]:
pprint(df['Reviews'].values[2], width = 80)

('</p><p>The concept of regular expressions began in the 1950s, when the '
 'American mathematician <a href="/wiki/Stephen_Cole_Kleene" title="Stephen '
 'Cole Kleene">Stephen Cole Kleene</a> formalized the description of a <i><a '
 'href="/wiki/Regular_language" title="Regular language">regular '
 'language</a></i>. They came into common use with <a href="/wiki/Unix" '
 'title="Unix">Unix</a> text-processing utilities. Different <a '
 'href="/wiki/Syntax_(programming_languages)" title="Syntax (programming '
 'languages)">syntaxes</a> for writing regular expressions have existed since '
 'the 1980s, one being the <a href="/wiki/POSIX" title="POSIX">POSIX</a> '
 'standard and another, widely used, being the <a href="/wiki/Perl" '
 'title="Perl">Perl</a> syntax.\n'
 '</p><p>Regular expressions are used in <a href="/wiki/Search_engine" '
 'title="Search engine">search engines</a>, search and replace dialogs of <a '
 'href="/wiki/Word_processor" title="Word processor">word processors</a> a

In [None]:
type(df['Reviews'].values[2])

str

# <font color = 'dodgerblue'>**Import Spacy Model**

In [None]:
# check the models we have dowloaded in spacy folder
!python -m spacy download en_core_web_sm

2024-01-28 22:11:11.992271: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-28 22:11:13.391665: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-01-28 22:11:13.392823: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-01-28 22:11:13.393677: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there 

In [None]:
# We will load the model -en_core_web_sm
nlp = spacy.load('en_core_web_sm')


# <font color = 'dodgerblue'>**Pre-processing**

## <font color = 'dodgerblue'>**Basic cleaning (remove HTML tags).**

In [None]:
def basic_clean(text: str) -> str:
    """
    This function performs basic text cleaning on an input string by removing HTML tags (if present)
    and replacing newline and return characters with a space.

    Parameters:
    text (str): The input string to be cleaned

    Returns:
    str: The cleaned string
    """
    # Use BeautifulSoup to remove HTML tags (if present)
    soup = BeautifulSoup(text, "html.parser")
    text = soup.get_text()

    # Replace newline and return characters with a space
    return re.sub(r'[\n\r]',' ', text)

In [None]:
cleaned_text= [basic_clean(text).encode('utf-8', 'ignore').decode() for text in df['Reviews'].values]

In [None]:
pprint(cleaned_text[2])

('The concept of regular expressions began in the 1950s, when the American '
 'mathematician Stephen Cole Kleene formalized the description of a regular '
 'language. They came into common use with Unix text-processing utilities. '
 'Different syntaxes for writing regular expressions have existed since the '
 '1980s, one being the POSIX standard and another, widely used, being the Perl '
 'syntax. Regular expressions are used in search engines, search and replace '
 'dialogs of word processors and text editors, in text processing utilities '
 'such as sed and AWK and in lexical analysis. Many programming languages '
 'provide regex capabilities either built-in or via libraries, as it has uses '
 'in many situations.  ')


## <font color = 'dodgerblue'>**Understand how to do preprocessing using spaCy**
- We will now understand how to do basic pre-processing (like removing stop words, lammetization, remove urls, remove emails etc.) using Spacy.
- We will also underatand how Spaxy handles mentions (@keyword) and hashtags (#keyword).

In [None]:
# initialize an empty list to store processed text
processed_text = []

# initialize an empty list to store processed text
processed_text = []

# Use context manager to temporarily disable the named pipes of spaCy NLP processing pipeline
with nlp.select_pipes(disable=['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']):
  # process multiple documents in parallel using the spaCy NLP library
  for doc in nlp.pipe(cleaned_text, batch_size=3, n_process=1):
      # filter out URLs and emails, then extract text of each token
      tokens = [token.text for token in doc if not token.like_url and not token.like_email]
      # join tokens back into a single string
      text = ' '.join(tokens)
      # append to the processed_text list
      processed_text.append(text)

In [None]:
cleaned_text[1]

'  The video of iphone x released. I liked iOS 9 but I like iOS 11 more. You may not like my like @Tech_Guru #Iphone #IOS harpreet@utdallas.edu  https://jindal.utdallas.edu/'

In [None]:
processed_text[1]

'The video of iphone x released . I liked iOS 9 but I like iOS 11 more . You may not like my like @Tech_Guru # Iphone # IOS'

- From the above we can see that `#` is treated as prefix and hence it sis separated from the keyword. `#' becomes a separate token.. If we remove punctuations then `#` will be removed.

- However `@` is not treated as prefix. Hence, `@keyword` is treated as one single token.

## <font color = 'dodgerblue'>**Understand how to change the default behavior of the tokenizer in spaCy**

We will now see how we can change this default behavior. We will modify the default prefixes. You can also modify the default suffixes in a similar manner.

In [None]:
# Load the spaCy model for English. This model includes various components like tokenization, lemmatization, part-of-speech tagging, etc.
nlp = spacy.load('en_core_web_sm')

# Get the default set of prefix characters from the loaded spaCy model.
# In tokenization, these prefixes are characters or sets of characters at the beginning of a word
# that are treated separately from the word they are attached to.
prefixes = list(nlp.Defaults.prefixes)

# The following lines adjust the tokenizer behavior for specific characters, often used in social media text or web text.

# Add '@' to the list of prefixes.
prefixes += ['@']

# Remove '#' from the list of prefixes. This is typically done when you want to treat hashtags as a single token rather
# than separating the '#' symbol from the following text.
# It allows the hashtag to remain intact for further analysis or feature extraction.
prefixes.remove(r'#')


# Compile prefix regex based on selected prefixes
prefix_regex = spacy.util.compile_prefix_regex(prefixes)
nlp.tokenizer.prefix_search = prefix_regex.search

In [None]:
# initialize an empty list to store processed text
processed_text = []

# Use context manager to temporarily disable the named pipes of spaCy NLP processing pipeline
with nlp.select_pipes(disable=['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']):
  # process multiple documents in parallel using the spaCy NLP library
  for doc in nlp.pipe(cleaned_text, batch_size=3, n_process=1):
      # filter out URLs and emails, then extract text of each token
      tokens = [token.text for token in doc if not token.like_url and not token.like_email]
      text = ' '.join(tokens)
      # join tokens back into a single string and add to the processed_text list
      processed_text.append(text)

In [None]:
processed_text[1]

'   The video of iphone x released . I liked iOS 9 but I like iOS 11 more . You may not like my like @ Tech_Guru #Iphone #IOS  '

- `@Tech_Guru` is tokenized as `@ Tech_Guru`, indicating that the '@' character is treated as a separate token. This modification is due to adding '@' to the list of prefix characters, making the tokenizer recognize and separate it from the word that follows. The intention might be to isolate mentions or handle them distinctly in subsequent processing steps.

- `#Iphone` and `#IOS` appear intact, without separating the '#' from the keywords. This is the result of removing '#' from the list of prefix characters. By doing this, the tokenizer no longer splits hashtags into two tokens ('#' and 'keyword'). Instead, each hashtag is preserved as a single token, which is beneficial for analyses where the integrity of hashtags is important, such as sentiment analysis or trend detection on social media platforms.



## <font color = 'dodgerblue'>**Create a custom pre-processing class**
Now, we will create a Custom class for pre-processing in Spacy. This will help us to re-use this code. Once, we test this class, we will create a module (.py file). We can then import the module like other modules which we have been importing.

The `SpacyPreprocessor` class is a text preprocessing transformer designed for natural language processing tasks. It extends scikit-learn's `BaseEstimator` and `TransformerMixin` to integrate smoothly with scikit-learn pipelines. The class provides a structured way to clean and preprocess text data using spaCy's powerful NLP tools, along with additional text cleaning steps.

Here's a breakdown of its components and functionalities:

1. **`__init__` (Constructor):**
   - The constructor initializes the transformer with a wide range of parameters that control how the text will be processed. These include options for batch processing, text cleaning (like removing stopwords, punctuation, URLs, emails, etc.), and text transformation (like lemmatization and stemming).
   - The `model` parameter specifies the spaCy language model to use.
   - A ValueError is raised if both `lemmatize` and `stemming` are set to True, as they are mutually exclusive text normalization techniques.

2. **`BaseEstimator`:**
   - Inheriting from `BaseEstimator` makes the class compatible with other scikit-learn utilities. It enables functionalities like getting and setting parameters, and it's necessary for any custom estimator.

3. **`TransformerMixin`:**
   - Inheriting from `TransformerMixin` requires the implementation of a `fit` method and a `transform` method. It also provides a `fit_transform` method that combines `fit` and `transform`.
   - This mixin is useful for creating custom transformers in scikit-learn pipelines.

4. **`fit`:**
   - The `fit` method is a requirement for compatibility with scikit-learn's transformer interface. In this case, it's a placeholder that doesn't learn anything from the data (since text preprocessing doesn't involve fitting to a dataset). It simply returns `self`, allowing the transformer to be used in a pipeline.

5. **`transform`:**
   - The `transform` method is where the actual text processing happens. It takes an input array of text data and performs the following steps:
     - Validates the input format.
     - Applies `basic_clean` to each text, handling HTML content and removing extra whitespace and line breaks.
     - If `basic_clean_only` is False, it further processes the text using the `spacy_preprocessor` method, which applies the spaCy pipeline and the specified text cleaning and normalization options.
     - Returns the preprocessed text data.

In summary, `SpacyPreprocessor` is a customizable text preprocessing class that conforms to scikit-learn's interface, making it a flexible tool for integrating NLP preprocessing into machine learning pipelines. The use of `BaseEstimator` and `TransformerMixin` ensures compatibility and ease of use with scikit-learn, while the `fit` and `transform` methods provide the standard interface for a transformer in this ecosystem.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from bs4 import BeautifulSoup
import re
import spacy
import numpy as np
from nltk.stem.porter import PorterStemmer
import os

class SpacyPreprocessor(BaseEstimator, TransformerMixin):

    def __init__(self, model, *, batch_size = 64, lemmatize=True, lower=True, remove_stop=True,
                remove_punct=True, remove_email=True, remove_url=True, remove_num=False, stemming = False,
                add_user_mention_prefix=True, remove_hashtag_prefix=False, basic_clean_only=False):

        self.model = model
        self.batch_size = batch_size
        self.remove_stop = remove_stop
        self.remove_punct = remove_punct
        self.remove_num = remove_num
        self.remove_url = remove_url
        self.remove_email = remove_email
        self.lower = lower
        self.add_user_mention_prefix = add_user_mention_prefix
        self.remove_hashtag_prefix = remove_hashtag_prefix
        self.basic_clean_only = basic_clean_only

        if lemmatize and stemming:
            raise ValueError("Only one of 'lemmatize' and 'stemming' can be True.")

        # Validate basic_clean_only option
        if self.basic_clean_only and (lemmatize or lower or remove_stop or remove_punct or remove_num or stemming or
                                      add_user_mention_prefix or remove_hashtag_prefix):
            raise ValueError("If 'basic_clean_only' is set to True, other processing options must be set to False.")

        # Assign lemmatize and stemming

        self.lemmatize = lemmatize
        self.stemming = stemming

    def basic_clean(self, text):
        soup = BeautifulSoup(text, "html.parser")
        text = soup.get_text()
        text = re.sub(r'[\n\r]', ' ', text)
        return text.strip()

    def spacy_preprocessor(self, texts):
        final_result = []
        nlp = spacy.load(self.model)

        # Disable unnecessary pipelines in spaCy model
        if self.lemmatize:
            # Disable parser and named entity recognition
            disabled_pipes = ['parser', 'ner']
        else:
            # Disable tagger, parser, attribute ruler, lemmatizer and named entity recognition
            disabled_pipes = ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

        with nlp.select_pipes(disable=disabled_pipes):
          # Modify tokenizer behavior based on user_mention_prefix and hashtag_prefix settings
          if self.add_user_mention_prefix or self.remove_hashtag_prefix:
              prefixes = list(nlp.Defaults.prefixes)
              if self.add_user_mention_prefix:
                  prefixes += ['@']  # Treat '@' as a separate token
              if self.remove_hashtag_prefix:
                  prefixes.remove(r'#')  # Don't separate '#' from the following text
              prefix_regex = spacy.util.compile_prefix_regex(prefixes)
              nlp.tokenizer.prefix_search = prefix_regex.search

          # Process text data in parallel using spaCy's nlp.pipe()
          for doc in nlp.pipe(texts, batch_size=self.batch_size):
              filtered_tokens = []
              for token in doc:
                  # Check if token should be removed based on specified filters
                  if self.remove_stop and token.is_stop:
                      continue
                  if self.remove_punct and token.is_punct:
                      continue
                  if self.remove_num and token.like_num:
                      continue
                  if self.remove_url and token.like_url:
                      continue
                  if self.remove_email and token.like_email:
                      continue

                  # Append the token's text, lemma, or stemmed form to the filtered_tokens list
                  if self.lemmatize:
                      filtered_tokens.append(token.lemma_)
                  elif self.stemming:
                      filtered_tokens.append(PorterStemmer().stem(token.text))
                  else:
                      filtered_tokens.append(token.text)

              # Join the tokens and apply lowercasing if specified
              text = ' '.join(filtered_tokens)
              if self.lower:
                  text = text.lower()
              final_result.append(text.strip())

        return final_result


    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        try:
            if not isinstance(X, (list, np.ndarray)):
                raise TypeError(f'Expected list or numpy array, got {type(X)}')

            x_clean = [self.basic_clean(text).encode('utf-8', 'ignore').decode() for text in X]

            # Check if only basic cleaning is required
            if self.basic_clean_only:
                return x_clean  # Return the list of basic-cleaned texts

            x_clean_final = self.spacy_preprocessor(x_clean)
            return x_clean_final

        except Exception as error:
            print(f'An exception occurred: {repr(error)}')


In [None]:
# import spacy pre-processor from custom module
preprocessor = SpacyPreprocessor(model='en_core_web_sm', batch_size=64, lemmatize=False, lower=False,
                                    remove_stop=False, remove_punct=False, remove_email=False,
                                    remove_url=False, remove_num=False, stemming=False,
                                    add_user_mention_prefix=False, remove_hashtag_prefix=False, basic_clean_only=True)

In [None]:
cleaned_text = preprocessor.fit_transform(df['Reviews'].values)

for item in cleaned_text:
    print()
    pprint(item, width = 80)


('New version of operation system is iOS 11. It is better than iOS 9. The new '
 'version of iPhone X seems cool.')

('The video of iphone x released. I liked iOS 9 but I like iOS 11 more. You '
 'may not like my like @Tech_Guru #Iphone #IOS harpreet@utdallas.edu  '
 'https://jindal.utdallas.edu/')

('The concept of regular expressions began in the 1950s, when the American '
 'mathematician Stephen Cole Kleene formalized the description of a regular '
 'language. They came into common use with Unix text-processing utilities. '
 'Different syntaxes for writing regular expressions have existed since the '
 '1980s, one being the POSIX standard and another, widely used, being the Perl '
 'syntax. Regular expressions are used in search engines, search and replace '
 'dialogs of word processors and text editors, in text processing utilities '
 'such as sed and AWK and in lexical analysis. Many programming languages '
 'provide regex capabilities either built-in or via libraries, as it has u

# <font color = 'dodgerblue'>**Feature Extraction**

## <font color = 'dodgerblue'> **Understand the extraction of POS (Part Of Speech) related features.**

In [None]:
noun_count = [] # create a list to store the noun count for each document
aux_count = [] # create a list to store the auxiliary verb count for each document
verb_count = [] # create a list to store the verb count for each document
adj_count =[] # create a list to store the adjective count for each document

# disable lemmatizer and named entity recognizer
disabled_pipes =  ['lemmatizer', 'ner']

# iterate over the documents in the dataframe using the spacy pipe method
with nlp.select_pipes(disable=disabled_pipes):
  for doc in nlp.pipe(cleaned_text, batch_size=3, n_process=1):

      # find all nouns and proper nouns in the document and store in a list
      nouns = [token.text for token in doc if (token.pos_ in ["NOUN","PROPN"])]

      # find all auxiliary verbs in the document and store in a list
      auxs =  [token.text for token in doc if (token.pos_ in ["AUX"])]

      # find all verbs in the document and store in a list
      verbs =  [token.text for token in doc if (token.pos_ in ["VERB"])]
      print(verbs)

      # find all adjectives in the document and store in a list
      adjectives =  [token.text for token in doc if (token.pos_ in ["ADJ"])]

      # store the count of nouns in the noun_count list
      noun_count.append(len(nouns))

      # store the count of auxiliary verbs in the aux_count list
      aux_count.append(len(auxs))

      # store the count of verbs in the verb_count list
      verb_count.append(len(verbs))

      # store the count of adjectives in the adj_count list
      adj_count.append(len(adjectives))


['seems']
['released', 'liked', 'like', 'like']
['began', 'formalized', 'came', 'processing', 'writing', 'existed', 'used', 'replace', 'sed', 'provide', 'built']


In [None]:
noun_count, verb_count, aux_count, adj_count

([6, 8, 40], [1, 4, 11], [2, 1, 5], [4, 1, 12])

## <font color = 'dodgerblue'> **Understand the extraction of Text Descriptive Features**

In [None]:
nlp = spacy.load('en_core_web_sm')
disabled_pipes = ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
with nlp.select_pipes(disable=disabled_pipes):
  if not nlp.has_pipe('sentencizer'):
    nlp.add_pipe('sentencizer')

    list_count_words = []
    list_count_characters = []
    list_count_characters_no_space = []
    list_avg_word_length = []
    list_count_digits = []
    list_count_numbers = []
    list_count_sentences = []

    for doc in nlp.pipe(cleaned_text, batch_size=3, n_process=1):
        # Count words (tokens)
        count_word = len([token.text for token in doc if not token.is_punct])

        # Count all characters (including spaces)
        count_char = len(doc.text)

        # Count characters without spaces
        count_char_no_space = len(doc.text_with_ws.replace(' ', ''))

        # Count characters without spaces and punctuation
        # count_char_no_punct = sum(len(token.text) for token in doc if not token.is_punct)


        # Calculate average word length
        avg_word_length = count_char_no_space / (count_word + 1)

        # Count numbers (consecutive digits)
        count_numbers = len([token for token in doc if token.is_digit])

        # Count sentences
        count_sentences = len(list(doc.sents))

        list_count_words.append(count_word)
        list_count_characters.append(count_char)
        list_count_characters_no_space.append(count_char_no_space)
        list_avg_word_length.append(avg_word_length)
        list_count_numbers.append(count_numbers)
        list_count_sentences.append(count_sentences)

In [None]:
list_count_words,  list_count_characters, list_count_characters_no_space, list_avg_word_length, list_count_numbers, list_count_sentences

([22, 28, 106],
 [107, 170, 687],
 [86, 143, 584],
 [3.739130434782609, 4.931034482758621, 5.457943925233645],
 [2, 2, 0],
 [3, 3, 5])

In [None]:
for item in cleaned_text:
    print()
    pprint(item, width = 80)


('New version of operation system is iOS 11. It is better than iOS 9. The new '
 'version of iPhone X seems cool.')

('The video of iphone x released. I liked iOS 9 but I like iOS 11 more. You '
 'may not like my like @Tech_Guru #Iphone #IOS harpreet@utdallas.edu  '
 'https://jindal.utdallas.edu/')

('The concept of regular expressions began in the 1950s, when the American '
 'mathematician Stephen Cole Kleene formalized the description of a regular '
 'language. They came into common use with Unix text-processing utilities. '
 'Different syntaxes for writing regular expressions have existed since the '
 '1980s, one being the POSIX standard and another, widely used, being the Perl '
 'syntax. Regular expressions are used in search engines, search and replace '
 'dialogs of word processors and text editors, in text processing utilities '
 'such as sed and AWK and in lexical analysis. Many programming languages '
 'provide regex capabilities either built-in or via libraries, as it has u

## <font color = 'dodgerblue'> **Count of Named Entitites**

In [None]:
nlp = spacy.load('en_core_web_sm')
count_ner = []
disabled_pipes = ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer']
with nlp.select_pipes(disable=disabled_pipes):
# Disable the tok2vec, tagger, parser, attribute ruler, and lemmatizer pipelines for improved performance
  for doc in nlp.pipe(cleaned_text, batch_size=1000, n_process=-1):
      ner_text = [ent.text for ent in doc.ents]
      print(ner_text)
      ner_labels = [ent.label_ for ent in doc.ents]
      print(ner_labels)
      count_ner.append(len(ner_labels))

['11', '9']
['CARDINAL', 'CARDINAL']
['9', '11', 'IOS']
['CARDINAL', 'CARDINAL', 'ORG']
['the 1950s', 'American', 'Stephen Cole', 'the 1980s', 'one', 'POSIX', 'Perl', 'AWK']
['DATE', 'NORP', 'PERSON', 'DATE', 'CARDINAL', 'ORG', 'ORG', 'ORG']


In [None]:
count_ner

[2, 3, 8]

## <font color = 'dodgerblue'> **Create a custom class for manual feature extraction**

In [None]:
from sklearn.base import TransformerMixin, BaseEstimator
import numpy as np
import spacy
import re
import sys
import os
from pathlib import Path


class ManualFeatures(TransformerMixin, BaseEstimator):


    def __init__(self, spacy_model, batch_size = 64, pos_features = True, ner_features = True, text_descriptive_features = True):

        self.spacy_model = spacy_model
        self.batch_size = batch_size
        self.pos_features = pos_features
        self.ner_features = ner_features
        self.text_descriptive_features = text_descriptive_features

    def get_cores(self):
        """
        Get the number of CPU cores to use in parallel processing.
        """
        # Get the number of CPU cores available on the system.
        num_cores = os.cpu_count()
        if num_cores < 3:
            use_cores = 1
        else:
            use_cores = num_cores // 2 + 1
        return num_cores

    def get_pos_features(self, cleaned_text):

        nlp = spacy.load(self.spacy_model)
        noun_count = []
        aux_count = []
        verb_count = []
        adj_count =[]

        # Disable the lemmatizer and NER pipelines for improved performance
        disabled_pipes = ['lemmatizer', 'ner']
        with nlp.select_pipes(disable=disabled_pipes):
            n_process = self.get_cores()
            for doc in nlp.pipe(cleaned_text, batch_size=self.batch_size, n_process=n_process):
                # Extract nouns, auxiliaries, verbs, and adjectives from the document
                nouns = [token.text for token in doc if token.pos_ in ["NOUN","PROPN"]]
                auxs =  [token.text for token in doc if token.pos_ in ["AUX"]]
                verbs =  [token.text for token in doc if token.pos_ in ["VERB"]]
                adjectives =  [token.text for token in doc if token.pos_ in ["ADJ"]]

                # Store the count of each type of word in separate lists
                noun_count.append(len(nouns))
                aux_count.append(len(auxs))
                verb_count.append(len(verbs))
                adj_count.append(len(adjectives))

        # Stack the count lists vertically to form a 2D numpy array
        return np.transpose(np.vstack((noun_count, aux_count, verb_count, adj_count)))



    def get_ner_features(self, cleaned_text):
        nlp = spacy.load(self.spacy_model)
        count_ner = []

        # Disable the tok2vec, tagger, parser, attribute ruler, and lemmatizer pipelines for improved performance
        disabled_pipes = ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer']
        with nlp.select_pipes(disable=disabled_pipes):
            n_process = self.get_cores()
            for doc in nlp.pipe(cleaned_text, batch_size=self.batch_size, n_process=n_process):
                ners = [ent.label_ for ent in doc.ents]
                count_ner.append(len(ners))

        # Convert the list of NER counts to a 2D numpy array
        return np.array(count_ner).reshape(-1, 1)


    def get_text_descriptive_features(self, cleaned_text):
        list_count_words = []
        list_count_characters = []
        list_count_characters_no_space = []
        list_avg_word_length = []
        list_count_digits = []
        list_count_numbers = []
        list_count_sentences = []

        nlp = spacy.load(self.spacy_model)
        disabled_pipes = ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
        with nlp.select_pipes(disable=disabled_pipes):
            if not nlp.has_pipe('sentencizer'):
                nlp.add_pipe('sentencizer')
            n_process = self.get_cores()
            for doc in nlp.pipe(cleaned_text, batch_size=self.batch_size, n_process=n_process):
                count_word = len([token for token in doc if not token.is_punct])
                count_char = len(doc.text)
                count_char_no_space = len(doc.text_with_ws.replace(' ', ''))
                avg_word_length = count_char_no_space / (count_word + 1)
                count_numbers = len([token for token in doc if token.is_digit])
                count_sentences = len(list(doc.sents))

                list_count_words.append(count_word)
                list_count_characters.append(count_char)
                list_count_characters_no_space.append(count_char_no_space)
                list_avg_word_length.append(avg_word_length)
                list_count_numbers.append(count_numbers)
                list_count_sentences.append(count_sentences)

        text_descriptive_features = np.vstack((list_count_words, list_count_characters, list_count_characters_no_space, list_avg_word_length,
                                    list_count_numbers, list_count_sentences))
        return np.transpose(text_descriptive_features)


    def fit(self, X, y=None):

        return self


    def transform(self, X, y=None):

        try:
            # Check if the input data is a list or numpy array
            if not isinstance(X, (list, np.ndarray)):
                raise TypeError(f"Expected list or numpy array, got {type(X)}")


            feature_names = []

            if self.text_descriptive_features:
                text_descriptive_features = self.get_text_descriptive_features(X)
                feature_names.extend(['count_words', 'count_characters',
                                      'count_characters_no_space', 'avg_word_length',
                                      'count_numbers', 'count_sentences'])
            else:
                text_descriptive_features = np.empty(shape=(0, 0))

            if self.pos_features:
                pos_features = self.get_pos_features(X)
                feature_names.extend(['noun_count', 'aux_count', 'verb_count', 'adj_count'])
            else:
                pos_features = np.empty(shape=(0, 0))

            if self.ner_features:
                ner_features = self.get_ner_features(X)
                feature_names.extend(['ner'])
            else:
                ner_features = np.empty(shape=(0, 0))

            # Stack the feature arrays horizontally to form a single 2D numpy array
            return np.hstack((text_descriptive_features, pos_features,  ner_features)), feature_names

        except Exception as error:
            print(f'An exception occured: {repr(error)}')

In [None]:
featurizer = ManualFeatures(spacy_model='en_core_web_sm', batch_size =3)

In [None]:
X_train_features, feature_names = featurizer.fit_transform(cleaned_text )

In [None]:
pd.DataFrame(X_train_features, columns=feature_names )

Unnamed: 0,count_words,count_characters,count_characters_no_space,avg_word_length,count_numbers,count_sentences,noun_count,aux_count,verb_count,adj_count,ner
0,22.0,107.0,86.0,3.73913,2.0,3.0,6.0,2.0,1.0,4.0,2.0
1,28.0,170.0,143.0,4.931034,2.0,3.0,8.0,1.0,4.0,1.0,3.0
2,106.0,687.0,584.0,5.457944,0.0,5.0,40.0,5.0,11.0,12.0,8.0
