<a href="https://colab.research.google.com/github/glitcher007/NLP/blob/main/NLP_pre_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To remove HTML tags from a string in NLP preprocessing, you can use various methods. One common approach is to use regular expressions. Here's a simple example in Python using the re module:

In [1]:
import re

def remove_html_tags(text):
    clean_text = re.sub('<.*?>', '', text)
    return clean_text

# Example usage:
html_string = "<p>This is an <b>example</b> HTML string.</p>"
cleaned_text = remove_html_tags(html_string)
print(cleaned_text)


This is an example HTML string.


This remove_html_tags function uses the re.sub method to replace all occurrences of HTML tags with an empty string. The regular expression <.*?> matches any HTML tag and the .*? part ensures a non-greedy match, so it stops at the first closing angle bracket.

Note that while this method can work for simple cases, it might not handle all edge cases in HTML parsing. For more robust HTML processing, you might want to consider using an HTML parser like BeautifulSoup:

In [2]:
from bs4 import BeautifulSoup

def remove_html_tags_bs4(text):
    soup = BeautifulSoup(text, 'html.parser')
    clean_text = soup.get_text()
    return clean_text

# Example usage:
html_string = "<p>This is an <b>example</b> HTML string.</p>"
cleaned_text = remove_html_tags_bs4(html_string)
print(cleaned_text)


This is an example HTML string.


In this example, BeautifulSoup is used to parse the HTML, and get_text() is used to extract the text content without HTML tags. This method is generally more robust and handles complex HTML structures better.

In [3]:
import re

def remove_urls(text):
    # This regular expression matches URLs
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub('', text)

# Example usage:
text_with_urls = "Check out this website: https://www.example.com for more information."
text_without_urls = remove_urls(text_with_urls)
print(text_without_urls)


Check out this website:  for more information.


In this example, the remove_urls function uses the re.sub method to replace URLs with an empty string. The regular expression https?://\S+|www\.\S+ is designed to match both HTTP and HTTPS URLs, as well as URLs starting with "www."

Keep in mind that removing URLs in this way might not handle all edge cases, and it's a relatively simple approach. If your text contains complex URL variations or if you need a more sophisticated solution, you might want to explore using specialized URL parsing libraries or additional pre-processing techniques.

In [4]:
import string

def remove_punctuation(text):
    # Create a translation table with None for each punctuation character
    translator = str.maketrans('', '', string.punctuation)
    # Use translate to remove punctuation
    text_without_punct = text.translate(translator)
    return text_without_punct

# Example usage:
text_with_punct = "Hello, World! This is an example text with some punctuation."
text_without_punct = remove_punctuation(text_with_punct)
print(text_without_punct)


Hello World This is an example text with some punctuation


In [None]:
import re

def remove_punctuation_regex(text):
    # Use regex to replace all punctuation characters with an empty string
    text_without_punct = re.sub(r'[^\w\s]', '', text)
    return text_without_punct

# Example usage:
text_with_punct = "Hello, World! This is an example text with some punctuation."
text_without_punct = remove_punctuation_regex(text_with_punct)
print(text_without_punct)


Both methods will remove common punctuation marks from the input string, leaving only the alphanumeric characters and spaces. Choose the method that best fits your needs or based on your preference for readability and simplicity.


Spelling correction in NLP involves identifying and correcting misspelled words in a text. One popular library for spelling correction in Python is textblob. Here's a simple example:

In [5]:
from textblob import TextBlob

def correct_spelling(text):
    blob = TextBlob(text)
    corrected_text = str(blob.correct())
    return corrected_text

# Example usage:
text_with_spelling_errors = "Ths is an example of a sentnce with sme speling mistake."
corrected_text = correct_spelling(text_with_spelling_errors)
print(corrected_text)


The is an example of a sentence with she spelling mistake.


n this example, the TextBlob library is used to create a text blob object, and the correct() method is applied to correct the spelling. Note that this is a basic example, and the accuracy of spelling correction may vary based on the context and the specific library or method used.

Keep in mind that there are other more sophisticated spelling correction models and libraries available, such as the symspellpy library and various language-specific models. These may provide better performance for specific use cases or domains.

Here's an example using the symspellpy library


Removing stop words is a common preprocessing step in NLP to filter out common words that do not contribute much to the meaning of a sentence. In Python, you can use the Natural Language Toolkit (nltk) library to remove stop words. Here's an example:

In [7]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

def remove_stop_words(text):
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word.lower() not in stop_words]
    return ' '.join(filtered_text)

# Example usage:
text_with_stop_words = "This is an example sentence with some common stop words."
text_without_stop_words = remove_stop_words(text_with_stop_words)
print(text_without_stop_words)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


example sentence common stop words .


In this example, the stopwords.words('english') function from nltk is used to obtain a set of English stop words. The word_tokenize function is then used to tokenize the input text into words. Finally, a list comprehension is employed to filter out stop words, and the result is joined back into a string.

Keep in mind that the list of stop words can vary based on the language and the specific requirements of your task. If you're working with a language other than English, you can replace 'english' with the appropriate language code (e.g., 'spanish', 'french').

Adjust the code based on your specific needs, and consider the requirements of your NLP task when determining whether or not to remove stop words.

To remove emojis from a text in NLP preprocessing, you can use a regular expression. Emojis are often represented as Unicode characters, and you can match and remove them using regex. Here's an example in Python:

In [8]:
import re

def remove_emojis(text):
    # Emoji regex pattern
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F700-\U0001F77F"  # alchemical symbols
                               u"\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
                               u"\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
                               u"\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
                               u"\U0001FA00-\U0001FA6F"  # Chess Symbols
                               u"\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
                               u"\U00002702-\U000027B0"  # Dingbats
                               u"\U000024C2-\U0001F251"
                               "]+", flags=re.UNICODE)

    # Remove emojis
    text_without_emojis = emoji_pattern.sub(r'', text)
    return text_without_emojis

# Example usage:
text_with_emojis = "Hello! 😊 How are you today? 🌍"
text_without_emojis = remove_emojis(text_with_emojis)
print(text_without_emojis)


Hello!  How are you today? 


This code defines a remove_emojis function that uses a regular expression to match and remove emojis from the input text. Adjust the regex pattern as needed for your specific use case.

Keep in mind that emojis can convey valuable information, and removing them may not be suitable for all NLP tasks, particularly those involving sentiment analysis or emotion detection. Customize the code based on your specific requirements.

It seems there might be a slight error in your question, and I assume you're asking about "lemmatization" rather than "demonization." Lemmatization is a text normalization process in NLP that involves reducing words to their base or root form. It helps in grouping together different inflected forms of a word.

For lemmatization in Python, you can use the NLTK library, which provides a WordNetLemmatizer class. Here's an example:

In [9]:
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('wordnet')

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    tokens = word_tokenize(text)
    lemmatized_tokens = [lemmatizer.lemmatize(word, get_pos(word)) for word in tokens]
    lemmatized_text = ' '.join(lemmatized_tokens)
    return lemmatized_text

def get_pos(word):
    # Get the part of speech for WordNet lemmatization
    pos_tag = nltk.pos_tag([word])[0][1][0].upper()
    pos_mapping = {'N': wordnet.NOUN, 'V': wordnet.VERB, 'R': wordnet.ADV, 'J': wordnet.ADJ}
    return pos_mapping.get(pos_tag, wordnet.NOUN)

# Example usage:
text_to_lemmatize = "The cats are running"


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


It appears you are referring to the process of "demojization" in the context of NLP. Demojization typically involves converting emojis into their textual representations. In Python, you can use the emoji library to perform demojization. If you don't have the library installed, you can install it using:

In [10]:
!pip install emoji


Collecting emoji
  Downloading emoji-2.9.0-py2.py3-none-any.whl (397 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m397.5/397.5 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.9.0


In [11]:
import emoji

def demojize_text(text):
    demojized_text = emoji.demojize(text)
    return demojized_text

# Example usage:
text_with_emojis = "Hello! 😊 How are you today? 🌍"
demojized_text = demojize_text(text_with_emojis)
print(demojized_text)


Hello! :smiling_face_with_smiling_eyes: How are you today? :globe_showing_Europe-Africa:


In this example, the emoji.demojize() function is applied to convert emojis into their textual representations. For instance, "😊" might be converted to ":smiling_face_with_smiling_eyes:".

Keep in mind that demojization might be useful in certain NLP tasks where you want to analyze or process text without considering the emojis. Adjust the code based on your specific needs and the requirements of your NLP task.