# NLP: Custom Text Preprocessing Function

In [1]:
# Importing the necessary libraries
from nltk.corpus import stopwords
import re

In [2]:
REPLACE_BY_SPACE_RE = re.compile("[/(){}\[\]\|@,;!]")
BAD_SYMBOLS_RE = re.compile("[^0-9a-z #+_]")
stop = set(stopwords.words('english'))

def text_preprocess(text):
    """ Preprocess the input text and returns clean text
    Args:
        text (str): Input string
    
    returns:
        Returns cleaned string
        
    """
    # removing digits
    text = text.replace("\d+"," ")
    
    # removing mentions and urls
    text = re.sub(r"(?:\@|https?\://)\S+", "", text) 
    
    # lowercase text
    text = text.lower() 
    
    # removing digits
    text =  re.sub('[0-9]+', '', text)
    
    # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = REPLACE_BY_SPACE_RE.sub(" ", text) 
    
    # delete symbols which are in BAD_SYMBOLS_RE from text
    text = BAD_SYMBOLS_RE.sub(" ", text) 
    
    # delete stopwors from text
    text = ' '.join([word for word in text.split() if word not in stop]) 
    
    # strip any white space characters
    text = text.strip()
    
    return text

### Working of the function using simple toy corpus:

In [3]:
toy_corpus = """The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem.[2] However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted until the late 1980s when the first statistical machine translation systems were developed. refer the link https://pypi.org/project/nlppreprocess/#history"""

Comparing the number of words before and after the preprocessing.

In [4]:
print("Toy Corpus Raw: " + str(len(toy_corpus))+ " words","\n\n", toy_corpus)

# using our function for preproceessing

cleaned2 = text_preprocess(toy_corpus)
print("\n\nToy Corpus Cleaned (Custom regex function): " + str(len(cleaned2))+ " words","\n\n", cleaned2)

Toy Corpus Raw: 654 words 

 The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem.[2] However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted until the late 1980s when the first statistical machine translation systems were developed. refer the link https://pypi.org/project/nlppreprocess/#history


Toy Corpus Cleaned (Custom regex function): 442 words 

 georgetown experiment involved fully automatic translation sixty russian sentences english authors claimed within three five years machine translation would solved problem however real progress much slower alpac report found ten year long research failed fu