# # CLEANING TEXTS USING REGEX AND NLTK  

The most important step when working with textual data is data cleaning, as you can't just go from source text to fitting a machine learning ou deep learning model. 
In fact, cleaning texts can be a very difficult task and there are several ways you can do this. On this notebook are some functions from the ***re*** (regular expression) and ***nltk*** (natural language toolkit) libraries that you can use to clear textual data. 

We will use the "Real/Fake Job Posting Prediction" dataset available on kaggle. You can check details of this dataset on the link: [https://www.kaggle.com/shivamb/real-or-fake-fake-jobposting-prediction]. 

This dataset will also be available in this repository, in the "datasets" folder. 

In [1]:
#LET´S START BY IMPORTING THE NECESSARY LIBRARIES 
import pandas as pd 
import numpy as np 
import re 
import nltk 
import string
from nltk.corpus import stopwords

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\luizh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [32]:
#IMPORT AND SELECT JOB DESCRIPTIONS  
df = pd.read_csv("./datasets/fake_job_postings.csv") 

texts = df.description.astype('str')

In [16]:
#HERE IS AN EXAMPLE WHERE WE CAN SEE THAT THE TEXTS HAVE SEVERAL CHARACTERS THAT CAN INTERFERE IN A CLASSIFICATION MODEL
print(texts.iloc[1])

Organised - Focused - Vibrant - Awesome!Do you have a passion for customer service? Slick typing skills? Maybe Account Management? ...And think administration is cooler than a polar bear on a jetski? Then we need to hear you! We are the Cloud Video Production Service and opperating on a glodal level. Yeah, it's pretty cool. Serious about delivering a world class product and excellent customer service.Our rapidly expanding business is looking for a talented Project Manager to manage the successful delivery of video projects, manage client communications and drive the production process. Work with some of the coolest brands on the planet and learn from a global team that are representing NZ is a huge way!We are entering the next growth stage of our business and growing quickly internationally.  Therefore, the position is bursting with opportunity for the right person entering the business at the right time. 90 Seconds, the worlds Cloud Video Production Service - http://90#URL_fbe6559afac

In [33]:
#Now, let's start processing this data. 
#First, let's start by removing links present in the data. For that, 
# let's create the folowwing function:
def remove_url(text):
    comp = re.compile(r'https?://\S+|www\.\S+')
    return comp.sub(r'', text)

#Now, let's apply to the data 
texts = texts.map(lambda text: remove_url(text)) 

#Let's create another example, remove html and extra spaces in texts
def spaces_html(text):
    #It is possible to use the sub function directly, without using the compile function
    text = re.sub(r'<.*?>', '', text) 
    text = re.sub(r"\s+", " ", text).strip()
    
    return text
#Applying to the data
texts = texts.map(lambda text: spaces_html(text))

In [27]:
#Show a text of the data 
print(texts.iloc[1])

#We can see that it's much better, right?

Organised - Focused - Vibrant - Awesome!Do you have a passion for customer service? Slick typing skills? Maybe Account Management? ...And think administration is cooler than a polar bear on a jetski? Then we need to hear you! We are the Cloud Video Production Service and opperating on a glodal level. Yeah, it's pretty cool. Serious about delivering a world class product and excellent customer service.Our rapidly expanding business is looking for a talented Project Manager to manage the successful delivery of video projects, manage client communications and drive the production process. Work with some of the coolest brands on the planet and learn from a global team that are representing NZ is a huge way!We are entering the next growth stage of our business and growing quickly internationally. Therefore, the position is bursting with opportunity for the right person entering the business at the right time. 90 Seconds, the worlds Cloud Video Production Service - Seconds is the worlds Clou

In [34]:
#Ok, we can see that there are still some problems in our text above, 
# like punctuation, capital letters, and numbers. Let's fix it!

#First, we'll remove the numbers. 
texts = texts.map(lambda text: re.sub(r'\d+', '', text))

#Now, we'll take care of special characters and capital letters. 
def remove_punctuation(text):
    for char in string.punctuation:
        text = text.replace(char, ' ')
    return ' '.join([word for word in text.lower().split()]) #Applying lowercase to text and
                                                                #remove remaining spaces. 
texts = texts.map(lambda text: remove_punctuation(text))

In [29]:
texts.iloc[1]

'organised focused vibrant awesome do you have a passion for customer service slick typing skills maybe account management and think administration is cooler than a polar bear on a jetski then we need to hear you we are the cloud video production service and opperating on a glodal level yeah it s pretty cool serious about delivering a world class product and excellent customer service our rapidly expanding business is looking for a talented project manager to manage the successful delivery of video projects manage client communications and drive the production process work with some of the coolest brands on the planet and learn from a global team that are representing nz is a huge way we are entering the next growth stage of our business and growing quickly internationally therefore the position is bursting with opportunity for the right person entering the business at the right time seconds the worlds cloud video production service seconds is the worlds cloud video production service 

In [35]:
#Okay, looks like we're ready? No. 
# We need to carry out the essential step for processing textual data. 
# remove stopwords. 
# Let's go?
", ".join(stopwords.words('english'))
STOPWORDS = set(stopwords.words('english'))

def remove_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])

texts = texts.map(lambda text: remove_stopwords(text))

In [41]:
print(texts.iloc[1])

organised focused vibrant awesome passion customer service slick typing skills maybe account management think administration cooler polar bear jetski need hear cloud video production service opperating glodal level yeah pretty cool serious delivering world class product excellent customer service rapidly expanding business looking talented project manager manage successful delivery video projects manage client communications drive production process work coolest brands planet learn global team representing nz huge way entering next growth stage business growing quickly internationally therefore position bursting opportunity right person entering business right time seconds worlds cloud video production service seconds worlds cloud video production service enabling brands agencies get high quality online video content shot produced anywhere world fast affordable managed seamlessly cloud purchase publish seconds removes hassle cost risk speed issues working regular video production compa

In [None]:
#We can see above that our text is much cleaner than the original. 
# There are still several possibilities to preprocess texts such as removing 
# mentions, hastags, apllying lemmatizer or stemmer, but today we stop here 
# and the challenge of applying these pre-processing is up to you :) 