# We will preprocessig the data for Topic Modelling 


---

![](https://www.iteratorshq.com/wp-content/uploads/2020/09/how_to_clean_data.jpg)

[Blog to read on Why Data Cleaning is Important](https://blog.nextpathway.com/5-reasons-why-data-cleaning-matters)

# Importing library need for Preprocessing and Loading data


---



In [None]:
# Importing necessary libarary

import re
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords

# Downloading tokenizer and stopwords list
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True



---

# In this Part we will load our data and display its content
- In case you need the data here is the [link you can download](https://drive.google.com/file/d/10kdA6QxOZD6poqJ_H-F5WtyNAnd-UUvG/view?usp=sharing) and then upload to the colab or even use your own data and upload it to colab.
---



In [None]:
data_df=pd.read_csv("/content/drive/MyDrive/final_wikidata_scrapped_data_csv")

In [None]:
data_df.shape

(398, 4)

In [None]:
data_df.columns

Index(['title', 'content', 'url', 'links'], dtype='object')

In [None]:
data_df

Unnamed: 0,title,content,url,links
0,Application security,Application security (short AppSec) includes a...,https://en.wikipedia.org/wiki/Application_secu...,"['Access control', 'Advanced persistent threat..."
1,Static application security testing,Static application security testing (SAST) is ...,https://en.wikipedia.org/wiki/Static_applicati...,"['Abstract syntax tree', 'Adobe Flash Player',..."
2,Web application firewall,A web application firewall (WAF) is a specific...,https://en.wikipedia.org/wiki/Web_application_...,"['Advanced persistent threat', 'Adware', 'Anom..."
3,Dynamic application security testing,A dynamic application security testing (DAST) ...,https://en.wikipedia.org/wiki/Dynamic_applicat...,"['AJAX', 'Adobe Flash', 'Arbitrary code execut..."
4,Dynamic application security testing,A dynamic application security testing (DAST) ...,https://en.wikipedia.org/wiki/Dynamic_applicat...,"['AJAX', 'Adobe Flash', 'Arbitrary code execut..."
...,...,...,...,...
393,Web application,A web application (or web app) is application ...,https://en.wikipedia.org/wiki/Web_application,"['3D computer graphics', 'AJAX', 'API economy'..."
394,Web3,Web3 (also known as Web 3.0 and sometimes styl...,https://en.wikipedia.org/wiki/Web3,"['Abra (company)', 'Airdrop (cryptocurrency)',..."
395,WS-Security,"Web Services Security (WS-Security, WSS) is an...",https://en.wikipedia.org/wiki/WS-Security,"['Advanced Message Queuing Protocol', 'Apache ..."
396,Web hosting service,A web hosting service is a type of Internet ho...,https://en.wikipedia.org/wiki/Web_hosting_service,"['ASP.NET', 'Active Server Pages', 'Afilias', ..."


In [None]:
data_df.isnull().sum()

title      0
content    0
url        0
links      0
dtype: int64

In [None]:
# data_df.columns

Index(['title', 'content', 'url', 'links'], dtype='object')

In [None]:
data_df.content

0      Application security (short AppSec) includes a...
1      Static application security testing (SAST) is ...
2      A web application firewall (WAF) is a specific...
3      A dynamic application security testing (DAST) ...
4      A dynamic application security testing (DAST) ...
                             ...                        
393    A web application (or web app) is application ...
394    Web3 (also known as Web 3.0 and sometimes styl...
395    Web Services Security (WS-Security, WSS) is an...
396    A web hosting service is a type of Internet ho...
397    In computing, a Trojan horse  is any malware t...
Name: content, Length: 398, dtype: object

In [None]:
data_df.title.value_counts()

Computer security                            8
McAfee                                       5
Public-key cryptography                      4
Internet security                            4
Directorate-General for External Security    4
                                            ..
ESET                                         1
Comodo Internet Security                     1
Symantec Endpoint Protection                 1
Personal data                                1
Trojan horse (computing)                     1
Name: title, Length: 335, dtype: int64



---

# Now we will start with Preprocessing steps.
- Note that we will only take the content part of the data .i.e. we will only preprocess the data that in text, only that data that we preprocess is the text data other data like urls links are not relevant to us while Topic Modelling

---




In [None]:
data=data_df.content

In [None]:
data.shape

(398,)

In [None]:
data

0      Application security (short AppSec) includes a...
1      Static application security testing (SAST) is ...
2      A web application firewall (WAF) is a specific...
3      A dynamic application security testing (DAST) ...
4      A dynamic application security testing (DAST) ...
                             ...                        
393    A web application (or web app) is application ...
394    Web3 (also known as Web 3.0 and sometimes styl...
395    Web Services Security (WS-Security, WSS) is an...
396    A web hosting service is a type of Internet ho...
397    In computing, a Trojan horse  is any malware t...
Name: content, Length: 398, dtype: object

# Step1: Converting the data to str format so we can carry our work effectively

In [None]:
data = data.astype(str)

In [None]:
# checking if the datatype of the our dataframe is string datatype
type(data[0])

str

# Step 2: Converting the text to lower character

In [None]:
data = data.str.lower()
data

0      application security (short appsec) includes a...
1      static application security testing (sast) is ...
2      a web application firewall (waf) is a specific...
3      a dynamic application security testing (dast) ...
4      a dynamic application security testing (dast) ...
                             ...                        
393    a web application (or web app) is application ...
394    web3 (also known as web 3.0 and sometimes styl...
395    web services security (ws-security, wss) is an...
396    a web hosting service is a type of internet ho...
397    in computing, a trojan horse  is any malware t...
Name: content, Length: 398, dtype: object

# Step 3: Here we create a function to remove all the punctuation from the text to clean the corpus

In [None]:
PUNCT_TO_REMOVE = """!"'()*+,-/:;<=>?[\]_`{|}~"""
def remove_punctuation(text):
    """custom function to remove the punctuation"""
    return text.translate(str.maketrans('', '', PUNCT_TO_REMOVE))

data=data.apply(lambda text: remove_punctuation(text))

In [None]:
data

0      application security short appsec includes all...
1      static application security testing sast is us...
2      a web application firewall waf is a specific f...
3      a dynamic application security testing dast is...
4      a dynamic application security testing dast is...
                             ...                        
393    a web application or web app is application so...
394    web3 also known as web 3.0 and sometimes styli...
395    web services security wssecurity wss is an ext...
396    a web hosting service is a type of internet ho...
397    in computing a trojan horse  is any malware th...
Name: content, Length: 398, dtype: object

- We also need to remove the special character from the text corpus

In [None]:
PUNCT_TO_REMOVE = "!@#$^%&\n\t"
def remove_punctuation(text):
    """custom function to remove the punctuation"""
    return text.translate(str.maketrans('', '', PUNCT_TO_REMOVE))

data = data.apply(lambda text: remove_punctuation(text))

In [None]:
data

0      application security short appsec includes all...
1      static application security testing sast is us...
2      a web application firewall waf is a specific f...
3      a dynamic application security testing dast is...
4      a dynamic application security testing dast is...
                             ...                        
393    a web application or web app is application so...
394    web3 also known as web 3.0 and sometimes styli...
395    web services security wssecurity wss is an ext...
396    a web hosting service is a type of internet ho...
397    in computing a trojan horse  is any malware th...
Name: content, Length: 398, dtype: object

# Step 4 : Now we will move on to remove the stop words from the corpus 

- This step is kind of optional.
- It depends upon what task are you performing.
- Since we are performing the topic modelling we need not to remove the stop words.
- We can remove the stop words if we need to, its not necessary. Since the topic modelling mainly focus on the structure of the words its better not to remove it.


In [None]:
# # This code will create a function that will remove the stop words from our corpus from the list of the Stopwords

STOPWORDS = set(stopwords.words('english'))
# def remove_stopwords(text):
#     """custom function to remove the stopwords"""
#     return " ".join([word for word in str(text).split() if word not in STOPWORDS])

# data=data.apply(lambda text: remove_stopwords(text))

In [None]:
len(STOPWORDS)

179

# Step 5: We can also tokenize the text using nltk (optional)

In [None]:
# This step is again optional since we are going to perform topic modelling we dont need to tokenize

# l=[]
# for i in range(len(data)):
#   data[i] = nltk.sent_tokenize(data[i])
#   l.append(data[i])

# Step 6: Removing the Emoji from our corpus (optional)

In [None]:
def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

remove_emoji("game is on 🔥🔥")

'game is on '

In [None]:
remove_emoji("Hilarious😂")

'Hilarious'

# Step 7: Removal of URLs (Optional)
- Next preprocessing step is to remove any URLs present in the Corpus.
- Sometimes it happen that our corpus has some URL in  it. 
- Probably we might need to remove them for our further analysis.

In [None]:
def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

In [None]:
text = "Driverless AI NLP blog post on https://www.h2o.ai/blog/detecting-sarcasm-is-difficult-but-ai-may-have-an-answer/"
remove_urls(text)

'Driverless AI NLP blog post on '

In [None]:
text = "Please refer to link http://lnkd.in/ecnt5yC for the paper"
remove_urls(text)

'Please refer to link  for the paper'

In [None]:
# Suppose say there is no http or https in the url link. The function can now captures that as well.
text = "Want to know more. Checkout www.h2o.ai for additional information"
remove_urls(text)

'Want to know more. Checkout  for additional information'

# Step 8:Remove html tags (optional)

In [None]:
def remove_html(text):
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r'', text)

text = """<div>
<h1> H2O</h1>
<p> AutoML</p>
<a href="https://www.h2o.ai/products/h2o-driverless-ai/"> Driverless AI</a>
</div>"""

print(remove_html(text))


 H2O
 AutoML
 Driverless AI



# Saving the Processed Data into new csv file using pandas

In [None]:
# checking the datatype of our data
type(data)

pandas.core.series.Series

In [None]:
# converting from series to dataframe
final_data = pd.DataFrame(data)

In [None]:
# Checking the final data datatype
type(final_data)

pandas.core.frame.DataFrame

In [None]:
# Saving the preprocessed data into csv file format usig .to_csv() method
final_data.to_csv(index=False)
print("Done!!!")

Done!!!


![](https://c.tenor.com/I6bSd_xNoc0AAAAM/hooray-its-weekend.gif)