## Regex is used for Rule-based Information Mining Systems

### Text Cleaning

**1. Removing a Specific String from the text**

In [1]:
import re 

text = "RT @Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r"
clean_text = re.sub(r"RT ", "", text)

print(clean_text)

@Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r


**2. Removing Specific Symbols like <U+...>**

In [2]:
text = "@Jaggesh2 Bharat band on 28??<ed><U+00A0><U+00BD><ed><U+00B8><U+0082>Those who  are protesting #demonetization  are all different party leaders"
clean_text = re.sub(r"<U\+[A-Z0-9]+>", "", text)

print(clean_text)

@Jaggesh2 Bharat band on 28??<ed><ed>Those who  are protesting #demonetization  are all different party leaders


Note: + is a Regex Operator, we cannot use directly hence backslash (\) is used

**3. Removing url/other information present in b/n <>**

In [3]:
text = """It is raining heavily today and I am not sure if I will be able to travel.
Can we postpone our meeting. Hope it is fine with you :) I am sending the new meeting invite on 
<a href= "www.example.com"> this link </a> """

In [4]:
print(text)

It is raining heavily today and I am not sure if I will be able to travel.
Can we postpone our meeting. Hope it is fine with you :) I am sending the new meeting invite on 
<a href= "www.example.com"> this link </a> 


In [5]:
import re

text = re.sub("<[^<]+?>","",text)

print(text)

It is raining heavily today and I am not sure if I will be able to travel.
Can we postpone our meeting. Hope it is fine with you :) I am sending the new meeting invite on 
 this link  


### Text Data Extraction

1. Extracting the Platforms(Android|Web|Google|....etc) from Top Platforms having > 100 Tweets

In [None]:
def platform_type(x):
    ser = re.search( r"android|iphone|web|windows|mobile|google|facebook|ipad|tweetdeck|onlywire", x, re.IGNORECASE)
    if ser:
        return ser.group()
    else:
        return None

#reset index of the series
top_platforms = top_platforms.reset_index()["index"]

#extract platform types
top_platforms.apply(lambda x: platform_type(x))

Note: 'top_platforms' is the dataset having highest number of tweets

2. Extracting hastags from the tweets

In [4]:
text = "RT @Atheist_Krishna: The effect of #Demonetization !!\r\n. https://t.co/A8of7zh2f5"
hashtag = re.search(r"#\w+", text)

print(hashtag.group())

#Demonetization


Note: Hashtags usually convey important information in Social Media related texts

In [5]:
# Incase of more than one hashtag in the text

text = """RT @kapil_kausik: #Doltiwal I mean #JaiChandKejriwal is "hurt" by #Demonetization as the same has rendered USELESS <ed><U+00A0><U+00BD><ed><U+00B1><U+0089> "acquired funds" No wo"""
hashtags = re.findall(r"#\w+", text)

print(hashtags)

['#Doltiwal', '#JaiChandKejriwal', '#Demonetization']


3. Splitting the sentences into Tokens based on Delimeters

In [6]:
Sentence = "Football,Cricket;Golf Tennis"

splitted_text = re.split(r'[,;\s]',Sentence)

print(splitted_text)

['Football', 'Cricket', 'Golf', 'Tennis']


**Important Additional Code is Available in Notebooks**

    1. Checking the Top Sources having more than 100 Tweets :: Regular Expressions in Action