## Regular Expressions in Action
We have already seen in the module that the right application of regular expression can help in complex text manipulation by using fairly simple expressions. In this notebook we will explore some of that on a real-world dataset. This exercise will not only introduce some of the practical applications of regex to you, but will also make you comfortable with writing code for the same. Which will come handy in the **Social Media Information Extraction** project you will be doing in module 8.

### Table of Contents
 1. About the Dataset
 2. Regex for Cleaning Text Data 
 3. Regex for Text Data Extraction
 4. Regex Challenge


### 1. About the Dataset

The dataset that we are going to use is the same dataset of tweets from Twitter that will be used in module 8 for **Social Media Information Extraction**. You can download it from [here.](https://s3.amazonaws.com/thinkific/file_uploads/118220/attachments/f3a/dcc/62a/Regular_Expressions_in_Action.zip)
Let's load the dataset using pandas and have a quick look at some sample tweets. 

In [1]:
#Load the dataset
import pandas as pd 
dataset = pd.read_csv("tweets.csv", encoding = "ISO-8859-1")

dataset.head()

Unnamed: 0.1,Unnamed: 0,X,text,favorited,favoriteCount,replyToSN,created,truncated,replyToSID,id,replyToUID,statusSource,screenName,retweetCount,isRetweet,retweeted
0,1,1,RT @rssurjewala: Critical question: Was PayTM ...,False,0,,2016-11-23 18:40:30,False,,8.014957e+17,,"<a href=""http://twitter.com/download/android"" ...",HASHTAGFARZIWAL,331,True,False
1,2,2,RT @Hemant_80: Did you vote on #Demonetization...,False,0,,2016-11-23 18:40:29,False,,8.014957e+17,,"<a href=""http://twitter.com/download/android"" ...",PRAMODKAUSHIK9,66,True,False
2,3,3,"RT @roshankar: Former FinSec, RBI Dy Governor,...",False,0,,2016-11-23 18:40:03,False,,8.014955e+17,,"<a href=""http://twitter.com/download/android"" ...",rahulja13034944,12,True,False
3,4,4,RT @ANI_news: Gurugram (Haryana): Post office ...,False,0,,2016-11-23 18:39:59,False,,8.014955e+17,,"<a href=""http://twitter.com/download/android"" ...",deeptiyvd,338,True,False
4,5,5,RT @satishacharya: Reddy Wedding! @mail_today ...,False,0,,2016-11-23 18:39:39,False,,8.014954e+17,,"<a href=""http://cpimharyana.com"" rel=""nofollow...",CPIMBadli,120,True,False


As can be seen above, **text** column is of interest to us as it contains the tweet. At this point, you don't have to worry about other columns as that will be handled in future modules. Let's go ahead and inspect some of the tweets.

In [2]:
for index, tweet in enumerate(dataset["text"][10:15]):
    print(index+1,".",tweet)

1 . Many opposition leaders are with @narendramodi on the #Demonetization 
And respect their decision,but support opposition just b'coz of party
2 . RT @Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r
3 . @Jaggesh2 Bharat band on 28??<ed><U+00A0><U+00BD><ed><U+00B8><U+0082>Those who  are protesting #demonetization  are all different party leaders.
4 . RT @Atheist_Krishna: The effect of #Demonetization !!
. https://t.co/A8of7zh2f5
5 . RT @sona2905: When I explained #Demonetization to myself and tried to put it down in my words which are not laced with any heavy technical


**Note :- Noise present in Tweets**

If you look closely, you'll see that there are some words like `RT` repeating in tweets. Also, there are strange symbols like `<U+00A0>` in tweet 3. This is the noise that is present in our dataset that we need to get rid of in order to do anything meaningful. 

### 2. Regex for Cleaning Text Data

Now that we know our tweets dataset is full of noise we can use the `re` module of python to clean of it.

#### a. Removing `RT`
RT means that the given tweet is a retweet of another which is useful information, but fortunately it is already present in the **isRetweet** column of our dataset so we can get rid of it.


In [3]:
import re 

text = "RT @Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r"
clean_text = re.sub(r"RT ", "", text)

print("Text before:\n", text)
print("Text after:\n", clean_text)

Text before:
 RT @Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r
Text after:
 @Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r


Here we accomplished the task with a simple pattern. Let's take an example where we'd need a deeper understanding of Regex operators.

**b. Removing `<U+..>` like symbols**

If you see the tweet 3 in the above example, there are strange symbols something of the sort `<U+..>` all over the place. We need to come up with a general Regex expression that will cover all such symbols. Let's break it down.

 - It is evident that the `<U+` in start is common in all. We can directly use it in our expression. But the `+` is a Regex operator so we cannot directly use it, we need to escape it inorder for our Regex to know that in this context `+` is part of the pattern and not a Regex operator. We can do that by backslash: `<U\+`.
 - The `<U+` is followed by one or more alphanumeric characters in upper case, this can be represented by `[A-Z0-9]+`
 - The `>` is always in the end.
 
Let's code it up. 

In [4]:
text = "@Jaggesh2 Bharat band on 28??<ed><U+00A0><U+00BD><ed><U+00B8><U+0082>Those who  are protesting #demonetization  are all different party leaders"
clean_text = re.sub(r"<U\+[A-Z0-9]+>", "", text)

print("Text before:\n", text)
print("Text after:\n", clean_text)

Text before:
 @Jaggesh2 Bharat band on 28??<ed><U+00A0><U+00BD><ed><U+00B8><U+0082>Those who  are protesting #demonetization  are all different party leaders
Text after:
 @Jaggesh2 Bharat band on 28??<ed><ed>Those who  are protesting #demonetization  are all different party leaders


**Note** that although we have gotten rid of majority of symbols, `<ed>` is still present. I leave this as an exercise for you to try out. 

#### c. Fixing the `&` and `&amp;`

If you explore the tweets further, you'll see that there is `&amp;` present in many tweets for example,

*RT @kanimozhi: Ts is exactly what Pappu `&amp;` opposition has done to themselves by opposing #Demonetization Now none can stop Modi bandwagon ti…*

`&amp;` is actually an escape character for `&` which people use often in place of `and` on twitter. 

We can fix this in our text by using a simple expression.

In [5]:
text = "RT @harshkkapoor: #DeMonetization survey results after 24 hours 5Lacs opinions Amazing response &amp; Commitment in fight against Blackmoney"
clean_text = re.sub(r"&amp;", "&", text)

print("Text before:\n", text)
print("Text after:\n", clean_text)

Text before:
 RT @harshkkapoor: #DeMonetization survey results after 24 hours 5Lacs opinions Amazing response &amp; Commitment in fight against Blackmoney
Text after:
 RT @harshkkapoor: #DeMonetization survey results after 24 hours 5Lacs opinions Amazing response & Commitment in fight against Blackmoney


### 3. Regex for Text Data Extraction

#### a. Extracting platform type of tweets
- Apart from cleaning text data, regex can be used effectively to extract information from given text data. For example, we extracted dates from text in the video module. But, Regex can be used creatively to make new features. 

- Take an example of the **statusSource** column in the dataset. If you look closely, you will find that you can find out more about the platform(android/iphone/web/windows phone) used for the given tweet. Information like this can be very useful for our machine learning model.

In [6]:
#List platforms that have more than 100 tweets
platform_count = dataset["statusSource"].value_counts()
top_platforms = platform_count.loc[platform_count>100]
top_platforms

<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>    7642
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                      2548
<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>      2093
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      492
<a href="https://mobile.twitter.com" rel="nofollow">Twitter Lite</a>                     263
<a href="https://mobile.twitter.com" rel="nofollow">Mobile Web (M5)</a>                  178
<a href="http://www.facebook.com/twitter" rel="nofollow">Facebook</a>                    167
<a href="http://twitter.com/#!/download/ipad" rel="nofollow">Twitter for iPad</a>        165
<a href="http://www.twitter.com" rel="nofollow">Twitter for Windows Phone</a>            139
<a href="http://onlywire.com/" rel="nofollow">OnlyWire / Official App</a>                136
<a href="http://www.twitter.com" rel="nofollow">Twitter for Windows</a

These are the platforms with atleast 100 tweets each. Now we can use our Regex to extract platform name from between `<a>..</a>` HTML tags. Let's extract our platform names.

In [7]:
def platform_type(x):
    ser = re.search( r"android|iphone|web|windows|mobile|google|facebook|ipad|tweetdeck|onlywire", x, re.IGNORECASE)
    if ser:
        return ser.group()
    else:
        return None

#reset index of the series
top_platforms = top_platforms.reset_index()["index"]

#extract platform types
top_platforms.apply(lambda x: platform_type(x))

0       android
1           Web
2        iphone
3     tweetdeck
4        mobile
5        mobile
6      facebook
7          ipad
8       Windows
9      onlywire
10      Windows
11       mobile
12       google
Name: index, dtype: object

#### b. Extracting hashtags from the tweets

Hashtags usually convey important information in social media related texts. Using regex, we can easily extract hashtags from each tweet. 

In [8]:
text = "RT @Atheist_Krishna: The effect of #Demonetization !!\r\n. https://t.co/A8of7zh2f5"
hashtag = re.search(r"#\w+", text)

print("Tweet:\n", text)
print("Hashtag:\n", hashtag.group())

Tweet:
 RT @Atheist_Krishna: The effect of #Demonetization !!
. https://t.co/A8of7zh2f5
Hashtag:
 #Demonetization


Notice that there can be tweets with more than one hashtag, this is where we can take advantage of the `find_all()`.

In [9]:
text = """RT @kapil_kausik: #Doltiwal I mean #JaiChandKejriwal is "hurt" by #Demonetization as the same has rendered USELESS <ed><U+00A0><U+00BD><ed><U+00B1><U+0089> "acquired funds" No wo"""
hashtags = re.findall(r"#\w+", text)

print("Tweet:\n", text)
print("Hashtag:\n", hashtags)

Tweet:
 RT @kapil_kausik: #Doltiwal I mean #JaiChandKejriwal is "hurt" by #Demonetization as the same has rendered USELESS <ed><U+00A0><U+00BD><ed><U+00B1><U+0089> "acquired funds" No wo
Hashtag:
 ['#Doltiwal', '#JaiChandKejriwal', '#Demonetization']


Now that you have understood the core concepts of Regular Expressions and seen it in action, it's time to test what you have learned so far in the next section.

### 4. Regex Challenge

Now that you have learned all the concepts regarding regex and have also seen it in action, it's time for you to utilize that to solve a challenge all by yourself. Here are some of the tasks that you have to do - 

**a. Removing URLs from tweets**

**Difficulty - Easy**

There are multiple URLs present in individual tweet's `text` and they don't neccessarily provide useful information so we can get rid of them. For example -  

*@Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r*


We can very well remove the URL as it isn't providing much useful information.


In [10]:
#Your Code Here

**b. Extract Top 100 mentions**

**Difficulty - Medium**

Many of the retweets(RT) have mentions of people in the form *@username*, for example see the following tweet - 

*@Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r*

Here *@Joydas* is a mention. You need to extract mentions from all the tweets and find which are the top 100 usernames. in terms of their name being mentioned in the dataset.

In [None]:
#Your Code Here