<a href="https://colab.research.google.com/github/hoangvn111/Data-Cleaning-in-Python-Advanced/blob/master/Data_Cleaning_in_Python_Advanced.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [18]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Introduction

The dataset we will be working with is based off this CSV of Hacker News stories from September 2015 to September 2016. The columns in the dataset are explained below:

* id: The unique identifier from Hacker News for the story
* title: The title of the story
* url: The URL that the stories links to, if the story has a URL
* num_points: The number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes
* num_comments: The number of comments that were made on the story
* author: The username of the person who submitted the story
* created_at: The date and time at which the story was submitted

In [19]:
import pandas as pd 
hn = pd.read_csv('/content/drive/MyDrive/my_datasets/Data Cleaning in Python Advanced/hacker_news.csv')

In [20]:
hn.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
2,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
3,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12
4,10482257,Title II kills investment? Comcast and other I...,http://arstechnica.com/business/2015/10/comcas...,53,22,Deinos,10/31/2015 9:48


In [21]:
hn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20099 entries, 0 to 20098
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            20099 non-null  int64 
 1   title         20099 non-null  object
 2   url           17659 non-null  object
 3   num_points    20099 non-null  int64 
 4   num_comments  20099 non-null  int64 
 5   author        20099 non-null  object
 6   created_at    20099 non-null  object
dtypes: int64(3), object(4)
memory usage: 1.1+ MB


## Regular Expression Basics

### The Regular Expression Module

We're going to use this technique to find out how many times Python is mentioned in the title of stories in our Hacker News dataset. We'll use a set to check for both Python with a capital 'P' and python with a lowercase 'p'.

In [22]:
# import module re
import re 

# extract a list 'titles' containing all the titles from dataset
titles = hn['title'].tolist()

python_mentions = 0

pattern = '[Pp]ython'

for title in titles:
    if re.search(pattern, title):
        python_mentions += 1

print(python_mentions)

160


### Counting Matches with pandas Methods

In [23]:
titles = hn['title']
pattern = '[Pp]ython'

python_mentions = titles.str.contains(pattern).sum()

print(python_mentions)

160


### Using Regular Expressions to Select Data

In [24]:
titles = hn['title']

ruby_titles = titles[titles.str.contains('[Rr]uby')]

print(ruby_titles.head())

190                    Ruby on Google AppEngine Goes Beta
484          Related: Pure Ruby Relational Algebra Engine
1388    Show HN: HTTPalooza  Ruby's greatest HTTP clie...
1949    Rewriting a Ruby C Extension in Rust: How a Na...
2022    Show HN: CrashBreak  Reproduce exceptions as f...
Name: title, dtype: object


### Quantifiers

In [25]:
pattern = 'e-?mail'
email_bool = titles.str.contains(pattern)
email_count = email_bool.sum()
email_titles = titles[email_bool]

print(email_count)
print('\n')
print(email_titles.head())

86


119     Show HN: Send an email from your shell to your...
313         Disposable emails for safe spam free shopping
1361    Ask HN: Doing cold emails? helps us prove this...
1750    Protect yourself from spam, bots and phishing ...
2421                   Ashley Madison hack treating email
Name: title, dtype: object


### Character Classes 

In [26]:
pattern = '\[\w+\]'
tag_titles = titles[titles.str.contains(pattern)]
tag_count = titles.str.contains(pattern).sum()

print(tag_count)
print('\n')
print(tag_titles.head())


444


66     Analysis of 114 propaganda sources from ISIS, ...
100    Munich Gunman Got Weapon from the Darknet [Ger...
159         File indexing and searching for Plan 9 [pdf]
162    Attack on Kunduz Trauma Centre, Afghanistan  I...
195               [Beta] Speedtest.net  HTML5 Speed Test
Name: title, dtype: object


### Accessing the Matching Text with Capture Groups 

In [34]:
pattern = r'\[(\w+)\]'

tag_titles = titles.str.extract(pattern, expand=False)
tag_freq = tag_titles.value_counts(dropna=False)

print(tag_freq)
print('\n')
print(tag_titles.head())

NaN            19655
pdf              276
video            111
2015               3
audio              3
2014               2
beta               2
slides             2
Petition           1
Ubuntu             1
NSFW               1
Benchmark          1
transcript         1
USA                1
gif                1
1996               1
German             1
Map                1
Beta               1
song               1
React              1
JavaScript         1
CSS                1
detainee           1
Videos             1
Infograph          1
Python             1
coffee             1
Australian         1
crash              1
blank              1
Challenge          1
5                  1
HBR                1
Live               1
SPA                1
viz                1
ask                1
map                1
videos             1
png                1
ANNOUNCE           1
2008               1
repost             1
satire             1
GOST               1
Excerpt            1
comic        

### Negative Character Classes 

We can see that there are a number of matches that contain Java as part of the word JavaScript. We want to exclude these titles from matching so we get an accurate count.

Let's use the negative set [^Ss] to exclude instances like JavaScript and Javascript:

In [40]:
pattern = r'[Jj]ava[^Ss]'

java_titles = titles[titles.str.contains(pattern)]

print(java_titles.head())

436     Unikernel Power Comes to Java, Node.js, Go, an...
811     Ask HN: Are there any projects or compilers wh...
1840                    Adopting RxJava on the Airbnb App
1972          Node.js vs. Java: Which Is Faster for APIs?
2093                    Java EE and Microservices in 2016
Name: title, dtype: object


### Word Boundaries 

While the negative set was effective in removing any bad matches that mention JavaScript, it also had the side-effect of removing any titles where Java occurs at the end of the string, like this title:

Pippo  Web framework in Java

In [42]:
pattern = r'\b[Jj]ava\b'

java_titles = titles[titles.str.contains(pattern)]

print(java_titles.head())

436     Unikernel Power Comes to Java, Node.js, Go, an...
811     Ask HN: Are there any projects or compilers wh...
1023                         Pippo  Web framework in Java
1972          Node.js vs. Java: Which Is Faster for APIs?
2093                    Java EE and Microservices in 2016
Name: title, dtype: object


### Matching at the Start and End of Strings

In [50]:
pattern_beginning = r"^\[\w+\]"
beginning_count = titles.str.contains(pattern_beginning).sum()

pattern_ending =  r"\[\w+\]$"
ending_count = titles.str.contains(pattern_ending).sum()

print(beginning_count)
print('\n')
print(ending_count)

15


417


### Challenge: Using Flags to Modify Regex Pattern

we can use **flags** to specify that our regular expression should ignore case

In [62]:
import re

email_tests = pd.Series(['email', 'Email', 'e Mail', 'e mail', 'E-mail',
              'e-mail', 'eMail', 'E-Mail', 'EMAIL', 'emails', 'Emails',
              'E-Mails'])

pattern = r'\be[\-\s]?mails?\b'

email_mentions = email_tests[email_tests.str.contains(pattern, flags=re.I)]

print(email_mentions)

0       email
1       Email
2      e Mail
3      e mail
4      E-mail
5      e-mail
6       eMail
7      E-Mail
8       EMAIL
9      emails
10     Emails
11    E-Mails
dtype: object
