# REGEX MEMO

#### Character classes
- .	        any character except newline
- \w\d\s	word, digit, whitespace
- \W\D\S	not word, digit, whitespace
- [abc]	    any of a, b, or c
- [^abc]	not a, b, or c
- [a-g]	    character between a & g

#### Anchors
- ^abc$	    start / end of the string
- \b\B	    word, not-word boundary

##### Escaped characters
- \.\*\\	escaped special characters
- \t\n\r	tab, linefeed, carriage return

#### Groups & Lookaround
- (abc)	    capture group
- \1	    backreference to group #1
- (?:abc)	non-capturing group
- (?=abc)	positive lookahead
- (?!abc)	negative lookahead

#### Quantifiers & Alternation
- a*a+a?	0 or more, 1 or more, 0 or 1
- a{5}a{2,}	exactly five, two or more
- a{1,3}	between one & three
- a+?a{2,}?	match as few as possible
- ab|cd	    match ab or cd

For testing Regex, refer to this site: [regexr.com](https://regexr.com/)

Below is some refresher for using regex

In [1]:
import pandas as pd
import numpy as np

hn = pd.read_csv('hacker_news.csv')

In [2]:
hn

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
2,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
3,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12
4,10482257,Title II kills investment? Comcast and other I...,http://arstechnica.com/business/2015/10/comcas...,53,22,Deinos,10/31/2015 9:48
5,10557283,Nuts and Bolts Business Advice,,3,4,shomberj,11/13/2015 0:45
6,12296411,Ask HN: How to improve my personal website?,,2,6,ahmedbaracat,8/16/2016 9:55
7,11337617,"Shims, Jigs and Other Woodworking Concepts to ...",http://firstround.com/review/shims-jigs-and-ot...,34,7,zt,3/22/2016 16:18
8,10379326,That self-appendectomy,http://www.southpolestation.com/trivia/igy1/ap...,91,10,jimsojim,10/13/2015 9:30
9,11370829,Crate raises $4M seed round for its next-gen S...,http://techcrunch.com/2016/03/15/crate-raises-...,3,1,hitekker,3/27/2016 18:08


In [3]:
import re

In [4]:
python_mentions = 0
pattern = '[pP]ython'
titles = hn['title']

for title in titles:
    if re.search(pattern, title):
        python_mentions += 1

python_mentions

160

In [5]:
v_python_mentions = titles.str.contains(pattern).sum()

v_python_mentions

160

In [6]:
ruby_titles = titles[titles.str.contains('[Rr]uby')]

ruby_titles.head()

190                    Ruby on Google AppEngine Goes Beta
484          Related: Pure Ruby Relational Algebra Engine
1388    Show HN: HTTPalooza  Ruby's greatest HTTP clie...
1949    Rewriting a Ruby C Extension in Rust: How a Na...
2022    Show HN: CrashBreak  Reproduce exceptions as f...
Name: title, dtype: object

In [16]:
email_bool = titles.str.contains('e-?mail')

In [17]:
email_counts = email_bool.sum()

email_counts

86

In [18]:
email_titles = titles[email_bool]

email_titles

119      Show HN: Send an email from your shell to your...
313          Disposable emails for safe spam free shopping
1361     Ask HN: Doing cold emails? helps us prove this...
1750     Protect yourself from spam, bots and phishing ...
2421                    Ashley Madison hack treating email
2685          Ask HN: Weather forecast in your email daily
3379            Ask HN: How do we solve the email problem?
3865     What Mailchimp does to make sure emails get de...
3889     Show HN: Do you know what emails your competit...
3921      Im killing most of my email capture. Here's why.
4219              Ask HN: One email, multiple team members
4318     How to get your email newsletter out of promot...
4322     SecureMyEmail  email client that automatically...
4577              Show HN: Gaggle Mail  Simple group email
4837     Ask HN: How do you manage per-service emails w...
4901     Show HN: GPG-Mailer  send GPG-encrypted emails...
5314     Ask HN: Has anybody built Tinder/Imgur style m.

In [20]:
tag_pattern = '\[\w+\]'

tag_titles = titles[titles.str.contains(tag_pattern)]

tag_count = tag_titles.shape[0]

tag_count

444

In [46]:
extract_pattern = r'\[(\w+)\]'

# parameter : expand=False -> to produce result as pandas series, so we can use value_counts immediately
tags     = titles.str.extract(extract_pattern,expand=False) 
tag_freq = tags.value_counts().sort_values(ascending=False)

tag_freq

pdf            276
video          111
2015             3
audio            3
beta             2
2014             2
slides           2
SPA              1
Australian       1
Beta             1
blank            1
Ubuntu           1
React            1
Infograph        1
Skinnywhale      1
Python           1
song             1
much             1
JavaScript       1
HBR              1
NSFW             1
ask              1
2008             1
satire           1
videos           1
Live             1
repost           1
Excerpt          1
German           1
1996             1
crash            1
gif              1
ANNOUNCE         1
Benchmark        1
png              1
CSS              1
Videos           1
map              1
Challenge        1
transcript       1
survey           1
Petition         1
5                1
viz              1
GOST             1
coffee           1
Map              1
USA              1
detainee         1
updated          1
comic            1
SpaceX           1
Name: title,

In [47]:
def first_10_matches(pattern):
    results = titles[titles.str.contains(pattern)]
    return results.heads(10)

In [50]:
java_pattern = r"\b[jJ]ava\b"

java_titles = titles[titles.str.contains(java_pattern)]

java_titles.shape[0]

54

In [53]:
beginning_count = titles[titles.str.contains(r"^\[\w+\]")].shape[0]
ending_count = titles[titles.str.contains(r"\[\w+\]$")].shape[0]

print("beginning count: ", beginning_count," ending count: " , ending_count)

beginning count:  15  ending count:  417


In [54]:
email_tests = pd.Series(['email', 'Email', 'e Mail', 'e mail', 'E-mail',
              'e-mail', 'eMail', 'E-Mail', 'EMAIL'])

In [66]:
#possible patterns = r"e[\-\s]?mail"  || r'e-?|\s?mail'
email_mentions = email_tests.str.contains(r'e[\-\s]?mail', flags= re.I).sum()

email_mentions

9

In [77]:
email_pattern = r"e[\-\s]?mail"

email_mentions = titles.str.contains(email_pattern,flags = re.I).sum()

print(email_mentions)

email_titlez = titles[titles.str.contains(email_pattern,flags = re.I)]

email_titlez

151


119      Show HN: Send an email from your shell to your...
161      Computer Specialist Who Deleted Clinton Emails...
174                                        Email Apps Suck
261      Emails Show Unqualified Clinton Foundation Don...
313          Disposable emails for safe spam free shopping
332                           Inky: Secure Email Made Easy
450      Mailtrain (the open source Mailchimp clone) is...
1361     Ask HN: Doing cold emails? helps us prove this...
1750     Protect yourself from spam, bots and phishing ...
1774     From Email Introductions to Addressing Diversi...
1900     Police Emails About Ahmed Mohamed: 'This Is Wh...
1956                   Email newsletters are the new zines
2018     Emails from a CEO Who Just Has a Few Changes t...
2421                    Ashley Madison hack treating email
2685          Ask HN: Weather forecast in your email daily
3181     Validating Email Addresses with a Regex? Do Yo...
3379            Ask HN: How do we solve the email proble

In [79]:
sql_count = titles.str.contains(r'SQL',flags=re.I).sum()

sql_count

108

In [81]:
hn_sql = hn[hn['title'].str.contains(r'\w+SQL',flags=re.I)].copy()

In [85]:
hn_sql['flavor'] = hn_sql['title'].str.extract(r'(\w+SQL)',flags=re.I)

hn_sql['flavor'] = hn_sql['flavor'].str.lower()

In [86]:
sql_pivot = hn_sql.pivot_table(index='flavor', values='num_comments', aggfunc='mean')

In [95]:
python_pattern = r"([Pp]ython [\d\.]+)"

py_versions = hn['title'].str.extract(python_pattern,expand=False)

py_versions_freq = dict(py_versions.value_counts())

py_versions_freq


{'Python 3': 10,
 'Python 3.5': 3,
 'Python 2': 2,
 'Python 3.6': 2,
 'Python 2.7': 1,
 'Python 8': 1,
 'Python 1.5': 1,
 'python 2': 1,
 'Python 4': 1,
 'Python 3.5.0': 1}

In [99]:
first_ten = titles[titles.str.contains(r'\b[Cc]\b[^\+\.]')].head(10)

first_ten

365                      The new C standards are worth it
444           Moz raises $10m Series C from Foundry Group
521          Fuchsia: Micro kernel written in C by Google
1307            Show HN: Yupp, yet another C preprocessor
1326                     The C standard formalized in Coq
1365                          GNU C Library 2.23 released
1429    Cysignals: signal handling (SIGINT, SIGSEGV, )...
1620                        SDCC  Small Device C Compiler
1949    Rewriting a Ruby C Extension in Rust: How a Na...
2195    MyHTML  HTML Parser on Pure C with POSIX Threa...
Name: title, dtype: object

In [100]:
pattern = r"(?<!Series\s)\b[cC]\b(?![\+\.])"
c_mentions = titles.str.contains(pattern).sum()

In [103]:
pattern = r"\b(\w+)\s\1\b"

repeated_words = titles[titles.str.contains(pattern)]
repeated_words

  This is separate from the ipykernel package so we can avoid doing imports until


3102                  Silicon Valley Has a Problem Problem
3176                Wire Wire: A West African Cyber Threat
3178                         Flexbox Cheatsheet Cheatsheet
4797                            The Mindset Mindset (2015)
7276     Valentine's Day Special: Bye Bye Tinder, Flirt...
10371    Mcdonalds copying cyriak  cows cows cows in th...
11575                                    Bang Bang Control
11901          Cordless Telephones: Bye Bye Privacy (1991)
12697          Solving the the Monty-Hall-Problem in Swift
15049    Bye Bye Webrtc2SIP: WebRTC with Asterisk and A...
15839          Intellij-Rust Rust Plugin for IntelliJ IDEA
Name: title, dtype: object

In [105]:
#get email
pattern = r"e[\-\s]?mail"

email_variations = pd.Series(['email', 'Email', 'e Mail',
                        'e mail', 'E-mail', 'e-mail',
                        'eMail', 'E-Mail', 'EMAIL'])

email_uniform = email_variations.str.replace(pattern,"email",flags=re.I)

email_uniform

0    email
1    email
2    email
3    email
4    email
5    email
6    email
7    email
8    email
dtype: object

In [106]:
titles_clean = titles.str.replace(pattern,"email",flags=re.I)

In [107]:
titles_clean

0                                Interactive Dynamic Video
1        Florida DJs May Face Felony for April Fools' W...
2             Technology ventures: From Idea to Enterprise
3        Note by Note: The Making of Steinway L1037 (2007)
4        Title II kills investment? Comcast and other I...
5                           Nuts and Bolts Business Advice
6              Ask HN: How to improve my personal website?
7        Shims, Jigs and Other Woodworking Concepts to ...
8                                   That self-appendectomy
9        Crate raises $4M seed round for its next-gen S...
10       Advertising Cannot Maintain the Internet. Here...
11                                          Coding Is Over
12       Show HN: Wio Link  ESP8266 Based Web of Things...
13                  Custom Deleters for C++ Smart Pointers
14              How often to update third party libraries?
15                        Review my AI based marketing bot
16       Ask HN: Am I the only one outraged by Twitter .

In [108]:
test_urls = pd.Series([
 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
 'http://www.interactivedynamicvideo.com/',
 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
 'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
 'HTTPS://github.com/keppel/pinn',
 'Http://phys.org/news/2015-09-scale-solar-youve.html',
 'https://iot.seeed.cc',
 'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
 'http://beta.crowdfireapp.com/?beta=agnipath',
 'https://www.valid.ly'
])

In [110]:
#Extract domain path
pattern = r"https?://([\w\.]+)"  #r"\/\/([\w]+?[\.]?\w+\.\w+[\.]?[\w]+)\/?"

test_url_clean = test_urls.str.extract(pattern,expand=False)

test_url_clean

0                     www.amazon.com
1    www.interactivedynamicvideo.com
2                    www.nytimes.com
3                      evonomics.com
4                         github.com
5                           phys.org
6                       iot.seeed.cc
7                   www.bfilipek.com
8              beta.crowdfireapp.com
9                       www.valid.ly
dtype: object

In [112]:
domains = hn['url'].str.extract(pattern,expand=False)

domains.value_counts()

github.com                       1008
medium.com                        825
www.nytimes.com                   525
www.theguardian.com               248
techcrunch.com                    245
www.youtube.com                   213
www.bloomberg.com                 193
arstechnica.com                   191
www.washingtonpost.com            190
www.wsj.com                       138
www.theatlantic.com               137
www.bbc.com                       134
www.wired.com                     114
www.theverge.com                  112
www.bbc.co                        108
en.wikipedia.org                  100
twitter.com                        93
qz.com                             85
motherboard.vice.com               82
www.newyorker.com                  81
www.forbes.com                     78
www.businessinsider.com            78
nautil.us                          77
www.nature.com                     72
www.reuters.com                    71
www.economist.com                  66
arxiv.org   

In [115]:
#Extract URL to 3 parts
pattern = r"(.+)://([\w\.]+)/?(.*)"

test_url_parts = test_urls.str.extract(pattern)
test_url_parts

Unnamed: 0,0,1,2
0,https,www.amazon.com,Technology-Ventures-Enterprise-Thomas-Byers/dp...
1,http,www.interactivedynamicvideo.com,
2,http,www.nytimes.com,2007/11/07/movies/07stein.html?_r=0
3,http,evonomics.com,advertising-cannot-maintain-internet-heres-sol...
4,HTTPS,github.com,keppel/pinn
5,Http,phys.org,news/2015-09-scale-solar-youve.html
6,https,iot.seeed.cc,
7,http,www.bfilipek.com,2016/04/custom-deleters-for-c-smart-pointers.html
8,http,beta.crowdfireapp.com,?beta=agnipath
9,https,www.valid.ly,


In [116]:
url_paths = hn['url'].str.extract(pattern)

url_paths

Unnamed: 0,0,1,2
0,http,www.interactivedynamicvideo.com,
1,http,www.thewire.com,entertainment/2013/04/florida-djs-april-fools-...
2,https,www.amazon.com,Technology-Ventures-Enterprise-Thomas-Byers/dp...
3,http,www.nytimes.com,2007/11/07/movies/07stein.html?_r=0
4,http,arstechnica.com,business/2015/10/comcast-and-other-isps-boost-...
5,,,
6,,,
7,http,firstround.com,review/shims-jigs-and-other-woodworking-concep...
8,http,www.southpolestation.com,trivia/igy1/appendix.html
9,http,techcrunch.com,2016/03/15/crate-raises-4m-seed-round-for-its-...


In [117]:
# use ?P<name> to give column names for each part
pattern = r"(?P<protocol>.+)://(?P<domain>[\w\.]+)/?(?P<path>.*)"

url_parts = hn['url'].str.extract(pattern)

url_parts

Unnamed: 0,protocol,domain,path
0,http,www.interactivedynamicvideo.com,
1,http,www.thewire.com,entertainment/2013/04/florida-djs-april-fools-...
2,https,www.amazon.com,Technology-Ventures-Enterprise-Thomas-Byers/dp...
3,http,www.nytimes.com,2007/11/07/movies/07stein.html?_r=0
4,http,arstechnica.com,business/2015/10/comcast-and-other-isps-boost-...
5,,,
6,,,
7,http,firstround.com,review/shims-jigs-and-other-woodworking-concep...
8,http,www.southpolestation.com,trivia/igy1/appendix.html
9,http,techcrunch.com,2016/03/15/crate-raises-4m-seed-round-for-its-...
