In [1]:
import pandas as pd
import re

hn = pd.read_csv("hacker_news.csv")
titles = hn['title']

In [2]:
# the SQL language has three different capitalizations: SQL, sql, and Sql in title coulmn.

pattern = 'sql'

sql_counts = titles.str.contains(pattern, flags = re.I).sum()
sql_counts

108

In [3]:
# alternative method 

pattern = r'[Ss][Qq][Ll]'

sql_counts = titles.str.contains(pattern).sum()
sql_counts

108

In [4]:
# looking at titles that have letters immediately before the "SQL," 

pattern = r'\w+sql'

# creating a new dataframe, hn_sql, including only rows that mention a SQL flavor

hn_sql = hn[hn['title'].str.contains(pattern, flags = re.I)].copy()

# Create a new column called flavor containing sql flavors

hn['flavor'] = hn_sql['title'].str.extract(r'(\w+sql)', flags = re.I)

# clean the values in the flavor column by converting them to lowercase

hn['flavor'] = hn['flavor'].str.lower()

# clean the values in the flavor column by converting them to lowercase

sql_pivot = hn.pivot_table(index = 'flavor', values = 'num_comments', aggfunc = 'mean')
sql_pivot


Unnamed: 0_level_0,num_comments
flavor,Unnamed: 1_level_1
cloudsql,5.0
memsql,14.0
mysql,12.230769
nosql,14.529412
postgresql,25.962963
sparksql,1.0


In [4]:
#regular expression pattern which will match Python or python, followed by a space, followed by one or more digit characters or periods

pattern = r'[Pp]ython\s([\d\.?]+)' # contain a capture group for the digit and period characters (the Python versions)

pv_version = titles.str.extract(pattern)
py_versions_freq = dict(pv_version[0].value_counts())
py_versions_freq

{'3': 10,
 '2': 3,
 '3.5': 3,
 '3.6': 2,
 '3.5.0': 1,
 '8': 1,
 '4': 1,
 '1.5': 1,
 '2.7': 1}

In [6]:
# alternate method of above
pattern = r'[Pp]ython ([\d\.]+)'
pv_version = titles.str.extract(pattern)
py_versions_freq = dict(pv_version[0].value_counts())
py_versions_freq

{'3': 10,
 '2': 3,
 '3.5': 3,
 '3.6': 2,
 '3.5.0': 1,
 '8': 1,
 '4': 1,
 '1.5': 1,
 '2.7': 1}

In [10]:
def first_10_matches(pattern):
    """
    Return the first 10 story titles that match
    the provided regular expression
    """
    all_matches = titles[titles.str.contains(pattern)]
    first_10 = all_matches.head(10)
    return first_10

pattern = r'\b[Cc]\b[^.+]' # finding C language only followed by spaces (\s)

first_ten  = first_10_matches(pattern)
first_ten

365                      The new C standards are worth it
444           Moz raises $10m Series C from Foundry Group
521          Fuchsia: Micro kernel written in C by Google
1307            Show HN: Yupp, yet another C preprocessor
1326                     The C standard formalized in Coq
1365                          GNU C Library 2.23 released
1429    Cysignals: signal handling (SIGINT, SIGSEGV, )...
1620                        SDCC  Small Device C Compiler
1949    Rewriting a Ruby C Extension in Rust: How a Na...
2195    MyHTML  HTML Parser on Pure C with POSIX Threa...
Name: title, dtype: object

In our first 10 matches we have one irrelevant result, which is about `"Series C,"` a term used to represent a particular type of startup fundraising.

Additionally, we've run into the same issue as we did in the previous file — by using a **negative set**, we may have eliminated any instances where the last character of the title is "C" (the second last line of output matches in spite of the fact that it ends with "C," because it also has "C" earlier in the string).

Neither of these can be avoided using negative sets, which are used to allow multiple matches for a single character. Instead we'll need a new tool: **lookarounds**.

Lookarounds let us define a character or sequence of characters that either must or must not come before or after our regex match. There are four types of lookarounds:

* `positive lookahead`  zzz(?=abc)  match zzz only when it is followed by abc
* `negative lookahead`  zzz(?!abc)  match zzz only when it is not followed by abc
* `positive lookbehind` (?<=abc)zzz match zzz only when it is preceded by abc
* `negative lookbehind` (?<!abc)zzz match zzz only when it is not preceded by abc

These tips can help us remember the syntax for lookarounds:

* Inside the parentheses, the first character of a lookaround is always `?`.
* If the lookaround is a lookbehind, the next character will be `<`, which we can think of as an arrow head pointing behind the match.
* The next character indicates whether the lookaround is positive (`=`) or negative (`!`).

The contents of a lookaround can include any other regular expression component. For instance, here is an example where we match only cases that are followed by exactly five characters:

`run_test_cases(r"Green(?=.{5})")`

In [13]:
# Exclude matches that have the word 'Series' immediately preceding C.
pattern = r'(?<!Series\s)\b[Cc]\b(?![.+])'

c_mentions =titles.str.contains(pattern).sum()
c_mentions

102

Let's say we wanted to identify strings that had words with double letters, like the `"ee"` in `"feed."`. We can do this with `backreferences`.

`Backreferences` allow us to repeat a **capture group** within our regex pattern by referring to them with an **integer** in the order they are captured. 

In [9]:
# regular expression to match cases of repeated words:

# We'll define a word as a series of one or more word characters that are preceded and followed by a boundary anchor.
# We'll define repeated words as the same word repeated twice, separated by a whitespace character.

pattern = r'\b(\w+)\s\1\b'

repeated_words = titles[titles.str.contains(pattern)]
repeated_words

  


3102                  Silicon Valley Has a Problem Problem
3176                Wire Wire: A West African Cyber Threat
3178                         Flexbox Cheatsheet Cheatsheet
4797                            The Mindset Mindset (2015)
7276     Valentine's Day Special: Bye Bye Tinder, Flirt...
10371    Mcdonalds copying cyriak  cows cows cows in th...
11575                                    Bang Bang Control
11901          Cordless Telephones: Bye Bye Privacy (1991)
12697          Solving the the Monty-Hall-Problem in Swift
15049    Bye Bye Webrtc2SIP: WebRTC with Asterisk and A...
15839          Intellij-Rust Rust Plugin for IntelliJ IDEA
Name: title, dtype: object

When we work with basic string methods, we used the `str.replace()` method to replace simple substrings. We can achieve the same with regular expressions using the `re.sub()` function. The basic syntax for `re.sub()` is:

`re.sub(pattern, repl, string, flags=0`)

`string = "aBcDEfGHIj"
print(re.sub(r"[A-Z]", "-", string))`

`a-c--f---j`

When working in pandas, we can use the `Series.str.replace()` method, which uses nearly identical syntax:

`Series.str.replace(pat, repl, flags=0)`

In [15]:
sql_variations = pd.Series(["SQL", "Sql", "sql"])

sql_uniform = sql_variations.str.replace(r"sql", "SQL", flags=re.I)
print(sql_uniform)

0    SQL
1    SQL
2    SQL
dtype: object


In [10]:
pattern = r'(e[-\s]?mail)'

email_variations = titles.str.extract(pattern, flags = re.I)[0].value_counts()
email_variations = pd.Series(email_variations.index)
email_variations

0     email
1     Email
2    e Mail
3    e-mail
4    e mail
5    E-mail
6    E-Mail
7     EMAIL
8     eMail
dtype: object

In [11]:
pattern = r'(e[-\s]?mail)'
email_uniform = email_variations.str.replace(pattern, "email", flags = re.I)
email_uniform

0    email
1    email
2    email
3    email
4    email
5    email
6    email
7    email
8    email
dtype: object

In [12]:
titles_clean = titles.str.replace(pattern, "email", flags = re.I)
titles_clean.str.extract(pattern)[0].value_counts()

email    151
Name: 0, dtype: int64

In [50]:
# extracting domain name from url
pattern = r"https?://([\w\.]+)"
hn["url"].str.extract(pattern, flags = re.I)[0].value_counts().head(10)

github.com                1008
medium.com                 825
www.nytimes.com            525
www.theguardian.com        248
techcrunch.com             245
www.youtube.com            213
www.bloomberg.com          193
arstechnica.com            191
www.washingtonpost.com     190
www.wsj.com                138
Name: 0, dtype: int64

In [68]:
# extracting Protocol,Domain and Page path using 3 capturing group
pattern = r'(.+)://([\w\.]+)/?(.*)'
hn["url"].str.extract(pattern).head()

Unnamed: 0,0,1,2
0,http,www.interactivedynamicvideo.com,
1,http,www.thewire.com,entertainment/2013/04/florida-djs-april-fools-...
2,https,www.amazon.com,Technology-Ventures-Enterprise-Thomas-Byers/dp...
3,http,www.nytimes.com,2007/11/07/movies/07stein.html?_r=0
4,http,arstechnica.com,business/2015/10/comcast-and-other-isps-boost-...


We can also name these columns, which we'll do using named capture groups.

In order to name a capture group we use the syntax `?P`, where name is the name of our capture group. This syntax goes after the open parentheses, but before the regex syntax that defines the capture group

In [74]:
pattern = r'(?P<protocol>.+)://(?P<domain>[\w\.]+)/?(?P<path>.*)'
hn["url"].str.extract(pattern).head()

Unnamed: 0,protocol,domain,path
0,http,www.interactivedynamicvideo.com,
1,http,www.thewire.com,entertainment/2013/04/florida-djs-april-fools-...
2,https,www.amazon.com,Technology-Ventures-Enterprise-Thomas-Byers/dp...
3,http,www.nytimes.com,2007/11/07/movies/07stein.html?_r=0
4,http,arstechnica.com,business/2015/10/comcast-and-other-isps-boost-...
