# Hacker News: September 2015 - September 2016

<br></br>
<i>Created by the startup incubator Y Combinator in 2007, [Hacker News](https://news.ycombinator.com/) is a social news site where *posts* — user-submitted content — are voted and commented upon, highly similar to Reddit's format. However, unlike Reddit, users can only upvote or downvote once they've accumulated enough karma (user points) to discourage [trolling](https://unlcms.unl.edu/engineering/james-hanson/trolls-and-their-impact-social-media) and affirm intelligent, respectful discourse. Hacker News' top posts can get hundreds of thousands of user engagements since it is fairly popular in technology and startup circles.</i>

## Dataset

The dataset is a subset of a .csv file containing Hacker News stories from September 2015 to September 2016. The dataset has been reduced from about 300,000 rows to about 20,000 rows. Submissions without any comments have been removed, and the remaining posts have been randomly sampled. The columns are as follows:

| COLUMN NAME    | DESCRIPTION  |
|:---------------|:-------------|
| id | Unique identifier from Hacker News for the story |
| title | Title of the story |
| url | URL that the stories links to, if the story has a URL |
| num_points | Number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes |
| num_comments | Number of comments that were made on the story |
| author | Username of the person who submitted the story |
| created_at | Date and time at which the story was submitted |

With this dataset, the goal is to explore ways of using regular expressions to allow for more refined pattern matching and searches. 

Below is a high level overview of and sample rows from the dataset.

In [1]:
import pandas as pd

hn = pd.read_csv('datasets/hacker_news.csv')
hn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20100 entries, 0 to 20099
Data columns (total 7 columns):
id              20100 non-null int64
title           20100 non-null object
url             17660 non-null object
num_points      20100 non-null int64
num_comments    20100 non-null int64
author          20100 non-null object
created_at      20100 non-null object
dtypes: int64(3), object(4)
memory usage: 1.1+ MB


In [2]:
hn.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,20100.0,11317530.0,696439.944151,10176908.0,10701763.5,11284446.5,11926073.0,12578975.0
num_points,20100.0,50.29607,107.107687,1.0,3.0,9.0,54.0,2553.0
num_comments,20100.0,24.80229,56.10734,1.0,1.0,3.0,21.0,1733.0


In [3]:
hn.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,10975351,How to Use Open Source and Shut the Fuck Up at...,http://hueniverse.com/2016/01/26/how-to-use-op...,39,10,josep2,1/26/2016 19:30
2,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
3,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
4,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12


In [4]:
hn[hn.url.isnull()].head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
6,10557283,Nuts and Bolts Business Advice,,3,4,shomberj,11/13/2015 0:45
7,12296411,Ask HN: How to improve my personal website?,,2,6,ahmedbaracat,8/16/2016 9:55
15,12335860,How often to update third party libraries?,,7,5,rabid_oxen,8/22/2016 12:37
17,10610020,Ask HN: Am I the only one outraged by Twitter ...,,28,29,tkfx,11/22/2015 13:43
22,11610310,Ask HN: Aby recent changes to CSS that broke m...,,1,1,polskibus,5/2/2016 10:14


In [5]:
hn[hn.num_comments == max(hn.num_comments)]

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
11909,12445994,iPhone 7,http://www.apple.com/iPhone7,756,1733,benigeri,9/7/2016 18:52


Some quick takeaways on the data include:

- The overview verifies that `num_comments` have values equal to or greater than 1 due to the data treatment done to the original source.
- Only the `url` column has NaN values. These values seem to be mostly for *Ask HN* posts, which are often stand-alone and don't require external links.
- The post with the most number of comments is about iPhone 7, at 1,733 comments.

## Using the `re` Module

Although regular expressions can be used with pandas, Python has a built-in `re` (regular expression) module, which contains functions and classes specifically for working with regular expressions. The power of regular expressions is in the use special character sequences, which allows for sophisticated pattern matching.

### Simple Patterns

The first example below shows using a loop to count the number of matched searches for the Python. Because a regex pattern was used, it becomes unnecessary spell out both title case 'Python' and lower case 'python' in the search. The next example uses the `Series.str.contains()` vectorized method to run a more efficient code.

In [6]:
import re
titles = hn['title']

In [7]:
mentions = 0
for t in titles.tolist():
    if re.search('[Pp]ython', t):
        mentions += 1
        
print(mentions)

160


In [8]:
# generate sum using a boolean array
titles.str.contains('[Pp]ython').sum()

160

In [9]:
# generate a boolean array from a series
titles.str.contains('[Pp]ython').head()

0    False
1    False
2    False
3    False
4    False
Name: title, dtype: bool

In [10]:
# apply regex-driven boolean mask as filter
titles[titles.str.contains("[Pp]ython")].head()

103                  From Python to Lua: Why We Switched
104            Ubuntu 16.04 LTS to Ship Without Python 2
145    Create a GUI Application Using Qt and Python i...
197    How I Solved GCHQ's Xmas Card with Python and ...
437    Unikernel Power Comes to Java, Node.js, Go, an...
Name: title, dtype: object

In [11]:
# sample article titles containing 'Ruby'
titles[titles.str.contains('[Rr]uby')].head()

191                    Ruby on Google AppEngine Goes Beta
485          Related: Pure Ruby Relational Algebra Engine
1389    Show HN: HTTPalooza  Ruby's greatest HTTP clie...
1950    Rewriting a Ruby C Extension in Rust: How a Na...
2023    Show HN: CrashBreak  Reproduce exceptions as f...
Name: title, dtype: object

### Quantifiers

Quantifiers refer to braces ({}) that indicates the number of times the preceding character needs to be repeated in a regular expression. This simplifies the pattern required to match substrings of specific lengths. 

In the example below, `?` allows search to look for both `e-mail` (where `-` occurs once), and email (where `-` occurs 0 times).

In [12]:
titles[titles.str.contains('e-?mail')].head()

120     Show HN: Send an email from your shell to your...
314         Disposable emails for safe spam free shopping
1362    Ask HN: Doing cold emails? helps us prove this...
1751    Protect yourself from spam, bots and phishing ...
2422                   Ashley Madison hack treating email
Name: title, dtype: object

In [13]:
# use regex as boolean mask
titles[titles.str.contains('e-*mail')].head()

120     Show HN: Send an email from your shell to your...
314         Disposable emails for safe spam free shopping
1362    Ask HN: Doing cold emails? helps us prove this...
1751    Protect yourself from spam, bots and phishing ...
2422                   Ashley Madison hack treating email
Name: title, dtype: object

### Character Classes

Character classes refer to a range of characters grouped under a class. Common character classes and their respective regex patterns include:

| Character Class | Pattern  | Scope    |
|:----------------|:---------|:---------|
| digit | \d | any digit from 0 to 9 |
| word | \w | any digit, uppercase, lowercase, or underscore character |
| whitespace | \s | any space, tab or linebreak character |
| dot | . | any character except newline |

Character classes are especially useful when there are unknown characters that needed to be accounted for. For instance:

In [14]:
# match any tag, i.e., string that has any word within a '[]'
tags = titles[titles.str.contains('\[\w+\]')]
titles[tags].head()

title
Analysis of 114 propaganda sources from ISIS, Jabhat al-Nusra, al-Qaeda [pdf]     NaN
Munich Gunman Got Weapon from the Darknet [German]                                NaN
File indexing and searching for Plan 9 [pdf]                                      NaN
Attack on Kunduz Trauma Centre, Afghanistan  Initial MSF Internal Review [pdf]    NaN
[Beta] Speedtest.net  HTML5 Speed Test                                            NaN
Name: title, dtype: object

### Capture Groups

A capture group is a subset of characters marked by parentheses within a regex pattern, so that captured subset can be reused in some another way.

In [15]:
tags.head()

67     Analysis of 114 propaganda sources from ISIS, ...
101    Munich Gunman Got Weapon from the Darknet [Ger...
160         File indexing and searching for Plan 9 [pdf]
163    Attack on Kunduz Trauma Centre, Afghanistan  I...
196               [Beta] Speedtest.net  HTML5 Speed Test
Name: title, dtype: object

In [16]:
# print samples of news tags
print(tags.str.extract(r'(\[\w+\])').head())

            0
67      [pdf]
101  [German]
160     [pdf]
163     [pdf]
196    [Beta]


In [17]:
# change capture group logic to exclude parentheses in the results
print(tags.str.extract(r'\[(\w+)\]').head())

          0
67      pdf
101  German
160     pdf
163     pdf
196    Beta


In [18]:
# extract words inside tag brackets
tags.str.extract(r'\[(\w+)\]')[0].value_counts().head()

pdf       276
video     111
audio       3
2015        3
slides      2
Name: 0, dtype: int64

### Negative Character Classes

Negative character classes are classes explicitly filtered out of a search.

In [19]:
def first_10_matches(pattern):
    '''Return the first 10 story titles
       matching the provided pattern
    '''
    return titles[titles.str.contains(pattern)].head(10)

# return matches with `[Jj]ava` excluding those followed by a negative set [^Ss] 
first_10_matches(r'[Jj]ava[^Ss]')

437     Unikernel Power Comes to Java, Node.js, Go, an...
812     Ask HN: Are there any projects or compilers wh...
1841                    Adopting RxJava on the Airbnb App
1973          Node.js vs. Java: Which Is Faster for APIs?
2094                    Java EE and Microservices in 2016
2368    Code that is valid in both PHP and Java, and p...
2494    Ask HN: I've been a java dev for a couple of y...
2752                Eventsourcing for Java 0.4.0 released
2911                2016 JavaOne Intel Keynote  32mn Talk
3453    What are the Differences Between Java Platform...
Name: title, dtype: object

As seen on the example above, the negative set was used in removing any bad matches that mention JavaScript. However, it also had a side-effect. It removes any title where `Java` occurs at the end of the string, like this title:

`Pippo  Web framework in Java`

The negative set [^Ss] must match one character, so instances at the end of a string are not considered a match.

### Word Boundaries

A different approach is to use the word boundary anchor, specified using the syntax `\b`. A word boundary matches the position between a word character and a non-word character, or a word character and the start/end of a string.

Here's an example of how using a word boundary changes the match from the string in the example below:

In [20]:
sample_string = "Sometimes people confuse JavaScript with Java"

# use negative character
print(re.search(r"Java[^S]", sample_string))

None


In [21]:
# use word boundary
print(re.search(r"\bJava\b", sample_string))

<re.Match object; span=(41, 45), match='Java'>


In [22]:
# use word boundary as a boolean mask
titles[titles.str.contains(r'\b[Jj]ava\b')].head()

437     Unikernel Power Comes to Java, Node.js, Go, an...
812     Ask HN: Are there any projects or compilers wh...
1024                         Pippo  Web framework in Java
1973          Node.js vs. Java: Which Is Faster for APIs?
2094                    Java EE and Microservices in 2016
Name: title, dtype: object

### Beginning and End Anchors

Other than the word boundary anchor, the other two most common anchors are the beginning anchor and the end anchor, which represent the start and the end of the string, respectfully. The `^` character is used both as a beginning anchor and negative set marker, depending on whether the character preceding it is a `[` or not.

In [23]:
titles[titles.str.contains(r'^\[\bpdf\b\]')]

10961    [pdf] Ninth Circuit Decision on AT&T Throttling
Name: title, dtype: object

In [24]:
# capture titles with tags in the beginning
titles[titles.str.contains(r'^\[\w+\]')].head()

196                [Beta] Speedtest.net  HTML5 Speed Test
399        [video] Google Self-Driving SUV Sideswipes Bus
3137                          [CSS] Yellow Fade Technique
5055    [React] proptypes-parser: Define React PropTyp...
9390    [Petition] Tell Microsoft to stop making browsers
Name: title, dtype: object

In [25]:
# capture titles with tags in the end
titles[titles.str.contains(r'\[\w+\]$')].head()

67     Analysis of 114 propaganda sources from ISIS, ...
101    Munich Gunman Got Weapon from the Darknet [Ger...
160         File indexing and searching for Plan 9 [pdf]
163    Attack on Kunduz Trauma Centre, Afghanistan  I...
211    A plan to rescue western democracy from the ig...
Name: title, dtype: object

## Challenge: Flags

In addition to the options discussed, flags are optional arguments that provide more flexibility in regular expressions. For instance, the `re.I` flag forces the pattern matching to ignore case. For instance:

In [31]:
# match(es) without the re.I flag
email_tests = pd.Series(['email', 'Email', 'eMail', 'EMAIL'])
email_tests[email_tests.str.contains(r'email')]

0    email
dtype: object

In [32]:
# matches with the re.I flag
email_tests[email_tests.str.contains(r"email", flags = re.I)]

0    email
1    Email
2    eMail
3    EMAIL
dtype: object

In [35]:
# regex that matches all 'email' variations
titles[titles.str.contains(r'\be[-\s]?mail', flags=re.I)]

120      Show HN: Send an email from your shell to your...
162      Computer Specialist Who Deleted Clinton Emails...
175                                        Email Apps Suck
262      Emails Show Unqualified Clinton Foundation Don...
314          Disposable emails for safe spam free shopping
                               ...                        
18848    Show HN: Crisp iOS keyboard for email and text...
19304    Ask HN: Why big email providers don't sign the...
19396    I used HTML Email when applying for jobs, here...
19447    Tell HN: Secure email provider Riseup will run...
19906    Gmail Will Soon Warn Users When Emails Arrive ...
Name: title, Length: 143, dtype: object

In [53]:
# counts of different 'email' formats in titles
titles.str.extract(r'(\be[-\s]?mail)', flags=re.I)[0].value_counts()

email     79
Email     56
e-mail     5
E-mail     2
E-Mail     1
Name: 0, dtype: int64