# Hacker News:  Programming Languages

<br></br>
<i>Created by the startup incubator Y Combinator in 2007, [Hacker News](https://news.ycombinator.com/) is a social news site where *posts* — user-submitted content — are voted and commented upon, highly similar to Reddit's format. However, unlike Reddit, users can only upvote or downvote once they've accumulated enough karma (user points) to discourage [trolling](https://unlcms.unl.edu/engineering/james-hanson/trolls-and-their-impact-social-media) and affirm intelligent, respectful discourse. Hacker News' top posts can get hundreds of thousands of user engagements since it is fairly popular in technology and startup circles.</i>

## Dataset

The dataset is a subset of a .csv file containing Hacker News stories from September 2015 to September 2016. The dataset has been reduced from about 300,000 rows to about 20,000 rows. Submissions without any comments have been removed, and the remaining posts have been randomly sampled. The columns are as follows:

| Column Name    | Description  |
|:---------------|:-------------|
| id | Unique identifier from Hacker News for the story |
| title | Title of the story |
| url | URL that the stories links to, if the story has a URL |
| num_points | Number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes |
| num_comments | Number of comments that were made on the story |
| author | Username of the person who submitted the story |
| created_at | Date and time at which the story was submitted |

We'll continue to analyze and count mentions of different programming languages in the dataset, and then we'll finish by extracting the different components of the URLs submitted to Hacker News.

In [1]:
import pandas as pd
import numpy as np
import re

hn = pd.read_csv('datasets/hacker_news.csv')
titles = hn['title']
sql_counts = titles.str.contains(r'sql', flags=re.I).sum()
sql_counts

108

## Capture Groups

In [2]:
hn_sql = hn[hn['title'].str.contains(r"\w+SQL", flags=re.I)].copy()
hn_sql['flavor'] = hn_sql['title'].str.extract(r"(\w+SQL)", flags=re.I)
hn_sql['flavor'] = hn_sql['flavor'].str.lower()

In [3]:
sql_pivot = hn_sql.pivot_table(values='num_comments', index='flavor', aggfunc=np.mean)
sql_pivot

Unnamed: 0_level_0,num_comments
flavor,Unnamed: 1_level_1
cloudsql,5.0
memsql,14.0
mysql,12.230769
nosql,14.529412
postgresql,25.962963
sparksql,1.0


### Using Capture Groups to Extract Data

In [4]:
py_versions_freq = dict(titles.str.extract(r'[Pp]ython ([\d\.]+)')[0].value_counts())

In [5]:
py_versions_freq

{'3': 10,
 '3.5': 3,
 '2': 3,
 '3.6': 2,
 '3.5.0': 1,
 '8': 1,
 '1.5': 1,
 '4': 1,
 '2.7': 1}

### Counting Mentions of the 'C' Language

In [6]:
def first_10_matches(pattern):
    '''
    Return the first 10 story titles that match
    the provided regular expression
    '''
    all_matches = titles[titles.str.contains(pattern)]
    first_10 = all_matches.head(10)
    return first_10

# first_ten = first_10_matches(r'[^.+]\b[Cc]\b[^+*\.]')
# first_ten
first_10_matches(r"\b[Cc]\b[^.+]")

366                      The new C standards are worth it
445           Moz raises $10m Series C from Foundry Group
522          Fuchsia: Micro kernel written in C by Google
1308            Show HN: Yupp, yet another C preprocessor
1327                     The C standard formalized in Coq
1366                          GNU C Library 2.23 released
1430    Cysignals: signal handling (SIGINT, SIGSEGV, )...
1621                        SDCC  Small Device C Compiler
1950    Rewriting a Ruby C Extension in Rust: How a Na...
2196    MyHTML  HTML Parser on Pure C with POSIX Threa...
Name: title, dtype: object

### Using Lookarounds to Control Matches Based on Surrounding Text

It looks like we're getting close. In our first 10 matches we have one irrelevant result, which is about "Series C," a term used to represent a particular type of startup fundraising.

Additionally, we've run into the same issue as we did in the previous mission — by using a negative set, we may have eliminated any instances where the last character of the title is "C" (the second last line of output matches in spite of the fact that it ends with "C," because it also has "C" earlier in the string).

Neither of these can be avoided using negative sets, which are used to allow multiple matches for a single character. Instead we'll need a new tool: lookarounds.

Lookarounds let us define a character or sequence of characters that either must or must not come before or after our regex match.

In [7]:
test_cases = ['Red_Green_Blue',
              'Yellow_Green_Red',
              'Red_Green_Red',
              'Yellow_Green_Blue',
              'Green']

def run_test_cases(pattern):
    for tc in test_cases:
        result = re.search(pattern, tc)
        print(result or "NO MATCH")

In [8]:
# positive lookahead
run_test_cases(r"Green(?=_Blue)")

<re.Match object; span=(4, 9), match='Green'>
NO MATCH
NO MATCH
<re.Match object; span=(7, 12), match='Green'>
NO MATCH


In [9]:
# negative lookahead
run_test_cases(r"Green(?!_Red)")

<re.Match object; span=(4, 9), match='Green'>
NO MATCH
NO MATCH
<re.Match object; span=(7, 12), match='Green'>
<re.Match object; span=(0, 5), match='Green'>


In [10]:
# positive lookbehind
run_test_cases(r"(?<=Red_)Green")

<re.Match object; span=(4, 9), match='Green'>
NO MATCH
<re.Match object; span=(4, 9), match='Green'>
NO MATCH
NO MATCH


In [11]:
# negative lookbehind
run_test_cases(r"(?<!Yellow_)Green")

<re.Match object; span=(4, 9), match='Green'>
NO MATCH
<re.Match object; span=(4, 9), match='Green'>
NO MATCH
<re.Match object; span=(0, 5), match='Green'>


In [12]:
# pattern = r'((?<![Ss]eries)\b[Cc].\b(?!.[A-Z]+))'
pattern2 = r'(?<!Series\s)\b[Cc]\b(?![\+\.])'
titles[titles.str.contains(pattern2)].head()

366                 The new C standards are worth it
522     Fuchsia: Micro kernel written in C by Google
1308       Show HN: Yupp, yet another C preprocessor
1327                The C standard formalized in Coq
1366                     GNU C Library 2.23 released
Name: title, dtype: object

### BackReferences: Using Capture Groups in a RegEx Pattern

Let's say we wanted to identify strings that had words with double letters, like the "ee" in "feed." Because we don't know ahead of time what letters might be repeated, we need a way to specify a capture group and then to repeat it. We can do this with backreferences.

In [13]:
pattern3 = r'(\b\w+)\s\1\b'
titles[titles.str.contains(pattern3)].values

  return func(self, *args, **kwargs)


array(['Silicon Valley Has a Problem Problem',
       'Wire Wire: A West African Cyber Threat',
       'Flexbox Cheatsheet Cheatsheet', 'The Mindset Mindset (2015)',
       "Valentine's Day Special: Bye Bye Tinder, Flirting in the Support Channel",
       'Mcdonalds copying cyriak  cows cows cows in their new commercial?',
       'Bang Bang Control', 'Cordless Telephones: Bye Bye Privacy (1991)',
       'Solving the the Monty-Hall-Problem in Swift',
       'Bye Bye Webrtc2SIP: WebRTC with Asterisk and Amazon AWS Only',
       'Intellij-Rust Rust Plugin for IntelliJ IDEA'], dtype=object)

### Substituting Regular Expression Matches

In [14]:
email_variations = pd.Series(['email', 'Email', 'e Mail', 'e mail', 'E-mail', 'e-mail', 'eMail', 'E-Mail', 'EMAIL'])

In [15]:
email_uniform = email_variations.str.replace(r'\be[-\s]?mail', 'email', flags=re.I)

In [16]:
email_uniform

0    email
1    email
2    email
3    email
4    email
5    email
6    email
7    email
8    email
dtype: object

In [17]:
titles_clean = titles.str.replace(r'\be[-\s]?mail\b', 'email', flags=re.I)

In [18]:
titles_clean.str.extract(r'(\be[-\s]?mail\b)')[0].value_counts()

email    108
Name: 0, dtype: int64

In [19]:
titles.str.extract(r'(\be[-\s]?mail\b)', flags=re.I)[0].value_counts()

email     62
Email     40
e-mail     5
E-Mail     1
Name: 0, dtype: int64

In [20]:
email_variations = pd.Series(['email', 'Email', 'e Mail',
                        'e mail', 'E-mail', 'e-mail',
                        'eMail', 'E-Mail', 'EMAIL'])

email_uniform = email_variations.str.replace(r'e[-\s]?mail', 'email', flags=re.I)
titles_clean = titles.str.replace(r'e[-\s]?mail', 'email', flags=re.I)

In [21]:
titles_clean

0                                Interactive Dynamic Video
1        How to Use Open Source and Shut the Fuck Up at...
2        Florida DJs May Face Felony for April Fools' W...
3             Technology ventures: From Idea to Enterprise
4        Note by Note: The Making of Steinway L1037 (2007)
                               ...                        
20095    How Purism Avoids Intels Active Management Tec...
20096            YC Application Translated and Broken Down
20097    Microkernels are slow and Elvis didn't do no d...
20098                        How Product Hunt really works
20099    RoboBrowser: Your friendly neighborhood web sc...
Name: title, Length: 20100, dtype: object

### Extracting Domains from URLs

In [22]:
test_urls = pd.Series([
 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
 'http://www.interactivedynamicvideo.com/',
 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
 'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
 'HTTPS://github.com/keppel/pinn',
 'Http://phys.org/news/2015-09-scale-solar-youve.html',
 'https://iot.seeed.cc',
 'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
 'http://beta.crowdfireapp.com/?beta=agnipath',
 'https://www.valid.ly?param'
])

In [23]:
dq_pattern = r'https?://([\w\.]+)'
test_urls.str.extract(dq_pattern, flags=re.I)

Unnamed: 0,0
0,www.amazon.com
1,www.interactivedynamicvideo.com
2,www.nytimes.com
3,evonomics.com
4,github.com
5,phys.org
6,iot.seeed.cc
7,www.bfilipek.com
8,beta.crowdfireapp.com
9,www.valid.ly


In [24]:
hn['url'].str.extract(dq_pattern, flags=re.I)[0].str.replace('www.', '').value_counts().head(20)

github.com              1010
medium.com               825
nytimes.com              531
theguardian.com          248
techcrunch.com           246
youtube.com              216
bloomberg.com            193
arstechnica.com          191
washingtonpost.com       190
theatlantic.com          138
wsj.com                  138
bbc.com                  134
wired.com                114
theverge.com             112
bbc.co.uk                108
en.wikipedia.org         100
twitter.com               93
qz.com                    85
newyorker.com             82
motherboard.vice.com      82
Name: 0, dtype: int64

In [25]:
dict(test_urls)

{0: 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
 1: 'http://www.interactivedynamicvideo.com/',
 2: 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
 3: 'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
 4: 'HTTPS://github.com/keppel/pinn',
 5: 'Http://phys.org/news/2015-09-scale-solar-youve.html',
 6: 'https://iot.seeed.cc',
 7: 'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
 8: 'http://beta.crowdfireapp.com/?beta=agnipath',
 9: 'https://www.valid.ly?param'}

### Extracting URL Parts Using Multiple Capture Groups

In [26]:
three_pattern = r'(https?)://([\w\.]+)(?:/)?(.+|)?'
test_urls.str.extract(three_pattern, flags=re.I)

Unnamed: 0,0,1,2
0,https,www.amazon.com,Technology-Ventures-Enterprise-Thomas-Byers/dp...
1,http,www.interactivedynamicvideo.com,
2,http,www.nytimes.com,2007/11/07/movies/07stein.html?_r=0
3,http,evonomics.com,advertising-cannot-maintain-internet-heres-sol...
4,HTTPS,github.com,keppel/pinn
5,Http,phys.org,news/2015-09-scale-solar-youve.html
6,https,iot.seeed.cc,
7,http,www.bfilipek.com,2016/04/custom-deleters-for-c-smart-pointers.html
8,http,beta.crowdfireapp.com,?beta=agnipath
9,https,www.valid.ly,?param


In [27]:
hn['url'].str.extract(three_pattern, flags=re.I)

Unnamed: 0,0,1,2
0,http,www.interactivedynamicvideo.com,
1,http,hueniverse.com,2016/01/26/how-to-use-open-source-and-shut-the...
2,http,www.thewire.com,entertainment/2013/04/florida-djs-april-fools-...
3,https,www.amazon.com,Technology-Ventures-Enterprise-Thomas-Byers/dp...
4,http,www.nytimes.com,2007/11/07/movies/07stein.html?_r=0
...,...,...,...
20095,https,puri.sm,philosophy/how-purism-avoids-intels-active-man...
20096,https,medium.com,@zreitano/the-yc-application-broken-down-and-t...
20097,http,blog.darknedgy.net,technology/2016/01/01/0/
20098,https,medium.com,@benjiwheeler/how-product-hunt-really-works-d8...


In [28]:
hn['url'].head(20)

0               http://www.interactivedynamicvideo.com/
1     http://hueniverse.com/2016/01/26/how-to-use-op...
2     http://www.thewire.com/entertainment/2013/04/f...
3     https://www.amazon.com/Technology-Ventures-Ent...
4     http://www.nytimes.com/2007/11/07/movies/07ste...
5     http://arstechnica.com/business/2015/10/comcas...
6                                                   NaN
7                                                   NaN
8     http://firstround.com/review/shims-jigs-and-ot...
9     http://www.southpolestation.com/trivia/igy1/ap...
10    http://techcrunch.com/2016/03/15/crate-raises-...
11    http://evonomics.com/advertising-cannot-mainta...
12    https://medium.com/@loorinm/coding-is-over-6d6...
13                                 https://iot.seeed.cc
14    http://www.bfilipek.com/2016/04/custom-deleter...
15                                                  NaN
16          http://beta.crowdfireapp.com/?beta=agnipath
17                                              

In [29]:
pattern = r"(.+)://([\w\.]+)/?(.*)"

test_url_parts = test_urls.str.extract(pattern)
url_parts = hn['url'].str.extract(pattern)

In [30]:
url_parts

Unnamed: 0,0,1,2
0,http,www.interactivedynamicvideo.com,
1,http,hueniverse.com,2016/01/26/how-to-use-open-source-and-shut-the...
2,http,www.thewire.com,entertainment/2013/04/florida-djs-april-fools-...
3,https,www.amazon.com,Technology-Ventures-Enterprise-Thomas-Byers/dp...
4,http,www.nytimes.com,2007/11/07/movies/07stein.html?_r=0
...,...,...,...
20095,https,puri.sm,philosophy/how-purism-avoids-intels-active-man...
20096,https,medium.com,@zreitano/the-yc-application-broken-down-and-t...
20097,http,blog.darknedgy.net,technology/2016/01/01/0/
20098,https,medium.com,@benjiwheeler/how-product-hunt-really-works-d8...


In [31]:
dict(hn['url'])

{0: 'http://www.interactivedynamicvideo.com/',
 1: 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
 2: 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
 3: 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
 4: 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
 5: 'http://arstechnica.com/business/2015/10/comcast-and-other-isps-boost-network-investment-despite-net-neutrality/',
 6: nan,
 7: nan,
 8: 'http://firstround.com/review/shims-jigs-and-other-woodworking-concepts-to-conquer-technical-debt/',
 9: 'http://www.southpolestation.com/trivia/igy1/appendix.html',
 10: 'http://techcrunch.com/2016/03/15/crate-raises-4m-seed-round-for-its-next-gen-sql-database/',
 11: 'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
 12: 'https://medium.com/@loorinm/coding-is-over-6d653abe8da8',
 13: 'https://iot.seeed.cc',
 14: 'http://www.bfili