hits

A simple & easy way to see how many people have viewed your GitHub Repository.

Why?

We have a few projects on GitHub ...
Sadly, we ~~have~~ had no idea how many people are reading/using the projects because GitHub only shares "traffic" stats for the past 14 days and not in "real time". (unless people star/watch the repo) Also, manually checking who has viewed a project is exceptionally tedious when you have more than a handful of projects.

We want to know the popularity of each of our repos to know what people are finding useful and help us decide where we need to be investing our time.

What?

A simple way to add (very basic) analytics to your GitHub repos.

There are already many "badges" that people use in their repos. See: github.com/dwyl/repo-badges
But we haven't seen one that gives a "hit counter" of the number of times a GitHub page has been viewed ...
So, in today's mini project we're going to create a basic Web Counter.

https://en.wikipedia.org/wiki/Web_counter

What Data to Capture/Store?

The first question we asked ourselves was: What is the minimum possible amount of (useful/unique) info we can store per visit (to one of our projects)?

date + time (timestamp) when the person visited the site/page.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date/now
url being visited. i.e. which project was viewed.
user-agent the browser/device (or "crawler") visiting the site/page https://en.wikipedia.org/wiki/User_agent
IP Address of the client. (for checking uniqueness)
Language of the person's web browser. Note: While not "essential", we added Browser Language as the 5th piece of data (when it is set/sent by the browser/device) because it's insightful to know what language people are using so that we can determine if we should be translating/"localising" our content.

"Common Log Format" (CLF) ?

We initially considered using the "Common Log Format" (CLF) because it's well-known/understood. see: https://en.wikipedia.org/wiki/Common_Log_Format

An example log entry:

127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

Real example:

84.91.136.21 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) 007 [05/Aug/2017:16:50:51 -0000] "GET github.com/dwyl/phase-two HTTP/1.0" 200 42247

The data makes sense when viewed as a table:

| IP Address of Client | User Identifier | User ID | Date+Imte of Request | Request "Verb" and URL of Request | HTTP Status Code | Size of Response | | -------------|:-----------|:--|:------------:|:--------:|:--|--|--| | 84.91.136.21 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) | 007 | [05/Aug/2017:16:50:51 -0000] | "GET github.com/dwyl/phase-two HTTP/1.0" | 200 | 42247 |

On further reflection, we think the "Common Log Format" is inneficient as it contains a lot of duplicate and some useless data.

We can do better.

Alternative Log Format ("ALF")

From the CLF we can remove:

IP Address, User Identifier and User ID can be condensed into a single hash (see below).
"GET"" - the word is implied by the service we are running (we only accept GET requests)
Response size is irrelevant and will be the same for most requests.

| Timestamp | URL | User Agent | IP Address | Language | Hit Count | | ------------- |:------------|:------------|:------------:|:--------:| | 1436570536950 | github.com/dwyl/the-book | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) | 84.91.136.21 | EN-GB | 42 |

In the log entry (example) described above the first 3 bits of data will identify the "user" requesting the page/resource, so rather than duplicating the data in an inefficient string, we can hash it!

Any repeating user-identifying data should be concactenated

Log entries are stored as a ("pipe" delimited) String which can be parsed and re-formatted into any other format:

1436570536950|github.com/dwyl/phase-two|Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)|88.88.88.88|EN-US|42

Reducing Storage (Costs)

If a person views multiple pages, three pieces of data are duplicated: User Agent, IP Address and Language. Rather than storing this data multiple times, we hash the data and store the hash as a lookup.

Hash Long Repeating (Identical) Data

If we run the following Browser|IP|Language String:

'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)|84.91.136.21|EN-US'

through a SHA hash function we get: 8HKg3NB5Cf (always)¹.

Sample code:

var hash = require('./lib/hash.js');
var user_agent_string = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)|88.88.88.88|EN-US';
var agent_hash = hash(user_agent_string, 10); // 8HKg3NB5Cf

¹Note: SHA hash is always 40 characters, but we truncate it because 10 alphanumeric characters (selected from a set of 26 letters + 10 digits) means there are 36¹⁰ = 3,656,158,440,062,976 (three and a half Quadrillion) possible strings which we consider "enough" entropy. (if you disagree, tell us why in an issue!)

Hit Data With Hash

1436570536950|github.com/dwyl/the-book|8HKg3NB5Cf|42

How?

Place a badge (image) in your repo README.md so others can can see how popular the page is and you can track it.

Run it Your_self_!

Download (clone) the code to your local machine:

git clone https://github.com/dwyl/hits.git && cd hits

Note: you will need to have Node.js running on your localhost.

Install dependencies:

npm install

Run locally:

npm run dev

Visit: http://localhost:8000/any/url/count.svg

Data Storage

Recording the "hit" data is essential for this app to work and be useful.

We have built it to work with two "data stores": Filesystem and Redis

Note: you only need one storage option to be available.

Filesystem

Research

User Agents

How many user agents (web browsers + crawlers) are there? there appear to be fewer than a couple of thousand user agents. http://www.useragentstring.com/pages/useragentstring.php which means we could store them using a numeric index; 1 - 3000

But, storing the user agents using a numeric index means we need to perform a lookup on each hit which requires network IO ... (expensive!) What if there was a way of deriving a String representation of the the user-agent string ... oh, that's right, here's one I made earlier... https://github.com/dwyl/aguid

Log Formats

Apache Log Sample: http://www.monitorware.com/en/logsamples/apache.php

Node.js http module headers

https://nodejs.org/api/http.html#http_message_rawheaders

Running the Test Suite locally

The test suite includes tests for 3 databases therefore running the tests on your localhost requires all 3 to be running.

Deploying and using the app only requires one of the databases to be available.

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
lib		lib
test		test
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
package.json		package.json
server.js		server.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hits

Why?

What?

What Data to Capture/Store?

"Common Log Format" (CLF) ?

Alternative Log Format ("ALF")

Reducing Storage (Costs)

Hash Long Repeating (Identical) Data

Hit Data With Hash

How?

Run it Your_self_!

Data Storage

Filesystem

Research

User Agents

Log Formats

Node.js http module headers

Running the Test Suite locally

About

Releases

Packages

Languages

nelsonic/hits

Folders and files

Latest commit

History

Repository files navigation

hits

Why?

What?

What Data to Capture/Store?

"Common Log Format" (CLF) ?

Alternative Log Format ("ALF")

Reducing Storage (Costs)

Hash Long Repeating (Identical) Data

Hit Data With Hash

How?

Run it Your_self_!

Data Storage

Filesystem

Research

User Agents

Log Formats

Node.js http module headers

Running the Test Suite locally

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages