Skip to content
This repository has been archived by the owner on Feb 24, 2022. It is now read-only.

Add a provider field fixes #77 #78

Merged
merged 1 commit into from
Oct 13, 2016
Merged

Add a provider field fixes #77 #78

merged 1 commit into from
Oct 13, 2016

Conversation

jaredlockhart
Copy link
Collaborator

No description provided.

metadata.provider = '';
if(url) {
const parsedUrl = urlparse.parse(url);
metadata.provider = parsedUrl.hostname.replace('www.', '').split('.').slice(0, 1).join('');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this will possibly "fail" w/ subdomains.

function providerUrl(hostname) {
  const provider = hostname.replace('www.', '').split('.').slice(0, 1).join('');
  console.log(provider);
}

providerUrl('yahoo.com'); // "yahoo"
providerUrl('www.reddit.com'); // "reddit"
providerUrl('sports.bing.com'); // "sports"  <== should this be "sports" or "bing"?
providerUrl('www.bbc.co.uk'); // "bbc"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you may be able to convert the .slice(0, 1).join('') to .shift(), as in:

const provider = hostname.replace('www.', '').split('.').shift();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on your viewpoint on supporting subdomains like "sports.bing.com", we could also possibly use something like tldjs to do some intelligent guessing for TLDs.

const tld = require('tldjs');

function providerUrl(hostname) {
    const provider = hostname.replace('www.', '')
      .replace('.' + tld.getPublicSuffix(hostname), '');
    console.log(provider);
}

providerUrl('yahoo.com'); // "yahoo"
providerUrl('www.reddit.com'); // "reddit"
providerUrl('sports.bing.com'); // "sports.bing"
providerUrl('www.bbc.co.uk'); // "bbc"

https://runkit.com/57fff073d643a000147e639d/57fff073d643a000147e639e


But basically I'm not 💯 on what sports.bing.com should return for a provider:

  1. "sports"? Your current behavior.
  2. "sports.bing"? That's the TLD.js behavior in the comment above.
  3. "bing"? Should be easy enough using combination of TLD.js and then doing a .pop() to get the last segment in the URL (after we remove the TLD).
function providerUrl(hostname) {
    const provider = hostname // .replace('www.', '')
      .replace('.' + tld.getPublicSuffix(hostname), '')
      .split('.')
      .pop();
    console.log(provider);
}

Of course, there may be some hidden dragons w/ TLD.js which may make it more effort/risk than it's worth.

@jaredlockhart
Copy link
Collaborator Author

Allowed support for subdomains so 'sports.bing.com' will render as 'sports bing' which I think is a reasonable compromise.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 100.0% when pulling 6e73102 on 77 into ee886dc on master.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 100.0% when pulling f2acfef on 77 into ee886dc on master.

@jaredlockhart jaredlockhart merged commit 25e3d41 into master Oct 13, 2016
return urlparse.parse(url)
.hostname
.replace(/www[a-zA-Z0-9]*\./, '')
.replace('co.', '')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The special casing of this one particular second-level TLD is mildly concerning.
Not sure how many others (if any), we should bother to handle: https://en.wikipedia.org/wiki/Second-level_domain

But probably can call it "good enough" for now, and address in the future if we notice weird results.

function getProvider(hostname) {
  const out = hostname
    .replace(/www[a-zA-Z0-9]*\./, '')
    .replace('co.', '')
    .split('.')
    .slice(0, -1)
    .join(' ');
  console.log(out);
}

getProvider('yahoo.com'); // "yahoo"
getProvider('www.reddit.com'); // "reddit"
getProvider('sports.bing.com'); // "sports bing"
getProvider('www.bbc.co.uk'); // "bbc"
getProvider('www.cnn.com'); // "cnn"
getProvider('aeon.co'); // "aeon"
getProvider('mobile.nytimes.com'); // "mobile nytimes"
getProvider('www1.foobar.ca'); // "foobar"
getProvider('mobile.www.barfoo.ninja'); // "mobile barfoo" // <== removes "www." from middle of URL.
getProvider('www.school.k12.il'); // "school k12"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other interesting results from random Reddit /r/worldnews results:

getProvider('www.businessinsider.com.au'); // "businessinsider com"
getProvider('abcnews.go.com'); // "abcnews go"
getProvider('timesofindia.indiatimes.com'); // "timesofindia indiatimes"
getProvider('www.abc.net.au'); // "abc net"
getProvider('bigstory.ap.org'); // "bigstory ap"

@pdehaan
Copy link
Contributor

pdehaan commented Oct 13, 2016

Last one, here's the sorted and deduped results of the first 5 pages of /r/worldnews through your provider function (where the provider is more than 1 word):

[ 'abc net',    // http://www.abc.net.au/
  'abcnews go',    // http://abcnews.go.com/
  'bigstory ap',    // http://bigstory.ap.org
  'businessinsider com',    // http://www.businessinsider.com.au/
  'dailystar com',    // http://www.dailystar.com.lb/
  'economictimes indiatimes',    // http://economictimes.indiatimes.com/
  'edition cnn',    // http://edition.cnn.com/
  'globalnation inquirer',    // http://globalnation.inquirer.net/
  'm ndtv',    // http://m.ndtv.com/
  'mobile reuters',    // http://mobile.reuters.com/
  'motherboard vice',    // http://motherboard.vice.com/
  'nakedsecurity sophos',    // https://nakedsecurity.sophos.com/
  'news sky',    // http://news.sky.com/
  'news vice',    // https://news.vice.com/
  'timesofindia indiatimes'    // http://timesofindia.indiatimes.com/
]
const urlparse = require('url');
const { fetchSubreddit, domainReducer } = require('reddit-as-json');

fetchSubreddit('worldnews', 5)
  .then(domainReducer)
  .then(({data}) => data.map((domain) => getProvider(`https://${domain.name}`)))
  .then((data) => {
    return Object.keys(data.reduce((prev, curr) => {
      prev[curr] = true;
      return prev;
    }, {})).sort();
  })
  .then((data) => data.filter((provider) => provider.split(' ').length > 1))
  .then((data) => console.log(data))
  .catch((err) => console.error(err));

function getProvider(url) {
  return urlparse.parse(url)
    .hostname
    .replace(/www[a-zA-Z0-9]*\./, '')
    .replace('co.', '')
    .split('.')
    .slice(0, -1)
    .join(' ');
}

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants