Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More tokens to strip #48

Open
wumpus opened this issue May 16, 2018 · 2 comments
Open

More tokens to strip #48

wumpus opened this issue May 16, 2018 · 2 comments
Labels

Comments

@wumpus
Copy link
Contributor

wumpus commented May 16, 2018

Hi. I'm a search engine guy, and I'm very interested in a well-tested list of strippable CGI args to reduce the work my crawler has to do. I tried to algorithmicly build a list by taking the top 1000 websites from an old Alexa list, plus a few hosts I care about, and took a sample of their URLs crawled by CommonCrawl, and then counting which cgi args appeared in many of the hosts.

The biggest was &utm_source appearing on 474 of the 1,000 hosts. I dropped everything fewer than 5 hosts. So, in theory, this is somewhat of a representative sample of the most popular ones... although CommonCrawl isn't totally representative of the web, of course.

Here is a list with examples of the ones that aren't currently in your configuration:

# more utm_ -- I think people use utm_ as a prefix for their own purposes and/or Google doesn't document all of them

# https://www.mozilla.org/en-US/firefox/new/?f=30&ref=producthunt&utm_expid=71153379-28.SNKFJ4VqRziIW1TLqjhpAw.1&utm_referrer=https%3A%2F%2Fwww.google.com%2F

utm_expid (15 hosts)
utm_referrer (12 hosts)

# https://www.etsy.com/?utm_source=google&utm_medium=cpc&utm_term=etsy&utm_campaign=search_fr_fr-fr-src-pure-brand-exact-st_exact_etsy&gclid=EAIaIQobChMIk6Duvp6\
n1QIVjantCh1f-whGEAAYASAAEgLsx_D_BwE&gclsrc=aw.ds

gclsrc 22 hosts

# https://www.google.fr/chrome/browser/features.html?brand=CHBD&gclid=CN6B2tjusdECFVAQ0wodfmcISw&dclid=CM6vjtnusdECFcSjUQodyg4B2Q

dclid 21 hosts {similar to gclid?}

normally cookies

# Adobe ColdFusion
# https://techcrunch.com/?CFID=8494701&CFTOKEN=56974155

&CFID= 25 hosts, 70 total instances
&CFTOKEN= 25 hosts, 70 total instances

# PHP
# http://instagram.com/p/BUPpEcIDFjT/?PHPSESSID=dbj4v5fl2c6sd8f8986aprqpf3

&PHPSESSID= 5 hosts, 89 total instances

and here are the popular ones that you don't have at all:

# Web Trends

# http://www.nature.com/collections/dtfkmdgglg?WT.mc_id=SFB_NA_1017_FattyLiverGraphic
# https://www.microsoft.com/en-us/store/b/accessories?tid=vpOCJmmq&cid=5250&pcrid=3050714533&pkw=makerbot%20replicator%202%20desktop%203d%20printer&pmt=e&WT.srch=1&WT.mc_id=pointitsem_Microsoft+US_bing_5+-+Accessories&WT.term=make
# https://www.chase.com/ccp/index.jsp?pg_name=ccpmapp/shared/assets/page/repayment_examples&WT.ac=st_ctr_student&jp_aid=st_ctr_student&WT.mc_id=st_ctr_student_repayment&jp_mep=st_ctr_student_repayment&WT.pn_sku=repayment_plans&memberid=studentcenter
# https://www.intuit.com/company/press-room/press-releases/2013/QuickenPullsBacktheCoversonLoveandMoney/?WT.qs_osrc=TST-164886110

&WT.mc_id= 24 hosts, 2530 total instances
&WT.srch= 14 hosts, 422 total instances
&WT.ac= 8 hosts, 4094 total instances
&WT.qs_osrc= 5 hosts, 20 total instances
&WT.pn_sku

# Oracle Eloqua

# http://www.cray.com/company/policies-and-practices/privacy-policy?elqTrackId=2e97d2d4f56e41eb9498379bab9753db&elqaid=584&elqat=2
# http://www.blackboard.com/Platforms/Collaborate/Resources/Webinars-and-Demos.aspx?elq=a318adfc3e7e40de83e0883a1d6760ba&elqCampaignId=329

&elqTrackId= 12 hosts, 191 total instances
&elqaid= 12 hosts, 189 total instances
&elqat= 12 hosts, 189 total instances
&elqCampaignId= 7 hosts, 138 total instances
&elq= 7 hosts, 111 total instances

# comScore Digital Analytix:

# http://www.dailymail.co.uk/sport/rugbyunion/article-5082539/France-23-28-New-Zealand-Blacks-French.html?ITO=1490&ns_mchannel=rss&ns_campaign=1490
# http://www.hotstar.com/tv/cineplay/13080?ns_mchannel=Article&ns_source=Scroll&ns_campaign=Cineplay&ns_linkname=CineplayShowPage&ns_fee=0

&ns_campaign= 6 hosts, 97 total instances
&ns_mchannel= 5 hosts, 92 total instances
&ns_source=
&ns_linkname=
&ns_fee=

# suspicious but probably too generic

# https://www.cray.com/?leadsource=website&srcdes=seagate&campaign=7010b0000018kLW
&campaign= 15 hosts, 9072 total instances

# https://wordpress.com/create/?utm_source=bing&utm_campaign=WordPress-Generic-Exact-US-GP&utm_medium=cpc&keyword=wordpress&creative=9925335912&campaignid=12806\
5278&adgroupid=3099786316&matchtype=e&device=c&network=o
&campaignid= 6 hosts, 74 total instances
@newhouse
Copy link
Owner

Hi @wumpus and thanks for the issue and excellent supporting data!
Some of these look for sure like no-brainers to add to the core set of trackers to block, while others look a little more dangerous.

I'm in the midst of working on a system to allow users to add/remove their own trackers, in which case I'd be far more willing to put many of these into the defaults. If I get stalled out on that update, I'll probably just add them to a minor update when I get an hour or so to play with and test them.

If you don't see any motion on this in a week or so, please prod me. Thanks again!

@wumpus
Copy link
Contributor Author

wumpus commented May 21, 2018

Just noticed this one, a little googling says it's been around for a while, and that it's common enough that some reddit subs have banned using it:

https://www.youtube.com/attribution_link?a=dRBqlLWtf5U&u=%2Fwatch%3Fv%3Dpogq2tZFKKo%26feature%3Dshare

It's not just a token to strip, though. Normally only Amazon designs urls this poorly!

lyubomyr-shaydariv added a commit to lyubomyr-shaydariv/uu-webext that referenced this issue Jun 19, 2024
lyubomyr-shaydariv added a commit to lyubomyr-shaydariv/uu-webext that referenced this issue Jun 19, 2024
lyubomyr-shaydariv added a commit to lyubomyr-shaydariv/uu-webext that referenced this issue Jun 19, 2024
lyubomyr-shaydariv added a commit to lyubomyr-shaydariv/uu-webext that referenced this issue Jun 19, 2024
lyubomyr-shaydariv added a commit to lyubomyr-shaydariv/uu-webext that referenced this issue Jun 19, 2024
lyubomyr-shaydariv added a commit to lyubomyr-shaydariv/uu-webext that referenced this issue Jun 19, 2024
lyubomyr-shaydariv added a commit to lyubomyr-shaydariv/uu-webext that referenced this issue Jun 19, 2024
lyubomyr-shaydariv added a commit to lyubomyr-shaydariv/uu-webext that referenced this issue Jun 20, 2024
lyubomyr-shaydariv added a commit to lyubomyr-shaydariv/uu-webext that referenced this issue Jun 21, 2024
lyubomyr-shaydariv added a commit to lyubomyr-shaydariv/uu-webext that referenced this issue Jun 21, 2024
lyubomyr-shaydariv added a commit to lyubomyr-shaydariv/uu-webext that referenced this issue Jun 21, 2024
lyubomyr-shaydariv added a commit to lyubomyr-shaydariv/uu-webext that referenced this issue Jun 21, 2024
lyubomyr-shaydariv added a commit to lyubomyr-shaydariv/uu-webext that referenced this issue Jun 22, 2024
lyubomyr-shaydariv added a commit to lyubomyr-shaydariv/uu-webext that referenced this issue Jun 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants