New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1362438: additional redirects from SCL3 Apache to Django #4231

Merged
merged 3 commits into from Aug 3, 2017

Conversation

Projects
None yet
3 participants
@metadave
Member

metadave commented May 10, 2017

source:
https://github.com/mozilla/kuma/blob/master/configs/htaccess

The remaining rules to address from htaccess are in this gist for now.

  • Rewrite rules that don't redirect (typically annotated with [L] in this file) will most likely need to be handled by Whitenoise or some other method:
# Links to FTP'ed code samples and examples
RewriteRule ^patches(.*) data/www/patches$1 [L]
RewriteRule ^presentations(.*) data/www/presentations$1 [L]
RewriteRule ^samples(.*) data/www/samples$1 [L]
RewriteRule ^diagrams(.*) data/www/diagrams$1 [L]
RewriteRule ^web-tech(.*) data/www/web-tech$1 [L]
RewriteRule ^css(.*) data/www/css$1 [L]


# Rewrites to robots & sitemaps
RewriteRule ^robots.txt$ media/robots.txt [L]
RewriteRule ^sitemap.xml$ media/sitemap.xml [L]
RewriteRule ^sitemaps/([\w\-]*)/sitemap.xml$ media/sitemaps/$1/sitemap.xml [L]
  • These seem to be broken:
# Miscellaneous off-site redirects
# broken
RewriteRule ^contests/$ http://labs.mozilla.com/contests/extendfirefox/ [R=302]
# broken
RewriteRule ^contests/extendfirefox(/.*)? http://labs.mozilla.com/contests/extendfirefox$1 [R=302]
# http://wiki.ecmascript.org doesn't seem to respond.
RewriteRule ^es4(/.*)?$ http://wiki.ecmascript.org/ [R]
  • These remain unimplemented:
RewriteCond %{QUERY_STRING} ^title=http [NC]
RewriteRule ^index\.php$ / [L,NC]

RewriteCond %{QUERY_STRING} ^title=(.*)$ [NC]
RewriteRule ^index\.php$ %1 [R=301,L,NC]

# HACK: Django will eventually redirect the user to the right spot, but skip a
# couple of redirects for these known legacy locales
RewriteRule ^en/(.*)$     /mwsgi/en-US/$1 [L,QSA,NE,NC]
RewriteRule ^cn/(.*)$     /mwsgi/zh-CN/$1 [L,QSA,NE,NC]
RewriteRule ^zh_cn/(.*)$  /mwsgi/zh-CN/$1 [L,QSA,NE,NC]
RewriteRule ^zh_tw/(.*)$  /mwsgi/zh-TW/$1 [L,QSA,NE,NC]

# Everything else passes through the Django handler
RewriteRule ^(.*)$ /mwsgi/$1 [L,QSA,NE]
  • For the following rule, we discussed using a Django view to serve robots.txt via a template to change content based on dev/stage/prod:
RewriteCond %{HTTP_HOST} allizom
RewriteRule ^robots.txt$ media/robots-go-away.txt [L]
  • Misc
    • tests with en-US hardcoded either have en-US directly in the redirect rule, or use this prefix regex to match locale: ^(?P<pre>\w{2,3}(?:-\w{2})?/)?
    • autopep8 caused some extra whitespace diffs
    • mozmeao/infra#217
@metadave

This comment has been minimized.

Show comment
Hide comment
@metadave

metadave May 10, 2017

Member

I'm not sure if the Travis errors are related to this PR.

Member

metadave commented May 10, 2017

I'm not sure if the Travis errors are related to this PR.

@metadave metadave changed the title from additional redirects from SCL3 Apache to Django to Bug 1362438: additional redirects from SCL3 Apache to Django May 10, 2017

Show outdated Hide outdated kuma/redirects/redirects.py
redirect(
r'^(?P<pre>[\w\-]*)/HTML5$',
'/{pre}/docs/HTML/HTML5',
re_flags='i', prepend_locale=False, permanent=True),

This comment has been minimized.

@jwhitlock

jwhitlock May 11, 2017

Member

This pattern is causing the Kuma test failure. I'll need to look closer at Kuma and commit history to decide which needs to bend.

@jwhitlock

jwhitlock May 11, 2017

Member

This pattern is causing the Kuma test failure. I'll need to look closer at Kuma and commit history to decide which needs to bend.

@jwhitlock jwhitlock self-assigned this May 15, 2017

@metadave metadave requested a review from escattone Jul 18, 2017

@jwhitlock

This comment has been minimized.

Show comment
Hide comment
@jwhitlock

jwhitlock Jul 27, 2017

Member

OK, here's what's happening. The failing test is requesting this path:

en-US/HTML/HTML5?view=edit

The redirect rule is parsing this as:

  • locale=en-US/
  • pre=HTML
  • (rest) = /HTML5

This is because locale_prefix=True, which is the default, and automatically looks for and captures an optional locale in the pattern.

So, the redirect pattern uses the rule:

{locale}{pre}/docs/HTML/HTML5

to redirect to:

en-US/HTML/docs/HTML/HTML5?view=edit

breaking the test, which is trying to detect MindTouch URLs like ?view=edit and convert them to Kuma URLs like $edit. Which is another level of dumb.

In any case, URL patterns that are trying to detect locales (this one and a few others) need to lean on locale_prefix instead.

I think I own this PR now, having made @metadave wait over 2 months for me to figure it out.

Member

jwhitlock commented Jul 27, 2017

OK, here's what's happening. The failing test is requesting this path:

en-US/HTML/HTML5?view=edit

The redirect rule is parsing this as:

  • locale=en-US/
  • pre=HTML
  • (rest) = /HTML5

This is because locale_prefix=True, which is the default, and automatically looks for and captures an optional locale in the pattern.

So, the redirect pattern uses the rule:

{locale}{pre}/docs/HTML/HTML5

to redirect to:

en-US/HTML/docs/HTML/HTML5?view=edit

breaking the test, which is trying to detect MindTouch URLs like ?view=edit and convert them to Kuma URLs like $edit. Which is another level of dumb.

In any case, URL patterns that are trying to detect locales (this one and a few others) need to lean on locale_prefix instead.

I think I own this PR now, having made @metadave wait over 2 months for me to figure it out.

metadave and others added some commits May 10, 2017

bug 1362438: Refactor locale redirects
Add helper function to explicitly set locale_prefix and prepend_locale
to False for "regular" redirects, to avoid matching on non-locale
prefixes.

Add helper function locale_redirect to set locale_prefix to True, and
default prepend_locale to True. Adjust locale redirect patterns to rely
on locale matching in redirect_urls library.

The /Learn topic has moved to /docs/Learn, but this redirect is
enforced in Apache rules. Adding a TODO to change the redirect after AWS
move.

Adjusted some test patterns to use valid locales, so that tests will
pass against Apache in SCL3 and locally.
@codecov-io

This comment has been minimized.

Show comment
Hide comment
@codecov-io

codecov-io Aug 3, 2017

Codecov Report

Merging #4231 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4231      +/-   ##
==========================================
+ Coverage   88.32%   88.33%   +<.01%     
==========================================
  Files         163      163              
  Lines       10229    10233       +4     
  Branches     1420     1420              
==========================================
+ Hits         9035     9039       +4     
  Misses        968      968              
  Partials      226      226
Impacted Files Coverage Δ
kuma/redirects/redirects.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f08f03c...17ec129. Read the comment docs.

codecov-io commented Aug 3, 2017

Codecov Report

Merging #4231 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4231      +/-   ##
==========================================
+ Coverage   88.32%   88.33%   +<.01%     
==========================================
  Files         163      163              
  Lines       10229    10233       +4     
  Branches     1420     1420              
==========================================
+ Hits         9035     9039       +4     
  Misses        968      968              
  Partials      226      226
Impacted Files Coverage Δ
kuma/redirects/redirects.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f08f03c...17ec129. Read the comment docs.

@jwhitlock

Django tests are passing, and the redirect tests pass against Apache/SCL3 and just Django. Thanks, @metadave!

@jwhitlock jwhitlock merged commit 3b6807e into master Aug 3, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@jwhitlock jwhitlock deleted the dp_apache_redirects_part_2 branch Aug 3, 2017

@metadave metadave referenced this pull request Aug 7, 2017

Merged

Kuma Report, July 2017 #48

@jwhitlock jwhitlock referenced this pull request Aug 17, 2017

Closed

Implement and retire remaining Apache rules #436

8 of 8 tasks complete
@jwhitlock

This comment has been minimized.

Show comment
Hide comment
@jwhitlock

jwhitlock Aug 23, 2017

Member

As part of mozmeao/infra#436, I took a look at what's left from the Apache config.

# Links to FTP'ed code samples and examples
RewriteRule ^patches(.*) data/www/patches$1 [L]
RewriteRule ^presentations(.*) data/www/presentations$1 [L]
RewriteRule ^samples(.*) data/www/samples$1 [L]
RewriteRule ^diagrams(.*) data/www/diagrams$1 [L]
RewriteRule ^web-tech(.*) data/www/web-tech$1 [L]
RewriteRule ^css(.*) data/www/css$1 [L]

presentations, diagrams, and samples are handled in kuma PR 4365 and infra PR 431.

patches, web-tech, and css are no longer present in the public folder on SCL3. There are some files in a non-served location for web-tech, which makes it appear to be a WordPress site from 2009-2010. The files are some scaled images, no text content. The internet will not miss these files, and we should start returning 404s.

RewriteRule ^robots.txt$ media/robots.txt [L]

robots.txt is a Django view, fixed in kuma PR 4273

RewriteRule ^sitemap.xml$ media/sitemap.xml [L]
RewriteRule ^sitemaps/([\w\-]*)/sitemap.xml$ media/sitemaps/$1/sitemap.xml [L]

These files are periodically generated, and accessed by Google and others to see what is new. We'll need to serve these from Django instead. If we move MEDIA_ROOT up a level, then we should be OK.

# Miscellaneous off-site redirects
# broken
RewriteRule ^contests/$ http://labs.mozilla.com/contests/extendfirefox/ [R=302]
# broken
RewriteRule ^contests/extendfirefox(/.*)? http://labs.mozilla.com/contests/extendfirefox$1 [R=302]
# http://wiki.ecmascript.org doesn't seem to respond.
RewriteRule ^es4(/.*)?$ http://wiki.ecmascript.org/ [R]

contests appears to refer to the Extend Firefox contest from 2008. There's no good place to send these requests, but http://www.mozillalabs.com/ feels as good as any. There were 3 requests in the last week.

es4 appears to have been the ECMAScript 4 committee wiki. It looks like Mozilla hosted it in 2006, before it moved. ES4 was abandoned. Wikipedia points to links on www.ecmascript.org, but these don't resolve. The new home appears to be http://www.ecma-international.org/memento/TC39.htm, which seems as good as any other place to send people.

RewriteCond %{QUERY_STRING} ^title=http [NC]
RewriteRule ^index\.php$ / [L,NC]

RewriteCond %{QUERY_STRING} ^title=(.*)$ [NC]
RewriteRule ^index\.php$ %1 [R=301,L,NC]

This must have been the way to load pages in the MediaWiki/DekiWiki days. I looked at the requests and they are roughly:

  • 90% over 1600 requests for recent changes RSS feeds. We have some feeds, but I think there is little value in making these legacy feed URLs start working.
  • 6% requests for legacy pages, mostly from search engine crawlers, none from internal referrers
  • 4% "sniffing" attempts, looking for Joomla admin, etc.

I don't see any reason to handle these requests, and they should return 404s

# HACK: Django will eventually redirect the user to the right spot, but skip a
# couple of redirects for these known legacy locales
RewriteRule ^en/(.*)$     /mwsgi/en-US/$1 [L,QSA,NE,NC]
RewriteRule ^cn/(.*)$     /mwsgi/zh-CN/$1 [L,QSA,NE,NC]
RewriteRule ^zh_cn/(.*)$  /mwsgi/zh-CN/$1 [L,QSA,NE,NC]
RewriteRule ^zh_tw/(.*)$  /mwsgi/zh-TW/$1 [L,QSA,NE,NC]

This is tracked in bug 962148. This rule serves up the duplicate content under a different URL, and instead we want a permanent redirect.

# Everything else passes through the Django handler
RewriteRule ^(.*)$ /mwsgi/$1 [L,QSA,NE]

This is the Apache alias for the Django modwsgi handler, and it not needed in AWS.

To determine the impact of these rules, I looked the the Apache access logs on developer1, which is one of six servers for developer.mozilla.org, developer.cdn.mozilla.net, and mozillademos.org.

for site in developer.mozilla.org mozillademos.org developer.cdn.mozilla.net; do
  echo $site;
  find $site -name 'access*' | xargs zgrep "GET /presentations" | wc -l;
done
Path developer.mozilla.org mozillademos.org developer.cdn.mozilla.net
/presentations 214 0 0
/patches 0 0 0
/samples 1083 0 0
/diagrams 0 0 0
/web-tech 339 0 0
/css 290 1 0
/robots.txt 835 203 3
/sitemap.xml 33 7 0
/sitemaps 181 0 0
/contests 2 0 0
/es4 13 0 0
/index.php 1769 0 0
/en/ 114320 0 0
/cn/ 451 0 0
/zh_cn/ 10 0 0
/zh_tw/ 96 0 0

Update 1: We do have recent changes feeds. I still think we should 404 the requests to the legacy feeds, rather than redirect them to "Create a New Page".

Member

jwhitlock commented Aug 23, 2017

As part of mozmeao/infra#436, I took a look at what's left from the Apache config.

# Links to FTP'ed code samples and examples
RewriteRule ^patches(.*) data/www/patches$1 [L]
RewriteRule ^presentations(.*) data/www/presentations$1 [L]
RewriteRule ^samples(.*) data/www/samples$1 [L]
RewriteRule ^diagrams(.*) data/www/diagrams$1 [L]
RewriteRule ^web-tech(.*) data/www/web-tech$1 [L]
RewriteRule ^css(.*) data/www/css$1 [L]

presentations, diagrams, and samples are handled in kuma PR 4365 and infra PR 431.

patches, web-tech, and css are no longer present in the public folder on SCL3. There are some files in a non-served location for web-tech, which makes it appear to be a WordPress site from 2009-2010. The files are some scaled images, no text content. The internet will not miss these files, and we should start returning 404s.

RewriteRule ^robots.txt$ media/robots.txt [L]

robots.txt is a Django view, fixed in kuma PR 4273

RewriteRule ^sitemap.xml$ media/sitemap.xml [L]
RewriteRule ^sitemaps/([\w\-]*)/sitemap.xml$ media/sitemaps/$1/sitemap.xml [L]

These files are periodically generated, and accessed by Google and others to see what is new. We'll need to serve these from Django instead. If we move MEDIA_ROOT up a level, then we should be OK.

# Miscellaneous off-site redirects
# broken
RewriteRule ^contests/$ http://labs.mozilla.com/contests/extendfirefox/ [R=302]
# broken
RewriteRule ^contests/extendfirefox(/.*)? http://labs.mozilla.com/contests/extendfirefox$1 [R=302]
# http://wiki.ecmascript.org doesn't seem to respond.
RewriteRule ^es4(/.*)?$ http://wiki.ecmascript.org/ [R]

contests appears to refer to the Extend Firefox contest from 2008. There's no good place to send these requests, but http://www.mozillalabs.com/ feels as good as any. There were 3 requests in the last week.

es4 appears to have been the ECMAScript 4 committee wiki. It looks like Mozilla hosted it in 2006, before it moved. ES4 was abandoned. Wikipedia points to links on www.ecmascript.org, but these don't resolve. The new home appears to be http://www.ecma-international.org/memento/TC39.htm, which seems as good as any other place to send people.

RewriteCond %{QUERY_STRING} ^title=http [NC]
RewriteRule ^index\.php$ / [L,NC]

RewriteCond %{QUERY_STRING} ^title=(.*)$ [NC]
RewriteRule ^index\.php$ %1 [R=301,L,NC]

This must have been the way to load pages in the MediaWiki/DekiWiki days. I looked at the requests and they are roughly:

  • 90% over 1600 requests for recent changes RSS feeds. We have some feeds, but I think there is little value in making these legacy feed URLs start working.
  • 6% requests for legacy pages, mostly from search engine crawlers, none from internal referrers
  • 4% "sniffing" attempts, looking for Joomla admin, etc.

I don't see any reason to handle these requests, and they should return 404s

# HACK: Django will eventually redirect the user to the right spot, but skip a
# couple of redirects for these known legacy locales
RewriteRule ^en/(.*)$     /mwsgi/en-US/$1 [L,QSA,NE,NC]
RewriteRule ^cn/(.*)$     /mwsgi/zh-CN/$1 [L,QSA,NE,NC]
RewriteRule ^zh_cn/(.*)$  /mwsgi/zh-CN/$1 [L,QSA,NE,NC]
RewriteRule ^zh_tw/(.*)$  /mwsgi/zh-TW/$1 [L,QSA,NE,NC]

This is tracked in bug 962148. This rule serves up the duplicate content under a different URL, and instead we want a permanent redirect.

# Everything else passes through the Django handler
RewriteRule ^(.*)$ /mwsgi/$1 [L,QSA,NE]

This is the Apache alias for the Django modwsgi handler, and it not needed in AWS.

To determine the impact of these rules, I looked the the Apache access logs on developer1, which is one of six servers for developer.mozilla.org, developer.cdn.mozilla.net, and mozillademos.org.

for site in developer.mozilla.org mozillademos.org developer.cdn.mozilla.net; do
  echo $site;
  find $site -name 'access*' | xargs zgrep "GET /presentations" | wc -l;
done
Path developer.mozilla.org mozillademos.org developer.cdn.mozilla.net
/presentations 214 0 0
/patches 0 0 0
/samples 1083 0 0
/diagrams 0 0 0
/web-tech 339 0 0
/css 290 1 0
/robots.txt 835 203 3
/sitemap.xml 33 7 0
/sitemaps 181 0 0
/contests 2 0 0
/es4 13 0 0
/index.php 1769 0 0
/en/ 114320 0 0
/cn/ 451 0 0
/zh_cn/ 10 0 0
/zh_tw/ 96 0 0

Update 1: We do have recent changes feeds. I still think we should 404 the requests to the legacy feeds, rather than redirect them to "Create a New Page".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment