Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding on Swedish is broken #10

Closed
Ainali opened this issue Feb 27, 2016 · 10 comments
Closed

Encoding on Swedish is broken #10

Ainali opened this issue Feb 27, 2016 · 10 comments

Comments

@Ainali
Copy link
Contributor

Ainali commented Feb 27, 2016

The letters å, ä and ö are not displayed correctly, as can be seen on place 3, 9 and 29 (amongst others) here: http://top.hatnote.com/sv/wikipedia/2016/2/24.html

@mahmoud
Copy link
Member

mahmoud commented Feb 27, 2016

Hey Jan, thanks for the report. We just noticed this yesterday and I'm looking into it now. It appears to be an upstream encoding change, as we've pushed no code changes in the past week or so. Not sure if there was an Labs announcement about this, but it is pretty inconvenient either way, sorry about that.

When we fix it I'll regenerate the old pages and let you know here.

@mahmoud
Copy link
Member

mahmoud commented Feb 27, 2016

(and definitely don't go look at Chinese, it's a disaster)

@mahmoud
Copy link
Member

mahmoud commented Feb 27, 2016

I can now confirm, the upstream service data encoding broke this week: Here is the raw data for the 26th.

If you wget that url and look at it in a text editor, you may notice many occurrences of \ufffd. That is the Unicode "replacement character", usually found in cases of dirty/improper encoding. It's often rendered as a box or a question mark. Anyways, I will try to pick this up with the relevant WMF people.

@milimetric
Copy link

I'm as surprised by this as you are. There was a deployment on Monday which allows you to pass in uri-encoded titles into the per-article endpoint (not the one you use). The other thing we added is to specify utf8 in the content-type header. But that was released a while ago and should have only helped with this issue. So if this was fine on Wednesday but broken Friday, then maybe the issue is in the front-end restbase instance that proxies requests to us. I will take a look but sadly am away from the laptop until tomorrow night. I don't have access to phabricator so if someone could add an unbreak-now task and tag with Analytics, that'd be useful. Thanks and sorry for the annoyance.

@gwicke
Copy link

gwicke commented Feb 28, 2016

The characters look garbled when requesting directly from the backend from within the cluster: curl http://aqs.svc.eqiad.wmnet:7232/analytics.wikimedia.org/v1/pageviews/top/sv.wikipedia/all-access/2016/02/26. This rules out a frontend issue.

However, characters are fine both internally and externally for older dates: https://wikimedia.org/api/rest_v1/metrics/pageviews/top/sv.wikipedia/all-access/2016/01/26

This suggests that something broke the top-title data stored to Cassandra recently.

@papuass
Copy link

papuass commented Mar 1, 2016

Same or Latvian: http://top.hatnote.com/lv/
I think they added Main page to the statistics at the same time. The problem for Latvian is that it is most viewed page almost every day. So in RSS version I get notification that Main page was most popular page yesterday, again.

@mahmoud
Copy link
Member

mahmoud commented Mar 1, 2016

Created a separate issue for the RSS variety, but we're still waiting on the fix from WMF Analytics team for the encoding.

@gwicke
Copy link

gwicke commented Mar 1, 2016

@milimetric
Copy link

The important comment is https://phabricator.wikimedia.org/T128295#2074948
which says we have 7 days of data to re-compute. This will take a long
time, as Joseph says, but we can let you know when it's done and the
changes are propagated to the storage behind the API

On Tue, Mar 1, 2016 at 3:25 AM, Gabriel Wicke notifications@github.com
wrote:

See https://phabricator.wikimedia.org/T128295.


Reply to this email directly or view it on GitHub
#10 (comment).

@slaporte
Copy link
Member

slaporte commented Mar 7, 2016

We've regenerated Feb. 23 - 29, now that the pageview API has properly encoded titles for that period.

@slaporte slaporte closed this as completed Mar 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants