Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subtitle Downloading #1117

Merged
merged 18 commits into from
Jun 15, 2020
Merged

Subtitle Downloading #1117

merged 18 commits into from
Jun 15, 2020

Conversation

bakshiutkarsha
Copy link
Contributor

should resolve #877

@codecov
Copy link

codecov bot commented May 15, 2020

Codecov Report

Merging #1117 into master will increase coverage by 0.33%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1117      +/-   ##
==========================================
+ Coverage   68.65%   68.98%   +0.33%     
==========================================
  Files          25       25              
  Lines        2249     2270      +21     
  Branches      441      441              
==========================================
+ Hits         1544     1566      +22     
+ Misses        522      521       -1     
  Partials      183      183              
Impacted Files Coverage Δ
src/S3.ts 75.67% <ø> (+1.99%) ⬆️
src/util/misc.ts 71.84% <100.00%> (ø)
src/util/saveArticles.ts 81.81% <100.00%> (+0.93%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b9c3fff...9634b73. Read the comment docs.

@kelson42
Copy link
Collaborator

@bakshiutkarsha What is the status of this PR?! There is not WIP or draft... but nobody is assigned to review it?!

Copy link
Collaborator

@kelson42 kelson42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bakshiutkarsha Please secure that the the development process is respected properly. Missing the review assignement like all other kinds of thing going forgotten in the process (not putting draft, not reviewieng, forgotting to merge, forgotting to rebase properly, etc...) just slows down the whole process and creates overhead (often on my side).

Concerning the testing, the instructions were:

Here is how your test should looks like IMO: (1) you have a small wikicode for a video with subtitle (2) you send it to https://en.wikipedia.org/api/rest_v1/#/Transforms/post_transform_wikitext_to_html (3) you get the result rewrite it and verify that the track URL is proper.

So far I can see, (3) is there but not (1) and (2) which makes the whole test relatively weak. What will happen if for some reason the HTML generated by Mediawiki changes?! The answer is that it might go through unoticed!

@bakshiutkarsha
Copy link
Contributor Author

@bakshiutkarsha Please secure that the the development process is respected properly. Missing the review assignement like all other kinds of thing going forgotten in the process (not putting draft, not reviewieng, forgotting to merge, forgotting to rebase properly, etc...) just slows down the whole process and creates overhead (often on my side).

Concerning the testing, the instructions were:

Here is how your test should looks like IMO: (1) you have a small wikicode for a video with subtitle (2) you send it to https://en.wikipedia.org/api/rest_v1/#/Transforms/post_transform_wikitext_to_html (3) you get the result rewrite it and verify that the track URL is proper.

So far I can see, (3) is there but not (1) and (2) which makes the whole test relatively weak. What will happen if for some reason the HTML generated by Mediawiki changes?! The answer is that it might go through unoticed!

@kelson42 I will make sure in future, these things won't affect the development timeline and the process is respected properly.

src/util/saveArticles.ts Outdated Show resolved Hide resolved
src/util/saveArticles.ts Outdated Show resolved Hide resolved
src/util/saveArticles.ts Outdated Show resolved Hide resolved
let mediaDependencies: Array<{ url: string, path: string }> = [];
let doc = domino.createDocument(html);
const tmRet = await treatMedias(doc, mw, dump, articleId);
const tmRet = await treatMedias(doc, mw, dump, articleId, downloader, zimCreator);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure it's a good idea to carry downloader and zimCreator 3 levels down into treatMedias. Normally, we get a set of links back to upper level then handle them by iterating mediaDependencies here. This way we don't deal with zim file in treat*() functions which follow OOP spirit. I belive the function called processArticleHtml should know nothing about the entity like zim, so I feel there's better way to handle this ;)
Could we handle subtitles outside this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understood this correctly, I agree with point that downloader and zimcreator got dragged 3 level down, but trackEle is tightly coupled with videoEle, so IMO it makes sense to have them in a same place. But le me know if you suggest differently?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably it does make sense to keep status quo because that would require so much changes to make the order here in processArticleHtml(). Let's table this until we get a resources to catch up a technical debt.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reopen this because I'm not happy about that either. We need to architecture this in a smarter way.

Copy link
Contributor Author

@bakshiutkarsha bakshiutkarsha May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the change of dragging downloader and zimcreator to 3 level down and handled this in different way as suggested.

src/util/saveArticles.ts Outdated Show resolved Hide resolved
test/util.ts Outdated Show resolved Hide resolved
test/unit/saveArticles.test.ts Outdated Show resolved Hide resolved
test/unit/saveArticles.test.ts Outdated Show resolved Hide resolved
let mediaDependencies: Array<{ url: string, path: string }> = [];
let doc = domino.createDocument(html);
const tmRet = await treatMedias(doc, mw, dump, articleId);
const tmRet = await treatMedias(doc, mw, dump, articleId, downloader, zimCreator);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reopen this because I'm not happy about that either. We need to architecture this in a smarter way.

test/unit/saveArticles.test.ts Outdated Show resolved Hide resolved
src/util/const.ts Outdated Show resolved Hide resolved
src/util/saveArticles.ts Outdated Show resolved Hide resolved
src/util/saveArticles.ts Outdated Show resolved Hide resolved
@midik
Copy link
Contributor

midik commented May 18, 2020

BTW, here's the fix for the bug that we're probably facing on v14

@kelson42
Copy link
Collaborator

BTW, here's the fix for the bug that we're probably facing on v14

Nice, happy if the bug is not on our side... and happy again to see this will be part of next v14 release (my guess).

@bakshiutkarsha bakshiutkarsha marked this pull request as draft May 23, 2020 08:00
@bakshiutkarsha bakshiutkarsha marked this pull request as ready for review May 25, 2020 20:14
Copy link
Contributor

@midik midik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minorities

test/util.ts Outdated Show resolved Hide resolved
test/util.ts Outdated Show resolved Hide resolved
test/unit/util.test.ts Outdated Show resolved Hide resolved
Copy link
Contributor

@midik midik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've no more feedback here at this point

Copy link
Collaborator

@kelson42 kelson42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI must pass

@@ -30,7 +30,7 @@ _test('S3 checks', async(t) => {
const imageExist = await s3.downloadIfPossible('bm.wikipedia.org/static/images/project-logos/bmwiki-2x.png', 'https://bm.wikipedia.org/static/images/project-logos/bmwiki-2x.png');
t.assert(!!imageExist, 'Image exists in S3');
// Checking the data related to image matches
t.equals(imageExist.headers.Metadata.etag, '"aeff-54a391a807034"', 'Etag matches');
t.equals(imageExist.headers.Metadata.etag, '"a740-5a6b0464619c2"', 'Etag matches');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really weak... and consequently has been broken only a few months after been implemented. Automated test should not have the this ETAG hardcoded. If the whole lifecycle has been tested properly you should not need to do that. Should be pretty easy to fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed this case, because lifecycle is getting tested.

test/unit/util.test.ts Show resolved Hide resolved
test/util.ts Outdated Show resolved Hide resolved
src/util/saveArticles.ts Outdated Show resolved Hide resolved
src/util/saveArticles.ts Outdated Show resolved Hide resolved
src/util/saveArticles.ts Outdated Show resolved Hide resolved
@kelson42
Copy link
Collaborator

kelson42 commented Jun 6, 2020

@bakshiutkarsha It needs a few changes and the CI shoudl pass.

test/unit/util.test.ts Outdated Show resolved Hide resolved
test('treat multiple subtitles in one video', async(t) => {
const { downloader, mw, dump } = await setupScrapeClasses({ format: '' });

// Wikicode is taken from article "User:Charliechlorine/sandbox" which has multiple(4) subtitles in this video
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My last comment is confusing:

  • 'treat one subtitle' should test exactly which URL
  • 'treat multiple subtitles in one video' should do the same
  • I would add an global test, testing the whole HTML testHtmlRewritingE2e

@kelson42
Copy link
Collaborator

@bakshiutkarsha Can you please rebase on master?

@bakshiutkarsha
Copy link
Contributor Author

  • 'treat one subtitle' should test exactly which URL - DONE
  • 'treat multiple subtitles in one video' should do the same - DONE
  • I would add an global test, testing the whole HTML testHtmlRewritingE2e - DONE, but can you confirm is this what you wanted?

Copy link
Collaborator

@kelson42 kelson42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should have one test for treatSubtitle(), one test for treatVideo() and one E2E test for a video wiki text using testHtmlRewritingE2e().

Yout current usage of testHtmlRewritingE2e() for subtitles is useless because you test it with the value of your own code... so this is probably always going to work. A unit test should test against hardcoded values (so here HTML)

@bakshiutkarsha
Copy link
Contributor Author

bakshiutkarsha commented Jun 15, 2020

You should have one test for treatSubtitle(), one test for treatVideo() and one E2E test for a video wiki text using testHtmlRewritingE2e().

Yout current usage of testHtmlRewritingE2e() for subtitles is useless because you test it with the value of your own code... so this is probably always going to work. A unit test should test against hardcoded values (so here HTML)

The HTML I am getting from the API is not exactly same as the online version of it. For eg. API gives me HTML starting from figure tag and there is no such tag in online version of it. So, what should be done in this case?

@kelson42 kelson42 merged commit 39a5e62 into master Jun 15, 2020
@kelson42 kelson42 deleted the issue/877-new branch June 15, 2020 08:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Mediawiki video subtitles should be scraped too
3 participants