Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The tag "</p>" and "<p>" appears on the questions of MagoGenie channel on CC. #222

Closed
harishbalachandran opened this issue Apr 18, 2017 · 7 comments

Comments

@harishbalachandran
Copy link

harishbalachandran commented Apr 18, 2017

Summary:
After importing the MagoGenie Questions onto CC, we see that a tag "< / P >" and/or
"< p >" appears on most of the questions.

Link used:
https://contentworkshop.learningequality.org/channels/f531b1ddf64755c9be5061a922317021/edit
Channel ID: f531b1ddf64755c9be5061a922317021

Screenshot:
screen shot 2017-04-18 at 12 46 29 pm
screen shot 2017-04-18 at 12 46 57 pm
screen shot 2017-04-18 at 12 51 23 pm
screen shot 2017-04-18 at 12 53 25 pm

@harishbalachandran
Copy link
Author

@rtibbles @aronasorman Can you please put a ETA on this? Its a Blocker for all the MG channel.

@rtibbles
Copy link
Member

rtibbles commented Apr 24, 2017

Looks the Magogenie content needs to run through an HTML to Markdown parser before being put in?

This package should do the trick, I think: https://pypi.python.org/pypi/html2text

@jamalex
Copy link
Member

jamalex commented Apr 24, 2017

Note that the ricecooker does not support HTML, only Markdown (with embedded $-delimited latex formulas, as needed, as well). The tool that @rtibbles links to may be helpful for converting HTML to Markdown in the sushi chef.

@yogeshmhaskule
Copy link

yogeshmhaskule commented Apr 25, 2017

@rtibbles @aronasorman @jamalex For more clarification, will this issue be handled in ricecooker or we need to handle it in our code (i.e sushi chef). I remember @aronasorman @jayoshih had taken care of the same issue before.

@jayoshih
Copy link
Contributor

@yogeshmhaskule For security reasons, we needed to escape html tags to prevent script attacks. To maintain the paragraphs, you'll need to use \n instead

@yogeshmhaskule
Copy link

yogeshmhaskule commented Apr 26, 2017

@jayoshih @aronasorman I tried using \n in place of </p> tag. But got the problem for other tags like <span>,</br>etc then some .png and base64 images. For this I have used "htm2text" python package, then It removes the 'img' tag of base64 image. and put the '![]' before the base64 image data and "!\[\]\" before the png image. So it failed to download the png image. If you provide more details to parse html in the "Sample program" of ricecooker. It will be beneficial to us for more understanding and easy to move forward.

I have attached the sample response of question in file:
sample_response.txt

Check the format of answer content which is similar to question content. for your reference take a look on answer content(it's combination of text, mathml, base64) which is in the file.

@jamalex
Copy link
Member

jamalex commented Apr 26, 2017

The examples in your sample text contain MathML source, but also include the images that are the rendered version of the MathML, so we can just use the images in this case. The examples you describe (e.g. ![](...)) are valid Markdown, and will work with the ricecooker, even with base64-encoded images. However, there's some escaping in there, and newlines, that throw it off. The code below shows an example of converting the source you have provided into something that works for the ricecooker. For fuller MathML code (with no image alternative), you'll need to follow the instructions in the other issue.

import json
import requests
import html2text

from ricecooker.classes.nodes import ChannelNode, ExerciseNode
from ricecooker.classes.questions import MultipleSelectQuestion

from le_utils.constants import licenses

def convert_html_to_markdown(html):
    return html2text.html2text(html.replace("\/", "/").replace("\n", ""))

def construct_channel(*args, **kwargs):

    channel = ChannelNode(
        source_domain="test.com",
        source_id="test",
        title="Exercise test",
    )

    exercise = ExerciseNode(source_id="ex1", title="My Ex", license=licenses.CC_BY)
    channel.add_child(exercise)

    question_source = json.loads(requests.get("https://github.com/fle-internal/content-curation/files/958703/sample_response.txt").content.decode())["103898"]

    question = MultipleSelectQuestion(
        id="question",
        question=convert_html_to_markdown(question_source["question"]["content"]),
        correct_answers=[convert_html_to_markdown(a["content"]) for a in question_source["possible_answers"] if a["is_correct"]],
        all_answers=[convert_html_to_markdown(a["content"]) for a in question_source["possible_answers"]],
    )

    exercise.add_question(question)

    return channel

@rtibbles rtibbles closed this as completed May 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants