Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-English language support #29

Open
lfschafaschek opened this issue May 25, 2021 · 6 comments
Open

Non-English language support #29

lfschafaschek opened this issue May 25, 2021 · 6 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@lfschafaschek
Copy link

lfschafaschek commented May 25, 2021

With a Kindle in Portuguese, highlights's location and date aren't added in Notion... I think the problem is date format, that is different in portuguese.
I tested change language to English, and create a new highlight; this one was exported correct (it's the last registre in the file).
My Clippings.txt

@paperboi paperboi changed the title Highlights's date and locations aren't exported Non-English language support May 25, 2021
@paperboi paperboi added enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed labels May 25, 2021
@paperboi
Copy link
Owner

paperboi commented May 25, 2021

For reference, this is the snippet of code that scrapes out the location, page and date information from the text file.
See lines 49-53 in /kindle2notion/parasing.py

The function that addresses this is pasted below:

def _parse_page_location_and_date(raw_clipping_list: List) -> Tuple[str, str, str]:
    second_line = raw_clipping_list[1]
    second_line_as_list = second_line.strip().split(' | ')
    page = location = date = ''
    for element in second_line_as_list:
        element = element.lower()
        if 'page' in element:
            page = element[element.find('page'):].replace('page', '').strip()
        if 'location' in element:
            location = element[element.find('location'):].replace('location', '').strip()
        if 'added on' in element:
            date = parse(element[element.find('added on'):].replace('added on', '').strip())
            date = date.strftime('%A, %d %B %Y %I:%M:%S %p')

    return page, location, date

One would need to replace 'page' , 'location' and 'added on' in lines 49, 51, 53 with their language equivalent terms as used in the respective My Clippings.txt file to get the relevant result.

In your case from my limited understanding it would be 'destaque na página', 'destaque ou posição, and Adicionado: .

Leaving this issue open cause I'm unsure of how to incorporate this feature within the structure of the package. I'm open to hearing inputs from the GH community on this one. A working solution may be to identify the language on scraping the first clipping and adapting the relevant keywords to fetch respectively. I can change the languages on my Kindle and make some test clippings so that they would get saved in that language in the My Clippings file and code from there.

@asyr01
Copy link
Contributor

asyr01 commented Jun 8, 2021

Really appreciate the hard work you put in.
There is no problem with English. However when it comes to my Turkish Books,
Unfortunately there is missing worlds on notion which includes special letters in Turkish,
For example "i, ç , ü, ö", This non-english letters are missing,
Maybe we could find some way to handle it.
Also when we start the script for second time, if clippings are all same it could skip existing ones
and only append the new ones, is it possible?
Thanks, Have a good one.

@paperboi
Copy link
Owner

Placing #46 here for reference. Thanks for contributing again @asyr01!

Regarding your second question, the current package is already capable of doing that. It can be optimized with a JSON structure to track clippings instead of the current method.

@mefonseca
Copy link

mefonseca commented Jun 11, 2021

Hi! Really appreciate this package!
I was also using Kindle in Portuguese and not getting the location and date. I changed my devise to English and it is all good now.
However, non-english letters are missing. I think is the same problem as @asyr01.
"Transformação" -> "Transformao"
"Mudança" -> "Mudana"
"Está" -> "est"
"Você" -> "voc"

I saw that the last commit was regard this issue:

raw_clippings_text = raw_clippings_text.encode("ascii", errors="xmlcharrefreplace").decode()


If it was only utf-8-sig I think it would read the non-english letter (I tried manually running the funcion "read_raw_clippings" on my "My Clippings.txt"), but I don't know what would happen on other parts of the code.

raw_clippings_text = open(clippings_file_path, "r", encoding="utf-8-sig").read()


Thank you!

@paperboi paperboi added this to To do in Enhancements Jun 25, 2021
@paperboi
Copy link
Owner

Hi! Really appreciate this package!
I was also using Kindle in Portuguese and not getting the location and date. I changed my devise to English and it is all good now.
However, non-english letters are missing. I think is the same problem as @asyr01.
"Transformação" -> "Transformao"
"Mudança" -> "Mudana"
"Está" -> "est"
"Você" -> "voc"

I saw that the last commit was regard this issue:
raw_clippings_text = raw_clippings_text.encode("ascii", errors="xmlcharrefreplace").decode()

If it was only utf-8-sig I think it would read the non-english letter (I tried manually running the funcion "read_raw_clippings" on my "My Clippings.txt"), but I don't know what would happen on other parts of the code.
raw_clippings_text = open(clippings_file_path, "r", encoding="utf-8-sig").read()

Thank you!

Thanks for the tip @mefonseca! Implemented your request in the latest release.
@asyr01 please update the package and try running it on your system. It should account for those letters now.

@lfschafaschek Will implement custom Portuguese support soon!

Thank you all for your patience and goodwill. Hope this fix addresses your issues here.

@huhlik-cz
Copy link

Hi, I'm running the latest version and I have the same issue as above but with the Czech characters like these: ěščřžňů. Can the Czech language be also supported? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
Development

No branches or pull requests

5 participants