New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persian text bugs were fixed #7
Conversation
src/chat_statistics/stats.py
Outdated
regrex_pattern = re.compile(pattern = "[" | ||
"\u2069" | ||
"\u2066" | ||
"]+", flags = re.UNICODE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this? Did you simply mean "[\u2069\u2066]+"
? What are the spaces for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, the function was intended for more characters.
After the suspicious characters were reduced, the function's structure was not modified.
Thanks a lot.
Your suggestion has been added to the new commits.
src/chat_statistics/stats.py
Outdated
"\u2069" | ||
"\u2066" | ||
"]+", flags = re.UNICODE) | ||
text = regrex_pattern.sub(r'', text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for the r
prefix here.
text = regrex_pattern.sub(r'', text) | |
text = regrex_pattern.sub('', text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r
prefix was deleted in the new commits
src/chat_statistics/stats.py
Outdated
import demoji | ||
import arabic_reshaper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import is not sorted.
import demoji | |
import arabic_reshaper | |
import arabic_reshaper | |
import demoji |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imports was sorted in the new commits
src/chat_statistics/stats.py
Outdated
elif isinstance(sub_msg, dict) and sub_msg['type'] in { | ||
'text_link', 'bold', 'italic', 'hashtag', 'mention', 'pre'}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better indentation:
elif isinstance(sub_msg, dict) and sub_msg['type'] in { | |
'text_link', 'bold', 'italic', 'hashtag', 'mention', 'pre'}: | |
elif isinstance(sub_msg, dict) and sub_msg['type'] in { | |
'text_link', 'bold', 'italic', 'hashtag', 'mention', 'pre' | |
}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
applied
Looks Awesome. Thanks @siniorone. |
The emojis have been removed as a result of these modifications.
More complicated and nested messages are successfully extracted.
The project was updated with more stop words, and the user was encouraged to use the Vazir font.