Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WhatsAppChatLoader fails to load 24 hours time format chats #2457

Closed
itortouch opened this issue Apr 5, 2023 · 1 comment · Fixed by #2458
Closed

WhatsAppChatLoader fails to load 24 hours time format chats #2457

itortouch opened this issue Apr 5, 2023 · 1 comment · Fixed by #2458

Comments

@itortouch
Copy link
Contributor

During an experiment I tryied to load some personal whatsapp conversations into a vectorstore. But loading was failing. Following there's an example of a dataset and code with some half lines working and half failing:

Dataset (whatsapp_chat.txt):

19/10/16, 13:24 - Aitor Mira: Buenas Andrea!
19/10/16, 13:24 - Aitor Mira: Si
19/10/16, 13:24 PM - Aitor Mira: Buenas Andrea!
19/10/16, 13:24 PM - Aitor Mira: Si

Code:

from langchain.document_loaders import WhatsAppChatLoader
loader = WhatsAppChatLoader("../data/whatsapp_chat.txt")
docs = loader.load()

Returns:

[Document(page_content='Aitor Mira on 19/10/16, 13:24 PM: Buenas Andrea!\n\nAitor Mira on 19/10/16, 13:24 PM: Si\n\n', metadata={'source': '.[.\\data\\whatsapp_chat.txt](https://file+.vscode-resource.vscode-cdn.net/c%3A/Users/itort/Documents/GiTor/impersonate-gpt/notebooks//data//whatsapp_chat.txt)'})]

What's happening is that due to a bug in the regex match pattern, all lines without AM or PM after the hour:minutes won't be matched. Thus two first lines of whatsapp_chat.txt are ignored and two last matched.

Here the buggy regex:
r"(\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{1,2} (?:AM|PM)) - (.*?): (.*)"

Here the solution regex parsing either 12 or 24 hours time formats:
r"(\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{1,2}(?: AM| PM)?) - (.*?): (.*)"

@itortouch
Copy link
Contributor Author

PR solution ready:

#2458

hwchase17 pushed a commit that referenced this issue Apr 6, 2023
Fix for 24 hour time format bug. Now whatsapp regex is able to parse
either 12 or 24 hours time format.

Linked [issue](#2457).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant