You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During an experiment I tryied to load some personal whatsapp conversations into a vectorstore. But loading was failing. Following there's an example of a dataset and code with some half lines working and half failing:
Dataset (whatsapp_chat.txt):
19/10/16, 13:24 - Aitor Mira: Buenas Andrea!
19/10/16, 13:24 - Aitor Mira: Si
19/10/16, 13:24 PM - Aitor Mira: Buenas Andrea!
19/10/16, 13:24 PM - Aitor Mira: Si
[Document(page_content='Aitor Mira on 19/10/16, 13:24 PM: Buenas Andrea!\n\nAitor Mira on 19/10/16, 13:24 PM: Si\n\n', metadata={'source': '.[.\\data\\whatsapp_chat.txt](https://file+.vscode-resource.vscode-cdn.net/c%3A/Users/itort/Documents/GiTor/impersonate-gpt/notebooks//data//whatsapp_chat.txt)'})]
What's happening is that due to a bug in the regex match pattern, all lines without AM or PM after the hour:minutes won't be matched. Thus two first lines of whatsapp_chat.txt are ignored and two last matched.
Here the buggy regex: r"(\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{1,2} (?:AM|PM)) - (.*?): (.*)"
Here the solution regex parsing either 12 or 24 hours time formats: r"(\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{1,2}(?: AM| PM)?) - (.*?): (.*)"
The text was updated successfully, but these errors were encountered:
During an experiment I tryied to load some personal whatsapp conversations into a vectorstore. But loading was failing. Following there's an example of a dataset and code with some half lines working and half failing:
Dataset (whatsapp_chat.txt):
Code:
Returns:
What's happening is that due to a bug in the regex match pattern, all lines without
AM
orPM
after the hour:minutes won't be matched. Thus two first lines of whatsapp_chat.txt are ignored and two last matched.Here the buggy regex:
r"(\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{1,2} (?:AM|PM)) - (.*?): (.*)"
Here the solution regex parsing either 12 or 24 hours time formats:
r"(\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{1,2}(?: AM| PM)?) - (.*?): (.*)"
The text was updated successfully, but these errors were encountered: