# Working with `chatlogs.txt`

*All imports are consolidated at the top of this notebook in the cell below.*

In [1]:
import re

In [17]:
import json

One of the files available in the Gab archive made available a few years ago appears to be a single file made up of chat logs, `chatlogs.txt`. The rest of the archive is made up of SQL files: accounts, groups, statuses, verifications. The `statuses.sql` looks promising but at 67GB is quite large. In contrast, `chatlogs.txt` is a mere 10MB. As the header (see below) notes, chatlogs has: "70593 messages in 19683 chats with 15322 users."

My first step was to find out what the file looked like, I opened it in Visual Studio Code, but below is what the file looked like using Python. 

In [2]:
with open("chatlogs.txt") as f:
    chatlogtexts = f.read()

print(chatlogtexts[0:500])

# #gableaks Gab "private" chats ft the ceo, qanon, nazi simps & much more.
# 70593 messages in 19683 chats with 15322 users
# user flags: [V]erified [P]ro [D]onor [I]nvestor [B]ot [DELETED]
# FUCK TRUMP. FUCK COLONIZERS & CAPITALISTS. DEATH TO AMERIKKKA.
--- 1
2020-12-22T20:18:32 @OsmanAbbaker: hi
2020-12-22T20:18:40 @OsmanAbbaker: how are you
2020-12-22T20:19:33 @OsmanAbbaker: my name is osman am from effortless english group i want to ask you somthing please
2020-12-22T20:20:42 @OsmanAbbaker: 


So, after a header of dubious utility, which we will remove in a moment, we get to the meat of the file which appears to consist of lines formatted as follows:

```
DATE-TIME @user: text
```
In most cases the entirety of the entry is on one line, but there are a number of places where a newline character creates multi-line entries. 

There are also a number of lines that have nothing more than three dashes and a number. 

I decided to remove those lines first, parsing them out while reading the lines as items in a list.

In [3]:
# Instead of reading the file to a string 
# Read each line into a list of strings 
# But leave out those lines that start with three dashes
with open("chatlogs.txt") as f:
    chatlog = [n for n in f.readlines() if not n.startswith('---')]

# Let's look at the last 5 lines
for i in chatlog[0:5]:
    print(i)

# #gableaks Gab "private" chats ft the ceo, qanon, nazi simps & much more.

# 70593 messages in 19683 chats with 15322 users

# user flags: [V]erified [P]ro [D]onor [I]nvestor [B]ot [DELETED]

# FUCK TRUMP. FUCK COLONIZERS & CAPITALISTS. DEATH TO AMERIKKKA.

2020-12-22T20:18:32 @OsmanAbbaker: hi



In [4]:
del chatlog[0:4]

With that bit of editing done, we can look at how the lines are structured:

In [5]:
chatlog[0:5]

['2020-12-22T20:18:32 @OsmanAbbaker: hi\n',
 '2020-12-22T20:18:40 @OsmanAbbaker: how are you\n',
 '2020-12-22T20:19:33 @OsmanAbbaker: my name is osman am from effortless english group i want to ask you somthing please\n',
 '2020-12-22T20:20:42 @OsmanAbbaker: if you dont mind\n',
 '2020-12-22T20:18:59 @anuralight[P]: yo\n']

Those look pretty straightforward, but I discovered multi-line blocks while scrolling in VS Code -- hand-checking always pays off! I want to be able to break the log by the date entry, which looks like this:

In [6]:
chatlog[5][0:19]

'2020-12-22T21:20:05'

First 20 characters of each line are date-time. I played with regex (on regexr.com) until I got what seemed like a reasonable solution: 
```
^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}
```
For some reason, my first impulse was to write this back to a file so that I could then recall by `readline`:
```python
with open("temp.txt", "w") as f:
    for i in chatlog:
        f.write(f"{i}\n")
```
but I realized I wanted to split a string using the regex above.

The cell block below starts with the join, and then splits the joined string back into a list of strings, but this time using the date-time pattern as the split. The goal here is to keep multi-line texts together.

The use of the compile function below is a matter of personal preference. I like keeping the regex pattern out of the `split`, `substitute`, or `find` functions, but you could just as easily insert the pattern into the split function.

In [7]:
# Join our list of strings into one big string again
chats = " ".join(chatlog)

# Split combined string at date-time
re_datetime = re.compile("\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}")
splits = re.split(re_datetime, chats)

# Checking in to see what we have
print(len(splits))
for i in splits[50:52]:
     print(i)

70596
 @JimHalsey: No idea. I thought that other thing still existed. I dunno. If you do figure out how I can give you the gift of abs, let me know 😂 I hope you’re having a wonderful day, cutie. Be happy 😘
 
 @JimHalsey: Been watching Breeders (1986) tonight, Miss 👀 Can’t send you a pic, but I’m sure you can jewgle it. About to watch Creepozoids now 😱
 


Yeah, this is Gab. The racism is pretty abhorrent. 

If we're going to focus only on the text of the messages, we need to remove the user names. I again use regexr.com to explore how best to approach making the pattern. (Please note that you can copy paste some of your sample text into the regexr text box.)

In [8]:
re_user = re.compile("^ (.*?):")

texts = []
for i in splits:
    text = re.sub(re_user, "", i, count=1)
    texts.append(text)

In [9]:
print(len(texts))
for i in texts[100:105]:
    print(i)

70596
 Thank you. I’m just a very lonely person. I had someone, kind of, or at least I thought I did (maybe I never had her; I don’t know), but she doesn’t want me anymore. I’ve been trying to find someone for a long time now. I’m the real deal. I like sex, but I find non-procreative sex pointless. My theory is that men were given the urge so that they’d procreate. So I understand we have that urge and need to satiate it, lest we go mad, but I find it totally pointless if it’s not for making babies. I’m not big on banging anything that moves. I used to go to pubs and clubs and did some of that, but never as much as I could’ve if I’d really wanted to. Those encounters are short-lived and leave you feeling very alone and used the next day when she’s left in a cab. I’m not about that and never was. I did it, but only because I couldn’t find a woman to marry me and give me kids. Now, every girl I meet who says she wants that is more interested in her work/business or too broken to pair bon

So now I have 70596 texts. I can save those to a text file, or I can put them back with the date-time and the user and save them as a csv.

<div class="alert alert-block alert-danger">
    <b>Counts Do Not Match.</b>
    In order to build a dataframe, I will need to split and keep. For now, I just need the texts. Let's save those to a file. 
</div>

In [10]:
datetimes = re.findall(re_datetime, chats)

In [11]:
print(len(datetimes))
print(datetimes[100:102])

70595
['2020-12-29T16:12:11', '2020-12-29T16:13:18']


In [12]:
users = re.findall("@(.*?):", chats)
print(len(users))
print(users[100:102])

66028
['JimHalsey', 'JimHalsey']


## Texts

All the texts of the messages are in `texts`, which is a list. If I want to preserve newlines, I should save the list to a csv? Do I want to preserve newlines?

In [16]:
texts[15:25]

[' its happening\n ',
 ' Squeeeeeeeeeeeee 😆\n ',
 ' Hello, Beautiful Lady. Have a fantastic day ☺️\n ',
 ' Hi.  Thank you.  Hope your day is wonderful as well.  :)\n ',
 ' Thanks, Sweetness\n ',
 ' Are you working today, Beautiful Girl, or are you hanging with fam?\n ',
 ' I’m working.  The only days I have off for the next 2 weeks is Christmas Day &amp;amp; New Year’s Day.  :(.   I need to get another job.  This one is killing me\n ',
 ' Was gonna say no, what you need is a husband and kids, but you know this already, and I don’t want to labour the point, tell you shid you know and just generally make you feel crappy lol — even though I’m sure I do at times, which is never my intention and which I am sorry for. I’m sure one day you’ll find a good man and will have a family. It’s a lot easier for a woman than it is for a man. Even when a woman is past her best, and can’t have kids, there are still men lining up. Whereas White men find it hard to find anyone now, especially a good Aryan

In [18]:
with open("texts.json", 'w') as f:
    # indent=2 is not needed but makes the file human-readable 
    # if the data is nested
    json.dump(texts, f, indent=2) 