# Data Pre-Processing Phase 1 - Making the data accessible

The TwiBot-22 dataset consists of over 100GB of textual data.
It is necessary to preprocess and organise the data in way that will set the foundation for further analysis.

In its original form the dataset is structured as follows:
```
TwiBot22-data
 └── original
     ├── edge.csv       # 6.2GB
     ├── hashtag.json   # 255MB
     ├── label.csv      # 21MB
     ├── list.json      # 4.7MB
     ├── readme.md      # 1.9KB
     ├── split.csv      # 20MB
     ├── tweet_0.json   # 12GB
     ├── tweet_0.json   # 12GB
     ├── tweet_0.json   # 11GB
     ├── tweet_1.json   # 11GB
     ├── tweet_2.json   # 11GB
     ├── tweet_3.json   # 11GB
     ├── tweet_4.json   # 11GB
     ├── tweet_5.json   # 12GB
     ├── tweet_6.json   # 11GB
     ├── tweet_7.json   # 12GB
     ├── tweet_8.json   # 9.4GB
     └── user.json      # 747MB
```

Our first phase of preprocessing includes:
- Split `tweet_[0-9].json` into individual `tweets.json` per user
- Split `user.json` into individual `user.json` per user

Function calls in this notebook can take a significant amount of time to complete.
Therefore cells that trigger the execution are commented out.
To actually run these steps just uncomment them.

In [1]:
import json

import pandas as pd

from collections import defaultdict
from pathlib import Path
from tqdm import tqdm

In [2]:
DATA_DIR = Path('/data/TwiBot-22')
USER_DIR = DATA_DIR / 'processed' / 'users'
USER_DIR.mkdir(exist_ok=True, parents=True)

In [3]:
def load_json(filename):
    with open(filename) as fh:
        data = json.load(fh)
    return data

## Step 1: Split `tweets_[0-8].json` into one file per user

In [4]:
TWEET_FILES = sorted([fh for fh in (DATA_DIR / 'original').glob("tweet_[0-8].json")])

In [5]:
def extract_tweets_per_user(tweet_json):
    tweets_per_user = defaultdict(list)
    tweet_data = load_json(tweet_json)
    for tweet in tweet_data:
        tweets_per_user[f'u{tweet["author_id"]}'].append(tweet)
    return tweets_per_user

In [6]:
def remove_duplicates(tweets):
    ids = [tw["id"] for tw in tweets]
    if len(ids) != len(set(ids)):
        new_tweets = {}
        for tw in tweets:
            id = tw['id']
            if id in new_tweets:
                assert tw == new_tweets[id]
            else:
                new_tweets[id] = tw
        tweets = list(new_tweets.values())
    return tweets       

def check_for_duplicates(tweets):
    ids = [tw["id"] for tw in tweets]
    assert len(ids) == len(set(ids)), "Duplicates found"

In [7]:
def export_tweets_per_user(tweet_files):
    for jf in tqdm(tweet_files):
        tweets_per_user = extract_tweets_per_user(jf)
        for user, tweets in tweets_per_user.items():
            user_directory = USER_DIR / user
            user_directory.mkdir(exist_ok=True)
            jf = user_directory / 'tweets.json'
            if jf.is_file():
                with open(jf, "r") as fh:
                    tweets = tweets + json.load(fh)
                tweets = remove_duplicates(tweets)
                check_for_duplicates(tweets)  # assert that no tweet is added twice
            with open(jf, "w") as fh:
                json.dump(tweets, fh)

This snippet will take a few hours to complete and its only necessary to run it once, since the results will be saved afterwards.<br>
However, one can split the big `tweets_[0-8].json` filtes into a `tweet.json` per user by running the following loop.

Most likely the kernel will die at some point should this script be run for all 8 json in a loop. Therefore, we recommend to run it manually 8 times.


```python
    export_tweets_per_user([TWEET_FILES[i]])
```

After this step is completed, there should be 933.872 user-specific directories in `USER_DIR`<br>
$\Longrightarrow$ $\approx$ 93% of users tweeted

## Step 2: Split `user.json` into one file per user

In [8]:
USER_FILE = DATA_DIR / 'original' / 'user.json'

In [9]:
def split_json_into_subfiles(json_file, destination):
    data = load_json(json_file)
    for entry in tqdm(data):
        current_entry = USER_DIR / entry["id"]
        if not current_entry.is_dir():
            current_entry.mkdir(exist_ok=True)
        with open(current_entry / json_file.name, "w") as fh:
            json.dump(entry, fh)

After running the following snippet we have a separate `user.json` for each user.

```python
split_json_into_subfiles(USER_FILE, USER_DIR)
```

We can also verify that we now have 1.000.000 user-specific directories in `USER_DIR`.

## Step 3: Extract tweet times
Our initial plan was to use this dataset with temporal patterns.
As a result, it would be helpful to have the times of tweets readily available.

Ultimately, we decided against using the temporal patterns. However, the files created in the following loop are still relevant for the data exploration in the next notebook.

In [10]:
def export_tweet_times(user_dir):
    for user in tqdm(user_dir.glob("*")):
        if not (user / "tweets.json").is_file(): continue
        tweet_times = defaultdict(list)
        with open(user / "tweets.json", "r") as fh:
            user_tweets = json.load(fh)
        for tweet in user_tweets:
            tweet_times["id"].append(tweet["id"])
            tweet_times["created_at"].append(tweet["created_at"])
        tweet_csv = user / "tweet_times.csv"
        pd.DataFrame(tweet_times, columns=["id", "created_at"]).to_csv(tweet_csv, index=False)

Again, the following snippet takes a long time to finish and only need to be run once.

```python
    export_tweet_times(USER_DIR)
```

For example, for user `u1436559360843517957` the resulting table looks like this:

| id                   | created_at                |
|:---------------------|:--------------------------|
| t1487173555120918530 | 2022-01-28 21:19:40+00:00 |
| t1485644387140911104 | 2022-01-24 16:03:18+00:00 |
| t1464373636530376708 | 2021-11-26 23:20:56+00:00 |
| t1447239553216094211 | 2021-10-10 16:36:12+00:00 |
| t1447239443111464968 | 2021-10-10 16:35:46+00:00 |

## Step 4: Split `edge.csv` into one file per relation

In this step the large `edge.csv` file will be split into smaller files based on the type of relation.

This is not strictly necessary as we could also load the given `edge.csv` and filter the rows whenever we need to do that.
But it reduces the runtime for further processing and gives an better overview of the kind of relations we have.
This is especially relevant once we start to add new ones.

In [11]:
EDGE_FILE = DATA_DIR / 'original' / 'edge.csv'
RELATION_DIR = DATA_DIR / 'processed' / 'relations'
RELATION_DIR.mkdir(exist_ok=True)

In [12]:
def split_into_subfiles(csv_file, category, out_path):
    df = pd.read_csv(csv_file)
    categories = df[category].unique()
    for cat in (pbar := tqdm(categories)):
        pbar.set_description(f"Processing {cat}")
        
        category_df = df.loc[df[category] == cat]
        category_df.to_csv(out_path / f"{cat}.csv", index=False)

The snippet can be used:
```python
split_into_subfiles(csv_file=EDGE_FILE, category='relation', out_path=RELATION_DIR)
```

This will create 14 new files, one for each type of relation:
```
contain.csv      # list  -> tweet
discuss.csv      # tweet -> hashtag
followed.csv     # list  -> user
followers.csv    # user  -> user
following.csv    # user  -> user
like.csv         # user  -> tweet
membership.csv   # list  -> user
mentioned.csv    # tweet -> user
own.csv          # user  -> list
pinned.csv       # user  -> tweet
post.csv         # user  -> tweet
quoted.csv       # tweet -> tweet
replied_to.csv   # tweet -> tweet
retweeted.csv    # tweet -> tweet
```

# End Result

This concludes our first round of pre-processing. We split our data into a more granular pieces and up with the following structure:

```
TwiBot22-data
 ├── original
 │   ├── edge.csv
 │   ├── ...
 │   └── user.json
 ├── relations
 │   ├── contain.csv
 │   ├── discuss.csv
 │   ├── followed.csv 
 │   ├── followers.csv 
 │   ├── following.csv 
 │   ├── like.csv 
 │   ├── membership.csv
 │   ├── mentioned.csv 
 │   ├── own.csv
 │   ├── pinned.csv  
 │   ├── post.csv
 │   ├── quoted.csv    
 │   ├── replied_to.csv   
 │   └── retweeted.csv   
 └── processed
     └── users
         ├── u1000
         │   ├── tweets.json 
         │   ├── tweet_times.json 
         │   └── user.csv 
         ├── u1000001683005542402
         │   ├── tweets.json 
         │   ├── tweet_times.json 
         │   └── user.csv
         ├── ...
         └── u999996798793109506
             ├── tweets.json 
             ├── tweet_times.json 
             └── user.csv
```