# Explanation

We want to aggregate the processed data together, obtaining a JSON file with roughly the following structure:
```json
{
    "id": 45043602,
    "title": "X",
    "content": "...",
    "tags": [
        {
            "id":34343,
            "tagName": "cpu",
            "description":"..."
        }
    ],
    'parentId': null,
    "creationDate": "xx/xx/xxxx",
    "viewCount": 15,
    “score”: 224,
    “lastCountUpdate”: "xx/xx/xxxx",
    "owner": {
        "id": "12353",
        "...": "..."
    },
    "comments": [
        {
            "comment_id": 2434,
            "made_by": "user_id",
            "content": "..."
        }
    ]
}

```

In [1]:
import json
import pandas as pd

In [1]:
'Datasets/Comments.json',
'Datasets/Posts.json',
'Datasets/Tags.json',
'Datasets/Users.json',
'Datasets/Votes.json'

'Datasets/Votes.json'

# Read files

In [2]:
# Open JSON files

with open("Datasets/posts.json", "r", encoding="utf-8") as f_posts:
    df_posts = pd.read_json(f_posts)
    
df_posts.head(1)


Unnamed: 0,id,postTypeId,acceptedAnswerId,creationDate,score,viewCount,body,ownerUserId,title,tags,parentId
0,2,1,28.0,2012-03-06T19:06:05.667,19,1128.0,"<p>The set difference operator (e.g., <code>EX...",5,Does the 'difference' operation add expressive...,"[database-theory, relational-algebra, finite-m...",


In [3]:
with open("Datasets/tags.json", "r", encoding="utf-8") as f_tags:
    df_tags = pd.read_json(f_tags)

df_tags.head(1)

Unnamed: 0,tagName,description
0,cpu-pipelines,"In computing, a pipeline, also known as a data..."


In [4]:
with open("Datasets/comments.json", "r", encoding="utf-8") as f_comments:
    df_comments = pd.read_json(f_comments)
    
df_comments.head(1)

Unnamed: 0,postId,text,userId
0,2,To show that they have the same expressive pow...,10.0


# Add tags with information to list of tags in posts.

In [5]:
l = []
for tag_list in df_posts['tags']:
    temp = []

    # Check if there are any tags in the first place
    if tag_list:
        """Iterate over the list of tags, for each tag
        get the information from tags dataframe, convert
        to dictionary and add it to list.
        After that, add that list to another list.
        """
        for tag in tag_list:
            t_info = df_tags.loc[df_tags['tagName'] == tag]
            temp.append(t_info.to_dict(orient="records")[0])


    l.append(temp)

In [6]:
df_posts['tags'] = l

In [7]:
df_posts['tags'].iloc[0]

[{'tagName': 'database-theory',
  'description': 'Database theory encapsulates a broad range of topics related to the study and research of the theoretical realm of databases and database management systems.\nTheoretical aspects of data management include, among other areas, the foundations of query languages, computational complexity and expressive...'},
 {'tagName': 'relational-algebra',
  'description': 'In database theory, relational algebra is a theory that uses algebraic structures with a well-founded semantics for modeling data, and defining queries on it. The theory was introduced by Edgar F. Codd.\nThe main application of relational algebra is to provide a theoretical foundation for relational databases...'},
 {'tagName': 'finite-model-theory',
  'description': 'Finite model theory (FMT) is a subarea of model theory (MT). MT is the branch of logic which deals with the relation between a formal language (syntax) and its interpretations (semantics). FMT is a restriction of MT to

# Add list of comments to each post

In [8]:
l = []

for post_id in df_posts['id']:
    # Find dataframe with all comments for a Post
    temp_df = df_comments.loc[df_comments['postId'] == post_id]
    temp = []
    if not temp_df.empty:
        # Get the comments, add them as a dictionary to a list
        for i, v in temp_df.iterrows():
            temp.append(v.to_dict())


    l.append(temp)

In [9]:
df_posts['comments'] = l

In [10]:
df_posts['comments'].iloc[1][:4]

[{'postId': 3,
  'text': 'Merge sort is $O(n\\log n)$ in the worst case, and sorting an array of integers where there is a known bound on the size of the integers can be done in $O(n)$ time with a counting sort.',
  'userId': 43.0},
 {'postId': 3,
  'text': '[Three Beautiful Quicksorts](http://video.google.com/videoplay?docid=-1031789501179533828) by Jon Bently might be of interest (Google Tech Talk).',
  'userId': 67.0},
 {'postId': 3,
  'text': 'http://www.sorting-algorithms.com/ has a pretty thorough comparison of sorting algorithms.',
  'userId': 71.0},
 {'postId': 3,
  'text': '@rgrig my bad! I must not write in a state of deep sleepiness. I will correct that. Thanks!',
  'userId': 24.0}]

# Save new Posts dataframe as JSON

In [11]:
df_json = df_posts.to_json(orient="records")
parsed = json.loads(df_json)

with open("Datasets/final_posts.json", "w", encoding="utf-8") as f:
    json.dump(parsed, f, indent=4)
    print("Final Posts dataset successfully converted to JSON!")

Final Posts dataset successfully converted to JSON!
