In this notebook we create the data for filling in the 3 new properties created in A3:

- Two edge properties in the `REVIEWED` edge: *description*, a short text where the author justifies her review, and *suggested_acceptance*, which here is assumed to be either true or false depending on the support of an author for a particular paper.
- In the `Author` node, a node property which should contain the organization an author is affiliated to, *affiliation*.

Note that the number of instances with those properties must match the number of instances of the `REVIEWED` edge and of the `Author` nodes, respectively.

# 0. Libraries

In [65]:
import csv
import random
from faker import Faker
from datetime import datetime, timedelta
import os
import pandas as pd

In [66]:
directory = "data_lab1"

# Create directory for saving the data if it doesn't exist
if not os.path.exists(directory):
    os.makedirs(directory)
    print(f"Directory '{directory}' created successfully")
else:
    print(f"Directory '{directory}' already exists")

Directory 'data_lab1' already exists


# 1. Adjustable parameters

Below we include some sample topics and words for creating fake data.

In [67]:
topics = [
    "This paper explores the impact of machine learning algorithms on data analysis efficiency.",
    "We present a novel approach for optimizing graph database queries.",
    "This study analyzes the effects of large-scale distributed systems in cloud computing.",
    "In this work, we investigate the security challenges in IoT networks.",
    "This paper proposes a new model for natural language processing tasks.",
    "The research examines the evolution of data privacy regulations worldwide.",
    "An empirical study on the performance of blockchain technologies.",
    "We provide a comparative analysis of various AI optimization techniques.",
    "This study evaluates the scalability of real-time recommendation systems.",
    "A new framework for cybersecurity threat detection is introduced."
]

# Components to generate unique titles
adjectives = ["Efficient", "Scalable", "Robust", "Secure", "Advanced", "Distributed", "Optimized", "Flexible"]
nouns = ["Framework", "Model", "Approach", "Architecture", "Method", "Algorithm", "Technique", "System"]
fields = [
    "Machine Learning",
    "Blockchain",
    "Cybersecurity",
    "Natural Language Processing",
    "Quantum Computing",
    "Data Privacy",
    "Graph Databases",
    "Cloud Computing",
    "Healthcare AI",
    "IoT Networks",
]

# Helper function to generate a unique title
def generate_unique_title(existing_titles):
    while True:
        title = f"{random.choice(adjectives)} {random.choice(nouns)} for {random.choice(fields)}"
        if title not in existing_titles:
            existing_titles.add(title)
            return title

existing_titles = set()

In [68]:
# Define components of the review descriptions
openings = [
    "This paper presents",
    "The authors propose",
    "An innovative approach is introduced in",
    "A comprehensive study is conducted on",
    "The manuscript explores"
]

topics = [
    "a novel method for data analysis",
    "an in-depth review of machine learning techniques",
    "a new framework for natural language processing",
    "an experimental evaluation of neural networks",
    "a theoretical model for quantum computing"
]

evaluations = [
    "The methodology is sound and well-explained.",
    "Results are promising but require further validation.",
    "The approach lacks sufficient experimental support.",
    "The paper is well-structured and easy to follow.",
    "Some claims are not adequately substantiated."
]

# Define recommendation phrases with associated acceptance status
recommendations = [
    ("I recommend acceptance after minor revisions.", True),
    ("Major revisions are necessary before acceptance.", True),
    ("The paper should be rejected due to insufficient contributions.", False),
    ("Accept with enthusiasm.", True),
    ("Consider for a poster presentation.", True),
    ("The methodology is flawed and lacks rigor.", False),
    ("The results do not support the conclusions drawn.", False),
    ("The paper fails to make a significant contribution.", False)
]

# Generate a list of random review descriptions
reviews = []
for _ in range(5):
    review = f"{random.choice(openings)} {random.choice(topics)}. {random.choice(evaluations)} {random.choice(recommendations)}"
    reviews.append(review)

# Output the list of reviews
for idx, review in enumerate(reviews, 1):
    print(f"Review {idx}: {review}")


Review 1: The manuscript explores a theoretical model for quantum computing. Some claims are not adequately substantiated. ('The paper fails to make a significant contribution.', False)
Review 2: This paper presents an in-depth review of machine learning techniques. Results are promising but require further validation. ('The paper fails to make a significant contribution.', False)
Review 3: This paper presents a theoretical model for quantum computing. The approach lacks sufficient experimental support. ('I recommend acceptance after minor revisions.', True)
Review 4: The authors propose a theoretical model for quantum computing. The methodology is sound and well-explained. ('Consider for a poster presentation.', True)
Review 5: The manuscript explores a theoretical model for quantum computing. The methodology is sound and well-explained. ('Accept with enthusiasm.', True)


# 2. Loading the fake data of the `REVIEWED` edges and `Author` nodes

In [69]:
reviewed = pd.read_csv('data_lab1/reviewed.csv')
author = pd.read_csv('data_lab1/authors.csv')

In [70]:
reviewed.head()

Unnamed: 0,author_id,paper_id,review_date
0,44,1,2024-09-20
1,46,1,2024-05-22
2,26,1,2024-11-04
3,49,1,2024-06-16
4,12,1,2024-06-10


In [71]:
author.head()

Unnamed: 0,id,name
0,1,Shannon Martinez
1,2,Olivia Jones
2,3,Michael Baker
3,4,Patrick Curtis
4,5,Tanya Riley


For ensuring uniqueness, we only need:
- From the `Author` node, the `id`.
- From the `REVIEWED` edge, the `author_id` and the `paper_id`.

In [72]:
reviewed = reviewed.iloc[:, :2]
author = author.iloc[:, :1]
author = author.rename(columns={'id': 'author_id'})

# 3. Creating the data (the `.csv`)

## 3.1. `Author` node: adding the affiliation

In [73]:
# Create fake company data for each author id
fake = Faker()
for i in range(len(author)):
    author.at[i, 'affiliation'] = fake.company()

# Display the head of the data frame with the new attribute
author.head()

Unnamed: 0,author_id,affiliation
0,1,"Morrison, Casey and Parrish"
1,2,Bailey Group
2,3,Flores-Wright
3,4,Anderson-Robinson
4,5,Russo LLC


## 3.2. `REVIEWED` edges: adding a description and a suggested acceptance

In [74]:
def generate_reviews(num_reviews, acceptance_rate):
    num_accept = int(num_reviews * acceptance_rate)
    num_reject = num_reviews - num_accept

    # Separate recommendations based on acceptance status
    accept_recommendations = [rec for rec in recommendations if rec[1]]
    reject_recommendations = [rec for rec in recommendations if not rec[1]]

    reviews = []

    # Generate accepted reviews
    for _ in range(num_accept):
        opening = random.choice(openings)
        topic = random.choice(topics)
        evaluation = random.choice(evaluations)
        recommendation = random.choice(accept_recommendations)[0]
        review_text = f"{opening} {topic}. {evaluation} {recommendation}"
        reviews.append({'description': review_text, 'suggested_acceptance': True})

    # Generate rejected reviews
    for _ in range(num_reject):
        opening = random.choice(openings)
        topic = random.choice(topics)
        evaluation = random.choice(evaluations)
        recommendation = random.choice(reject_recommendations)[0]
        review_text = f"{opening} {topic}. {evaluation} {recommendation}"
        reviews.append({'description': review_text, 'suggested_acceptance': False})

    # Shuffle the reviews to randomize order
    random.shuffle(reviews)

    return reviews

In [75]:
# Example usage
num_reviews = len(reviewed)  # Number of reviews to generate
acceptance_rate = 0.7  # 70% acceptance rate of all of the reviews
sample_reviews = generate_reviews(num_reviews, acceptance_rate)

# Create a pandas DataFrame
reviewed_properties = pd.DataFrame(sample_reviews)

# Display the first few rows of the DataFrame
reviewed_properties.head()

Unnamed: 0,description,suggested_acceptance
0,An innovative approach is introduced in a new ...,False
1,An innovative approach is introduced in a nove...,True
2,The authors propose a new framework for natura...,True
3,The authors propose an experimental evaluation...,False
4,A comprehensive study is conducted on an exper...,True


In [76]:
# Append the review properties to the existing DataFrame
reviewed = pd.concat([reviewed, reviewed_properties], axis=1)

In [77]:
reviewed.head()

Unnamed: 0,author_id,paper_id,description,suggested_acceptance
0,44,1,An innovative approach is introduced in a new ...,False
1,46,1,An innovative approach is introduced in a nove...,True
2,26,1,The authors propose a new framework for natura...,True
3,49,1,The authors propose an experimental evaluation...,False
4,12,1,A comprehensive study is conducted on an exper...,True


# 4. Saving the data frame

In [78]:
author.to_csv('data_lab1/authors_additional_properties.csv', index=False)
reviewed.to_csv('data_lab1/reviewed_additional_properties.csv', index=False)