# <span style="color:#D6D58E">Lorenzo Paolini - OpenScience Project Notebook</span>

## <span style="color:#9FC131">Research question</span>

<b>How many citations (according to COCI) involve, either as citing or cited entities, publications in SSH journals (according to ERIH-PLUS) included in OpenCitations Meta? What are the disciplines that cites the most and those cited the most? How many citations start from and go to publications in OpenCitations Meta that are not included in SSH journals?</b>

--------

-----

### <span style="color:#D6D58E">General abstract - Progressive update</span>
#### Last update: Week 1 (20/26 march)

<b>Purpose</b>: we want to find out the following:
- by looking at citations data contained in COCI, the number of citations included in Meta which refer to publication in SSH (Social Sciences and Humanities) journals indicated in ERIH-PLUS;
- the disciplines citing the most VS the disciplines cited the most;
- the citations from/to publication contained in Meta which are not included in SSH journals.

We want to create a connection between these three different datasets in order to have an overall view of the citations present in each of them.


<b>Methodology</b>: we will approach the problem from a computational point of view, by building a python software able to analyse the data, querying them in order to retrieve the information needed, and to present the results in a clear and understandable way.


<b>Findings</b>: for what concerns the findings, up to today, we can't see meaningful differences in the number of citations coming from different disciplines, since it is related to the subject of the study, while the ones cited the most belong to psychology, health and science studies.


<b>Originality/Value</b>: our research can be defined as very valuable, since it adds information to existing resources with the aim of facilitating their use and allowing the users to have a clearer view of the data contained in each dataset. Further development will be made. For example, we could analyse other disciplines, to have the same overview as the one created by us but related to other fields.


<i><b>Keywords</b>: OpenScience, Citation, OC-COCI, OC-Meta, ERIH-PLUS, journals</i>


--------

-----

### <span style="color:#9FC131">Week 1</span>
##### <i>20/03 - 25/03</i>

During this week we have defined the abstract for our work. Additionally, I have started to download the data that we will use to carry out our project.

-----

### <span style="color:#9FC131">Week 2</span>
##### <i>27/03 - 01/04</i>

This second week, which goes together with the third one (Easter things), I have created my own personal ORCID.</br>
Additionally, I have finally downloaded all the data for the final project, and started to explore them in detail. The aim of this exploration was to have a better grasp on what we have at our disposal in order to answer the research questions provided at the top of this notebook.</br>

The bigger part of the exploration has been done thanks to pandas and os libraries. I still have some doubts for what concerns COCI in particular. I am not sure about which data should I work on.

##### <span style="color:#D6D58E">Data Management Plan</span>
Together with my group, we have defined the first draft of the data management plan of our project, and deposited it permanently on Zenodo. According to the requests, we have produced it for two datasets:
- one for the data we will use for our project, and 
- another for the software we will develop to analyse them.

-----

### <span style="color:#9FC131">Week 3</span>
##### <i>03/04 - 08/04</i>

##### <span style="color:#D6D58E">Workflow</span>
Together with my group, we have also defined and wrote a first version of our workflow in [protocols.io](https://www.protocols.io/). The workflow is not precisely defined yet, this is due to the fact that we still need to understand better what we aim to do. 

After an additional review of the workflow, this morning we have obtained a DOI for the first version.

-----

### <span style="color:#9FC131">Week 4</span>
##### <i>10/04 - 15/04</i>

During this week I have tried to get back the lecture I've missed but I didn't manage to do it all. Nonetheless, I have investigated better the topics about Peer review and did the review to the other group's Data Management Plan, trying to make it as useful as possible.

-----

### <span style="color:#9FC131">Week 5</span>
##### <i>17/04 - 22/04</i>

We noticed a new release by ERIH-PLUS. We have decided all together to use such version to conduct our analysis.

This week, we met several times with the other members of the group in order to revise the DMP and the Protocol according to the reviews we received. New versions of both the research outcomes have been published. Additionally, we discussed and prepared some answers to our reviewers, which will be published and subsequently delivered.

Another result reahced by means of these meetings has been a first united and commonly agreed version of the final software, which has been reasoned and started to be written. In particular, we decided to re-use part of the code taken from preprocessing operations developed inside OpenCitations, properly linked in the workflow.

-----

### <span style="color:#9FC131">Week 6</span>
##### <i>24/04 - 29/04</i>

During this week, we discussed better the workflow of the project. We have also run some experiments on COCI's preprocessing.

Additionally, I have started to work in order to provide executable bash files to make the entire process easier to be reproduced. Accordingly, I have created a new branch in our github repository, containing these new files and some tests that needs to be investigated better.

For now, I have developed a .sh file useful to automatically download all the original files that will then be processed by an additional .sh file, which has been started to be developed. For now, the preprocessing.sh file contains only a first version of COCI preprocessing operations.

I have also written a first version of the README of this new branch, useful to explain how to deal with such files and what commands are needed.

-----------

### <span style="color:#9FC131">Week 7</span>
##### <i>01/05 - 06/05</i>

Reviewed the original answer to the other group's review of our DMP, according to the double check done with Sara, and send it back to her for publishing it on Zenodo.

Additionally, I have started working on the code in order to answer to the three research questions.
This work has been done in parallel to the one done by the other members of the group. In this way we should also be able to have different versions capable of solving the problem, but also to double-check with better precision the results that came from our analysis.

---------

### <span style="color:#9FC131">Week 8</span>
##### <i>08/05 - 13/05</i>

Some of the codes are now ready. I have produced the answers to both the first and the third research questions, and we are waiting for a double check with the other members' analysis results.

For the way in which I have thought at the problems, the best way to deal with such big data is to produce smaller versions of the same data, but with less information. According to this, I have stored the DOI's contained in META, that has a SSH publisher in a new .csv file, called ERIH_META. This file has been produced also by the other members of the group but, since it was not properly working on my machine, I devised a way to produce my own copy of it.

In [None]:
import os
import pandas as pd
from multiprocessing import Pool
from functools import partial
from tqdm import tqdm
import json

# Preprocess erih in json
erih_dir_path = "/Volumes/Extreme SSD/OS_data/Processed_data/Processed_ERIH/erih_preprocessed.csv"

erih = pd.read_csv(erih_dir_path, delimiter=';', encoding='utf-8')
erih_dict = {}
erih_disciplines = set()
for idx, row in tqdm(erih.iterrows()):
    erih_dict[row["venue_id"]] = []
    disciplines = row["ERIH_disciplines"].split(',')
    for discipline in disciplines:
        erih_dict[row["venue_id"]].append(discipline.strip())
        erih_disciplines.add(discipline.strip())

with open("erih_dict.json", "w") as f:
    json.dump(erih_dict, f)

with open("erih_disciplines.json", "w") as f:
    disciplines = {}
    for discipline in erih_disciplines:
        disciplines[discipline] = 0
    json.dump(disciplines, f)

# Build erih-meta in csv files

meta_dir_path = "/Volumes/Extreme SSD/OS_data/Processed_data/Processed_META/"
erih_meta_dir_path = "/Volumes/Extreme SSD/OS_data/Processed_data/ERIH_META_prep/"
meta_filenames = [filename for filename in os.listdir(meta_dir_path) if os.path.isfile(os.path.join(meta_dir_path, filename)) 
                                                                                and not filename.startswith("._")]

for filename in tqdm(meta_filenames):
    # read
    meta_df = pd.read_csv(os.path.join(meta_dir_path, filename), delimiter=',', encoding='utf-8')
    # drop nan from venue column
    meta_df = meta_df.dropna(subset=['venue'])
    # add a new column
    meta_df["ERIH_disciplines"] = ""
    # iterate over rows
    for idx, row in meta_df.iterrows():
        # get the venue id
        venue_ids = row["venue"].split(' ')
        if len(venue_ids) == 1:
            venue_id = venue_ids[0]
            # check if the venue id is in the erih_dict
            if venue_id in erih_dict:
                # get the disciplines
                disciplines = erih_dict[venue_id]
                # append the disciplines to the row
                meta_df.at[idx, "ERIH_disciplines"] = disciplines
        else:
            for venue_id in venue_ids:
                # check if the venue id is in the erih_dict
                if venue_id in erih_dict:
                    # get the disciplines
                    disciplines = erih_dict[venue_id]
                    # append the disciplines to the row
                    meta_df.at[idx, "ERIH_disciplines"] = disciplines
                    break
    # save the dataframe -> one by one...
    meta_df.to_csv(os.path.join(erih_meta_dir_path, filename), index=False)

Then, I have divided each paper in the newly built ERIH-META in order to have a clear view of SSH and NOT_SSH publications. This has been saved in a JSON, as you can see below.

In [None]:
# Function to filter each erih_meta csv in order to divide ssh and not_ssh dois in json

erih_meta_filenames = [filename for filename in os.listdir(erih_meta_dir_path) if os.path.isfile(os.path.join(erih_meta_dir_path, filename)) 
                                                                                and not filename.startswith("._")]


erih_meta_papers = {'ssh_papers':list(),
                    'not_ssh_papers':list()}

for filename in tqdm(erih_meta_filenames):
    df = pd.read_csv(os.path.join(erih_meta_dir_path, filename))
    df = df[['id', 'ERIH_disciplines']]
    # fill all the possible NaN or None with ""
    df = df.fillna('')
    # create boolean mask for erih_disciplines column
    mask = df['ERIH_disciplines'] != ''

    # filter the dataframe with this mask
    ssh_df = df[mask]
    ssh_df = ssh_df.reset_index(drop=True)

    # Create a second dataframe from the above mask, where are kept only the False rows in the mask
    not_ssh_df = df[~mask]
    not_ssh_df = not_ssh_df.reset_index(drop=True)

    # Get the unique values of the id column
    unique_ssh = ssh_df['id'].unique().tolist()
    unique_not_ssh = not_ssh_df['id'].unique().tolist()

    # Append the unique values to the list
    erih_meta_papers['ssh_papers'].extend(unique_ssh)
    erih_meta_papers['not_ssh_papers'].extend(unique_not_ssh)

# Save inside JSON
with open("erih_meta_papers.json", "w") as f:
    json.dump(erih_meta_papers, f)

print("Done...")

From the above reults, I have looked whether there were (as can be spotted in the .csv of meta) more DOIs for the same work, I have divided them, and then I have removed possible double values, and created a new dictionary of unique DOIs for SSH and NOT_SSH.

In [None]:
# Clean double dois in erih-meta

#load erih json into dict
with open("erih_meta_papers.json", "r") as f:
    erih_meta_papers = json.load(f)

ssh_papers = []
for ssh_pap in tqdm(erih_meta_papers['ssh_papers']):
    papers = ssh_pap.split(' ')
    ssh_papers.extend(papers)

not_ssh_papers = []
for not_ssh_pap in tqdm(erih_meta_papers['not_ssh_papers']):
    papers = not_ssh_pap.split(' ')
    not_ssh_papers.extend(papers)

# Create a new dict with the unique values
erih_meta_papers_unique = {'ssh_papers':list(set(ssh_papers)),
                            'not_ssh_papers':list(set(not_ssh_papers))}

with open("erih_meta_papers_unique.json", "w") as f:
    json.dump(erih_meta_papers_unique, f)

Additionally, I have developed this snippet to answer to both Q1 and Q2 that is still pretty slow. It still needs to be optimized (and to run in parallel cores), I will do it if the answers to the two research questions are correct and if we decide with the other members of the group to use this approach.

In [None]:
import os
import pandas as pd
from tqdm import tqdm
import json

coci_dir_path = "/Volumes/Extreme SSD/OS_data/Processed_data/smaller_COCI/"

# Scan directory
coci_filenames = []
with os.scandir(coci_dir_path) as entries:
    for entry in tqdm(entries, desc='Iterating filenames...', colour='blue', smoothing=0.1, total=len(os.listdir(coci_dir_path))):
        if entry.is_file() and not entry.name.startswith("._"):
            coci_filenames.append(entry.name)

print('Reading erih-meta...')
with open("erih_meta_papers_unique.json", "r") as f:
    erih_meta_papers_unique = json.load(f)

print('Building sets...')
ssh_set = set(erih_meta_papers_unique['ssh_papers'])
not_ssh_set = set(erih_meta_papers_unique['not_ssh_papers'])

ssh_citations = 0
not_ssh_citations = 0

def count_citations(row):
    """
    This function is used thanks to the apply method of pandas.
    """
    # The row contains an SSH citation? -> This is with an OR
    if row['citing'] in ssh_set or row['cited'] in ssh_set:
        return 'ssh'
    # The row contains a non-SSH citation? -> This is with an AND
    elif row['citing'] in not_ssh_set and row['cited'] in not_ssh_set:
        return 'not_ssh'
    # If not inside
    else:
        return 'other'

for filename in tqdm(coci_filenames, desc='Iterating files...', colour='green', smoothing=0.1):
    df = pd.read_csv(os.path.join(coci_dir_path, filename))

    # Apply count_citations to each row of the DF
    citation_counts = df.apply(count_citations, axis=1).value_counts()
    # Increment the SSH citation count and non-SSH citation count
    ssh_citations += citation_counts.get('ssh', 0)
    not_ssh_citations += citation_counts.get('not_ssh', 0)

print(ssh_citations)
print(not_ssh_citations)   

print("Done...")

The following are the results I got from it:
- SSH Count in META: 225370804
- NOT_SSH Count in META: 985223927