# <span style="color:#D6D58E">Lorenzo Paolini - OpenScience Project Notebook</span>

## <span style="color:#9FC131">Research question</span>

<b>How many citations (according to COCI) involve, either as citing or cited entities, publications in SSH journals (according to ERIH-PLUS) included in OpenCitations Meta? What are the disciplines that cites the most and those cited the most? How many citations start from and go to publications in OpenCitations Meta that are not included in SSH journals?</b>

--------

-----

### <span style="color:#D6D58E">General abstract - Progressive update</span>
#### Last update: Week 1 (20/26 march)

<b>Purpose</b>: we want to find out the following:
- by looking at citations data contained in COCI, the number of citations included in Meta which refer to publication in SSH (Social Sciences and Humanities) journals indicated in ERIH-PLUS;
- the disciplines citing the most VS the disciplines cited the most;
- the citations from/to publication contained in Meta which are not included in SSH journals.

We want to create a connection between these three different datasets in order to have an overall view of the citations present in each of them.


<b>Methodology</b>: we will approach the problem from a computational point of view, by building a python software able to analyse the data, querying them in order to retrieve the information needed, and to present the results in a clear and understandable way.


<b>Findings</b>: for what concerns the findings, up to today, we can't see meaningful differences in the number of citations coming from different disciplines, since it is related to the subject of the study, while the ones cited the most belong to psychology, health and science studies.


<b>Originality/Value</b>: our research can be defined as very valuable, since it adds information to existing resources with the aim of facilitating their use and allowing the users to have a clearer view of the data contained in each dataset. Further development will be made. For example, we could analyse other disciplines, to have the same overview as the one created by us but related to other fields.


<i><b>Keywords</b>: OpenScience, Citation, OC-COCI, OC-Meta, ERIH-PLUS, journals</i>


--------

-----

### <span style="color:#9FC131">Week 1</span>
##### <i>20/03 - 25/03</i>

During this week we have defined the abstract for our work. Additionally, I have started to download the data that we will use to carry out our project.

-----

### <span style="color:#9FC131">Week 2</span>
##### <i>27/03 - 01/04</i>

This second week, which goes together with the third one (Easter things), I have created my own personal ORCID.</br>
Additionally, I have finally downloaded all the data for the final project, and started to explore them in detail. The aim of this exploration was to have a better grasp on what we have at our disposal in order to answer the research questions provided at the top of this notebook.</br>

The bigger part of the exploration has been done thanks to pandas and os libraries. I still have some doubts for what concerns COCI in particular. I am not sure about which data should I work on.

##### <span style="color:#D6D58E">Data Management Plan</span>
Together with my group, we have defined the first draft of the data management plan of our project, and deposited it permanently on Zenodo. According to the requests, we have produced it for two datasets:
- one for the data we will use for our project, and 
- another for the software we will develop to analyse them.

-----

### <span style="color:#9FC131">Week 3</span>
##### <i>03/04 - 08/04</i>

##### <span style="color:#D6D58E">Workflow</span>
Together with my group, we have also defined and wrote a first version of our workflow in [protocols.io](https://www.protocols.io/). The workflow is not precisely defined yet, this is due to the fact that we still need to understand better what we aim to do. 

After an additional review of the workflow, this morning we have obtained a DOI for the first version.

-----

### <span style="color:#9FC131">Week 4</span>
##### <i>10/04 - 15/04</i>

During this week I have tried to get back the lecture I've missed but I didn't manage to do it all. Nonetheless, I have investigated better the topics about Peer review and did the review to the other group's Data Management Plan, trying to make it as useful as possible.

-----

### <span style="color:#9FC131">Week 5</span>
##### <i>17/04 - 22/04</i>

We noticed a new release by ERIH-PLUS. We have decided all together to use such version to conduct our analysis.

This week, we met several times with the other members of the group in order to revise the DMP and the Protocol according to the reviews we received. New versions of both the research outcomes have been published. Additionally, we discussed and prepared some answers to our reviewers, which will be published and subsequently delivered.

Another result reahced by means of these meetings has been a first united and commonly agreed version of the final software, which has been reasoned and started to be written. In particular, we decided to re-use part of the code taken from preprocessing operations developed inside OpenCitations, properly linked in the workflow.

-----

### <span style="color:#9FC131">Week 6</span>
##### <i>24/04 - 29/04</i>

During this week, we discussed better the workflow of the project. We have also run some experiments on COCI's preprocessing.

Additionally, I have started to work in order to provide executable bash files to make the entire process easier to be reproduced. Accordingly, I have created a new branch in our github repository, containing these new files and some tests that needs to be investigated better.

For now, I have developed a .sh file useful to automatically download all the original files that will then be processed by an additional .sh file, which has been started to be developed. For now, the preprocessing.sh file contains only a first version of COCI preprocessing operations.

I have also written a first version of the README of this new branch, useful to explain how to deal with such files and what commands are needed.

-----------

### <span style="color:#9FC131">Week 7</span>
##### <i>01/05 - 06/05</i>

Reviewed the original answer to the other group's review of our DMP, according to the double check done with Sara, and send it back to her for publishing it on Zenodo.

Additionally, I have started working on the code in order to answer to the three research questions.
This work has been done in parallel to the one done by the other members of the group. In this way we should also be able to have different versions capable of solving the problem, but also to double-check with better precision the results that came from our analysis.

---------

### <span style="color:#9FC131">Week 8</span>
##### <i>08/05 - 13/05</i>

Some of the codes are now ready. I have produced the answers to both the first and the third research questions, and we are waiting for a double check with the other members' analysis results.

For the way in which I have thought at the problems, the best way to deal with such big data is to produce smaller versions of the same data, but with less information. According to this, I have stored the DOI's contained in META, that has a SSH publisher in a new .csv file, called ERIH_META. This file has been produced also by the other members of the group but, since it was not properly working on my machine, I devised a way to produce my own copy of it.

In [None]:
import os
import pandas as pd
from multiprocessing import Pool
from functools import partial
from tqdm import tqdm
import json

# Preprocess erih in json
erih_dir_path = "/Volumes/Extreme SSD/OS_data/Processed_data/Processed_ERIH/erih_preprocessed.csv"

erih = pd.read_csv(erih_dir_path, delimiter=';', encoding='utf-8')
erih_dict = {}
erih_disciplines = set()
for idx, row in tqdm(erih.iterrows()):
    erih_dict[row["venue_id"]] = []
    disciplines = row["ERIH_disciplines"].split(',')
    for discipline in disciplines:
        erih_dict[row["venue_id"]].append(discipline.strip())
        erih_disciplines.add(discipline.strip())

with open("erih_dict.json", "w") as f:
    json.dump(erih_dict, f)

with open("erih_disciplines.json", "w") as f:
    disciplines = {}
    for discipline in erih_disciplines:
        disciplines[discipline] = 0
    json.dump(disciplines, f)

# Build erih-meta in csv files

meta_dir_path = "/Volumes/Extreme SSD/OS_data/Processed_data/Processed_META/"
erih_meta_dir_path = "/Volumes/Extreme SSD/OS_data/Processed_data/ERIH_META_prep/"
meta_filenames = [filename for filename in os.listdir(meta_dir_path) if os.path.isfile(os.path.join(meta_dir_path, filename)) 
                                                                                and not filename.startswith("._")]

for filename in tqdm(meta_filenames):
    # read
    meta_df = pd.read_csv(os.path.join(meta_dir_path, filename), delimiter=',', encoding='utf-8')
    # drop nan from venue column
    meta_df = meta_df.dropna(subset=['venue'])
    # add a new column
    meta_df["ERIH_disciplines"] = ""
    # iterate over rows
    for idx, row in meta_df.iterrows():
        # get the venue id
        venue_ids = row["venue"].split(' ')
        if len(venue_ids) == 1:
            venue_id = venue_ids[0]
            # check if the venue id is in the erih_dict
            if venue_id in erih_dict:
                # get the disciplines
                disciplines = erih_dict[venue_id]
                # append the disciplines to the row
                meta_df.at[idx, "ERIH_disciplines"] = disciplines
        else:
            for venue_id in venue_ids:
                # check if the venue id is in the erih_dict
                if venue_id in erih_dict:
                    # get the disciplines
                    disciplines = erih_dict[venue_id]
                    # append the disciplines to the row
                    meta_df.at[idx, "ERIH_disciplines"] = disciplines
                    break
    # save the dataframe -> one by one...
    meta_df.to_csv(os.path.join(erih_meta_dir_path, filename), index=False)

Then, I have divided each paper in the newly built ERIH-META in order to have a clear view of SSH and NOT_SSH publications. This has been saved in a JSON, as you can see below.

In [None]:
# Function to filter each erih_meta csv in order to divide ssh and not_ssh dois in json

erih_meta_filenames = [filename for filename in os.listdir(erih_meta_dir_path) if os.path.isfile(os.path.join(erih_meta_dir_path, filename)) 
                                                                                and not filename.startswith("._")]


erih_meta_papers = {'ssh_papers':list(),
                    'not_ssh_papers':list()}

for filename in tqdm(erih_meta_filenames):
    df = pd.read_csv(os.path.join(erih_meta_dir_path, filename))
    df = df[['id', 'ERIH_disciplines']]
    # fill all the possible NaN or None with ""
    df = df.fillna('')
    # create boolean mask for erih_disciplines column
    mask = df['ERIH_disciplines'] != ''

    # filter the dataframe with this mask
    ssh_df = df[mask]
    ssh_df = ssh_df.reset_index(drop=True)

    # Create a second dataframe from the above mask, where are kept only the False rows in the mask
    not_ssh_df = df[~mask]
    not_ssh_df = not_ssh_df.reset_index(drop=True)

    # Get the unique values of the id column
    unique_ssh = ssh_df['id'].unique().tolist()
    unique_not_ssh = not_ssh_df['id'].unique().tolist()

    # Append the unique values to the list
    erih_meta_papers['ssh_papers'].extend(unique_ssh)
    erih_meta_papers['not_ssh_papers'].extend(unique_not_ssh)

# Save inside JSON
with open("erih_meta_papers.json", "w") as f:
    json.dump(erih_meta_papers, f)

print("Done...")

From the above reults, I have looked whether there were (as can be spotted in the .csv of meta) more DOIs for the same work, I have divided them, and then I have removed possible double values, and created a new dictionary of unique DOIs for SSH and NOT_SSH.

In [None]:
# Clean double dois in erih-meta

#load erih json into dict
with open("erih_meta_papers.json", "r") as f:
    erih_meta_papers = json.load(f)

ssh_papers = []
for ssh_pap in tqdm(erih_meta_papers['ssh_papers']):
    papers = ssh_pap.split(' ')
    ssh_papers.extend(papers)

not_ssh_papers = []
for not_ssh_pap in tqdm(erih_meta_papers['not_ssh_papers']):
    papers = not_ssh_pap.split(' ')
    not_ssh_papers.extend(papers)

# Create a new dict with the unique values
erih_meta_papers_unique = {'ssh_papers':list(set(ssh_papers)),
                            'not_ssh_papers':list(set(not_ssh_papers))}

with open("erih_meta_papers_unique.json", "w") as f:
    json.dump(erih_meta_papers_unique, f)

Additionally, I have developed this snippet to answer to both Q1 and Q2 that is still pretty slow. It still needs to be optimized (and to run in parallel cores), I will do it if the answers to the two research questions are correct and if we decide with the other members of the group to use this approach.

In [None]:
import os
import pandas as pd
from tqdm import tqdm
import json

coci_dir_path = "/Volumes/Extreme SSD/OS_data/Processed_data/smaller_COCI/"

# Scan directory
coci_filenames = []
with os.scandir(coci_dir_path) as entries:
    for entry in tqdm(entries, desc='Iterating filenames...', colour='blue', smoothing=0.1, total=len(os.listdir(coci_dir_path))):
        if entry.is_file() and not entry.name.startswith("._"):
            coci_filenames.append(entry.name)

print('Reading erih-meta...')
with open("erih_meta_papers_unique.json", "r") as f:
    erih_meta_papers_unique = json.load(f)

print('Building sets...')
ssh_set = set(erih_meta_papers_unique['ssh_papers'])
not_ssh_set = set(erih_meta_papers_unique['not_ssh_papers'])

ssh_citations = 0
not_ssh_citations = 0

def count_citations(row):
    """
    This function is used thanks to the apply method of pandas.
    """
    # The row contains an SSH citation? -> This is with an OR
    if row['citing'] in ssh_set or row['cited'] in ssh_set:
        return 'ssh'
    # The row contains a non-SSH citation? -> This is with an AND
    elif row['citing'] in not_ssh_set and row['cited'] in not_ssh_set:
        return 'not_ssh'
    # If not inside
    else:
        return 'other'

for filename in tqdm(coci_filenames, desc='Iterating files...', colour='green', smoothing=0.1):
    df = pd.read_csv(os.path.join(coci_dir_path, filename))

    # Apply count_citations to each row of the DF
    citation_counts = df.apply(count_citations, axis=1).value_counts()
    # Increment the SSH citation count and non-SSH citation count
    ssh_citations += citation_counts.get('ssh', 0)
    not_ssh_citations += citation_counts.get('not_ssh', 0)

print(ssh_citations)
print(not_ssh_citations)   

print("Done...")

The following are the results I got from it:
- SSH Count in META: 225370804
- NOT_SSH Count in META: 985223927

We have seen that the numbers are the same, thus both methods work. I am currently waiting for the other members of the group to run their method in different computer architectures, then we will se which one is faster and which has more possibilities to be optimized.

---------

### <span style="color:#9FC131">Week 9</span>
##### <i>15/05 - 19/05</i>

During this week we have double-checked our results, and we have joined all the methods inside a single class. I have extended and tried to optimize (by merging together the various methods and functions) the code written by the other members of the group, as well as my code. The final result is the following:

In [None]:
import os
from os.path import exists
from datetime import datetime
import pandas as pd
from tqdm import tqdm
import csv
import multiprocessing
from concurrent.futures import ThreadPoolExecutor
from functools import partial
from lib.csv_manager_erih_meta_disciplines import CSVManager

class Counter(object):
    _entity_columns_to_use_erih_meta_disciplines = ['id', 'erih_disciplines']
    _entity_columns_to_use_erih_meta_without_disciplines = ['id']
    _entity_columns_to_use_q1_q3 = ['citing', 'cited']
    _entity_columns_to_use_q2 = ['id', 'citing', 'cited', 'disciplines']

    def __init__(self, coci_preprocessed_path, erih_meta_path):
        self._list_coci_files = self.get_all_files(coci_preprocessed_path, '.csv')
        self._list_erih_meta_files = self.get_all_files(erih_meta_path, '.csv')
        self.num_cpu = multiprocessing.cpu_count() - 1

    def get_all_files(self, i_dir_or_compr, req_type):
        '''It returns a list containing all the files found in the input folder and with the extension required, like ".csv".'''
        result = []
        if os.path.isdir(i_dir_or_compr):
            for cur_dir, cur_subdir, cur_files in os.walk(i_dir_or_compr):
                for cur_file in cur_files:
                    if cur_file.endswith(req_type) and not os.path.basename(cur_file).startswith("."):
                        result.append(os.path.join(cur_dir, cur_file))
        return result

    def splitted_to_file(self, cur_n, lines, columns_to_use, output_dir_path):
        '''
        This method is responsible for writing the new csv files, with the columns passed as input.
        It concretely produces output files by creating in the output folder a new file every n lines
        which come from the other methods (like "create_erih_meta_with_disciplines", "create_dataset_SSH", etc.)
        where n is the integer number defined as an input parameter.
        In particular, the method takes in input the current number of lines, a data structure containing
        the lines to write in the output file, the name of the columns of the new csv files, the path of the directory to store the new files.
        '''
        if int(cur_n) != 0 and int(cur_n) % int(self._interval) == 0:
            filename = "count_" + str(cur_n // self._interval) + '.csv'
            if os.path.exists(os.path.join(output_dir_path, filename)):
                cur_datetime = datetime.now()
                dt_string = cur_datetime.strftime("%d%m%Y_%H%M%S")
                filename = filename[:-len('.csv')] + "_" + dt_string + '.csv'
            with open(os.path.join(output_dir_path, filename), "w", encoding="utf8", newline="") as f_out:
                dict_writer = csv.DictWriter(f_out, delimiter=",", quoting=csv.QUOTE_ALL, escapechar="\\",
                                             fieldnames=columns_to_use)
                dict_writer.writeheader()
                dict_writer.writerows(lines)
                f_out.close()
            lines = []
            return lines
        else:
            return lines

    #def create_erih_meta_with_disciplines(self):
    #    '''This method, starting from the "ERIH_META" dataset creates a subset of it, containing just the ids with at least a discipline associated.
    #    It has two columns: 'id' and 'erih_disciplines' '''
    #    output_erih_meta_disciplines = os.path.join(self._output_dir + 'erih_meta_with_disciplines')
    #    if not exists(output_erih_meta_disciplines):
    #        os.makedirs(output_erih_meta_disciplines)
    #    data = []
    #    count = 0
    #    for file_idx, file in enumerate(tqdm(self._list_erih_meta_files), 1):
    #        chunksize = 10000
    #        with pd.read_csv(file, usecols=['id', 'erih_disciplines'], chunksize=chunksize, sep=",") as reader:
    #            for chunk in reader:
    #                chunk.fillna("", inplace=True)
    #                df_dict_list = chunk.to_dict("records")
    #                for line in df_dict_list:
    #                    discipline = line.get('erih_disciplines')
    #                    if discipline:
    #                        data.append(line)
    #                        count += 1
    #                        if int(count) != 0 and int(count) % int(self._interval) == 0:
    #                            data = self.splitted_to_file(count, data, self._entity_columns_to_use_erih_meta_disciplines, output_erih_meta_disciplines)
    #    if len(data) > 0:
    #        count = count + (self._interval - (int(count) % int(self._interval)))
    #        self.splitted_to_file(count, data, self._entity_columns_to_use_erih_meta_disciplines, output_erih_meta_disciplines)

    #def create_erih_meta_without_disciplines(self):
    #    '''This method, starting from the "ERIH_META" dataset creates a subset of it, containing just the ids without a discipline associated.
    #    It has just one column: 'id' '''
    #    output_erih_meta_without_disciplines = os.path.join(self._output_dir + 'erih_meta_without_disciplines')
    #    if not exists(output_erih_meta_without_disciplines):
    #        os.makedirs(output_erih_meta_without_disciplines)
    #    data = []
    #    count = 0
    #    for file_idx, file in enumerate(tqdm(self._list_erih_meta_files), 1):
    #        chunksize = 10000
    #        with pd.read_csv(file, usecols=['id', 'erih_disciplines'], chunksize=chunksize, sep=",") as reader:
    #            for chunk in reader:
    #                chunk.fillna("", inplace=True)
    #                df_dict_list = chunk.to_dict("records")
    #                for line in df_dict_list:
    #                    new_line = dict()
    #                    discipline = line.get('erih_disciplines')
    #                    if not discipline:
    #                        new_line['id'] = line.get('id')
    #                        data.append(new_line)
    #                        count += 1
    #                        if int(count) != 0 and int(count) % int(self._interval) == 0:
    #                            data = self.splitted_to_file(count, data, self._entity_columns_to_use_erih_meta_without_disciplines, output_erih_meta_without_disciplines)
    #    if len(data) > 0:
    #        count = count + (self._interval - (int(count) % int(self._interval)))
    #        self.splitted_to_file(count, data, self._entity_columns_to_use_erih_meta_without_disciplines, output_erih_meta_without_disciplines)

    def create_additional_files(self, with_disciplines):
        process_output_dir = self._path_erih_meta_with_disciplines if with_disciplines else self._path_erih_meta_without_disciplines
        entity_columns_to_use = self._entity_columns_to_use_erih_meta_disciplines if with_disciplines else self._entity_columns_to_use_erih_meta_without_disciplines
        
        if not exists(process_output_dir):
            os.makedirs(process_output_dir)

        data = []
        count = 0

        if with_disciplines:
            for _, file in enumerate(tqdm(self._list_erih_meta_files), 1):
                chunksize = 10000
                with pd.read_csv(file, usecols=['id', 'erih_disciplines'], chunksize=chunksize, sep=",") as reader:
                    for chunk in reader:
                        chunk.fillna("", inplace=True)
                        df_dict_list = chunk.to_dict("records")
                        for line in df_dict_list:
                            discipline = line.get('erih_disciplines')
                            if discipline:
                                data.append(line)
                                count += 1
                                if int(count) != 0 and int(count) % int(self._interval) == 0:
                                    data = self.splitted_to_file(count, data, entity_columns_to_use, process_output_dir)
        else:
            for _, file in enumerate(tqdm(self._list_erih_meta_files), 1):
                chunksize = 10000
                with pd.read_csv(file, usecols=['id', 'erih_disciplines'], chunksize=chunksize, sep=",") as reader:
                    for chunk in reader:
                        chunk.fillna("", inplace=True)
                        df_dict_list = chunk.to_dict("records")
                        for line in df_dict_list:
                            discipline = line.get('erih_disciplines')
                            new_line = dict()
                            if not discipline:
                                new_line['id'] = line.get('id')
                                data.append(new_line)
                                count += 1
                                if int(count) != 0 and int(count) % int(self._interval) == 0:
                                    data = self.splitted_to_file(count, data, entity_columns_to_use, process_output_dir)
            
        if len(data) > 0:
            count = count + (self._interval - (int(count) % int(self._interval)))
            self.splitted_to_file(count, data, entity_columns_to_use, process_output_dir)    

#####################

    #def create_dataset_SSH(self):
    #    '''This method creates, starting from "COCI_preprocessed", the dataset that we use for answering to the first research question.
    #    It has two columns 'citing' and 'cited', and contains just the DOIs that belongs to SSH journals.'''
    #    output_q1 = os.path.join(self._output_dir + 'dataset_SSH')
    #    if not exists(output_q1):
    #        os.makedirs(output_q1)
    #    self._CSVManager_erih_meta_with_disciplines = CSVManager(self._path_erih_meta_with_disciplines)
    #    data = []
    #    count = 0
    #    for file_idx, file in enumerate(tqdm(self._list_coci_files), 1):
    #        chunksize = 10000
    #        with pd.read_csv(file, chunksize=chunksize, sep=",") as reader:
    #            for chunk in reader:
    #                chunk.fillna("", inplace=True)
    #                df_dict_list = chunk.to_dict("records")
    #                for line in df_dict_list:
    #                    citing = line.get('citing')
    #                    cited = line.get('cited')
    #                    if self._CSVManager_erih_meta_with_disciplines.get_value(citing) or self._CSVManager_erih_meta_with_disciplines.get_value(cited):
    #                        count += 1
    #                        if self._CSVManager_erih_meta_with_disciplines.get_value(citing) and self._CSVManager_erih_meta_with_disciplines.get_value(cited):
    #                            data.append(line)
    #                        elif self._CSVManager_erih_meta_with_disciplines.get_value(citing):
    #                            entity_dict1 = dict()
    #                            entity_dict1['citing'] = citing
    #                            entity_dict1['cited'] = ""
    #                            data.append(entity_dict1)
    #                        elif self._CSVManager_erih_meta_with_disciplines.get_value(cited):
    #                            entity_dict2 = dict()
    #                            entity_dict2['cited'] = cited
    #                            entity_dict2['citing'] = ""
    #                            data.append(entity_dict2)
#
    #                        if int(count) != 0 and int(count) % int(self._interval) == 0:
    #                            data = self.splitted_to_file(count, data, self._entity_columns_to_use_q1_q3, output_q1)
#
    #    if len(data) > 0:
    #        count = count + (self._interval - (int(count) % int(self._interval)))
    #        self.splitted_to_file(count, data, self._entity_columns_to_use_q1_q3, output_q1)
#
    #def create_dataset_no_SSH(self):
    #    '''This method creates, starting from "COCI_preprocessed", the dataset that we use for answering to the third research question.
    #    It has two columns 'citing' and 'cited', and contains just the DOIs that don't belong to SSH journals.'''
    #    output_q3 = os.path.join(self._output_dir + 'dataset_no_SSH')
    #    if not exists(output_q3):
    #        os.makedirs(output_q3)
    #    self._set_erih_meta_without_disciplines = CSVManager.load_csv_column_as_set(self._path_erih_meta_without_disciplines, 'id')
    #    data = []
    #    count = 0
    #    for file_idx, file in enumerate(tqdm(self._list_coci_files), 1):
    #        chunksize = 10000
    #        with pd.read_csv(file, chunksize=chunksize, sep=",") as reader:
    #            for chunk in reader:
    #                chunk.fillna("", inplace=True)
    #                df_dict_list = chunk.to_dict("records")
    #                for line in df_dict_list:
    #                    citing = line.get('citing')
    #                    cited = line.get('cited')
    #                    if citing in self._set_erih_meta_without_disciplines or cited in self._set_erih_meta_without_disciplines:
    #                        count += 1
    #                        if citing in self._set_erih_meta_without_disciplines and cited in self._set_erih_meta_without_disciplines:
    #                            data.append(line)
    #                        elif citing in self._set_erih_meta_without_disciplines:
    #                            entity_dict1 = dict()
    #                            entity_dict1['citing'] = citing
    #                            entity_dict1['cited'] = ""
    #                            data.append(entity_dict1)
    #                        elif cited in self._set_erih_meta_without_disciplines:
    #                            entity_dict2 = dict()
    #                            entity_dict2['cited'] = cited
    #                            entity_dict2['citing'] = ""
    #                            data.append(entity_dict2)
    #                        if int(count) != 0 and int(count) % int(self._interval) == 0:
    #                            data = self.splitted_to_file(count, data, self._entity_columns_to_use_q1_q3, output_q3)
    #    if len(data) > 0:
    #        count = count + (self._interval - (int(count) % int(self._interval)))
    #        self.splitted_to_file(count, data, self._entity_columns_to_use_q1_q3, output_q3)

    def create_datasets_for_count(self, is_SSH=True):
        output_process_dir = self._path_dataset_SSH if is_SSH else self._path_dataset_no_SSH
        
        if not exists(output_process_dir):
            os.makedirs(output_process_dir)

        if is_SSH:
            self._CSVManager_erih_meta_with_disciplines = CSVManager(self._path_erih_meta_with_disciplines)
        else:
            self._set_erih_meta_without_disciplines = CSVManager.load_csv_column_as_set(self._path_erih_meta_without_disciplines, 'id')

        data = []
        count = 0
        for _, file in enumerate(tqdm(self._list_coci_files), 1):
            chunksize = 10000
            with pd.read_csv(file, chunksize=chunksize, sep=",") as reader:
                for chunk in reader:
                    chunk.fillna("", inplace=True)
                    df_dict_list = chunk.to_dict("records")
                    for line in df_dict_list:
                        citing = line.get('citing')
                        cited = line.get('cited')

                        if is_SSH:
                            condition = self._CSVManager_erih_meta_with_disciplines.get_value(citing) or self._CSVManager_erih_meta_with_disciplines.get_value(cited)
                        else:
                            condition = citing in self._set_erih_meta_without_disciplines and cited in self._set_erih_meta_without_disciplines

                        if condition:
                            count += 1
                            if is_SSH:
                                entity_condition1 = self._CSVManager_erih_meta_with_disciplines.get_value(citing)
                                entity_condition2 = self._CSVManager_erih_meta_with_disciplines.get_value(cited)
                            else:
                                entity_condition1 = citing in self._set_erih_meta_without_disciplines
                                entity_condition2 = cited in self._set_erih_meta_without_disciplines

                            if entity_condition1 and entity_condition2:
                                data.append(line)
                            elif entity_condition1:
                                entity_dict1 = {'citing': citing, 'cited': ""}
                                data.append(entity_dict1)
                            elif entity_condition2:
                                entity_dict2 = {'cited': cited, 'citing': ""}
                                data.append(entity_dict2)

                            if int(count) != 0 and int(count) % int(self._interval) == 0:
                                data = self.splitted_to_file(count, data, self._entity_columns_to_use_q1_q3, output_process_dir)

        if len(data) > 0:
            count = count + (self._interval - (int(count) % int(self._interval)))
            self.splitted_to_file(count, data, self._entity_columns_to_use_q1_q3, output_process_dir)

    def count_lines(self, path):
        '''This method simply counts and sums the lines of csv files contained in the folder, the path of which is passed as input'''
        citations_count = 0
        for file in tqdm(self.get_all_files(path, '.csv')):
            results = pd.read_csv(file, sep=",")
            citations_count += len(results)
        return citations_count

    def execute_count(self, output_dir='OutputFiles/', create_subfiles=False, interval=10000):
        if create_subfiles:
            self._interval = interval
            self._output_dir = output_dir
            if not exists(self._output_dir):
                os.makedirs(self._output_dir)
            self._path_erih_meta_with_disciplines = os.path.join(output_dir, 'erih_meta_with_disciplines')
            self._path_dataset_SSH = os.path.join(output_dir, 'dataset_SSH')
            self._path_erih_meta_without_disciplines = os.path.join(output_dir, 'erih_meta_without_disciplines')
            self._path_dataset_no_SSH = os.path.join(output_dir, 'dataset_no_SSH')

        '''question_1
        chiama create_erih_meta_with_disciplines -> crea un dataset con colonne "id" e "erih_disciplines" contenenti solo i doi SSH con discipline associate 
        chiama create_dataset_SSH -> crea il dataset che utilizziamo per rispondere alla Q1, cioè un dataset a due colonne "citing" e "cited" compilato con i doi solo SSH
        chiama count_lines -> Va messo come input del metodo "self._path_dataset_SSH"; conta le righe del dataset crato con "create_dataset_SSH e da la risposta alla Q1"
        '''
        '''question 3
        chiama create_erih_meta_without_disciplines -> crea un dataset a una sola colonna "id" che contiene solo i doi senza disciplina associata
        chiama create_dataset_no_SSH -> crea il dataset che utilizziamo per rispondere alla Q3, cioè un dataset a due colonne "citing" e "cited" compilato con i doi non SSH
        chiama count_lines -> Va messo come input del metodo "self._path_dataset_no_SSH"; conta le righe del dataset crato con "create_dataset_no_SSH e da la risposta alla Q3"
        '''

        if create_subfiles:
            # Answer to question 1
            self.create_additional_files(with_disciplines=True)
            self.create_datasets_for_count(is_SSH=True)
            ssh_citations = self.count_lines(self._path_dataset_SSH)
            print('Number of citations that (according to COCI) involve, either as citing or cited entities, publications in SSH journals (according to ERIH-PLUS) included in OpenCitations Meta: %d' %ssh_citations)

            # Answer to question 3
            self.create_additional_files(with_disciplines=False)
            self.create_datasets_for_count(is_SSH=False)
            not_ssh_citations = self.count_lines(self._path_dataset_no_SSH)
            print('Number of citations that (according to COCI) start from and go to publications in OpenCitations Meta that are not included in SSH journals: %d' %not_ssh_citations)
        else:
            print('\nSarting the process, be patient, it will take a while...\n')
            ssh_papers = list()
            not_ssh_papers = list()

            for filename in tqdm(self._list_erih_meta_files ,total=len(self._list_erih_meta_files), desc='Building lists of DOIs over ERIH-PLUS and META...', colour='yellow', smoothing=0.1):
                df = pd.read_csv(filename) # it was -> os.path.join(erih_meta_dir_path, filename))
                df = df[['id', 'erih_disciplines']] # Attention to the name given to the ERIH_disciplines column, if erih or ERIH
                # fill all the possible NaN or None with ""
                df = df.fillna('')
                # create boolean mask for erih_disciplines column
                mask = df['erih_disciplines'] != ''

                # filter the dataframe with the above mask
                ssh_df = df[mask]
                ssh_df = ssh_df.reset_index(drop=True)

                # Create a second dataframe from the above mask, where are kept only the False rows in the mask
                not_ssh_df = df[~mask]
                not_ssh_df = not_ssh_df.reset_index(drop=True)

                # Get the unique values of the id column
                unique_ssh = ssh_df['id'].unique().tolist()
                unique_not_ssh = not_ssh_df['id'].unique().tolist()

                # Append the unique values to the list
                ssh_papers.extend(unique_ssh)
                not_ssh_papers.extend(unique_not_ssh)

            print('Decoupling DOIs from lists...')
            ssh_papers_unique = []
            for paper in ssh_papers:
                papers = paper.split(' ')
                ssh_papers_unique.extend(papers)

            not_ssh_papers_unique = []
            for paper in not_ssh_papers:
                papers = paper.split(' ')
                not_ssh_papers_unique.extend(papers)

            print('Creating sets for unique DOIs...')
            ssh_set = set(ssh_papers_unique)
            not_ssh_set = set(not_ssh_papers_unique)

            ssh_citations = 0
            not_ssh_citations = 0

            def count_citations(ssh_set, not_ssh_set, row):
                if row['citing'] in ssh_set or row['cited'] in ssh_set:
                    return 'ssh'
                elif row['citing'] in not_ssh_set and row['cited'] in not_ssh_set:
                    return 'not_ssh'
                else:
                    return 'other'
            
            def count_citations_in_file(ssh_set, not_ssh_set, filepath):
                df = pd.read_csv(filepath, usecols=['citing', 'cited'])
                citation_counts = df.apply(lambda row: count_citations(ssh_set, not_ssh_set, row), axis=1).value_counts()
                return citation_counts.get('ssh', 0), citation_counts.get('not_ssh', 0)

            print('Starting to count...\n')
            with ThreadPoolExecutor(max_workers=self.num_cpu) as executor:
                count_citations_partial = partial(count_citations_in_file, ssh_set, not_ssh_set)
                results = list(tqdm(executor.map(count_citations_partial, self._list_coci_files), total=len(self._list_coci_files), desc='Iterating files...', colour='green', smoothing=0.1))

            print('Updating results...')
            for ssh_count, not_ssh_count in results:
                ssh_citations += ssh_count
                not_ssh_citations += not_ssh_count

            print('Number of citations that (according to COCI) involve, either as citing or cited entities, publications in SSH journals (according to ERIH-PLUS) included in OpenCitations Meta: %d' %ssh_citations)
            print('Number of citations that (according to COCI) start from and go to publications in OpenCitations Meta that are not included in SSH journals: %d' %not_ssh_citations)
        
        print('Done...')
        return ssh_citations, not_ssh_citations


"""
c = Counter("/Volumes/Extreme SSD/OS_data/Processed_data/smaller_COCI/", "/Volumes/Extreme SSD/OS_data/Processed_data/ERIH_META_prep/")
count = c.execute_count()
print(count)


c = Counter(processed_coci_path, erih_meta_path)
count = c.execute_count(output_dir=DOVE_PREFERITE, create_subfiles=True, interval=10000)
print(count)
"""



-------

### <span style="color:#9FC131">Week 10</span>
##### <i>22/05 - 26/05</i>

During this week we started discussing about the best visualizations for the workshop, as well as checked what is left to be done.</br>
I have devised a different way to answer the second research question and added it to the previous code.

Right now, the code is still running, so in the end we will get a meaningful comparison between the two results.

Above the temporary code, to be added inside the new counter class that the other members of the group have developed in order to face some issues discovered during the testing.

In [None]:
import os
from os.path import exists
from datetime import datetime
import pandas as pd
from tqdm import tqdm
import csv
import multiprocessing
from concurrent.futures import ThreadPoolExecutor
from functools import partial
import time
from lib.csv_manager_erih_meta_disciplines import CSVManager

class Counter(object):
    _entity_columns_to_use_erih_meta_disciplines = ['id', 'erih_disciplines']
    _entity_columns_to_use_erih_meta_without_disciplines = ['id']
    _entity_columns_to_use_q1_q3 = ['citing', 'cited']
    _entity_columns_to_use_q2 = ['id', 'citing', 'cited', 'disciplines']

    def __init__(self, coci_preprocessed_path, erih_meta_path, num_cpus=None):
        self._list_coci_files = self.get_all_files(coci_preprocessed_path, '.csv')
        self._list_erih_meta_files = self.get_all_files(erih_meta_path, '.csv')
        self.num_cpu = num_cpus if num_cpus!=None else multiprocessing.cpu_count() - 1

    def get_all_files(self, i_dir_or_compr, req_type):
        '''It returns a list containing all the files found in the input folder and with the extension required, like ".csv".'''
        result = []
        if os.path.isdir(i_dir_or_compr):
            for cur_dir, cur_subdir, cur_files in os.walk(i_dir_or_compr):
                for cur_file in cur_files:
                    if cur_file.endswith(req_type) and not os.path.basename(cur_file).startswith("."):
                        result.append(os.path.join(cur_dir, cur_file))
        return result

    def splitted_to_file(self, cur_n, lines, columns_to_use, output_dir_path):
        '''
        This method is responsible for writing the new csv files, with the columns passed as input.
        It concretely produces output files by creating in the output folder a new file every n lines
        which come from the other methods (like "create_erih_meta_with_disciplines", "create_dataset_SSH", etc.)
        where n is the integer number defined as an input parameter.
        In particular, the method takes in input the current number of lines, a data structure containing
        the lines to write in the output file, the name of the columns of the new csv files, the path of the directory to store the new files.
        '''
        if int(cur_n) != 0 and int(cur_n) % int(self._interval) == 0:
            filename = "count_" + str(cur_n // self._interval) + '.csv'

            if os.path.exists(os.path.join(output_dir_path, filename)):
                cur_datetime = datetime.now()
                dt_string = cur_datetime.strftime("%d%m%Y_%H%M%S")
                filename = filename[:-len('.csv')] + "_" + dt_string + '.csv'

            with open(os.path.join(output_dir_path, filename), "w", encoding="utf8", newline="") as f_out:
                dict_writer = csv.DictWriter(f_out, delimiter=",", quoting=csv.QUOTE_ALL, escapechar="\\",
                                             fieldnames=columns_to_use)
                dict_writer.writeheader()
                dict_writer.writerows(lines)
                f_out.close()

            lines = []
            return lines
        else:
            return lines

    def create_additional_files(self, with_disciplines):
        process_output_dir = self._path_erih_meta_with_disciplines if with_disciplines else self._path_erih_meta_without_disciplines
        entity_columns_to_use = self._entity_columns_to_use_erih_meta_disciplines if with_disciplines else self._entity_columns_to_use_erih_meta_without_disciplines
        
        if not exists(process_output_dir):
            os.makedirs(process_output_dir)

        data = []
        count = 0

        if with_disciplines:
            for file in tqdm(self._list_erih_meta_files, desc='Processing ERIH-META files to extract SSH DOIs...', total=len(self._list_erih_meta_files), colour='yellow', smoothing=0.1):
                chunksize = 10000
                with pd.read_csv(file, usecols=['id', 'erih_disciplines'], chunksize=chunksize, sep=",") as reader:
                    for chunk in reader:
                        chunk.fillna("", inplace=True)
                        df_dict_list = chunk.to_dict("records")
                        for line in df_dict_list:
                            discipline = line.get('erih_disciplines')
                            if discipline:
                                data.append(line)
                                count += 1
                                if int(count) != 0 and int(count) % int(self._interval) == 0:
                                    data = self.splitted_to_file(count, data, entity_columns_to_use, process_output_dir)
        else:
            for file in tqdm(self._list_erih_meta_files, desc='Processing ERIH-META files to extract non-SSH DOIs...', total=len(self._list_erih_meta_files), colour='yellow', smoothing=0.1):
                chunksize = 10000
                with pd.read_csv(file, usecols=['id', 'erih_disciplines'], chunksize=chunksize, sep=",") as reader:
                    for chunk in reader:
                        chunk.fillna("", inplace=True)
                        df_dict_list = chunk.to_dict("records")
                        for line in df_dict_list:
                            discipline = line.get('erih_disciplines')
                            new_line = dict()
                            if not discipline:
                                new_line['id'] = line.get('id')
                                data.append(new_line)
                                count += 1
                                if int(count) != 0 and int(count) % int(self._interval) == 0:
                                    data = self.splitted_to_file(count, data, entity_columns_to_use, process_output_dir)
            
        if len(data) > 0:
            count = count + (self._interval - (int(count) % int(self._interval)))
            self.splitted_to_file(count, data, entity_columns_to_use, process_output_dir)    

    def create_datasets_for_count(self, is_SSH=True):
        output_process_dir = self._path_dataset_SSH if is_SSH else self._path_dataset_no_SSH
        load_message = 'SSH' if is_SSH else 'non-SSH'
        
        if not exists(output_process_dir):
            os.makedirs(output_process_dir)

        if is_SSH:
            self._CSVManager_erih_meta_with_disciplines = CSVManager(self._path_erih_meta_with_disciplines)
        else:
            self._set_erih_meta_without_disciplines = CSVManager.load_csv_column_as_set(self._path_erih_meta_without_disciplines, 'id')

        data = []
        count = 0
        for file in tqdm(self._list_coci_files, desc=f'Processing COCI files to build {load_message} files for counting...', total=len(self._list_coci_files), colour='cyan', smoothing=0.1):
            chunksize = 10000
            with pd.read_csv(file, chunksize=chunksize, sep=",") as reader:
                for chunk in reader:
                    chunk.fillna("", inplace=True)
                    df_dict_list = chunk.to_dict("records")
                    for line in df_dict_list:
                        citing = line.get('citing')
                        cited = line.get('cited')

                        if is_SSH:
                            condition = self._CSVManager_erih_meta_with_disciplines.get_value(citing) or self._CSVManager_erih_meta_with_disciplines.get_value(cited)
                        else:
                            condition = citing in self._set_erih_meta_without_disciplines and cited in self._set_erih_meta_without_disciplines

                        if condition:
                            count += 1
                            if is_SSH:
                                entity_condition1 = self._CSVManager_erih_meta_with_disciplines.get_value(citing)
                                entity_condition2 = self._CSVManager_erih_meta_with_disciplines.get_value(cited)
                            else:
                                entity_condition1 = citing in self._set_erih_meta_without_disciplines
                                entity_condition2 = cited in self._set_erih_meta_without_disciplines

                            if entity_condition1 and entity_condition2:
                                data.append(line)
                            elif entity_condition1:
                                entity_dict1 = {'citing': citing, 'cited': ""}
                                data.append(entity_dict1)
                            elif entity_condition2:
                                entity_dict2 = {'cited': cited, 'citing': ""}
                                data.append(entity_dict2)

                            if int(count) != 0 and int(count) % int(self._interval) == 0:
                                data = self.splitted_to_file(count, data, self._entity_columns_to_use_q1_q3, output_process_dir)

        if len(data) > 0:
            count = count + (self._interval - (int(count) % int(self._interval)))
            self.splitted_to_file(count, data, self._entity_columns_to_use_q1_q3, output_process_dir)

    def count_lines(self, path):
        '''This method simply counts and sums the lines of csv files contained in the folder, the path of which is passed as input'''
        citations_count = 0
        files_list = self.get_all_files(path, '.csv')
        for file in tqdm(files_list, total=len(files_list), desc='Counting on files...', colour='red', smoothing=0.1):
            results = pd.read_csv(file, sep=",")
            citations_count += len(results)
        return citations_count
    
    def iterate_erih_meta(self):
        ssh_papers = list()
        not_ssh_papers = list()
        id_disciplines_map = dict()
        ssh_disciplines = set()

        for filename in tqdm(self._list_erih_meta_files, total=len(self._list_erih_meta_files), desc='Building lists of DOIs over ERIH-PLUS and META...', colour='yellow', smoothing=0.1):
            df = pd.read_csv(filename) # it was -> os.path.join(erih_meta_dir_path, filename))
            df = df[['id', 'erih_disciplines']] # Attention to the name given to the erih_disciplines column, if erih or ERIH
            # fill all the possible NaN or None with ""
            df = df.fillna('')
            # create boolean mask for erih_disciplines column
            mask = df['erih_disciplines'] != ''
            # filter the dataframe with the above mask
            ssh_df = df[mask]
            ssh_df = ssh_df.reset_index(drop=True)

            for _, row in ssh_df.iterrows():
                disciplines = row['erih_disciplines'].split(',')
                disciplines = [discipline.strip() for discipline in disciplines]
                doi = row['id']
                if doi not in id_disciplines_map:
                    id_disciplines_map[doi] = disciplines
                else:
                    id_disciplines_map[doi].extend(disciplines)
                for discipline in disciplines:
                    if discipline not in ssh_disciplines:
                        ssh_disciplines.add(discipline)
                    
            # Create a second dataframe from the above mask, where are kept only the False rows in the mask
            not_ssh_df = df[~mask]
            not_ssh_df = not_ssh_df.reset_index(drop=True)
            # Get the unique values of the id column
            unique_ssh = ssh_df['id'].unique().tolist()
            unique_not_ssh = not_ssh_df['id'].unique().tolist()
            # Append the unique values to the list
            ssh_papers.extend(unique_ssh)
            not_ssh_papers.extend(unique_not_ssh)

        print('Decoupling DOIs...')
        ssh_papers_unique = []
        for paper in ssh_papers:
            papers = paper.split(' ')
            ssh_papers_unique.extend(papers)

        not_ssh_papers_unique = []
        for paper in not_ssh_papers:
            papers = paper.split(' ')
            not_ssh_papers_unique.extend(papers)

        unique_id_disciplines_map = dict()
        for key, value in id_disciplines_map.items():
            multiple_keys = key.split(' ')
            for k in multiple_keys:
                unique_id_disciplines_map[k] = value

        print('Creating sets for unique DOIs...')
        ssh_set = set(ssh_papers_unique)
        not_ssh_set = set(not_ssh_papers_unique)

        return ssh_set, not_ssh_set, unique_id_disciplines_map, ssh_disciplines


    def execute_count(self, output_dir='OutputFiles/', create_subfiles=False, interval=10000):
        if create_subfiles:
            self._interval = interval
            self._output_dir = output_dir
            if not exists(self._output_dir):
                os.makedirs(self._output_dir)
            self._path_erih_meta_with_disciplines = os.path.join(output_dir, 'erih_meta_with_disciplines')
            self._path_dataset_SSH = os.path.join(output_dir, 'dataset_SSH')
            self._path_erih_meta_without_disciplines = os.path.join(output_dir, 'erih_meta_without_disciplines')
            self._path_dataset_no_SSH = os.path.join(output_dir, 'dataset_no_SSH')

        if create_subfiles:
            # Answer to question 1
            self.create_additional_files(with_disciplines=True)
            self.create_datasets_for_count(is_SSH=True)
            ssh_citations = self.count_lines(self._path_dataset_SSH)
            print('Number of citations that (according to COCI) involve, either as citing or cited entities, publications in SSH journals (according to ERIH-PLUS) included in OpenCitations Meta: %d' %ssh_citations)

            # Answer to question 3
            self.create_additional_files(with_disciplines=False)
            self.create_datasets_for_count(is_SSH=False)
            not_ssh_citations = self.count_lines(self._path_dataset_no_SSH)
            print('Number of citations that (according to COCI) start from and go to publications in OpenCitations Meta that are not included in SSH journals: %d' %not_ssh_citations)
        else:
            print('\nSarting the process, be patient, it will take a while...\n')
            
            ssh_set, not_ssh_set, id_disciplines_map, ssh_disciplines = self.iterate_erih_meta()

            ssh_citations = 0
            not_ssh_citations = 0
            discipline_counter = {}

            for discipline in ssh_disciplines:
                discipline_counter[discipline] = {'citing': 0, 
                                                  'cited': 0}

            def count_citations(ssh_set, not_ssh_set, row):
                if row['citing'] in ssh_set or row['cited'] in ssh_set:
                    return 'ssh'
                elif row['citing'] in not_ssh_set and row['cited'] in not_ssh_set:
                    return 'not_ssh'
                else:
                    return 'other'
                
            def count_disciplines(id_disciplines_map, discipline_counter, row):
                if row['citing'] in id_disciplines_map:
                    citing_disciplines = id_disciplines_map[row['citing']]
                    for discipline in citing_disciplines:
                        discipline_counter[discipline]['citing'] += 1
                if row['cited'] in id_disciplines_map:
                    cited_disciplines = id_disciplines_map[row['cited']]
                    for discipline in cited_disciplines:
                        discipline_counter[discipline]['cited'] += 1
            
            def count_citations_in_file(ssh_set, not_ssh_set, id_disciplines_map, discipline_counter, filepath):
                df = pd.read_csv(filepath, usecols=['citing', 'cited'])
                #citation_counts = df.apply(lambda row: count_citations(ssh_set, not_ssh_set, row), axis=1).value_counts()
                count_ssh_citations = 0
                count_not_ssh_citations = 0

                for _, row in df.iterrows():
                    citation_type = count_citations(ssh_set, not_ssh_set, row)
                    if citation_type == 'ssh':
                        count_ssh_citations += 1
                    elif citation_type == 'not_ssh':
                        count_not_ssh_citations += 1
                    count_disciplines(id_disciplines_map, discipline_counter, row)
                
                return (count_ssh_citations, count_not_ssh_citations, discipline_counter)
                #return (citation_counts.get('ssh', 0), citation_counts.get('not_ssh', 0), discipline_counter)

            print('Starting to count...\n')
            start_time = time.time()
            if time.time() - start_time > 45:
                print('The process is taking a while...')
                print('Be aware that the overall speed of the process depends on your machine,')
                print('please be patient, a progress bar will appear soon...')

            with ThreadPoolExecutor(max_workers=self.num_cpu) as executor:
                count_citations_partial = partial(count_citations_in_file, ssh_set, not_ssh_set, id_disciplines_map, discipline_counter)
                results = list(tqdm(executor.map(count_citations_partial, self._list_coci_files), total=len(self._list_coci_files), desc='Iterating files...', colour='green', smoothing=0.1))

            print('Updating results...')
            for ssh_count, not_ssh_count, partial_discipline_counter in results:
                ssh_citations += ssh_count
                not_ssh_citations += not_ssh_count
                
                for discipline, counts in partial_discipline_counter.items():
                    discipline_counter[discipline]['citing'] += counts['citing']
                    discipline_counter[discipline]['cited'] += counts['cited']

            print('Number of citations that (according to COCI) involve, either as citing or cited entities, publications in SSH journals (according to ERIH-PLUS) included in OpenCitations Meta: %d' %ssh_citations)
            print('Number of citations that (according to COCI) start from and go to publications in OpenCitations Meta that are not included in SSH journals: %d' %not_ssh_citations)
            discipline_citing_more = max(discipline_counter, key=lambda x: discipline_counter[x]['citing'])
            print(f"The discipline with the highest 'citing' count is '{discipline_counter[discipline_citing_more]['citing']}'")
            discipline_more_cited = max(discipline_counter, key=lambda x: discipline_counter[x]['cited'])
            print(f"The discipline with the highest 'cited' count is '{discipline_counter[discipline_more_cited]['cited']}'")

        print('Done...')
        return ssh_citations, not_ssh_citations

"""
c = Counter("/Volumes/Extreme SSD/OS_data/Processed_data/smaller_COCI/", "/Volumes/Extreme SSD/OS_data/Processed_data/ERIH_META_Marta/")
count = c.execute_count()
print(count)

c = Counter("/Volumes/Extreme SSD/OS_data/Processed_data/smaller_COCI/", "/Volumes/Extreme SSD/OS_data/Processed_data/ERIH_META_Marta/")

#files = c.get_all_files("/Volumes/Extreme SSD/OS_data/Processed_data/ERIH_META_Marta/", '.csv')
#print(files)

count = c.execute_count(output_dir='/Volumes/Extreme SSD/OS_data/OutputFiles/', create_subfiles=True, interval=10000)
print(count)
"""