# Maddalena's Notebook
## Open Science 2022/2023

## 23/03/23 

I started exploring the course material and the data sources we will need to use according to our research question

<b>Research Question</b>
* What is the coverage of publications in Social Science and Humanities (SSH) journals 
(according to ERIH-PLUS) included in OpenCitations Meta? 
* What are the disciplines that have more publications? 
* What are countries providing the largest number of publications and journals? 
* How many of the SSH journals are available in Open Access according to the data in DOAJ?

<b>data sources</b>
* <a href= "http://opencitations.net/meta">OpenCitations Meta</a>
* <a href= "https://kanalregister.hkdir.no/publiseringskanaler/erihplus/">ERIH-PLUS</a>
* <a href= "https://doaj.org/">DOAJ</a>

One of the first problems to address in our research will be how to retrieve the data, whether through REST API or by a dump download, hence I tried to figure out how a REST API works and then I tried making some requests to OpenCitations REST API.
Since we will need to access a large amount of data but at the moment I see no way of doing it manually, I am wondering if there are way of doing it by accessing the REST API with Python and I plan to look into it.


I downloaded ERIH-PLUS list of approved journals as a dump and I tried to explore the dataset

In [8]:
import pandas as pd

approvedSSH = pd.read_csv("ERIHPLUSapprovedJournals.csv", encoding = "utf-8", sep=";")
print(len(approvedSSH))
approvedSSH.info()

11065
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11065 entries, 0 to 11064
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Journal ID              11065 non-null  int64 
 1   Print ISSN              8741 non-null   object
 2   Online ISSN             9510 non-null   object
 3   Original Title          11065 non-null  object
 4   International Title     11065 non-null  object
 5   Country of Publication  10945 non-null  object
 6   ERIH PLUS Disciplines   11065 non-null  object
 7   OECD Classifications    11065 non-null  object
 8   [Last Updated]          11065 non-null  object
dtypes: int64(1), object(8)
memory usage: 778.1+ KB


Nonetheless these informations are about journals and not about publication. I am thinking we will need a list of DOI of the publications indexed in ERIH-PLUS, I have explored a bit but I can't seem to find articles' downloadable data. This will be something to address in the first group meeting.


## 27/03/23 

I read Emerald publishing guidelines on how to write an abstract. 

According to the key points that an abstract should feature, I revised the sources provided by the teacher on our research question subject and did some further research to better focus what is the purpose and value of our research.

In particular, since the domain of our research question is Scientometrics and inquires about the coverage of SSH journal in OpenCitatins and how many of those are open access, I noted down some articles I skimmed and plan to read in the future:

- <a href="https://www.nature.com/articles/s41599-018-0149-x">The possibility and desirability of replication in the humanities</a>
- <a href="https://scholarlypublications.universiteitleiden.nl/handle/1887/65315">Exploration of reproducibility issues in scientometric research</a>
- <a href="https://direct.mit.edu/qss/article/3/4/953/113634/A-quantitative-and-qualitative-open-citation">A quantitative and qualitative open citation analysis of retracted articles in the humanities</a>
- <a href="https://www.degruyter.com/document/doi/10.1515/opis-2018-0013/html">Scholarly Communication Practices in Humanities and Social Sciences: A Study of Researchers' Attitudes and Awareness of Open Access</a>

Finally, I started reasoning on the first point of the abstract "Purpose" by formulating some on point questions to share with the group later today, as well as my personal answer, to direct us for writing our abstract. 

* what is the problem we need to solve? why is it relevant? What is the main finding?
<center>The problem to solve is to understand the state of the art of Social Science and Humanities journals citation data presence in OpenCitation databases. <br> It is relevant because of citations centrality in ssh domain and the possibility of conducting further research on available data on citations. How does Open access relate to that?<br>
The main finding will be 1.providing an up do date overview on citation data and oppennes in the ssh domain. 2. creating a software that can make these data available and keep them up to date for enhancing further research on the topic</center>

<hr>

### Notes and questions for myself
* How many of the Open Access journals adopt Article processing charge (= make people pay a fee to publish open access?). What is that for? Where else could the open access system sustain itself without having authors pay to publish?

<b>Notes from <a href="https://www.emeraldgrouppublishing.com/how-to/authoring-editing-reviewing/write-article-abstract">Emerald group publishing guidelines</a></b>

maximum of 250 words

<b>Points to feature</b>
* <b>Purpose</b>: This is where you explain why you undertook this study. explain the problem that you have solved.  let readers know why you chose to study this topic or problem and its relevance. Let them know what your key argument or main finding is.
* <b>Study design, methodology, approach</b>: Let readers know exactly what you did to reach your results.  Used tools, methods, protocols or datasets
* <b>Findings</b>: what you found during your study, whether it answers the problem you set out to explore, and whether your hypothesis was confirmed. Clear, give exact figures and not generalizations.
* <b>Originality, value</b>: analysis of the value of your results. Ask colleagues whether your analysis is balanced and fair. Conjecture what future research steps could be. 

<b>Include</b>
* Reasearch limitations/implications
* Practical implications
* Social implications


After the group meeting - where we shared our current understanding of the research to carry and also envisioned possible study designs and various technical approaches - I offered to write a coherent first draft of the abstract considering the material and perspectives we shared. This first draft is pushed in the repository for my teammates to read and rework on, so we can agree on a final version to present on Wednesday.


## Abstract first draft
<span style="color: red">currently around 400 words, to cut</span>

* <b>Purpose</b>: In the domain of Social Science and Humanities, citations hold a central role in measuring reliability and interconnectedness of a publication as well as its long term impact. Referencing works that inspired a particular work provide the reader with the possibility of accessing primary sources, hence, enhancing scholarly communication and community evaluation of a research.
This project aims to assess the state of the art of Social Science and Humanities journals citation data coverage in OpenCitation databases and investigate how many of these are also Open Access. The research provides further insight into the available data, evaluating which countries and disciplines provide the larger number of publications and journals.

* <b>Study design and methodology</b>: We use data from multiple sources: OpenCitations Meta, a database that stores and delivers bibliographic metadata about citations; ERIH-PLUS, an academic Social Science and humanities journal index; DOAJ, the directory of open access journals to retrieve data about open access journals. These data are fetched, filtered, and processed by means of Python Programming Language. The code will be published and available in a GitHub repository, along with instructions on how to use it, to provide users full access to reproduce this analysis at any moment in time. Finally, output data will be visualized in an intuitive and user friendly way.

* <b>Findings</b>: Considering the diverse problematics and the reluctance that revolve around opening the scholarly knowledge in the SSH domain, we expect to find an increasing but still not enough full open access journals. Nonetheless, we expect this deficiency to be in part counterbalanced by a significant number of open citation metadata of SSH publications, provided by OpenCitations  database. <span style="color: red">to inquire</span>

* <b>Originality</b>: Our study provides valuable insights into the current state of the openness of Social Science and Humanities publications and provide reusable data for enhancing further research about Open Science in the Scientometrics domain.
Furthermore, we believe that increased open access to SSH research can enable broader societal engagement, foster cross-disciplinary collaboration, and promote the dissemination of knowledge to a wider audience.

* <b>Limitations</b>: The research is based on the current state and reliability of publicly available data, which might not encompass the entire landscape of SSH publications. These data may and will probably change over time or might come with errors we overlook, despite our best efforts to be as precise an meticulous as possible.

Keywords: Open Access, Social Science and Humanities, OpenCitations Meta, ERIH-PLUS, DOAJ, scholarly communication, open science, citation

## 02/04/23 

Today I revised 29/03 lesson about FAIR and open data and summarized/organized my handwritten notes + slides in a unique txt document.
While revising I noted down some questions and doubts to share and discuss with the group during our next meeting, where we will start working on our Data Management Plan.
<hr>
<b>Unclear issues</b>

- What are practical examples of vocabularies that use FAIR principles? (Interoperability principle 2)
- (Meta)data include qualified references to other (meta)data (?)
About the DMP
- What data (beside csv datasets) will our research produce?
- Do we need to add metadata to our data?
- what metadata should a software have, are there standards?
- Datasets to describe in our DMP are source datasets or our final processed dataset?

## 04/04/23 

In order to fill up the sections of the DMP I am in charge of, I first read a <a hred="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004525">Ten simple rules for creating a good data management PLAN</a> and wrote a small summary, highlighting the parts that may concern us the most.

1.data summary - madda
2.reused data- madda
3.fair data- madda

then fair data ds 2

During the compilation of the Dataset description I stopped at some questions, since I realized I was not entirely sure how to answer them:

<b>1.1.3 What are the formats of the described generated/collected data?</b><br>
I am not sure if a dataset description should be specific for one kind of data or if it can include all kind of data produced during the research? I think --> our generated data will be diffused in different ways, a dataset in the classical term for sure, but then also probably a textual description analysis and likely some visualiztions (maybe a platform? A website? a pdf?)

I looked into some other resources to help me better understand the specific of different data formats
* <a href="https://ukdataservice.ac.uk/learning-hub/research-data-management/format-your-data/recommended-formats/">File formats recommended by the UK Data Service</a>
* <a href="http://www.docs.is.ed.ac.uk/docs/data-library/EUDL_RDM_Handbook.pdf">Edinburgh University Data Library Research Data Management Handbook</a>
* <a href="https://dmptool-stg.cdlib.org/general_guidance">DMPTool</a>

In the dropdown menu I chose "Discipline specific formats" since it seemed to me the most suitable to our type of data. All the formats we will use are in common usage by the research community, even though not "discipline specific" in a narrow sense.

<b>2.1.1 Are you re-using the described data and how?</b><br>
since we are using public structured data I initially though this meant we were reusing data. Anyway, on a second thought, since they seem to intend "reused data" as data that are outcomes of other researches, this is not true for us: our source data is downloaded from public directories, aggregators and databases.

<b>3.1.1.1 Will you use metadata to describe the data?</b>
I am familiar with DCAT standard but I though it could be worth looking into other standards amongst the ones present in the dropdown menu. I had a quick look at <a href="https://www.w3.org/TR/vocab-data-cube/#outline">The RDF data cube vocabulary</a> and at <a href="https://schema.datacite.org/meta/kernel-4.4/doc/DataCite-MetadataKernel_v4.4.pdf">The DataCite metadata schema</a>.
Reading the RDF data cube vocabulary, made me think about whether we will be also producing statystical data and in what format they need to be published (I think csv is fine, but also could be visualizations) and with which standard they should be described (DCAT is also fine but we could look into SDMX)<br>
<b>N.B.</b> DCAT has the class dcat:Resource which is a general catalogue resource: "It is strongly recommended to use a more specific sub-class. When describing a resource which is not a dcat:Dataset or dcat:DataService, it is recommended to create a suitable sub-class of dcat:Resource, or use dcat:Resource with the dct:type property to indicate the specific type." this would allow as to include non-datasets resurces as data in our catalogue too.

<b>3.1.1.12 Will you provide searchable metadata for the described data?</b>
How can we make our metadata searchable?

As it turns out, many other answers require further group discussion. So far, considering these questions lead me to note down useful resources to envision possible answers.

<hr>

* <a href="https://www.openaire.eu/find-trustworthy-data-repository">data repositories</a>
* <a href="http://www.openarchives.org/OAI/2.0/guidelines.htm">OAIP-MH</a>



