## 1. Retrieve OpenCitation Meta publication and Journals that are registered in ERIH-PLUS index

Starting from the ERIH-PLUS index of Social Science and Humanities approved journals dataset 
ERIHPLUSapprovedJournals.csv
 (downloaded 27/04/2023) we want to retrieve all the publications belonging to one of those journals, included in OpenCitations Meta database (https://opencitations.net/meta#:~:text=For%20each%20publication%2C%20the%20metadata,and%20PubMed%20Identifiers%20(PMIDs).)

### 1.1 

In order to fulfill this task, we intend to download the data dump and perform chunk operations (either reading the csv with pandas setting a chunksize parameter, using os library to iterate over the folder's files, reading directly the zip file using gzip library etc.)
Note that the OpenCitations Meta data dump has a row for each entity that is either a publication or a venue. At this moment we don't need publication information, so we would need to cut down the dataset to only have venues information in it.

In [2]:
import os
import pandas as pd
import csv

def detect_delimiter(file_path):
    with open(file_path, 'r', newline='', encoding='utf-8') as file:
        dialect = csv.Sniffer().sniff(file.read(1024))
    return dialect.delimiter

delimiter = detect_delimiter('ERIHPLUSapprovedJournals.csv')
erih_plus_df = pd.read_csv('ERIHPLUSapprovedJournals.csv', sep=delimiter)

;


In [3]:
erih_plus_df.head(1)

Unnamed: 0,Journal ID,Print ISSN,Online ISSN,Original Title,International Title,Country of Publication,ERIH PLUS Disciplines,OECD Classifications,[Last Updated]
0,486254,1989-3477,,@tic.revista d'innovació educativa,@tic.revista d'innovació educativa,Spain,Interdisciplinary research in the Social Scien...,Educational Sciences; Other Social Sciences,2015-06-25 13:48:26


In [18]:
def process_meta_csv(file_path, e_df):
    meta_data = pd.read_csv(file_path)
    meta_data['venue'] = meta_data['venue'].astype(str)
    meta_data['issn'] = meta_data['venue'].str.extract(r'issn:(\d{4}-\d{3}[\dX])')
    
    # Extract the identifier (OMID) from the 'id' column
    meta_data['id'] = meta_data['id'].str.extract(r'(meta:[^ ]*)')
     
    merged_data_print = e_df.merge(meta_data, left_on='Print ISSN', right_on='issn', how='inner')
    merged_data_online = e_df.merge(meta_data, left_on='Online ISSN', right_on='issn', how='inner')
    merged_data = pd.concat([merged_data_print, merged_data_online], ignore_index=True)
    
    # Keep only the relevant columns for the mapping dataframe
    merged_data = merged_data[['id', 'issn', 'Journal ID', 'Print ISSN', 'Online ISSN']].rename(columns={'id': 'OC_OMID', 'issn': 'OC_ISSN', 'Journal ID': 'EP_ID', 'Print ISSN': 'EP_Print_ISSN', 'Online ISSN': 'EP_Online_ISSN'})
    
    # Create the 'EP_ISSN' column
    merged_data['EP_ISSN'] = merged_data['EP_Print_ISSN'].combine_first(merged_data['EP_Online_ISSN'])
    
    # Drop the 'EP_Print_ISSN' and 'EP_Online_ISSN' columns
    merged_data = merged_data.drop(columns=['EP_Print_ISSN', 'EP_Online_ISSN'])

    return merged_data

merged_data = process_meta_csv('0.csv', erih_plus_df)
merged_data

Unnamed: 0,OC_OMID,OC_ISSN,EP_ID,EP_ISSN
0,meta:br/060100,,488561,2341-0515
1,meta:br/060176,,488561,2341-0515
2,meta:br/06084,,488561,2341-0515
3,meta:br/060104,,488561,2341-0515
4,meta:br/06046,,488561,2341-0515
...,...,...,...,...
7381951,meta:br/0602842,,491147,2309-1606
7381952,meta:br/0602972,,491147,2309-1606
7381953,meta:br/0602880,,491147,2309-1606
7381954,meta:br/0602884,,491147,2309-1606


In [9]:
csv_directory = ''
merged_data = pd.DataFrame()

for file_name in os.listdir(csv_directory):
    if file_name.endswith('.csv'):
        file_path = os.path.join(csv_directory, file_name)
        merged_data_file = process_meta_csv(file_path, erih_plus_df)
        merged_data = pd.concat([merged_data, merged_data_file], ignore_index=True)

FileNotFoundError: [WinError 3] The system cannot find the path specified: 'I:\\open-sci\\dump-files\\opencitations-meta\\solo_one'

In [10]:
merged_data.head(1)

In [6]:
new_merged_data = merged_data.dropna(subset=['OC_ISSN']).reset_index(drop=True)
new_merged_data.head(2)

Unnamed: 0,OC_OMID,OC_ISSN,EP_ID,EP_ISSN
0,meta:br/0601646,0172-6404,471777,0172-6404
1,meta:br/0601638,0172-6404,471777,0172-6404


### 1.2

HERE WE NEED TO HAVE A STEP FOR ADDING INFORMATION ABOUT OPEN ACCESS TO THE DATAFRAME WE JUST CREATED, SO THAT THE OMIDS ARE DIRECTLY CONNECTED TO THE INFORMATION ABOUT ACCESSIBILITY OF THE JOURNAL!

In [7]:
# Load DOAJ CSV file into a DataFrame
doaj_file_path = 'journalcsv__doaj.csv'
doaj_df = pd.read_csv(doaj_file_path, encoding="UTF-8")

In [8]:
doaj_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19278 entries, 0 to 19277
Data columns (total 54 columns):
 #   Column                                                                       Non-Null Count  Dtype  
---  ------                                                                       --------------  -----  
 0   Journal title                                                                19278 non-null  object 
 1   Journal URL                                                                  19278 non-null  object 
 2   URL in DOAJ                                                                  19278 non-null  object 
 3   When did the journal start to publish all content using an open license?     19277 non-null  float64
 4   Alternative title                                                            7485 non-null   object 
 5   Journal ISSN (print version)                                                 11148 non-null  object 
 6   Journal EISSN (online version)        

In [9]:
new_doaj = doaj_df.iloc[1:, [5, 6, 10]]
new_doaj.columns

Index(['Journal ISSN (print version)', 'Journal EISSN (online version)',
       'Country of publisher'],
      dtype='object')

In [12]:
# Create a dictionary of Open Access ISSNs
open_access_dict = {}

for index, row in new_doaj.iterrows():
    open_access_dict[row['Journal ISSN (print version)']] = True
    open_access_dict[row['Journal EISSN (online version)']] = True


In [13]:
# Merge Open Access information with the main dataframe
new_merged_data['Open Access'] = new_merged_data['OC_ISSN'].map(open_access_dict)


In [14]:
# Fill missing Open Access information with 'Unknown'
new_merged_data['Open Access'] = new_merged_data['Open Access'].fillna('Unknown')


In [15]:
new_merged_data.head()

Unnamed: 0,OC_OMID,OC_ISSN,EP_ID,EP_ISSN,Open Access
0,meta:br/0601646,0172-6404,471777,0172-6404,Unknown
1,meta:br/0601638,0172-6404,471777,0172-6404,Unknown
2,meta:br/0601645,0172-6404,471777,0172-6404,Unknown
3,meta:br/0601643,0172-6404,471777,0172-6404,Unknown
4,meta:br/0601640,0172-6404,471777,0172-6404,Unknown
