<h1>First Step</h1>
<p>The first step in using the scripts of this repository is to acquire the entities that contain the mashed dois. 
This can be done by utilising the Sparql endpoint of Opencitations Meta. This query allows the user to retrieve all of the journals inside of the endpoint which possess more than one Doi.</p>

In [None]:
# Placeholder
# PREFIX datacite: <http://purl.org/spar/datacite/>
# PREFIX dcterms: <http://purl.org/dc/terms/>
# PREFIX fabio: <http://purl.org/spar/fabio/>

# SELECT ?journal (COUNT(?identifier) AS ?doiCount)
# WHERE {
#   ?journal a fabio:Journal .
#   ?journal datacite:hasIdentifier ?identifier .
#   ?identifier datacite:usesIdentifierScheme datacite:doi .
# }
# GROUP BY ?journal
# HAVING (COUNT(?identifier) > 1)

<p>After getting the results in a csv format we can start working on the entities we gathered from the query. The csv contains the url of the journals and their omid, and the number of dois assigned to that specific entity. By using the DoiFinder object we can retrieve the dois of the entities</p>

In [None]:
from DoiFinder import DOIFinder
import pandas as pd
import tqdm

finder = DOIFinder()
    
# Read the CSV file containing journal URLs
journal_doi_df = pd.read_csv("insert_file_path", usecols=['journal'])

# Iterate over each URL in the 'journal' column
for idx, initial_url in tqdm(journal_doi_df['journal'].items(), desc="Processing journals"):
    initial_url = str(initial_url).strip()  # Clean the URL string
        
    # Check if it's a valid URL
    if initial_url and (initial_url.startswith('http://') or initial_url.startswith('https://')):
        # Find DOI links in the URL
        doi_links = finder.find_doi_links(initial_url)
            
        # Save the found DOI links (if any) into the 'journal_doi' column
        if doi_links:
                # Save the first DOI link or concatenate multiple DOIs
            journal_doi_df.at[idx, 'journal_doi'] = ", ".join(doi_links)
    else:
        print(f"Invalid URL: {initial_url}")

# Save the updated DataFrame with the new 'journal_doi' column to a CSV file
journal_doi_df.to_csv("insert_file_path", index=False)
print("Updated CSV file saved.")


<p>Now it's time to use the DOIOpener object to validate these dois and see which ones need to be deleted from the original entity, which ones point to other entities, and which ones belong to the entity they are assigned to. This process needs to be done manually since there isn't a way to automatically validate the Dois</p>

In [None]:
from doi_opener import DOIOpener
# Insert the list of dois
doi_list = []

doi_opener = DOIOpener(doi_list)
doi_opener.process_and_open_urls()

<p>Once the correct dois are gathered, we discard the ones that are either completely wrong or contain a 404 error, then we can proceed via the SPARQLCitationExtractor to retrieve all of the citations that point to the entities with mashed dois and the ones where the entities with the mashed dois point to other entities. After we can reuse the Doifinder object to find the dois of the citing and cited entities.</p>

In [None]:
from citation_finder import SPARQLCitationExtractor
# insert the csv containing the urls of the entities
extractor = SPARQLCitationExtractor(
    endpoint_url="https://opencitations.net/index/sparql",
    input_csv_path="insert_path",
    output_directory="insert_directory"
)
    
extractor.extract_citing_data()
extractor.extract_cited_data()
extractor.merge_citing_and_cited_data()

<p>Now we have all of the citations regarding the mashed entities which need to be removed later. 
We can proceed to gather the metadata of the mashed entities with the MetadataGatherer object. This object needs to be used with a semiautomatic approach since it retrieves the html tags from the page which contains the metadata that we need for the upload to OpenCitations Meta, change the tags to be retrieved inside of the script and proceed for cases.</p>

<p>Before continuing we need to address the problem of the dois that point to an entity different then the one we assigned to, and to do this after opening the dois and verifying that they are indeed articles and not journals we can retrieve their citations and their metadata, this way we can recrete the connections to other entities and upload the new entities later on OpenCitations Meta. The DOIProcessor object can gather the metadata of these articles and their citations.</p>

In [None]:
from metadata_researcher import MetadataGatherer
    # Replace the list below with the DOIs you want to process, or with a file containing dois.
example_dois = [
    "10.1234/example-doi-1",
    "10.5678/example-doi-2"
    ]

# Create a DOIProcessor instance
processor = MetadataGatherer(example_dois)

# Process DOIs and save metadata to a JSON file
processor.process_dois(output_file="insert_file_path")

<p>Now that we have the metadata regarding the entities that have mashed doi, we can proceed by gathering the metadata of the entities cited by these entities with the CrossRefProcessor object </p>

In [None]:
from cited_articles_metadata_gatherer import CrossRefProcessor
processor = CrossRefProcessor(
    input_csv="insert_file_path",
    output_json="insert_file_path",
    output_csv="insert_file_path",
    filtered_json="insert_file_path"
)
    
# Process data and save results
processor.process_data()
    
# Filter and save referenced DOIs
processor.filter_referenced_dois()

<p>Now the elements to reconnect all of the entities are all present we just need to order them and reconnect them. The DOIMatcher object will help us by connecting the Entities that cite the mashed entities to the mashed entities</p>

In [None]:
from citing_doi_matcher import DOIMatcher
processor = DOIMatcher(
    csv_file="insert_file_path",
    json_file="insert_file_path",
    output_file="insert_file_path"
)

processor.process()

<p>The remaining operations regard the upload to Meta and the restablishment of the connections between the entities, which will be done in time.
Remaining operations: upload the entities on the new version of meta; retrieve their omid, match the omid to other identifiers; recreate the connections between the entities and the citations.</p>