<a href="https://colab.research.google.com/github/m-mahdavian/duplicated-references/blob/main/duplicatefinder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Readme
Unfortunately, using reference managers by different co-authors sometimes results in an anomaly in the reference list. Some references may be inserted in two places of the references list with the same format or with a bit of difference in the data. Although reference managers have built-in duplicate removers, when two or more co-authors are working on a manuscript, facing this problem seems inevitable. The purpose of this program is to assist authors, especially corresponding authors, in detecting duplicate references in the lengthy reference list of manuscripts before submission to a journal. You can remove the 100% match duplicates with no further consideration. However, those that are found similar might be duplicates due to different information entries in the reference managers or different due to, e.g., similar titles of similar authors lists. Therefore, in the case of susceptible duplicates remove them manually after a thorough check. I used the FuzzyWuzzy library of Python to find similarities. You can decrease the target similarity % if there is more chance of finding duplicates in your references list due to errors in reference data entry in the reference managers.

To use this code, you need to copy-paste the references list into the notepad and save it as "references.txt". This code is compatible with most reference lists. However, with some reference list formats, a bit of change in the codes may be required. Please leave me a comment if you enjoy it, you find a glitch with some references list, or you come up with some hints to change the code. Please mention your selected similarity % in your comment.

Note: Finding duplicates using duplicated() and removing them with drop_duplicates() methods seems to be ineffective sometimes which is why I didn't use them here.

This code has been developed by M. Mahdavian. I have used some hints of Bing AI to develop this code.
https://www.linkedin.com/in/mohammad-mahdavian-50827b53

# Installing and Loading Required Libraries

In [None]:
pip install fuzzywuzzy[speedup]



In [None]:
import pandas as pd
from fuzzywuzzy import fuzz

# Read the References Text File

In [None]:
file = open("references.txt", "rb")
data = file.readlines()
file.close()
x=0
text=[]
while x <len(data):
  text.append(data[x].decode("ISO-8859-1"))
  x=x+1

In [None]:
df=pd.DataFrame(text)

# Removing the References' Number

In [None]:
i=0
while i<len(df)+1:
  x='['+str(i)+']'
  df[0]=df[0].str.strip(x)
  i=i+1

# Finding Exact Duplicates

In [None]:
i=0
j=0
s=0
duplicated=[]
#index_of_duplicates=id
id=[]
while i <len(df):
  while j<len(df):
    if i!=j and df.iloc[i].to_string()==df.iloc[j].to_string():
      if df.iloc[i].to_string() not in duplicated and i not in id and j not in id:
        print("Ref. Number #", i+1, "is identical (100% similar) to Ref. Number #", j+1)
        duplicated.append(df.iloc[i].to_string())
        id.append(j)
      s=s+1
    j=j+1
  j=0
  i=i+1
if s==0:
  print("NO IDENTICAL (100% SIMILAR) REFERENCES WERE DETECTED.")

Ref. Number # 40 is identical (100% similar) to Ref. Number # 42


# Removing Identical Documents

In [None]:
i=0
j=0
while i<len(df):
  if df.iloc[i].to_string() in duplicated:
    #print(j)
    del duplicated[j]
    df.at[i, 0] = "Mahdavian"
  i=i+1

# Finding Near Duplicates

In [None]:
i=0
j=0
s=0
# Set the intended similarity % here.
similarity=80
id=[]
print("Succeptible duplicated references cosidering similarity above or equal to", similarity, "%:")
print("***Remove them manually after a thorough check. You can decrease similarity % if you think \nthere is more chance to find duplicates in your references list due to errors in references \ninformation entry in the reference managers.***")
while i <len(df):
  while j<len(df):
    if i!=j and fuzz.ratio(df.iloc[i].to_string(), df.iloc[j].to_string())>=similarity:
      id1=len(df.iloc[i].to_string())
      id2=len(df.iloc[j].to_string())
      if id1!=14 or id2!=14:
        if i not in id and j not in id:
          print("Ref. Number #", i+1, "is similar to Ref. Number #", j+1)
          id.append(j)
        s=s+1
    j=j+1
  j=0
  i=i+1
if s==0:
  print("NO SIMILARITY WAS DETECTED.")


Succeptible duplicated references cosidering similarity above or equal to 80 %:
***Remove them manually after a thorough check. You can decrease similarity % if you think 
there is more chance to find duplicates in your references list due to errors in references 
information entry in the reference managers.***
NO SIMILARITY WAS DETECTED.
