<a href="https://colab.research.google.com/github/m-mahdavian/duplicated-references/blob/main/duplicatefinder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Unfortunately, using reference managers by different co-authors sometimes results in an anomaly in the reference list.
# Some references may be inserted in two places of the references list with the exactly same format or with a bit difference in the data.
# Although reference managers have built-in duplicate removers, when two or more co-authors are working on a manuscript, facing this problem seems inevitable.
# Purpose of this program is to assist authors, especially corresponding authors, to detect duplicate references in the lengthy reference list of manuscripts before submission to a journal.
# You can remove the 100% match duplicates with no further consideration. However, those which are found similar might be duplicate due to different information entry in the reference managers or different due to, for e.g., similar titles of similar author list.
# Therefore, in the case of susceptible duplicates remove them manually after a thorough check. I used FuzzyWuzzy library of Python to find similarities.
# You can decrease similarity % if you think there is more chance to find duplicates in your references list due to errors in references information entry in the reference managers.

# To use this program, you need to simply copy-past the references list it into the notepad and save it as "references.txt".
# This program is compatible with most references list. However, with some references list format a bit change in the codes may be required.
# Please leave me a comment if you enjoy it or you find a glitch with some references list, or you come up with some hints to change the code. Please mention your selected similarity % in your comment.

# Note: Finding duplicates using duplicated() and removing them with drop_duplicates() methods seems to be ineffective sometime that is why I didnt use them here.

# This program has been developed by M. Mahdavian.
# https://www.linkedin.com/in/mohammad-mahdavian-50827b53


In [1]:
pip install fuzzywuzzy[speedup]

Collecting fuzzywuzzy[speedup]
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Collecting python-levenshtein>=0.12 (from fuzzywuzzy[speedup])
  Downloading python_Levenshtein-0.23.0-py3-none-any.whl (9.4 kB)
Collecting Levenshtein==0.23.0 (from python-levenshtein>=0.12->fuzzywuzzy[speedup])
  Downloading Levenshtein-0.23.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (169 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m169.4/169.4 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rapidfuzz<4.0.0,>=3.1.0 (from Levenshtein==0.23.0->python-levenshtein>=0.12->fuzzywuzzy[speedup])
  Downloading rapidfuzz-3.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m34.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fuzzywuzzy, rapidfuzz, Levenshtein, python-levenshtein
Successfully installed Levenshtein-0.23.0 fuzzyw

In [2]:
import pandas as pd
from fuzzywuzzy import fuzz

# Read the References Text File

In [3]:
file = open("references.txt", "rb")
data = file.readlines()
file.close()
x=0
text=[]
while x <len(data):
  text.append(data[x].decode("ISO-8859-1"))
  x=x+1

In [4]:
df=pd.DataFrame(text)

In [5]:
df[0] = df[0].str.strip('[]')
x=0
while x<len(df):
  df = df.replace(r'\d+]\t', '', regex=True)
  x=x+1
df

Unnamed: 0,0
0,"J. Davis, ""Corrosion control by protective coa..."
1,"F. Zhang et al., ""Self-healing mechanisms in s..."
2,"F. Olivieri, R. Castaldo, M. Cocca, G. Gentile..."
3,"M. Rui, Y. Jiang, and A. Zhu, ""Sub-micron calc..."
4,"R. G. Buchheit, H. Guan, S. Mahajanam, and F. ..."
...,...
57,"C. Gu, J. Hu, and X. Zhong, ""The coating delam..."
58,"C. Gu, J. Hu, and X. Zhong, ""Evidence of hydro..."
59,"H. Chen et al., ""Preparation of a BTAUIOGO n..."
60,"M. Keramatinia, R. Majidi, and B. Ramezanzadeh..."


# Exact Duplicates

In [9]:
i=0
j=0
s=0
duplicated=[]
#index_of_duplicates=id
id=[]
while i <len(df):
  while j<len(df):
    if i!=j and df.iloc[i].to_string()==df.iloc[j].to_string():
      if df.iloc[i].to_string() not in duplicated and i not in id and j not in id:
        print("Ref. Number #", i+1, "is identical (100% similar) to Ref. Number #", j+1)
        duplicated.append(df.iloc[i].to_string())
        id.append(j)
      s=s+1
    j=j+1
  j=0
  i=i+1
if s==0:
  print("NO IDENTICAL (100% SIMILAR) REFERENCES WERE DETECTED.")

NO IDENTICAL REFERENCES WAS DETECTED.


In [7]:
i=0
j=0
while i<len(df):
  if df.iloc[i].to_string() in duplicated:
    #print(j)
    del duplicated[j]
    df.at[i, 0] = "Mahdavian"
  i=i+1
df

Unnamed: 0,0
0,"J. Davis, ""Corrosion control by protective coa..."
1,"F. Zhang et al., ""Self-healing mechanisms in s..."
2,"F. Olivieri, R. Castaldo, M. Cocca, G. Gentile..."
3,"M. Rui, Y. Jiang, and A. Zhu, ""Sub-micron calc..."
4,"R. G. Buchheit, H. Guan, S. Mahajanam, and F. ..."
...,...
57,"C. Gu, J. Hu, and X. Zhong, ""The coating delam..."
58,"C. Gu, J. Hu, and X. Zhong, ""Evidence of hydro..."
59,"H. Chen et al., ""Preparation of a BTAUIOGO n..."
60,"M. Keramatinia, R. Majidi, and B. Ramezanzadeh..."


# Near Duplicates

In [8]:
i=0
j=0
s=0
similarity=85
id=[]
print("Succeptible duplicated references cosidering similarity above or equal to", similarity, "%:")
print("***Remove them manually after a thorough check. You can decrease similarity % if you think \nthere is more chance to find duplicates in your references list due to errors in references \ninformation entry in the reference managers.***")
while i <len(df):
  while j<len(df):
    if i!=j and fuzz.ratio(df.iloc[i].to_string(), df.iloc[j].to_string())>=similarity:
      id1=len(df.iloc[i].to_string())
      id2=len(df.iloc[j].to_string())
      if id1!=14 or id2!=14:
        if i not in id and j not in id:
          print("Ref. Number #", i+1, "is similar to Ref. Number #", j+1)
          id.append(j)
        s=s+1
    j=j+1
  j=0
  i=i+1
if s==0:
  print("NO SIMILARITY WAS DETECTED.")


Succeptible duplicated references cosidering similarity above or equal to 85 %:
***Remove them manually after a thorough check. You can decrease similarity % if you think 
there is more chance to find duplicates in your references list due to errors in references 
information entry in the reference managers.***
Ref. Number # 15 is similar to Ref. Number # 35
