# Data Quality Check

This file checks whether articles in "article_list.csv" match with articles in "salt_author_info.csv" (converted from "salt_author_info.xlsx"), using the following steps

- "article_list.csv" -> data frame "article_list"
- "salt_author_info.csv" -> data frame "author"
- take the "ID" and "Title" column from article_list -> right_df
- take the "ID" and "original title" column from author, deduplicate (multiple entries of the same ID article) and sort -> left_df
- merge the two df by ID and check whether the two titles are the same (i.e., no title/ID mismatch)

Result: find one mismatch, #24, but this is because the title in the originally deposited dataset mistook the workshop name for the title. The rest of the article all match. 

Conclusion: the IDs in the "salt_author_info.csv" follow the IDs in the "article_list.csv."

import pandas as pd

In [43]:
author = pd.read_csv("salt_author_info.csv")
attr_list = pd.read_csv("Article_attr.csv")
article_list = pd.read_csv("Article_list.csv")

In [44]:
article_list.head()

Unnamed: 0,ID,paper assigned ID,Type,Study Groupings,Title,year,Attitude,Doi,Retracted (Y/N)
0,1,Hooper2002,Systematic Review,,Systematic review of long term effects of advi...,2002,inconclusive,10.1136/bmj.325.7365.628,
1,2,Hooper2003,Systematic Review,,Reduced dietary salt for prevention of cardiov...,2003,inconclusive,10.1002/14651858.CD003656,
2,3,Hooper2004,Systematic Review,,Advice to reduce dietary salt for prevention o...,2004,inconclusive,10.1002/14651858.CD003656.pub2,
3,4,Strazzullo2009,Systematic Review,,"Salt intake, stroke, and cardiovascular diseas...",2009,for,10.1136/bmj.b4567,
4,5,Taylor2011a,Systematic Review,,Reduced dietary salt for the prevention of car...,2011,inconclusive,10.1038/ajh.2011.115,


In [45]:
len(article_list['ID'].unique())

82

In [46]:
len(author['ID'].unique())

82

In [65]:
author['ID'].unique()

array([ 1,  2,  3,  4,  5,  7,  9, 12, 13, 26, 27, 30, 31, 33, 34, 38, 39,
       40, 41, 42, 43, 45, 47, 48, 49, 50, 51, 52, 53, 54, 55, 58, 59, 61,
       62, 63, 64, 65, 66, 67, 68, 70, 71, 73, 74, 75, 76, 77, 78, 79, 80,
       81, 82, 83, 84, 85, 87, 88, 89, 90, 92,  6, 11, 46, 56, 57, 69, 14,
       28, 60, 86, 91, 93,  8, 10, 29, 32, 35, 36, 37, 44, 72],
      dtype=int64)

In [59]:
len(author['ID'].sort_values().unique())

82

In [48]:
left_df = author[['ID','original_title']].drop_duplicates().sort_values(by='ID')
len(left_df)

82

In [49]:
right_df = article_list[['ID','Title']]
len(right_df)

82

In [50]:
merged_df = left_df.merge(right_df, on='ID')

In [51]:
merged_df.head(5)

Unnamed: 0,ID,original_title,Title
0,1,Systematic review of long term effects of advi...,Systematic review of long term effects of advi...
1,2,Reduced dietary salt for prevention of cardiov...,Reduced dietary salt for prevention of cardiov...
2,3,Advice to reduce dietary salt for prevention o...,Advice to reduce dietary salt for prevention o...
3,4,"Salt intake, stroke, and cardiovascular diseas...","Salt intake, stroke, and cardiovascular diseas..."
4,5,Reduced dietary salt for the prevention of car...,Reduced dietary salt for the prevention of car...


In [52]:
for i in range(len(merged_df)):
    if merged_df.loc[i,'Title'] != merged_df.loc[i,'original_title']:
        print('Titles for article ', merged_df.loc[i,'ID'], ' DO NOT match.')
        print('\nTitle in the article list: ', merged_df.loc[i,'Title'])
        print('\nTitle in the author list: ', merged_df.loc[i,'original_title'])
    else:
        print('Titles for article ', merged_df.loc[i,'ID'], ' match.')

Titles for article  1  match.
Titles for article  2  match.
Titles for article  3  match.
Titles for article  4  match.
Titles for article  5  match.
Titles for article  6  match.
Titles for article  7  match.
Titles for article  8  match.
Titles for article  9  match.
Titles for article  10  match.
Titles for article  11  match.
Titles for article  12  match.
Titles for article  13  match.
Titles for article  14  match.
Titles for article  26  match.
Titles for article  27  match.
Titles for article  28  match.
Titles for article  29  match.
Titles for article  30  match.
Titles for article  31  match.
Titles for article  32  match.
Titles for article  33  match.
Titles for article  34  match.
Titles for article  35  match.
Titles for article  36  DO NOT match.

Title in the article list:  Workshop on Sodium and Blood Pressure

Title in the author list:  Multiple Risk Factor Intervention Trial follow-up
Titles for article  37  match.
Titles for article  38  match.
Titles for article  

In [None]:
## only article 36 did not match. The reason was the original dataset (deposited in the data bank), mistook the workshop name for title

# Cohen J. Multiple Risk Factor Intervention Trial follow-up. Workshop on Sodium and
# Blood Pressure. Bethesda, MD, National Heart, Lung, and Blood Institute. 1999

# Titles for article  36  DO NOT match.

# Title in the article list:  Workshop on Sodium and Blood Pressure

# Title in the author list:  Multiple Risk Factor Intervention Trial follow-up

In [63]:
attr_list['ID'].to_numpy()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 26, 27, 28,
       29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45,
       46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62,
       63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
       80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93],
      dtype=int64)

In [64]:
len(attr_list['ID'].to_numpy())

82