In [1]:
import pandas as pd

# Data Cleaning

Start by removing unneeded variables

In [2]:
#load data and select columns
df = pd.read_csv("./data/all_rawis.csv").loc[
    :, ['scholar_indx', 'name', 'grade', 'area_of_interest',
        'tags', 'students_inds', 'teachers_inds']].rename(
    columns={'grade':'generation'})

Some records have duplicated data under different scholar indices. Let's remove the duplicates.

In [3]:
df.drop_duplicates(subset=['name','generation','area_of_interest','tags','students_inds','teachers_inds'],
                  keep='first',ignore_index=True,inplace=True)

There are now 24247 records in the data.

## Remove Records Without Teachers

Because of the nature of hadith, every narrator should have at least one teacher except the Prophet, who has no teachers. Let's remove all records without teacher indices that also do not appear in any other scholars' student indices, except the Prophet. This will also have the effect of ensuring there are no isolated nodes. There are some records that have teacher names but not indices, these are few enough that we can ignore them.

In [4]:
#doing this inside a function to keep variables local
#return dataset after removing all records without teacher indices, except the Prophet
def rm_noteachers(df):
    prophet = df.loc[0,:].to_dict()#the prophet
    
    has_teacher=set()#all scholar_indx which are listed as students somewhere
    for inds in df['students_inds']:
        if isinstance(inds, list):
            for ind in inds:
                has_teacher.add(ind)

    no_teachers_inds = list(df.loc[1:,'teachers_inds'].isna())#whether teachers_inds is NA by position
    no_teachers = set()#all scholar_indx verified to have no teachers in the dataset
    for i in range(len(no_teachers_inds)):
        if no_teachers_inds[i]:
            if df['scholar_indx'][i+1] in has_teacher:
                pass
            else:
                no_teachers.add(df['scholar_indx'][1+i])

    narrators = df.loc[1:,:].query('scholar_indx not in '+str(list(no_teachers)))   
    
    return narrators.append(
        prophet,ignore_index=True).sort_values(
        by='scholar_indx',axis='index').set_index('scholar_indx',drop=False)

#modify the data
df = rm_noteachers(df)

There are now 13061 records remaining.

## Remove Untrustworthy Narrators

The `area_of_interest` column contains some data on trustworthiness for ~5500 records, so let's use this to remove all records with a reputation of less than 'sahih'.

First, let's extract all of the grades and add them to the dataset as a new column `grade`.

In [5]:
#extract grades
grades = []
for item in df['area_of_interest'].str.findall(r"\[Grade:[^\]]+\]").values:
    if isinstance(item, list):
        if len(item) > 0:
            grade = item[0].lower().strip('[]').lstrip('grade').lstrip(':')
            grades.append(grade)
            continue
    grades.append('undefined')

#assign grade column to the df and remove area_of_interest column
df = df.assign(grade=grades)
del df['area_of_interest']
del grades #remove unneeded variable

Take a look at the unique grade values:

In [6]:
df['grade'].unique()

array(['undefined', 'no doubt', 'thiqah thiqah', 'thiqah', 'maqbool',
       'sadooq', 'weak', 'sadooq/delusion', 'not thiqah',
       'unknown-majhool', 'abandoned', 'liar', 'accused liar'],
      dtype=object)

See the data dictionary at muslimscholars.info for explanations of each grade.

We are including:
no doubt (Companions), thiqah thiqah (Awthaqun Nas), thiqah (Thiqat), sadooq (Saduq), sadooq/delusion (Saduq Yahim), maqbool (Maqbool/Layyin), not thiqah (Majhool al-haal/Mastur), and undefined

We are excluding:
abandoned (Matruk), accused liar (Muttaham bi'l kadhib), liar (Kadhdhaab, waddaa'), weak (Da'eef), and unknown-majhool (Majhool)

Now let's remove all untrustworthy narrators.

In [7]:
#again working inside a function to keep variables local
#return dataset after removing all untrustworthy narrators
def rm_untrustworthy(df):
    untrustworthy = [] #this will hold indices to drop
    drop_grades = ['abandoned','liar','accused liar','weak','unknown-majhool'] #grades we are excluding
    
    #build list of indices of untrustworthy narrators
    for indx, grade in df['grade'].items():
        if (grade in drop_grades):
            untrustworthy.append(indx)
                
    return df.drop(index=untrustworthy)
    
#modify the data
df = rm_untrustworthy(df)

There are now 12,429 records remaining.

## Create Gender Variable

Now let's clean up the `tags` column. We are only interested in the `Female` tag, so let's create a new column `gender` where records with a `Female` tag are encoded as `"f"` and records without a `Female` tag are encoded as `"m"`.

In [8]:
#tags are stored as a single string
#return "f" if "female" found in tags, otherwise "m"
def get_gender(tag):
    if pd.isna(tag):
        return "m"
    elif tag.lower().find('female') == -1:
        return "m"
    else:
        return "f"

#assign gender column to the df and remove tags column
df = df.assign(gender=df['tags'].transform(get_gender))
del df['tags']

It's possible that some female narrators were missing tags and have been misclassified. Since by convention women's names contain "bint" (i.e. "daughter of") while men's names contain "bin" (i.e. "son of") we can check to see if any scholars with "bint" in their names have been classified as men. Since names may contain multiple levels of ancestors, the presence of "bint" is necessary but not sufficient to conclude a record has been misclassified.

In [9]:
#detect names with 'bint' that are classified 'm'
#generate array of indices with potentially misclassified gender
flagged = []
for indx, data in df.loc[:,['name','gender']].iterrows():
    if (data[0].lower().find('bint') >= 0) & (data[1] == 'm'):
        flagged.append(indx)

Now let's look at the records flagged as misclassified. There are only 15, so we can manually double-check all of them. Some will not appear here because we have since altered them in the dataset.

In [10]:
df.filter(items=flagged,axis='index')

Unnamed: 0_level_0,scholar_indx,name,generation,students_inds,teachers_inds,grade,gender
scholar_indx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
186,186,Umm Kulthum bint 'Amr (Jarwal) ( أم كلثوم بنت ...,Comp.(RA),,1,undefined,m
6222,6222,al-Swda'a bint 'Asim Lha Shbh ( السوداء ( رضي ...,Comp.(RA),,1,undefined,m
6415,6415,al-Fry'h bint Wahub al-Zhryh Rf'ha ( الفريعة ب...,Comp.(RA),,1,undefined,m
6607,6607,'Abdullah al-Bakry Rwt bint'h Bhyh ( عبد الله ...,Comp.(RA),,1,undefined,m
6864,6864,Hqh bint 'Amr Slt al-Qbltyn ( حقة بنت عمرو صلت...,Comp.(RA),,1,undefined,m
7729,7729,al-Tfyl bin Akhy Jwyryh bint al-Harith ( الطفي...,Comp.(RA),,1,undefined,m
8367,8367,Khbab Mwla Fitimah bint 'Utba bin Rabi'y ( خبا...,Comp.(RA),,1,undefined,m
8708,8708,Shrhbyl bin Habib Zwj al-Shfa'a bint ( شرحبيل ...,Comp.(RA),,1,undefined,m
9179,9179,'Isa al-Msyh bin Maryam al-Sdyqh bint ( عيسى ا...,Comp.(RA),,1,undefined,m
18885,18885,Kinanah Client of Safiyya bint Huyayy كنانة مو...,Follower(Tabi') [3rd Generation],"20604, 24467, 20323, 20321, 28095","60, 4, 13, 11041",maqbool,m


Notes: 

- 641's name starts with a "bint" but their spouse is listed as 'Iqrab bint Mua'dh on muslimscholars.info. Lesbian companion to the prophet???
    - No, 641 is a male; this is a translation error. The correct translation of the Arabic name should be "Qays bin al-Khutaym al-Ansari".

- The latter half of 7729's name does correspond to one of the prophet's wives but I cant tell what's up with the first half.
    - The correct translation of the Arabic name would be "al-Tufayl son of the brother of Juwayriyyah bint al-Harith". So he was male. 

- 9179 seems to be Jesus but why would Jesus be listed as a companion?
    - That's so weird??? Checking his profile on https://muslimscholars.info/manage.php?submit=scholar&ID=9179, yes it is Jesus - even the Arabic biography on the page matches up. I have no idea why he's listed as a companion, nor why it says his "place of stay" is the Hijaz? This entry should be deleted.

- On muslimscholars.info 18885 is identified as "Kinanah (bin Nabiyya), Client of Safiyya bint Huyayy" so I'm assuming male classification is correct.
    - Yes, I agree - correct male classification. 

Typos:
- 641: Qays bint al-Khutaym should be translated as Qays bin al-Khutaym. In fact, every time his name is mentioned in English it is spelled with the "bint" typo - this includes the instances where his children's names are listed. So I corrected all the instances of this in the all_rawis.csv file.
- 3046: It shouldn't be "Qutila bint Qays." Qutila bint Qays was indeed a female, but #3046 is a male and his name, correctly translated from Arabic, should be Qays bin al-Makshuh al-Muradi. I corrected this in the csv file.
-3117: Should be "Mu'awiya bin Suwayd al-Mazni" (not bint). Corrected in file.

Manually checking the flagged records identified the following records as misclassified:

In [11]:
#indices of misclassified records - individuals who were listed as male but are in fact female. 
misclassified = [186, 6222, 6415, 6864]

#generate corrected gender column
corrected_gender = []
for indx, gender in df['gender'].items():
    if indx in misclassified:
        corrected_gender.append('f')
    else:
        corrected_gender.append(gender)

#remove old gender column and assign new gender column to the df
del df['gender']
df = df.assign(gender=corrected_gender)
del flagged #remove unneeded variables
del misclassified
del corrected_gender

Now that we're comfortable with our gender classifications, let's check the gender ratio in the dataset

In [12]:
df['gender'].value_counts()

m    11225
f     1204
Name: gender, dtype: int64

## Clean Up Edgelists

Connections for each scholar are stored as strings containing lists of their students' and teachers' numeric indices. Let's turn these into lists of numeric indices.

In [13]:
#build corrected columns for students_inds and teachers_inds
#strings of lists of numbers ---> lists of numeric indices
students_inds_corrected = []
teachers_inds_corrected = []
for indx, data in df.loc[:,['students_inds','teachers_inds']].iterrows():
    
    students = data[0]
    if pd.isna(students):
        students_inds_corrected.append(students)
    elif isinstance(students, str):
        students_temp = []
        for item in students.split(','):
            if item.strip().isdigit():
                students_temp.append(int(item.strip()))
            else:
                print("non-numeric in students_inds at id="+str(indx)+", value: "+item.strip())
        students_inds_corrected.append(students_temp)
    else:
        raise TypeError("students_inds value at indx "+str(indx)+" is neither str nor NaN")
    
    teachers = data[1]
    if pd.isna(teachers):
        teachers_inds_corrected.append(teachers)
    elif isinstance(teachers, str):
        teachers_temp = []
        for item in teachers.split(','):
            if item.strip().isdigit():
                teachers_temp.append(int(item.strip()))
            else:
                print("non-numeric in teachers_inds at id="+str(indx)+", value: "+item.strip())
        teachers_inds_corrected.append(teachers_temp)
    else:
        raise TypeError("teachers_inds value at indx "+str(indx)+" is neither str nor NaN")

#remove old columns and assign corrected columns to the dataset
del df['students_inds']
del df['teachers_inds']
df = df.assign(students_inds=students_inds_corrected, teachers_inds=teachers_inds_corrected)
del students_inds_corrected #remove unneeded variables
del teachers_inds_corrected

## Export Data

Let's look at the cleaned dataset

In [14]:
df

Unnamed: 0_level_0,scholar_indx,name,generation,grade,gender,students_inds,teachers_inds
scholar_indx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1,Prophet Muhammad(saw) ( محمّد صلّی اللہ علیہ و...,Rasool Allah,undefined,m,"[2, 3, 4, 5, 6, 7, 8, 9, 10, 16, 17, 18, 19, 5...",
2,2,Abu Bakr As-Siddique ( أبو بكر الصديق ( رضي ال...,Comp.(RA) [1st Generation],undefined,m,"[3, 4, 5, 8, 49, 53, 107, 168, 17, 106, 18, 29...",[1]
3,3,'Umar ibn al-Khattab ( عمر بن الخطاب بن نفيل (...,Comp.(RA) [1st Generation],undefined,m,"[54, 18, 563, 4, 5, 6, 9, 8, 39, 16, 27, 28, 4...","[1, 2]"
4,4,'Uthman ibn 'Affaan ( عثمان بن عفان ( رضي الله...,Comp.(RA) [1st Generation],undefined,m,"[10582, 10587, 16, 49, 123, 391, 13, 19, 16, 1...","[1, 2, 3]"
5,5,Ali ibn Abi Talib ( علي بن أبي طالب بن عبد الم...,Comp.(RA) [1st Generation],undefined,m,"[30, 31, 16, 400, 13, 38, 182, 438, 17, 18, 10...","[1, 2, 3, 63, 163]"
...,...,...,...,...,...,...,...
38948,38948,Yazid bin S'aid al-Sabahi يزيد بن سعيد الصباحي,3rd Century AH,thiqah,m,,"[20001, 20185]"
38992,38992,'Abdul Qahir bin Rashid bin Sa'd عبد القاهر بن...,3rd Century AH,thiqah,m,,[26012]
38996,38996,Muslim bin Yazid bin Madhkur مسلم بن يزيد بن م...,3rd Century AH,thiqah,m,,[17256]
39104,39104,Ibrahim bin Musa al-Maktab إبراهيم بن موسى المكتب,3rd Century AH,thiqah,m,[38719],[28578]


Save dataset to pickle

In [15]:
df.to_pickle('./data/cleaned_rawis.pkl')