In [1]:
import pandas as pd

# Data Cleaning

Start by removing unneeded variables

In [2]:
#load data and select columns
df = pd.read_csv("./data/all_rawis.csv").loc[
    :, ['scholar_indx', 'name', 'grade', 'teachers', 'students',
        'area_of_interest', 'tags', 'students_inds', 'teachers_inds']].rename(
    columns={'grade':'generation'})

There are 24326 records in the uncleaned data.

## Remove Records Without Teachers

Because of the nature of hadith, every narrator should have at least one teacher except the Prophet, who has no teachers. Let's remove all records without teacher indices, except the Prophet. This will also have the effect of ensuring there are no isolated nodes. There are some records that have teacher names but not indices, these are few enough that we can ignore them.

In [3]:
#doing this inside a function to keep variables local
#return dataset after removing all records without teacher indices, except the Prophet
def rm_noteachers(df):
    prophet = df.loc[0,:].to_dict()
    narrators = df.loc[1:,:].dropna(subset=['teachers_inds'])
    
    return narrators.append(
        prophet,ignore_index=True).sort_values(
        by='scholar_indx',axis='index').set_index('scholar_indx',drop=False)

#modify the data
df = rm_noteachers(df)

There are now 13105 records remaining.

## Remove Untrustworthy Narrators

The `area_of_interest` column contains some data on trustworthiness for 5448 records, so let's use this to remove all records with a reputation of less than 'sahih'.

First, let's extract all of the grades and add them to the dataset as a new column `grade`.

In [4]:
#extract grades
grades = []
for item in df['area_of_interest'].str.findall(r"\[Grade:[^\]]+\]").values:
    if isinstance(item, list):
        if len(item) > 0:
            grade = item[0].lower().strip('[]').lstrip('grade').lstrip(':')
            grades.append(grade)
            continue
    grades.append('undefined')

#assign grade column to the df and remove area_of_interest column
df = df.assign(grade=grades)
del df['area_of_interest']
del grades #remove unneeded variable

Take a look at the unique grade values:

In [5]:
df['grade'].unique()

array(['undefined', 'no doubt', 'thiqah thiqah', 'thiqah', 'maqbool',
       'sadooq', 'weak', 'sadooq/delusion', 'not thiqah',
       'unknown-majhool', 'abandoned', 'liar', 'accused liar'],
      dtype=object)

See the data dictionary at muslimscholars.info for explanations of each grade.

We are including:
TODO

We are excluding:
TODO

Now let's remove all untrustworthy narrators.

In [6]:
#again working inside a function to keep variables local
#return dataset after removing all untrustworthy narrators
def rm_untrustworthy(df):
    untrustworthy = [] #this will hold indices to drop
    drop_grades = ['abandoned','liar','accused liar','weak','unknown-mahjool'] #grades we are excluding
    
    #build list of indices of untrustworthy narrators
    for indx, grade in df['grade'].items():
        if (grade in drop_grades):
            untrustworthy.append(indx)
                
    return df.drop(index=untrustworthy)
    
#modify the data
df = rm_untrustworthy(df)

There are now TODO records remaining.

In [7]:
df.shape

(12692, 9)

## Create Gender Variable

Now let's clean up the `tags` column. We are only interested in the `Female` tag, so let's create a new column `gender` where records with a `Female` tag are encoded as `"f"` and records without a `Female` tag are encoded as `"m"`.

In [8]:
#tags are stored as a single string
#return "f" if "female" found in tags, otherwise "m"
def get_gender(tag):
    if pd.isna(tag):
        return "m"
    elif tag.lower().find('female') == -1:
        return "m"
    else:
        return "f"

#assign gender column to the df and remove tags column
df = df.assign(gender=df['tags'].transform(get_gender))
del df['tags']

Check the ratio of males to females in the dataset

In [9]:
df['gender'].value_counts()

m    11484
f     1208
Name: gender, dtype: int64

In [10]:
df

Unnamed: 0_level_0,scholar_indx,name,generation,teachers,students,students_inds,teachers_inds,grade,gender
scholar_indx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,1,Prophet Muhammad(saw) ( محمّد صلّی اللہ علیہ و...,Rasool Allah,,"Abu Bakr As-Siddique [2] , 'Umar ibn al-Khatta...","2, 3, 4, 5, 6, 7, 8, 9, 10, 16, 17, 18, 19, 53...",,undefined,m
2,2,Abu Bakr As-Siddique ( أبو بكر الصديق ( رضي ال...,Comp.(RA) [1st Generation],"Muhammad (saw) [1] ,","'Umar ibn al-Khattab [3] , 'Uthman ibn 'Affaan...","3, 4, 5, 8, 49, 53, 107, 168, 17, 106, 18, 29,...",1,undefined,m
3,3,'Umar ibn al-Khattab ( عمر بن الخطاب بن نفيل (...,Comp.(RA) [1st Generation],"Muhammad (saw) [1] , Abu Bakr As-Siddique [2]","Hafsa bint Umar [54] , ibn Umar [18] , 'Asim b...","54, 18, 563, 4, 5, 6, 9, 8, 39, 16, 27, 28, 49...","1, 2",undefined,m
4,4,'Uthman ibn 'Affaan ( عثمان بن عفان ( رضي الله...,Comp.(RA) [1st Generation],"Muhammad (saw) [1] , Abu Bakr As-Siddique [2] ...","Aban bin 'Uthman [10582] , Sa'id bin 'Uthman [...","10582, 10587, 16, 49, 123, 391, 13, 19, 16, 17...","1, 2, 3",undefined,m
5,5,Ali ibn Abi Talib ( علي بن أبي طالب بن عبد الم...,Comp.(RA) [1st Generation],"Muhammad (saw) [1] , Abu Bakr As-Siddique [2] ...","Hassan ibn Ali bin Abi Talib [30] , Hussain ib...","30, 31, 16, 400, 13, 38, 182, 438, 17, 18, 106...","1, 2, 3, 63, 163",undefined,m
...,...,...,...,...,...,...,...,...,...
38948,38948,Yazid bin S'aid al-Sabahi يزيد بن سعيد الصباحي,3rd Century AH,"Imam Maalik [20001] , Y'aqub bin 'Abdur Rahman...",,,"20001, 20185",thiqah,m
38992,38992,'Abdul Qahir bin Rashid bin Sa'd عبد القاهر بن...,3rd Century AH,Rashidayn bin Sa'd al-Mahri al-Qayni [26012],,,26012,thiqah,m
38996,38996,Muslim bin Yazid bin Madhkur مسلم بن يزيد بن م...,3rd Century AH,Yazid bin Mdhkwr al-Hmdany [17256],,,17256,thiqah,m
39104,39104,Ibrahim bin Musa al-Maktab إبراهيم بن موسى المكتب,3rd Century AH,Ma'mar bin Suliaman [28578],"Y'aqub bin Sufyan [38719] , أبو حامد بن هارون ...",38719,28578,thiqah,m
