In [41]:
import pandas as pd
import math

# Data Cleaning

Start by removing unneeded variables

In [2]:
#load data and select columns
df = pd.read_csv("./data/all_rawis.csv", index_col='scholar_indx').loc[
    :, ['name', 'grade', 'teachers', 'students', 'area_of_interest',
        'tags', 'students_inds', 'teachers_inds']]

There are 24326 records in the uncleaned data.

We want to now remove records that have neither teachers nor students, since these will be isolated in the graph.

Dropping records with neither teacher nor student indices leaves 13755 remaining records, 651 of which still lack teacher indices, and 383 of which lack teacher names.

Dropping records with neither teacher nor student names leaves 13870 remaining records, 766 of which lack teacher indices and 434 of which lack teacher names.

In either case, around 300 of the remaining records contain teacher names but not indices. Since both this number and 13870-13755=115 are relatively small, it should be safe to disregard records based on absence of teacher/student indices.

It may be that some of the remaining records without teacher indices are missing data. Would you expect many people other than the Prophet to have no teachers? Is this even possible?

In [3]:
#remove records missing both teacher and student indices
df.dropna(how='all', subset=['students_inds', 'teachers_inds'], inplace=True)

#count records without teacher indices
df['teachers_inds'].isna().value_counts()

False    13104
True       651
Name: teachers_inds, dtype: int64

Now let's clean up the `tags` column. We are only interested in the `Female` tag, so let's create a new column `gender` where records with a `Female` tag are encoded as `"f"` and records without a `Female` tag are encoded as `"m"`.

In [4]:
#tags are stored as a single string
#return "f" if "female" found in tags, otherwise "m"
def get_gender(tag):
    if pd.isna(tag):
        return "m"
    elif tag.lower().find('female') == -1:
        return "m"
    else:
        return "f"

#append gender column to the df and remove tags column
df = df.assign(gender=df['tags'].transform(get_gender))
del df['tags']

Check the ratio of males to females in the dataset

In [5]:
df['gender'].value_counts()

m    12523
f     1232
Name: gender, dtype: int64

The `area_of_interest` column contains some data on trustworthiness, so let's use this to remove all records with a reputation of less than 'sahih'.

In [52]:
df.columns

Index(['name', 'grade', 'teachers', 'students', 'area_of_interest',
       'students_inds', 'teachers_inds', 'gender'],
      dtype='object')

Now we want to get lists of narrators and their students. 
The following code generates a dictionary specifically for Aishah r.a. and a list of ids of her students.

In [7]:
aishahsStudents = [int(x) for x in list(df.loc[53]['students_inds'].split(", "))]
students = {53 : aishahsStudents}
students # a dictionary for Aishah r.a. and her students

{53: [70, 106, 13, 17, 18, 41, 28, 10535, 10511, 10520, 10521, 10522, 11002]}

The following code segments expands the dictionary to include keys for each of Aishah's students, with their values being a list of their students.

In [25]:
for i in aishahsStudents: # iterate through Aishah's students
    students[i] = [int(x) for x in list(df.loc[i]['students_inds'].split(", "))]

len(students) # a dictionary for Aishah r.a. and her students and her students' students

14

Goal: we want to make a function that, given a starting index and a max degree, will return a dictionary of teachers and students.

Note: I am using a try statement here because the data appears to have unexpected values for indices (like "nan") that throw errors with the split() method. So using "try", I just avoid any entry that throws an error.

In [46]:


def studentLists(startInd, maxDegree):
    # build the initial dictionary with an entry for the startInd (starting narrator) and a list of all her students
    narrators = {startInd : [int(x) for x in list(df.loc[startInd]['students_inds'].split(", "))]}
    i = 0
    while i < maxDegree:
        i += 1
        values = narrators.copy().values() # make a copy because an error is thrown when the dictionary changes in the middle of an iteration
        for valuelist in values: # iterate through the list of student indices
            for value in valuelist:
                try:  
                    narrators[value] = [int(x) for x in list(df.loc[value]['students_inds'].split(", "))]
                except:
                    continue
    return narrators


- startInd = 53 -> Aishah bint Abi Bakr r.a.
- maxDegree = 0 -> Aishah and her students
- maxDegree = 1 -> Aishah and her students and her students' students
- maxDegree = 2 -> Aishah and her students and her students' students and her students' students' students.

In [48]:
# startInd = 53 -> Aishah bint Abi Bakr r.a.
# maxDegree = 0 -> Aishah and her students
# maxDegree = 1 -> Aishah and her students and her students' students
# maxDegree = 2 -> Aishah and her students and her students' students and her students' students' students.

x = studentLists(53, 2)
len(x)      

In [51]:
x

153

In [41]:
df

Unnamed: 0_level_0,name,grade,teachers,students,area_of_interest,students_inds,teachers_inds,gender
scholar_indx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,Prophet Muhammad(saw) ( محمّد صلّی اللہ علیہ و...,Rasool Allah,,"Abu Bakr As-Siddique [2] , 'Umar ibn al-Khatta...","Tafsir/Quran, Recitation/Quran, Hadith, Comman...","2, 3, 4, 5, 6, 7, 8, 9, 10, 16, 17, 18, 19, 53...",,m
2,Abu Bakr As-Siddique ( أبو بكر الصديق ( رضي ال...,Comp.(RA) [1st Generation],"Muhammad (saw) [1] ,","'Umar ibn al-Khattab [3] , 'Uthman ibn 'Affaan...","Tafsir/Quran, Recitation/Quran, Narrator [ ع ...","3, 4, 5, 8, 49, 53, 107, 168, 17, 106, 18, 29,...",1,m
3,'Umar ibn al-Khattab ( عمر بن الخطاب بن نفيل (...,Comp.(RA) [1st Generation],"Muhammad (saw) [1] , Abu Bakr As-Siddique [2]","Hafsa bint Umar [54] , ibn Umar [18] , 'Asim b...","Tafsir/Quran, Recitation/Quran, Narrator [ ع ...","54, 18, 563, 4, 5, 6, 9, 8, 39, 16, 27, 28, 49...","1, 2",m
4,'Uthman ibn 'Affaan ( عثمان بن عفان ( رضي الله...,Comp.(RA) [1st Generation],"Muhammad (saw) [1] , Abu Bakr As-Siddique [2] ...","Aban bin 'Uthman [10582] , Sa'id bin 'Uthman [...","Tafsir/Quran, Narrator [ ع - صحابة ], Fiqh, ...","10582, 10587, 16, 49, 123, 391, 13, 19, 16, 17...","1, 2, 3",m
5,Ali ibn Abi Talib ( علي بن أبي طالب بن عبد الم...,Comp.(RA) [1st Generation],"Muhammad (saw) [1] , Abu Bakr As-Siddique [2] ...","Hassan ibn Ali bin Abi Talib [30] , Hussain ib...","Tafsir/Quran, Recitation/Quran, Narrator [ ع ...","30, 31, 16, 400, 13, 38, 182, 438, 17, 18, 106...","1, 2, 3, 63, 163",m
...,...,...,...,...,...,...,...,...
38948,Yazid bin S'aid al-Sabahi يزيد بن سعيد الصباحي,3rd Century AH,"Imam Maalik [20001] , Y'aqub bin 'Abdur Rahman...",,Narrator[Grade:Thiqah],,"20001, 20185",m
38992,'Abdul Qahir bin Rashid bin Sa'd عبد القاهر بن...,3rd Century AH,Rashidayn bin Sa'd al-Mahri al-Qayni [26012],,Narrator[Grade:Thiqah],,26012,m
38996,Muslim bin Yazid bin Madhkur مسلم بن يزيد بن م...,3rd Century AH,Yazid bin Mdhkwr al-Hmdany [17256],,Narrator[Grade:Thiqah],,17256,m
39104,Ibrahim bin Musa al-Maktab إبراهيم بن موسى المكتب,3rd Century AH,Ma'mar bin Suliaman [28578],"Y'aqub bin Sufyan [38719] , أبو حامد بن هارون ...",Narrator[Grade:Thiqah],38719,28578,m
