In [1]:
import pandas as pd

# Data Cleaning

Start by removing unneeded variables

In [2]:
#load data and select columns
df = pd.read_csv("./data/all_rawis.csv").loc[
    :, ['scholar_indx', 'name', 'grade', 'area_of_interest',
        'tags', 'students_inds', 'teachers_inds']].rename(
    columns={'grade':'generation'})

Some records have duplicated data under different scholar indices. Let's remove the duplicates.

In [3]:
df.drop_duplicates(subset=['name','generation','area_of_interest','tags','students_inds','teachers_inds'],
                  keep='first',inplace=True)

There are now 24247 records in the data.

## Remove Records Without Teachers

Because of the nature of hadith, every narrator should have at least one teacher except the Prophet, who has no teachers. Let's remove all records without teacher indices, except the Prophet. This will also have the effect of ensuring there are no isolated nodes. There are some records that have teacher names but not indices, these are few enough that we can ignore them.

In [4]:
#doing this inside a function to keep variables local
#return dataset after removing all records without teacher indices, except the Prophet
def rm_noteachers(df):
    prophet = df.loc[0,:].to_dict()
    narrators = df.loc[1:,:].dropna(subset=['teachers_inds'])
    
    return narrators.append(
        prophet,ignore_index=True).sort_values(
        by='scholar_indx',axis='index').set_index('scholar_indx',drop=False)

#modify the data
df = rm_noteachers(df)

There are now 13061 records remaining.

## Remove Untrustworthy Narrators

The `area_of_interest` column contains some data on trustworthiness for ~5500 records, so let's use this to remove all records with a reputation of less than 'sahih'.

First, let's extract all of the grades and add them to the dataset as a new column `grade`.

In [5]:
#extract grades
grades = []
for item in df['area_of_interest'].str.findall(r"\[Grade:[^\]]+\]").values:
    if isinstance(item, list):
        if len(item) > 0:
            grade = item[0].lower().strip('[]').lstrip('grade').lstrip(':')
            grades.append(grade)
            continue
    grades.append('undefined')

#assign grade column to the df and remove area_of_interest column
df = df.assign(grade=grades)
del df['area_of_interest']
del grades #remove unneeded variable

Take a look at the unique grade values:

In [6]:
df['grade'].unique()

array(['undefined', 'no doubt', 'thiqah thiqah', 'thiqah', 'maqbool',
       'sadooq', 'weak', 'sadooq/delusion', 'not thiqah',
       'unknown-majhool', 'abandoned', 'liar', 'accused liar'],
      dtype=object)

See the data dictionary at muslimscholars.info for explanations of each grade.

We are including:
no doubt (Companions), thiqah thiqah (Awthaqun Nas), thiqah (Thiqat), sadooq (Saduq), sadooq/delusion (Saduq Yahim), maqbool (Maqbool/Layyin), not thiqah (Majhool al-haal/Mastur), and undefined

We are excluding:
abandoned (Matruk), accused liar (Muttaham bi'l kadhib), liar (Kadhdhaab, waddaa'), weak (Da'eef), and unknown-majhool (Majhool)

Now let's remove all untrustworthy narrators.

In [7]:
#again working inside a function to keep variables local
#return dataset after removing all untrustworthy narrators
def rm_untrustworthy(df):
    untrustworthy = [] #this will hold indices to drop
    drop_grades = ['abandoned','liar','accused liar','weak','unknown-majhool'] #grades we are excluding
    
    #build list of indices of untrustworthy narrators
    for indx, grade in df['grade'].items():
        if (grade in drop_grades):
            untrustworthy.append(indx)
                
    return df.drop(index=untrustworthy)
    
#modify the data
df = rm_untrustworthy(df)

There are now 12,429 records remaining.

## Create Gender Variable

Now let's clean up the `tags` column. We are only interested in the `Female` tag, so let's create a new column `gender` where records with a `Female` tag are encoded as `"f"` and records without a `Female` tag are encoded as `"m"`.

In [8]:
#tags are stored as a single string
#return "f" if "female" found in tags, otherwise "m"
def get_gender(tag):
    if pd.isna(tag):
        return "m"
    elif tag.lower().find('female') == -1:
        return "m"
    else:
        return "f"

#assign gender column to the df and remove tags column
df = df.assign(gender=df['tags'].transform(get_gender))
del df['tags']

It's possible that some female narrators were missing tags and have been misclassified. Since by convention women's names contain "bint" (i.e. "daughter of") while men's names contain "bin" (i.e. "son of") we can check to see if any scholars with "bint" in their names have been classified as men. Since names may contain multiple levels of ancestors, the presence of "bint" is necessary but not sufficient to conclude a record has been misclassified.

In [9]:
#detect names with 'bint' that are classified 'm'
#generate array of indices with potentially misclassified gender
flagged = []
for indx, data in df.loc[:,['name','gender']].iterrows():
    if (data[0].lower().find('bint') >= 0) & (data[1] == 'm'):
        flagged.append(indx)

print("flagged: "+str(len(flagged)))

flagged: 15


Now let's look at the records flagged as misclassified. There are only 15, so we can manually double-check all of them.

In [10]:
df.filter(items=flagged,axis='index')

Unnamed: 0_level_0,scholar_indx,name,generation,students_inds,teachers_inds,grade,gender
scholar_indx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
186,186,Umm Kulthum bint 'Amr (Jarwal) ( أم كلثوم بنت ...,Comp.(RA),,1,undefined,m
641,641,Qays bint al-Khutaym ( قيس بن الخطيم الأنصاري ...,Comp.(RA),,1,undefined,m
642,642,Thabit bin Qays bint al-Khutaym ( ثابت بن قيس ...,Comp.(RA),,1,undefined,m
643,643,Yazid bin Qays bint al-Khutaym ( يزيد بن قيس ب...,Comp.(RA),,1,undefined,m
3046,3046,Qutila bint Qays ( قيس بن المكشوح المرادي ( رض...,Comp.(RA),,1,undefined,m
3117,3117,Mu'awiya bint Suwayd al-Mazni ( معاوية بن سويد...,Comp.(RA) [1st Generation],"11324, 11052, 11348, 11350, 11381","1, 3123, 400",undefined,m
6222,6222,al-Swda'a bint 'Asim Lha Shbh ( السوداء ( رضي ...,Comp.(RA),,1,undefined,m
6415,6415,al-Fry'h bint Wahub al-Zhryh Rf'ha ( الفريعة ب...,Comp.(RA),,1,undefined,m
6607,6607,'Abdullah al-Bakry Rwt bint'h Bhyh ( عبد الله ...,Comp.(RA),,1,undefined,m
6864,6864,Hqh bint 'Amr Slt al-Qbltyn ( حقة بنت عمرو صلت...,Comp.(RA),,1,undefined,m


Manually checking the flagged records identified the following records as misclassified:

***TODO***: I've checked through myself and noted the definitely misclassified records but need confirmation on the following records since I couldn't find anybody by that exact name via web search. If you can read arabic, that might be helpful, since the romanized transliterations might be incorrect or nonstandard (I wouldn't be able to tell).

need confirmation: 641, 642, 643, 3046, 3117, 6222, 6415, 6607, 6864, 7729, 8367, 8708, 9179

notes: 

-641's name starts with a "bint" but their spouse is listed as 'Iqrab bint Mua'dh on muslimscholars.info. Lesbian companion to the prophet???

-The latter half of 7729's name does correspond to one of the prophet's wives but I cant tell what's up with the first half.

-9179 seems to be Jesus but why would Jesus be listed as a companion?

-On muslimscholars.info 18885 is identified as "Kinanah (bin Nabiyya), Client of Safiyya bint Huyayy" so I'm assuming male classification is correct.

In [11]:
#indices of misclassified records
misclassified = [186]

#generate corrected gender column
corrected_gender = []
for indx, gender in df['gender'].items():
    if indx in misclassified:
        corrected_gender.append('f')
    else:
        corrected_gender.append(gender)

#remove old gender column and assign new gender column to the df
del df['gender']
df = df.assign(gender=corrected_gender)
del flagged #remove unneeded variables
del misclassified
del corrected_gender

Now that we're comfortable with our gender classifications, let's check the gender ratio in the dataset

In [12]:
df['gender'].value_counts()

m    11228
f     1201
Name: gender, dtype: int64

## Clean Up Edgelists

Connections for each scholar are stored as lists of their students' and teachers' numeric indices. Some records in `students_inds` and `teachers_inds` also contain text at the end indicating there are other unrecorded connections. Let's remove these pieces of text so we can easily process the numeric data.

In [18]:
students_inds_corrected = []
teachers_inds_corrected = []
for indx, data in df.loc[:,['students_inds','teachers_inds']].iterrows():
    students = data[0]
    if pd.isna(students):
        students_inds_corrected.append(students)
    elif isinstance(students, str):
        students_temp = []
        for item in students.split(','):
            #TODO
            #item.strip
            pass
    else:
        raise TypeError("students_inds value at indx " + str(indx) + " is neither str nor NaN")
    
    teachers = data[1]
    if pd.isna(teachers):
        teachers_inds_corrected.append(teachers)
    elif isinstance(teachers, str):
        teachers_temp = []
        #TODO
    else:
        raise TypeError("teachers_inds value at indx " + str(indx) + " is neither str nor NaN")

#TODO

In [None]:
df.loc[1,['students_inds','teachers_inds']][0].split(',')

## Export Data

Save the data to CSV