Part 1 of Assignment starts from here. After part 1, the output will consist of 5 csv files, each having the data of faculty members of the respective department.

In [96]:
from bs4 import BeautifulSoup
import requests
url_link = 'http://lhr.nu.edu.pk/faculty/'
#get all the information in text form
content = requests.get(url_link).text
doc = BeautifulSoup(content,'html.parser')
#check to see if the function is working
#print(doc.prettify())


The cell below defines 2 helper functions. The designation information on the website consists of many spaces. The strip function is used to remove spaces before and after the strings and the split function is used to seperate text by lines. As it is important to know wheter a specific faculty member is an HEC Approved PHD supervisor or not, some conditions are applied after observing the data. The data after splitting in case of the PHD approved supervisor is of the form:


data after splitting = ['Professor', '                ', '       HEC Approved PHD Supervisor']

So, in this case, the code simply checks if the string at index 2 contains 'HEC Approved PHD Supervisor' or not

In case of the faculty members that are not HEC approved PHD supervisors, the list only has 1 element i.e. the designation. Hence, if the list size is less than 3, it automatically returns a 'No'

In [97]:
def cleanDesignation(designation):
    designation = designation.strip()
    designation = designation.split('\n')
    return designation[0]

def checkPHD(designation):
    designation = designation.strip()
    designation = designation.split('\n')

    if len(designation) >=3:
        if('HEC Approved PhD Supervisor' in designation[2]):
            return 'Yes'
        return 'No'
    return 'No'
    

The cell below defines a function getID that retrieves the ID of the faculty member from a link provided. The IDs are present at the end of the link. For example, in the link: http://lhr.nu.edu.pk/fsc/facultyProfile/4391, 4391 is the ID. 

In [98]:
def getID(link):
    i = -1
    id = ''
    while(link[i] != '/'):
        num = link[i]
        id = num + id
        i = i-1
    
    return int(id)


The function ScrapeDept in the cell below scrapes data for a particular department. findAll is used to collect all cards and then each card is iterated over and relevant information is extracted using classes and IDs that are defined in the HTML.

In [99]:
import pandas as pd


#Scrapes Data from the department whos ID is passed
def ScrapeDept(deptID, deptName):
    doc2 = doc.find(id = deptID)
    Cards = doc2.findAll(class_ = 'col-lg-3 col-md-4 col-sm-6 col-12')

    #define name of classes from where data is to be scrapped for better readability
    nameClass = "text-center"
    designationClass = "small text-center font-italic"
    emailClass = "mb-0 text-center"
    imageClass = "card-img-top rounded-circle mt-3 mb-0 d-block mx-auto"
    reference = 'http://lhr.nu.edu.pk/'

    #initialize a dataframe with columns
    df = pd.DataFrame(columns=['ID', 'Name', 'Designation',
    'HEC Approved PHD Supervisor', 'Email',
     'Department', 'ImageURL'])

    for Card in Cards:

        #get the name of the faculty member
        name = Card.find(class_ = nameClass).text

        #get the designation of the faculty member
        rawDesignation = Card.find(class_ = designationClass).text
        designation = cleanDesignation(rawDesignation)

        #check if HEC approved supervisor or not
        phdSup = checkPHD(rawDesignation)

        #get the email of the faculty member
        email = Card.find(class_ = emailClass).text

        #prepare the faculty link
        facultyLink = Card.find('a', class_ = 'faculty-link')['href']
        #facultyLink = reference + facultyLink

        #getID
        id = getID(facultyLink)

        #assign the department
        department = deptName
        
        #get the image URL
        imgUrl = Card.find(class_ = imageClass)['src']
        imgUrl = reference + imgUrl


        #add a row to the dataframe
        df = df.append({'ID' : id, 'Name' : name, 'Designation' : designation,
        'HEC Approved PHD Supervisor': phdSup, 'Email': email, 'Department': department,
        'ImageURL': imgUrl},
        ignore_index = True)

    return df
    


In [100]:
#Call the Scraping Function for each department
csDf = ScrapeDept('fsc', 'Computer Science')
eeDf = ScrapeDept('ee', 'Electrical Engineering')
fsmDf = ScrapeDept('fsm', 'FAST School of Management')
cvDf = ScrapeDept('cv', 'Civil Engineering')
ssDf = ScrapeDept('ss', 'Social Sciences')

#convert DataFrames into csv files
csDf.to_csv('fsc.csv', index=False)
eeDf.to_csv('ee.csv', index=False)
cvDf.to_csv('cv.csv', index=False)
fsmDf.to_csv('fsm.csv', index=False)
ssDf.to_csv('ss.csv', index=False)

Part 1 of Assignment ends here. 
*********************************************************************************************************************************

The cell below loads the created files into new dataframes

In [101]:
#convert csv files to dataframes

csDf_ = pd.read_csv('fsc.csv')
eeDf_ = pd.read_csv('ee.csv')
cvDf_ = pd.read_csv('cv.csv')
fsmDf_ = pd.read_csv('fsm.csv')
ssDf_ = pd.read_csv('ss.csv')

The cell below takes 10 random entries from each dataframe

In [102]:
# Sample dataFrames

fsc_sample = csDf_.sample(10)
ee_sample = eeDf_.sample(10)
cv_sample = cvDf_.sample(10)
fsm_sample = fsmDf_.sample(10)
ss_sample = ssDf_.sample(10)


convertColToList converts column of a dataframe into a python list. The list is later iterated over as desired.

In [103]:
def convertColToList(df, colName):
    ids = df[colName].tolist()
    return ids

def getFacultyLink(id, dept):
    link = f'http://lhr.nu.edu.pk/{dept}/facultyProfile/{id}'
    return link

getExtension extracts the extension from the phone number. The extension is given at the end of the string that contains the phone number. For example, XXXXXXXXXXXXX Ext:241. In case there is no extension given, the function returns -1

In [104]:
def getExtension(phone):
    i = -1
    ext = ''
    while(phone[i] != ':'):
        num = phone[i]
        ext = num + ext
        i = i-1

    if ext == 'None': return -1
    return int(ext)

otherAttribures extracts phone and education via the facultyLink given to it.

In [105]:
def otherAttributes(facultyLink):
    #get the content
    content = requests.get(facultyLink).text
    Card = BeautifulSoup(content,'html.parser')

    #define classes of phone and education
    phoneClass = 'fas fa-phone-square mr-1'
    educationClass = 'col-lg-8 col-md-6 col-sm-12 text-justify'
    
    #rawPhone is the complete phone number along with the text, while phone stores the extraction
    rawPhone = Card.find('span', class_ = 'small').text
    phone = getExtension(rawPhone)
    
    edClass = Card.find(class_ = educationClass)
    
    #find the unordered list tag
    education = edClass.find('ul')
    
    #extract the first li tag, that will be the most recent education. Otherwise return 'No Information'
    if education.find('li'):
        education = education.find('li').text
    
    else:
        education = 'No Information'
    
    return phone,education

getExtendedDf takes a dataframe, converts the IDs in the given data frames into a list and then extracts phone number and extension against each ID and then returns a dataframe of : ID Phone Extension

In [106]:
def getExtendedDf(df, colName, dept):
    ids = convertColToList(df,colName)

    dataFrame = pd.DataFrame(columns=['ID', 'Extension', 'Education'])
    for id in ids:
        link = getFacultyLink(id, dept)
        ext, education = otherAttributes(link)
        
        dataFrame = dataFrame.append({'ID' : id, 'Extension': ext, 'Education': education}, ignore_index = True)
    
    return dataFrame

In [107]:
#makes the append warning go away
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [108]:
#call extendedDf for every class
csDf2 = getExtendedDf(csDf_, 'ID', 'fsc')
eeDf2 = getExtendedDf(eeDf_, 'ID', 'ee')
cvDf2 = getExtendedDf(cvDf_, 'ID', 'cv')
ssDf2 = getExtendedDf(ssDf_, 'ID', 'ss')
fsmDf2 = getExtendedDf(fsmDf_, 'ID', 'fsm')


In [109]:
#convert new dataframes to csv files
csDf2.to_csv('fsc_2.csv', index=False)
eeDf2.to_csv('ee_2.csv', index=False)
cvDf2.to_csv('cv_2.csv', index=False)
fsmDf2.to_csv('fsm_2.csv', index=False)
ssDf2.to_csv('ss_2.csv', index=False)

Task 2 of the assignment ends here. Task 3 is present in the cell below.

In [110]:
#load all csv files into dataframes

cs = pd.read_csv('fsc.csv')
ee = pd.read_csv('ee.csv')
cv = pd.read_csv('cv.csv')
fsm= pd.read_csv('fsm.csv')
ss = pd.read_csv('ss.csv')


cs1 = pd.read_csv('fsc_2.csv')
ee1 = pd.read_csv('ee_2.csv')
cv1 = pd.read_csv('cv_2.csv')
fsm1= pd.read_csv('fsm_2.csv')
ss1 = pd.read_csv('ss_2.csv')

In [111]:
#merge the files on the column ID
csMerged = pd.merge(cs, cs1, on='ID', how='left')
eeMerged = pd.merge(ee, ee1, on='ID', how='left')
cvMerged = pd.merge(cv, cv1, on='ID', how='left')
ssMerged = pd.merge(ss, ss1, on='ID', how='left')
fsmMerged = pd.merge(fsm, fsm1, on='ID', how='left')

In [112]:
#merge every dataframe to get a single dataframe
df1 = csMerged.append(eeMerged, ignore_index=True)
df2 = cvMerged.append(ssMerged, ignore_index=True)
df3 = df2.append(fsmMerged, ignore_index=True)
df_merged = df1.append(df3,ignore_index=True)

#make a csv file
df_merged.to_csv('fast_lhr_faculty.csv', index=False)