# End-To-End Example: Data Analysis of iSchool Classes

In this end-to-end example we will perform a data analysis in Python Pandas we will attempt to answer the following questions:

- What percentage of the schedule are undergrad (course number 500 or lower)?
- What undergrad classes are on Friday? or at 8AM?

Things we will demonstrate:

- `read_html()` for basic web scraping
- dealing with 5 pages of data
- `append()` multiple `DataFrames` together
- Feature engineering (adding a column to the `DataFrame`)

The iSchool schedule of classes can be found here: https://ischool.syr.edu/classes 


In [1]:
import pandas as pd

# this turns off warning messages
import warnings
warnings.filterwarnings('ignore')

In [31]:
# just figure out how to get the data
website = 'https://ischool.syr.edu/classes/?page=1'
data = pd.read_html(website)
data[0]

Unnamed: 0,Course,Section,Class,Credits,Title,Instructor(s),Time,Day,Room(s)
0,GET302,M001,37463,3.0,Global Financial Sys Arch,vcschoon,5:00pm - 7:45pm,Tu,Hinds Hall 011
1,GET460,M001,41946,3.0,Global Technology Abroad,Laurie A Ferger,12:00am - 12:00am,,
2,GET460,M002,41948,3.0,Global Technology Abroad,Paul Brian Gandel,12:00am - 12:00am,,
3,GET602,M001,37464,3.0,Global Financial Sys Arch,vcschoon,5:00pm - 7:45pm,Tu,Hinds Hall 011
4,GET660,M001,41947,3.0,Global Technology Abroad,Laurie A Ferger,12:00am - 12:00am,,
5,GET660,M002,41949,3.0,Global Technology Abroad,Paul Brian Gandel,12:00am - 12:00am,,
6,IDS401,M001,37545,3.0,What's the Big Idea?,Michael A D'Eredita,5:00pm - 7:50pm,Tu,FALK ROOM 201
7,IDS403,M001,37431,1.0,Startup Sandbox,John DuRoss Liddy,2:15pm - 5:05pm,F,Syracuse Technology Garden
8,IDS460,M002,37571,3.0,Entretech - NYC,John DuRoss Liddy,12:00am - 12:00am,,
9,IDS460,M003,37572,3.0,Spring Break in Silicon Valley,John DuRoss Liddy,12:00am - 12:00am,,


In [6]:
# let's generate links to the other pages
website = 'https://ischool.syr.edu/classes/?page='
for i in range(1,6):
    link = website + str(i)
    print(link)                        


https://ischool.syr.edu/classes/?page=1
https://ischool.syr.edu/classes/?page=2
https://ischool.syr.edu/classes/?page=3
https://ischool.syr.edu/classes/?page=4
https://ischool.syr.edu/classes/?page=5


In [15]:
# let's read them all and append them to a single data frame

website = 'https://ischool.syr.edu/classes/?page='
classes = pd.DataFrame() #  (columns = ['Course','Section','ClassNo','Credits','Title','Instructor','Time','Days','Room'])

for i in range(1,6):
    link = website + str(i)
    data = pd.read_html(website  + str(i))    
    classes = classes.append(data[0], ignore_index=True)
    
classes.sample(5)

Unnamed: 0,Course,Section,Class,Credits,Title,Instructor(s),Time,Day,Room(s)
111,IST511,M406,37614,3.0,Intro to Library & Info Prof,Alison J Johnson,12:00am - 10:30pm,M,Online Online
21,IST195,M008,37452,3.0,LAB: Information Technologies,Jeff Rubin,2:55pm - 3:50pm,F,Hinds Hall 010
145,IST614,M401,37590,3.0,Mngmt Prncpls for Info Profess,Dr. John P. Stinnett,12:00am - 10:30pm,Tu,Online Online
61,IST345,M003,37434,3.0,Managing Info Systems Projects,Mark Andrew Borte,5:15pm - 8:05pm,Tu,Hinds Hall 111
173,IST621,M401,37560,3.0,Info Management and Technology,Michael Larche,12:00am - 10:30pm,Tu,Online Online


In [16]:
## let's set the columns

website = 'https://ischool.syr.edu/classes/?page='
classes = pd.DataFrame() 

for i in range(1,6):
    link = website + str(i)
    data = pd.read_html(website  + str(i))    
    classes = classes.append(data[0], ignore_index=True)
    
classes.columns = ['Course','Section','ClassNo','Credits','Title','Instructor','Time','Days','Room']

classes.sample(5)


Unnamed: 0,Course,Section,ClassNo,Credits,Title,Instructor,Time,Days,Room
45,IST335,M002,37393,3.0,Intro/Info Based Organizations,Blythe Scherrer,8:00am - 9:20am,MW,Hinds Hall 117
102,IST486,U800,37543,3.0,Social Media in the Organiz.,Maren Guse Powell,12:00am - 12:00am,,Online
31,IST256,M004,37538,3.0,Appl.Prog.For Information Syst,Angela Usha Ramnarine-Rieks,12:30pm - 1:50pm,Th,LINK200
162,IST618,M401,37595,3.0,Information Policy,Angela Usha Ramnarine-Rieks,12:00am - 10:30pm,M,Online Online
69,IST352,M005,37475,3.0,Info Analysis of Org. Systems,P Douglas Taber,8:00am - 9:20am,MW,Hinds Hall 111


In [33]:
## this is good stuff. Let's make a function out of it for simplicity

def get_ischool_classes():
    website = 'https://ischool.syr.edu/classes/?page='
    classes = pd.DataFrame() 

    for i in range(1,6):
        link = website + str(i)
        data = pd.read_html(website  + str(i))    
        classes = classes.append(data[0], ignore_index=True)
    
    classes.columns = ['Course','Section','ClassNo','Credits','Title','Instructor','Time','Days','Room']

    return classes

# main program 
classes = get_ischool_classes()

In [19]:
# undergrad classes are 0-499, grad classes are 500 and up but we don't have course numbers!!!! So we must engineer them.

classes['Course'].str[0:3].sample(5)
classes['Course'].str[3:].sample(5)


169    618
73     359
18     195
66     352
227    687
Name: Course, dtype: object

In [20]:
# make the subject and number columns
classes['Subject'] = classes['Course'].str[0:3]
classes['Number'] = classes['Course'].str[3:]
classes.sample(5)

Unnamed: 0,Course,Section,ClassNo,Credits,Title,Instructor,Time,Days,Room,Subject,Number
30,IST256,M003,37537,3.0,Appl.Prog.For Information Syst,Angela Usha Ramnarine-Rieks,3:45pm - 5:05pm,W,CH101,IST,256
41,IST300,M002,41802,3.0,Information Security Policy,James Enwright,6:45pm - 8:05pm,MW,Hinds Hall 013,IST,300
119,IST600,M004,37582,3.0,Intro to Cloud Technologies,Radhika Garg,2:15pm - 5:00pm,M,Hinds Hall 018,IST,600
42,IST300,M001,42515,3.0,Enterprise Data Analysis,Michael A Leonardo,5:15pm - 8:05pm,M,Hinds Hall 010,IST,300
205,IST659,M401,37602,3.0,Data Admin Concepts & Db Mgmt,Chad Aaron Harper,12:00am - 10:30pm,Tu,Online Online,IST,659


In [36]:
# and finally we can create the column we need!
classes['Type'] = ''
classes['Type'][classes['Number'] < '500'] = 'UGrad'
classes['Type'][classes['Number'] >= '500'] = 'Grad'

classes.sample(5)

Unnamed: 0,Course,Section,ClassNo,Credits,Title,Instructor,Time,Days,Room,Subject,Number,UG,Type
113,IST600,M800,37113,1.0,IT Auditing,Thomas J Wood,12:00am - 12:00am,,Online,IST,600,N,Grad
1,GET302,M001,37049,3.0,Global Financial Sys Arch,Frank Jr Marullo,5:15pm - 8:00pm,M,Hinds Hall 021,GET,302,Y,UGrad
182,IST718,M801,37082,3.0,Advanced Information Analytics,Gary E Krudys,12:00am - 12:00am,,Online,IST,718,N,Grad
193,IST754,M001,37065,3.0,Telecom Final Project,Lee H Badman,9:30am - 12:15pm,W,Hinds Hall 117,IST,754,N,Grad
69,IST425,M001,36950,3.0,Enterprise Risk Management,Michael Larche,5:15pm - 6:35pm,MW,Slocum 104,IST,425,Y,UGrad


In [21]:
# the entire program to retrieve the data and setup the columns looks like this:

# main program 
classes = get_ischool_classes()
classes['Subject'] = classes['Course'].str[0:3]
classes['Number'] = classes['Course'].str[3:]
classes['Type'] = ''
classes['Type'][classes['Number'] < '500'] = 'UGrad'
classes['Type'][classes['Number'] >= '500'] = 'Grad'


In [22]:
# let's fins the number of grad / undergrad courses
classes['Type'].value_counts()

# more grad classes than undergrad

Grad     150
UGrad    100
Name: Type, dtype: int64

In [24]:
# how many undergrad classes on a Friday?
friday = classes[ (classes['Type'] == 'UGrad') & (classes['Days'].str.find('F')>=0 ) ]
friday


Unnamed: 0,Course,Section,ClassNo,Credits,Title,Instructor,Time,Days,Room,Subject,Number,Type
7,IDS403,M001,37431,1.0,Startup Sandbox,John DuRoss Liddy,2:15pm - 5:05pm,F,Syracuse Technology Garden,IDS,403,UGrad
16,IST195,M003,37443,3.0,LAB: Information Technologies,Jeff Rubin,9:30am - 10:25am,F,Hinds Hall 010,IST,195,UGrad
17,IST195,M004,37444,3.0,LAB: Information Technologies,Jeff Rubin,10:35am - 11:30am,F,Hinds Hall 010,IST,195,UGrad
18,IST195,M005,37445,3.0,LAB: Information Technologies,Jeff Rubin,11:40am - 12:35pm,F,Hinds Hall 010,IST,195,UGrad
19,IST195,M006,37446,3.0,LAB: Information Technologies,Jeff Rubin,12:45pm - 1:40pm,F,Hinds Hall 010,IST,195,UGrad
20,IST195,M007,37447,3.0,LAB: Information Technologies,Jeff Rubin,1:50pm - 2:45pm,F,Hinds Hall 010,IST,195,UGrad
21,IST195,M008,37452,3.0,LAB: Information Technologies,Jeff Rubin,2:55pm - 3:50pm,F,Hinds Hall 010,IST,195,UGrad
22,IST195,M002,37488,3.0,LAB: Information Technologies,Jeff Rubin,8:25am - 9:20am,F,Hinds Hall 010,IST,195,UGrad
25,IST233,M003,37448,3.0,LAB: Intro to Computer Networking,S Bruce Boardman,10:35am - 11:30am,F,Hinds Hall 027,IST,233,UGrad
26,IST233,M004,37449,3.0,LAB: Intro to Computer Networking,S Bruce Boardman,11:40am - 12:35pm,F,Hinds Hall 027,IST,233,UGrad


In [26]:
# let's get rid of those pesky LAB sections!!!
# how many undergrad classes on a Friday?
friday_no_lab = friday[ ~friday['Title'].str.startswith('LAB:')]
friday_no_lab


Unnamed: 0,Course,Section,ClassNo,Credits,Title,Instructor,Time,Days,Room,Subject,Number,Type
7,IDS403,M001,37431,1.0,Startup Sandbox,John DuRoss Liddy,2:15pm - 5:05pm,F,Syracuse Technology Garden,IDS,403,UGrad
50,IST337,M001,41702,1.0,IM&T Support Practicum,Jeffrey Fouts,2:15pm - 5:00pm,F,Hinds Hall 117,IST,337,UGrad


In [27]:
# Looking for more classes to avoid? How about 8AM classes?
eight_am = classes[ classes['Time'].str.startswith('8:00am')]
eight_am

Unnamed: 0,Course,Section,ClassNo,Credits,Title,Instructor,Time,Days,Room,Subject,Number,Type
35,IST256,M008,37583,3.0,Appl.Prog.For Information Syst,Avinash Kadaji,8:00am - 9:20am,W,Hinds Hall 011,IST,256,UGrad
39,IST263,M002,37440,3.0,Intro to Front-End Web Dev,Christian A Kirkegaard,8:00am - 9:20am,TuTh,Hinds Hall 021,IST,263,UGrad
45,IST335,M002,37393,3.0,Intro/Info Based Organizations,Blythe Scherrer,8:00am - 9:20am,MW,Hinds Hall 117,IST,335,UGrad
69,IST352,M005,37475,3.0,Info Analysis of Org. Systems,P Douglas Taber,8:00am - 9:20am,MW,Hinds Hall 111,IST,352,UGrad
70,IST352,M006,37476,3.0,Info Analysis of Org. Systems,Alexander Corsello,8:00am - 9:20am,TuTh,Hinds Hall 117,IST,352,UGrad
72,IST359,M003,37421,3.0,Intro to Data Base Mgmt Systs,Blythe Scherrer,8:00am - 9:20am,TuTh,Hinds Hall 010,IST,359,UGrad
95,IST466,M001,37399,3.0,Prof Issues/Info Mgmt & Tech,Marcene S. Sonneborn,8:00am - 9:20am,MW,Hinds Hall 021,IST,466,UGrad
