# Web Crawling & JSON <br>
This week's homework focuses on web crawling to get data. The focus will be on crawling to get JSON files then making sense out of those JSON files.
<br> <br>
There are three ( 3 ) parts to this assignment. 
- Part 1 & 2 are guided and / or take from the lessons in class.
- Part 3 has you with a variable way to do it and can be more challenging

## Assignment Case Study: NYU Gallatin Courses

<img src = 'gallatin_web.png'>

Imagine you're a student at NYU Gallatin and you would like to find courses at Gallatin. 
<br>
In order to do so, you have acess to the above website. But as you filter your choices, you still realize that you don't have a clear picture of the courses that are available at any given times and days.
<br>
You tell a friend about this and you think to yourselves that since you have both taken an introductory class, there has to be a programmatic way to harvest this data and make sense of classes to take.
<br>
You start web crawling the website and get a message from IT that you are abusing resources, more so the **Policy on Responsible Use of NYU Computers and Data**
<br>
As you scratch your head, IT Services refers you to a website on accessing data and you realize that there is a JSON representation of this data at: http://gallatin.nyu.edu/academics/courses/jcr:content/content/search.json?
<br>
<img src = 'gallatinjson.png'>

## Part A: Working with remote files
In this part, the goal is to enable you to use remote,web-hosted sources of information as your source of data and use them locally.

In [45]:
# Question 1 - In order to run these scripts below you need to import a few modules. Please import them

import requests
from bs4 import BeautifulSoup
import json

url = 'http://gallatin.nyu.edu/academics/courses/jcr:content/content/search.json?'
resp = requests.get(url)
resp.json()

{'totalMatches': '4623',
 '1': {'course': 'IDSEM-UG1752',
  'title': 'This Mediated Life: An Introduction to the Study of Mass Media',
  'credit': '4',
  'foundation-libarts': 'HUM',
  'level': 'U',
  'term': 'WI',
  'type': 'Interdisciplinary Seminars (IDSEM-UG)',
  'year': '2019',
  'section': '001',
  'description': 'This interdisciplinary seminar will provide an intensive introduction to the study of mass media. Utilizing wide ranging critical and theoretical methodologies, the course will consider how media alternately reflects and forms our sense of politics, economics, race, gender, sexuality and citizenship. The course will be concerned with questions such as: What function does mass media serve for society? How does a media saturated cultural environment shape our identity? How do mass media forms delineate and naturalize prevailing ideologies and ways of being in the world? Can media provide a means to challenge cultural and political hegemony? Readings will be drawn from Ber

In [46]:
# Let's name the json response to gallatin
gallatin = resp.json()

As we inspect the result above, we realize that the key 'totalMatches' is useless to our need and so we want to get rid of it. <br><br>
On using StackOverflow to check our work, we find that the function pop(   ) e.g. dictionary_name.pop('key', None) removes a key from a dictionary and the function keys(   ), lists all the keys in the dictionary. <br><br>
With this in mind, can we remove the 'totalMatches' key from gallatin and show what the left keys are?

In [351]:
# Question 2 Remove totalMatches and show the keys that are left
gallatin.pop('totalMatches', None)
gallatin.keys()

dict_keys(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50'])

In [48]:
# Checking in to see how it looks like
gallatin

{'1': {'course': 'IDSEM-UG1752',
  'title': 'This Mediated Life: An Introduction to the Study of Mass Media',
  'credit': '4',
  'foundation-libarts': 'HUM',
  'level': 'U',
  'term': 'WI',
  'type': 'Interdisciplinary Seminars (IDSEM-UG)',
  'year': '2019',
  'section': '001',
  'description': 'This interdisciplinary seminar will provide an intensive introduction to the study of mass media. Utilizing wide ranging critical and theoretical methodologies, the course will consider how media alternately reflects and forms our sense of politics, economics, race, gender, sexuality and citizenship. The course will be concerned with questions such as: What function does mass media serve for society? How does a media saturated cultural environment shape our identity? How do mass media forms delineate and naturalize prevailing ideologies and ways of being in the world? Can media provide a means to challenge cultural and political hegemony? Readings will be drawn from Berger’s \xa0<em>Media Analy

## Part 2 - Working with JSON files

In [50]:
# Question 3 - From the looks of it, Gallatin has 50 courses, each represented by the key '1' through '50'.
# The description of each course doesn't matter to you at this point, please remove the 'description' keys
# of the first course ('1') and then display the course:
gallatin['1'].pop('description', None)
gallatin['1']

{'course': 'IDSEM-UG1752',
 'title': 'This Mediated Life: An Introduction to the Study of Mass Media',
 'credit': '4',
 'foundation-libarts': 'HUM',
 'level': 'U',
 'term': 'WI',
 'type': 'Interdisciplinary Seminars (IDSEM-UG)',
 'year': '2019',
 'section': '001',
 'days': 'Mon Tue Wed Thu ',
 'times': '10:00 AM - 1:30 PM',
 'days2': '',
 'instructors': [{'Julian Cornell': '/content/gallatin/en/people/faculty/jc266'}],
 'notes': ''}

In [51]:
# Question 4: Based on question 3, can you now remove all the 'description' keys in your gallatin dictionary?
# Please display the result

for key in list(gallatin.keys()):
    gallatin[key].pop('description', None)
    
gallatin


{'1': {'course': 'IDSEM-UG1752',
  'title': 'This Mediated Life: An Introduction to the Study of Mass Media',
  'credit': '4',
  'foundation-libarts': 'HUM',
  'level': 'U',
  'term': 'WI',
  'type': 'Interdisciplinary Seminars (IDSEM-UG)',
  'year': '2019',
  'section': '001',
  'days': 'Mon Tue Wed Thu ',
  'times': '10:00 AM - 1:30 PM',
  'days2': '',
  'instructors': [{'Julian Cornell': '/content/gallatin/en/people/faculty/jc266'}],
  'notes': ''},
 '2': {'course': 'IDSEM-UG1938',
  'title': 'What Do We Study When We Study Religion?',
  'credit': '4',
  'foundation-libarts': 'HUM',
  'level': 'U',
  'term': 'WI',
  'type': 'Interdisciplinary Seminars (IDSEM-UG)',
  'year': '2019',
  'section': '001',
  'days': 'Mon Tue Wed Thu ',
  'times': '10:00 AM - 1:30 PM',
  'days2': '',
  'instructors': [{'Gregory Erickson': '/content/gallatin/en/people/faculty/gte1'}],
  'notes': ''},
 '3': {'course': 'ARTS-UG1485',
  'title': 'Beyond Picture Perfect: Personal Choice in a Digital World',
  

In [52]:
# Question 5 - Do the same thing. Get rid of the keys - 'days2' and 'notes' then display gallatin

for key in list(gallatin.keys()):
    gallatin[key].pop('days2', None)
    gallatin[key].pop('notes', None)
    
gallatin

{'1': {'course': 'IDSEM-UG1752',
  'title': 'This Mediated Life: An Introduction to the Study of Mass Media',
  'credit': '4',
  'foundation-libarts': 'HUM',
  'level': 'U',
  'term': 'WI',
  'type': 'Interdisciplinary Seminars (IDSEM-UG)',
  'year': '2019',
  'section': '001',
  'days': 'Mon Tue Wed Thu ',
  'times': '10:00 AM - 1:30 PM',
  'instructors': [{'Julian Cornell': '/content/gallatin/en/people/faculty/jc266'}]},
 '2': {'course': 'IDSEM-UG1938',
  'title': 'What Do We Study When We Study Religion?',
  'credit': '4',
  'foundation-libarts': 'HUM',
  'level': 'U',
  'term': 'WI',
  'type': 'Interdisciplinary Seminars (IDSEM-UG)',
  'year': '2019',
  'section': '001',
  'days': 'Mon Tue Wed Thu ',
  'times': '10:00 AM - 1:30 PM',
  'instructors': [{'Gregory Erickson': '/content/gallatin/en/people/faculty/gte1'}]},
 '3': {'course': 'ARTS-UG1485',
  'title': 'Beyond Picture Perfect: Personal Choice in a Digital World',
  'credit': '4',
  'level': 'U',
  'term': 'WI',
  'type': 'Ar

#### Coding lesson:
#### As you will notice, we can't pop everything and in fact, some of the data will have missing values or even missing keys.
#### For example, let us print all the instructors

In [53]:
for key in gallatin.keys():                              # just gets the keys 1-50                                  # gets each subdictionaries
            print (gallatin[key]['instructors'])

[{'Julian Cornell': '/content/gallatin/en/people/faculty/jc266'}]
[{'Gregory Erickson': '/content/gallatin/en/people/faculty/gte1'}]
[{'Jeff Day': '/content/gallatin/en/people/faculty/jmd20'}]
[{'Irene Han': '/content/gallatin/en/people/faculty/ih472'}]
[{'Pedro Cristiani': '/content/gallatin/en/people/faculty/pc78'}]
[{'Karen Holmberg': '/content/gallatin/en/people/faculty/kgh1'}]
[{'Jeff Day': '/content/gallatin/en/people/faculty/jmd20'}]
[{'June Foley': '/content/gallatin/en/people/faculty/jaf3'}]
[{'Alex Halberstadt': '/content/gallatin/en/people/faculty/alh18'}]
[{'Christina M. Squitieri': '/content/gallatin/en/people/faculty/cms531'}]
[{'Kwami Coleman': '/content/gallatin/en/people/faculty/ktc4'}]
[{'George Shulman': '/content/gallatin/en/people/faculty/gms1'}]
[{'David Sugarman': '/content/gallatin/en/people/faculty/dss368'}]


KeyError: 'instructors'

In [62]:
# Notice from the code above that the last course is that of David Sugarman

for key in gallatin.keys():                              # just gets the keys 1-50                                  # gets each subdictionaries
        try:
            print (gallatin[key]['instructors'])         # get the instructor for each course (subdictionary)
        except:
            print ('None')

[{'Julian Cornell': '/content/gallatin/en/people/faculty/jc266'}]
[{'Gregory Erickson': '/content/gallatin/en/people/faculty/gte1'}]
[{'Jeff Day': '/content/gallatin/en/people/faculty/jmd20'}]
[{'Irene Han': '/content/gallatin/en/people/faculty/ih472'}]
[{'Pedro Cristiani': '/content/gallatin/en/people/faculty/pc78'}]
[{'Karen Holmberg': '/content/gallatin/en/people/faculty/kgh1'}]
[{'Jeff Day': '/content/gallatin/en/people/faculty/jmd20'}]
[{'June Foley': '/content/gallatin/en/people/faculty/jaf3'}]
[{'Alex Halberstadt': '/content/gallatin/en/people/faculty/alh18'}]
[{'Christina M. Squitieri': '/content/gallatin/en/people/faculty/cms531'}]
[{'Kwami Coleman': '/content/gallatin/en/people/faculty/ktc4'}]
[{'George Shulman': '/content/gallatin/en/people/faculty/gms1'}]
[{'David Sugarman': '/content/gallatin/en/people/faculty/dss368'}]
None
[{'Orna Ophir': '/content/gallatin/en/people/faculty/oo10'}]
[{'Jacob Remes': '/content/gallatin/en/people/faculty/jar31'}]
[{'Keith Miller': '/conten

In [63]:
# Let us check the last item of these instructors and call it z
#z = [{'Maria-Luisa Achino-Loeb': '/content/gallatin/en/people/faculty/mal6'}]
z = [{'Andrea Gadberry': '/content/gallatin/en/people/faculty/alg16'}]

# Since we know there is only one dictionary, let us convert the first item (the dictionary) it into a list to create
z_name = list(z[0])

In [64]:
# and now can get just the name in a list
z_name

['Andrea Gadberry']

In [65]:
# then the name as a string
z_name[0]

'Andrea Gadberry'

In [66]:
# Given that, let us create a function
def getInstructorName(instructor):
    return list(instructor[0])[0]

In [67]:
getInstructorName([{'Kristoffer Diaz': '/content/gallatin/en/people/faculty/kkd2000'}])

'Kristoffer Diaz'

In [68]:
# Question 6 - using a list, could you store all the values of the instrcutors using the getInstructorName function and the 
# loop above?

instructorlist = []

for key in gallatin.keys():
    try: instructorlist.append((getInstructorName(gallatin[key]['instructors'])))
    except: instructorlist.append('None')
    
instructorlist

['Julian Cornell',
 'Gregory Erickson',
 'Jeff Day',
 'Irene Han',
 'Pedro Cristiani',
 'Karen Holmberg',
 'Jeff Day',
 'June Foley',
 'Alex Halberstadt',
 'Christina M. Squitieri',
 'Kwami Coleman',
 'George Shulman',
 'David Sugarman',
 'None',
 'Orna Ophir',
 'Jacob Remes',
 'Keith Miller',
 'Karen Hornick',
 'None',
 'Lenora Champagne',
 'Adrian Versteegh',
 'Danielle Inkpen',
 'Stacy Pies',
 'Julie Avina',
 'Paul Thaler',
 'Simon Fortin',
 'Gianpaolo Baiocchi',
 'None',
 'Barton Bishop',
 'Melissa Turoff',
 'Gene Cittadino',
 'Irene Morrison-Moncure',
 'Valerie Forman',
 'Sara Murphy',
 'Judith Sloan',
 'Christopher Trogan',
 'None',
 'Allen Keller',
 'Diane Wong',
 'Roy Nathanson',
 'Todd Porterfield',
 'Meera Nair',
 'Leon Grek',
 'Vasuki Nesiah',
 'Bill Rayner',
 'Jim Tolisano',
 'Kristoffer Diaz',
 'Karen Hornick',
 'Christopher Bram',
 'Andrea Gadberry']

## Part 3 - Making the data more useful ( a bit more involving )

In [215]:
# As a helper, you can have an empy dataframe from the word go
import pandas as pd
df = pd.DataFrame()
df

In [402]:
for key in gallatin.keys():
    gallatin[key].pop('foundation-libarts', None)
    gallatin[key].pop('foundation-histcult', None)
    gallatin[key].pop('section', None)
    gallatin[key].pop('instructors', None)

df1 = pd.DataFrame.from_dict(gallatin, orient='index')    #convert gallatin dictionary to dataframe
indexlist = df1.index.values.tolist()                     #create list containing index df1 index values
new_index = [int(i)-1 for i in indexlist]                 #convert df1 index values from string to int, and prep to match df2 indices
df1 = df1.set_index([new_index])                          #replace df1 string index values with new int index values
df1 = df1.sort_index()                                    #sort by index values

df2 = pd.DataFrame([instructorlist])                      #convert instructorlist list to dataframe
df2 = df2.transpose()                                     #transpose dataframe
df2 = df2.rename(columns={df2.columns[0]: "instructor" }) #name column "instructor"

df = pd.concat([df1, df2], axis=1, sort=False)
df

Unnamed: 0,course,title,credit,level,term,type,year,days,times,instructor
0,IDSEM-UG1752,This Mediated Life: An Introduction to the Stu...,4,U,WI,Interdisciplinary Seminars (IDSEM-UG),2019,Mon Tue Wed Thu,10:00 AM - 1:30 PM,Julian Cornell
1,IDSEM-UG1938,What Do We Study When We Study Religion?,4,U,WI,Interdisciplinary Seminars (IDSEM-UG),2019,Mon Tue Wed Thu,10:00 AM - 1:30 PM,Gregory Erickson
2,ARTS-UG1485,Beyond Picture Perfect: Personal Choice in a D...,4,U,WI,Arts Workshops (ARTS-UG),2019,Mon Tue Wed Thu,2:00 PM - 5:30 PM,Jeff Day
3,IDSEM-UG2012,Plato&rsquo;s Republic,4,U,WI,Interdisciplinary Seminars (IDSEM-UG),2019,Mon Tue Wed Thu,2:00 PM - 5:30 PM,Irene Han
4,ARTS-UG1568,"Television, Now----Mapping An Original Show",4,U,WI,Arts Workshops (ARTS-UG),2019,Mon Tue Wed Thu,2:00 PM - 5:30 PM,Pedro Cristiani
5,IDSEM-UG2004,"NYC Coastlines: Past, Present, and Future",4,U,SP,Interdisciplinary Seminars (IDSEM-UG),2019,Mon Wed,11:00 AM - 12:15 PM,Karen Holmberg
6,ARTS-UG1480,"Photograph New York, Create Your Vision",4,U,SP,Arts Workshops (ARTS-UG),2019,Tue,6:20 PM - 9:00 PM,Jeff Day
7,IDSEM-UG1811,Desperate Housewives of the 19th-Century Novel,4,U,SP,Interdisciplinary Seminars (IDSEM-UG),2019,Wed,3:30 PM - 6:10 PM,June Foley
8,WRTNG-UG1024,Magazine Writing,4,U,SP,Advanced Writing Courses (WRTNG-UG),2019,Fri,11:00 AM - 1:45 PM,Alex Halberstadt
9,FIRST-UG794,First-Year Research Seminar: Utopian Literatur...,4,U,SP,First-Year Program: Research Seminars (FIRST-UG),2019,Tue Thu,2:00 PM - 3:15 PM,Christina M. Squitieri
