In [2]:
import urllib2
import json
import pandas as pd

### Download course catalogue
Next, we access the coursera API and download the course catalogue. For each course, we are interested in 3 fields:

shortName: The short name associated with the course.
name: The course name or title.
language: The language code for the course. (e.g. 'en' means English.)

We also include in the query universities and categories parameters. This will return the ids that matches each course with their corresponding universities and categories. Below are the Python commands to do so.

In [3]:
courses_response = urllib2.urlopen('https://api.coursera.org/api/catalog.v1/courses?fields=shortName,name,language&includes=universities,categories')
courses_data = json.load(courses_response)
courses_data = courses_data['elements']

If we want to get the data about the first course in the courses_data dictionary, we simply execute the command below.

In [4]:
courses_data[0]

{u'id': 2163,
 u'language': u'en',
 u'links': {u'categories': [8, 10, 19, 20], u'universities': [65]},
 u'name': u'The Land Ethic Reclaimed: Perceptive Hunting, Aldo Leopold, and Conservation',
 u'shortName': u'perceptivehunting'}

The first course in the courses_data dictionary is 'The Land Ethic Reclaimed: Perceptive Hunting, Aldo Leopold, and Conservation', offered in English by the university with id=65, and under categories 8, 10, 19 and 20.

### Download universities and categories data
Next, we retrieve the universities and categories data from the Coursera API. For the universities data, we are interested in the university name and its location.

In [5]:
universities_response = urllib2.urlopen('https://api.coursera.org/api/catalog.v1/universities?fields=name,locationCountry')
universities_data = json.load(universities_response)
universities_data = universities_data['elements']

We can get the data about the first university from universities_data by executing the following command:

In [6]:
universities_data[0]

{u'id': 234,
 u'links': {},
 u'locationCountry': u'CN',
 u'name': u"Xi'an Jiaotong University",
 u'shortName': u'xjtu'}

Jiaotong University is the 1st on the list

### Download course data
Similarly, we can get the courses categories data by executing the following commands.

In [7]:
categories_response = urllib2.urlopen('https://api.coursera.org/api/catalog.v1/categories')
categories_data = json.load(categories_response)
categories_data = categories_data['elements']

We can get the data about the the first category from categories_data by executing the following command:

In [8]:
categories_data[0]

{u'id': 5, u'links': {}, u'name': u'Mathematics', u'shortName': u'math'}

## Structuring the data

In this section, we will structure the courses_data, universities_data and categories_data into pandas DataFrames, and we will map the universities and categories ids with the corresponding names. By the end of this section, we will have one pandas DataFrame called courses_df that will have all the necessary data in a well structured format.

### Putting the data into Pandas DataFrames

First, we start by creating a pandas DataFrame for the courses data.

In [9]:
courses_df = pd.DataFrame()

Next, we add the course_name, course_language, course_short_name, categories and universities columns to the courses_df DataFrame.

In [10]:
courses_df['course_name'] = map(lambda course_data: course_data['name'], courses_data)
courses_df['course_language'] = map(lambda course_data: course_data['language'], courses_data)
courses_df['course_short_name'] = map(lambda course_data: course_data['shortName'], courses_data)
courses_df['categories'] = map(lambda course_data: course_data['links']['categories'] if 'categories' in course_data['links'] else [], courses_data)
courses_df['universities'] = map(lambda course_data: course_data['links']['universities'] if 'universities' in course_data['links'] else [], courses_data)

We can print the first 5 rows from the courses_df DataFrame by executing the command below.

In [11]:
courses_df.head()

Unnamed: 0,course_name,course_language,course_short_name,categories,universities
0,"The Land Ethic Reclaimed: Perceptive Hunting, ...",en,perceptivehunting,"[8, 10, 19, 20]",[65]
1,"Contraception: Choices, Culture and Consequences",en,contraception,"[3, 8]",[10]
2,Introduction to Computational Arts: Processing,en,compartsprocessing,"[1, 4, 18, 22]",[117]
3,Introduction to Programming with MATLAB,en,matlab,"[12, 15]",[37]
4,Experimentation for Improvement,en,experiments,"[4, 5, 15, 16]",[148]


Similarly, we create a python DataFrame for the universities data, and we add the university_id, university_name and university_location_country columns to it.

In [12]:
universities_df = pd.DataFrame()
universities_df['university_id'] = map(lambda university_data: university_data['id'], universities_data)
universities_df['university_name'] = map(lambda university_data: university_data['name'], universities_data)
universities_df['university_location_country'] = map(lambda university_data: university_data['locationCountry'], universities_data)

In [13]:
universities_df.head()

Unnamed: 0,university_id,university_name,university_location_country
0,234,Xi'an Jiaotong University,CN
1,120,University of New Mexico,US
2,10,"University of California, San Francisco",US
3,56,"University of California, Santa Cruz",US
4,24,Hebrew University of Jerusalem,


Next, we change the universities_df index to university_id.

In [14]:
universities_df = universities_df.set_index('university_id')

That changes the columns that does the indexing. We can print the first 5 rows from the universities_df DataFrame by executing the command below.

In [15]:
universities_df.head()

Unnamed: 0_level_0,university_name,university_location_country
university_id,Unnamed: 1_level_1,Unnamed: 2_level_1
234,Xi'an Jiaotong University,CN
120,University of New Mexico,US
10,"University of California, San Francisco",US
56,"University of California, Santa Cruz",US
24,Hebrew University of Jerusalem,


Next, we create a python DataFrame for the categories data, and we add the category_id and category_name columns to it.

In [16]:
categories_df = pd.DataFrame()
categories_df['category_id'] = map(lambda category_data: category_data['id'], categories_data)
categories_df['category_name'] = map(lambda category_data: category_data['name'], categories_data)

In [17]:
categories_df.head()

Unnamed: 0,category_id,category_name
0,5,Mathematics
1,10,Biology & Life Sciences
2,24,Chemistry
3,25,Energy & Earth Sciences
4,14,Education


Similarly, we change the categories_df index to category_id.

In [18]:
categories_df = categories_df.set_index('category_id')

We can print the first 5 rows from the categories_df DataFrame by executing the command below.

In [19]:
categories_df.head()

Unnamed: 0_level_0,category_name
category_id,Unnamed: 1_level_1
5,Mathematics
10,Biology & Life Sciences
24,Chemistry
25,Energy & Earth Sciences
14,Education


## Mapping ids with the corresponding names

In the courses_df DataFrame, the universities and the categories are referred to by their ids and not by their name. To change that, we start by defining a function that change the ids to their corresponding names.

In [20]:
def map_ids_names(ids_array, df, object_name):
    names_array = []
    for object_id in ids_array:
        try:
            names_array.append(df.loc[object_id][object_name])
        except:
            continue
    return names_array

For example, we can print the categories with ids [4,5,15,16] by executing the command below.

In [21]:
map_ids_names([4,5,15,16], categories_df, 'category_name')

[u'Information, Tech & Design',
 u'Mathematics',
 u'Engineering',
 u'Statistics and Data Analysis']

Similarly, we can print the name of the university with id 234 by executing the command below.

In [22]:
map_ids_names([234], universities_df, 'university_name')

[u"Xi'an Jiaotong University"]

Next, we add both the categories and universities name to the courses_df DataFrame.

In [23]:
courses_df['categories_name'] = courses_df.apply(lambda row: map_ids_names(row['categories'], categories_df, 'category_name'), axis=1)
courses_df['universities_name'] = courses_df.apply(lambda row: map_ids_names(row['universities'], universities_df, 'university_name'), axis=1)

In [24]:
courses_df.head()

Unnamed: 0,course_name,course_language,course_short_name,categories,universities,categories_name,universities_name
0,"The Land Ethic Reclaimed: Perceptive Hunting, ...",en,perceptivehunting,"[8, 10, 19, 20]",[65],"[Health & Society, Biology & Life Sciences, Fo...",[University of Wisconsin–Madison]
1,"Contraception: Choices, Culture and Consequences",en,contraception,"[3, 8]",[10],"[Medicine, Health & Society]","[University of California, San Francisco]"
2,Introduction to Computational Arts: Processing,en,compartsprocessing,"[1, 4, 18, 22]",[117],"[Computer Science: Theory, Information, Tech &...",[The State University of New York]
3,Introduction to Programming with MATLAB,en,matlab,"[12, 15]",[37],"[Computer Science: Software Engineering, Engin...",[Vanderbilt University]
4,Experimentation for Improvement,en,experiments,"[4, 5, 15, 16]",[148],"[Information, Tech & Design, Mathematics, Engi...",[McMaster University]


## Adding course URLs to the data

The URL to each Coursera course looks like https://www.coursera.org/course/<shortName>. We can add the URLs to the courses_df DataFrame by executing the command below.

In [25]:
courses_df['course_url'] = 'https://www.coursera.org/course/' + courses_df['course_short_name']

We can print the first 5 rows from the courses_df DataFrame by executing the command below.

In [26]:
courses_df.head()

Unnamed: 0,course_name,course_language,course_short_name,categories,universities,categories_name,universities_name,course_url
0,"The Land Ethic Reclaimed: Perceptive Hunting, ...",en,perceptivehunting,"[8, 10, 19, 20]",[65],"[Health & Society, Biology & Life Sciences, Fo...",[University of Wisconsin–Madison],https://www.coursera.org/course/perceptivehunting
1,"Contraception: Choices, Culture and Consequences",en,contraception,"[3, 8]",[10],"[Medicine, Health & Society]","[University of California, San Francisco]",https://www.coursera.org/course/contraception
2,Introduction to Computational Arts: Processing,en,compartsprocessing,"[1, 4, 18, 22]",[117],"[Computer Science: Theory, Information, Tech &...",[The State University of New York],https://www.coursera.org/course/compartsproces...
3,Introduction to Programming with MATLAB,en,matlab,"[12, 15]",[37],"[Computer Science: Software Engineering, Engin...",[Vanderbilt University],https://www.coursera.org/course/matlab
4,Experimentation for Improvement,en,experiments,"[4, 5, 15, 16]",[148],"[Information, Tech & Design, Mathematics, Engi...",[McMaster University],https://www.coursera.org/course/experiments


## Getting social sharing counts

### Getting social counts from sharedcount.com

SharedCount is a service that looks up the number of times a given URL has been shared on major social networks. We can get the social shares number of any web page by going to sharedcount.com, entering the page URL in the input field and pressing the "Analyze" button. Below is the output of the very popular Machine Learning course from Stanford that has the URL https://www.coursera.org/course/ml.

### Getting social counts using the sharedcount.com API

In our dataset, we want to get the social share of more than 1000 courses. Looking up each course URL from the browser is not efficient. Fortunately, sharedcount.com provides an API that we can use for automating this process. In order to use the API, you should create an account and get your API key. The API allows a daily quota of 10,000 requests which more than enough for this use case. We start by defining a function that returns the social shares from sharecount.com

In [27]:
def get_social_metrics(url, api_key):
    sharedcount_response = urllib2.urlopen('https://free.sharedcount.com/?url=' + url + '&apikey=' + api_key)
    return json.load(sharedcount_response)

I need to get an API key: https://docs.sharedcount.com/

In [29]:
SHAREDCOUNT_API_KEY = '2c8787d463e1be55a6ef68b1a341ccc0ccb41697'
courses_df['sharedcount_metrics'] = map(lambda course_url: get_social_metrics(course_url, SHAREDCOUNT_API_KEY), courses_df['course_url'])

In this tutorial, we will be only using the social share count from Twitter, LinkedIn and Facebook, We can add this information to the courses_df DataFrame by executing the commands below.

In [30]:
courses_df['twitter_count'] = map(lambda sharedcount: sharedcount['Twitter'], courses_df['sharedcount_metrics'])
courses_df['linkedin_count'] = map(lambda sharedcount: sharedcount['LinkedIn'], courses_df['sharedcount_metrics'])
courses_df['facebook_count'] = map(lambda sharedcount: sharedcount['Facebook']['total_count'], courses_df['sharedcount_metrics'])

We can print the first 5 rows from the courses_df DataFrame by executing the command below.

In [31]:
courses_df.head()

Unnamed: 0,course_name,course_language,course_short_name,categories,universities,categories_name,universities_name,course_url,sharedcount_metrics,twitter_count,linkedin_count,facebook_count
0,"The Land Ethic Reclaimed: Perceptive Hunting, ...",en,perceptivehunting,"[8, 10, 19, 20]",[65],"[Health & Society, Biology & Life Sciences, Fo...",[University of Wisconsin–Madison],https://www.coursera.org/course/perceptivehunting,"{u'StumbleUpon': 0, u'Reddit': 0, u'Delicious'...",81,1,1058
1,"Contraception: Choices, Culture and Consequences",en,contraception,"[3, 8]",[10],"[Medicine, Health & Society]","[University of California, San Francisco]",https://www.coursera.org/course/contraception,"{u'StumbleUpon': 0, u'Reddit': 0, u'Delicious'...",199,3,1703
2,Introduction to Computational Arts: Processing,en,compartsprocessing,"[1, 4, 18, 22]",[117],"[Computer Science: Theory, Information, Tech &...",[The State University of New York],https://www.coursera.org/course/compartsproces...,"{u'StumbleUpon': 0, u'Reddit': 0, u'Delicious'...",169,1,1014
3,Introduction to Programming with MATLAB,en,matlab,"[12, 15]",[37],"[Computer Science: Software Engineering, Engin...",[Vanderbilt University],https://www.coursera.org/course/matlab,"{u'StumbleUpon': 0, u'Reddit': 0, u'Delicious'...",188,25,2537
4,Experimentation for Improvement,en,experiments,"[4, 5, 15, 16]",[148],"[Information, Tech & Design, Mathematics, Engi...",[McMaster University],https://www.coursera.org/course/experiments,"{u'StumbleUpon': 0, u'Reddit': 0, u'Delicious'...",47,122,313


## Querying the data

Now that we have all Coursera courses with the corresponding number of social shares structured in a Pandas DataFrame, we can write queries that return the top courses for a specific category or language.

### Getting the top 10 most popular English courses by Twitter count

In [32]:
cols_to_show = ['course_name', 'universities_name', 'categories_name', 'twitter_count', 'linkedin_count', 'facebook_count']
#Get English courses
query = courses_df[courses_df['course_language'] == 'en']
#Sort the courses by twitter count and get the top 10 courses
query = query.sort('twitter_count', ascending=0).head(10)
query[cols_to_show]

Unnamed: 0,course_name,universities_name,categories_name,twitter_count,linkedin_count,facebook_count
418,Gamification,[University of Pennsylvania],"[Information, Tech & Design, Business & Manage...",10346,8896,23322
206,Functional Programming Principles in Scala,[École Polytechnique Fédérale de Lausanne],[Computer Science: Software Engineering],6755,701,10129
740,Public-Private Partnerships (PPP): How can PPP...,[The World Bank Group],"[Economics & Finance, Humanities , Business & ...",6599,2020,8166
314,Cryptography I,[Stanford University],"[Computer Science: Theory, Computer Science: S...",4494,8896,16044
252,Social Network Analysis,[University of Michigan],"[Information, Tech & Design, Computer Science:...",3909,39,10347
833,Principles of Reactive Programming,[École Polytechnique Fédérale de Lausanne],[Computer Science: Software Engineering],3859,8896,3281
743,Think Again: How to Reason and Argue,[Duke University],"[Humanities , Teacher Professional Development]",2948,160,460
667,Model Thinking,[University of Michigan],"[Economics & Finance, Humanities ]",2774,393,10859
720,Introduction to Finance,[University of Michigan],"[Economics & Finance, Business & Management]",2712,8896,12897
34,E-learning and Digital Cultures,[The University of Edinburgh],[Education],2632,8896,5188


### Getting the top 10 most popular English courses in "Statistics and Data Analysis" by Twitter count

In [33]:
#Get English courses
query = courses_df[courses_df['course_language'] == 'en']
#Filter the "Statistics and Data Analysis" courses
query = query[query['categories_name'].map(lambda categories_name: 'Statistics and Data Analysis' in categories_name)]
#Sort the courses by twitter count and get the top 10 courses
query = query.sort('twitter_count', ascending=0).head(10)
query[cols_to_show]

Unnamed: 0,course_name,universities_name,categories_name,twitter_count,linkedin_count,facebook_count
771,Startup Engineering,[Stanford University],"[Computer Science: Software Engineering, Busin...",2584,252,7896
608,Computing for Data Analysis,[Johns Hopkins University],"[Health & Society, Statistics and Data Analysis]",2169,252,8814
162,Introduction to Data Science,[University of Washington],"[Information, Tech & Design, Computer Science:...",1624,491,4670
97,Statistics One,[Princeton University],[Statistics and Data Analysis],1500,151,7312
185,Data Analysis,[Johns Hopkins University],"[Health & Society, Statistics and Data Analysis]",1493,145,4943
735,Maps and the Geospatial Revolution,[The Pennsylvania State University],"[Information, Tech & Design, Statistics and Da...",1411,327,7108
480,"Creativity, Innovation, and Change | 创意，创新, 与 变革",[The Pennsylvania State University],"[Computer Science: Theory, Economics & Finance...",1229,301,9533
106,The Data Scientist’s Toolbox,[Johns Hopkins University],[Statistics and Data Analysis],981,8896,4570
862,R Programming,[Johns Hopkins University],"[Information, Tech & Design, Statistics and Da...",945,8896,8875
560,Enhance Your Career and Employability Skills,[University of London],"[Computer Science: Theory, Economics & Finance...",710,597,4317


## Conclusion

In this tutorial, we learned how to use the Coursera API to get the courses catalogue and how to use the sharedcount.com API to get social shares metrics for each course. The technique introduced in this tutorial can be leveraged to other use cases that requires a popularity ranking system for measuring the relevance of a list of links.