This Notebook provides the code for getting the Udemy data through Udemy API . 

Reference: https://www.udemy.com/developers/affiliate/

In [78]:
import json
import requests
import time
import pandas as pd

# Udemy API function 

The function is defined here to grab the Udemy course informations based on various course category. The maximum course content shows in each page is 60. 

In [2]:
def get_udemy_course(pnumber,catid):
    url = "https://www.udemy.com/api-2.0/discovery-units/all_courses/" \
               "?p={0}" \
               "&page_size=60" \
               "&category_id={1}" \
               "&source_page=category_page&locale=en_US&currency=usd&sos=pc&fl=cat".format(pnumber,catid)
         
    response = requests.get(url)
    if response.status_code == 200:
        resp_json = response.json()
    else:
        print(response.status_code)
        return []
    time.sleep(1)
    return resp_json['unit']['items']

# Collect the data based on the categories

There are twelve course categories in the Udemy site, each category had its own course_id which are listed below. The `while` loop will allow me to keep grabbing the course information shown in each page until I hit the last page for the certain category . Even though I can assign a loop iterate through twelve course categories, I ended up deciding to run a function repeatedly since this can allow people to grab the data that only they want. In addition, I created the other column named `category` so that I can investigate the data in each category during the EDA process. 

- Development: `course_id = 288`
- Business: `course_id = 268`
- IT & Software: `course_id = 294`
- Office Productivity: `course_id = 292`
- Personal Developement: `course_id = 296`
- Design: `course_id = 269`
- Marketing: `course_id = 290`
- Lifestyle: `course_id = 274`
- Photography: `course_id = 273`
- Health & Fitness `course_id = 276`
- Music `course_id = 278`
- Teaching & Academics `course_id = 300`

## 1. "Development" courses

In [3]:
all_dev = []
page_number = 1
data = True
while data:
    data = get_udemy_course(page_number, 288)
    all_dev.extend(data)
    page_number += 1

400


In [30]:
Udemy_Dev_course = pd.DataFrame(all_dev)
Udemy_Dev_course['language'] = [i['title'] for i in Udemy_Dev_course['locale']]
Udemy_Dev_course = Udemy_Dev_course[['id', 'avg_rating_recent', 
                                     'objectives_summary','num_subscribers', 
                                     'content_info', 'headline', 'image_304x171',
                                     'title', 'url','language' ]]
Udemy_Dev_course['category'] = 'Development'

In [90]:
Udemy_Dev_course.to_csv('Udemy_Dev_course.csv', index=False)

## 2. "Business" courses

In [36]:
all_business = []
page_number = 1
data = True
while data:
    data = get_udemy_course(page_number, 268)
    all_business.extend(data)
    page_number += 1

400


In [40]:
Udemy_Bus_course = pd.DataFrame(all_business)
Udemy_Bus_course['language'] = [i['title'] for i in Udemy_Bus_course['locale']]
Udemy_Bus_course = Udemy_Bus_course[['id', 'avg_rating_recent', 
                                     'objectives_summary','num_subscribers', 
                                     'content_info', 'headline', 'image_304x171',
                                     'title', 'url','language' ]]
Udemy_Bus_course['category'] = 'Business'

In [91]:
Udemy_Bus_course.to_csv('Udemy_Bus_course.csv', index=False)

## 3. "IT & Software" courses

In [43]:
all_it_soft = []
page_number = 1
data = True
while data:
    data = get_udemy_course(page_number, 294)
    all_it_soft.extend(data)
    page_number += 1

400


In [44]:
Udemy_it_course = pd.DataFrame(all_it_soft)
Udemy_it_course['language'] = [i['title'] for i in Udemy_it_course['locale']]
Udemy_it_course = Udemy_it_course[['id', 'avg_rating_recent', 
                                   'objectives_summary','num_subscribers', 
                                   'content_info', 'headline', 'image_304x171',
                                   'title', 'url','language' ]]
Udemy_it_course['category'] = 'IT_&_Software'

In [92]:
Udemy_it_course.to_csv('Udemy_it_course.csv', index=False)

## 4. "Office Productivity" courses

In [46]:
all_office = []
page_number = 1
data = True
while data:
    data = get_udemy_course(page_number, 292)
    all_office.extend(data)
    page_number += 1

In [49]:
Udemy_office_course = pd.DataFrame(all_office)
Udemy_office_course['language'] = [i['title'] for i in Udemy_office_course['locale']]
Udemy_office_course = Udemy_office_course[['id', 'avg_rating_recent', 
                                           'objectives_summary','num_subscribers', 
                                           'content_info', 'headline', 'image_304x171',
                                           'title', 'url','language' ]]
Udemy_office_course['category'] = 'Office_Productivity'

In [93]:
Udemy_office_course.to_csv('Udemy_office_course.csv', index=False)

## 5. "Personal Development" courses

In [47]:
all_personal = []
page_number = 1
data = True
while data:
    data = get_udemy_course(page_number, 296)
    all_personal.extend(data)
    page_number += 1

400


In [94]:
Udemy_personal_course = pd.DataFrame(all_personal)
Udemy_personal_course['language'] = [i['title'] for i in Udemy_personal_course['locale']]
Udemy_personal_course = Udemy_personal_course[['id', 'avg_rating_recent', 
                                               'objectives_summary','num_subscribers', 
                                               'content_info', 'headline', 'image_304x171',
                                               'title', 'url','language' ]]
Udemy_personal_course['category'] = 'Personal_Development'

Udemy_personal_course.to_csv('Udemy_personal_course.csv', index=False)

## 6. "Design" courses

In [51]:
all_design = []
page_number = 1
data = True
while data:
    data = get_udemy_course(page_number, 269)
    all_design.extend(data)
    page_number += 1

In [52]:
Udemy_design_course = pd.DataFrame(all_design)
Udemy_design_course['language'] = [i['title'] for i in Udemy_design_course['locale']]
Udemy_design_course = Udemy_design_course[['id', 'avg_rating_recent', 
                                               'objectives_summary','num_subscribers', 
                                               'content_info', 'headline', 'image_304x171',
                                               'title', 'url','language' ]]
Udemy_design_course['category'] = 'Design'

In [95]:
Udemy_design_course.to_csv('Udemy_design_course.csv', index=False)

## 7. "Marketing" courses

In [54]:
all_marketing = []
page_number = 1
data = True
while data:
    data = get_udemy_course(page_number, 290)
    all_marketing.extend(data)
    page_number += 1

In [55]:
Udemy_marketing_course = pd.DataFrame(all_marketing)
Udemy_marketing_course['language'] = [i['title'] for i in Udemy_marketing_course['locale']]
Udemy_marketing_course = Udemy_marketing_course[['id', 'avg_rating_recent', 
                                                 'objectives_summary','num_subscribers', 
                                                 'content_info', 'headline', 'image_304x171',
                                                 'title', 'url','language' ]]
Udemy_marketing_course['category'] = 'Marketing'

In [96]:
Udemy_marketing_course.to_csv('Udemy_marketing_course.csv', index=False)

## 8. "Lifestyle" courses

In [57]:
all_lifestyle = []
page_number = 1
data = True
while data:
    data = get_udemy_course(page_number, 274)
    all_lifestyle.extend(data)
    page_number += 1

In [58]:
Udemy_lifestyle_course = pd.DataFrame(all_lifestyle)
Udemy_lifestyle_course['language'] = [i['title'] for i in Udemy_lifestyle_course['locale']]
Udemy_lifestyle_course = Udemy_lifestyle_course[['id', 'avg_rating_recent', 
                                                 'objectives_summary','num_subscribers', 
                                                 'content_info', 'headline', 'image_304x171',
                                                 'title', 'url','language' ]]
Udemy_lifestyle_course['category'] = 'Lifestyle'

In [97]:
Udemy_lifestyle_course.to_csv('Udemy_lifestyle_course.csv', index=False)

## 9. "Photography" courses

In [60]:
all_photography = []
page_number = 1
data = True
while data:
    data = get_udemy_course(page_number, 273)
    all_photography.extend(data)
    page_number += 1

In [61]:
Udemy_photography_course = pd.DataFrame(all_photography)
Udemy_photography_course['language'] = [i['title'] for i in Udemy_photography_course['locale']]
Udemy_photography_course = Udemy_photography_course [['id', 'avg_rating_recent', 
                                                      'objectives_summary','num_subscribers', 
                                                      'content_info', 'headline', 'image_304x171',
                                                      'title', 'url','language' ]]
Udemy_photography_course ['category'] = 'Photography'

In [98]:
Udemy_photography_course.to_csv('Udemy_photography_course.csv', index=False)

## 10. "Health & Fitness" courses

In [63]:
all_health = []
page_number = 1
data = True
while data:
    data = get_udemy_course(page_number, 276)
    all_health.extend(data)
    page_number += 1

In [65]:
Udemy_health_course = pd.DataFrame(all_health)
Udemy_health_course['language'] = [i['title'] for i in Udemy_health_course['locale']]
Udemy_health_course = Udemy_health_course[['id', 'avg_rating_recent', 
                                           'objectives_summary','num_subscribers', 
                                           'content_info', 'headline', 'image_304x171',
                                           'title', 'url','language' ]]
Udemy_health_course['category'] = 'Health_&_Fitness'

In [99]:
Udemy_health_course.to_csv('Udemy_health_course.csv', index=False)

## 11. "Music" courses

In [67]:
all_music = []
page_number = 1
data = True
while data:
    data = get_udemy_course(page_number, 278)
    all_music.extend(data)
    page_number += 1

In [68]:
Udemy_music_course = pd.DataFrame(all_music)
Udemy_music_course['language'] = [i['title'] for i in Udemy_music_course['locale']]
Udemy_music_course = Udemy_music_course[['id', 'avg_rating_recent', 
                                         'objectives_summary','num_subscribers', 
                                         'content_info', 'headline', 'image_304x171',
                                         'title', 'url','language' ]]
Udemy_music_course['category'] = 'Music'

In [100]:
Udemy_music_course.to_csv('Udemy_music_course.csv', index=False)

## 12. "Teaching & Academics" courses

In [70]:
all_academics = []
page_number = 1
data = True
while data:
    data = get_udemy_course(page_number, 300)
    all_academics.extend(data)
    page_number += 1

In [71]:
Udemy_academics_course = pd.DataFrame(all_academics)
Udemy_academics_course['language'] = [i['title'] for i in Udemy_academics_course['locale']]
Udemy_academics_course = Udemy_academics_course[['id', 'avg_rating_recent', 
                                                 'objectives_summary','num_subscribers', 
                                                 'content_info', 'headline', 'image_304x171',
                                                 'title', 'url','language' ]]
Udemy_academics_course['category'] = 'Teaching_&_Academics'

In [101]:
Udemy_academics_course.to_csv('Udemy_academics_course.csv', index=False)

# Merge and store the data to Postgres in the AWS instance

The collected data are merged together to a single dataframe using `pd.concat` method.

In [108]:
Udemy = pd.concat([Udemy_academics_course,Udemy_Bus_course,Udemy_design_course,Udemy_Dev_course,
                   Udemy_health_course,Udemy_it_course,Udemy_lifestyle_course,Udemy_marketing_course,
                   Udemy_music_course,Udemy_office_course,Udemy_personal_course,Udemy_photography_course], 
                   axis=0, ignore_index=True)

In [110]:
Udemy.to_csv('Udemy.csv', index=False)

> Data Dictionary of the final dataframe

|Column Name|Data type|Description|
| --- | --- | --- |
|id|int|Course ID |
|avg_rating_recent|float|The average ratings for the course|
|objectives_summary|object|Objective summary of the course|
|num_subscibers|int|The number of people who subscribe the course|
|content_info|object|The Length of course or number of quiz/questions
|headline|object|The headlines of the course
|image_304x171|object|Link to the course thumbnail|
|title|object|The title of the course
|url|object|The Link to the Udemy website
|language|object|The Language that the instructor used in the course|
|category|object|The course category


This dataset will be stored into the postgres server in AWS instance with the following commands. 
```
CREATE TABLE udemy(
	id INT,
	avg_rating_recent NUMERIC,
	objectives_summary TEXT,
	num_subscribers NUMERIC,
	content_info TEXT,
	headline TEXT,
	image_304x171 TEXT,
	title TEXT,
	url TEXT,
	language TEXT,
	category TEXT,
	description TEXT    
);
COPY udemy FROM '/Udemy.csv' WITH DELIMITER ',' HEADER CSV;      
```