## Udacity Student Engagement Data Analysis ##
### Activity ###
The following data from Udacity DAND (Data Analyst Nano Degree) is provided for analysis. 


* **Enrollments** (__enrollments.csv__) -
    Provides details of how students are getting enrolling and cancelling to Udacity

* **Daily Engagement** (daily_engagement.csv) - 
    Provides Engagement Summary of Students on a particular date like - Number of Courses Visited, Total Minutes Visited, Number of Lessons completed, Number of projects completed

* **Project Submissions** (project_submissions.csv) -
    Provides details of Project Submissions like Submission Date, Evaluation Date, Evaluation Status, Grade obtained, which lesson
    

## Enrollment Questions ##
1. Month with Highest number of joinees and month with lowest number of joinees
1. Days to Cancel - Frequency Distribution, Mean, Median, Mode, Distribution
1. Joining Date - What day of Month, Week, Month there is highest joining
1. Cancellation - What day of Month, Week, Month there is highest cancellation

## Engagement Questions
1. Monthly - Total Minutes Visited, Total Projects Completed, Total Lessons Completed
1. 

## Project Submission Questions ##
1. Time take for Project
2. By Lesson - Processing State
3. 


In [151]:
import sys
import csv
import datetime as dt

In [152]:
def parse_data(filename):
    with open(filename) as f:
        reader = csv.DictReader(f)
        return list(reader)

enrollments = parse_data('/Users/subramanyans/GitHub/courses/Udacity/DataAnalysis/enrollments.csv')
engagements = parse_data('/Users/subramanyans/GitHub/courses/Udacity/DataAnalysis/daily_engagement.csv')
submissions = parse_data('/Users/subramanyans/GitHub/courses/Udacity/DataAnalysis/project_submissions.csv')

print ("Enrollments:\n{}{}\nEngagements:\n{}{}\nSubmissions:\n{}{}".format(\
            len(enrollments), enrollments[0], len(engagements), engagements[0], len(submissions), submissions[0]))

Enrollments:
1640{'status': 'canceled', 'is_udacity': 'True', 'is_canceled': 'True', 'join_date': '2014-11-10', 'account_key': '448', 'cancel_date': '2015-01-14', 'days_to_cancel': '65'}
Engagements:
136240{'lessons_completed': '0.0', 'num_courses_visited': '1.0', 'total_minutes_visited': '11.6793745', 'projects_completed': '0.0', 'acct': '0', 'utc_date': '2015-01-09'}
Submissions:
3642{'lesson_key': '3176718735', 'processing_state': 'EVALUATED', 'account_key': '256', 'assigned_rating': 'UNGRADED', 'completion_date': '2015-01-16', 'creation_date': '2015-01-14'}


***************
#### Cleaning Data ####
Note that when we parse the data, all data is parsed as 'strings'. We see that many fields can be converted to Native Python data types like boolean, numbers, dates etc. instead of strings for easier handling

Also, it is easier to create Dictionaries of Dictionaries instead of List of Dictionaries for easier search operations later

In [153]:
enrollment_dict = {}
engagement_dict = {}
submission_dict = {}

def parse_date(date):
    if not date:
        return None
    else:
        return dt.datetime.strptime(date,'%Y-%m-%d')


def parse_int(value):
    if not value:
        return None
    else:
        return int(value)

# {'join_date': '2014-11-10', 'days_to_cancel': '65', 'status': 'canceled', 'is_canceled': 'True',
# 'cancel_date': '2015-01-14', 'account_key': '448', 'is_udacity': 'True'}

# Clean Enrollment Data
def clean_enrollments():
    for enrollment in enrollments:
        enrollment['is_canceled'] = (enrollment['is_canceled'] == 'True')
        enrollment['is_udacity'] = (enrollment['is_udacity'] == 'True')
        enrollment['join_date'] = parse_date(enrollment['join_date'])
        enrollment['cancel_date'] = parse_date(enrollment['cancel_date'])
        enrollment['account_key'] = parse_int(enrollment['account_key'])
        enrollment['days_to_cancel'] = parse_int(enrollment['days_to_cancel'])
        enrollment['status'] = enrollment['status'].upper()
        
        if enrollment['account_key'] in enrollment_dict:
            enrollment_dict[enrollment['account_key']].append(enrollment)
        else:
            enrollment_dict[enrollment['account_key']] = [enrollment]


# Clean Engagement Data
# {'acct': '0', 'total_minutes_visited': '11.6793745', 'num_courses_visited': '1.0', 'lessons_completed': '0.0', 'projects_completed': '0.0', 'utc_date': '2015-01-09'}
def clean_engagements():
    for engagement in engagements:
        engagement['acct'] = parse_int(engagement['acct'])
        engagement['total_minutes_visited'] = int(float(engagement['total_minutes_visited']))
        engagement['num_courses_visited'] = int(float(engagement['num_courses_visited']))
        engagement['lessons_completed'] = int(float(engagement['lessons_completed']))
        engagement['projects_completed'] = int(float(engagement['projects_completed']))
        engagement['utc_date'] = parse_date(engagement['utc_date'])

        engagement['account_key'] = engagement['acct']
        del engagement['acct']
        
        if engagement['account_key'] in engagement_dict:
            engagement_dict[engagement['account_key']].append(engagement)
        else:
            engagement_dict[engagement['account_key']] = [engagement]


# Clean Submissions Data
#{'completion_date': '2015-01-16', 'lesson_key': '3176718735', 'processing_state': 'EVALUATED', 'assigned_rating': 'UNGRADED', 'creation_date': '2015-01-14', 'account_key': '256'}
def clean_submissions():
    for submission in submissions:
        submission['completion_date'] = parse_date(submission['completion_date'])
        submission['lesson_key'] = parse_int(submission['lesson_key'])
        submission['account_key'] = parse_int(submission['account_key'])
        submission['creation_date'] = parse_date(submission['creation_date'])
        
        if submission['account_key'] in submission_dict:
            submission_dict[submission['account_key']].append(submission)
        else:
            submission_dict[submission['account_key']] = [submission]

clean_enrollments()
clean_engagements()
clean_submissions()

# print ("Enrollments:\n{}\nEngagements:\n{}\nSubmissions:\n{}"\
#        .format(enrollment_dict[0], engagement_dict[0], submission_dict[0]))

**********
_Check if there is any deviation in number of accounts in each of the data sets
**********

In [154]:
def get_unique_accounts(input_list):
    unique_accounts = set()
    for element in input_list:
        unique_accounts.add(element['account_key'])
    return unique_accounts

unique_enrollments = get_unique_accounts(enrollments)
unique_engagements = get_unique_accounts(engagements)
unique_submissions = get_unique_accounts(submissions)

print(len(enrollments), len(unique_enrollments), enrollments[0])
print(len(engagements), len(unique_engagements), engagements[0])
print(len(submissions), len(unique_submissions), submissions[0])

(1640, 1302, {'status': 'CANCELED', 'is_udacity': True, 'is_canceled': True, 'join_date': datetime.datetime(2014, 11, 10, 0, 0), 'account_key': 448, 'cancel_date': datetime.datetime(2015, 1, 14, 0, 0), 'days_to_cancel': 65})
(136240, 1237, {'lessons_completed': 0, 'num_courses_visited': 1, 'total_minutes_visited': 11, 'projects_completed': 0, 'account_key': 0, 'utc_date': datetime.datetime(2015, 1, 9, 0, 0)})
(3642, 743, {'lesson_key': 3176718735, 'processing_state': 'EVALUATED', 'account_key': 256, 'assigned_rating': 'UNGRADED', 'completion_date': datetime.datetime(2015, 1, 16, 0, 0), 'creation_date': datetime.datetime(2015, 1, 14, 0, 0)})


In [156]:
# Find out set of Test accounts
udacity_test_accounts = set()
# Solution for Dictionary
# for enrollments in enrollment_dict.values():
#     if (len(enrollments)) > 1:
#         for enrollment in enrollments:
#             if enrollment['is_udacity']:
#                 udacity_test_accounts.add(enrollment['account_key'])
#     else:
#         if enrollments[0]['is_udacity']:
#             udacity_test_accounts.add(enrollments[0]['account_key'])

udacity_test_accounts = set([enrollment['account_key'] for enrollment in enrollments if enrollment['is_udacity']])
print (udacity_test_accounts)

set([448, 1069, 1101, 312, 818, 1304])


In [157]:
# Remove Udacity Test Accounts
# for ta in udacity_test_accounts:
#     if ta in enrollment_dict:
#         del enrollment_dict[ta]
#     if ta in engagement_dict:
#         del engagement_dict[ta]
#     if ta in submission_dict:
#         del submission_dict[ta]

enrollments = [enrollment for enrollment in enrollments if not enrollment['is_udacity']]
engagements = [engagement for engagement in engagements if engagement['account_key'] not in udacity_test_accounts]
submissions = [submission for submission in submissions if submission['account_key'] not in udacity_test_accounts]

### Question ###
**How do numbers in Daily engagement table differ for people who pass the first project?**

#### Approach ####
* From Submissions, Get the first PASSED project, note its submission date
* From Daily_Engagement, Get the total time spent until the submission
    * _This step is problematic since students might have spent time on other lessons other than what is submitted_

### Changed Question ###
**Only look at engagement for 1st week and ignore the students who cancelled before the first week **

#### Approach ####
* Create a dictionary of "paid_students" who have not cancelled for > 1 week
* Dictionary shall have account_key and joining date

In [158]:
paid_students = {enrollment['account_key']: enrollment['join_date'] \
                 for enrollment in enrollments if enrollment['days_to_cancel'] > 7 or not enrollment['is_canceled']}

print(len(paid_students))    
#print(paid_students)


# paid_students = {}
# for enrollments in enrollment_dict.values():
#     if (len(enrollments)) > 1:
#         for enrollment in enrollments:
#             if if enrollment['days_to_cancel'] > 7 or not enrollment['is_canceled']:
#                 paid_students[enrollment['account_key'] = enrollment['join_date']
#     else:
#         if enrollments[0]['is_udacity']:
#             paid_students.add(enrollments[0]['account_key'])
# print (paid_students)

995


In [None]:
first_week_engagements = {}
# tmp_engagement_list = []
# for key in paid_students:
#     tmp_engagement_list = []
#     for e in engagements:
#         if e['account_key'] == key and (e['utc_date'] - paid_students[key]).days <  7:
#             tmp_engagement_list.append(e)
#     first_week_engagements[key] = tmp_engagement_list

print (first_week_engagements[0])
        