# Data Analysis of Udacity Student Engagements

Given data about student enrollments on Udacity, an analysis was done to answer which variables affects project completion. 

#### Questions to ask

* How long it takes students to submit projects?
* How do students who pass their projects differ from students who don’t? 
* How much time the avg student spends on audacity?
* How much time spent relates to lessons/projects completed? 
* How student engagement changes over time?
* How many times students submit before they pass?

#### Load Data from CSVs

We will be working with three CSV files: 
* enrollments.csv
* daily_engagement.csv
* project_submissions.csv

We can write a function to load data from CSV files. 

In [77]:
# read_csv takes a string of a csv file name and outputs a list    

import unicodecsv

def read_csv(filename):
    with open(filename, 'rb') as f: #'rb' = format for reading
        reader = unicodecsv.DictReader(f)
        return list(reader)

# Read in our data and store in variables 
enrollments = read_csv('enrollments.csv')
daily_engagement = read_csv('daily_engagement.csv')
project_submissions = read_csv('project_submissions.csv')    

Let's look at what we are working with. Print out the first row of each table.

In [78]:
print enrollments[0]
print daily_engagement[0]
print project_submissions[0]

{u'status': u'canceled', u'is_udacity': u'True', u'is_canceled': u'True', u'join_date': u'2014-11-10', u'account_key': u'448', u'cancel_date': u'2015-01-14', u'days_to_cancel': u'65'}
{u'lessons_completed': u'0.0', u'num_courses_visited': u'1.0', u'total_minutes_visited': u'11.6793745', u'projects_completed': u'0.0', u'acct': u'0', u'utc_date': u'2015-01-09'}
{u'lesson_key': u'3176718735', u'processing_state': u'EVALUATED', u'account_key': u'256', u'assigned_rating': u'UNGRADED', u'completion_date': u'2015-01-16', u'creation_date': u'2015-01-14'}


#### Fixing Data Types

From the first row of each table, we see that the CSV file has converted all our data into "string" format. In order for us to manipulate the data, we will need to clean/convert the data types. 

* The "cancel_date" data can be converted to a Python datetime object from a string by importing the datetime library. If the string is empty, then the student has not cancelled yet. 
* The "days_to_cancel" data can be an empty string. If the string is empty, it means the student has not cancelled yet.
* The 'account_key' data is kept as a string of numbers since they are unique and we will not be working mathematically with them. 

In [79]:
from datetime import datetime as dt

# Takes a date as a string, and returns a Python datetime object. 
# If there is no date given, returns None
def parse_date(date):
    if date == '': 
        return None
    else:
        return dt.strptime(date, '%Y-%m-%d')
    
# check if integer or not
# Takes a string which is either an empty string or represents an integer,
# and returns an int or None.
def parse_maybe_int(i):
    if i == '': 
        return None 
    else:
        return int(i)

# Clean up the data types in the enrollments table
for enrollment in enrollments:
    enrollment['cancel_date'] = parse_date(enrollment['cancel_date'])
    enrollment['days_to_cancel'] = parse_maybe_int(enrollment['days_to_cancel'])
    enrollment['is_canceled'] = enrollment['is_canceled'] == 'True'
    enrollment['is_udacity'] = enrollment['is_udacity'] == 'True'
    enrollment['join_date'] = parse_date(enrollment['join_date'])
    
enrollments[0]

{u'account_key': u'448',
 u'cancel_date': datetime.datetime(2015, 1, 14, 0, 0),
 u'days_to_cancel': 65,
 u'is_canceled': True,
 u'is_udacity': True,
 u'join_date': datetime.datetime(2014, 11, 10, 0, 0),
 u'status': u'canceled'}

In [80]:
# Clean up the data types in the engagement table

# float to int --> can not complete 0.5 of a lesson or course 

for engagement_record in daily_engagement:
    engagement_record['lessons_completed'] = int(float(engagement_record['lessons_completed']))
    engagement_record['num_courses_visited'] = int(float(engagement_record['num_courses_visited']))
    engagement_record['projects_completed'] = int(float(engagement_record['projects_completed']))
    engagement_record['total_minutes_visited'] = float(engagement_record['total_minutes_visited'])
    engagement_record['utc_date'] = parse_date(engagement_record['utc_date'])
    
daily_engagement[0]

{u'acct': u'0',
 u'lessons_completed': 0,
 u'num_courses_visited': 1,
 u'projects_completed': 0,
 u'total_minutes_visited': 11.6793745,
 u'utc_date': datetime.datetime(2015, 1, 9, 0, 0)}

In [81]:
# Clean up the data types in the submissions table
for submission in project_submissions:
    submission['completion_date'] = parse_date(submission['completion_date'])
    submission['creation_date'] = parse_date(submission['creation_date'])

project_submissions[0]

{u'account_key': u'256',
 u'assigned_rating': u'UNGRADED',
 u'completion_date': datetime.datetime(2015, 1, 16, 0, 0),
 u'creation_date': datetime.datetime(2015, 1, 14, 0, 0),
 u'lesson_key': u'3176718735',
 u'processing_state': u'EVALUATED'}

#### Investigating the Data

#### Renaming table headers

In two tables, we have columns "account_key" but in the daily_engagement table we have a column "acct". Need to change the column name from "acct" to "account_key". 


In [82]:
# Change Header: We can either create a new list, or modify the old list. 
# Let's modify the list.

for engagement_record in daily_engagement:
    engagement_record['account_key'] = engagement_record['acct']
    del[engagement_record['acct']]

Check that the renamed header is 'account_key' instead of 'acct': 

In [83]:
daily_engagement[0]['account_key']

u'0'

#### Unique students

We can find the total number of enrollments in each csv file by counting the number of rows.

In [84]:
def get_unique_students(data):
    unique_students = set()
    for data_point in data:
        unique_students.add(data_point['account_key'])
    return unique_students

We can find the total number of enrollments in each csv file by counting the number of rows.

In [85]:
len(enrollments)

1640

Does this mean 1640 students are enrolled in the course? We can find the total number of unique students in each table by counting the number of values in a set. 

In [86]:
unique_enrolled_students = set()

for enrollment in enrollments:
    unique_enrolled_students.add(enrollment['account_key'])

len(unique_enrolled_students)

1302

There are 1302 unique enrollments and 1640 enrollments in total. A possible explanation is that students are enrolling, cancelling, and re-enrolling in courses. Let's see how engaged these 1302 unique students are? 

In [87]:
len(daily_engagement)

136240

In [88]:
unique_engagement_students = set()

for engagement_record in daily_engagement:
    unique_engagement_students.add(engagement_record['account_key'])
len(unique_engagement_students)

1237

We see that the 136,240 daily engagements came from only unique 1,237 students.

In [89]:
len(project_submissions)

3642

In [90]:
unique_project_submitters = set()
for submission in project_submissions:
    unique_project_submitters.add(submission['account_key'])
len(unique_project_submitters)

743

We see that the 3,642 project submissions come from 743 unique students. This averages to almost 5 projects per student. 

#### Students with 0 daily engagement 

Why are there more unique students in enrollment table compared to engagement table? It is strange that a student will enroll but not have a single engagement with the platform. Let's print out a row for this special case.

In [91]:
for enrollment in enrollments:
    # find the account key for each enrollment 
    student = enrollment['account_key']
    # check if that account key is in set of unique students 
    if student not in unique_engagement_students:
        print enrollment 
        break 

{u'status': u'canceled', u'is_udacity': False, u'is_canceled': True, u'join_date': datetime.datetime(2014, 11, 12, 0, 0), u'account_key': u'1219', u'cancel_date': datetime.datetime(2014, 11, 12, 0, 0), u'days_to_cancel': 0}


Looking at this data point, we notice that the JOIN date and CANCEL date are both (2014, 11, 12, 0, 0). Perhaps a user has to be active for one full day before daily engagement is registered?

## Tracking Down the Remaining Problems

In [94]:
# Create a set of the account keys for all Udacity test accounts
udacity_test_accounts = set()
for enrollment in enrollments:
    if enrollment['is_udacity']:
        udacity_test_accounts.add(enrollment['account_key'])
len(udacity_test_accounts)

6

In [95]:
# Given some data with an account_key field, removes any records corresponding to Udacity test accounts
def remove_udacity_accounts(data):
    non_udacity_data = []
    for data_point in data:
        if data_point['account_key'] not in udacity_test_accounts:
            non_udacity_data.append(data_point)
    return non_udacity_data

In [96]:
# Remove Udacity test accounts from all three tables
non_udacity_enrollments = remove_udacity_accounts(enrollments)
non_udacity_engagement = remove_udacity_accounts(daily_engagement)
non_udacity_submissions = remove_udacity_accounts(project_submissions)

print len(non_udacity_enrollments)
print len(non_udacity_engagement)
print len(non_udacity_submissions)

1622
135656
3634


## Refining the Question

Question: 

How do numbers in the daily engagement table differ for students who pass the first project compared to students who don't pass the first project?

Exploring this further, we will need to isolate engagement data for the time period BEFORE the student's first project. This period of time will be different for each student. Example: One student may take two weeks to do their project, whereas another student may take two months. How would we compare data from different lengths of time? We also need to make the assumption that the engagements are only towards that one project.

Considering these factors, we will only look at engagement for a fixed amount of time (one week), and exclude students who cancel within a week. Note that the free trial for Udacity courses is seven days. This means that we will exclude those individuals from our analysis. 

We start by creating a dictionary of students who either haven't canceled yet, or have stayed enrolled more than one week.

In [104]:
## Create a dictionary named paid_students containing all students who either
## haven't canceled yet or who remained enrolled for more than 7 days. The keys
## should be account keys, and the values should be the date the student enrolled.

paid_students = {}

for enrollment in non_udacity_enrollments:
    if not enrollment['is_canceled'] or enrollment['days_to_cancel'] > 7:
        account_key = enrollment['account_key']
        enrollment_date = enrollment['join_date']
        
        if account_key not in paid_students or \
            enrollment_date > paid_students[account_key]:
            paid_students[account_key] = enrollment_date
            

len(paid_students)

995

## Getting Data from First Week

In [None]:
# Takes a student's join date and the date of a specific engagement record,
# and returns True if that engagement record happened within one week
# of the student joining.
def within_one_week(join_date, engagement_date):
    time_delta = engagement_date - join_date
    return time_delta.days < 7

In [None]:
#####################################
#                 7                 #
#####################################

## Create a list of rows from the engagement table including only rows where
## the student is one of the paid students you just found, and the date is within
## one week of the student's join date.

paid_engagement_in_first_week = 

## Exploring Student Engagement

In [None]:
from collections import defaultdict

# Create a dictionary of engagement grouped by student.
# The keys are account keys, and the values are lists of engagement records.
engagement_by_account = defaultdict(list)
for engagement_record in paid_engagement_in_first_week:
    account_key = engagement_record['account_key']
    engagement_by_account[account_key].append(engagement_record)

In [None]:
# Create a dictionary with the total minutes each student spent in the classroom during the first week.
# The keys are account keys, and the values are numbers (total minutes)
total_minutes_by_account = {}
for account_key, engagement_for_student in engagement_by_account.items():
    total_minutes = 0
    for engagement_record in engagement_for_student:
        total_minutes += engagement_record['total_minutes_visited']
    total_minutes_by_account[account_key] = total_minutes

In [None]:
import numpy as np

# Summarize the data about minutes spent in the classroom
total_minutes = total_minutes_by_account.values()
print 'Mean:', np.mean(total_minutes)
print 'Standard deviation:', np.std(total_minutes)
print 'Minimum:', np.min(total_minutes)
print 'Maximum:', np.max(total_minutes)

## Debugging Data Analysis Code

In [None]:
#####################################
#                 8                 #
#####################################

## Go through a similar process as before to see if there is a problem.
## Locate at least one surprising piece of data, output it, and take a look at it.

## Lessons Completed in First Week

In [None]:
#####################################
#                 9                 #
#####################################

## Adapt the code above to find the mean, standard deviation, minimum, and maximum for
## the number of lessons completed by each student during the first week. Try creating
## one or more functions to re-use the code above.

## Number of Visits in First Week

In [None]:
######################################
#                 10                 #
######################################

## Find the mean, standard deviation, minimum, and maximum for the number of
## days each student visits the classroom during the first week.

## Splitting out Passing Students

In [None]:
######################################
#                 11                 #
######################################

## Create two lists of engagement data for paid students in the first week.
## The first list should contain data for students who eventually pass the
## subway project, and the second list should contain data for students
## who do not.

subway_project_lesson_keys = ['746169184', '3176718735']

passing_engagement =
non_passing_engagement =

## Comparing the Two Student Groups

In [None]:
######################################
#                 12                 #
######################################

## Compute some metrics you're interested in and see how they differ for
## students who pass the subway project vs. students who don't. A good
## starting point would be the metrics we looked at earlier (minutes spent
## in the classroom, lessons completed, and days visited).

## Making Histograms

In [None]:
######################################
#                 13                 #
######################################

## Make histograms of the three metrics we looked at earlier for both
## students who passed the subway project and students who didn't. You
## might also want to make histograms of any other metrics you examined.

## Improving Plots and Sharing Findings

In [None]:
######################################
#                 14                 #
######################################

## Make a more polished version of at least one of your visualizations
## from earlier. Try importing the seaborn library to make the visualization
## look better, adding axis labels and a title, and changing one or more
## arguments to the hist() function.