# Stage 1: Data wrangling

## Load Data from CSVs

We could use this code three time, but every time you repeat something it is a good moment to create function. The if we find bug we can solve it in one place.

In [139]:
import unicodecsv

def read_csv(path):
    with open(path, 'rb') as f:
        reader = unicodecsv.DictReader(f)
        data = list(reader)
    return data

enrollments = read_csv("enrollments.csv")
enrollments[0]

OrderedDict([('account_key', '448'),
             ('status', 'canceled'),
             ('join_date', '2014-11-10'),
             ('cancel_date', '2015-01-14'),
             ('days_to_cancel', '65'),
             ('is_udacity', 'True'),
             ('is_canceled', 'True')])

In [140]:
daily_engagement = read_csv("daily_engagement.csv")    
daily_engagement[0]

OrderedDict([('acct', '0'),
             ('utc_date', '2015-01-09'),
             ('num_courses_visited', '1.0'),
             ('total_minutes_visited', '11.6793745'),
             ('lessons_completed', '0.0'),
             ('projects_completed', '0.0')])

In [141]:
project_submissions = read_csv("project_submissions.csv")
project_submissions[0]

OrderedDict([('creation_date', '2015-01-14'),
             ('completion_date', '2015-01-16'),
             ('assigned_rating', 'UNGRADED'),
             ('account_key', '256'),
             ('lesson_key', '3176718735'),
             ('processing_state', 'EVALUATED')])

## Fixing Data Types
Data types of some fields like join_date (date), days_to_cancel (int), is_udacity (bool) are incorrectly classified as string. We need to fix it.

It is better to do it upfront - we avoid data type confusion later.

In [142]:
from datetime import datetime as dt

# Takes a date as a string, and returns a Python datetime object. 
# If there is no date given, returns None
def parse_date(date):
    if date == '':
        return None
    else:
        return dt.strptime(date, '%Y-%m-%d')
    
# Takes a string which is either an empty string or represents an integer,
# and returns an int or None.
def parse_maybe_int(i):
    if i == '':
        return None
    else:
        return int(i)

# Clean up the data types in the enrollments table
for enrollment in enrollments:
    enrollment['cancel_date'] = parse_date(enrollment['cancel_date'])
    enrollment['days_to_cancel'] = parse_maybe_int(enrollment['days_to_cancel'])
    enrollment['is_canceled'] = enrollment['is_canceled'] == 'True'
    enrollment['is_udacity'] = enrollment['is_udacity'] == 'True'
    enrollment['join_date'] = parse_date(enrollment['join_date'])
    
enrollments[0]

OrderedDict([('account_key', '448'),
             ('status', 'canceled'),
             ('join_date', datetime.datetime(2014, 11, 10, 0, 0)),
             ('cancel_date', datetime.datetime(2015, 1, 14, 0, 0)),
             ('days_to_cancel', 65),
             ('is_udacity', True),
             ('is_canceled', True)])

In [143]:
# Clean up the data types in the engagement table
for engagement_record in daily_engagement:
    engagement_record['lessons_completed'] = int(float(engagement_record['lessons_completed']))
    engagement_record['num_courses_visited'] = int(float(engagement_record['num_courses_visited']))
    engagement_record['projects_completed'] = int(float(engagement_record['projects_completed']))
    engagement_record['total_minutes_visited'] = float(engagement_record['total_minutes_visited'])
    engagement_record['utc_date'] = parse_date(engagement_record['utc_date'])
    
daily_engagement[0]

OrderedDict([('acct', '0'),
             ('utc_date', datetime.datetime(2015, 1, 9, 0, 0)),
             ('num_courses_visited', 1),
             ('total_minutes_visited', 11.6793745),
             ('lessons_completed', 0),
             ('projects_completed', 0)])

In [144]:
# Clean up the data types in the submissions table
for submission in project_submissions:
    submission['completion_date'] = parse_date(submission['completion_date'])
    submission['creation_date'] = parse_date(submission['creation_date'])

project_submissions[0]

OrderedDict([('creation_date', datetime.datetime(2015, 1, 14, 0, 0)),
             ('completion_date', datetime.datetime(2015, 1, 16, 0, 0)),
             ('assigned_rating', 'UNGRADED'),
             ('account_key', '256'),
             ('lesson_key', '3176718735'),
             ('processing_state', 'EVALUATED')])

Note when running the above cells that we are actively changing the contents of our data variables. If you try to run these cells multiple times in the same session, an error will occur.

# Stage 2: Questions we can now ask about data

What fraction of students does not cancel nanodegree.

What fraction of students cancel nanodegree.

What is the average time / distirbution times to cancel?

What is the average time / distirbution times without canceling?

What is the difference in engagement patterns between those who cancel and those who did not?

Did those who did not cancel were more engaged in courses activity than those who canceled?

Did those who did not cancel were more active in additional courses?

How many courses did take people who cancel and those who did not.

How many minutes were they engaged daily/weekly/monthly (+ distirbution)?

How many lessons did they finish daily/weekly/monthly (+distribution)?

How many projects did they submitted daily/weekly/monthly (+distribution)?

Were succesful students active in any way in a more regular way that those who cancel?

What is the difference between these two groups in terms of number of submision blank/incomplete/ungraded vs. passed/distiction.

What is the ratio between submision finished vs. submision unfinised in successful and canceling users.

How much time student spend on udacity courses.

How long does it take from starting the course to finishing it.

How long to submit project.

How time spent relates to number of lessons or projects submitted.

How engagement changes over time.

### How students who pass projects differ from those who did not.

# Stage 3 - Data wrangling again

In [145]:
#####################################
#                 2                 #
#####################################

## Find the total number of rows and the number of unique students (account keys)
## in each table.

enrollments_rows_n = len(enrollments)
enrollments_unique_students_n = len(set([enrollment['account_key'] for enrollment in enrollments]))
print("enrollments_rows_n", enrollments_rows_n)
print("enrollments_unique_students_n", enrollments_unique_students_n)

daily_engagement_rows_n = len(daily_engagement)
daily_engagement_unique_students_n = len(set([daily_engagement_element['acct'] for daily_engagement_element in daily_engagement]))
print("daily_engagement_rows_n", daily_engagement_rows_n)
print("daily_engagement_unique_students_n", daily_engagement_unique_students_n)

project_submissions_rows_n = len(project_submissions)
project_submissions_unique_students_n = len(set([project_submission['account_key'] for project_submission in project_submissions]))
print("project_submissions_rows_n", project_submissions_rows_n)
print("project_submissions_unique_students_n", project_submissions_unique_students_n)

enrollments_rows_n 1640
enrollments_unique_students_n 1302
daily_engagement_rows_n 136240
daily_engagement_unique_students_n 1237
project_submissions_rows_n 3642
project_submissions_unique_students_n 743


Problems:

A) enrollments_unique_students_n > daily_engagement_unique_students_n -> daily engagement should cover all user - also those who did not do anything

B) different names for account_key and acct

## Problems in the Data

In [146]:
#####################################
#                 3                 #
#####################################

## Rename the "acct" column in the daily_engagement table to "account_key".

daily_engagement[1]

OrderedDict([('acct', '0'),
             ('utc_date', datetime.datetime(2015, 1, 10, 0, 0)),
             ('num_courses_visited', 2),
             ('total_minutes_visited', 37.2848873333),
             ('lessons_completed', 0),
             ('projects_completed', 0)])

In [147]:
for daily_engagement_element in daily_engagement:
    daily_engagement_element['account_key'] = daily_engagement_element['acct']
    del daily_engagement_element['acct']
    
daily_engagement[0]['account_key']

'0'

Now we can solve previous task better by writing one function that gets unique users from each table.

In [156]:
#####################################
#                 2                 #
#####################################

## Find the total number of rows and the number of unique students (account keys)
## in each table.

def get_unique_students(data):
     return set([data_row['account_key'] for data_row in data])

enrollments_rows_n = len(enrollments)
enrollments_unique_students = get_unique_students(enrollments)
print("enrollments_rows_n", enrollments_rows_n)
print("enrollments_unique_students_n", len(enrollments_unique_students))

daily_engagement_rows_n = len(daily_engagement)
daily_engagement_unique_students = get_unique_students(daily_engagement)
print("daily_engagement_rows_n", daily_engagement_rows_n)
print("daily_engagement_unique_students_n", len(daily_engagement_unique_students))

project_submissions_rows_n = len(project_submissions)
project_submissions_unique_students = get_unique_students(project_submissions)
print("project_submissions_rows_n", project_submissions_rows_n)
print("project_submissions_unique_students_n", len(project_submissions_unique_students))

enrollments_rows_n 1640
enrollments_unique_students_n 1302
daily_engagement_rows_n 136240
daily_engagement_unique_students_n 1237
project_submissions_rows_n 3642
project_submissions_unique_students_n 743


## Missing Engagement Records

When analyzing data it is really important to solve problems like this in advance. When something like this happens you do not know what is incorrect and you cannot trust your results.

The process:
1) Identify which data points are surprising.

2) Print surprising data points.

3) Fix any problems.
- more investigation may be needed
- or there might be no problem (like here - students just cancelled with 24 hours)

In [176]:
#####################################
#                 4                 #
#####################################
## Find any one student enrollments where the student is missing from the daily engagement table.
## Output that enrollment.
for enrollment in enrollments:
    if enrollment['account_key'] not in daily_engagement_unique_students:
        print(enrollment)
        break

OrderedDict([('account_key', '1219'), ('status', 'canceled'), ('join_date', datetime.datetime(2014, 11, 12, 0, 0)), ('cancel_date', datetime.datetime(2014, 11, 12, 0, 0)), ('days_to_cancel', 0), ('is_udacity', False), ('is_canceled', True)])


## Checking for More Problem Records

After solving one problem we should check whether there are any othe remaining problematic data points.

In [179]:
#####################################
#                 5                 #
#####################################

## Find the number of surprising data points (enrollments missing from
## the engagement table) that remain, if any.

for enrollment in enrollments:
    if enrollment['account_key'] not in daily_engagement_unique_students \
        and enrollment['join_date'] != enrollment['cancel_date']:
        print(enrollment)

OrderedDict([('account_key', '1304'), ('status', 'canceled'), ('join_date', datetime.datetime(2015, 1, 10, 0, 0)), ('cancel_date', datetime.datetime(2015, 3, 10, 0, 0)), ('days_to_cancel', 59), ('is_udacity', True), ('is_canceled', True)])
OrderedDict([('account_key', '1304'), ('status', 'canceled'), ('join_date', datetime.datetime(2015, 3, 10, 0, 0)), ('cancel_date', datetime.datetime(2015, 6, 17, 0, 0)), ('days_to_cancel', 99), ('is_udacity', True), ('is_canceled', True)])
OrderedDict([('account_key', '1101'), ('status', 'current'), ('join_date', datetime.datetime(2015, 2, 25, 0, 0)), ('cancel_date', None), ('days_to_cancel', None), ('is_udacity', True), ('is_canceled', False)])


They are all Udacity test account which should be excluded from test data.

## Tracking Down the Remaining Problems


In [183]:
# Create a set of the account keys for all Udacity test accounts
udacity_test_accounts = set()
for enrollment in enrollments:
    if enrollment['is_udacity']:
        udacity_test_accounts.add(enrollment['account_key'])
len(udacity_test_accounts)

6

In [184]:
# Given some data with an account_key field, removes any records corresponding to Udacity test accounts
def remove_udacity_accounts(data):
    non_udacity_data = []
    for data_point in data:
        if data_point['account_key'] not in udacity_test_accounts:
            non_udacity_data.append(data_point)
    return non_udacity_data

In [186]:
# Remove Udacity test accounts from all three tables
non_udacity_enrollments = remove_udacity_accounts(enrollments)
non_udacity_engagement = remove_udacity_accounts(daily_engagement)
non_udacity_submissions = remove_udacity_accounts(project_submissions)

print(len(non_udacity_enrollments))
print(len(non_udacity_engagement))
print(len(non_udacity_submissions))

1622
135656
3634


# Stage 4 - Data exploration

## Refining the Question

In [None]:
#####################################
#                 6                 #
#####################################

## Create a dictionary named paid_students containing all students who either
## haven't canceled yet or who remained enrolled for more than 7 days. The keys
## should be account keys, and the values should be the date the student enrolled.

paid_students =

## Getting Data from First Week

In [None]:
# Takes a student's join date and the date of a specific engagement record,
# and returns True if that engagement record happened within one week
# of the student joining.
def within_one_week(join_date, engagement_date):
    time_delta = engagement_date - join_date
    return time_delta.days < 7

In [None]:
#####################################
#                 7                 #
#####################################

## Create a list of rows from the engagement table including only rows where
## the student is one of the paid students you just found, and the date is within
## one week of the student's join date.

paid_engagement_in_first_week = 

## Exploring Student Engagement

In [None]:
from collections import defaultdict

# Create a dictionary of engagement grouped by student.
# The keys are account keys, and the values are lists of engagement records.
engagement_by_account = defaultdict(list)
for engagement_record in paid_engagement_in_first_week:
    account_key = engagement_record['account_key']
    engagement_by_account[account_key].append(engagement_record)

In [None]:
# Create a dictionary with the total minutes each student spent in the classroom during the first week.
# The keys are account keys, and the values are numbers (total minutes)
total_minutes_by_account = {}
for account_key, engagement_for_student in engagement_by_account.items():
    total_minutes = 0
    for engagement_record in engagement_for_student:
        total_minutes += engagement_record['total_minutes_visited']
    total_minutes_by_account[account_key] = total_minutes

In [None]:
import numpy as np

# Summarize the data about minutes spent in the classroom
total_minutes = total_minutes_by_account.values()
print 'Mean:', np.mean(total_minutes)
print 'Standard deviation:', np.std(total_minutes)
print 'Minimum:', np.min(total_minutes)
print 'Maximum:', np.max(total_minutes)

## Debugging Data Analysis Code

In [None]:
#####################################
#                 8                 #
#####################################

## Go through a similar process as before to see if there is a problem.
## Locate at least one surprising piece of data, output it, and take a look at it.

## Lessons Completed in First Week

In [None]:
#####################################
#                 9                 #
#####################################

## Adapt the code above to find the mean, standard deviation, minimum, and maximum for
## the number of lessons completed by each student during the first week. Try creating
## one or more functions to re-use the code above.

## Number of Visits in First Week

In [None]:
######################################
#                 10                 #
######################################

## Find the mean, standard deviation, minimum, and maximum for the number of
## days each student visits the classroom during the first week.

## Splitting out Passing Students

In [None]:
######################################
#                 11                 #
######################################

## Create two lists of engagement data for paid students in the first week.
## The first list should contain data for students who eventually pass the
## subway project, and the second list should contain data for students
## who do not.

subway_project_lesson_keys = ['746169184', '3176718735']

passing_engagement =
non_passing_engagement =

## Comparing the Two Student Groups

In [None]:
######################################
#                 12                 #
######################################

## Compute some metrics you're interested in and see how they differ for
## students who pass the subway project vs. students who don't. A good
## starting point would be the metrics we looked at earlier (minutes spent
## in the classroom, lessons completed, and days visited).

## Making Histograms

In [None]:
######################################
#                 13                 #
######################################

## Make histograms of the three metrics we looked at earlier for both
## students who passed the subway project and students who didn't. You
## might also want to make histograms of any other metrics you examined.

## Improving Plots and Sharing Findings

In [None]:
######################################
#                 14                 #
######################################

## Make a more polished version of at least one of your visualizations
## from earlier. Try importing the seaborn library to make the visualization
## look better, adding axis labels and a title, and changing one or more
## arguments to the hist() function.