### Python Code of the Udacity free course Intro to data analysis

### Lesson 1: Data Analysis Process

In this first part, the code involves process of data wrangling (data acquisition and cleaning), exploration (building intuition and finding patterns), drawing conclusion (or make predictions) and comunicating the results. The soil dataset used is one provided by Udacity, which has information on students' engagements, enrollments and project submissions.

#### Step 1: Load libraries and dataset

In [2]:
## Install libraries
pip install unicodecsv

In [3]:
## Load libraries
import unicodecsv
from datetime import datetime as dt
from collections import defaultdict
import numpy as np

In [7]:
# Define function to read csv
def read_csv(filename):
    with open(filename, 'rb') as f:
        reader = unicodecsv.DictReader(f)
        return list(reader)

In [26]:
# Load data
enrollments = read_csv('C:/Users/neliq/Desktop/Intro-to-data-analysis/enrollments.csv')
engagements = read_csv('C:/Users/neliq/Desktop/Intro-to-data-analysis/daily_engagement.csv')
submissions = read_csv('C:/Users/neliq/Desktop/Intro-to-data-analysis/project_submissions.csv')

#### Step 2: Fixing column names and data types


In [27]:
## Rename the column acct
for eng in engagements:
    eng['account_key'] = eng['acct']
    del[eng['acct']]

In [24]:
enrollments[0]

{'account_key': '448',
 'status': 'canceled',
 'join_date': '2014-11-10',
 'cancel_date': '2015-01-14',
 'days_to_cancel': '65',
 'is_udacity': 'True',
 'is_canceled': 'True'}

In [28]:
## Define functions

## 1. Change date string, and returns a Python datetime object
## If there's no date given, returns None

def parse_date(date):
    if date == '':
        return None
    else:
        return dt.strptime(date, '%Y-%m-%d')
    
## 2. Change a string or empty string which represents an integer,
## and returns an in or None

def parse_maybe_int(i):
    if i == '':
        return None
    else:
        return int(i)
    
## Apply the functions and change data types of the enrollments dataset
for enrollment in enrollments:
    enrollment['cancel_date'] = parse_date(enrollment['cancel_date'])
    enrollment['days_to_cancel'] = parse_maybe_int(enrollment['days_to_cancel'])
    enrollment['is_canceled'] = enrollment['is_canceled'] == 'True'
    enrollment['is_udacity'] = enrollment['is_udacity'] == 'True'
    enrollment['join_date'] = parse_date(enrollment['join_date'])

## Clean up the data types in the engagement table
for engagement_record in engagements:
    engagement_record['lessons_completed'] = int(float(engagement_record['lessons_completed']))
    engagement_record['num_courses_visited'] = int(float(engagement_record['num_courses_visited']))
    engagement_record['projects_completed'] = int(float(engagement_record['projects_completed']))
    engagement_record['total_minutes_visited'] = float(engagement_record['total_minutes_visited'])
    engagement_record['utc_date'] = parse_date(engagement_record['utc_date'])
    
## Clean up the data types in the submissions table
for submission in submissions:
    submission['completion_date'] = parse_date(submission['completion_date'])
    submission['creation_date'] = parse_date(submission['creation_date'])


## Check 
enrollments[0]
engagements[0]
submissions[0]

{'creation_date': datetime.datetime(2015, 1, 14, 0, 0),
 'completion_date': datetime.datetime(2015, 1, 16, 0, 0),
 'assigned_rating': 'UNGRADED',
 'account_key': '256',
 'lesson_key': '3176718735',
 'processing_state': 'EVALUATED'}

#### Step 3: Investigating the data

Find the total number of enrollments, engagements and project submissions as well as the unique values for each. 

In [29]:
def get_unique_students(data):
    unique_students = set()
    for data_point in data:
        unique_students.add(data_point['account_key'])
    return unique_students
print('Enrollments:', len(enrollments))
unique_enrolled_students = get_unique_students(enrollments)
print('Unique enrollments:', len(unique_enrolled_students))
print('Engagements:', len(engagements))
unique_engagement_students = get_unique_students(engagements)
print('Unique engagements:', len(unique_engagement_students))
print('Submissions:', len(submissions))
unique_project_submitters = get_unique_students(submissions)
print('Unique submissions:', len(unique_project_submitters))

Enrollments: 1640
Unique enrollments: 1302
Engagements: 136240
Unique engagements: 1237
Submissions: 3642
Unique submissions: 743


#### Missing engagement records

In this case, there are few enrollment missing in the unique_engagement_students records. We have 1302 enrollemnts and 1237 engagements. They should be the same. Below we are printing one example and it is possible to observe that those are account that were cancelled in the same day. 

In [31]:
for enrollment in enrollments:
    student = enrollment['account_key']
    if student not in unique_engagement_students:
        print(enrollment)
        break

{'account_key': '1219', 'status': 'canceled', 'join_date': datetime.datetime(2014, 11, 12, 0, 0), 'cancel_date': datetime.datetime(2014, 11, 12, 0, 0), 'days_to_cancel': 0, 'is_udacity': False, 'is_canceled': True}


#### Check for more problem records

Here we are trying to find more records that are not in the engagement records and cancelled the course in the same day. The print of the records revealed another problem, which was that there are a few udacity test account, which should be removed

In [33]:
num_problem_students = 0
for enrollment in enrollments:
    student = enrollment['account_key']
    if (student not in unique_engagement_students and enrollment['join_date'] != enrollment['cancel_date']):
        print(enrollment)
        num_problem_students += 1

num_problem_students

{'account_key': '1304', 'status': 'canceled', 'join_date': datetime.datetime(2015, 1, 10, 0, 0), 'cancel_date': datetime.datetime(2015, 3, 10, 0, 0), 'days_to_cancel': 59, 'is_udacity': True, 'is_canceled': True}
{'account_key': '1304', 'status': 'canceled', 'join_date': datetime.datetime(2015, 3, 10, 0, 0), 'cancel_date': datetime.datetime(2015, 6, 17, 0, 0), 'days_to_cancel': 99, 'is_udacity': True, 'is_canceled': True}
{'account_key': '1101', 'status': 'current', 'join_date': datetime.datetime(2015, 2, 25, 0, 0), 'cancel_date': None, 'days_to_cancel': None, 'is_udacity': True, 'is_canceled': False}


3

#### Check number of udacity test accounts

Six records were found. These need to be removed.

In [34]:
udacity_test_accounts = set()
for enrollment in enrollments:
    if enrollment['is_udacity']:
        udacity_test_accounts.add(enrollment['account_key'])
len(udacity_test_accounts)

6

#### Remove udacity test accounts and create new clean datasets

In [37]:
def remove_udacity_accounts(data):
    non_udacity_data = []
    for data_point in data:
        if data_point['account_key'] not in udacity_test_accounts:
            non_udacity_data.append(data_point)
    return non_udacity_data

non_udacity_enrollments = remove_udacity_accounts(enrollments)
non_udacity_engagements = remove_udacity_accounts(engagements)
non_udacity_submissions = remove_udacity_accounts(submissions)

print(len(non_udacity_enrollments))
print(len(non_udacity_engagements))
print(len(non_udacity_submissions))

1622
135656
3634


#### Step 4:  Exploration phase

In [14]:
paid_students = {}
for enrollment in non_udacity_enrollments:
    if (not enrollment['is_canceled'] or
            enrollment['days_to_cancel'] > 7):
        account_key = enrollment['account_key']
        enrollment_date = enrollment['join_date']
        if (account_key not in paid_students or
                enrollment_date > paid_students[account_key]):
            paid_students[account_key] = enrollment_date
len(paid_students)

995

In [19]:
def within_one_week(join_date, engagement_date):
    time_delta = engagement_date - join_date
    return time_delta.days < 7

def remove_free_trial_cancels(data):
    new_data = []
    for data_point in data:
        if data_point['account_key'] in paid_students:
            new_data.append(data_point)
    return new_data

In [54]:
paid_enrollments = remove_free_trial_cancels(non_udacity_enrollments)
paid_engagement = remove_free_trial_cancels(non_udacity_engagements)
paid_submissions = remove_free_trial_cancels(non_udacity_submissions)

print(len(paid_enrollments))
print(len(paid_engagement))
print(len(paid_submissions))

1293
134549
3618
