# Etract, Transform, Load

**Goal**: Exploratory Data Analysis - to design the experiment. Transform the data and load it back to the database.

Specific:

- Extract applicant information regarding addmissions quiz completion.
- Design a research question, null hypothesis and alternative hypothesis for the experiment.
- Create functions for transforming applicant documents and loading them to a database.
- Build a Python class to streamline the experiment.

In [7]:
import random

import pandas as pd
from pymongo import MongoClient

### Extract

Aggregate Clients by Quiz Completion

In [9]:
complete = 3717
incomplete = 1308

In [10]:
total = complete + incomplete
prop_incomplete = incomplete/total
print(
    "Proportion of users who don't complete admissions quiz:", round(prop_incomplete, 2)
)

Proportion of users who don't complete admissions quiz: 0.26


Developing a Research Question

RQ: Does sending to no-quiz applicants Increase their likelihood of taking admission exams?

In [11]:
null_hypothesis = " No significant difference in the quiz completion between the 2 groups"

alternate_hypothesis = " A significant difference in the quiz completion between the 2 groups"

print("Null Hypothesis:", null_hypothesis)
print("Alternate Hypothesis:", alternate_hypothesis)

Null Hypothesis:  No significant difference in the quiz completion between the 2 groups
Alternate Hypothesis:  A significant difference in the quiz completion between the 2 groups


Find_by_date function

In [12]:
def find_by_date(collection, date_string):
    """Find records in a PyMongo Collection created on a given date.

    Parameters
    ----------
    collection : pymongo.collection.Collection
        Collection in which to search for documents.
    date_string : str
        Date to query. Format must be '%Y-%m-%d', e.g. '2022-06-28'.

    Returns
    -------
    observations : list
        Result of query. List of documents (dictionaries).
    """
    collection = ds_app
    date_string = "2022-05-04"
    # Convert `date_string` to datetime object
    start = pd.to_datetime(date_string, format='%Y-%m-%d')
    # Offset `start` by 1 day
    end = start + pd.DateOffset(days=1)
    # Create PyMongo query for no-quiz applicants b/t `start` and `end`
    query = {"createdAt": {"$gte": start, "$lt": end}, "admissionsQuiz": "incomplete"}
    # Query collection, get result
    result=collection.find(query)
    # Convert `result` to list
    observations = list(result)
    # REMOVE}
    return observations

### Transform: Designing the Experiment

- This step involves manipulating the data extracted.

Assign_to_groups Function

In [14]:
def assign_to_groups(observations):
    """Randomly assigns observations to control and treatment groups.

    Parameters
    ----------
    observations : list or pymongo.cursor.Cursor
        List of users to assign to groups.

    Returns
    -------
    observations : list
        List of documents from `observations` with two additional keys:
        `inExperiment` and `group`.
    """

    # Shuffle `observations`
    random.seed(42)
    random.shuffle(observations)
    
    # Get index position of item at observations halfway point
    idx = len(observations) // 2

    # Assign first half of observations to control group
    for doc in observations[:idx]:
        doc["inexperiment"] = True
        doc["group"] = "no email (control)"

    # Assign second half of observations to treatment group
    for doc in observations[idx:]:
        doc["inexperiment"] = True
        doc["group"] = "email (treatment)"

    return observations

Export_treament_emails Function