In [1]:
# all imports here
import duo_etl
import os
import configparser

In [2]:
# create spark session
spark = duo_etl.create_spark_session(mode='local')
#path = '/home/e/Projects/duo-pipeline/data_files/'
path = '../data_files/'

22/04/11 13:38:31 WARN Utils: Your hostname, MacBook-Air.local resolves to a loopback address: 127.0.0.1; using 172.20.10.2 instead (on interface en0)
22/04/11 13:38:31 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/04/11 13:38:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
# read learning traces into df
filename = 'learning_traces.csv'
lt_df = duo_etl.read_learning_traces(spark, path, filename)

# read language reference table into df
filename = 'language-codes-full_json.json'
lang_df = duo_etl.read_lang_ref(spark, path, filename)

                                                                                

In [4]:
# load txt file containing breakdown of lexeme codes
filename = 'lexeme_reference.txt'
lex_df = duo_etl.read_lex_ref_local(spark, path, filename)

#### Print Dataset Sizes

In [5]:
duo_etl.show_size('learning traces', lt_df)



learning traces dataset has 12854226 rows


                                                                                

In [6]:
duo_etl.show_size('lexeme reference', lex_df)

[Stage 5:>                                                          (0 + 8) / 8]

lexeme reference dataset has 22 rows


                                                                                

In [7]:
duo_etl.show_size('language reference', lang_df)

language reference dataset has 487 rows


In [8]:
lt_df.columns

['p_recall',
 'timestamp',
 'delta',
 'user_id',
 'learning_language',
 'ui_language',
 'lexeme_id',
 'lexeme_string',
 'history_seen',
 'history_correct',
 'session_seen',
 'session_correct']

### Step 2: Explore and Assess the Data
#### Explore the Data 
An integral part of of data pipelines is performing data quality checks. In this case we are concerned with missing data, and duplicate data.

#### Cleaning Steps
All of the raw data is checked for missing values, and the entire rows are dropped if found.

Duplicate data also presents a problem, so all duplicate rows are dropped from the tables.

In [None]:
# clean learning traces
lt_df = duo_etl.check_data(lt_df, 'learning traces data')

# clean language reference
lang_df = duo_etl.check_data(lang_df, 'language data', cols=['alpha2', 'English'])

# clean lexeme data
lex_df = duo_etl.check_data(lex_df, 'lexeme data')

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

![alt text](schema_diagram.png "schema")

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# dimension table: users
dim_users = duo_etl.create_users_table(lt_df)

# dimension table: times
dim_times = duo_etl.create_times_table(lt_df)

# dimension table: languages
dim_langs = duo_etl.create_langs_table(spark, lt_df, lang_df)

# dimension table: words
dim_words = duo_etl.create_words_table(lt_df, lex_df)

# fact table: word views
fact_wordviews = duo_etl.create_wordviews_table(lt_df)

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# check pkey
pk = duo_etl.qc_check_pk_unique(dim_users, 'user_id')
if pk != True:
    print('qc failed for users table')

# check pkey
pk = duo_etl.qc_check_pk_unique(dim_langs, 'alpha2_code')
if pk != True:
    print('qc failed for langs table')

# check pkey words table
pk = duo_etl.qc_check_pk_unique(dim_words, 'lexeme_id')
if pk != True:
    print('qc failed for words table')

# check pkey times table
pk = duo_etl.qc_check_pk_unique(dim_times, 'epoch')
if pk != True:
    print('qc failed for times table')

# count number of rows in learning traces... compare with size of word views table
count = duo_etl.qc_source_count(lt_df, fact_wordviews)
if count != True:
    print('qc failed for wordviews table')


##### write to parquet files

In [None]:
# directory in S3 bucket to store parquet files
output_data = 'output_files/'

# write parquet files to S3
duo_etl.upload_parquet(s3_path, output_data, dim_times, 'dim_times.parquet')
duo_etl.upload_parquet(s3_path, output_data, dim_langs, 'dim_langs.parquet')
duo_etl.upload_parquet(s3_path, output_data, dim_users, 'dim_users.parquet')
duo_etl.upload_parquet(s3_path, output_data, dim_words, 'dim_words.parquet')
duo_etl.upload_parquet(s3_path, output_data, fact_wordviews, 'fact_wordviews.parquet')


#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

### dim_langs
##### *alpha2_code*: the two-letter alphanumeric code used by ISO for identifying languages  (from learning traces table)
##### *english_name*: the name of the language in English (from language reference table)

### dim_times
##### *timestamp*: timestamp of session (derived from epoch)
##### *epoch*: Unix epoch of session (from learning traces table)
##### *hour*: hour of session (derived from epoch)
##### *day*: day of session (derived from epoch)
##### *week*: week of session (derived from epoch)
##### *month*: month of session (derived from epoch)
##### *year*: year of session (derived from epoch)
##### *weekday*: day of week of session (derived from epoch)

### dim_users
##### *user_id*: user ID (from learning traces table)
##### *number_of_sessions* number of sessions user has logged (derived from learning traces table)

### dim_words:
##### *lexeme_id*: lexeme ID (from learning traces table)
##### *language*: language of the word (from learning traces table)
##### *lemma*: lemma of word (derived from learning traces table)
##### *surface*: surface of word (derived from learning traces table)
##### *part_of_speech*: part of speech of word (derived from learning traces and lexeme reference)

### fact_wordviews:
##### *timestamp*: timestamp of session
##### *user_id*: user ID
##### *delta*: time since word last seen
##### *learning_language*: language that user is learning
##### *ui_language*: language that user is using
##### *lexeme_id*: word ID
##### *session_pct*: percent that user has gotten the word correct in current session
##### *history_pct*: percent that user has gotten the word correct in all previous sessions

### Sample Queries

In [None]:
# view language pairs available
lang_pairs = duo_etl.languages_available(fact_wordviews, dim_langs)
lang_pairs.show()

In [None]:
# analysts to see number of users per language pair
pair_users = duo_etl.num_users_pair(fact_wordviews)
pair_users.show()

In [None]:
# analysts to see number of words shown per language pair
pair_views = duo_etl.num_views_pair(fact_wordviews)
pair_views.show()

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.

This project uses S3 to store the datasets because S3 is a cost effective way to store data in the cloud. Spark is used because it a good tool for wrangling data due to its Python API, and it can scale horizontally so that as the dataset becomes larger, it is able to handle the load.

A star schema was used to model the data because each field in the word_views fact table can be further described in a corresponding dimension table. In this use-case using a star schema has the benefit over other schemas because it allows the fact table to be as minimally descriptive as it can be, with more granular information just a JOIN away in a dimension table.

The data should be updated any time there is a new learning_traces.csv and an accompanying lexeme_reference.txt dataset released. This is because the learning_traces.csv dataset is what contains the events, and the lexeme_reference.txt pairs with the words in the events. We do not expect the language_reference-json.json dataset to be updated as it is a fixed reference table.

If the data was increased by 100x, a larger Spark cluster would have to be invoked. Likely a managed cluster such as an AWS EMR, or a Databricks cluster. 

If the data is used to populate a dashboard that must be updated on a daily basis, then Airflow would be used to schedule the loading of the updated learning_traces.csv file, then perform the data modeling, then upload the modeled data to S3 so that the dashboard can fetch the newly updated data.

If the database needed to be accessed by 100+ people then a Redshift cluster would be made available for the people who need privilege to the data. This would ensure ACID compliance amongst all users of the data.