# Duolingo Word Views Data Pipeline
### Data Engineering Capstone Project

#### Project Summary
Duolingo is a company who's product is a language learning app. The app uses statistical techniques to optimize their user's speed and efficacy of learning languages.

In this project a data piepline is created for the use of the Duolingo researchers to help better understand their users behavior within the app.

Some questions that Duolingo researchers may ask are:

*What are the most common language pairs?*

*Which language pair has the most activity?*

*Are certain language pairs correlated with time-of-day?*

*Which language pair has the best retention?*

*Which language UI has the highest word retention across all learning languages?*

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# all imports here
import duo_etl
import os
import configparser

### Step 1: Scope the Project and Gather Data

#### Scope 
Duolingo has made a data set available for public use, "learning_traces.csv", containing instances word views in the language they are learning. Each row of data contains the users ID, the language they are learning, the word, and various other statistics relevant to the users current session and history.

Duolingo has also made an accompanying data file available, "lexeme_reference.txt", which breaksdown lingustical attributes of the words used in "learning_traces.csv".

The final data file, "language-codes-full_json.json", contains the list ISO language codes, which are present in the "learning_traces.csv" data set.

This data pipeline downloads each of the three datasets mentioned above from S3, loads the data into separate Spark dataframes, cleans the data, reorganizes the data into a data model suited to aid in the analysis, writes the data to parquet files on S3 that can easily be loaded into Redshift for the analysts to run queries on.


#### Describe and Gather Data 
*learning_traces.csv* can be downloaded from Duolingo at https://github.com/duolingo/halflife-regression

*lexeme_reference.txt* can be downloaded from Duolingo at https://github.com/duolingo/halflife-regression

*language-codes-full_json.json* can be downloaded from https://datahub.io/core/language-codes#resource-language-codes

In [2]:
# load AWS credentials
config = configparser.ConfigParser()
config.read('dl.cfg')
os.environ['AWS_ACCESS_KEY_ID']=config['AWS']['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY']=config['AWS']['AWS_SECRET_ACCESS_KEY']

# s3 bucket where datasets live
s3_path = 's3a://duo-data-pipeline/'
s3_bucket = 'duo-data-pipeline'

# create spark session
spark = duo_etl.create_spark_session()

# read learning traces into df
filename = 'learning_traces.csv'
lt_df = duo_etl.read_learning_traces(spark, s3_path, filename)

# read language reference table into df
filename = 'language-codes-full_json.json'
lang_df = duo_etl.read_lang_ref(spark, s3_path, filename)

# load txt file containing breakdown of lexeme codes
filename = 'lexeme_reference.txt'
lex_df = duo_etl.read_lex_ref(spark, s3_bucket, filename)

#### Print Dataset Sizes

In [3]:
duo_etl.show_size('learning traces', lt_df)
duo_etl.show_size('language reference', lang_df)
duo_etl.show_size('lexeme reference', lex_df)

learning traces dataset has 2000000 rows
language reference dataset has 487 rows
lexeme reference dataset has 22 rows


### Step 2: Explore and Assess the Data
#### Explore the Data 
An integral part of of data pipelines is performing data quality checks. In this case we are concerned with missing data, and duplicate data.

#### Cleaning Steps
All of the raw data is checked for missing values, and the entire rows are dropped if found.

Duplicate data also presents a problem, so all duplicate rows are dropped from the tables.

In [4]:
# clean learning traces
lt_df = duo_etl.check_data(lt_df, 'learning traces data')

# clean language reference
lang_df = duo_etl.check_data(lang_df, 'language data', cols=['alpha2', 'English'])

# clean lexeme data
lex_df = duo_etl.check_data(lex_df, 'lexeme data')

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

![alt text](schema_diagram.png "schema")

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [5]:
# dimension table: users
dim_users = duo_etl.create_users_table(lt_df)

# dimension table: times
dim_times = duo_etl.create_times_table(lt_df)

# dimension table: languages
dim_langs = duo_etl.create_langs_table(spark, lt_df, lang_df)

# dimension table: words
dim_words = duo_etl.create_words_table(lt_df, lex_df)

# fact table: word views
fact_wordviews = duo_etl.create_wordviews_table(lt_df)

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [6]:
# check pkey
pk = duo_etl.qc_check_pk_unique(dim_users, 'user_id')
if pk != True:
    print('qc failed for users table')

# check pkey
pk = duo_etl.qc_check_pk_unique(dim_langs, 'alpha2_code')
if pk != True:
    print('qc failed for langs table')

# check pkey words table
pk = duo_etl.qc_check_pk_unique(dim_words, 'lexeme_id')
if pk != True:
    print('qc failed for words table')

# check pkey times table
pk = duo_etl.qc_check_pk_unique(dim_times, 'epoch')
if pk != True:
    print('qc failed for times table')

# count number of rows in learning traces... compare with size of word views table
count = duo_etl.qc_source_count(lt_df, fact_wordviews)
if count != True:
    print('qc failed for wordviews table')


##### write to parquet files

In [7]:
# directory in S3 bucket to store parquet files
output_data = 'output_files/'

# write parquet files to S3
duo_etl.upload_parquet(s3_path, output_data, dim_times, 'dim_times.parquet')
duo_etl.upload_parquet(s3_path, output_data, dim_langs, 'dim_langs.parquet')
duo_etl.upload_parquet(s3_path, output_data, dim_users, 'dim_users.parquet')
duo_etl.upload_parquet(s3_path, output_data, dim_words, 'dim_words.parquet')
duo_etl.upload_parquet(s3_path, output_data, fact_wordviews, 'fact_wordviews.parquet')


"\n# write parquet files to S3\nduo_etl.upload_parquet(s3_path, output_data, dim_times, 'dim_times.parquet')\nduo_etl.upload_parquet(s3_path, output_data, dim_langs, 'dim_langs.parquet')\nduo_etl.upload_parquet(s3_path, output_data, dim_users, 'dim_users.parquet')\nduo_etl.upload_parquet(s3_path, output_data, dim_words, 'dim_words.parquet')\nduo_etl.upload_parquet(s3_path, output_data, fact_wordviews, 'fact_wordviews.parquet')\n"

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

### dim_langs
##### *alpha2_code*: the two-letter alphanumeric code used by ISO for identifying languages  (from learning traces table)
##### *english_name*: the name of the language in English (from language reference table)

### dim_times
##### *timestamp*: timestamp of session (derived from epoch)
##### *epoch*: Unix epoch of session (from learning traces table)
##### *hour*: hour of session (derived from epoch)
##### *day*: day of session (derived from epoch)
##### *week*: week of session (derived from epoch)
##### *month*: month of session (derived from epoch)
##### *year*: year of session (derived from epoch)
##### *weekday*: day of week of session (derived from epoch)

### dim_users
##### *user_id*: user ID (from learning traces table)
##### *number_of_sessions* number of sessions user has logged (derived from learning traces table)

### dim_words:
##### *lexeme_id*: lexeme ID (from learning traces table)
##### *language*: language of the word (from learning traces table)
##### *lemma*: lemma of word (derived from learning traces table)
##### *surface*: surface of word (derived from learning traces table)
##### *part_of_speech*: part of speech of word (derived from learning traces and lexeme reference)

### fact_wordviews:
##### *timestamp*: timestamp of session
##### *user_id*: user ID
##### *delta*: time since word last seen
##### *learning_language*: language that user is learning
##### *ui_language*: language that user is using
##### *lexeme_id*: word ID
##### *session_pct*: percent that user has gotten the word correct in current session
##### *history_pct*: percent that user has gotten the word correct in all previous sessions

### Sample Queries

In [8]:
# view language pairs available
lang_pairs = duo_etl.languages_available(fact_wordviews, dim_langs)
lang_pairs.show()

+-----------------+-----------+------------------+------------------+
|learning_language|ui_language| Learning Language|       UI Language|
+-----------------+-----------+------------------+------------------+
|               pt|         en|        Portuguese|           English|
|               de|         en|            German|           English|
|               es|         en|Spanish; Castilian|           English|
|               it|         en|           Italian|           English|
|               fr|         en|            French|           English|
|               en|         pt|           English|        Portuguese|
|               en|         es|           English|Spanish; Castilian|
|               en|         it|           English|           Italian|
+-----------------+-----------+------------------+------------------+



In [9]:
# analysts to see number of users per language pair
pair_users = duo_etl.num_users_pair(fact_wordviews)
pair_users.show()

+-----------------+-----------+---------------------------+
|learning_language|ui_language|number of learners for pair|
+-----------------+-----------+---------------------------+
|               de|         en|                       4275|
|               es|         en|                       9078|
|               pt|         en|                        799|
|               en|         pt|                       2353|
|               en|         es|                       8851|
|               en|         it|                        982|
|               fr|         en|                       5628|
|               it|         en|                       2046|
+-----------------+-----------+---------------------------+



In [10]:
# analysts to see number of words shown per language pair
pair_views = duo_etl.num_views_pair(fact_wordviews)
pair_views.show()

+-----------------+-----------+-----------------------------+
|learning_language|ui_language|number of words seen for pair|
+-----------------+-----------+-----------------------------+
|               de|         en|                       226055|
|               es|         en|                       527346|
|               pt|         en|                        46085|
|               en|         pt|                       144067|
|               en|         es|                       579262|
|               en|         it|                        59757|
|               fr|         en|                       299490|
|               it|         en|                       117938|
+-----------------+-----------+-----------------------------+



#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.

This project uses S3 to store the datasets because S3 is a cost effective way to store data in the cloud. Spark is used because it a good tool for wrangling data due to its Python API, and it can scale horizontally so that as the dataset becomes larger, it is able to handle the load.

A star schema was used to model the data because each field in the word_views fact table can be further described in a corresponding dimension table. In this use-case using a star schema has the benefit over other schemas because it allows the fact table to be as minimally descriptive as it can be, with more granular information just a JOIN away in a dimension table.

The data should be updated any time there is a new learning_traces.csv and an accompanying lexeme_reference.txt dataset released. This is because the learning_traces.csv dataset is what contains the events, and the lexeme_reference.txt pairs with the words in the events. We do not expect the language_reference-json.json dataset to be updated as it is a fixed reference table.

If the data was increased by 100x, a larger Spark cluster would have to be invoked. Likely a managed cluster such as an AWS EMR, or a Databricks cluster. 

If the data is used to populate a dashboard that must be updated on a daily basis, then Airflow would be used to schedule the loading of the updated learning_traces.csv file, then perform the data modeling, then upload the modeled data to S3 so that the dashboard can fetch the newly updated data.

If the database needed to be accessed by 100+ people then a Redshift cluster would be made available for the people who need privilege to the data. This would ensure ACID compliance amongst all users of the data.