# Spaced Repeition Model for Language Learning

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/N8XJME

## Language Codes:
### ISO Language Codes (639-1 and 693-2) and IETF Language Types
https://datahub.io/core/language-codes#resource-language-codes


Data used to develop our half-life regression (HLR) spaced repetition algorithm. This is a collection of 13 million user-word pairs for learners of several languages with a variety of language backgrounds. It includes practice recall rates, lag times between practices, and other morpho-lexical metadata.

Github:
https://github.com/duolingo/halflife-regression

## Some references

- https://acsweb.ucsd.edu/~btomosch/duodata.html

- https://blog.duolingo.com/how-we-learn-how-you-learn/

In [None]:
import pandas as pd
import numpy as np
import time

In [None]:
pd.set_option('display.max_rows', 500)

In [2]:
s_time_chunk = time.time()
filename = 'learning_traces.csv'
chunk = pd.read_csv(filename, chunksize=1000)
e_time_chunk = time.time()

In [3]:
chunk

<pandas.io.parsers.TextFileReader at 0x7fb358ac6f40>

In [6]:
csv_reader = pd.read_csv(filename, iterator=True, chunksize=100)
first_chunk = csv_reader.get_chunk()
chunk = pd.DataFrame(first_chunk)

In [14]:
chunk

Unnamed: 0,p_recall,timestamp,delta,user_id,learning_language,ui_language,lexeme_id,lexeme_string,history_seen,history_correct,session_seen,session_correct
0,1.0,1362076081,27649635,u:FO,de,en,76390c1350a8dac31186187e2fe1e178,lernt/lernen<vblex><pri><p3><sg>,6,4,2,2
1,0.5,1362076081,27649635,u:FO,de,en,7dfd7086f3671685e2cf1c1da72796d7,die/die<det><def><f><sg><nom>,4,4,2,1
2,1.0,1362076081,27649635,u:FO,de,en,35a54c25a2cda8127343f6a82e6f6b7d,mann/mann<n><m><sg><nom>,5,4,1,1
3,0.5,1362076081,27649635,u:FO,de,en,0cf63ffe3dda158bc3dbd55682b355ae,frau/frau<n><f><sg><nom>,6,5,2,1
4,1.0,1362076081,27649635,u:FO,de,en,84920990d78044db53c1b012f5bf9ab5,das/das<det><def><nt><sg><nom>,4,4,1,1
5,1.0,1362076081,27649635,u:FO,de,en,56429751fdaedb6e491f4795c770f5a4,der/der<det><def><m><sg><nom>,4,3,1,1
6,1.0,1362076081,27649635,u:FO,de,en,1bacf218eaaf9f944e525f7be9b31899,kind/kind<n><nt><sg><nom>,4,4,1,1
7,1.0,1362082032,444407,u:dDwF,es,en,73eecb492ca758ddab5371cf7b5cca32,bajo/bajo<pr>,3,3,1,1
8,1.0,1362082044,5963,u:FO,de,en,76390c1350a8dac31186187e2fe1e178,lernt/lernen<vblex><pri><p3><sg>,8,6,6,6
9,0.75,1362082044,5963,u:FO,de,en,7dfd7086f3671685e2cf1c1da72796d7,die/die<det><def><f><sg><nom>,6,5,4,3


In [16]:
chunk.columns

Index(['p_recall', 'timestamp', 'delta', 'user_id', 'learning_language',
       'ui_language', 'lexeme_id', 'lexeme_string', 'history_seen',
       'history_correct', 'session_seen', 'session_correct'],
      dtype='object')

'p_recall', 
'timestamp',            (UNIX timestamp for current lesson)
'delta',                (seconds since last seen)
'user_id', 
'learning_language',
'ui_language', 
'lexeme_id', 
'lexeme_string', 
'history_seen',         (total times user has seen the word/lexeme prior to this lesson/practice)
'history_correct',      (total times user has been correct for the word/lexeme prior to this lesson/practice)
'session_seen',         (times the user saw the word/lexeme during this lesson/practice)
'session_correct'       (times the user got the word/lexeme correct during this lesson/practice)

## Fact Table

word views: timestamp(date format), user_id, learning language name, ui language name, word name, time since last seen

# Dimension Tables Redesigned

user table: user_id, number of sessions

time table: timestamp, year, month, day, dayofweek, hour, minute, second

language pairs table: ui language name, learning language name

word table: lexeme id, surface form (word name), language code

# Fact Table Redesigned

word views: timestamp, user_id, learning language name, ui language name, word name, time since last seen, percent correct this session, percent correct historic

# Analytics Team Questions

- what are the most common language pairs?

- which language pair has the most activity?

- are certain language pairs correlated with time-of-day?

- which language pair has the best retention?

- which language UI has the highest word retention across all learning languages?

- what is the average time it takes to learn a word?

# Data Flow

- upload learning_traces and language code csv's to S3

- get a Spark cluster spinning on AWS

- read data from S3 into Spark, perform data modeling

- push modeled data from Python into S3 as parquet files

- save modeled data to a Redshift database, ready for analytics team to query