# Papercup Data Scientist Test

In this project we have an SQLite database that has been filled with a realistic dataset. The test is made up of two sections, each one with 3 subsections. 

The first section focuses on SQL, and the 2nd section focuses on using python for data analysis and visualization. This notebook includes the questions, and sample code to query the SQLite database. You are free to use any tools to aid you with the solutions and feel free to use a search engine as you go. The entire exercise is expected to take 1-2 hours. 

## Data 

The dataset is made of two tables in an SQLite database. Below is a schema of the tables:

![schema](./db.png)

The data is generated by a workflow where each video is validated by a QA for any translation or transcription errors. To validate each video a QA may spend multiple sessions to finalize the video.

## Getting started

The first step is to copy this notebook into your Google Drive or Github Gist. Navigate to the "File" dropdown in the upper left corner, and select your preferred option. This will create a copy of this notebook for you to work on.


Next run the next cell to download the data


In [2]:
%%capture
!wget https://github.com/papercup-ai/data-technical-test/raw/main/papercup.db

## Assesment Criteria

You will be assesed on:

- Correctness and completeness of your answers
- Quality of code 
- Level of detail provided in solutions

## Submitting your answers

Once you have completed all the tasks, share this notebook with us by providing a link to the notebook on Github Gist or your Google Drive.

## SQL

### Q1: Calculate average duration of translated videos in the dataset
### Q2: Calculate total time spent by each QA for each month of the year
### Q3: Find the average QA time per minute of video for each language

Below are example answers based on a simplified dataset shown below.

    Sessions

    | qa | session_duration  | tv_id | session_date  |
    |----|-------------------|-------|---------------|
    | 1  | 6000              | a     |   2021-01-01  |
    | 1  | 12000             | b     |   2021-01-09  |
    | 2  | 12000             | c     |   2021-01-07  |
    | 2  | 22000             | c     |   2021-01-05  |

    Translated video

    | language 	| duration 	| tv_id 	| ... 	|
    |----------	|----------	|-------	|-----	|
    | es-la    	| 300      	| a     	|     	|
    | es-la    	| 1200     	| b     	|     	|
    | de-de    	| 1900     	| c     	|     	|

Example A1: 1133.33s is the average duration of the video

Example A2: QA 1 has spent total of 18000s working on videos and Q2 34000s in January

Example A3: Based on the presented data on average a minute of Spanish video takes 12 minutes to QA. 

In [None]:
import pandas as pd
import sqlite3
conn = sqlite3.connect("papercup.db")


In [None]:
q1_sql = """
    SELECT * FROM session;
"""

pd.read_sql_query(q1_sql, conn)

In [None]:
q2_sql = """
    SELECT * FROM session;
"""
pd.read_sql_query(q2_sql, conn)


In [None]:
q3_sql = """
    SELECT * FROM session;
"""
pd.read_sql_query(q3_sql, conn)


## Python

For the following task you are free to use any library to perform analysis and provide visualtion. 

### Q4: Plot a histogram of session durations in minutes
### Q5: Plot the relationship between duration of the video and number of minutes QA spent on a minute of the video
### Q6: Provide a function that would predict how many minutes a QA would spend on a video. Provide a brief description, and analysis of the function.

_Note: there is no correct answer, it is an explorative question_