# Papercup Data Analyst Test

In this project we have an SQLite database that has been filled with a realistic dataset. The test is made up of two sections, each one with 3 subsections. 

The first section focuses on SQL, and the 2nd section focuses on using python for data analysis and visualization. This notebook includes the questions, and sample code to query the SQLite database. You are free to use any tools to aid you with the solutions and feel free to use a search engine as you go. The entire exercise is expected to take 1-2 hours. 

## Data 

The dataset is made of two tables in an SQLite database. Below is a schema of the tables:

![schema](./db.png)

The data is generated by a workflow where each video is validated by a QA for any translation or transcription errors. To validate each video a QA may spend multiple sessions to finalize the video.

## Getting started

The first step is to copy this notebook into your Google Drive or Github Gist. Navigate to the "File" dropdown in the upper left corner, and select your preferred option. This will create a copy of this notebook for you to work on.


Next run the next cell to download the data


In [None]:
%%capture
!wget https://github.com/papercup-ai/data-technical-test/raw/main/papercup.db

## Assesment Criteria

You will be assesed on:

- Correctness and completeness of your answers
- Quality of code 
- Level of detail provided in solutions

## Submitting your answers

Once you have completed all the tasks, share this notebook with us by providing a link to the notebook on Github Gist or your Google Drive.

## SQL

### Q1: Identify the QAs with the best and worst average performance per minute of video.
### Q2: Identify the QAs who are performing above the 3rd quantile. 
### Q3: Are there any translated videos with no session associated?
### Q4: How has the average time that QAs spend per video changed from their initial month to their most recent month?
### Q5: Retrieve the translated video IDs and their corresponding total session durations for sessions where the total session duration exceeds 1200 seconds. Include only videos in the 'completed' state, according to the translated_video table.

_Note: We do look for the correct answers. However, we are more interested in the approach and the steps followed to answer the questions._

In [None]:
import pandas as pd
import sqlite3
conn = sqlite3.connect("papercup.db")


In [None]:
q1_sql = """
    SELECT * FROM translated_video;
"""

pd.read_sql_query(q1_sql, conn).info()

In [None]:
q2_sql = """
    SELECT * FROM session;
"""
pd.read_sql_query(q2_sql, conn)


## Python

For the following task you are free to use any library to perform analysis and provide visualtion. 

### Q6: Visualise the relationship between the total video duration and the average time spent by QA on each minute of video.
### Q7: Calculate the correlation coefficient between video duration and session duration. Provide a brief interpretation of the results.
### Q8: Perform data cleaning on the video categories column.


_Note: there is no correct answer, it is an explorative question_