# **Starter Dataset - Statistics**

In this notebook some insigths on the so called "starter dataset" are presented.

Particularly, the term "starter dataset" refers to the demo data used by Talentia's developers in their daily work.
For this thesis study, I've considered the following tables:
- *TF_QST_QUESTIONNAIRES*, describing the existing questionnaires;
- *TF_QST_QUESTIONS*, describing the questions defined for each questionnaire;
- *TF_QST_QUESTION_TYPES*, describing the possible questions' type;
- *TF_QST_ANSWERS*, describing the possible answers for each defined question.

In [1]:
import os
import sys
sys.path.append('\\'.join(os.getcwd().split('\\')[:-1])+'\\src')

from src.data.TFQuestionnairesDataset import TFQuestionnairesDataset
import src.visualization.visualize as vis

## **Data**

The dataset is managed by the *TFQuestionnairesDataset* class, whose properties describes all the needed information.

First of all, below I've reported just few instances for each table, showing just the "essential" columns. Indeed, all the extracted tables report a set of fields that do not affect directly this task (e.g. the so called audit columns, or those related to the UI).

In [2]:
dataset = TFQuestionnairesDataset()
dataset.load_data()

In [3]:
vis.get_head_using_essential_columns(dataset.questionnaires, dataset.ESSENTIAL_COLUMNS_QUESTIONNAIRES)

Unnamed: 0,ID,CODE,NAME,DESCRIPTION
0,1146002,Stress Survey,Stress Survey,Burnout being represented by emotional and or ...
1,1146004,Mood,Mood,"Checking continuously the ""climate"" of your Te..."
2,1173001,REMOTE_WORKING,Remote Working - Tell us your experience,Remote Working - Tell us your experience
3,1173002,EXIT_INTERVIEW,Exit interview - Tell us your experience,Thank you for taking the time to complete this...
4,1173003,OQ,Onboarding Questionnaire,This questionnaire can be used in an onboardin...


In [4]:
vis.get_head_using_essential_columns(dataset.question_types, dataset.ESSENTIAL_COLUMNS_QUESTION_TYPES)

Unnamed: 0,ID,CODE,NAME,DESCRIPTION
0,1,SingleChoice,Single choice,Use this type of question to choose one answer...
1,2,MultiChoice,Multi choice,Use this type of question to choose one or mor...
2,3,RatingScale,Rating scale,Use this type of question to rate something.
3,4,Comment,Comment,Use this type of question to aquire feedback.
4,5,ReorderItems,Reorder items,Use this type of question to reorder items.


In [5]:
vis.get_head_using_essential_columns(dataset.questions, dataset.ESSENTIAL_COLUMNS_QUESTIONS)

Unnamed: 0,ID,TYPE_ID,QUESTIONNAIRE_ID,CODE,NAME,DISPLAY_ORDER
0,1146002,3,1146002.0,1,I often feel overwhelmed by tasks which I'm no...,1
1,1146003,3,1146002.0,2,I'm not comfortable with my working Time sched...,2
2,1146004,3,1146002.0,3,I'm often frustrated because I'm allocated on ...,3
3,1146005,3,1146002.0,4,I often work overtime and I don't have time fo...,4
4,1146006,3,1146002.0,5,The place where I use to work for most of my t...,5


In [6]:
vis.get_head_using_essential_columns(dataset.answers, dataset.ESSENTIAL_COLUMNS_ANSWERS)

Unnamed: 0,ID,QUESTION_ID,ANSWER
0,1146002,1146002,Completely Disagree
1,1146003,1146002,2
2,1146004,1146002,3
3,1146005,1146002,4
4,1146006,1146002,5


### **Count null values**

It should be obvious that not all columns are going to be used in the future steps.

Therefore, follows the analysis of the null values for each column in all the extracted tables. This allows us to discard them safely.

In [7]:
questionnaires_null_count = vis.count_null_values_for_columns(dataset.questionnaires)
question_types_null_count = vis.count_null_values_for_columns(dataset.question_types)
questions_null_count = vis.count_null_values_for_columns(dataset.questions)
answers_null_count = vis.count_null_values_for_columns(dataset.answers)

In [8]:
vis.plot_null_values(questionnaires_null_count, "Questionnaires", color="#90AEE4")

In [9]:
vis.plot_null_values(question_types_null_count, "Questions' Types", color="#2DCCD3")

In [10]:
vis.plot_null_values(questions_null_count, "Questions", color="#D0006F")

In [11]:
vis.plot_null_values(answers_null_count, "Answers", color="#372082")

## **Questionnaires insights**

Stated this is just an initial analysis on demo data, I wanted to investigate the data distribution of the "ground truth" data.

Hence, follows the computation of some simple statistics and the plots related to the length distribution of questions and answers.

In [12]:
avg_questions = vis.average_number_per_column(dataset.questions, "QUESTIONNAIRE_ID")
print(f"Average number of questions per questionnaire: {avg_questions}")

Average number of questions per questionnaire: 8


In [13]:
avg_question_types = vis.average_number_per_column(dataset.questions, "TYPE_ID")
print(f"Average number of questions per question types: {avg_question_types}")

Average number of questions per question types: 27


In [14]:
avg_answers = vis.average_number_per_column(dataset.answers, "QUESTION_ID")
print(f"Average number of answers per question: {avg_answers}")

Average number of answers per question: 5


### **Length distribution**

In [15]:
vis.plot_length_distribution(dataset.questions, "NAME", "Questions", color="#D0006F")

In [16]:
vis.plot_length_distribution(dataset.answers, "ANSWER", "Answers", color="#372082")