# Question Survey Analysis

The first study, called `question-survey`, looked at the effect of social influence on curiousity, using upvotes as proxy for social interest. Below we analyze the responses to test the hypothesis that the same questions, when given higher upvotes, receive higher scores from participants with regards to curiousity.

In [1]:
%load_ext pycodestyle_magic

In [2]:
# Analytical Tools
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Utilities
import math
import json
import pprint
import utilities.processing as processing

# Make printing much more convenient
log = pprint.pprint

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

### Loading Data

In [3]:
FILE_NAMES = [
    'raw-data/question-setup-question-survey.json',
    'raw-data/question-survey-entries.json'
]

with open(FILE_NAMES[0]) as file:
    literals = json.load(file)
    
q_text = literals['question_text']
j_text = literals['judgement_text']
QUESTIONS = {ques: 'q' + str(num) for num, ques in enumerate(q_text)}
JUDGEMENTS = {judge: 'j' + str(num) for num, judge in enumerate(j_text)}

In [4]:
with open(FILE_NAMES[1]) as file:
    master_responses = [json.loads(line) for line in file if line]
# Study changed after the first 30 were collected
real_responses = master_responses[30:]
len(real_responses)

100

### Reading Responses into Data
Creates a `DataFrame` based on the survey data.

In [15]:
# Create dictionary to represent future DataFrame
num_questions = len(QUESTIONS)
num_judgements = len(JUDGEMENTS)
col_labels = processing.get_col_labels(num_questions,
                                       num_judgements,
                                       choice=False)
data = {label: [] for label in col_labels}

In [16]:
processing.fill_question_survey_data(data,
                                     real_responses,
                                     QUESTIONS,
                                     JUDGEMENTS)

In [18]:
data = pd.DataFrame(data)
log(data.size)
data.head()

6200


Unnamed: 0,condition,consent,q0j0,q0j1,q0j2,q0j3,q0j4,q0score,q1j0,q1j1,...,q8j2,q8j3,q8j4,q8score,q9j0,q9j1,q9j2,q9j3,q9j4,q9score
0,A,1,3,3,2,4,4,3381,4,3,...,3,2,2,32,4,3,4,3,3,29
1,A,1,4,4,2,5,2,3362,4,4,...,4,4,2,32,5,0,5,4,2,32
2,A,1,1,4,4,3,4,3365,2,4,...,4,3,3,37,5,2,5,4,4,26
3,A,1,2,1,1,2,1,3370,2,2,...,3,2,3,53,3,2,1,2,2,36
4,A,1,1,0,5,2,5,3370,3,4,...,5,4,0,23,4,2,6,5,0,29


### Analysizing Data

We first split the data into two groups, one corresponding to those which received high scores for group A and one for those who received higher scores for group B. The groups themselves are in the constant section for reference, where the first five entries corresponding to group A and the last five to group B.

In [20]:
# Remove participants without consent
data = data[data.consent == 1]
# Seperate into dataframes for each condition!
high_a = data[data.condition == 'A']
high_b = data[data.condition == 'B']
a_size, b_size = len(high_a), len(high_b)
a_size, b_size

(54, 46)