# Pet Dog Behavior Study Analysis

## Table of Contents

1. [Overall Incidence](#toc1)  
    1.1 [Total survey participants](#toc1.1)  
    1.2 [Total participants that reported a behavior problem](#toc1.2)  
    1.3 [Total participants that reported a bahvior problem but joined for other reasons](#toc1.3)  
    1.4 [Overall incidence summary](#toc1.4)  
2. [First Investigation](#)
3. [Second Investigation](#)
4. [Third Investigation](#)

<a id='toc1'></a>
## 1. Overall Incidence

<a id='toc1.1'></a>
### 1.1 Total survey participants

In [1]:
import sqlite3
import pandas as pd
import textwrap
import scipy.stats as scs
from IPython.display import display


# Create the necessary dataframe.
con = sqlite3.connect('../data/processed/processed.db')
query = ('SELECT record_id, question_reason_for_part_3, q02_score FROM users JOIN dogs '
         'USING(record_id)')
df = pd.read_sql_query(query, con)
df.columns = ['id', 'reason', 'problems']
df['problems'] = pd.to_numeric(df['problems'])
df = df.sort_values('problems', ascending=False).drop_duplicates('id').sort_index()

# Get a count of the total number of participants.
cnt_total_users = len(df.index)
print('Total participants: %d' %cnt_total_users)

Total participants: 2986


<a id='toc1.2'></a>
### 1.2 Total participants that reported a behavior problem

In [2]:
# Determine the total number of participants who reported behavior problems.
cnt_any_w_problems = len(df[df.problems != 0].index)
print('Total participants who reported behavior problems: %d ' %cnt_any_w_problems)

Total participants who reported behavior problems: 2810 


<a id='toc1.3'></a>
### 1.3 Total participants that reported a bahvior problem but joined for other reasons

In [3]:
diff_reason = df[df.reason != '1']
diff_w_problem = diff_reason[diff_reason.problems != 0]
cnt_diff_w_problem = len(diff_w_problem.index)
result = ('Participants who reported behavior problems, but did not list the behavior '
          'problems as the reason for joining the study: %d' %cnt_diff_w_problem)
print(textwrap.fill(result, width=90))

Participants who reported behavior problems, but did not list the behavior problems as the
reason for joining the study: 2043


<a id='toc1.4'></a>
### 1.4 Overall incidence summary

In [4]:
pct_any_reason = cnt_any_w_problems / cnt_total_users
summary = ('Behavior problems were reported by %d of the total %d participants (%.2f%%). '
           %(cnt_any_w_problems, cnt_total_users, pct_any_reason))

cnt_diff_reason = cnt_total_users - (cnt_any_w_problems - cnt_diff_w_problem)
pct_diff_reason = cnt_diff_w_problem / cnt_diff_reason
summary += ('After removing participants who listed behavior problems as a reason for '
            'joining the study, behavior problems were reported by %d of the remaining %d '
            'participants (%.2f%%).' %(cnt_diff_w_problem, cnt_diff_reason, pct_diff_reason))

print(textwrap.fill(summary, width=90))

Behavior problems were reported by 2810 of the total 2986 participants (0.94%). After
removing participants who listed behavior problems as a reason for joining the study,
behavior problems were reported by 2043 of the remaining 2219 participants (0.92%).


## 2. First Investigation

### 2.1 Preparation of data

In [5]:
# Create the necessary dataframe.
query = ('SELECT q04_1, q04_2, q04_9 FROM dogs')
df = pd.read_sql_query(query, con)
df.columns = ['thunderstorm phobia', 'noise phobia', 'separation anxiety']
for col in df:
    df[col] = pd.to_numeric(df[col])
    
# Set a significance level.
sig_p = 0.01
    
# Record the total number of dogs.
cnt_total_dogs = len(df.index)
print('Total dogs: %d' %cnt_total_dogs)

Total dogs: 5018


### 2.2 Looping through pairs

In [6]:
pairs = [['thunderstorm phobia', 'noise phobia'],
         ['thunderstorm phobia', 'separation anxiety'],
         ['noise phobia', 'separation anxiety']
        ]
for pair in pairs:
    contingency = pd.crosstab(df[pair[0]], df[pair[1]])
    display(contingency)
    print('Chi-squared Test of Independence for %s and %s:' %(pair[0], pair[1]))
    c, p, dof, expected = scs.chi2_contingency(contingency, correction=False)
    print('chi2 = %f, p = %.2E, dof = %d' %(c, p, dof))
    if p < sig_p:
        print('The resulting p-value is below the set significance threshold (%.2f).' %sig_p)

noise phobia,0,1
thunderstorm phobia,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3839,427
1,210,542


Chi-squared Test of Independence for thunderstorm phobia and noise phobia:
chi2 = 1580.493072, p = 0.00E+00, dof = 1
The resulting p-value is below the set significance threshold (0.01).


separation anxiety,0,1
thunderstorm phobia,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3750,516
1,501,251


Chi-squared Test of Independence for thunderstorm phobia and separation anxiety:
chi2 = 223.618927, p = 1.47E-50, dof = 1
The resulting p-value is below the set significance threshold (0.01).


separation anxiety,0,1
noise phobia,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3592,457
1,659,310


Chi-squared Test of Independence for noise phobia and separation anxiety:
chi2 = 258.860919, p = 3.04E-58, dof = 1
The resulting p-value is below the set significance threshold (0.01).


In [7]:
contingency = pd.crosstab(df['separation anxiety'], [df['noise phobia'],
                                                     df['thunderstorm phobia']])
display(contingency.style)
title = ('Chi-squared Test of Independence for separation anxiety and the combination of '
         'noise and thunderstorm phobia:')
print(textwrap.fill(title, width=90))
c, p, dof, expected = scs.chi2_contingency(contingency, correction=False)
print('chi2 = %f, p = %.2E, dof = %d' %(c, p, dof))
if p < sig_p:
    print('The resulting p-value is below the set significance threshold (%.2f).' %sig_p)

noise phobia,0,0,1,1
thunderstorm phobia,0,1,0,1
separation anxiety,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,3452,140,298,361
1,387,70,129,181


Chi-squared Test of Independence for separation anxiety and the combination of noise and
thunderstorm phobia:
chi2 = 343.870316, p = 3.17E-74, dof = 3
The resulting p-value is below the set significance threshold (0.01).


## 3. Second Investigation

In [8]:
# Create the necessary dataframe.
query = ('SELECT q02_main_2, q02_main_3 FROM dogs')
df = pd.read_sql_query(query, con)
df.columns = ['fearful/anxious behavior', 'repetitive behavior']
for col in df:
    df[col] = pd.to_numeric(df[col])

contingency = pd.crosstab(df['fearful/anxious behavior'], df['repetitive behavior'])
display(contingency)
c, p, dof, expected = scs.chi2_contingency(contingency,
                                           correction=False)
print('Chi-square Test of Independence:')
print('chi2 = %f, p = %.2E, dof = %d' %(c, p, dof))
if p < sig_p:
    print('The resulting p-value is below the set significance threshold (%.2f).' %sig_p)

repetitive behavior,0,1
fearful/anxious behavior,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2307,324
1,1893,494


Chi-square Test of Independence:
chi2 = 64.426486, p = 1.00E-15, dof = 1
The resulting p-value is below the set significance threshold (0.01).
