## FetchMaker

You’ve just started working at the hottest new tech startup, FetchMaker. FetchMaker’s mission is to match up prospective dog owners with their perfect pet. FetchMaker has been collecting data on their adoptable dogs, and it’s your job to analyze some of that data.

In [1]:
import numpy as np
import pandas as pd

In [2]:
dogs = pd.read_csv('dog_data.csv')
dogs.head()

Unnamed: 0,is_rescue,weight,tail_length,age,color,likes_children,is_hypoallergenic,name,breed
0,0,6,2.25,2,black,1,0,Huey,chihuahua
1,0,4,5.36,4,black,0,0,Cherish,chihuahua
2,0,7,3.63,3,black,0,1,Becka,chihuahua
3,0,5,0.19,2,black,0,0,Addie,chihuahua
4,0,5,0.37,1,black,1,1,Beverlee,chihuahua


FetchMaker estimates (based on historical data for all dogs) that 8% of dogs in their system are rescues.

They would like to know if whippets are significantly more or less likely than other dogs to be a rescue.

In [15]:
whippet_rescue = dogs.breed == 'whippet'
whippet_rescue

0      False
1      False
2      False
3      False
4      False
       ...  
795     True
796     True
797     True
798     True
799     True
Name: breed, Length: 800, dtype: bool

In [13]:
whippet_rescue = dogs.is_rescue[dogs.breed == 'whippet']
whippet_rescue

700    0
701    0
702    0
703    0
704    0
      ..
795    0
796    0
797    0
798    0
799    0
Name: is_rescue, Length: 100, dtype: int64

How many whippets are rescues (remember that the value of is_rescue is 1 for rescues and 0 otherwise)? 

In [4]:
num_whippet_rescues = np.sum(whippet_rescue == 1)
num_whippet_rescues

6

How many whippets are in this sample of data in total?

In [5]:
num_whippets = len(whippet_rescue)
num_whippets

100

Use a hypothesis test to test the following null and alternative hypotheses:

- Null: 8% of whippets are rescues

- Alternative: more or less than 8% of whippets are rescues

Save the p-value from this test as pval and print it out. Using a significance threshold of 0.05, Is the proportion of whippets who are rescues significantly different from 8%?

In [6]:
from scipy.stats import binom_test
pval = binom_test(num_whippet_rescues, num_whippets, .08)
pval

0.5811780106238105

### Mid-Sized Dog Weights

Three of FetchMaker’s most popular mid-sized dog breeds are 'whippet's, 'terrier's, and 'pitbull's. Is there a significant difference in the average weights of these three dog breeds?

To start answering this question, save the weights of each of these breeds in three separate series named wt_whippets, wt_terriers, and wt_pitbulls, respectively.

In [7]:
wt_whippets = dogs.weight[dogs.breed == 'whippet']
wt_terriers = dogs.weight[dogs.breed == 'terrier']
wt_pitbulls = dogs.weight[dogs.breed == 'pitbull']

Run a single hypothesis test to address the following null and alternative hypotheses:

- Null: whippets, terriers, and pitbulls all weigh the same amount on average

- Alternative: whippets, terriers, and pitbulls do not all weigh the same amount on average (at least one pair of breeds has differing average weights)

Save the resulting p-value as pval and print it out. Using a significance threshold of 0.05, is there at least one pair of dog breeds that have significantly different average weights?

In [8]:
from scipy.stats import f_oneway
Fstat, pval = f_oneway(wt_whippets, wt_terriers, wt_pitbulls)
pval

3.276415588274815e-17

If you completed the previous step correctly, you should have concluded that at least one pair of dog breeds have significantly different average weights.

Run another hypothesis test to determine which of those breeds (whippets, terriers, and pitbulls) weigh different amounts on average. Use an overall type I error rate of 0.05 for all three comparisons.

In [9]:
dogs_wtp = dogs[dogs.breed.isin(['whippet', 'terrier', 'pitbull'])]

In [10]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
output = pairwise_tukeyhsd(dogs_wtp.weight, dogs_wtp.breed)
print(output)

 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
 group1  group2 meandiff p-adj   lower  upper  reject
-----------------------------------------------------
pitbull terrier   -13.24  0.001 -16.728 -9.752   True
pitbull whippet    -3.34 0.0639  -6.828  0.148  False
terrier whippet      9.9  0.001   6.412 13.388   True
-----------------------------------------------------


### Poodle and Shihtzu Colors

FetchMaker wants to know if 'poodle's and 'shihtzu's come in different colors.

In [11]:
dogs_ps = dogs[dogs.breed.isin(['poodle', 'shihtzu'])]
Xtab = pd.crosstab(dogs_ps.color, dogs_ps.breed)
Xtab

breed,poodle,shihtzu
color,Unnamed: 1_level_1,Unnamed: 2_level_1
black,17,10
brown,13,36
gold,8,6
grey,52,41
white,10,7


Run a hypothesis test for the following null and alternative hypotheses:

- Null: There is an association between breed (poodle vs. shihtzu) and color.

- Alternative: There is not an association between breed (poodle vs. shihtzu) and color.

Save the p-value as pval and print it out. Do poodles and shihtzus come in significantly different color combinations? Use a significance threshold of 0.05.

In [12]:
from scipy.stats import chi2_contingency
chi2, pval, dof, exp = chi2_contingency(Xtab)
print(pval)

0.005302408293244593
