### FetchMaker
Congratulations! You’ve just started working at the hottest new tech startup, FetchMaker. FetchMaker’s mission is to match up prospective dog owners with their perfect pet. FetchMaker has been collecting data on their adoptable dogs, and it’s your job to analyze some of that data.

FetchMaker has provided us with data for a sample of dogs from their app, including the following attributes:

- weight, an integer representing how heavy a dog is in pounds
- tail_length, a float representing tail length in inches
- age, in years
- color, a String such as "brown" or "grey"
- is_rescue, a boolean 0 or 1

The data has been saved pandas DataFrame named dogs.

In [1]:
import numpy as np
import pandas as pd

In [2]:
dogs = pd.read_csv('/Users/elorm/Documents/Repos/Datasets/dog_data.csv')
dogs.head()

Unnamed: 0,is_rescue,weight,tail_length,age,color,likes_children,is_hypoallergenic,name,breed
0,0,6,2.25,2,black,1,0,Huey,chihuahua
1,0,4,5.36,4,black,0,0,Cherish,chihuahua
2,0,7,3.63,3,black,0,1,Becka,chihuahua
3,0,5,0.19,2,black,0,0,Addie,chihuahua
4,0,5,0.37,1,black,1,1,Beverlee,chihuahua


FetchMaker estimates (based on historical data for all dogs) that 8% of dogs in their system are rescues.

They would like to know if whippets are significantly more or less likely than other dogs to be a rescue.

In [3]:
dogs.breed.nunique()

8

In [4]:
dogs.breed.unique()

array(['chihuahua', 'greyhound', 'pitbull', 'poodle', 'rottweiler',
       'shihtzu', 'terrier', 'whippet'], dtype=object)

In [5]:
# Subset to just whippets, terriers, and pitbulls
dogs_wtp = dogs[dogs.breed.isin(['whippet', 'terrier', 'pitbull'])]

# Subset to just poodles and shihtzus
dogs_ps = dogs[dogs.breed.isin(['poodle', 'shihtzu'])]

In [6]:
whippet_rescue = dogs.is_rescue[dogs.breed == 'whippet']

How many whippets are rescues (remember that the value of is_rescue is 1 for rescues and 0 otherwise)? 

In [7]:
num_whippet_rescues = np.sum(whippet_rescue == 1)
print(num_whippet_rescues)

#OR

num_whippet_rescues = np.count_nonzero(whippet_rescue)
print(num_whippet_rescues)

6
6


In [8]:
#How many whippets are in this sample of data in total?
num_whippets = len(whippet_rescue)
print(num_whippets)

100


Use a hypothesis test to test the following null and alternative hypotheses:

- Null: 8% of whippets are rescues
- Alternative: more or less than 8% of whippets are rescues

Save the p-value from this test as pval and print it out. Using a significance threshold of 0.05, Is the proportion of whippets who are rescues significantly different from 8%?

For this test, we are focused on a single binary categorical variable, which indicates whether or not each whippet is a rescue. We want to compare the number of rescues in our sample to a hypothetical population-level proportion of 0.08. Therefore, we should use a binomial test.

First, we need to import the binom_test() function:

In [9]:
from scipy.stats import binom_test

In [10]:
# Run a binomial test 
from scipy.stats import binom_test
pval = binom_test(num_whippet_rescues, num_whippets, .08)
print(pval)

0.5811780106238098


Three of FetchMaker’s most popular mid-sized dog breeds are 'whippet's, 'terrier's, and 'pitbull's. Is there a significant difference in the average weights of these three dog breeds?

In [11]:
#Weights of the whippets
wt_whippets = dogs[dogs['breed'] == 'whippet']['weight']

#Weights of the terriers
wt_terriers = dogs[dogs['breed'] == 'terrier']['weight']

#Weights of the pitbull
wt_pitbulls = dogs[dogs['breed'] == 'pitbull']['weight']

Run a single hypothesis test to address the following null and alternative hypotheses:

- Null: whippets, terriers, and pitbulls all weigh the same amount on average
- Alternative: whippets, terriers, and pitbulls do not all weigh the same amount on average (at least one pair of breeds has differing average weights)

This test addresses an association between two variables: a non-binary categorical variable (breed, with three possible options) and a quantitative variable (weight). It is not a good idea to run three separate two-sample t-tests here, because running multiple t-tests increases our chances of a type I error, or a false positive. In order to run a single hypothesis test with three categories, we should use an ANOVA.

To do this, we first need to import the f_oneway() function:

In [12]:
# Run an ANOVA 
from scipy.stats import f_oneway
Fstat, pval = f_oneway(wt_whippets, wt_terriers, wt_pitbulls)
print(pval)

3.276415588274815e-17


At least one pair of dog breeds have significantly different average weights.

Run another hypothesis test to determine which of those breeds (whippets, terriers, and pitbulls) weigh different amounts on average. Use an overall type I error rate of 0.05 for all three comparisons.

In [14]:
# Subset to just whippets, terriers, and pitbulls
dogs_wtp = dogs[dogs.breed.isin(['whippet', 'terrier', 'pitbull'])]
print(dogs_wtp)

     is_rescue  weight  tail_length  age  color  likes_children  \
200          0      71         5.74    4  black               0   
201          0      26        11.56    3  black               0   
202          0      56        10.76    4  black               0   
203          0      33         6.32    4  black               1   
204          0      54        17.18    4  black               1   
..         ...     ...          ...  ...    ...             ...   
795          0      24        14.34    6   grey               0   
796          0      42         7.04   12   grey               0   
797          0      58        10.59    8   grey               1   
798          0      62        18.83    1   grey               1   
799          0      27        15.18    3   grey               0   

     is_hypoallergenic      name    breed  
200                  0   Charlot  pitbull  
201                  0       Jud  pitbull  
202                  0  Rosamund  pitbull  
203                

For this test, we need Tukey’s range test, which can be implemented with pairwise_tukeyhsd. First we need to import the function:

In [15]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

In [16]:
results = pairwise_tukeyhsd(endog = dogs_wtp.weight, groups = dogs_wtp.breed)
print(results)

 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
 group1  group2 meandiff p-adj   lower  upper  reject
-----------------------------------------------------
pitbull terrier   -13.24  0.001 -16.728 -9.752   True
pitbull whippet    -3.34 0.0639  -6.828  0.148  False
terrier whippet      9.9  0.001   6.412 13.388   True
-----------------------------------------------------


For any pair where “Reject” is “True”, we conclude that those two breeds weigh significantly different amounts.

FetchMaker wants to know if 'poodle's and 'shihtzu's come in different colors. 

In [18]:
# Create a contingency table of color vs. breed
Xtab = pd.crosstab(dogs_ps.color, dogs_ps.breed)
print(Xtab)

breed  poodle  shihtzu
color                 
black      17       10
brown      13       36
gold        8        6
grey       52       41
white      10        7


Run a hypothesis test for the following null and alternative hypotheses:

- Null: There is an association between breed (poodle vs. shihtzu) and color.
- Alternative: There is not an association between breed (poodle vs. shihtzu) and color.

In [19]:
# Run a Chi-Square Test
from scipy.stats import chi2_contingency
chi2, pval, dof, exp = chi2_contingency(Xtab)
print(pval)

0.005302408293244593
