# FetchMaker
Congratulations! You’ve just started working at the hottest new tech startup, FetchMaker. FetchMaker’s mission is to match up prospective dog owners with their perfect pet. Data on thousands of adoptable dogs are in FetchMaker’s system, and it’s your job to analyze some of that data.

In [6]:
import pandas as pd
import numpy as np

dogs = pd.read_csv("dog_data.csv")

def get_attribute(breed, attribute):
  if breed in dogs.breed.unique():
    if attribute in dogs.columns:
        return dogs[dogs["breed"] == breed][attribute]
    else:
        raise NameError('Attribute {} does not exist.'.format(attribute))
  else:
    raise NameError('Breed {} does not exist.'.format(breed))
  

def get_weight(breed):
    return get_attribute(breed, 'weight')
  
def get_tail_length(breed):
    return get_attribute(breed, 'tail_length')

def get_color(breed):
    return get_attribute(breed, 'color')

def get_age(breed):
    return get_attribute(breed, 'age')

def get_is_rescue(breed):
    return get_attribute(breed, 'is_rescue')

def get_likes_children(breed):
    return get_attribute(breed, 'likes_children')

def get_is_hypoallergenic(breed):
    return get_attribute(breed, "is_hypoallergenic")

def get_name(breed):
    return get_attribute(breed, "name")

# Play around with the data
1.
Let’s start by including a data interface called fetchmaker that will give you access to FetchMaker’s dog data.

Use import fetchmaker at the top of your script.py file to import the fetchmaker package.

2.
The attributes that FetchMaker keeps track of are:

* weight, an integer representing how heavy a dog is in pounds
* tail_length, a float representing tail length in inches
* age, in years
* color, a String such as "brown" or "grey"
* is_rescue, a boolean 0 or 1

The fetchmaker package lets you access this data for a specific breed of dog with the following format:

    fetchmaker.get_weight("poodle")
    
This returns a Pandas DataFrame of the weights of the poodles recorded in the system. The other methods are get_tail_length, get_color, get_age, and get_is_rescue, which all take a breed as an input.

Get the tail lengths of all of the "rottweiler"s in the system, and store it in a variable called rottweiler_tl.

3.
Print out the mean of rottweiler_tl and the standard deviation of rottweiler_tl, using np.mean and np.std.

## Data to the rescue
4.
Over the years, we have seen that we expect 8% of dogs in the FetchMaker system to be rescues. We want to know if whippets are significantly more or less likely to be a rescue.

Store the is_rescue values for "whippet"s in a variable called whippet_rescue.

5.
Use np.count_nonzero to get the number of entries in whippet_rescue that are 1. Store this number in a variable called num_whippet_rescues.

6.
Get the number of samples in the whippet set by taking the np.size of whippet_rescue. Store this in a variable called num_whippets.

7.
Use a binomial test to test the number of whippet rescues, num_whippet_rescues, against our expected percentage, 8%.

Remember to import the binomial test by using from scipy.stats import binom_test.

8.
Print out the p-value. Is your result significant?

Size does matter
9.
Three of our most popular mid-sized dog breeds are whippets, terriers, and pitbulls. Is there a significant difference in the average weights of these three dog breeds? Perform a comparative numerical test to determine if there is a significant difference.


Hint
Use ANOVA for this scenario. First, use the line from scipy.stats import f_oneway to import SciPy’s ANOVA function.

10.
Now, perform another test to determine which of the pairs of these dog breeds differ from each other.



## Categorical dog test
11.
We want to see if "poodle"s and "shihtzu"s have significantly different color breakdowns.

Get the poodle colors and store it in a variable called poodle_colors.

Get the shih tzu colors and store it in a variable called shihtzu_colors.



12.
You can get the number of occurrences of brown poodles by using np.count_nonzero(poodle_colors == "brown").

Use this function to build a Chi Square contingency table, called color_table, with the following structure:

Poodle	Shih Tzu
Black	x	x
Brown	x	x
Gold	x	x
Grey	x	x
White	x	x
Fill in the “x” entries with the number of each poodle or shih tzu with the specified color.



13.
Feed your color_table into SciPy’s Chi Square test, save the p-value and print it out.

Is there a significant difference?




14.


Feel free to play around with fetchmaker more and run some hypothesis tests of your own.

The breeds you can explore are "poodle", "rottweiler", "whippet", "greyhound", "terrier", "chihuahua", "shihtzu", and "pitbull".

In [7]:

import numpy as np
#import fetchmaker
from scipy.stats import binom_test, f_oneway,chi2_contingency
from statsmodels.stats.multicomp import pairwise_tukeyhsd

rottweiler_tl  = get_tail_length('rottweiler')

print(np.mean(rottweiler_tl))

print(np.std(rottweiler_tl))

whippet_rescue = get_is_rescue("whippet")

num_whippet_rescues = np.count_nonzero(whippet_rescue)

num_whippets = np.size(whippet_rescue)

print(binom_test(num_whippet_rescues,num_whippets,0.08))

w = get_weight('whippet')
t = get_weight('terrier')
p = get_weight('pitbull')

print(f_oneway(w,t,p).pvalue)

values = np.concatenate([w,t,p])
labels = ['whippet']*len(w)+['terrier']*len(t) +['pitbull']*len(p)
# print(labels)
print(pairwise_tukeyhsd(values, labels, 0.05))

poodle_color = get_color("poodle")
shihtzu_color = get_color("shihtzu")

color_table =[ 
  [np.count_nonzero(poodle_color=='black'),
  np.count_nonzero(shihtzu_color=='black')],

  [np.count_nonzero(poodle_color=='brown'),
  np.count_nonzero(shihtzu_color=='brown')],

  [np.count_nonzero(poodle_color=='gold'),
  np.count_nonzero(shihtzu_color=='gold')],

  [np.count_nonzero(poodle_color=='grey'),
  np.count_nonzero(shihtzu_color=='grey')],

  [np.count_nonzero(poodle_color=='white'),
  np.count_nonzero(shihtzu_color=='white')]
  
]

print(color_table)

_, color_pval,_,_=chi2_contingency(color_table)

print(color_pval)


4.2361
2.0647536874891395
0.5811780106238098
3.276415588274815e-17
 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
 group1  group2 meandiff p-adj   lower  upper  reject
-----------------------------------------------------
pitbull terrier   -13.24  0.001 -16.728 -9.752   True
pitbull whippet    -3.34 0.0639  -6.828  0.148  False
terrier whippet      9.9  0.001   6.412 13.388   True
-----------------------------------------------------
[[17, 10], [13, 36], [8, 6], [52, 41], [10, 7]]
0.005302408293244593
