# Sampling Costs vs Risk Costs
Sprint 7: Sampling, Instruments, and Bias 

By: Jon Honda

# 1. Skill Story / Skills / Questions
As a engineer using storm water quality data to design treatment systems, I need to understand how small sample sizes affect uncertainty So that I can develop effective tools to communicate how additional sampling costs will result in reduced stormwater treatment costs. However, it will be difficult to use this data as it is very sensitive. Rather, I will find a publicly available data set and create a scenario similar to the one I face as an engineer.

# 2. Example Projects, Models, and Benchmarks

# 2.a Minimum Samples for a Certain Margin of Error
http://www.dummies.com/education/math/statistics/how-to-determine-the-minimum-size-needed-for-a-statistical-sample/  
*Suppose you are getting ready to do your own survey to estimate a population mean; wouldn’t it be nice to see ahead of time what sample size you need to get the margin of error you want? Thinking ahead will save you money and time and it will give you results you can live with in terms of the margin of error — you won’t have any surprises later.*

Here is the formula that links sample size (n) with the general population's estimated mean, and the margin of errorand confidence level surrounding the estimated mean. 

In [5]:
%%latex   
\begin{align}n \geq {(\frac{z * \sigma}{MOE})^2}\end{align}

<IPython.core.display.Latex object>

#### Explanation of terms:
n: number of samples  
z: z test value related to a certain confidence level (CL)
   For a given confidence interval, Z score tells how many standard deviations away from the mean your point estimate is
s(sigma): standard deviation that mean is within  
MOE: margin of error that mean is bound by  

#### BUT WHAT DOES IT MEAN????
**Try this explanation on for size:  **  
 If your sample size is n, then CL% of the time, the sample population will return an average value between +/- MOE. of the calculated mean. 

# 2.b Calculation Steps
https://www.wikihow.com/Calculate-Confidence-Interval  
Steps given from this site include:
1.  Select a sample from your chosen population  
2.  Calculate your sample mean and sample standard deviation  
3.  Choose your desired confidence level.  
4.  Calculate your margin of error (using formula in 2.a)  
    To find the critical value, Z, convert the confidence level percentage to a decimal. Then, use a Z table to lookup the value.  http://math.arizona.edu/~rsims/ma464/standardnormaltable.pdf  
   *it turns out there are differences on how to do this step. I modified the wikihow's directions to use the more common approach.*
   
    


# 3. Technical Prototyping

## 3.a EDA Kaggle Hotel Data
(is this data set going to work for my needs?)

I'm going to try doing sample size work using european hotel review data on Kaggle.
https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe  
Maybe setup a story where we're going to have to pay to stay at hotel every time we do review.
We want to know how much we'll need to spend in order to get a good idea of hotel quality in europe.

### Obtain the data, write to dataframe (df)

In [6]:
# https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe
import pandas as pd
df = pd.read_csv('_jonhonda_dat\\Hotel_Reviews.csv')

### General data review

In [7]:
print (df.columns)
print ('number of hotel reviews: ', len(df))
display (df.head(2))
ls_Hotels = df.Hotel_Name.unique()  #List unique values in the Hotel_Address column
print ('number of hotels reviewed: ', len(ls_Hotels))

Index(['Hotel_Address', 'Additional_Number_of_Scoring', 'Review_Date',
       'Average_Score', 'Hotel_Name', 'Reviewer_Nationality',
       'Negative_Review', 'Review_Total_Negative_Word_Counts',
       'Total_Number_of_Reviews', 'Positive_Review',
       'Review_Total_Positive_Word_Counts',
       'Total_Number_of_Reviews_Reviewer_Has_Given', 'Reviewer_Score', 'Tags',
       'days_since_review', 'lat', 'lng'],
      dtype='object')
number of hotel reviews:  515738


Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,Tags,days_since_review,lat,lng
0,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Russia,I am so angry that i made this post available...,397,1403,Only the park outside of the hotel was beauti...,11,7,2.9,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
1,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Ireland,No Negative,0,1403,No real complaints the hotel was great great ...,105,7,7.5,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968


number of hotels reviewed:  1492


### Summary statistics

In [8]:
import numpy as np
df.groupby('Hotel_Name')['Reviewer_Score'].agg([np.min, np.max, np.sum, np.mean, np.std, len])

Unnamed: 0_level_0,amin,amax,sum,mean,std,len
Hotel_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
11 Cadogan Gardens,4.2,10.0,1406.4,8.845283,1.386494,159.0
1K Hotel,3.8,10.0,1163.5,7.861486,1.667232,148.0
25hours Hotel beim MuseumsQuartier,2.5,10.0,6189.5,8.983309,1.224922,689.0
41,6.7,10.0,1000.3,9.711650,0.590497,103.0
45 Park Lane Dorchester Collection,8.3,10.0,268.9,9.603571,0.564035,28.0
88 Studios,3.3,10.0,3896.5,8.489107,1.401501,459.0
9Hotel Republique,2.5,10.0,1600.1,8.743716,1.460864,183.0
A La Villa Madame,3.8,10.0,363.0,8.853659,1.422339,41.0
ABaC Restaurant Hotel Barcelona GL Monumento,2.5,10.0,262.4,8.464516,1.975947,31.0
AC Hotel Barcelona Forum a Marriott Lifestyle Hotel,2.5,10.0,2312.4,8.001384,1.637303,289.0


## 3.b Finding stats library
need a python stats library that works with Pandas dataframe.
Came across Scipy - an apparently well used library for stats.  
https://www.scipy.org/  
https://stackoverflow.com/questions/20864847/probability-to-z-score-and-vice-versa-in-python  



In [11]:
import scipy.stats as st
#Z-score of a CI level:
st.norm.ppf(.95)
#CI level of a Z-Score:
st.norm.cdf(1.64)

0.94949741652589625

# 4. Project Scope
I propose to learn about:
1. Ways to describe uncertainty as it relates to sampling size.
2. The mathematical relationships between sampling size and uncertainty. Hopefully there is a way to do something like this:
   Given current sampling size, identify probability that the average data value obtained thru sampling is not the "True" average. Or, that the observed 60 percentile value is not the true 60 percentile average, etc. - basically that the value used in our design is not the sufficiently close to the true value, and we therefore should do something to accomodate this uncertainty.
3. Identify how increasing sampling size will reduce probability that our average (or other statistic) based on observation is not the "true" value.

4. Build a simulator that demonstrats that spending X dollars on more sampling results in Y reduction in necessary remedial actions.

## THE STORY:
I'm going to try doing sample size work using european hotel review data on Kaggle.
https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe  
Maybe setup a story where we're going to have to pay to stay at hotel every time we do review.
Let's pretend we are a travel agency that books hotels in europe. We base our pricing based on our user rating system. e.g. 3 start hotel rating means you pay $300 per night; 5 star rating means you pay $500 per night. If you stay and feel you didn't get the experience promissed by our rating system, then we'll pay you the difference between what our rating system said you should get and what you actually got. e.g. 5 star rated hotel: you pay 500; but you only got a 3 star experience. We are out 200.

We developed rating by paying a small group of people to stay at each hotel. We want to know how many stays at each hotel are necessary to get confidence that our rating system is correct so that we minimize the amount that we pay out to disatisfied customers.    

We want to know how much we'll need to spend in order to get a good idea of hotel quality in europe.




# The approach
1. Send out n reviewers to sample the hotels. 
   Simulate going out to the real world by using n sample points in Kaggle data for this
2. Generat statistics for reviewers' hotel findings (avg rating, std dev).
3. Build estimated probability distribution using the raters' findings.
4. Run pricing model using the rater generated probability distribution to find the "best" price point.
5. Run pricing model on entire Kaggle data set using the "best" price point to see if we make money or not.
6. Do 1-5 over and over with different number of reviewers to see how changing number of reviewers affects pricing outcome.

## THOUGHTS  
What does confidence level actually mean?? How is it used with margin of error???

#### From the following website:
https://www.isixsigma.com/tools-templates/sampling-data/margin-error-and-confidence-levels-made-simple/  
*How well the sample represents the population is gauged by two important statistics – the survey’s margin of error and confidence level. They tell us how well the spoonfuls represent the entire pot. **For example, a survey may have a margin of error of plus or minus 3 percent at a 95 percent level of confidence. These terms simply mean that if the survey were conducted 100 times, the data would be within a certain number of percentage points above or below the percentage reported in 95 of the 100 surveys***  

*In other words, Company X surveys customers and finds that 50 percent of the respondents say its customer service is “very good.” The confidence level is cited as 95 percent plus or minus 3 percent. This information means that if the survey were conducted 100 times, the percentage who say service is “very good” will range between 47 and 53 percent most (95 percent) of the time.*

The website goes on to say that MOE and confidence intervals are linked with number of samples!!!  
*Margin of error – the plus or minus 3 percentage points in the above example – decreases as the sample size increases, but only to a point. A very small sample, such as 50 respondents, has about a 14 percent margin of error while a sample of 1,000 has a margin of error of 3 percent. The size of the population (the group being surveyed) does not matter. (This statement assumes that the population is larger than the sample.) There are, however, diminishing returns. By doubling the sample to 2,000, the margin of error only decreases from plus or minus 3 percent to plus or minus 2 percent. Although a 95 percent level of confidence is an industry standard, a 90 percent level may suffice in some instances. A 90 percent level can be obtained with a smaller sample, which usually translates into a less expensive survey. To obtain a 3 percent margin of error at a 90 percent level of confidence requires a sample size of about 750. For a 95 percent level of confidence, the sample size would be about 1,000.*

#### PYTHON STATISTICS:
http://people.duke.edu/~ccc14/sta-663/index.html




# CODE IT!!!!

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv('_jonhonda_dat\\Hotel_Reviews.csv')
print ('here')
HotelGrp = df.groupby('Hotel_Name')
HotelGrp.apply(lambda aHotel: print (aHotel['Hotel_Name']))


# FacGroup = pdFaccp.groupby('Facility_ID') #group combo options by facility_id
# return FacGroup.apply(lambda aFac:  aFac.iloc[random.randint(0,aFac.shape[0]-1)]) #randomly select a combo option for each facilityp
       

here


TypeError: <lambda>() got an unexpected keyword argument 'axis'