<a href="https://colab.research.google.com/github/johnwesleyharding/DS-Unit-1-Sprint-2-Statistics/blob/master/JWH_assignment_DS_122_Sampling_Confidence_Intervals_and_Hypothesis_Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Assignment - Build a confidence interval

A confidence interval refers to a neighborhood around some point estimate, the size of which is determined by the desired p-value. For instance, we might say that 52% of Americans prefer tacos to burritos, with a 95% confidence interval of +/- 5%.

52% (0.52) is the point estimate, and +/- 5% (the interval $[0.47, 0.57]$) is the confidence interval. "95% confidence" means a p-value $\leq 1 - 0.95 = 0.05$.

In this case, the confidence interval includes $0.5$ - which is the natural null hypothesis (that half of Americans prefer tacos and half burritos, thus there is no clear favorite). So in this case, we could use the confidence interval to report that we've failed to reject the null hypothesis.

But providing the full analysis with a confidence interval, including a graphical representation of it, can be a helpful and powerful way to tell your story. Done well, it is also more intuitive to a layperson than simply saying "fail to reject the null hypothesis" - it shows that in fact the data does *not* give a single clear result (the point estimate) but a whole range of possibilities.

How is a confidence interval built, and how should it be interpreted? It does *not* mean that 95% of the data lies in that interval - instead, the frequentist interpretation is "if we were to repeat this experiment 100 times, we would expect the average result to lie in this interval ~95 times."

For a 95% confidence interval and a normal(-ish) distribution, you can simply remember that +/-2 standard deviations contains 95% of the probability mass, and so the 95% confidence interval based on a given sample is centered at the mean (point estimate) and has a range of +/- 2 (or technically 1.96) standard deviations.

Different distributions/assumptions (90% confidence, 99% confidence) will require different math, but the overall process and interpretation (with a frequentist approach) will be the same.

Your assignment - using the data from the prior module ([congressional voting records](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records)):


### Confidence Intervals:
1. Generate and numerically represent a confidence interval
2. Graphically (with a plot) represent the confidence interval
3. Interpret the confidence interval - what does it tell you about the data and its distribution?




In [0]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import scipy.stats
from scipy.stats import chisquare, t, normaltest, kruskal, stats
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel, ttest_1samp

In [2]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

df = pd.read_csv('house-votes-84.data', 
                 header=None,
                 names=['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa'])

df = df.replace({'?': np.NaN, 'n': 0, 'y': 1})

dem = df[df['party'] == "democrat"]
rep = df[df['party'] == "republican"]

--2019-10-09 15:25:25--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data.7’


2019-10-09 15:25:26 (286 KB/s) - ‘house-votes-84.data.7’ saved [18171/18171]



In [0]:
def confidence_interval(data, confidence=0.95):
  """
  Calculate a confidence interval around a sample mean for given data.
  Using t-distribution and two-tailed test, default 95% confidence. 
  
  Arguments:
    data - iterable (list or numpy array) of sample observations
    confidence - level of confidence for the interval
  
  Returns:
    tuple of (mean, lower bound, upper bound)
  """
  data = np.array(data)
  mean = np.mean(data)
  n = len(data)
  stderr = stats.sem(data)
  margin = stderr * stats.t.ppf((1 + confidence) / 2.0, n - 1)
  return (mean, mean - margin, mean + margin)

In [4]:
(smean, cimin, cimax) = confidence_interval(df['south-africa'].dropna(), confidence = .95)
print(f'Confidence Interval: {cimin} to {cimax} \n')

AttributeError: ignored

In [0]:
# def showconfidence(series):

#   (smean, cimin, cimax) = confidence_interval(df[series].dropna(), confidence = .95)
  
#   sns.distplot(df[series].dropna())
#   plt.axvline(x = cimin, color = 'o')
#   plt.axvline(x = cimax, color = 'o')
#   plt.axvline(x = smean, color = 'g')
#   plt.show()
  
#   print(f'Confidence Interval: {cimin} to {cimax} \n')

In [0]:
# showconfidence(df['south-africa'])

### Chi-squared tests:
4. Take a dataset that we have used in the past in class that has **categorical** variables. Pick two of those categorical variables and run a chi-squared tests on that data
  - By hand using Numpy
  - In a single line using Scipy

In [0]:
!wget https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

!unzip LoanStats_2018Q4.csv.zip

df = pd.read_csv('LoanStats_2018Q4.csv', header = 1, skipfooter = 2)

In [7]:
df.describe()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,installment,annual_inc,url,desc,dti,delinq_2yrs,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,total_acc,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_amnt,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,annual_inc_joint,dti_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,...,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,deferral_term,hardship_amount,hardship_length,hardship_dpd,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,settlement_amount,settlement_percentage,settlement_term
count,0.0,0.0,128412.0,128412.0,128412.0,128412.0,128412.0,0.0,0.0,128175.0,128412.0,128412.0,56216.0,15450.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,29180.0,128412.0,16782.0,16782.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,...,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,125553.0,128412.0,128412.0,128412.0,128412.0,126720.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,16782.0,16782.0,16782.0,16782.0,16524.0,16782.0,16782.0,16782.0,16782.0,5154.0,69.0,69.0,69.0,69.0,67.0,69.0,69.0,168.0,168.0,168.0
mean,,,15971.321021,15971.321021,15968.498166,463.253654,82797.33,,,19.933178,0.227837,0.447038,36.880337,86.130162,11.564052,0.12185,16898.0,22.677413,11099.092269,11097.369164,5968.022141,5966.735984,4517.296051,1441.805821,0.620806,8.299461,1.493903,1845.527731,0.017958,46.553461,1.0,133551.6,19.226602,0.0,188.304286,146792.2,0.939507,2.760202,0.689071,1.572665,...,5.414128,4.882129,7.065944,8.288524,8.19921,12.866702,5.386342,11.546717,0.0,0.0,0.059488,2.011642,94.659843,32.900756,0.121733,0.0,188485.2,53560.97,27439.534163,46821.8,36642.8,0.585985,1.587177,11.436003,55.878831,3.027112,12.423907,0.03617,0.062984,38.51591,3.0,210.569275,3.0,14.768116,630.752687,15521.93,256.52942,7241.088274,52.028333,18.10119
std,,,10150.384233,10150.384233,10152.16897,285.718934,108298.5,,,20.143542,0.733793,0.73448,21.813805,21.880055,5.981599,0.332825,24082.55,12.129216,9151.449997,9152.063754,5783.892565,5784.03214,5505.609562,1186.760791,6.321555,189.708011,34.147442,4960.129365,0.146569,21.801716,0.0,96870.01,8.141631,0.0,1569.290033,173872.7,1.145306,2.942377,0.935776,1.565118,...,3.439644,3.205305,4.517165,7.389195,4.958905,7.873911,3.38838,5.977599,0.0,0.0,0.410652,1.880559,8.989288,34.899647,0.332552,0.0,196553.6,55993.35,26377.282557,49574.91,32525.66,0.936053,1.801878,6.690119,26.071241,3.254318,8.190067,0.347726,0.364083,23.659436,0.0,132.49769,0.0,7.4718,392.539085,8737.277743,222.369541,4751.362959,9.760939,6.851408
min,,,1000.0,1000.0,725.0,30.48,0.0,,,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,9000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,22.37,3.0,1.0,67.11,1034.03,7.98,437.21,40.0,1.0
25%,,,8000.0,8000.0,8000.0,253.5975,47058.0,,,11.76,0.0,0.0,19.0,72.0,7.0,0.0,5599.0,14.0,3853.89,3853.89,2689.63,2689.03,1681.55,560.64,0.0,0.0,0.0,274.25,0.0,29.0,1.0,87394.0,13.2,0.0,0.0,27465.5,0.0,1.0,0.0,0.0,...,3.0,3.0,4.0,3.0,5.0,7.0,3.0,7.0,0.0,0.0,0.0,1.0,92.3,0.0,0.0,0.0,53159.75,20091.0,10100.0,15000.0,16104.0,0.0,0.0,7.0,36.4,1.0,7.0,0.0,0.0,19.0,3.0,100.1,3.0,9.0,337.2,8543.03,56.6,3658.8225,45.0,14.0
50%,,,14000.0,14000.0,14000.0,382.905,68000.0,,,17.99,0.0,0.0,34.0,90.0,10.0,0.0,11199.5,21.0,8932.05,8932.05,4281.48,4281.48,2765.06,1097.16,0.0,0.0,0.0,446.735,0.0,47.0,1.0,117000.0,18.655,0.0,0.0,74985.0,1.0,2.0,0.0,1.0,...,5.0,4.0,6.0,6.0,7.0,11.0,5.0,10.0,0.0,0.0,0.0,2.0,100.0,25.0,0.0,0.0,117700.0,38469.0,19800.0,34605.0,28585.0,0.0,1.0,10.0,57.5,2.0,11.0,0.0,0.0,37.0,3.0,186.71,3.0,16.0,560.13,14385.64,234.35,6127.41,45.01,18.0
75%,,,21600.0,21600.0,21600.0,622.68,99000.0,,,25.3,0.0,1.0,53.0,104.0,15.0,0.0,20563.0,29.0,16915.7725,16915.245,7194.82,7190.64,4929.57,1991.6275,0.0,0.0,0.0,775.03,0.0,64.0,1.0,158679.0,24.93,0.0,0.0,221823.5,1.0,3.0,1.0,2.0,...,7.0,6.0,9.0,11.0,10.0,17.0,7.0,14.0,0.0,0.0,0.0,3.0,100.0,55.6,0.0,0.0,273606.5,68041.25,36100.0,63289.5,47228.75,1.0,3.0,15.0,76.9,4.0,16.0,0.0,0.0,58.0,3.0,313.77,3.0,20.0,939.03,21278.88,356.27,9571.75,60.0,24.0
max,,,40000.0,40000.0,40000.0,1618.24,9757200.0,,,999.0,24.0,5.0,160.0,119.0,94.0,6.0,2358150.0,160.0,38060.6,38060.6,49349.8505,49349.85,40000.0,10768.47,408.66,34655.15,6237.927,41253.54,8.0,162.0,1.0,6282000.0,39.99,0.0,208593.0,9971659.0,13.0,56.0,6.0,19.0,...,59.0,64.0,69.0,130.0,69.0,127.0,37.0,94.0,0.0,0.0,23.0,24.0,100.0,100.0,6.0,0.0,9999999.0,2622906.0,666200.0,2118996.0,1110019.0,6.0,15.0,67.0,170.1,34.0,95.0,21.0,15.0,153.0,3.0,559.64,3.0,28.0,1678.92,35394.04,1045.41,22161.0,98.24,24.0


In [6]:
df.describe(include = object)

Unnamed: 0,term,int_rate,grade,sub_grade,emp_title,emp_length,home_ownership,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,earliest_cr_line,revol_util,initial_list_status,last_pymnt_d,next_pymnt_d,last_credit_pull_d,application_type,verification_status_joint,sec_app_earliest_cr_line,hardship_flag,hardship_type,hardship_reason,hardship_status,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_loan_status,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date
count,128412,128412,128412,128412,107465,116708,128412,128412,128412,128412,128412,128412,128412,128412,128412,128412,128256,128412,128253,109769,128411,128412,14848,16782,128412,69,69,69,69,69,69,69,128412,168,168,168
unique,2,46,7,35,43892,11,4,3,3,7,2,12,12,880,50,644,1074,2,13,3,13,2,3,573,2,1,7,3,4,6,4,4,2,7,3,8
top,36 months,13.56%,A,A4,Teacher,10+ years,MORTGAGE,Not Verified,Oct-2018,Current,n,debt_consolidation,Debt consolidation,112xx,CA,Aug-2006,0%,w,Sep-2019,Oct-2019,Sep-2019,Individual,Not Verified,Aug-2006,N,INTEREST ONLY-3 MONTHS DEFERRAL,UNEMPLOYMENT,ACTIVE,Sep-2019,Nov-2019,Sep-2019,Late (16-30 days),N,Sep-2019,ACTIVE,Aug-2019
freq,88179,6974,38011,9770,2090,38826,63490,58350,46305,105925,128367,70603,70603,1370,17879,1130,1132,114498,106772,109708,116568,111630,6360,155,128346,69,23,66,42,30,31,30,128244,59,152,57


In [12]:
observed = pd.crosstab(df['home_ownership'], df['installment']).values
print(observed.shape)

(4, 12363)


In [13]:
chi_squared, p_value, dof, expected = stats.chi2_contingency(observed)

print(f"Chi-Squared: {chi_squared}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}") 
print("Expected: \n", np.array(expected))

AttributeError: ignored

Scipy failed to mask my lack of statistics knowledge.  I have no idea what's wrong.

Other Questions

It was mentioned a couple times today that standard deviations and t-statistic are not the same thing.  As I understand it, standard deviation is some kind of measure of the distribution of data in a sample or population.  Is t-statistic something different?  Or a different measurement of that distribution?  What kind of features would the data have to easily observe that difference.

I'm lost on the significance of picking a null hypothesis, a confidence interval, and identifying an observation that really means something.  It seems like any way of considering the data is either obvious, insignificant, or irrelevant.  Is it all about who agrees with your perception of the null hypothesis?  Do you always want the smallest confidence interval you can achieve or one that's consistent across a domain?



## Stretch goals:

1. Write a summary of your findings, mixing prose and math/code/results. *Note* - yes, this is by definition a political topic. It is challenging but important to keep your writing voice *neutral* and stick to the facts of the data. Data science often involves considering controversial issues, so it's important to be sensitive about them (especially if you want to publish).
2. Apply the techniques you learned today to your project data or other data of your choice, and write/discuss your findings here.
3. Refactor your code so it is elegant, readable, and can be easily run for all issues.

## Resources

- [Interactive visualize the Chi-Squared test](https://homepage.divms.uiowa.edu/~mbognar/applets/chisq.html)
- [Calculation of Chi-Squared test statistic](https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test)
- [Visualization of a confidence interval generated by R code](https://commons.wikimedia.org/wiki/File:Confidence-interval.svg)
- [Expected value of a squared standard normal](https://math.stackexchange.com/questions/264061/expected-value-calculation-for-squared-normal-distribution) (it's 1 - which is why the expected value of a Chi-Squared with $n$ degrees of freedom is $n$, as it's the sum of $n$ squared standard normals)