<a href="https://colab.research.google.com/github/johnwesleyharding/DS-Unit-1-Sprint-2-Statistics/blob/master/JWH_assignment_DS_121_Statistics_Probability.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

Stretch goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Apply hypothesis testing to your personal project data (for the purposes of this notebook you can type a summary of the hypothesis you formed and tested)

In [0]:
### YOUR CODE STARTS HERE
import pandas as pd
import scipy.stats
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel, ttest_1samp, stats
import numpy as np
import seaborn as sns
from matplotlib import style

In [2]:
# Getting started with the assignment
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data

--2019-10-10 00:38:19--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data.1’


2019-10-10 00:38:19 (286 KB/s) - ‘house-votes-84.data.1’ saved [18171/18171]



In [3]:
# Load Data
df = pd.read_csv('house-votes-84.data', 
                 header=None,
                 names=['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa'])
print(df.shape)
df.head()

(435, 17)


Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [4]:
df = df.replace({'?':np.NaN, 'n':0, 'y':1})

df.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [5]:
df.isnull().sum()

party                     0
handicapped-infants      12
water-project            48
budget                   11
physician-fee-freeze     11
el-salvador-aid          15
religious-groups         11
anti-satellite-ban       14
aid-to-contras           15
mx-missile               22
immigration               7
synfuels                 21
education                31
right-to-sue             25
crime                    17
duty-free                28
south-africa            104
dtype: int64

In [6]:
df['party'].value_counts()

democrat      267
republican    168
Name: party, dtype: int64

In [7]:
dem = df[df['party'] == "democrat"]
dem.describe()

Unnamed: 0,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
count,258.0,239.0,260.0,259.0,255.0,258.0,259.0,263.0,248.0,263.0,255.0,249.0,252.0,257.0,251.0,185.0
mean,0.604651,0.502092,0.888462,0.054054,0.215686,0.476744,0.772201,0.828897,0.758065,0.471483,0.505882,0.144578,0.289683,0.350195,0.63745,0.935135
std,0.489876,0.501045,0.315405,0.226562,0.412106,0.50043,0.420224,0.377317,0.429121,0.500138,0.500949,0.352383,0.454518,0.477962,0.481697,0.246956
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0
75%,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [8]:
rep = df[df['party'] == "republican"]
rep.describe()

Unnamed: 0,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
count,165.0,148.0,164.0,165.0,165.0,166.0,162.0,157.0,165.0,165.0,159.0,155.0,158.0,161.0,156.0,146.0
mean,0.187879,0.506757,0.134146,0.987879,0.951515,0.89759,0.240741,0.152866,0.115152,0.557576,0.132075,0.870968,0.860759,0.981366,0.089744,0.657534
std,0.391804,0.501652,0.341853,0.10976,0.215442,0.304104,0.428859,0.36101,0.320176,0.498186,0.339643,0.336322,0.347298,0.135649,0.286735,0.476168
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0
50%,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0
75%,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

In [0]:
def partyttest(series):
  
  dm = dem[series].mean()
  rm = rep[series].mean()  
  m = df[series].mean()
  
  if ttest_1samp(dem[series], m, nan_policy='omit').pvalue < .01 and ttest_1samp(rep[series], m, nan_policy='omit').pvalue < .01:
    print(f'The parties are divided.')
  
  ds = ttest_1samp(dem[series], 1, nan_policy='omit')
  rs = ttest_1samp(rep[series], 1, nan_policy='omit')
  do = ttest_1samp(dem[series], 0, nan_policy='omit')
  ro = ttest_1samp(rep[series], 0, nan_policy='omit')
  
#   print(ttest_1samp(dem[series], 1, nan_policy='omit'))
#   print(ttest_1samp(rep[series], 1, nan_policy='omit'))
#   print(ttest_1samp(dem[series], 0, nan_policy='omit'))
#   print(ttest_1samp(rep[series], 0, nan_policy='omit'))
  
  if abs(ds.statistic) <= 2:
    print(f'Democrats support this issue with a pvalue of: {ds.pvalue}')  
  
  if abs(rs.statistic) <= 2:
    print(f'Republicans support this issue with a pvalue of: {rs.pvalue}')
  
  if abs(do.statistic) <= 2:
    print(f'Democrats oppose this issue with a pvalue of: {do.pvalue}')
  
  if abs(ro.statistic) <= 2:
    print(f'Republicans oppose this issue with a pvalue of: {ro.pvalue}')
  
  ind = ttest_ind(rep[series], dem[series], nan_policy='omit')
  if abs(ind.pvalue) > .1:
      print(f'Bipartisanship is alive with a pvalue of: {ind.pvalue}!')
  
  yay = (df[series] == 1.0).sum()
  nay = (df[series] == 0.0).sum()
  dy = (dem[series] == 1.0).sum()
  dn = (dem[series] == 0.0).sum()
  ry = (rep[series] == 1.0).sum()
  rn = (rep[series] == 0.0).sum()
  print(f'Pass? Yay: {yay} Nay: {nay}  |  Dems: Yay: {dy} Nay: {dn}  |  Reps: Yay: {ry} Nay: {rn} \n')

In [0]:
def testall(dataframe):
  
  for i in range(1,len(df.columns)):
    
    print(df.columns[i].upper())
    partyttest(df.columns[i])  


In [11]:
testall(df)

HANDICAPPED-INFANTS
The parties are divided.
Pass? Yay: 187 Nay: 236  |  Dems: Yay: 156 Nay: 102  |  Reps: Yay: 31 Nay: 134 

WATER-PROJECT
Bipartisanship is alive with a pvalue of: 0.9291556823993485!
Pass? Yay: 195 Nay: 192  |  Dems: Yay: 120 Nay: 119  |  Reps: Yay: 75 Nay: 73 

BUDGET
The parties are divided.
Pass? Yay: 253 Nay: 171  |  Dems: Yay: 231 Nay: 29  |  Reps: Yay: 22 Nay: 142 

PHYSICIAN-FEE-FREEZE
The parties are divided.
Republicans support this issue with a pvalue of: 0.1579292482594923
Pass? Yay: 177 Nay: 247  |  Dems: Yay: 14 Nay: 245  |  Reps: Yay: 163 Nay: 2 

EL-SALVADOR-AID
The parties are divided.
Pass? Yay: 212 Nay: 208  |  Dems: Yay: 55 Nay: 200  |  Reps: Yay: 157 Nay: 8 

RELIGIOUS-GROUPS
The parties are divided.
Pass? Yay: 272 Nay: 152  |  Dems: Yay: 123 Nay: 135  |  Reps: Yay: 149 Nay: 17 

ANTI-SATELLITE-BAN
The parties are divided.
Pass? Yay: 239 Nay: 182  |  Dems: Yay: 200 Nay: 59  |  Reps: Yay: 39 Nay: 123 

AID-TO-CONTRAS
The parties are divided.
Pass? 

Domain knowlege: Democrats control the House with more than 61% membership.  Party unity is far less absolute than in recent political environments.



A reasonable null hypothesis would be that the parties vote on issues with the same proportions of favorability.
The alternative would be that one party is more likely to vote yes, and the other no, on a given issue.  Or that one party would be unified and the other divided.  The null hypothesis avoided rejection on the Water Project issue.

Questions:

Any null can have many alternatives.  What happens if we consider a non-valid alternative (what is the virtue in considering the correct one before we see the test)?

What does the curve look like for issues where democrats support an issue and republicans do not.  How do we visualize these comparisons?  How can there be outliers on both sides of a binary?  Or is the curve one sided.

In the lesson it was shown how the t-test was flatter than the normal distribution.  How can we understand that with this data, what does either curve look like?

In [0]:
#p_value = stats.chi2.sf(chi_squared, dof)
#chi_squared, p_value, dof, expected = stats.chi2_contingency(contingency)