<a href="https://colab.research.google.com/github/njaiprakash/DS-Unit-1-Sprint-2-Statistics/blob/master/module1/LS_DS_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

In [0]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
from scipy import stats

In [3]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data --no-check-certificate

--2020-04-10 20:42:51--  https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
  Issued certificate has expired.
HTTP request sent, awaiting response... 200 OK
Length: 18171 (18K) [application/x-httpd-php]
Saving to: ‘house-votes-84.data’


2020-04-10 20:42:51 (278 KB/s) - ‘house-votes-84.data’ saved [18171/18171]



In [5]:
column_header = ['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa']


voters = pd.read_csv('house-votes-84.data', header=None, names=column_header,
                     na_values='?')
voters.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [6]:
voters = voters.replace ({'y':1, 'n':0})
voters.head(3)

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0


In [7]:
voters['party'].value_counts()

democrat      267
republican    168
Name: party, dtype: int64

In [8]:
#how did republicans vote
repubs = voters[voters['party']=='republican']
repubs.shape

(168, 17)

In [9]:
#how did the demo vote
dems=voters[voters['party']=='democrat']
dems.shape

(267, 17)

In [10]:
#percentage republicans voted yes on hand.inf bill
print(repubs['handicapped-infants'].sum())
print (len(repubs))
repubs['handicapped-infants'].sum()/len(repubs)

31.0
168


0.18452380952380953

In [11]:
#percentage democrats voted yes for the hand.inf bill
print(dems['handicapped-infants'].sum())
print (len(dems))
dems['handicapped-infants'].sum()/len(dems)

156.0
267


0.5842696629213483

In [0]:
#the percentage repubs voted yes may be affected by nan values
#remove NaN values
col = repubs['handicapped-infants']
np.isnan(col)
inf_no_nans=col[~np.isnan(col)]




In [13]:
inf_no_nans.sum()/len(inf_no_nans)

0.18787878787878787

In [0]:
col2 = dems['handicapped-infants']
np.isnan(col2)
inf_no_nans2 = col2[~np.isnan(col2)]

In [0]:
inf_no_nans2.sum()/len(inf_no_nans2)

0.6046511627906976

In [14]:
#how to compare repubs vs dems about a specific issue
ttest_ind(dems['handicapped-infants'], repubs['handicapped-infants'], nan_policy='omit')

Ttest_indResult(statistic=9.205264294809222, pvalue=1.613440327937243e-18)

**Assignmant**

In [0]:
issue_dem = []
issue_rep = []
issue_all = []
issue_oth = []

for coll in voters:
  if coll != "party":
    rep = voters[voters["party"] == "republican"][coll].dropna()
    dem = voters[voters["party"] == "democrat"][coll].dropna()
    if stats.ttest_ind(rep, dem)[1] > 0.1:
      issue_all.append([coll, stats.ttest_ind(rep, dem)[1], dem.mean(), rep.mean()])
    elif stats.ttest_ind(rep, dem)[1] < 0.01:
      if dem.mean() > rep.mean():
        issue_dem.append([coll, stats.ttest_ind(rep, dem)[1], dem.mean(), rep.mean()])
      else:
        issue_rep.append([coll, stats.ttest_ind(rep, dem)[1], dem.mean(), rep.mean()])
    else:
      issue_oth.append([coll, stats.ttest_ind(rep, dem)[1], dem.mean(), rep.mean()])

**Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01**

In [21]:
print("Issues Democrats Support More Than Republicans:")
print("\tpvalue\t\t\tDemocrat Support\tRepublican Support\tIssue")
for i in issue_dem:
  print("\t{}\t{}\t{}\t{}".format(i[1], i[2], i[3], i[0]))


Issues Democrats Support More Than Republicans:
	pvalue			Democrat Support	Republican Support	Issue
	1.613440327936998e-18	0.6046511627906976	0.18787878787878787	handicapped-infants
	2.0703402795405602e-77	0.8884615384615384	0.13414634146341464	budget
	8.521033017443427e-31	0.7722007722007722	0.24074074074074073	anti-satellite-ban
	2.824718413723432e-54	0.8288973384030418	0.15286624203821655	aid-to-contras
	5.030792653107883e-47	0.7580645161290323	0.11515151515151516	mx-missile
	1.5759322301054227e-15	0.5058823529411764	0.1320754716981132	synfuels
	5.997697174348817e-32	0.6374501992031872	0.08974358974358974	duty-free
	3.652674361672202e-11	0.9351351351351351	0.6575342465753424	south-africa


**Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01**


In [22]:
print("Issues Republicans Support More Than Democrats:")
print("\tpvalue\t\t\tDemocrat Support\tRepublican Support\tIssue")
for i in issue_rep:
  print("\t{}\t{}\t{}\t{}".format(i[1], i[2], i[3], i[0]))

Issues Republicans Support More Than Democrats:
	pvalue			Democrat Support	Republican Support	Issue
	1.994262314074572e-177	0.05405405405405406	0.9878787878787879	physician-fee-freeze
	5.600520111728605e-68	0.21568627450980393	0.9515151515151515	el-salvador-aid
	2.393672252059893e-20	0.47674418604651164	0.8975903614457831	religious-groups
	1.8834203990447446e-64	0.14457831325301204	0.8709677419354839	education
	1.2278581709672014e-34	0.2896825396825397	0.8607594936708861	right-to-sue
	9.9523427056044e-47	0.35019455252918286	0.9813664596273292	crime


Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

In [23]:
print("Issues With Equal Support:")
print("\tpvalue\t\t\tDemocrat Support\tRepublican Support\tIssue")
for i in issue_all:
  print("\t{}\t{}\t{}\t{}".format(i[1], i[2], i[3], i[0]))
print()
print()
print("Issues With Somewhat Equal Support:")
print("\tpvalue\t\t\tDemocrat Support\tRepublican Support\tIssue")
for i in issue_oth:
  print("\t{}\t{}\t{}\t{}".format(i[1], i[2], i[3], i[0]))

Issues With Equal Support:
	pvalue			Democrat Support	Republican Support	Issue
	0.9291556823994811	0.502092050209205	0.5067567567567568	water-project


Issues With Somewhat Equal Support:
	pvalue			Democrat Support	Republican Support	Issue
	0.08330248490425282	0.4714828897338403	0.5575757575757576	immigration


## Stretch Goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Work on Performing a T-test without using Scipy in order to get "under the hood" and learn more thoroughly about this topic.
### Start with a 1-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://blog.minitab.com/hs-fs/hubfs/Imported_Blog_Media/701f9c0efa98a38fb397f3c3ec459b66.png?width=247&height=172&name=701f9c0efa98a38fb397f3c3ec459b66.png) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then try a 2-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://lh3.googleusercontent.com/proxy/rJJ5ZOL9ZDvKOOeBihXoZDgfk7uv1YsRzSQ1Tc10RX-r2HrRpRLVqlE9CWX23csYQXcTniFwlBg3H-qR8MKJPBGnjwndqlhDX3JxoDE5Yg) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then check your Answers using Scipy!