<a href="https://colab.research.google.com/github/mbrooke1113/DS-Unit-1-Sprint-2-Statistics/blob/master/DS_Unit_1_Sprint_Challenge_2_Statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 2

## Exploring Data, Testing Hypotheses

In this sprint challenge you will look at a dataset of Echocardiograms

<https://archive.ics.uci.edu/ml/datasets/Echocardiogram>

Attribute Information:

1. survival -- the number of months patient survived (has survived, if patient is still alive). Because all the patients had their heart attacks at different times, it is possible that some patients have survived less than one year but they are still alive. Check the second variable to confirm this. Such patients cannot be used for the prediction task mentioned above.
2. still-alive -- a binary variable. 0=dead at end of survival period, 1 means still alive
3. age-at-heart-attack -- age in years when heart attack occurred
4. pericardial-effusion -- binary. Pericardial effusion is fluid around the heart. 0=no fluid, 1=fluid
5. fractional-shortening -- a measure of contracility around the heart lower numbers are increasingly abnormal
6. epss -- E-point septal separation, another measure of contractility. Larger numbers are increasingly abnormal.
7. lvdd -- left ventricular end-diastolic dimension. This is a measure of the size of the heart at end-diastole. Large hearts tend to be sick hearts.
8. wall-motion-score -- a measure of how the segments of the left ventricle are moving
9. wall-motion-index -- equals wall-motion-score divided by number of segments seen. Usually 12-13 segments are seen in an echocardiogram. Use this variable INSTEAD of the wall motion score.
10. mult -- a derivate var which can be ignored
11. name -- the name of the patient (I have replaced them with "name")
12. group -- meaningless, ignore it
13. alive-at-1 -- Boolean-valued. Derived from the first two attributes. 0 means patient was either dead after 1 year or had been followed for less than 1 year. 1 means patient was alive at 1 year.

Sprint challenges are evaluated based on satisfactory completion of each part. It is suggested you work through it in order, getting each aspect reasonably working, before trying to deeply explore, iterate, or refine any given step. Once you get to the end, if you want to go back and improve things, go for it!

## Part 1 - Load and validate the data

- Load the data as a `pandas` data frame.
- Validate that it has the appropriate number of observations (you can check the raw file, and also read the dataset description from UCI).
- UCI says there should be missing data - check, and if necessary change the data so pandas recognizes it as na
- Make sure that the loaded features are of the types described above (continuous values should be treated as float), and correct as necessary

This is review, but skills that you'll use at the start of any data exploration. Further, you may have to do some investigation to figure out which file to load from - that is part of the puzzle.

In [37]:
# Load data

import pandas as pd
import numpy as np


In [38]:
from google.colab import files
uploaded = files.upload()

Saving echocardiogram.data to echocardiogram (1).data


In [39]:
columns = ['survival', 'still alive','age at heart attack', 'pericardial effusion',
          'fractional shortening', 'EPSS', 'LVDD', 'wall motion score', 
          'wall motion index', 'malt', 'name', 'group', 'alive at 1']

heart = pd.read_csv('echocardiogram.data', names= columns, error_bad_lines=False,
                    na_values='?')
heart

Unnamed: 0,survival,still alive,age at heart attack,pericardial effusion,fractional shortening,EPSS,LVDD,wall motion score,wall motion index,malt,name,group,alive at 1
0,11.0,0.0,71.0,0,0.260,9.000,4.600,14.0,1.000,1.000,name,1,0.0
1,19.0,0.0,72.0,0,0.380,6.000,4.100,14.0,1.700,0.588,name,1,0.0
2,16.0,0.0,55.0,0,0.260,4.000,3.420,14.0,1.000,1.000,name,1,0.0
3,57.0,0.0,60.0,0,0.253,12.062,4.603,16.0,1.450,0.788,name,1,0.0
4,19.0,1.0,57.0,0,0.160,22.000,5.750,18.0,2.250,0.571,name,1,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
127,7.5,1.0,64.0,0,0.240,12.900,4.720,12.0,1.000,0.857,name,,
128,41.0,0.0,64.0,0,0.280,5.400,5.470,11.0,1.100,0.714,name,,
129,36.0,0.0,69.0,0,0.200,7.000,5.050,14.5,1.210,0.857,name,,
130,22.0,0.0,57.0,0,0.140,16.100,4.360,15.0,1.360,0.786,name,,


In [40]:
heart.shape

(132, 13)

Shape of dataset matches the values presented by UCI.

In [41]:
heart.isnull().sum()

survival                  2
still alive               1
age at heart attack       6
pericardial effusion      0
fractional shortening     8
EPSS                     15
LVDD                     11
wall motion score         4
wall motion index         2
malt                      3
name                      1
group                    22
alive at 1               57
dtype: int64

The previous cell differs slightly to the UCI.

In [42]:
heart.dtypes

survival                 float64
still alive              float64
age at heart attack      float64
pericardial effusion       int64
fractional shortening    float64
EPSS                     float64
LVDD                     float64
wall motion score        float64
wall motion index        float64
malt                     float64
name                      object
group                     object
alive at 1               float64
dtype: object

## Part 2 - Exploring data, Testing hypotheses

The only thing we really know about this data is that Alive-at-1 is the class label. Besides that, we have continuous features and categorical features.

Explore the data: you can use whatever approach (tables, utility functions, visualizations) to get an impression of the distributions and relationships of the variables. In general, your goal is to understand how the features are different when grouped by the two class labels (`1` and `0`).

For the continuous features, how are they different when split between the two class labels? Choose two features to run t-tests (again split by class label) - specifically, select one feature that is *extremely* different between the classes, and another feature that is notably less different (though perhaps still "statistically significantly" different). You may have to explore more than two features to do this.

For the categorical features, explore by creating "cross tabs" (aka [contingency tables](https://en.wikipedia.org/wiki/Contingency_table)) between them and the class label, and apply the Chi-squared test to them. [pandas.crosstab](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html) can create contingency tables, and [scipy.stats.chi2_contingency](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html) can calculate the Chi-squared statistic for them.

There are also categorical features - as with the t-test, try to find one where the Chi-squared test returns an extreme result (rejecting the null that the data are independent), and one where it is less extreme.

**NOTE** - "less extreme" just means smaller test statistic/larger p-value. Even the least extreme differences may be strongly statistically significant.

Your *main* goal is the hypothesis tests, so don't spend too much time on the exploration/visualization piece. That is just a means to an end - use simple visualizations, such as boxplots or a scatter matrix (both built in to pandas), to get a feel for the overall distribution of the variables.

This is challenging, so manage your time and aim for a baseline of at least running two t-tests and two Chi-squared tests before polishing. And don't forget to answer the questions in part 3, even if your results in this part aren't what you want them to be.

In [None]:
# TODO


In [46]:
alive = heart[heart['alive at 1'] == 1]
alive

Unnamed: 0,survival,still alive,age at heart attack,pericardial effusion,fractional shortening,EPSS,LVDD,wall motion score,wall motion index,malt,name,group,alive at 1
10,10.0,1.0,77.0,0,0.13,16.0,4.23,18.0,1.8,0.714,name,1,1.0
14,0.5,1.0,62.0,0,0.12,23.0,5.8,11.67,2.33,0.358,name,1,1.0
16,0.5,1.0,69.0,1,0.26,11.0,4.65,18.0,1.64,0.784,name,1,1.0
17,0.5,1.0,62.529,1,0.07,20.0,5.2,24.0,2.0,0.857,name,1,1.0
19,1.0,1.0,66.0,1,0.22,15.0,5.4,27.0,2.25,0.857,name,1,1.0
20,0.75,1.0,69.0,0,0.15,12.0,5.39,19.5,1.625,0.857,name,1,1.0
21,0.75,1.0,85.0,1,0.18,19.0,5.46,13.83,1.38,0.71,name,1,1.0
22,0.5,1.0,73.0,0,0.23,12.733,6.06,7.5,1.5,0.36,name,1,1.0
23,5.0,1.0,71.0,0,0.17,0.0,4.65,8.0,1.0,0.57,name,1,1.0
37,1.0,1.0,65.0,0,0.06,23.6,,21.5,2.15,0.714,name,2,1.0


In [47]:
alive.shape

(24, 13)

In [48]:
dead = heart[heart['alive at 1'] == 0]
dead

Unnamed: 0,survival,still alive,age at heart attack,pericardial effusion,fractional shortening,EPSS,LVDD,wall motion score,wall motion index,malt,name,group,alive at 1
0,11.0,0.0,71.0,0,0.26,9.0,4.6,14.0,1.0,1.0,name,1,0.0
1,19.0,0.0,72.0,0,0.38,6.0,4.1,14.0,1.7,0.588,name,1,0.0
2,16.0,0.0,55.0,0,0.26,4.0,3.42,14.0,1.0,1.0,name,1,0.0
3,57.0,0.0,60.0,0,0.253,12.062,4.603,16.0,1.45,0.788,name,1,0.0
4,19.0,1.0,57.0,0,0.16,22.0,5.75,18.0,2.25,0.571,name,1,0.0
5,26.0,0.0,68.0,0,0.26,5.0,4.31,12.0,1.0,0.857,name,1,0.0
6,13.0,0.0,62.0,0,0.23,31.0,5.43,22.5,1.875,0.857,name,1,0.0
7,50.0,0.0,60.0,0,0.33,8.0,5.25,14.0,1.0,1.0,name,1,0.0
8,19.0,0.0,46.0,0,0.34,0.0,5.09,16.0,1.14,1.003,name,1,0.0
9,25.0,0.0,54.0,0,0.14,13.0,4.49,15.5,1.19,0.93,name,1,0.0


In [49]:
from scipy.stats import ttest_ind

In [50]:
ttest_ind(alive['LVDD'], dead['LVDD'], nan_policy='omit')

Ttest_indResult(statistic=2.351305780538384, pvalue=0.021702859002225544)

In [51]:
# Test if age when arrested affects survival
ttest_ind(alive['age at heart attack'], dead['age at heart attack'], nan_policy='omit')

Ttest_indResult(statistic=2.2165294422402875, pvalue=0.02986013147831115)

In [52]:
# Test if the wall motion index when arrested affects survival
ttest_ind(alive['wall motion index'], dead['wall motion index'], nan_policy='omit')

Ttest_indResult(statistic=5.271197163087783, pvalue=1.393774872046117e-06)

In [54]:
# Finding catergoricals
heart.describe(exclude= np.number)

Unnamed: 0,name,group
count,131,110
unique,1,3
top,name,2
freq,131,85


In [65]:
heart['name'].unique()

array(['name', nan], dtype=object)

In [64]:
heart['group'].unique()

array(['1', '2', 'name', nan], dtype=object)

In [79]:
name = pd.crosstab(heart['name'], heart['alive at 1'])
name

alive at 1,0.0,1.0
name,Unnamed: 1_level_1,Unnamed: 2_level_1
name,50,24


In [76]:
group = pd.crosstab(heart['group'], heart['alive at 1'])
group

alive at 1,0.0,1.0,2.0
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,15,9,0
2,35,15,0
name,0,0,1


In [77]:
from scipy.stats import chisquare
chisquare(group, axis = None)

Power_divergenceResult(statistic=135.84, pvalue=1.7373215360490077e-25)

In [80]:
chisquare(name, axis= None)

Power_divergenceResult(statistic=9.135135135135135, pvalue=0.0025074693935343085)

## Part 3 - Analysis and Interpretation

Now that you've looked at the data, answer the following questions:

- Interpret and explain the two t-tests you ran - what do they tell you about the relationships between the continuous features you selected and the class labels?
- Interpret and explain the two Chi-squared tests you ran - what do they tell you about the relationships between the categorical features you selected and the class labels?
- What was the most challenging part of this sprint challenge?

Answer with text, but feel free to intersperse example code/results or refer to it from earlier.

*Your words here!*

From the ttests performed, there was a clear relation between LVDD and the chance that the individual was alive and a very distinct relationship between the wall motion index and whether that person was alive. The higher the wall motion index was, the better the chance that they were alive. The LVDD score was a lot less distinct so I was not able to extract the direction.

From the chisquare test, it was clear either catergorical columns were independent to the fact that the person was alive or not based on the low pvalue.

This unit was a little more complicated since we were doing technical stuff while overlaying it with theoretical ideas. Understanding then bringing it over to code did provide a challenge.



## Part 4 - Bayesian vs Frequentist Statistics

Using a minimum of 2-3 sentences, give an explanation of Bayesian and Frequentist statistics, and then compare and contrast these two approaches to statistical inference.



 *Your words here!*

 Both methods examines the chance something has happened. The frequentist approach uses just the chance of the event itself, while Bayesian also looks at the possibilities of other events. 

CONTRAST AND COMPARE

Bayesian depend on the prior or the probability of the hypotheses, unlike the Frequentist who completely ignore the afore-mentioned portion and focus exclusively to the events. These events are then treated as independent from one another which cuts out the ability to iterate, which Bayesians do.  

# Stretch Goals: 
Do these to get a 3. These are not required in order to pass the Sprint Challenge.

## Part 1: 

Make sure that all of your dataframe columns have the appropriate data types. *Hint:* If a column has the datatype of "object" even though it's made up of float or integer values, you can coerce it to act as a numeric column by using the `pd.to_numeric()` function. In order to get a 3 on this section make sure that your data exploration is particularly well commented, easy to follow, and thorough.

## Part 2:

Write functions that can calculate t-tests and chi^2 tests on all of the appropriate column combinations from the dataset. (Remember that certain tests require certain variable types.)

## Part 3: 

Calculate and report confidence intervals on your most important mean estimates (choose at least two). Make some kind of a graphic or visualization to help us see visually how precise these estimates are.

## Part 4:

Give an extra awesome explanation of Bayesian vs Frequentist Statistics. Maybe use code or visualizations, or any other means necessary to show an above average grasp of these high level concepts.

In [None]:
# You can work the stretch goals down here or back up in their regular sections
# just make sure that they are labeled so that we can easily differentiate
# your main work from the stretch goals.