<a href="https://colab.research.google.com/github/phillippsm/colab_project/blob/colab_dev/ShoeSize.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Shoe Size vs Height

Explore this by running cells. Do we need to make changes?

How do we interpret the correlation result (Pearson) that we find at the end?

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


## Get some data - original source is [here](https://osf.io/ja9dw/)

In [0]:
ss_df = pd.read_csv("http://cyclicautomata.com/stuff/data/shoe_size.csv", parse_dates=['time'])

## What's in the dataset?
Look at the data

In [0]:
ss_df.head()

## Visualise it
Plot the data to see its range and likely correlation

In [0]:
ss_df.plot.scatter(x='shoe_size', y='height')

## Start to Filter
Height of zero is invalid for this analysis - and will invalidate the correlation.

Let's filter out height less than 100 cm

In [0]:
ss_df[ss_df.height>100].plot.scatter(x='shoe_size', y='height')

## Outlier Validity?

One data point has a height over 350cm

This seems improbable.  We should discard this data point pending further information about the quality of the data.

In [0]:
# is the outlier OK - a height of over 350cm seems improbable
ss_df[(ss_df.height>100) & (ss_df.height<300)].plot.scatter(x='shoe_size', y='height')

This looks much better

## Make it so
Put the filtered data into a separate dataframe (called filtered_ss)

***Should we look at separating data by "sex"?***

In [0]:
filtered_ss = ss_df[(ss_df.height>100) & (ss_df.height<300)]
#filtered_ss.drop(['time','sex'],axis=1,inplace=True)
filtered_ss.plot.scatter(x='shoe_size', y='height')
filtered_ss[filtered_ss.sex=='man'].plot.scatter(x='shoe_size', y='height')
filtered_ss[filtered_ss.sex=='woman'].plot.scatter(x='shoe_size', y='height')

## Find the correlation between the columns

Pandas has a single dataframe.corr() method to do this.

You tell corr() which method you wish to use:
* pearson : standard correlation coefficient
* kendall : Kendall Tau correlation coefficient
* spearman : Spearman rank correlation
* callable: callable with input two 1d ndarrays (your custom code)



In [0]:
filtered_ss.corr(method='pearson')

The result is a matrix showing "column A copared with column B".

Height is perfectly correlated with height, because IT IS THE SAME VALUE.

### What is we filter by "sex" - are there differences in correlation?

In [0]:
filtered_ss[filtered_ss.sex=='man'].corr(method='pearson')

In [0]:
filtered_ss[filtered_ss.sex=='woman'].corr(method='pearson')

The ttest from scipy stats for indepdenent variables gives us a p-value.

A p-value of less than 0.05 (5% chance) is usually interpreted as significant.

What is our p-value for shoe size vs height?

In [0]:
from scipy.stats import ttest_ind

ttest_ind(filtered_ss.height, filtered_ss.shoe_size, equal_var=False)