# Tutorial 1

data9b_w.txt and data9b_m.txt are tab separated values tables, where each row contains the number of steps that a person took on a particular day (steps) and their body mass index (bmi). data9b_w.txt contains data from women, while data9b_m.txt contains data from men.

## Load required packages

In [None]:
# Python has many useful built-in functions, but you can also add 
# functionality to python using packages written by others
# after installing them through PyPI (pip) or Anaconda (conda)
# They can be loaded with "import name"
#
# Added functions are then references as name.function, so it is often
# helpful to shorten the library name with "import name as abbreviation"
# Python also allows selective loading of parts of packages using "from"
# such as the Counter function from the collections package shown below

import pandas as pd
import numpy as np
import scipy as sp
import seaborn as sns
import matplotlib.pyplot as plt

## Data Ingest

In [None]:
# Read in tab separated values tables using the read_csv function from pandas (pd)
men = pd.read_csv('data9b_m.txt', sep = "\t")
women = pd.read_csv('data9b_w.txt', sep = "\t")

In [None]:
# The print command can be used to display variables
# Jupyter truncates the middle of tables if they are large
print(men)
print(women)

## Testing null hypotheses

Assume that both traits are normally distributed for males and for females. Consider the following (alternative, not null) hypotheses:

a) There is a difference in the mean number of steps between women and men.

b) The correlation coefficient between steps and bmi is negative for women.

c) The correlation coefficient between steps and bmi is positive for men.

In [None]:
# You can access data in columns of pandas dataframes (python speak for a table)
# with the format dataframe.column. For example, men.steps would return all of the
# steps each man took in the dataset.
#
# We can then use the mean(), median(), and std() functions from the numpy library
# to obtain the mean, median, and standard deviation of each parameter
# Finally we can use the print function to format the information
# Notice how we use "+" symbols to embed the functions in the print() commands
# str() is used to convert the numbers returned from numpy to strings

print("Means\nMen: " + str(np.mean(men.steps)) + "\nWomen: " + str(np.mean(women.steps)) + "\n")

print("Medians\nMen: " + str(np.median(men.steps)) + "\nWomen: " + str(np.median(women.steps)) + "\n")

print("Standard Deviation\nMen: " + str(np.std(men.steps)) + "\nWomen: " + str(np.std(women.steps)) + "\n")

In [None]:
# scipy.stats has a lot of useful statistical functions like t-tests and correlations
# Here we use the the ttest_ind() and pearsonr() functions to test the null hypotheses

print("Significant difference in steps?")
print(sp.stats.ttest_ind(men.steps, women.steps))

print("\nCorrelation between men steps and bmi?")
print(sp.stats.pearsonr(men.steps, men.bmi))

print("\nCorrelation between women steps and bmi?")
print(sp.stats.pearsonr(women.steps, women.bmi))

**Question:** Is each null hypothesis accepted or rejected?

**Answer:**

## Further exploration

**Question:** What other conclusions can you draw from the data? Two examples of exploratory plotting are included below, come up with others!

**Answer:**

In [None]:
plt.rcParams['figure.figsize'] = [6, 6]
# Example 1: Visually inspect scatterplot to see relationship between male and female step counts
sns.scatterplot(x=men.steps, y=women.steps);

In [None]:
plt.rcParams['figure.figsize'] = [6, 6]
# Example 2: Visually check for normality in female bmi
sns.distplot(women.bmi);