<a href="https://colab.research.google.com/github/irinavalenzuela/Applied-Data-Science-Python/blob/main/Week4_Data_manipulation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Data Science in Python

## Week 4: Beyond Data Manipulation

## Basic Statistical Testing

In [3]:
# We use statistics in a lot of different ways in data science, and on this lecture, I want to refresh your
# knowledge of hypothesis testing, which is a core data analysis activity behind experimentation. The goal of
# hypothesis testing is to determine if, for instance, the two different conditions we have in an experiment 
# have resulted in different impacts

# Let's import our usual numpy and pandas libraries
import pandas as pd
import numpy as np

# Now let's bring in some new libraries from scipy
from scipy import stats

In [None]:
# Now, scipy is an interesting collection of libraries for data science and you'll use most or perpahs all of
# these libraries. It includes numpy and pandas, but also plotting libraries such as matplotlib, and a
# number of scientific library functions as well

In [4]:
# When we do hypothesis testing, we actually have two statements of interest: the first is our actual
# explanation, which we call the alternative hypothesis, and the second is that the explanation we have is not
# sufficient, and we call this the null hypothesis. Our actual testing method is to determine whether the null
# hypothesis is true or not. If we find that there is a difference between groups, then we can reject the null
# hypothesis and we accept our alternative.

# Let's see an example of this; we're going to use some grade data
df = pd.read_csv('grades.csv')
df.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000


In [5]:
# If we take a look at the data frame inside, we see we have six different assignments. Lets look at some
# summary statistics for this DataFrame
print("There are {} rows and {} columns".format(df.shape[0], df.shape[1]))

There are 2315 rows and 13 columns


In [7]:
# For the purpose of this lecture, let's segment this population into two pieces. Let's say those who finish
# the first assignment by the end of December 2015, we'll call them early finishers, and those who finish it 
# sometime after that, we'll call them late finishers.

early_finishers=df[pd.to_datetime(df['assignment1_submission'])<'2016']

print(type(early_finishers['assignment1_submission']))

early_finishers.head()


<class 'pandas.core.series.Series'>


Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
0,B73F2C11-70F0-E37D-8B10-1D20AFED50B1,92.733946,2015-11-02 06:55:34.282000000,83.030552,2015-11-09 02:22:58.938000000,67.164441,2015-11-12 08:58:33.998000000,53.011553,2015-11-16 01:21:24.663000000,47.710398,2015-11-20 13:24:59.692000000,38.168318,2015-11-22 18:31:15.934000000
1,98A0FAE0-A19A-13D2-4BB5-CFBFD94031D1,86.790821,2015-11-29 14:57:44.429000000,86.290821,2015-12-06 17:41:18.449000000,69.772657,2015-12-10 08:54:55.904000000,55.098125,2015-12-13 17:32:30.941000000,49.588313,2015-12-19 23:26:39.285000000,44.629482,2015-12-21 17:07:24.275000000
4,5ECBEEB6-F1CE-80AE-3164-E45E99473FB4,64.8138,2015-12-13 17:06:10.750000000,51.49104,2015-12-14 12:25:12.056000000,41.932832,2015-12-29 14:25:22.594000000,36.929549,2015-12-28 01:29:55.901000000,33.236594,2015-12-29 14:46:06.628000000,33.236594,2016-01-05 01:06:59.546000000
5,D09000A0-827B-C0FF-3433-BF8FF286E15B,71.647278,2015-12-28 04:35:32.836000000,64.05255,2016-01-03 21:05:38.392000000,64.75255,2016-01-07 08:55:43.692000000,57.467295,2016-01-11 00:45:28.706000000,57.467295,2016-01-11 00:54:13.579000000,57.467295,2016-01-20 19:54:46.166000000
8,C9D51293-BD58-F113-4167-A7C0BAFCB6E5,66.595568,2015-12-25 02:29:28.415000000,52.916454,2015-12-31 01:42:30.046000000,48.344809,2016-01-05 23:34:02.180000000,47.444809,2016-01-02 07:48:42.517000000,37.955847,2016-01-03 21:27:04.266000000,37.955847,2016-01-19 15:24:31.060000000


In [12]:
# how would you go about getting the late_finishers dataframe?

late_finishers2=df[pd.to_datetime(df['assignment1_submission'])>'2016']

late_finishers2.head()


Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
6,3217BE3F-E4B0-C3B6-9F64-462456819CE4,87.498744,2016-03-05 11:05:25.408000000,69.998995,2016-03-09 07:29:52.405000000,55.999196,2016-03-16 22:31:24.316000000,50.399276,2016-03-18 07:19:26.032000000,45.359349,2016-03-19 10:35:41.869000000,45.359349,2016-03-23 14:02:00.987000000
7,F1CB5AA1-B3DE-5460-FAFF-BE951FD38B5F,80.57609,2016-01-24 18:24:25.619000000,72.518481,2016-01-27 13:37:12.943000000,65.266633,2016-01-30 14:34:36.581000000,65.266633,2016-02-03 22:08:49.002000000,65.266633,2016-02-16 14:22:23.664000000,65.266633,2016-02-18 08:35:04.796000000
9,E2C617C2-4654-622C-AB50-1550C4BE42A0,59.270882,2016-03-06 12:06:26.185000000,59.270882,2016-03-13 02:07:25.289000000,53.343794,2016-03-17 07:30:09.241000000,53.343794,2016-03-20 21:45:56.229000000,42.675035,2016-03-27 15:55:04.414000000,38.407532,2016-03-30 20:33:13.554000000


In [13]:
# Here's my solution. First, the dataframe df and the early_finishers share index values, so I really just
# want everything in the df which is not in early_finishers

late_finishers=df[~df.index.isin(early_finishers.index)]

late_finishers.head()

Unnamed: 0,student_id,assignment1_grade,assignment1_submission,assignment2_grade,assignment2_submission,assignment3_grade,assignment3_submission,assignment4_grade,assignment4_submission,assignment5_grade,assignment5_submission,assignment6_grade,assignment6_submission
2,D0F62040-CEB0-904C-F563-2F8620916C4E,85.512541,2016-01-09 05:36:02.389000000,85.512541,2016-01-09 06:39:44.416000000,68.410033,2016-01-15 20:22:45.882000000,54.728026,2016-01-11 12:41:50.749000000,49.255224,2016-01-11 17:31:12.489000000,44.329701,2016-01-17 16:24:42.765000000
3,FFDF2B2C-F514-EF7F-6538-A6A53518E9DC,86.030665,2016-04-30 06:50:39.801000000,68.824532,2016-04-30 17:20:38.727000000,61.942079,2016-05-12 07:47:16.326000000,49.553663,2016-05-07 16:09:20.485000000,49.553663,2016-05-24 12:51:18.016000000,44.598297,2016-05-26 08:09:12.058000000
6,3217BE3F-E4B0-C3B6-9F64-462456819CE4,87.498744,2016-03-05 11:05:25.408000000,69.998995,2016-03-09 07:29:52.405000000,55.999196,2016-03-16 22:31:24.316000000,50.399276,2016-03-18 07:19:26.032000000,45.359349,2016-03-19 10:35:41.869000000,45.359349,2016-03-23 14:02:00.987000000
7,F1CB5AA1-B3DE-5460-FAFF-BE951FD38B5F,80.57609,2016-01-24 18:24:25.619000000,72.518481,2016-01-27 13:37:12.943000000,65.266633,2016-01-30 14:34:36.581000000,65.266633,2016-02-03 22:08:49.002000000,65.266633,2016-02-16 14:22:23.664000000,65.266633,2016-02-18 08:35:04.796000000
9,E2C617C2-4654-622C-AB50-1550C4BE42A0,59.270882,2016-03-06 12:06:26.185000000,59.270882,2016-03-13 02:07:25.289000000,53.343794,2016-03-17 07:30:09.241000000,53.343794,2016-03-20 21:45:56.229000000,42.675035,2016-03-27 15:55:04.414000000,38.407532,2016-03-30 20:33:13.554000000


In [None]:
# There are lots of other ways to do this. For instance, you could just copy and paste the first projection
# and change the sign from less than to greater than or equal to. This is ok, but if you decide you want to
# change the date down the road you have to remember to change it in two places. You could also do a join of
# the dataframe df with early_finishers - if you do a left join you only keep the items in the left dataframe,
# so this would have been a good answer. You also could have written a function that determines if someone is
# early or late, and then called .apply() on the dataframe and added a new column to the dataframe. This is a
# pretty reasonable answer as well.

In [14]:
# As you've seen, the pandas data frame object has a variety of statistical functions associated with it. If
# we call the mean function directly on the data frame, we see that each of the means for the assignments are
# calculated. 

# Let's compare the means for our two populations

print(early_finishers['assignment1_grade'].mean())
print(late_finishers['assignment1_grade'].mean())

74.94728457024303
74.0450648477065


In [15]:
# Ok, these look pretty similar. But, are they the same? What do we mean by similar? This is where the
# students' t-test comes in. It allows us to form the alternative hypothesis ("These are different") as well
# as the null hypothesis ("These are the same") and then test that null hypothesis.

# When doing hypothesis testing, we have to choose a significance level as a threshold for how much of a
# chance we're willing to accept. This significance level is typically called alpha. 

# For this example, let's use a threshold of 0.05 for our alpha or 5%. Now this is a commonly used number but it's really quite
# arbitrary.

# The SciPy library contains a number of different statistical tests and forms a basis for hypothesis testing
# in Python and we're going to use the ttest_ind() function which does an independent t-test (meaning the
# populations are not related to one another). 

# The result of ttest_index() are the t-statistic and a p-value.
# It's this latter value, the probability, which is most important to us, as it indicates the chance (between
# 0 and 1) of our null hypothesis being True.

# Let's bring in our ttest_ind function
from scipy.stats import ttest_ind 

# Let's run this function with our two populations, looking at the assignment 1 grades
ttest_ind(early_finishers['assignment1_grade'],late_finishers['assignment1_grade'])

Ttest_indResult(statistic=1.322354085372139, pvalue=0.1861810110171455)

In [None]:
# So here we see that the probability is 0.18, and this is above our alpha value of 0.05. This means that we
# cannot reject the null hypothesis. The null hypothesis was that the two populations are the same, and we
# don't have enough certainty in our evidence (because it is greater than alpha) to come to a conclusion to
# the contrary. This doesn't mean that we have proven the populations are the same.

In [17]:
# Why don't we check the other assignment grades?
print(ttest_ind(early_finishers['assignment2_grade'],late_finishers['assignment2_grade']))
print(ttest_ind(early_finishers['assignment3_grade'],late_finishers['assignment3_grade']))
print(ttest_ind(early_finishers['assignment4_grade'],late_finishers['assignment4_grade']))
print(ttest_ind(early_finishers['assignment5_grade'],late_finishers['assignment5_grade']))
print(ttest_ind(early_finishers['assignment6_grade'],late_finishers['assignment6_grade']))

Ttest_indResult(statistic=1.2514717608216366, pvalue=0.2108889627004424)
Ttest_indResult(statistic=1.6133726558705392, pvalue=0.10679998102227865)
Ttest_indResult(statistic=0.049671157386456125, pvalue=0.960388729789337)
Ttest_indResult(statistic=-0.05279315545404755, pvalue=0.9579012739746492)
Ttest_indResult(statistic=-0.11609743352612056, pvalue=0.9075854011989656)


In [None]:
# Ok, so it looks like in this data we do not have enough evidence to suggest the populations differ with
# respect to grade. Let's take a look at those p-values for a moment though, because they are saying things
# that can inform experimental design down the road. 

# For instance, one of the assignments, assignment 3, has a p-value around 0.1. 
# This means that if we accepted a level of chance similarity of 11% this would have been
# considered statistically significant. As a research, this would suggest to me that there is something here
# worth considering following up on. For instance, if we had a small number of participants (we don't) or if
# there was something unique about this assignment as it relates to our experiment (whatever it was) then
# there may be followup experiments we could run.

In [18]:
# P-values have come under fire recently for being insuficient for telling us enough about the interactions
# which are happening, and two other techniques, confidence intervalues and bayesian analyses, are being used
# more regularly. One issue with p-values is that as you run more tests you are likely to get a value which
# is statistically significant just by chance.

# Lets see a simulation of this. First, lets create a data frame of 100 columns, each with 100 numbers

df1=pd.DataFrame([np.random.random(100) for x in range(100)])  # iterate over 100 columns

df1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
0,0.818277,0.233608,0.518128,0.61033,0.195709,0.100853,0.810955,0.740671,0.155037,0.145705,0.442893,0.312357,0.997357,0.160068,0.060567,0.330716,0.750934,0.155802,0.134776,0.060802,0.736405,0.837134,0.571161,0.55736,0.370893,0.56121,0.067348,0.051619,0.161962,0.308803,0.866842,0.364688,0.881721,0.764948,0.355653,0.072992,0.004618,0.512695,0.667528,0.278086,...,0.688559,0.176884,0.142137,0.595805,0.751944,0.090052,0.95695,0.655991,0.427692,0.322871,0.336547,0.998436,0.04483,0.228977,0.139382,0.206863,0.871421,0.948399,0.419822,0.890982,0.598638,0.287521,0.35867,0.45824,0.327219,0.469971,0.623924,0.145221,0.860122,0.835407,0.483336,0.934659,0.133325,0.488039,0.697272,0.396496,0.716929,0.03135,0.321844,0.026569
1,0.51135,0.850737,0.622293,0.027536,0.58622,0.570966,0.411718,0.354321,0.278828,0.089558,0.520698,0.92231,0.676812,0.975383,0.877069,0.919266,0.679872,0.278006,0.096081,0.753526,0.503679,0.911471,0.592548,0.324981,0.686097,0.090539,0.147167,0.25885,0.816539,0.240602,0.284821,0.488813,0.231613,0.652234,0.815487,0.40699,0.136763,0.005417,0.370829,0.943688,...,0.75036,0.909496,0.499873,0.738233,0.636898,0.808294,0.496766,0.409117,0.481659,0.816139,0.915172,0.417053,0.03729,0.123902,0.946377,0.547728,0.09799,0.969038,0.488535,0.534029,0.372771,0.250365,0.623569,0.679341,0.495421,0.351866,0.082751,0.553675,0.079547,0.600957,0.28165,0.284411,0.535888,0.854358,0.173917,0.240181,0.205953,0.345446,0.596038,0.152388
2,0.671104,0.680082,0.214162,0.496686,0.527034,0.522543,0.312727,0.467914,0.468365,0.357307,0.488672,0.026002,0.266188,0.506849,0.240952,0.902955,0.553964,0.185046,0.929856,0.927359,0.741997,0.680518,0.927719,0.667827,0.935045,0.181294,0.533388,0.252525,0.489838,0.890991,0.895952,0.704061,0.956294,0.887262,0.876846,0.326498,0.189796,0.080183,0.725301,0.047,...,0.616178,0.464799,0.269118,0.992395,0.367223,0.559672,0.85469,0.118635,0.968611,0.133317,0.272445,0.035222,0.730131,0.110186,0.205423,0.750984,0.546255,0.456043,0.50995,0.470678,0.55017,0.019251,0.374774,0.103766,0.072768,0.155605,0.641769,0.395292,0.814257,0.24719,0.74944,0.299144,0.180745,0.53111,0.508201,0.876599,0.460025,0.960571,0.279017,0.584619
3,0.807513,0.586911,0.564979,0.047946,0.26078,0.419999,0.813494,0.831799,0.963824,0.69776,0.240277,0.613946,0.836159,0.627284,0.551211,0.955532,0.048552,0.069001,0.039163,0.511676,0.701746,0.1056,0.650474,0.259368,0.164167,0.876202,0.453437,0.707517,0.56655,0.004749,0.327701,0.39637,0.26884,0.373736,0.524732,0.526039,0.251371,0.263491,0.813904,0.187576,...,0.944553,0.400379,0.996747,0.030102,0.260854,0.382516,0.92113,0.149605,0.273989,0.896228,0.189673,0.444857,0.850275,0.108514,0.384534,0.385041,0.337417,0.363923,0.109052,0.908007,0.319534,0.963387,0.246667,0.158023,0.154372,0.069946,0.518156,0.637051,0.466489,0.163116,0.580611,0.423641,0.191118,0.617849,0.467387,0.858818,0.109349,0.534692,0.624827,0.215953
4,0.640545,0.952519,0.124015,0.03901,0.879437,0.130547,0.388832,0.332975,0.97952,0.505161,0.421468,0.943437,0.784898,0.297371,0.98785,0.47999,0.814713,0.420326,0.103438,0.657838,0.493729,0.322063,0.917259,0.356038,0.649316,0.396615,0.324807,0.544297,0.599277,0.058952,0.759891,0.885951,0.801895,0.163879,0.661054,0.393278,0.297883,0.543983,0.017936,0.259712,...,0.453008,0.80932,0.701571,0.053833,0.887011,0.357287,0.339245,0.465887,0.835052,0.044189,0.730994,0.224889,0.423724,0.361777,0.551211,0.452273,0.146676,0.703315,0.993713,0.505097,0.915539,0.345159,0.167939,0.701688,0.894131,0.461347,0.099962,0.288209,0.388458,0.966105,0.464786,0.272398,0.376983,0.343652,0.296716,0.063981,0.686916,0.360091,0.604852,0.833728


In [20]:
# Ok, let's create a second dataframe

df2=pd.DataFrame([np.random.random(100) for x in range(100)])
df2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
0,0.421163,0.497285,0.257804,0.75666,0.941861,0.729373,0.579277,0.952047,0.633942,0.034157,0.9524,0.42688,0.29177,0.929623,0.26458,0.512514,0.689713,0.112331,0.281042,0.833182,0.629293,0.925568,0.437328,0.364861,0.673443,0.517831,0.078445,0.474334,0.837862,0.916667,0.350025,0.487348,0.791029,0.527768,0.902243,0.007381,0.261241,0.852046,0.721642,0.784149,...,0.29998,0.772088,0.277825,0.877293,0.753875,0.84724,0.349904,0.19714,0.397241,0.97362,0.271412,0.456894,0.463097,0.075765,0.686967,0.682188,0.10359,0.196513,0.240473,0.306323,0.442605,0.79445,0.32845,0.330211,0.654027,0.449609,0.021356,0.689195,0.372164,0.769303,0.90384,0.78122,0.07659,0.585984,0.099104,0.584649,0.891952,0.14717,0.539485,0.344606
1,0.014441,0.50925,0.858099,0.530881,0.413645,0.628138,0.436286,0.993423,0.037184,0.072112,0.834712,0.673316,0.835497,0.020229,0.608879,0.73521,0.022592,0.840181,0.757378,0.556917,0.044663,0.065599,0.192649,0.923058,0.40492,0.517441,0.885316,0.748804,0.134315,0.814349,0.466764,0.810653,0.627651,0.301383,0.858305,0.085805,0.541686,0.097252,0.356889,0.364328,...,0.932666,0.754762,0.953101,0.107904,0.913569,0.027047,0.981784,0.622415,0.16276,0.688607,0.738109,0.94681,0.235624,0.867784,0.611618,0.180711,0.283093,0.219997,0.106478,0.921186,0.953534,0.950642,0.590741,0.843981,0.919829,0.222184,0.960137,0.004707,0.362058,0.078903,0.459933,0.79825,0.504317,0.029269,0.246199,0.557834,0.764946,0.575636,0.652884,0.845353
2,0.644802,0.002848,0.050685,0.458497,0.567392,0.18334,0.129881,0.180041,0.033273,0.145161,0.422416,0.259115,0.162285,0.718505,0.166314,0.831025,0.834772,0.877686,0.546152,0.681986,0.628428,0.054182,0.182449,0.672617,0.454024,0.720504,0.85988,0.496303,0.180191,0.469483,0.739849,0.58623,0.709133,0.288277,0.58974,0.697909,0.332961,0.064399,0.715348,0.048843,...,0.59756,0.636452,0.766567,0.488453,0.072133,0.357016,0.696652,0.567733,0.125515,0.295504,0.486698,0.195899,0.174206,0.079618,0.241478,0.156166,0.909936,0.218639,0.286138,0.531869,0.130905,0.998726,0.186992,0.618169,0.237312,0.824913,0.422749,0.009223,0.487514,0.49285,0.832198,0.086883,0.829833,0.286792,0.296285,0.61965,0.951393,0.193944,0.928602,0.758451
3,0.494647,0.148745,0.096672,0.820684,0.437282,0.048114,0.492941,0.425937,0.925796,0.107171,0.412025,0.577022,0.107701,0.636694,0.032142,0.378025,0.980374,0.395038,0.263272,0.665829,0.350452,0.228408,0.614001,0.559619,0.0061,0.622214,0.396043,0.010834,0.707523,0.068304,0.036801,0.505593,0.531104,0.23173,0.077736,0.633794,0.662122,0.499985,0.365098,0.230863,...,0.62637,0.427114,0.198079,0.052973,0.577367,0.278404,0.191261,0.453726,0.027959,0.499755,0.799943,0.195007,0.201428,0.819519,0.604838,0.208968,0.371638,0.201835,0.034679,0.071765,0.191705,0.036412,0.072458,0.032455,0.729885,0.081661,0.797511,0.894028,0.016895,0.659058,0.3385,0.011181,0.181511,0.799124,0.332658,0.978959,0.765132,0.756038,0.504018,0.304725
4,0.633099,0.396085,0.478736,0.220845,0.450311,0.694579,0.659734,0.574931,0.190544,0.098252,0.249805,0.936324,0.779909,0.293064,0.799492,0.866875,0.737707,0.414936,0.985985,0.413507,0.694851,0.964798,0.336507,0.589971,0.949901,0.205289,0.733263,0.557963,0.061503,0.355622,0.39738,0.00112,0.132939,0.580556,0.503232,0.912505,0.445534,0.442406,0.990035,0.028582,...,0.684443,0.6142,0.182721,0.079117,0.311673,0.376049,0.069722,0.525973,0.992675,0.923461,0.891244,0.416105,0.956417,0.864859,0.76787,0.557983,0.531811,0.997822,0.48077,0.326239,0.165627,0.984017,0.701705,0.33914,0.240446,0.693151,0.764057,0.812951,0.214569,0.177549,0.571052,0.714716,0.741981,0.237501,0.157134,0.79182,0.703566,0.086636,0.904248,0.025941


In [None]:
# Are these two DataFrames the same? Maybe a better question is, for a given row inside of df1, is it the same
# as the row inside df2?

# Let's take a look. Let's say our critical value is 0.1, or and alpha of 10%. And we're going to compare each
# column in df1 to the same numbered column in df2. And we'll report when the p-value isn't less than 10%,
# which means that we have sufficient evidence to say that the columns are different.



In [22]:
# Let's write this in a function called test_columns
def test_columns(alpha=0.1):
    # I want to keep track of how many differ
    num_diff=0
    # And now we can just iterate over the columns
    for col in df1.columns:
        # we can run out ttest_ind between the two dataframes
        teststat,pval=ttest_ind(df1[col], df2[col])
        # and we check the pvalue versus the alpha
        if pval<=alpha:
            # And now we'll just print out if they are different and increment the num_diff
            print("Col {} is statistically significantly different at alpha={}, pval={}".format(col,alpha,pval))
            num_diff=num_diff+1
    # and let's print out some summary stats
    print("Total number different was {}, which is {}%".format(num_diff,float(num_diff)/len(df1.columns)*100))

# And now lets actually run this
test_columns()

Col 2 is statistically significantly different at alpha=0.1, pval=0.09698006670238109
Col 11 is statistically significantly different at alpha=0.1, pval=0.015381448599358541
Col 15 is statistically significantly different at alpha=0.1, pval=0.024371446048831673
Col 21 is statistically significantly different at alpha=0.1, pval=0.0022989698051481184
Col 30 is statistically significantly different at alpha=0.1, pval=0.0633174131099543
Col 31 is statistically significantly different at alpha=0.1, pval=0.08063650051947202
Col 76 is statistically significantly different at alpha=0.1, pval=0.08882686508936156
Total number different was 7, which is 7.000000000000001%


In [24]:
# Interesting, so we see that there are a bunch of columns that are different! In fact, that number looks a
# lot like the alpha value we chose. So what's going on - shouldn't all of the columns be the same? Remember
# that all the ttest does is check if two sets are similar given some level of confidence, in our case, 10%.

# The more random comparisons you do, the more will just happen to be the same by chance. In this example, we
# checked 100 columns, so we would expect there to be roughly 10 of them if our alpha was 0.1.

# We can test some other alpha values as well

test_columns(0.05)

Col 11 is statistically significantly different at alpha=0.05, pval=0.015381448599358541
Col 15 is statistically significantly different at alpha=0.05, pval=0.024371446048831673
Col 21 is statistically significantly different at alpha=0.05, pval=0.0022989698051481184
Total number different was 3, which is 3.0%


In [25]:
# So, keep this in mind when you are doing statistical tests like the t-test which has a p-value. Understand
# that this p-value isn't magic, that it's a threshold for you when reporting results and trying to answer
# your hypothesis. What's a reasonable threshold? Depends on your question, and you need to engage domain
# experts to better understand what they would consider significant.

# Just for fun, lets recreate that second dataframe using a non-normal distribution, I'll arbitrarily chose
# chi squared

df2=pd.DataFrame([np.random.chisquare(df=1,size=100) for x in range(100)])

test_columns()

Col 0 is statistically significantly different at alpha=0.1, pval=0.0006365749161282661
Col 1 is statistically significantly different at alpha=0.1, pval=0.035205551893638005
Col 2 is statistically significantly different at alpha=0.1, pval=0.00031110260844063105
Col 3 is statistically significantly different at alpha=0.1, pval=0.0005830473894458481
Col 4 is statistically significantly different at alpha=0.1, pval=0.0007418892315833164
Col 5 is statistically significantly different at alpha=0.1, pval=3.377054367319157e-05
Col 6 is statistically significantly different at alpha=0.1, pval=0.06445329665785245
Col 7 is statistically significantly different at alpha=0.1, pval=0.0009403914506916738
Col 8 is statistically significantly different at alpha=0.1, pval=0.0849261720137337
Col 9 is statistically significantly different at alpha=0.1, pval=5.9299568129655804e-05
Col 10 is statistically significantly different at alpha=0.1, pval=0.0006667310094725716
Col 11 is statistically significant