# Data Science Mathematics
# Statistical Inference
# In-Class Activity

Let's calculate a p-value for our data set.  First, let's import the relevant libraries.

In [125]:
import pandas as pd
import numpy as np
from scipy import stats

Now, let's import our data set.  You need to specify the absolute path of the data set (Dataset_Session5.csv).

In [126]:
dataset = r'Dataset_Session5.csv' #Specify absolute path here.
data = pd.read_csv(dataset)
data.head()

Unnamed: 0,Date,Similarity Metric (0.00-1.00)
0,18-Feb-18,0.0614
1,19-Feb-18,0.0913
2,20-Feb-18,0.0368
3,21-Feb-18,0.0213
4,22-Feb-18,0.0098


Now, let's get the data we want in the correct format: a Numpy array.

In [127]:
data_series = np.array(list(data['Similarity Metric (0.00-1.00)']))

Now we can calculate some descriptive statistics.

In [128]:
#mean and standard deviation
m = np.average(data_series)
sd = np.std(data_series)

Now print the values in the cell below:

In [129]:
print(" Mean: {}".format(m), "\n", "Standard Deviation: {}". format(sd))

 Mean: 0.07577999999999999 
 Standard Deviation: 0.14518890774895077


At first glance, does it look like our value on 18 March 2018 is much different from the mean?

We know that on 18 March 2018, our value is 0.8965.  We are performing a one-sample t-test in this case to see if our value at that date is anomalous.  We will take that sample to be the "population mean," in this case. 

In [130]:
pop_mean1 = 0.8965

So let's calculate our t statistic and our two-tailed p-value.

In [131]:
t, p = stats.ttest_1samp(data_series, pop_mean1, axis=0)

Now let's print our results.

In [132]:
print('t statistic: {}'.format(t))
print('p-value: {}'.format(p))

t statistic: -43.41977659265283
p-value: 1.749492641970814e-46


Do we reject the null hypothesis?

***Now save your output.  Go to File -> Print Preview and save your final output as a PDF.  Turn in to your Instructor, along with any additional sheets.

## Homework

a) Null Hypothesis = The mean tweet similarity metric for all days will be less than or equal to the tweet similarity metric from significant Russian political events among the target Twitter group.

To reject the null hypothesis would mean that the results of our analysis indicate that there was as significant difference between the tweet similarity metric on 18 March 2018 and the mean tweet similarity metric of all days. Statistical inference is the process of utilizing sample data to draw conclusions regarding population parameters. The rejection of the null hypothesis in terms of statistical inference indicates that significant events, such as Russion political elections, significantly increase the similarity metric of tweets for this specific target population.

It is important to note that this single "test" is of limited use on its own due to the countless confounding variables that have the potential to influence the results resulting from the methodology used. It would be necessary to draw several more samples centered around many different Russian events and focusing on several different target Twitter groups in order to generate inferences that measure high in terms of validity and reliability.

b) We are able to detect an anomalous event given the dataset and alpha value of 5%. After performing a single sample t-test, the p-value was calculate to be 1.749e-46 (see both above and below). This value is MUCH lower than the alpha value of 0.05, indicating that the likelihood of our results occuring simply due to chance is very, very low. As a results of this, we can conclude to reject the null hypothesis, and we can infer that the presence of the Russian election did correlate with a significant increase in the tweet similarity metric.

Additionally, the same conclusion can be come to based on the t-value, which was calculated at -43.42. Looking at the t-distribution table for single tailed t-tests at an alpha value of 0.05 and df = 59, you receive a critical t-value of 1.6716. This means that if the calculated t-value is higher than 1.6716 or lower than -1.6716, the null hypothesis can be rejected.

c) As stated in part b, the null hypothesis would be rejected based on the calculated p value.

d) Assuming an alpha value of .01, the critical t-value increases to 2.3917. This occurs because by decreasing the alpha value, the cut-off for determining significance of the results becomes more stringent. This in turn means that the results based on the dependent variable must be further away from the mean in order to be determined significant. The distance from the mean in terms of standard deviation/sqrt(n) is represented by the t-value, so it makes sense that the cut-off value for determining significance would increase if the alpha value is decreased.

Because the calculated t-value and p-value are so significant in this example, the conclusion would not be impacted by reducing the alpha value to .01. This is further evidence for the significance of the results.

In [133]:
x_bar1 = sum(data_series)/len(data_series)
x_bar1

0.07577999999999999

In [134]:
total = 0
for dat in data_series:
    total += (dat - x_bar1)**2

In [135]:
sd1 = (total/(len(data_series) - 1))**.5
sd1

0.1464141523214201

In [136]:
n1 = len(data_series)
n1

60

In [137]:
def t_test(x_bar, pop_mean, sd, n):
    t = (x_bar - pop_mean)/(sd/(n**.5))
    return t

In [138]:
t_test(x_bar1, pop_mean1, sd1, n1)

-43.41977659265283