### The Question to be Answered

Before we decide on which test to use, we need to be clear of what we want to solve. Is it:

* To validate whether the population mean is correct
* To compare the difference between 2 groups of data to see whether the difference is statistically significant
* To validate whether there’s a relationship between 2 categorical variables

#### Null and Alternative Hypothesis
Next, you will be deciding on the hypothesis based on your test objective. A null hypothesis (H0) proposes that no significant difference exists in a set of given observations, and an alternative hypothesis (H1) proposes otherwise.

For rejecting a null hypothesis, a test statistic is calculated. This test-statistic is then compared with a critical value. 

The critical values are the boundaries of the critical region.
* If the test statistic > critical value, the null hypothesis is rejected.
* If the test statistic ≤ critical value, the null hypothesis is accepted.

P-Value

Other than test statistic, P-value is another important result to look at. P-value is the level of marginal significance within a statistical hypothesis test, representing the probability of the occurrence of a given event. Therefore, a P-value that is less than 0.05, indicates strong evidence against the null hypothesis, so you reject the null hypothesis.
The lower a P-value, the stronger the evidence.


As a conclusion, the larger the absolute value of the test statistic, the smaller the p-value, and the greater the evidence against the null hypothesis.


[Seeing Theory](https://seeing-theory.brown.edu/basic-probability/index.html#section1)

In [None]:
import os
import glob
import pandas as pd
import numpy as np
import filenames
import missingno as msno
import matplotlib.pyplot as plt

pd.set_option('display.max_rows', 500, 'display.max_columns', 500,
              'display.width', 1000)

### Data

use glob to get all the csv files in the raw data folder.

In [None]:
profile_files = filenames.profile_folder_path.glob(os.path.join("*.csv"))

profile_appended_data = []
# loop over the list of csv files
for f in profile_files:
    data = pd.read_csv(f)
    profile_appended_data.append(data)
#profile_appended_data

df = pd.concat(profile_appended_data)
df.reset_index(drop=True, inplace=True)

#### Drop duplicate userid

In [None]:
df = df.drop_duplicates(subset=['userid'], keep='last').reset_index(drop=True)

#### Create Label for Followers

In [None]:
import csv
fpath = filenames.followers_path
follower = []
with open(fpath, newline='') as f:
    for i in csv.reader(f):
        follower.append(i[0])

In [None]:
len(follower)

In [None]:
df['is_follower'] = df['username'].isin(follower).astype(int)

### Test About One Categorical Variable - Z test

#### Sample Question: 
Is there a difference in the number of folowers and non-followers in the population?

For a single categorical variable that we want to check if there is a difference between the number of its values, we will use a one proportion Z test. Let’s state the hypothesis:

Ho: there is no difference between the number of followers and non-followers
H1: there is a difference between the number of followers and non-followers

We need to clarify that this is a two-sided test because we are checking if the proportion of followers Pf is different than non-followers Pn. 

If we wanted to check if Pf > Pn or Pf < Pf then we would have a one-tailed test.

In [None]:
from statsmodels.stats.proportion import proportions_ztest
 
count = df[df['is_follower'] == 1].shape[0] #number of followers 
nobs = df.shape[0] #number of rows | or trials 
value = 0.5 # This is the value of the null hypothesis. That means porpotion of men = porpotion of women = 0.5
 
#we are using alternative='two-sided' because we are chcking Pm≠Pw.
#for Pw>Pm we have to set it to "larger" and for Pw<Pm to "smaller"
 
stat, pval = proportions_ztest(count, nobs, value, alternative='two-sided')
 
print("p_value: ",round(pval,3))

The p-value is less than 0.05 hence, we reject the null hypothesis at a 95% level of confidence. 
That means that there is a difference in the number of followers and non-followers in the population.

### Test About Two Categorical Variables - Chi-Squared Test

#### Sample Question: 

Does the proportion of followers and non-followers differ across is_private?

If we want to check the independence of two categorical values, we will use the Chi-Squared test.

The key assumptions associated with this test are: 
1. random sample from the population. 
2. each subject cannot be in more than 1 group in any variable.

Let’s state the hypothesis:

* Ho: is_follower and is_private Groups are Independent or there is no significant relationship
* H1: is_follower and is_private Groups are Dependent or there is a significant relationship

In [None]:
from scipy.stats import chi2_contingency
 
#The easiest way to apply a chi-squared test is to compute the contigency table.
 
contigency= pd.crosstab(df['is_follower'], df['is_private'])
contigency

In [None]:
#Chi-square test of independence.
c, p, dof, expected = chi2_contingency(contigency)
 
print("p_value: ",round(p,3))

The p-value is not less than 0.05 hence, we failed to reject the H1 hypothesis at a 95% level of confidence.
Or The p-value is over 0.5, so at the significance level of 0.05, we fail to reject that there is no relationship between ‘is_follower’ and ‘is_private’.

That means that 'is_follower' and 'is_private' Groups are Dependent.

[Further Reading](https://towardsdatascience.com/chi-square-test-for-independence-in-python-with-examples-from-the-ibm-hr-analytics-dataset-97b9ec9bb80a)

##### Caveats and Limitations
There are a few caveats when conducting this analysis as well as some limitations of this test:

1. In order to draw a meaningful conclusion, the number of samples in each scenario needs to be sufficiently large, which might not be the case in reality.
2. A significant relationship does not imply causality.
3. The Chi-square test itself does not provide additional insights besides ‘significant relationship or not’. For example, the test does not inform that as "is_follower" increases, the proportion of "is_private" tends to decrease.

In [None]:
def chi2_square_test(col1, col2):
    contigency= pd.crosstab(df[col1], df[col2])
    #Chi-square test of independence.
    c, p, dof, expected = chi2_contingency(contigency)

    print("p_value: ",round(p,3))

In [None]:
chi2_square_test('is_follower', 'is_business_account')

'is_follower' and 'is_business_account' Groups are Dependent.

In [None]:
chi2_square_test('is_follower', 'has_public_story')

'is_follower' and 'has_public_story' Groups are Independent.

### Test About one Categorical and one Numeric Variable - T test

#### Sample Question: 

Is there a difference in height between men and women?

In this situation, we will use a T-Test (students T-Test).

* Ho: There is no difference
* H1: There is a difference

A t-test looks at the t-statistic, the t-distribution values, and the degrees of freedom to determine the statistical significance.

##### What Is a T Distribution?
The T distribution, also known as the Student’s t-distribution, is a type of probability distribution that is similar to the normal distribution with its bell shape but has heavier tails. T distributions have a greater chance for extreme values than normal distributions, hence the fatter tails.

[Further Reading](https://www.investopedia.com/terms/t/tdistribution.asp)

In [None]:
from scipy.stats import ttest_ind
 
#this is a two-sided test
#you can divide the two-sided p-value by two, and this will give you the one-sided one.
 
t_stat, p = ttest_ind(df.query('is_follower== 1')['mediacount'], df.query('is_follower== 0')['mediacount'])
 
print("p_value: ",round(p,3))

The p-value is less than 0.05 hence, we reject the null hypothesis at a 95% level of confidence. That means that there is a difference in mediacounts between followers and non-followers.

In [None]:
t_stat, p = ttest_ind(df.query('is_follower== 1')['followers'], df.query('is_follower== 0')['followers'])
 
print("p_value: ",round(p,3))

In [None]:
t_stat, p = ttest_ind(df.query('is_follower== 1')['followees'], df.query('is_follower== 0')['followees'])
 
print("p_value: ",round(p,3))

In [None]:
df.columns

### Test About one Categorical with more than two unique values and one Numeric Variable. - ANOVA

#### Sample Question:

s there a difference in 'followers' between'business_category_name' groups?

Now, we will use the ANOVA (Analysis Of Variance) test.

* Ho: Groups means of followers are equal
* H1: At least, one group mean of followers is different from other groups

In [None]:
df['business_category_name'].value_counts()

In [None]:
import scipy.stats as stats
 
# stats f_oneway functions takes the groups as input and returns ANOVA F and p value
fvalue, pvalue = stats.f_oneway(df.query('business_category_name == "Creators & Celebrities"')['followers'],
                                df.query('business_category_name == "Personal Goods & General Merchandise Stores"')['followers'],
                                df.query('business_category_name == "Transportation & Accomodation Services"')['followers'])
 
print("p_value: ",round(pvalue,3))

The p-value is not less than 0.05 hence, we failed to reject the null hypothesis at a 95% level of confidence.

### Test About Two Numeric Variables - Correlation
#### Sample Question: 
Is there a relationship between height and weight?

* Ho: There is no relationship between height and weight
* H1: There is a relationship between height and weight

We will use a correlation test. A correlation test will give us two things, a correlation coefficient, and a p-value. As you may already know the correlation coefficient is the number that shows us how correlated are the two variables. For its p-value, we are applying the same principles as before, if the p-value is less than 0.05 we reject the null hypothesis.

In [None]:
import scipy.stats as stats
 
#for this example we will use the Pearson Correlation.
pearson_coef, p_value = stats.pearsonr(df["mediacount"], df["followers"])
 
print("Pearson Correlation Coefficient: ", pearson_coef, "and a P-value of:", round(p_value,3) )

In [None]:
#for this example we will use the Pearson Correlation.
pearson_coef, p_value = stats.pearsonr(df["mediacount"], df["followees"])
 
print("Pearson Correlation Coefficient: ", pearson_coef, "and a P-value of:", round(p_value,3) )

In [None]:
#for this example we will use the Pearson Correlation.
pearson_coef, p_value = stats.pearsonr(df["followers"], df["followees"])
 
print("Pearson Correlation Coefficient: ", pearson_coef, "and a P-value of:", round(p_value,3) )

As we can see the p-value is less than 0.05 hence, we reject the null hypothesis at a 95% level of confidence. That means that there is a relationship between height and weight.