# Statistical tests

This Notebook contains code that will repeat the statistical tests that attempted to identify any relationship between different endolichenic fungal (ELF) isolates and their host lichens.

In [2]:
import pandas as pd
from scipy.stats import chi2_contingency

## Table 3.3 - Statistical comparison of ELF taxonomic diversity against various host mycobiont categories using Pearson's Chi-Squared tests 

A Pearson's Chi-Squared test is commonly used for comparing large categorical datasets. Specifically, here we will be performing a test of independence to see whether the type of ELF isolated is independent of the host lichen taxonomy, growth form, location, and macroclimate.

The first step of Pearson Chi-Squared test is to calculate the Chi-squared test statistic ($X^2$), which is a normalized sum of squared deviations between observed and theroretical frequencies. Then the degrees of freedom ($df$) for that statistic is calculated using the following formula:

$$ df = {(R-1)}×{(C-1)} $$

where $R$ is the number of categories for the first variable and $C$ the number of categories for the seconda variable. Then we can use a *p-value* of 0.05 to assess the level of confidence we have for that comparison. We can then sustain or reject the null hypothesis ($H_0$) that the observed frequency is the same as the theoretical distribution, based on whether the test statistic exceeds the critical value of $X^2$. If the test statistic is greater than the threshold $X^2$ value, then the alternative hypothesis ($H_1$), that there is a difference between the distributions can be accepted.

To investigate this, data is parsed from `ELF_master_results.csv` and a contingency table prepared for values in the relevant columns are assigned category codes using `df.astype("category").cat.codes`.

In [3]:
data = pd.read_csv("../ELF_master_results.csv")
data_sub = pd.DataFrame()
data.fillna("unidentified", inplace=True)
data_sub["isolate_class"] = data["isolate_class"].astype("category").cat.codes
data_sub["isolate_order"] = data["isolate_order"].astype("category").cat.codes
data_sub["isolate_family"] = data["isolate_family"].astype("category").cat.codes
data_sub["isolate_genus"] = data["isolate_genus"].astype("category").cat.codes
data_sub["Host_ID"] = data["Host_ID"].astype("category").cat.codes
data_sub["Host_location"] = data["Host_location"].astype("category").cat.codes
data_sub["Host_Order"] = data["Host_Order"].astype("category").cat.codes
data_sub["Host_Family"] = data["Host_Family"].astype("category").cat.codes
data_sub["Host_Genus"] = data["Host_Genus"].astype("category").cat.codes
data_sub["Host_Species"] = data["Host_Species"].astype("category").cat.codes
data_sub["Photobiont"] = data["Photobiont"].astype("category").cat.codes
data_sub["Growth_form"] = data["Growth_form"].astype("category").cat.codes
data_sub["Macroclimate"] = data["Macroclimate"].astype("category").cat.codes

data_sub.head()

Unnamed: 0,isolate_class,isolate_order,isolate_family,isolate_genus,Host_ID,Host_location,Host_Order,Host_Family,Host_Genus,Host_Species,Photobiont,Growth_form,Macroclimate
0,7,6,14,62,9,2,2,5,14,5,0,1,6
1,4,8,17,31,9,2,2,5,14,5,0,1,6
2,4,8,17,31,9,2,2,5,14,5,0,1,6
3,10,18,45,67,9,2,2,5,14,5,0,1,6
4,4,8,17,31,9,2,2,5,14,5,0,1,6


The function below performs a Pearson's Chi Squared test for two columns from a given dataframe, with the assumption that the values are represented in categorical form as numerical values, as prepared above. It uses the `chi2_contingency` submodule of `scipy.stats`.

In [4]:
def pearsons_chi2(df, var1, var2):
    """
    Input:  df = pandas dataframe in which column values are stored in categorical form as numerical values. 
            var1 = first column/category for comparison.
            var2 = second column/category for comparison.
    Output: prints summary output of Chi-Squared test.
            returns dictionary of Chi-Squared test output.
    """
    crosstab = pd.crosstab(df[var1], df[var2])
    stat, p, dof, expected = chi2_contingency(crosstab)

    diction = {"Variables": f"{var1}_vs_{var2}",
                "Chi-Squared Statistic": stat, 
                "p-value": p,
                "Degrees of Freedom": dof}
    dict_df = pd.DataFrame(data=diction, index=[0])
    #final_df = pd.concat([final_df, dict_df], ignore_index=True)
    #final_df = final_df.append(dict_df)
    #print(final_df)
    
    # Check that all expected values are larger than 5
    if (expected >= 5).all():
        print(f"Reject null hypothesis for {var1} and {var2}")
        print("\tChi-Squared Statistic: ", stat)
        print("\tp-value: ", p)
        print("\tDegrees of Freedom: ", dof)
        print("\tExpected Values: ", expected)
    else:
        print(f"Reject alternative hypothesis for {var1} and {var2}")

    return dict_df

Now that we have a function to apply for each of the comparisons of the columns in the prepared dataframe, we can splice together lists of the 'isolate' ELF columns and the 'host' lichen columns. 
The below code will prepare a list of tuples containing each of the elements of 'isolate' and 'host' variable lists, then pass each tuple to the `pearsons_chi2` function. The output of each comparison is saved to a dataframe.

As it turned out, we cannot reject the null hypothesis ($H_0$) for any test, as there were none for [which all observed and expected frequencies are larger than 5.](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html)

In [5]:
final_df = pd.DataFrame()

isolate_vars = ["isolate_genus", "isolate_family", "isolate_order", "isolate_class"]
host_vars = ["Host_Species", "Host_Genus", "Host_Family", "Host_Order",
    "Host_location", "Photobiont", "Growth_form", "Macroclimate",]
result = [(x, y) for x in isolate_vars for y in host_vars]

for i in result:
    df = pearsons_chi2(data_sub, i[0], i[1])
    final_df = pd.concat([final_df, df], ignore_index=True)

Reject alternative hypothesis for isolate_genus and Host_Species
Reject alternative hypothesis for isolate_genus and Host_Genus
Reject alternative hypothesis for isolate_genus and Host_Family
Reject alternative hypothesis for isolate_genus and Host_Order
Reject alternative hypothesis for isolate_genus and Host_location
Reject alternative hypothesis for isolate_genus and Photobiont
Reject alternative hypothesis for isolate_genus and Growth_form
Reject alternative hypothesis for isolate_genus and Macroclimate
Reject alternative hypothesis for isolate_family and Host_Species
Reject alternative hypothesis for isolate_family and Host_Genus
Reject alternative hypothesis for isolate_family and Host_Family
Reject alternative hypothesis for isolate_family and Host_Order
Reject alternative hypothesis for isolate_family and Host_location
Reject alternative hypothesis for isolate_family and Photobiont
Reject alternative hypothesis for isolate_family and Growth_form
Reject alternative hypothesis fo

The resulting `final_df` is then saved to a file. These were than manually input into a table for display.

In [7]:
final_df.to_csv("pearsons_chi_squared_tests.csv")
display(final_df)

Unnamed: 0,Variables,Chi-Squared Statistic,p-value,Degrees of Freedom
0,isolate_genus_vs_Host_Species,2669.740879,1.610399e-21,2010
1,isolate_genus_vs_Host_Genus,2436.705483,4.146126e-37,1608
2,isolate_genus_vs_Host_Family,1345.802843,2.695727e-12,1005
3,isolate_genus_vs_Host_Order,468.031615,0.01269443,402
4,isolate_genus_vs_Host_location,2051.39546,2.822886e-13,1608
5,isolate_genus_vs_Photobiont,344.841687,0.001054765,268
6,isolate_genus_vs_Growth_form,140.46216,0.3339095,134
7,isolate_genus_vs_Macroclimate,692.800731,7.253408e-11,469
8,isolate_family_vs_Host_Species,1878.458525,4.735137e-20,1350
9,isolate_family_vs_Host_Genus,1694.618736,4.601834e-30,1080


## Fig. 3.11 - Prepare data for rarefaction curves on iNEXT

The following text will prepare an abundance table which can be input into the [iNEXT](https://chao.shinyapps.io/iNEXTOnline/) online viewer using default settings to generate rarefaction curves of endolichenic fungi genera diversity against lichen host Order.

http://chao.stat.nthu.edu.tw/wordpress/wp-content/uploads/software/iNEXTOnline_UserGuide.pdf

In [8]:
data_sub2 = pd.DataFrame()
data_sub2["isolate_genus"] = data["isolate_genus"]
data_sub2["Host_Order"] = data["Host_Order"]
results = pd.crosstab(data_sub2['isolate_genus'], data_sub2['Host_Order'])

In [9]:
results.to_csv("iNEXT_input.csv")
display(results)

Host_Order,Gyalectales,Lecanorales,Peltigerales,Pertusariales,Teloschistales,Trichotheliales,unidentified
isolate_genus,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Absidia,0,0,0,0,0,1,0
Amphirosellinia,0,6,4,0,0,0,0
Anthostomelloides,2,6,5,2,0,0,0
Antrelloides,0,1,0,0,0,0,0
Ascochyta,0,2,0,0,0,0,0
...,...,...,...,...,...,...,...
Trichonectria,0,1,0,0,0,0,0
Umbelopsis,0,3,1,0,0,0,0
Xylaria,0,5,3,1,0,0,0
Xylotumulus,0,2,0,0,0,0,0
