# Kruskal-Wallis test
This Python script is created to perform the test on a .csv dataset stored in Google Drive. It tests the hypothesis that subscription duration depends on users' region.

1. Install necessary Python libraries

In [None]:
!pip install pandas scipy



2. Import essential libraries

In [None]:
import pandas as pd
from scipy.stats import kruskal
import sys
from google.colab import drive

3. Define configuration parameters: file path for the dataset, column names for duration and region, and the statistical significance level (alpha).

In [None]:
CSV_FILE_PATH = '/content/drive/My Drive/colabfiles/ab_test_dataset.csv'
DURATION_COLUMN = 'subscription_duration_days'
REGION_COLUMN = 'region'
alpha = 0.05

4. Define a function to mount Google Drive

In [None]:
def mount_google_drive():
  print("Mounting Google Drive...")
  try:
    drive.mount('/content/drive')
    print("Google Drive mounted successfully.")
    return True
  except Exception as e:
    print(f"Error mounting Google Drive: {e}")
    print("Please ensure you've authorized Colab to access your Google Drive.")
    return False

5. Mount Google Drive and exit if mounting fails

In [None]:
if not mount_google_drive():
    sys.exit(1)

Mounting Google Drive...
Mounted at /content/drive
Google Drive mounted successfully.


6. Load dataset from the specified path, display initial rows and info, and handle file-related errors.

In [None]:
try:
    df = pd.read_csv(CSV_FILE_PATH)
    print(f"Successfully loaded data from '{CSV_FILE_PATH}'")
    print("\nFirst 5 rows of the dataset:")
    print(df.head())
    print("\nDataset Info:")
    df.info()
except FileNotFoundError:
    print(f"Error: The file '{CSV_FILE_PATH}' was not found.")
    print("Please ensure the CSV_FILE_PATH variable points to the correct file on Google Drive.")
    sys.exit(1)
except Exception as e:
    print(f"An error occurred while loading the CSV file: {e}")
    sys.exit(1)

Successfully loaded data from '/content/drive/My Drive/colabfiles/ab_test_dataset.csv'

First 5 rows of the dataset:
   user_pseudo_id category        country subscription_start subscription_end  \
0    1.099668e+06  desktop  United States          11/4/2020        1/12/2021   
1    1.136556e+06   mobile           Peru          11/2/2020       12/24/2020   
2    1.271864e+06   mobile          India          11/7/2020       12/13/2020   
3    1.014060e+06  desktop  United States          11/3/2020        1/22/2021   
4    1.828432e+06  desktop  United States         11/13/2020       12/20/2020   

   subscription_duration_days day_type         region  
0                          69  Weekday  North America  
1                          52  Weekday  South America  
2                          36  Weekend           Asia  
3                          80  Weekday  North America  
4                          37  Weekday  North America  

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIn

7. Prepare data for Kruskal-Wallis test by extracting and filtering subscription durations for each unique region.

In [None]:
region_data = [df[DURATION_COLUMN][df[REGION_COLUMN] == region].dropna().values for region in df[REGION_COLUMN].unique()]
region_data_filtered = [data for data in region_data if len(data) > 0]

8. Perform Kruskal-Wallis test, provide results and conclusions

In [None]:
print(f"\n--- Kruskal-Wallis Test ---")
print(f"Hypotheses:")
print("H0: The region of user does not have a significant effect on the subscription duration.")
print("H1: At least one region has a significantly different mean subscription duration compared to others.")
print(f"Significance level (alpha): {alpha}")


if len(region_data_filtered) >= 2:
    h_statistic, p_value = kruskal(*region_data_filtered)

    print(f"\nKruskal-Wallis H-statistic: {h_statistic:.4f}")
    print(f"P-value: {p_value:.4f}")

    print("\n--- Conclusion ---")
    if p_value < alpha:
        print("Since the p-value is less than the significance level, we reject the null hypothesis (H0).")
        print("Conclusion: There is a statistically significant difference in the median subscription duration between at least two regions.")
    else:
        print("Since the p-value is greater than the significance level, we fail to reject the null hypothesis (H0).")
        print("Conclusion: There is no statistically significant difference in the median subscription duration between the regions.")
else:
    print("\nNot enough regions with data to perform the Kruskal-Wallis H-test (at least 2 groups are required).")


--- Kruskal-Wallis Test ---
Hypotheses:
H0: The region of user does not have a significant effect on the subscription duration.
H1: At least one region has a significantly different mean subscription duration compared to others.
Significance level (alpha): 0.05

Kruskal-Wallis H-statistic: 6.8905
P-value: 0.4404

--- Conclusion ---
Since the p-value is greater than the significance level, we fail to reject the null hypothesis (H0).
Conclusion: There is no statistically significant difference in the median subscription duration between the regions.
