<a href="https://colab.research.google.com/github/hemendrasakpal/AINE-AI-Projects/blob/main/Project7/Statistical_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 7: Statistical Analysis and Hypothesis Testing

## Packages and setup

In [None]:
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import os

from scipy.stats import shapiro
import scipy.stats as stats

#parameter settings
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

### Reading data and initial processing

In [None]:
#Read data using pandas
from google.colab import files
uploaded = files.upload()

import io
user_df = pd.read_csv(io.BytesIO(uploaded['cookie_cats.csv']))
# user_df=pd.read_csv("C:/Users/91842/Desktop/Aine Ai Internship Projects/cookie_cats.csv")

#Check data types of each column using "dtypes" function
print("Data types for the data set:")
user_df.dtypes

#Check dimension of data i.e. # of rows and #column using pandas "shape" funtion
print("Shape of the data i.e. no. of rows and columns")
user_df.shape

#display first 5 rows of the data using "head" function
print("First 5 rows of the raw data:")
user_df.head(5)

TypeError: ignored

## Exercise

## Q1. Detect and resolve problems in the data (Missing value, Outliers, etc.)

### Q1.1 Identify missing value

In [None]:
#Check for any missing values in the data using isnull() function

user_df.isnull().sum().sum()

### Q1.2 Identify outliers

In [None]:
#Check for outlier values in sum_gamerounds column
plt.title("Total gamerounds played")
plt.xlabel("Index")
plt.ylabel("sum_gamerounds")
plt.plot(user_df.sum_gamerounds)

In [None]:
#Based on the plot, filter out the outlier from sum_gamerounds played; Use max() fucntion to find the index of the outlier

user_df['sum_gamerounds'].max()
user_df[user_df['sum_gamerounds']  == 49854]
user_df.drop(index = 57702, inplace= True)

In [None]:
#Plot the graph for sum_gamerounds player after removing the outlier

plt.title("Total gamerounds played")
plt.xlabel("Index")
plt.ylabel("sum_gamerounds")
plt.plot(user_df.sum_gamerounds)

## Q2. Plot summary statistics and identify trends to answer basis business questions

### Q2.1 What is the overall 7-day retention rate of the game?

In [None]:
#Insert calculation for 7-day retention rate

retention_rate_7= round((user_df['retention_7'].sum()/user_df['retention_7'].count())*100,2)
print("Overal 7 days retention rate of the game for both versions is: " ,retention_rate_7,"%")

### Q2.2 How many players never played the game after installing? 

In [None]:
# Find number of customers with sum_gamerounds is equal to zero

(user_df['sum_gamerounds'] == 0).sum()

### Q2.3 Does the number of users decrease as the level progresses highlighting the difficulty of the game?

In [None]:
#Group by sum_gamerounds and count the number of users for the first 200 gamerounds
#Use plot() function on the summarized stats to visualize the chart

user_df.groupby('sum_gamerounds').userid.count()[:200].plot()

## Q3. Generate crosstab for two groups of players to understand if there is a difference in 7 days retention rate & total number of game rounds played

### Q3.1 Seven days retention rate summary for different game versions

In [None]:
#Create cross tab for game version and retention_7 flag counting number of users for each possible categories

pd.crosstab(user_df.version, user_df.retention_7).apply(lambda r: r/r.sum(), axis=1)

<mark>__Analsysis Results:__
    
Type your interpretation here from the crosstab generated above   

### Q3.2 Gamerounds summary for different game versions

In [None]:
#use pandas group by to calculate average game rounds played summarized by different versions

user_df.groupby('version').mean()['sum_gamerounds']
print('\n')
user_df.groupby('version').count()['sum_gamerounds']

<mark>__Analsysis Results:__
    
Do total number of gamerounds played in total by each player differ based on  different versions of the game? 

The total number of rounds played for gate_40 is greater than that of gate_30 

## Q4. Perform two-sample test for groups A and B to test statistical significance amongst the groups in the sum of game rounds played i.e., if groups A and B are statistically different

### Initial data processing

In [None]:
#Define A/B groups for hypothesis testing
user_df["version"] = np.where(user_df.version == "gate_30", "A", "B")
group_A=pd.DataFrame(user_df[user_df.version=="A"]['sum_gamerounds'])
group_B=pd.DataFrame(user_df[user_df.version=="B"]['sum_gamerounds'])

### Q4.1 Shapiro test of Normality

In [None]:
#---------------------- Shapiro Test ----------------------
# NULL Hypothesis H0: Distribution is normal
# ALTERNATE Hypothesis H1: Distribution is not normal    

#test for group_A
shapiro(group_A)
#test for group_B
shapiro(group_B)

<mark>__Analsysis Results:__
    
__Type your answer here:__ Analyze and interpret the results of shapiro test of normality i.e. are the two groups normally distributed?

**Both groups A and B have pvalue less than 0.05 hence they are not normally distributed**

### Q4.2 Test of homegienity of variance

In [None]:
#---------------------- Leven's Test ----------------------
# NULL Hypothesis H0: Two groups have equal variances
# ALTERNATE Hypothesis H1: Two groups do not have equal variances

#perform levene's test and accept or reject the null hypothesis based on the results

stats.levene(group_A.sum_gamerounds, group_B.sum_gamerounds)

<mark>__Analsysis Results:__
    
__Type your answer here:__ Write your final recommendation from the results of Levene's test

**Since the pvalue is not less than 0.05 both groups A and B have equal variance**

### Q4.3 Test of significance: Two sample test

In [None]:
#---------------------- Two samples test ----------------------
# NULL Hypothesis H0: Two samples are equal
# ALTERNATE Hypothesis H1: Two samples are different

#Apply relevant two sample test to accept or reject the NULL hypothesis

stats.mannwhitneyu(group_A, group_B)


<mark>__Analsysis Results:__
    
__Type your answer here:__ Write your final recommendation from the results of two sample hyothesis testing

**Since the pvalue is less than 0.05 we can conclude that both groups A and b are different**

## Q5. Based on significance testing results, if groups A and B are statistically different, which level has more advantage in terms of player retention and number of game rounds played

In [None]:
#Analyze the 1 day and 7 days retention rate for two different groups using group by function
user_df.groupby('version')['retention_1'].mean()
user_df.groupby('version')['retention_7'].mean()

<mark>__Analsysis Results:__
    
__Type your answer here:__ Write your final recommendation to the company regarding which level works best as the first gate  - Level 30 or Level 40

**The retention rate for both 1 day retention and 7 day retention is slightly greater for gate 30 than for gate 40**

## Q6. [Bonus Question]  Using bootstrap resampling, plot the retention rate distribution for both the groups inorder to visualize effect of different version of the game on retention.

In [None]:
#Hint: Plot density function
