# Exploratory Data Analysis using MS Power BI/ Python (matplotlib/ seaborn)
*This ipynb file serves as a file to run our Python code as well as document our EDA processes, all visuals included here are generated and retrieved from our Exploratory Data Analysis using MS Power BI and Python.pbix file either from their provided visuals or using Python scripts by importing matplotlib and seaborn libraries.

## Correlation Analysis 
The Correlation plot provided by Microsoft Corporation within Power BI visuals is useful for correlation analysis. By inputting our pre-processed data via data cleaning, we get the correlation plot below:

<img src="../Charts and Visualizations/Correlation Analysis.png">

As correlation plots only work for numeric data, categorical variables such as Claim, Product Name, etc. are not included in this part of the analysis.

From this plot, we can say that Net Sales and Commission are strongly correlated. That is because Commission is a product of Net Sales. Logically speaking, without Net Sales, there would not 
be a Commission too. Hence, we choose the 2nd most positively related pair: Net Sales - Duration, and decided to continue our EDA from here on.

Do note that we do not continue EDA from the Commission-Commission Percentage pair as the Commission Percentage variable is derived from the Commission variable itself. Therefore, it would make sense that these two variables would be very positively related.

## Data Distribution and Density Plots

To give a better understanding of how our dataset looks like, we created two density plots showing the distribution of our data according to Age and Net Sales (both density plots are from MS Power BI):

<img src="../Charts and Visualizations/Density Plot of Age.png">
<br>
<br>
<img src="../Charts and Visualizations/Density Plot of Net Sales.png">

This shows that most insurance purchases were made by customers with an age of around 36 years old, and the net sales for most purchases are within the range 0 - 60.

## Anomaly Behaviour/ Patterns
While working on the relationship between Net Sales and Duration, we've found anomalies in the Duration column as our graph on the x-axis extended to more than 4000:

<img src="../Charts and Visualizations/Original Net Sales by Duration.png">

This anomaly disrupts our dataset and we went back to data cleaning to remove the 14 anomalies. This is now the Net Sales by Duration line chart after the anomalies have been removed:

<img src="../Charts and Visualizations/Cleaned Net Sales by Duration.png">



From this chart, we can say that there are 2 different spikes in Net Sales, one at within 2 weeks, and one at the 1-year mark. This observation will be important to us in our future analysis at the important features section.

## Statistical Test & Description

We utilize the train_test_split function from the sklearn.model_selection library to obtain 10 percent of data from our cleaned dataset randomly and use that 10 percent of data to perform our statistical test. The Pandas.dataframe.corr() function calculates the correlation between Duration and Net Sales and its value will be stored in r in the next cell. 

In [26]:
from sklearn.model_selection import train_test_split
import pandas as pd

dataset = pd.read_excel("../Datasets/MASA_Hackathon_2022_Travel_Insurance_Data_Set_Cleaned.xlsx")

x = dataset["Duration"]
y = dataset["Net Sales"]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, train_size=0.9, shuffle=True, random_state=1)
df = pd.concat([x_test, y_test], axis=1, join="inner")

print(df.corr(method="pearson"))

# n is the value of sample size, neede for Fisher's Transformation test
n = df.shape[0]
print(n)

           Duration  Net Sales
Duration   1.000000   0.654308
Net Sales  0.654308   1.000000
6235


Now, with the correlation calculated, we store it in variable r and perform our calculations.<br>
**h<sub>0</sub>: rho = 0.6**<br>
**h<sub>1</sub>: rho > 0.6**

In [33]:
from math import log, sqrt

rho = 0.6
r = 0.654308
Z = 0.5 * (log((1 + r) / (1 - r)) - log((1 + rho) / (1 - rho))) / sqrt(1 / (n - 3))
print(Z)

7.077064172725067


Since this test statistic is larger than the critical value at 5% significance level, Z<sub>0.05</sub> = 1.645 indicating that the null hypothesis h<sub>0</sub> cannot be rejected and is in favor of the alternative h<sub>1</sub>.

To conclude, the sample data suggests that population correlation, rho > 0.6.

## Important Features
Net Sales is greatly influenced by the product. Therefore, we decided to discover the correlation that we already have and find the connection between it and different products.

<img src="../Charts and Visualizations/Net Sales by Product Name.png">

This pie chart explains why there are 2 spikes, 2 way Comprehensive Plan, Cancellation Plan, and Rental Vehicle Excess Insurance are mostly short-term insurance plans, whereas the Annual Silver Plan starts at 364 days.

<img src="../Charts and Visualizations/Annual Silver Plan Net Sales and Location.png">

The Annual Silver Plan shows an interesting visualization as it provides the highest net sales for a single timeframe, which is 66,509.20 for plans with duration of 365 days. More interestingly, all Annual Silver Plan customers travel to Singapore.