# Using T-SQL to perform basic data analytics

## Context
A Tour & Travels Company Is Offering Travel Insurance Package To Their Customers. The Company Requires To Know The Which Customers Would Be Interested To Buy It Based On Its Database History.

### Task
Using SQL command to perform Exploratory Data Analysis to find insight for Customer Specific Advertising Of The Package.

## Setup

In [1]:
#relevant installs
import pandas as pd 
import pyodbc
import sqlalchemy as db

## Querries

I used Travel Insurance Prediction dataset which contain travaler's information. Source: Kaggle.com

### Importing & Cleaning Dataset

In [2]:
#Connect to SQL Server
cnxn_str = ("Driver={SQL Server};"
            "Server=Myserver;"
            "Database=MyDb;")
cnxn = pyodbc.connect(cnxn_str)

As a first step, I look at the schema of the TravelInsurancePrediction table and see where we might be able to join for insight in future queries.

In [3]:
#Show 1st 5 row of the table
query = 'SELECT * FROM TravelInsurancePrediction;' 
x =pd.read_sql(query, cnxn)
x.head()

Unnamed: 0,Index,Age,EmploymentType,GraduateOrNot,AnnualIncome,FamilyMembers,ChronicDiseases,FrequentFlyer,EverTravelledAbroad,TravelInsurance
0,0,31,Government Sector,Yes,40000,6,1,No,No,0
1,1,31,Private Sector/Self Employed,Yes,125000,7,0,No,No,0
2,2,34,Private Sector/Self Employed,Yes,50000,4,1,No,No,1
3,3,28,Private Sector/Self Employed,Yes,70000,3,1,No,No,0
4,4,28,Private Sector/Self Employed,Yes,70000,8,1,Yes,No,0


In [4]:
#Show Table Schema
query = "SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'TravelInsurancePrediction';" 
x =pd.read_sql(query, cnxn)
x

Unnamed: 0,TABLE_CATALOG,TABLE_SCHEMA,TABLE_NAME,COLUMN_NAME,ORDINAL_POSITION,COLUMN_DEFAULT,IS_NULLABLE,DATA_TYPE,CHARACTER_MAXIMUM_LENGTH,CHARACTER_OCTET_LENGTH,...,DATETIME_PRECISION,CHARACTER_SET_CATALOG,CHARACTER_SET_SCHEMA,CHARACTER_SET_NAME,COLLATION_CATALOG,COLLATION_SCHEMA,COLLATION_NAME,DOMAIN_CATALOG,DOMAIN_SCHEMA,DOMAIN_NAME
0,MyDb,dbo,TravelInsurancePrediction,Index,1,,NO,nvarchar,50.0,100.0,...,,,,UNICODE,,,SQL_Latin1_General_CP1_CI_AS,,,
1,MyDb,dbo,TravelInsurancePrediction,Age,2,,NO,int,,,...,,,,,,,,,,
2,MyDb,dbo,TravelInsurancePrediction,EmploymentType,3,,NO,nvarchar,50.0,100.0,...,,,,UNICODE,,,SQL_Latin1_General_CP1_CI_AS,,,
3,MyDb,dbo,TravelInsurancePrediction,GraduateOrNot,4,,NO,nvarchar,50.0,100.0,...,,,,UNICODE,,,SQL_Latin1_General_CP1_CI_AS,,,
4,MyDb,dbo,TravelInsurancePrediction,AnnualIncome,5,,NO,int,,,...,,,,,,,,,,
5,MyDb,dbo,TravelInsurancePrediction,FamilyMembers,6,,NO,int,,,...,,,,,,,,,,
6,MyDb,dbo,TravelInsurancePrediction,ChronicDiseases,7,,NO,int,,,...,,,,,,,,,,
7,MyDb,dbo,TravelInsurancePrediction,FrequentFlyer,8,,NO,nvarchar,50.0,100.0,...,,,,UNICODE,,,SQL_Latin1_General_CP1_CI_AS,,,
8,MyDb,dbo,TravelInsurancePrediction,EverTravelledAbroad,9,,NO,nvarchar,50.0,100.0,...,,,,UNICODE,,,SQL_Latin1_General_CP1_CI_AS,,,
9,MyDb,dbo,TravelInsurancePrediction,TravelInsurance,10,,NO,int,,,...,,,,,,,,,,


There are  10 columns in the table in which there's no datetime data and no columns with Null data.
The column TravelInsurance and ChronicDiseases contain integer data which only has 2 values: 1 for Yes and 0 for No.

Now I have a overall picture of the dataset. Next step I'll check if there's any error, outliner in the data and clean them up.

In [5]:
#This step is not necessary as we already know from the Table_schema that there's no null values
query = 'SELECT * FROM TravelInsurancePrediction WHERE Age IS NULL OR AnnualIncome IS NULL OR TravelInsurance IS NULL ;' 
x =pd.read_sql(query, cnxn)
x

Unnamed: 0,Index,Age,EmploymentType,GraduateOrNot,AnnualIncome,FamilyMembers,ChronicDiseases,FrequentFlyer,EverTravelledAbroad,TravelInsurance


As Age and AnualIncome are numberic I would check if there is any unusual outliner

In [6]:
#Show range of Age and Income
query = 'SELECT MIN(Age) as min_age, MAX(Age) as max_age, \
	MIN(AnnualIncome) as min_income, MAX(AnnualIncome) as max_income \
FROM TravelInsurancePrediction' 
x =pd.read_sql(query, cnxn)
x

Unnamed: 0,min_age,max_age,min_income,max_income
0,25,35,30000,180000


Age is range from 25 to 35. This range is quite small. But assumming the datasets is based on a campaign targeting young customes, this makes sense.
However the Annual income look quite huge. While it's normal for customer to have high income, the minimun income is 300,000 per year! This is abnormal. 

After checking with marketing team, it's confirmed that the Annual Income column has 2 extra 00. 
I would remove the extra digit as follow

In [7]:
#Alter the table column
cursor = cnxn.cursor()
cursor.execute("UPDATE TravelInsurancePrediction SET AnnualIncome = AnnualIncome/10")
cnxn.commit()

Then I run the query again to check if the table is updated 

In [8]:
query = 'SELECT MIN(Age) as min_age, MAX(Age) as max_age, \
	MIN(AnnualIncome) as min_income, MAX(AnnualIncome) as max_income \
FROM TravelInsurancePrediction' 
x =pd.read_sql(query, cnxn)
x

Unnamed: 0,min_age,max_age,min_income,max_income
0,25,35,3000,18000


### Understand Data

Once the dataset is nice and clean, let's get to know our data

In [9]:
#Show number and percentage of traveler with insurance
query = "SELECT Have_Insurance, COUNT(*) as travelers, \
        COUNT(TravelInsurance) * 100.00 /SUM(COUNT(TravelInsurance)) OVER () AS Percentage_with_Insurance \
        FROM ( 	SELECT TravelInsurance, \
        CASE WHEN TravelInsurance = 1 THEN 'Yes' \
        ELSE 'No' \
        END AS Have_Insurance \
        FROM TravelInsurancePrediction ) Insurance \
        GROUP BY Have_Insurance;"
x =pd.read_sql(query, cnxn)
x

Unnamed: 0,Have_Insurance,travelers,Percentage_with_Insurance
0,Yes,710,35.73226
1,No,1277,64.26774


There's 35,7% of the customers had purchased traveling insurance. 
Next let's see if age can affect their decision

In [10]:
#Show number and percentage of traveller with insurance per age
query = "SELECT Age, SUM(TravelInsurance) as With_Insurance, \
        SUM(TravelInsurance) * 100.0 / SUM(SUM(TravelInsurance)) OVER () AS Percentage_With_Ins \
        FROM TravelInsurancePrediction \
        GROUP BY Age  \
        Order By Age; "
x =pd.read_sql(query, cnxn)
x

Unnamed: 0,Age,With_Insurance,Percentage_With_Ins
0,25,92,12.957746
1,26,74,10.422535
2,27,27,3.802817
3,28,105,14.788732
4,29,51,7.183099
5,30,28,3.943662
6,31,75,10.56338
7,32,19,2.676056
8,33,78,10.985915
9,34,133,18.732394


It's difficult to see if there is any pattern of each age.
I will divide the travellers into 4 age groups:

    - Mid 20s: from 25 to 27
    - Late 20s: from 28 to 31  
    - Early 30s: from 31 to 33
    - Mid 30s: from 34 to 35

In [11]:
#Show number and percentage of traveller with insurance per age range

query = "WITH AgeBins AS \
        ( SELECT \
            CASE \
                WHEN Age >= 25 and Age <= 27 THEN 'Mid 20s' \
                WHEN Age >= 28 and Age <= 30 THEN 'Late 20s' \
                WHEN Age >= 31 and Age <= 33 THEN 'Early 30s' \
                WHEN Age > 33 THEN 'Mid 30s' \
            END as age_bins, \
            TravelInsurance \
        FROM TravelInsurancePrediction ) \
        SELECT \
            age_bins, \
            SUM(TravelInsurance) AS With_Insurance, \
            SUM(TravelInsurance) * 100.0 / SUM(SUM(TravelInsurance)) OVER () AS Percentage_With_Ins, \
            SUM(TravelInsurance) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY age_bins) AS Percentage_Total_Bins \
        FROM AgeBins \
        GROUP BY age_bins \
        ORDER BY  \
            (CASE age_bins \
                WHEN 'Mid 20s' THEN 1 \
                WHEN 'Late 20s' THEN 2 \
                WHEN 'Early 30s' THEN 3 \
                WHEN 'Mid 30s' THEN 4 \
                END) ASC;"
x =pd.read_sql(query, cnxn)
x

Unnamed: 0,age_bins,With_Insurance,Percentage_With_Ins,Percentage_Total_Bins
0,Mid 20s,193,27.183099,45.411765
1,Late 20s,184,25.915493,24.115334
2,Early 30s,172,24.225352,36.363636
3,Mid 30s,161,22.676056,49.386503


The With_Insurance column shows the number of travellers with insurance per each age group.

The Percentage_With_Ins shows the proportion of each age group per total of traveler with insurance. 

The Percentage_Total_Bins shows the proportion of traveler with insurance of total travelers within each age group.

Mid 20s travelers  and mid 30s tend to purchase insurance as 45% of mid 20s and 49% of mid 30s decided to buy insurance. However these group contribute quite the same proportion to total of traveler with insurance ( all age group contribute 22% to 27% to the total traveler with insurance)

This can be explained as although mid 20s-30s age group tend to purchase insurance, they are not the majority customers. 
Let's test this theory by finding the proportion of each age group per total customers.

In [12]:
#show percentage of each age group per total number of traveller
query = "WITH AgeBins AS \
        ( SELECT \
            CASE \
                WHEN Age >= 25 and Age <= 27 THEN 'Mid 20s' \
                WHEN Age >= 28 and Age <= 30 THEN 'Late 20s' \
                WHEN Age >= 31 and Age <= 33 THEN 'Early 30s' \
                WHEN Age > 33 THEN 'Mid 30s' \
            END as age_bins, \
            TravelInsurance \
        FROM TravelInsurancePrediction ) \
        SELECT \
            age_bins, \
            SUM(TravelInsurance) AS With_Insurance, \
            Count(*) * 100.0 / SUM(COUNT(*)) OVER () AS Percentage \
        FROM AgeBins \
        GROUP BY age_bins \
        ORDER BY  \
            (CASE age_bins \
                WHEN 'Mid 20s' THEN 1 \
                WHEN 'Late 20s' THEN 2 \
                WHEN 'Early 30s' THEN 3 \
                WHEN 'Mid 30s' THEN 4 \
                END) ASC;"
x =pd.read_sql(query, cnxn)
x

Unnamed: 0,age_bins,With_Insurance,Percentage
0,Mid 20s,193,21.389029
1,Late 20s,184,38.399597
2,Early 30s,172,23.804731
3,Mid 30s,161,16.406643


Mid 30s contributes the less to the total customers and Mid 20s is the second lowest. 
This gives evident to our previous theory that mid 20s-30s age group tend to purchase more insurance even that they don't contribute the most to the total quantity of insurance sold.

The next step let's look at income distribution. I'll group income into 3 segments:

    - Low Income: less then 40,100 a year
    - Mid Income: from 40,100 to 102,400 a year
    - High Income: more than 102,000 a year

In [13]:
# number and percentage of traveller with insurance per income range 
query = " WITH IncomeBins AS \
        (SELECT CASE \
                WHEN AnnualIncome <= 40100 THEN 'Low Incomes' \
                WHEN AnnualIncome > 40100 and AnnualIncome <= 102400 THEN 'Mid Income' \
                WHEN AnnualIncome > 102400 THEN 'High Incomes' \
            END as incomes, \
            TravelInsurance \
        FROM TravelInsurancePrediction ) \
        SELECT incomes, \
            SUM(TravelInsurance) AS With_Insurance, \
            SUM(TravelInsurance) * 100.0 / SUM(SUM(TravelInsurance)) OVER () AS Percentage_With_Ins, \
            SUM(TravelInsurance) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY incomes) AS Percentage_Total_Bins \
        FROM IncomeBins \
        GROUP BY incomes \
        ORDER BY \
            (CASE incomes \
                WHEN 'Low Incomes' THEN 1 \
                WHEN 'Mid Income' THEN 2 \
                WHEN 'High Incomes' THEN 3 \
                END) ASC; "
x =pd.read_sql(query, cnxn)
x

Unnamed: 0,incomes,With_Insurance,Percentage_With_Ins,Percentage_Total_Bins
0,Low Incomes,710,100.0,35.73226


64.5% of travellers who purchased insurance earn more then 102,000 a year, 32.5% earn from 40,100 to 102,400 a year. Only 2.8% of travellers with insurance earn less than 40,100. 
Beside 51.5%  of high income customers decided to buy insurance. This numbers for mid income and low income are 26% and 9.6% respectively. 
There is evidence that high income customer tend to buy more insurance.

Next, let's if **both** age and income have an affect to the decision.

In [14]:
query = "WITH AgeIncome AS \
        (SELECT CASE \
                WHEN Age >= 25 and Age <= 27 THEN 'Mid 20s' \
                WHEN Age >= 28 and Age <= 30 THEN 'Late 20s' \
                WHEN Age >= 31 and Age <= 33 THEN 'Early 30s' \
                WHEN Age > 33 THEN 'Mid 30s' \
            END as age_bins, \
            CASE WHEN AnnualIncome <= 40100 THEN 'Low Incomes' \
                WHEN AnnualIncome > 40100 and AnnualIncome <= 102400 THEN 'Mid Income' \
                WHEN AnnualIncome > 102400 THEN 'High Incomes' \
            END as incomes, \
            TravelInsurance \
        FROM TravelInsurancePrediction) \
        SELECT incomes, age_bins, \
            SUM(TravelInsurance) AS With_Insurance, \
            SUM(TravelInsurance) * 100.0 / SUM(SUM(TravelInsurance)) OVER () AS Percentage_With_Ins, \
            SUM(TravelInsurance) * 100.0 / SUM(COUNT(TravelInsurance)) OVER () AS Percentage_Total \
        FROM AgeIncome \
        GROUP BY age_bins, incomes \
        ORDER BY \
            (CASE incomes \
            WHEN 'Low Incomes' THEN 1 \
            WHEN 'Mid Income' THEN 2 \
            WHEN 'High Incomes' THEN 3 \
            END) ASC, \
            (CASE age_bins \
            WHEN 'Mid 20s' THEN 1 \
            WHEN 'Late 20s' THEN 2 \
            WHEN 'Early 30s' THEN 3 \
            WHEN 'Mid 30s' THEN 4 \
            END) ASC;"
x =pd.read_sql(query, cnxn)
x

Unnamed: 0,incomes,age_bins,With_Insurance,Percentage_With_Ins,Percentage_Total
0,Low Incomes,Mid 20s,193,27.183099,9.713135
1,Low Incomes,Late 20s,184,25.915493,9.260191
2,Low Incomes,Early 30s,172,24.225352,8.656266
3,Low Incomes,Mid 30s,161,22.676056,8.102667


We can see that Mid 20s with Mid income contributes the highest insurance sold: ~23% of travellers with insurance, seconded by Early 30s with High Income, 15%.

Next, we'll see if traveller's health status is a factor of their decision to buy insurance.

In [15]:
query = "SELECT ChronicDiseases as ChronicDiseases_0_No_1_Yes, COUNT(*) AS travelers, \
            COUNT(*) * 100.00 / SUM(COUNT(*)) OVER () AS Percentage_Total, \
            SUM(TravelInsurance) as With_Insurance, \
            SUM(TravelInsurance) * 100.0 / SUM(SUM(TravelInsurance)) OVER () AS Percentage_With_Ins \
        FROM TravelInsurancePrediction \
        GROUP BY ChronicDiseases \
        Order By ChronicDiseases;"
x =pd.read_sql(query, cnxn)
x

Unnamed: 0,ChronicDiseases_0_No_1_Yes,travelers,Percentage_Total,With_Insurance,Percentage_With_Ins
0,0,1435,72.219426,505,71.126761
1,1,552,27.780574,205,28.873239


There's 27.8% of total travelers have chronic diseases and 28.87 % travelers with insurance have chronis diseases. 
There's no evidence to say if traveler with chronic diseases tend to buy more or less insurance.

Next, let's see if a Frequent Flyer tend to buy more insurance.

In [16]:
query = "SELECT FrequentFlyer, \
            COUNT(*) as Passanger, \
            COUNT(*) * 100.00 / SUM(COUNT(*)) OVER () AS Percentage_Total, \
            SUM(TravelInsurance) as With_Insurance, \
            SUM(TravelInsurance) * 100.0 / SUM(SUM(TravelInsurance)) OVER () AS Percentage_With_Ins \
        FROM TravelInsurancePrediction \
        GROUP BY FrequentFlyer \
        Order By FrequentFlyer; "
x =pd.read_sql(query, cnxn)
x

Unnamed: 0,FrequentFlyer,Passanger,Percentage_Total,With_Insurance,Percentage_With_Ins
0,No,1570,79.013588,471,66.338028
1,Yes,417,20.986412,239,33.661972


There's 20.9% of total travelers are frequent flyers and 33.7% travelers with insurance are frequent flyers.

Although there's not enough evidence to say if a frequent flyer tend to buy more insurance, we should collect more data of this group to see if there is a correlation.

Next, let's see if a Frequent Flyer tend to buy more insurance.

In [17]:
query = "SELECT EverTravelledAbroad, COUNT(*) as Passanger, \
            COUNT(*) * 100.00 / SUM(COUNT(*)) OVER () AS Percentage_Total, \
            SUM(TravelInsurance) as With_Insurance, \
            SUM(TravelInsurance) * 100.0 / SUM(SUM(TravelInsurance)) OVER () AS Percentage_With_Ins \
        FROM TravelInsurancePrediction \
        GROUP BY EverTravelledAbroad \
        Order By EverTravelledAbroad;"
x =pd.read_sql(query, cnxn)
x

Unnamed: 0,EverTravelledAbroad,Passanger,Percentage_Total,With_Insurance,Percentage_With_Ins
0,No,1607,80.875692,412,58.028169
1,Yes,380,19.124308,298,41.971831


There are only 19% of total traveller ever travel aboard but 42% of travallers purchased insurance have ever travelled aboard.
There's a strong evidence that traveller that traveled aboard tend to purchase insurance.

## Conclusion

So far we found strong evidence that travellers with higher income and had traveled aboard tend to purchase insurance.  
I might look into doing some kind of hypothesis test in the future to confirm this in the future, but I'll leave the analysis here for now.