# Project: Ensemble Techniques - Travel Package Purchase Prediction

## Background and Context
You are a Data Scientist for a tourism company named "Visit with us". The Policy Maker of the company wants to enable and establish a viable business model to expand the customer base.

A viable business model is a central concept that helps you to understand the existing ways of doing the business and how to change the ways for the benefit of the tourism sector.

One of the ways to expand the customer base is to introduce a new offering of packages.

Currently, there are 5 types of packages the company is offering - Basic, Standard, Deluxe, Super Deluxe, King. Looking at the data of the last year, we observed that 18% of the customers purchased the packages.

However, the marketing cost was quite high because customers were contacted at random without looking at the available information.

The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being.

However, this time company wants to harness the available data of existing and potential customers to make the marketing expenditure more efficient.

You as a Data Scientist at "Visit with us" travel company have to analyze the customers' data and information to provide recommendations to the Policy Maker and Marketing Team and also build a model to predict the potential customer who is going to purchase the newly introduced travel package.

### Objective

To predict which customer is more likely to purchase the newly introduced travel package.

### Data Dictionary

#### Customer details:

1. CustomerID: Unique customer ID
2. ProdTaken: Whether the customer has purchased a package or not (0: No, 1: Yes)
3. Age: Age of customer
4. TypeofContact: How customer was contacted (Company Invited or Self Inquiry)
5. CityTier: City tier depends on the development of a city, population, facilities, and living standards. The categories are ordered i.e. Tier 1 > Tier 2 > Tier 3
6. Occupation: Occupation of customer
7. Gender: Gender of customer
8. NumberOfPersonVisiting: Total number of persons planning to take the trip with the customer
9. PreferredPropertyStar: Preferred hotel property rating by customer
10. MaritalStatus: Marital status of customer
11. NumberOfTrips: Average number of trips in a year by customer
12. Passport: The customer has a passport or not (0: No, 1: Yes)
13. OwnCar: Whether the customers own a car or not (0: No, 1: Yes)
14. NumberOfChildrenVisiting: Total number of children with age less than 5 planning to take the trip with the customer
15. Designation: Designation of the customer in the current organization
16. MonthlyIncome: Gross monthly income of the customer

#### Customer interaction data: 

1. PitchSatisfactionScore: Sales pitch satisfaction score
2. ProductPitched: Product pitched by the salesperson
3. NumberOfFollowups: Total number of follow-ups has been done by the salesperson after the sales pitch
4. DurationOfPitch: Duration of the pitch by a salesperson to the customer

### Best Practices for Notebook : 

* The notebook should be well-documented, with inline comments explaining the functionality of code and markdown cells containing comments on the observations and insights.
* The notebook should be run from start to finish in a sequential manner before submission.
* It is preferable to remove all warnings and errors before submission.
* The notebook should be submitted as an HTML file (.html) and as a notebook file (.ipynb)

### Submission Guidelines :

1. There are two parts to the submission: 
    1. A well commented Jupyter notebook [format - .ipynb]
    2. File converted to HTML format 
2. Any assignment found copied/ plagiarized with other groups will not be graded and awarded zero marks
3. Please ensure timely submission as any submission post-deadline will not be accepted for evaluation
4. Submission will not be evaluated if,
    1. it is submitted post-deadline, or,
    2. more than 2 files are submitted

Happy Learning!!

## Scoring guide (Rubric) - Travel Package Purchase Prediction

| Criteria  | Points |
| --------- | ------ |
| Perform an Exploratory Data Analysis on the data  | 8  |
| Illustrate the insights based on EDA  | 4  |
| Data Pre-processing  | 7  |
| Model building - Bagging  | 4  |
| Model performance improvement - Bagging  | 9  |
| Model building - Boosting  | 6  |
| Model performance improvement - Boosting  | 9  |
| Model performance evaluation   |  4   |
| Actionable Insights & Recommendations   |  5  |
| Notebook - Overall    |  4   |

## Environment Setup

We will import all the libraries/packages that we would need for performing EDA, model building and model evaluations in beginning to be able to focus purely on the above listed tasks for the rest of the notebook.

We will need the following:
* Pandas: For working with dataframes.
* Numpy: For working with arrays and collections.
* MatplotLib: for plotting functions.
* Seaborn: for producing high quality visualizations.
* Warnings: To avoid listing warnings in our notebook to keep it tidy
* Scikit Learn: For using algorithms for Bagging, Boosting and Stacking models.
* XGBoost: for XGBoost model implementation.

#### Note:
I am using a conda (miniconda) environment on my machine with the following libraries and language versions. The ```environment.yml``` file is listed below so that the evaluator can run the notebook in the same environment as I did.

```yaml
name: gl-tensorflow
dependencies:
    - python=3.7
    - pip>=20.0
    - jupyter
    - tensorflow=2.0
    - scikit-learn
    - scipy
    - pandas
    - pandas-datareader
    - matplotlib
    - pillow
    - tqdm
    - requests
    - h5py
    - pyyaml
    - flask
    - boto3
    - xgboost
    - pip:
        - bayesian-optimization
        - gym
        - kaggle
```

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# use the inline backend for matplotlib to render output inline in the notebook.
%matplotlib inline
import seaborn as sns
# needed to turn of render of warnings in notebook outputs
import warnings
warnings.filterwarnings('ignore')
# scikit-learn ecosystem
from sklearn import metrics
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, StackingClassifier 
# XGBoost
from xgboost import XGBClassifier

# a patch for speeding up the sklearn operations on Intel hardware.
# I have a Macbook Pro and I see good performance gains - please comment out the below part if you don't need it.
from sklearnex import patch_sklearn
patch_sklearn()

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


## Data Loading

In the project we have the data available in an Excel Spreadsheet with 2 worksheets. The first worksheet contains the details about different columns in data. Although we won't need this worksheet for any of our tasks - it would be still nice to have it in a data frame to be able to quickly print the overview of columns when needed inline rather than requiring to opening up the excel spreadsheet separately.

The second worksheet in the spreadsheet contains the observations using which we want to train/test our models.

Let's import the data to get started.

In [13]:
data_dictionary = pd.read_excel('Tourism.xlsx', sheet_name='Data Dict', engine='openpyxl')
customer_travel_package_data = pd.read_excel('Tourism.xlsx', sheet_name='Tourism', engine='openpyxl')

In [14]:
data_dictionary.drop('Unnamed: 0', axis=1, inplace=True)
data_dictionary.drop([0], inplace=True)
data_dictionary.rename(columns={"Unnamed: 1":"Dataset", "Unnamed: 2":"ColumnName", "Unnamed: 3":"Description"}, inplace=True)
data_dictionary.head()

Unnamed: 0,Dataset,ColumnName,Description
1,Tourism,CustomerID,Unique customer ID
2,Tourism,ProdTaken,Whether the customer has purchased a package o...
3,Tourism,Age,Age of customer
4,Tourism,TypeofContact,How customer was contacted (Company Invited or...
5,Tourism,CityTier,City tier depends on the development of a city...


In [15]:
customer_travel_package_data.head()

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0


# EDA

## General Understanding of the dataset

In order to gain a general understanding of the dataset we will do some simple tasks like:
- Looking at the shape of data
- Determining the quality of data - count of missing values vs total values in the dataset
- Information about the different columns on the dataset.
- Distribution of target variable (positive/yes vs negative/no)

In [29]:
print(f'Dataset Shape: {customer_travel_package_data.shape}')
missing_values_count = customer_travel_package_data.isna().sum().sum()
total_values_count = customer_travel_package_data.shape[0]*customer_travel_package_data.shape[1]
missing_values_percentage = (missing_values_count/total_values_count)*100
print(f'Missing Values: {missing_values_count} ({missing_values_percentage}%)\nTotal Values: {total_values_count}')

Dataset Shape: (4888, 20)
Missing Values: 1012 (1.03518821603928%)
Total Values: 97760


In [32]:
# Get information about various columns and their datatypes
customer_travel_package_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888 entries, 0 to 4887
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CustomerID                4888 non-null   int64  
 1   ProdTaken                 4888 non-null   int64  
 2   Age                       4662 non-null   float64
 3   TypeofContact             4863 non-null   object 
 4   CityTier                  4888 non-null   int64  
 5   DurationOfPitch           4637 non-null   float64
 6   Occupation                4888 non-null   object 
 7   Gender                    4888 non-null   object 
 8   NumberOfPersonVisiting    4888 non-null   int64  
 9   NumberOfFollowups         4843 non-null   float64
 10  ProductPitched            4888 non-null   object 
 11  PreferredPropertyStar     4862 non-null   float64
 12  MaritalStatus             4888 non-null   object 
 13  NumberOfTrips             4748 non-null   float64
 14  Passport

In [33]:
# describe the dataset in terms of distribution of numeric values - some of the numeric values are actually categories
customer_travel_package_data.describe()

Unnamed: 0,CustomerID,ProdTaken,Age,CityTier,DurationOfPitch,NumberOfPersonVisiting,NumberOfFollowups,PreferredPropertyStar,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,MonthlyIncome
count,4888.0,4888.0,4662.0,4888.0,4637.0,4888.0,4843.0,4862.0,4748.0,4888.0,4888.0,4888.0,4822.0,4655.0
mean,202443.5,0.188216,37.622265,1.654255,15.490835,2.905074,3.708445,3.581037,3.236521,0.290917,3.078151,0.620295,1.187267,23619.853491
std,1411.188388,0.390925,9.316387,0.916583,8.519643,0.724891,1.002509,0.798009,1.849019,0.454232,1.365792,0.485363,0.857861,5380.698361
min,200000.0,0.0,18.0,1.0,5.0,1.0,1.0,3.0,1.0,0.0,1.0,0.0,0.0,1000.0
25%,201221.75,0.0,31.0,1.0,9.0,2.0,3.0,3.0,2.0,0.0,2.0,0.0,1.0,20346.0
50%,202443.5,0.0,36.0,1.0,13.0,3.0,4.0,3.0,3.0,0.0,3.0,1.0,1.0,22347.0
75%,203665.25,0.0,44.0,3.0,20.0,3.0,4.0,4.0,4.0,1.0,4.0,1.0,2.0,25571.0
max,204887.0,1.0,61.0,3.0,127.0,5.0,6.0,5.0,22.0,1.0,5.0,1.0,3.0,98678.0


In [34]:
# ratio of positive and negative values
customer_travel_package_data['ProdTaken'].value_counts()

0    3968
1     920
Name: ProdTaken, dtype: int64

### Insights about the dataset.

After poking around the data we can make these general observations around the dataset.
1. There are around 5k observations (4888 to be precise), each observation has 19 attributes (excluding the CustomerID which doesn't add any value for our use case)
2. We have a total of 1012 (roughly 1% of all data points) missing values. So in terms of quality - the data is not horrible.
3. The dataset has a lot of categorical variables (9 out of 19):
    1. TypeofContact
    2. CityTier
    3. Occupation
    4. Gender
    5. ProductPitched
    6. PreferredPropertyStar (although this is a number but the values are categories)
    7. MaritalStatus
    8. PitchSatisfactionScore (although this is a number but the values are categories)
    9. Designation
4. The dataset is skewed towards negative observations.

### Actions based on the above insights

Based on the above - we can drop off the column CustomerID right away as it would create unnecessary noise when doing multi-variate analysis.

In [35]:
customer_travel_package_data.drop('CustomerID', axis=1, inplace=True)
customer_travel_package_data.head()

Unnamed: 0,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0
