# **Holiday Package Prediction**

## **Context**

**"Trips & Travel.com"** company wants to enable and establish a viable business model to expand the customer base. Currently, the company offers five types of packages: Basic, Standard, Deluxe, Super Deluxe, and King. *Last year's data found that 18% of customers purchased these packages. However, the marketing costs were high due to random customer outreach without utilizing available information.* The company now plans to launch a new product: the Wellness Tourism Package. Wellness Tourism involves travel that helps maintain, enhance, or start a healthy lifestyle, boosting one's overall well-being. *This time, the company intends to use existing data on current and potential customers to optimize marketing expenses efficiently.*

Based on this, we can determine the goals, objectives, and key metrics for this project:

**Goals:**
- Increase the customer purchase rate
- Reduce marketing costs

**Objective:**
- Utilize available data to target potential customers more effectively.
- Optimize marketing strategies to ensure cost-efficiency and higher conversion rates.

**Key Metrics**
- Customer purchase rate: Measure the percentage increase in package purchases.
- Return on investment (ROI): Assess the financial returns from the new Wellness Tourism Package and overall marketing efforts.


## **Data Understanding**

The dataset used in this analysis is sourced from Kaggle, titled [Holiday Package Prediction](https://www.kaggle.com/datasets/susant4learning/holiday-package-purchase-prediction). It contains 4888 rows and 20 columns. The dataset includes the following columns:

1. **CustomerID**: Unique identifier for each customer.
2. **ProdTaken**: Indicates whether the product was taken (True/False).
3. **Age**: Age of customer.
4. **TypeofContact**: How customer was contacted (Company Invited or Self Inquiry).
5. **CityTier**: City tier depends on the development of a city, population, facilities, and living standards. The categories are ordered.
6. **DurationOfPitch**: Duration of the pitch by a salesperson to the customer.
7. **Occupation**: Occupation of customer.
8. **Gender**: Gender of customer.
9. **NumberOfPersonVisiting**: Total number of persons planning to take the trip with the customer.
10. **NumberOfFollowups**: Total number of follow-ups has been done by the salesperson after the sales pitch.
11. **ProductPitched**: Product pitched by the salesperson.
12. **PreferredPropertyStar**: Preferred hotel property rating by customer.
13. **MaritalStatus**: Marital status of customer.
14. **NumberOfTrips**: Average number of trips in a year by customer.
15. **Passport**: Indicates if the customer has a passport (0: No, 1: Yes).
16. **PitchSatisfactionScore**: Satisfaction score of the sales pitch.
17. **OwnCar**: Indicates if the customer owns a car (0: No, 1: Yes).
18. **NumberOfChildrenVisiting**: Total number of children with age less than 5 planning to take the trip with the customer.
19. **Designation**: Job title of the customer in their current organization.
20. **MonthlyIncome**: Gross monthly income of the customer.

## **Import Library**

In [2]:
import pandas as pd

## **Read Dataset**

In [3]:
df = pd.read_csv('Travel.csv')
df

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4883,204883,1,49.0,Self Enquiry,3,9.0,Small Business,Male,3,5.0,Deluxe,4.0,Unmarried,2.0,1,1,1,1.0,Manager,26576.0
4884,204884,1,28.0,Company Invited,1,31.0,Salaried,Male,4,5.0,Basic,3.0,Single,3.0,1,3,1,2.0,Executive,21212.0
4885,204885,1,52.0,Self Enquiry,3,17.0,Salaried,Female,4,4.0,Standard,4.0,Married,7.0,0,1,1,3.0,Senior Manager,31820.0
4886,204886,1,19.0,Self Enquiry,3,16.0,Small Business,Male,3,4.0,Basic,3.0,Single,3.0,0,5,0,2.0,Executive,20289.0


## **EDA**

### Descriptive Statistics

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888 entries, 0 to 4887
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CustomerID                4888 non-null   int64  
 1   ProdTaken                 4888 non-null   int64  
 2   Age                       4662 non-null   float64
 3   TypeofContact             4863 non-null   object 
 4   CityTier                  4888 non-null   int64  
 5   DurationOfPitch           4637 non-null   float64
 6   Occupation                4888 non-null   object 
 7   Gender                    4888 non-null   object 
 8   NumberOfPersonVisiting    4888 non-null   int64  
 9   NumberOfFollowups         4843 non-null   float64
 10  ProductPitched            4888 non-null   object 
 11  PreferredPropertyStar     4862 non-null   float64
 12  MaritalStatus             4888 non-null   object 
 13  NumberOfTrips             4748 non-null   float64
 14  Passport

Based on the dataset information, we can infer the following:
- There are some columns with missing values (Age, TypeofContact, DurationOfPitch, NumberOfFollowups, PreferredPropertyStar, NumberOfTrips, NumberOfChildrenVisiting, and MonthlyIncome). The number of missing values is relatively small, so it might not require  action, but further analysis is necessary.
- The data types are generally appropriate and do not present any significant issues. However, to make easier to analysis, I will convert the columns ProdTaken, Passport, and OwnCar into object types.

In [14]:
df[['ProdTaken', 'Passport', 'OwnCar']] = df[['ProdTaken', 'Passport', 'OwnCar']].astype('str')
categoric = ['TypeofContact', 'Occupation', 'Gender', 'ProductPitched', 'MaritalStatus', 'Designation', 'ProdTaken', 'Passport', 'OwnCar']
numeric = ['Age', 'CityTier', 'DurationOfPitch', 'NumberOfPersonVisiting', 'NumberOfFollowups', 'PreferredPropertyStar', 'NumberOfTrips', 'PitchSatisfactionScore', 'NumberOfChildrenVisiting', 'MonthlyIncome']

In [21]:
df[['Gender']].value_counts()

Gender 
Male       2916
Female     1817
Fe Male     155
Name: count, dtype: int64

In [15]:
df[categoric].describe()

Unnamed: 0,TypeofContact,Occupation,Gender,ProductPitched,MaritalStatus,Designation,ProdTaken,Passport,OwnCar
count,4863,4888,4888,4888,4888,4888,4888,4888,4888
unique,2,4,3,5,4,5,2,2,2
top,Self Enquiry,Salaried,Male,Basic,Married,Executive,False,False,True
freq,3444,2368,2916,1842,2340,1842,3968,3466,3032


- **TypeofContact**: Most customers contacted by Self Enquiry (3444 out of 4863 / 70.8%). 
- **Occupation**: Most customers are salaried (2368 out of 4888 / 48.4%).
- **Gender**: Most customers are male (2916 out of 4888 / 59.7%). And then, there is an issue with the gender data, as there are three unique values instead of the expected two. This issue needs to be addressed.
- **ProductPitched**: The Basic package is the most commonly pitched (1842 out of 4888 / 37.7%).
- **MaritalStatus**: Most customers are married (2340 out of 4888 / 47.9%).
- **Designation**: The most frequent job designation among customers is Executive (1842 out of 4888 / 37.7%).
- **ProdTaken**: A large majority of customers did not take the product (3968 out of 4888 / 81.1%).
- **Passport**: Most customers do not have a passport (3466 out of 4888 / 70.9%)
- **OwnCar**: Most customers have cars (3032 out of 4888 / 62.0%)

In [16]:
df[numeric].describe()

Unnamed: 0,Age,CityTier,DurationOfPitch,NumberOfPersonVisiting,NumberOfFollowups,PreferredPropertyStar,NumberOfTrips,PitchSatisfactionScore,NumberOfChildrenVisiting,MonthlyIncome
count,4662.0,4888.0,4637.0,4888.0,4843.0,4862.0,4748.0,4888.0,4822.0,4655.0
mean,37.622265,1.654255,15.490835,2.905074,3.708445,3.581037,3.236521,3.078151,1.187267,23619.853491
std,9.316387,0.916583,8.519643,0.724891,1.002509,0.798009,1.849019,1.365792,0.857861,5380.698361
min,18.0,1.0,5.0,1.0,1.0,3.0,1.0,1.0,0.0,1000.0
25%,31.0,1.0,9.0,2.0,3.0,3.0,2.0,2.0,1.0,20346.0
50%,36.0,1.0,13.0,3.0,4.0,3.0,3.0,3.0,1.0,22347.0
75%,44.0,3.0,20.0,3.0,4.0,4.0,4.0,4.0,2.0,25571.0
max,61.0,3.0,127.0,5.0,6.0,5.0,22.0,5.0,3.0,98678.0
