<a href="https://colab.research.google.com/github/leulged/tanzania-tourism-prediction-zindi/blob/main/Tanzania_Tourism_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🌍 Tanzania Tourism Prediction - Zindi Challenge

Welcome to the Tanzania Tourism Prediction project! In this notebook, we aim to explore and analyze tourism data to uncover patterns, insights, and key indicators that can help predict tourism trends in Tanzania.

### 📌 Objective
Our goal is to perform data cleaning, exploration, and feature engineering on the dataset to support accurate predictions for tourism-related metrics.

### 📊 What We'll Do:
- Explore the dataset and understand its structure.
- Handle missing values and clean the data.
- Perform descriptive analysis and filtering.
- Engineer new features for better insights.
- Prepare the data for modeling through encoding and normalization.

Let's dive in and see what the data reveals about travel behaviors and tourism in Tanzania!


## 📦 Importing Libraries

We begin by importing essential Python libraries for data analysis:

- **Pandas**: For data manipulation and analysis.
- **NumPy**: For numerical operations and handling arrays.


In [3]:
import pandas as pd
import numpy as np


## 📂 Loading the Dataset

We now load the dataset into a Pandas DataFrame to begin our analysis. This dataset contains tourism-related information collected for the Zindi competition.

Let’s load the training data and take a look at its structure by displaying the first few rows using `.head()`.


In [5]:
df = pd.read_csv('/content/sample_data/Train (1).csv')
df.head()


Unnamed: 0,ID,country,age_group,travel_with,total_female,total_male,purpose,main_activity,info_source,tour_arrangement,...,package_transport_tz,package_sightseeing,package_guided_tour,package_insurance,night_mainland,night_zanzibar,payment_mode,first_trip_tz,most_impressing,total_cost
0,tour_0,SWIZERLAND,45-64,Friends/Relatives,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Friends, relatives",Independent,...,No,No,No,No,13.0,0.0,Cash,No,Friendly People,674602.5
1,tour_10,UNITED KINGDOM,25-44,,1.0,0.0,Leisure and Holidays,Cultural tourism,others,Independent,...,No,No,No,No,14.0,7.0,Cash,Yes,"Wonderful Country, Landscape, Nature",3214906.5
2,tour_1000,UNITED KINGDOM,25-44,Alone,0.0,1.0,Visiting Friends and Relatives,Cultural tourism,"Friends, relatives",Independent,...,No,No,No,No,1.0,31.0,Cash,No,Excellent Experience,3315000.0
3,tour_1002,UNITED KINGDOM,25-44,Spouse,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Package Tour,...,Yes,Yes,Yes,No,11.0,0.0,Cash,Yes,Friendly People,7790250.0
4,tour_1004,CHINA,1-24,,1.0,0.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Independent,...,No,No,No,No,7.0,4.0,Cash,Yes,No comments,1657500.0


## 🔍 1. Data Exploration

In this section, we explore the dataset to understand its quality and structure. We'll begin by:

- Counting missing values in each column.
- Identifying the number of unique values in each categorical column.


In [6]:
# Count missing values in each column
missing_values = df.isnull().sum()
print("📌 Missing values per column:\n")
print(missing_values)

# Identify categorical columns (typically 'object' type)
categorical_cols = df.select_dtypes(include='object').columns

# Show number of unique values for each categorical column
print("\n📌 Unique values in each categorical column:\n")
for col in categorical_cols:
    unique_vals = df[col].nunique()
    print(f"{col}: {unique_vals}")


📌 Missing values per column:

ID                          0
country                     0
age_group                   0
travel_with              1114
total_female                3
total_male                  5
purpose                     0
main_activity               0
info_source                 0
tour_arrangement            0
package_transport_int       0
package_accomodation        0
package_food                0
package_transport_tz        0
package_sightseeing         0
package_guided_tour         0
package_insurance           0
night_mainland              0
night_zanzibar              0
payment_mode                0
first_trip_tz               0
most_impressing           313
total_cost                  0
dtype: int64

📌 Unique values in each categorical column:

ID: 4809
country: 105
age_group: 4
travel_with: 5
purpose: 7
main_activity: 9
info_source: 8
tour_arrangement: 2
package_transport_int: 2
package_accomodation: 2
package_food: 2
package_transport_tz: 2
package_sightseeing

## 📈 2. Descriptive Statistics

Next, we compute basic statistics to summarize the dataset. Specifically, we will:

- Calculate the average **total_cost** grouped by the purpose of travel.
- Compute the average number of nights spent on the mainland and in Zanzibar, grouped by country.


In [7]:
# Average total cost by purpose of travel
avg_cost_by_purpose = df.groupby("purpose")[["total_cost"]].mean()
print("📌 Average Total Cost by Purpose:\n")
print(avg_cost_by_purpose)

# Average nights spent in mainland and Zanzibar by country
avg_nights_by_country = df.groupby("country")[["night_mainland", "night_zanzibar"]].mean()
print("\n📌 Average Nights (Mainland & Zanzibar) by Country:\n")
print(avg_nights_by_country)


📌 Average Total Cost by Purpose:

                                  total_cost
purpose                                     
Business                        1.782438e+06
Leisure and Holidays            1.195114e+07
Meetings and Conference         2.453004e+06
Other                           1.592155e+06
Scientific and Academic         4.031990e+06
Visiting Friends and Relatives  3.190776e+06
Volunteering                    3.950565e+06

📌 Average Nights (Mainland & Zanzibar) by Country:

                          night_mainland  night_zanzibar
country                                                 
ALGERIA                         7.500000       10.500000
ANGOLA                          6.000000       12.000000
ARGENTINA                       6.000000        3.000000
AUSTRALIA                       8.854839        1.870968
AUSTRIA                         9.277778        4.833333
...                                  ...             ...
UNITED STATES OF AMERICA       10.082014        1.06

## 🔎 3. Data Filtering

Here, we apply conditional filtering to extract specific subsets of the data:

- Filter all trips where the **purpose** is *"Leisure and Holidays"* and **total_cost** is greater than or equal to 5000.
- Find all **first-time travelers** whose **main activity** was *"Wildlife tourism"*.


In [8]:
# Filter: Leisure and Holidays trips with total cost >= 5000
leisure_high_cost = df[(df["purpose"] == "Leisure and Holidays") & (df["total_cost"] >= 5000)]
print("📌 Leisure and Holidays trips with total_cost >= 5000 (First 10 rows):\n")
print(leisure_high_cost.head(10))

# Filter: First-time travelers interested in Wildlife tourism
first_time_wildlife = df[(df["first_trip_tz"] == "Yes") & (df["main_activity"] == "Wildlife tourism")]
print("\n📌 First-time travelers whose main activity was Wildlife tourism (First 10 rows):\n")
print(first_time_wildlife.head(10))


📌 Leisure and Holidays trips with total_cost >= 5000 (First 10 rows):

           ID                   country age_group        travel_with  \
0      tour_0                SWIZERLAND     45-64  Friends/Relatives   
1     tour_10            UNITED KINGDOM     25-44                NaN   
3   tour_1002            UNITED KINGDOM     25-44             Spouse   
4   tour_1004                     CHINA      1-24                NaN   
5   tour_1005            UNITED KINGDOM     25-44                NaN   
7   tour_1008  UNITED STATES OF AMERICA     45-64  Friends/Relatives   
8    tour_101                   NIGERIA     25-44              Alone   
10  tour_1012                    BRAZIL     25-44             Spouse   
11  tour_1013                    CANADA     45-64           Children   
12  tour_1016                    CANADA     45-64           Children   

    total_female  total_male               purpose     main_activity  \
0            1.0         1.0  Leisure and Holidays  Wildlife tou

## 🛠️ 4. Feature Engineering

In this step, we create a new feature to enrich the dataset:

- **total_people**: This column represents the total number of travelers in a group, calculated by summing the number of females and males.


In [9]:
# Create a new column for total number of people
df["total_people"] = df['total_female'] + df['total_male']

# Preview the updated DataFrame
df.head()


Unnamed: 0,ID,country,age_group,travel_with,total_female,total_male,purpose,main_activity,info_source,tour_arrangement,...,package_sightseeing,package_guided_tour,package_insurance,night_mainland,night_zanzibar,payment_mode,first_trip_tz,most_impressing,total_cost,total_people
0,tour_0,SWIZERLAND,45-64,Friends/Relatives,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Friends, relatives",Independent,...,No,No,No,13.0,0.0,Cash,No,Friendly People,674602.5,2.0
1,tour_10,UNITED KINGDOM,25-44,,1.0,0.0,Leisure and Holidays,Cultural tourism,others,Independent,...,No,No,No,14.0,7.0,Cash,Yes,"Wonderful Country, Landscape, Nature",3214906.5,1.0
2,tour_1000,UNITED KINGDOM,25-44,Alone,0.0,1.0,Visiting Friends and Relatives,Cultural tourism,"Friends, relatives",Independent,...,No,No,No,1.0,31.0,Cash,No,Excellent Experience,3315000.0,1.0
3,tour_1002,UNITED KINGDOM,25-44,Spouse,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Package Tour,...,Yes,Yes,No,11.0,0.0,Cash,Yes,Friendly People,7790250.0,2.0
4,tour_1004,CHINA,1-24,,1.0,0.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Independent,...,No,No,No,7.0,4.0,Cash,Yes,No comments,1657500.0,1.0


## 📊 5. Aggregation & Grouping

In this section, we perform group-based aggregations to extract summarized insights:

- Create a new column **total_nights** by summing up nights spent on the mainland and in Zanzibar.
- Group by **age_group** and calculate the average **total_cost** and **total_nights**.
- Group by **travel_with** and count how many people traveled in each category.


In [10]:
# Create a new column total_nights by adding nights on mainland and Zanzibar
df["total_nights"] = df["night_mainland"] + df["night_zanzibar"]

# Group by age_group to find average total_cost and total_nights
avg_cost_nights_by_age = df.groupby("age_group")[["total_cost", "total_nights"]].mean()
print("📌 Average total_cost and total_nights by Age Group:\n")
print(avg_cost_nights_by_age)

# Group by travel_with and count how many people traveled in each group
travel_group_counts = df["travel_with"].value_counts()
print("\n📌 Number of travelers by Travel Group:\n")
print(travel_group_counts)


📌 Average total_cost and total_nights by Age Group:

             total_cost  total_nights
age_group                            
1-24       5.415205e+06     16.116987
25-44      6.026176e+06     10.048653
45-64      1.105093e+07      9.920201
65+        1.721195e+07      9.947883

📌 Number of travelers by Travel Group:

travel_with
Alone                  1265
Spouse                 1005
Friends/Relatives       895
Spouse and Children     368
Children                162
Name: count, dtype: int64


## 🔽 6. Advanced Filtering & Sorting

We now sort the data to extract specific insights:

- Identify the **top 10 most expensive trips** based on **total_cost**.
- Display their associated **main activities** and other relevant details.


In [11]:
# Get top 10 most expensive trips and their main activities
top_expensive_trips = df.sort_values("total_cost", ascending=False)[
    ["ID", "country", "total_cost", "main_activity"]
]
print("📌 Top 10 Most Expensive Trips:\n")
print(top_expensive_trips.head(10))


📌 Top 10 Most Expensive Trips:

             ID                   country  total_cost       main_activity
3411  tour_5121              SOUTH AFRICA  99532875.0    Wildlife tourism
2826  tour_4440              SOUTH AFRICA  99450000.0    Wildlife tourism
1731  tour_3109                     ITALY  95992659.0    Wildlife tourism
388   tour_1475                    CANADA  94809000.0    Wildlife tourism
1805  tour_3194            UNITED KINGDOM  92645962.5  Conference tourism
1753   tour_314                    CANADA  90085125.0    Wildlife tourism
3984  tour_5838                     CHINA  89505000.0    Wildlife tourism
4085  tour_5954  UNITED STATES OF AMERICA  86190000.0  Conference tourism
236   tour_1296               NETHERLANDS  86190000.0    Wildlife tourism
491   tour_1605                    CANADA  85059156.0    Wildlife tourism


## 🧹 7. Data Cleaning

In this step, we handle data quality issues to ensure accuracy during analysis and modeling:

- Replace zero or invalid values in **total_people** with `NaN`.
- Recalculate **cost_per_person** based on updated `total_people`.
- Normalize the **total_cost** column using Min-Max normalization.


In [12]:
import numpy as np

# Replace 0 or invalid total_people values with NaN
df["total_people"] = df["total_people"].replace(0, np.nan)

# Recalculate cost per person
df["cost_per_person"] = df["total_cost"] / df["total_people"]

# Min-Max Normalization of total_cost
min_cost = df["total_cost"].min()
max_cost = df["total_cost"].max()

df["total_cost_normalized"] = (df["total_cost"] - min_cost) / (max_cost - min_cost)

# Preview the changes
df[["total_cost", "total_people", "cost_per_person", "total_cost_normalized"]].head()


Unnamed: 0,total_cost,total_people,cost_per_person,total_cost_normalized
0,674602.5,2.0,337301.25,0.006288
1,3214906.5,1.0,3214906.5,0.031823
2,3315000.0,1.0,3315000.0,0.032829
3,7790250.0,2.0,3895125.0,0.077814
4,1657500.0,1.0,1657500.0,0.016168


## 🔢 8. Encoding for Modeling (Manual)

To prepare the dataset for machine learning models, we manually encode categorical variables:

- Convert **payment_mode** and **travel_with** into numeric codes using `pd.factorize()`.
- Create a binary column **has_package**: `1` if any of the package-related columns (sightseeing, guided tour, insurance) is "Yes", otherwise `0`.


In [13]:
# Encode payment_mode and travel_with using pd.factorize()
df["payment_mode_encoded"], _ = pd.factorize(df["payment_mode"])
df["travel_with_encoded"], _ = pd.factorize(df["travel_with"])

# Create a binary flag 'has_package' if any package column is 'Yes'
package_cols = ["package_sightseeing", "package_guided_tour", "package_insurance"]
df["has_package"] = df[package_cols].apply(lambda row: any(row == "Yes"), axis=1).astype(int)

# Preview encoded columns
df[["payment_mode", "payment_mode_encoded", "travel_with", "travel_with_encoded", "has_package"]].head()


Unnamed: 0,payment_mode,payment_mode_encoded,travel_with,travel_with_encoded,has_package
0,Cash,0,Friends/Relatives,0,0
1,Cash,0,,-1,0
2,Cash,0,Alone,1,0
3,Cash,0,Spouse,2,1
4,Cash,0,,-1,0
