## **Exploratory Data Analysis for Tourism Sector in Tanzania.**

### **1. Understand the Problem Statement**


Tourism is a vital pillar of Tanzania’s economy, contributing significantly to foreign exchange earnings, employment, and community development. Renowned globally for attractions such as Serengeti National Park, Mount Kilimanjaro, and Zanzibar beaches, the country draws millions of visitors annually.

However, there is limited data-driven understanding of visitor demographics, seasonal patterns, and revenue drivers. This lack of insight hinders informed decision-making for policymakers, tourism boards, and stakeholders, resulting in inefficiencies in marketing strategies, resource allocation, and sustainable sector planning.

To address this, there is a need to analyze historical tourism data to uncover patterns in tourist arrivals, spending, preferences, and peak seasons. Furthermore, a predictive model can be developed to classify and forecast tourist flow levels (e.g., "High", "Medium", or "Low") and segment visitors based on key attributes like source market or package preferences.

By combining Exploratory Data Analysis (EDA) with predictive modeling, this project aims to provide actionable insights and build a foundation for data-driven decision-making in Tanzania’s tourism sector.

##### Import libraries.

In [922]:
import pandas as pd
import matplotlib.pyplot as pyplot
import numpy as np

##### Load Dataset

In [923]:
data = pd.read_csv('./datasets/Train.csv')

In [924]:
#explore data
data.head()

Unnamed: 0,ID,country,age_group,travel_with,total_female,total_male,purpose,main_activity,info_source,tour_arrangement,...,package_transport_tz,package_sightseeing,package_guided_tour,package_insurance,night_mainland,night_zanzibar,payment_mode,first_trip_tz,most_impressing,total_cost
0,tour_0,SWIZERLAND,45-64,Friends/Relatives,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Friends, relatives",Independent,...,No,No,No,No,13.0,0.0,Cash,No,Friendly People,674602.5
1,tour_10,UNITED KINGDOM,25-44,,1.0,0.0,Leisure and Holidays,Cultural tourism,others,Independent,...,No,No,No,No,14.0,7.0,Cash,Yes,"Wonderful Country, Landscape, Nature",3214906.5
2,tour_1000,UNITED KINGDOM,25-44,Alone,0.0,1.0,Visiting Friends and Relatives,Cultural tourism,"Friends, relatives",Independent,...,No,No,No,No,1.0,31.0,Cash,No,Excellent Experience,3315000.0
3,tour_1002,UNITED KINGDOM,25-44,Spouse,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Package Tour,...,Yes,Yes,Yes,No,11.0,0.0,Cash,Yes,Friendly People,7790250.0
4,tour_1004,CHINA,1-24,,1.0,0.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Independent,...,No,No,No,No,7.0,4.0,Cash,Yes,No comments,1657500.0


### **2. Hypothesis Generation**

Hypothesis generation is a critical stage in any data science or machine learning pipeline. It involves thoroughly understanding the problem and brainstorming all possible factors that may influence the outcome. This step is carried out before analyzing the data to form logical assumptions that can later be tested.

Based on the tourism data and context, the following hypotheses are proposed:

* **Age Group**: Young adults are more likely to visit Tanzania compared to other age groups.
* **Travel Companions**: Visitors are more likely to travel alone or with their spouse rather than with children.
* **Purpose of Visit**: Most visitors travel to Tanzania primarily for leisure and holidays rather than business or visiting friends and relatives.
* **Tourism Activities**: Visitors prefer wildlife and beach tourism over cultural or business-related tourism.
* **Information Source**: Tourists are more likely to learn about Tanzania through travel agents, friends, or relatives than through TV, radio, web platforms, magazines, or Tanzanian missions abroad.
* **Tour Arrangement**: Visitors prefer arranging tours independently rather than opting for package tours.
* **Payment Method**: Cash is the most common mode of payment among visitors, compared to credit cards or travelers' cheques.
* **Trip Frequency**: A majority of visitors are on their first trip to Tanzania rather than being repeat travelers.
* **Visitor Impressions**: Visitors are more likely to appreciate the friendliness of Tanzanians compared to giving feedback on experiences or overall satisfaction.
* **Pricing Preference**: Most visitors prefer reasonably priced services rather than low-cost or high-cost options.
* **Length of Stay**: The typical length of stay for visitors is up to two weeks.
* **Destination Preference**: Visitors spend more time in Zanzibar than on mainland Tanzania.
* **Package Services**: Most visitors prefer not to use package services.



### **3. Describe Data Structure**

**Note:** Open the VariableDefinition file to understand the meaning of each variable in this dataset

In [925]:
variable_dfn = pd.read_csv('./datasets/VariableDefinitions.csv')
display(variable_dfn)

Unnamed: 0,Column Name,Definition
0,id,Unique identifier for each tourist
1,country,The country a tourist coming from.
2,age_group,The age group of a tourist.
3,travel_with,The relation of people a tourist travel with t...
4,total_female,Total number of females
5,total_male,Total number of males
6,purpose,The purpose of visiting Tanzania
7,main_activity,The main activity of tourism in Tanzania
8,infor_source,The source of information about tourism in Tan...
9,tour_arrangment,The arrangment of visiting Tanzania


In [926]:
#data shape
print(f"Data Shape: {data.shape}")

Data Shape: (4809, 23)


In [927]:
#list first data
data.tail()

Unnamed: 0,ID,country,age_group,travel_with,total_female,total_male,purpose,main_activity,info_source,tour_arrangement,...,package_transport_tz,package_sightseeing,package_guided_tour,package_insurance,night_mainland,night_zanzibar,payment_mode,first_trip_tz,most_impressing,total_cost
4804,tour_993,UAE,45-64,Alone,0.0,1.0,Business,Hunting tourism,"Friends, relatives",Independent,...,No,No,No,No,2.0,0.0,Credit Card,No,No comments,3315000.0
4805,tour_994,UNITED STATES OF AMERICA,25-44,Spouse,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Package Tour,...,Yes,Yes,Yes,Yes,11.0,0.0,Cash,Yes,Friendly People,10690875.0
4806,tour_995,NETHERLANDS,1-24,,1.0,0.0,Leisure and Holidays,Wildlife tourism,others,Independent,...,No,No,No,No,3.0,7.0,Cash,Yes,Good service,2246636.7
4807,tour_997,SOUTH AFRICA,25-44,Friends/Relatives,1.0,1.0,Business,Beach tourism,"Travel, agent, tour operator",Independent,...,No,No,No,No,5.0,0.0,Credit Card,No,Friendly People,1160250.0
4808,tour_999,UNITED KINGDOM,25-44,Spouse,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Package Tour,...,Yes,Yes,Yes,No,4.0,7.0,Cash,Yes,Friendly People,13260000.0


In [928]:
data.describe

<bound method NDFrame.describe of              ID                   country age_group        travel_with  \
0        tour_0                SWIZERLAND     45-64  Friends/Relatives   
1       tour_10            UNITED KINGDOM     25-44                NaN   
2     tour_1000            UNITED KINGDOM     25-44              Alone   
3     tour_1002            UNITED KINGDOM     25-44             Spouse   
4     tour_1004                     CHINA      1-24                NaN   
...         ...                       ...       ...                ...   
4804   tour_993                       UAE     45-64              Alone   
4805   tour_994  UNITED STATES OF AMERICA     25-44             Spouse   
4806   tour_995               NETHERLANDS      1-24                NaN   
4807   tour_997              SOUTH AFRICA     25-44  Friends/Relatives   
4808   tour_999            UNITED KINGDOM     25-44             Spouse   

      total_female  total_male                         purpose  \
0          

### **4. Explolatory Data Analysis**


This is the process of finding some insights from you dataset before create predictive models.
**Note:** This is an important steps in your Data science workflow.

In [929]:
# Check for missing values
print('missing values:', data.isnull().sum())

missing values: ID                          0
country                     0
age_group                   0
travel_with              1114
total_female                3
total_male                  5
purpose                     0
main_activity               0
info_source                 0
tour_arrangement            0
package_transport_int       0
package_accomodation        0
package_food                0
package_transport_tz        0
package_sightseeing         0
package_guided_tour         0
package_insurance           0
night_mainland              0
night_zanzibar              0
payment_mode                0
first_trip_tz               0
most_impressing           313
total_cost                  0
dtype: int64


##### **Note**: Since we have missing values , then we have to replace them with real data.

In [930]:
# Check missing values count per column
missing_counts = data[['travel_with', 'total_female', 'total_male', 'most_impressing']].isnull().sum()

for col in ['travel_with', 'total_female', 'total_male', 'most_impressing']:
    if data[col].isnull().sum() > 0:
        if data[col].dtype == 'O':  # object/string type
            data[col] = data[col].fillna('Alone')
        else:  # numeric
            data[col] = data[col].fillna(data[col].median())


# Confirm no more nulls in those columns
print(data.isnull().sum())


ID                       0
country                  0
age_group                0
travel_with              0
total_female             0
total_male               0
purpose                  0
main_activity            0
info_source              0
tour_arrangement         0
package_transport_int    0
package_accomodation     0
package_food             0
package_transport_tz     0
package_sightseeing      0
package_guided_tour      0
package_insurance        0
night_mainland           0
night_zanzibar           0
payment_mode             0
first_trip_tz            0
most_impressing          0
total_cost               0
dtype: int64


##### since we have no missing values we can proceed to data mapping


In [931]:
age_mapping = {
    "1-24" : "Child/Youth",
    "25-44" : "Young Adult",
    "45-64" : "Senior Adult",
    "65+" : "Elder",
}

data['age_group'] = data.age_group.map(age_mapping)
# data.head()
data.age_group.value_counts()

age_group
Young Adult     2487
Senior Adult    1391
Child/Youth      624
Elder            307
Name: count, dtype: int64

#### we need to organize well the data 
For packages we are combining all in one column and , night_spent in tanzania mainland and zanzibar , because it is one country 

In [932]:
# Fix column name typos
data.rename(columns={
    'infor_source': 'info_source',
    'tour_arrangment': 'tour_arrangement',
    'package_accomodation': 'package_accommodation'
}, inplace=True)

# Combine nights stayed on mainland and Zanzibar into one total nights column
# data['nights_stayed'] = data['night_mainland'] + data['night_zanzibar']

# Package columns and mapping
package_mapping = {
    'package_transport_int': 'Transport Package',
    'package_accommodation': 'Accommodation Package',
    'package_food': 'Food Package',
    'package_transport_tz': 'Transportation Package',
    'package_sightseeing': 'Sightseeing Package',
    'package_guided_tour': 'Guided Tour Package',
    'package_insurance': 'Insurance Package'
}

package_cols = list(package_mapping.keys())

# Create combined multi-label column 'package_services'
def get_included_services(row):
    services = [package_mapping[col] for col in package_cols if str(row[col]).strip().lower() == 'yes']
    return ', '.join(services) if services else 'None'

data['package_services'] = data.apply(get_included_services, axis=1)

# ✅ Drop package columns AND nights columns in one go
data.drop(columns=package_cols  ,inplace=True)

# Check the result
data.head()


Unnamed: 0,ID,country,age_group,travel_with,total_female,total_male,purpose,main_activity,info_source,tour_arrangement,night_mainland,night_zanzibar,payment_mode,first_trip_tz,most_impressing,total_cost,package_services
0,tour_0,SWIZERLAND,Senior Adult,Friends/Relatives,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Friends, relatives",Independent,13.0,0.0,Cash,No,Friendly People,674602.5,
1,tour_10,UNITED KINGDOM,Young Adult,Alone,1.0,0.0,Leisure and Holidays,Cultural tourism,others,Independent,14.0,7.0,Cash,Yes,"Wonderful Country, Landscape, Nature",3214906.5,
2,tour_1000,UNITED KINGDOM,Young Adult,Alone,0.0,1.0,Visiting Friends and Relatives,Cultural tourism,"Friends, relatives",Independent,1.0,31.0,Cash,No,Excellent Experience,3315000.0,
3,tour_1002,UNITED KINGDOM,Young Adult,Spouse,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Package Tour,11.0,0.0,Cash,Yes,Friendly People,7790250.0,"Accommodation Package, Food Package, Transport..."
4,tour_1004,CHINA,Child/Youth,Alone,1.0,0.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Independent,7.0,4.0,Cash,Yes,No comments,1657500.0,


In [933]:
# data.country.value_counts()

In [934]:
# data.age_group.value_counts()

In [935]:
# data.travel_with.value_counts()

In [936]:
# data.total_female.value_counts()

In [937]:
# data.total_male.value_counts()

In [938]:
# data.purpose.value_counts()

In [939]:
# data.main_activity.value_counts()

In [940]:
# data.info_source.value_counts()

In [941]:
# data.tour_arrangement.value_counts()

In [942]:
# data.payment_mode.value_counts()

In [943]:
# data.first_trip_tz.value_counts()

In [944]:
# data.most_impressing.value_counts()

In [945]:
# data.total_cost.value_counts()

In [946]:
# data.night_mainland.value_counts()

In [947]:
# data.night_zanzibar.value_counts()

In [948]:
# data.package_services.value_counts()

In [949]:
data.head()

Unnamed: 0,ID,country,age_group,travel_with,total_female,total_male,purpose,main_activity,info_source,tour_arrangement,night_mainland,night_zanzibar,payment_mode,first_trip_tz,most_impressing,total_cost,package_services
0,tour_0,SWIZERLAND,Senior Adult,Friends/Relatives,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Friends, relatives",Independent,13.0,0.0,Cash,No,Friendly People,674602.5,
1,tour_10,UNITED KINGDOM,Young Adult,Alone,1.0,0.0,Leisure and Holidays,Cultural tourism,others,Independent,14.0,7.0,Cash,Yes,"Wonderful Country, Landscape, Nature",3214906.5,
2,tour_1000,UNITED KINGDOM,Young Adult,Alone,0.0,1.0,Visiting Friends and Relatives,Cultural tourism,"Friends, relatives",Independent,1.0,31.0,Cash,No,Excellent Experience,3315000.0,
3,tour_1002,UNITED KINGDOM,Young Adult,Spouse,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Package Tour,11.0,0.0,Cash,Yes,Friendly People,7790250.0,"Accommodation Package, Food Package, Transport..."
4,tour_1004,CHINA,Child/Youth,Alone,1.0,0.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Independent,7.0,4.0,Cash,Yes,No comments,1657500.0,


#### **Types of EDA**

##### A. Univariate Analysis

In this section, we will do univariate analysis. It is the simplest form of analyzing data where we examine each variable individually. For categorical features we can use frequency table or bar plots which will calculate the number of each category in a particular variable. For numerical features, probability density plots can be used to look at the distribution of the variable.