## **Exploratory Data Analysis for Tourism Sector in Tanzania.**

### **1. Understand the Problem Statement**

Tourism is a vital pillar of Tanzania’s economy, contributing significantly to foreign exchange earnings, employment, and community development. Renowned globally for attractions such as Serengeti National Park, Mount Kilimanjaro, and Zanzibar beaches, the country draws millions of visitors annually.

However, there is limited data-driven understanding of visitor demographics, seasonal patterns, and revenue drivers. This lack of insight hinders informed decision-making for policymakers, tourism boards, and stakeholders, resulting in inefficiencies in marketing strategies, resource allocation, and sustainable sector planning.

To address this, there is a need to analyze historical tourism data to uncover patterns in tourist arrivals, spending, preferences, and peak seasons. Furthermore, a predictive model can be developed to classify and forecast tourist flow levels (e.g., "High", "Medium", or "Low") and segment visitors based on key attributes like source market or package preferences.

By combining Exploratory Data Analysis (EDA) with predictive modeling, this project aims to provide actionable insights and build a foundation for data-driven decision-making in Tanzania’s tourism sector.



##### Import libraries.

In [65]:
import pandas as pd
import matplotlib.pyplot as pyplot
import numpy as np

##### Load Dataset

In [66]:
data = pd.read_csv('./datasets/Train.csv')

In [67]:
#explore data
data.head()

Unnamed: 0,ID,country,age_group,travel_with,total_female,total_male,purpose,main_activity,info_source,tour_arrangement,...,package_transport_tz,package_sightseeing,package_guided_tour,package_insurance,night_mainland,night_zanzibar,payment_mode,first_trip_tz,most_impressing,total_cost
0,tour_0,SWIZERLAND,45-64,Friends/Relatives,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Friends, relatives",Independent,...,No,No,No,No,13.0,0.0,Cash,No,Friendly People,674602.5
1,tour_10,UNITED KINGDOM,25-44,,1.0,0.0,Leisure and Holidays,Cultural tourism,others,Independent,...,No,No,No,No,14.0,7.0,Cash,Yes,"Wonderful Country, Landscape, Nature",3214906.5
2,tour_1000,UNITED KINGDOM,25-44,Alone,0.0,1.0,Visiting Friends and Relatives,Cultural tourism,"Friends, relatives",Independent,...,No,No,No,No,1.0,31.0,Cash,No,Excellent Experience,3315000.0
3,tour_1002,UNITED KINGDOM,25-44,Spouse,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Package Tour,...,Yes,Yes,Yes,No,11.0,0.0,Cash,Yes,Friendly People,7790250.0
4,tour_1004,CHINA,1-24,,1.0,0.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Independent,...,No,No,No,No,7.0,4.0,Cash,Yes,No comments,1657500.0


### **2. Describe Data Structure**

**Note:** Open the VariableDefinition file to understand the meaning of each variable in this dataset

In [68]:
variable_dfn = pd.read_csv('./datasets/VariableDefinitions.csv')
display(variable_dfn)

Unnamed: 0,Column Name,Definition
0,id,Unique identifier for each tourist
1,country,The country a tourist coming from.
2,age_group,The age group of a tourist.
3,travel_with,The relation of people a tourist travel with t...
4,total_female,Total number of females
5,total_male,Total number of males
6,purpose,The purpose of visiting Tanzania
7,main_activity,The main activity of tourism in Tanzania
8,infor_source,The source of information about tourism in Tan...
9,tour_arrangment,The arrangment of visiting Tanzania


In [69]:
#data shape
print(f"Data Shape: {data.shape}")

Data Shape: (4809, 23)


In [70]:
#list first data
data.tail()

Unnamed: 0,ID,country,age_group,travel_with,total_female,total_male,purpose,main_activity,info_source,tour_arrangement,...,package_transport_tz,package_sightseeing,package_guided_tour,package_insurance,night_mainland,night_zanzibar,payment_mode,first_trip_tz,most_impressing,total_cost
4804,tour_993,UAE,45-64,Alone,0.0,1.0,Business,Hunting tourism,"Friends, relatives",Independent,...,No,No,No,No,2.0,0.0,Credit Card,No,No comments,3315000.0
4805,tour_994,UNITED STATES OF AMERICA,25-44,Spouse,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Package Tour,...,Yes,Yes,Yes,Yes,11.0,0.0,Cash,Yes,Friendly People,10690875.0
4806,tour_995,NETHERLANDS,1-24,,1.0,0.0,Leisure and Holidays,Wildlife tourism,others,Independent,...,No,No,No,No,3.0,7.0,Cash,Yes,Good service,2246636.7
4807,tour_997,SOUTH AFRICA,25-44,Friends/Relatives,1.0,1.0,Business,Beach tourism,"Travel, agent, tour operator",Independent,...,No,No,No,No,5.0,0.0,Credit Card,No,Friendly People,1160250.0
4808,tour_999,UNITED KINGDOM,25-44,Spouse,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Package Tour,...,Yes,Yes,Yes,No,4.0,7.0,Cash,Yes,Friendly People,13260000.0


In [71]:
data.describe

<bound method NDFrame.describe of              ID                   country age_group        travel_with  \
0        tour_0                SWIZERLAND     45-64  Friends/Relatives   
1       tour_10            UNITED KINGDOM     25-44                NaN   
2     tour_1000            UNITED KINGDOM     25-44              Alone   
3     tour_1002            UNITED KINGDOM     25-44             Spouse   
4     tour_1004                     CHINA      1-24                NaN   
...         ...                       ...       ...                ...   
4804   tour_993                       UAE     45-64              Alone   
4805   tour_994  UNITED STATES OF AMERICA     25-44             Spouse   
4806   tour_995               NETHERLANDS      1-24                NaN   
4807   tour_997              SOUTH AFRICA     25-44  Friends/Relatives   
4808   tour_999            UNITED KINGDOM     25-44             Spouse   

      total_female  total_male                         purpose  \
0          

### **3. Explolatory Data Analysis**

This is the process of finding some insights from you dataset before create predictive models.
**Note:** This is an important steps in your Data science workflow.

In [72]:
# Check for missing values
print('missing values:', data.isnull().sum())

missing values: ID                          0
country                     0
age_group                   0
travel_with              1114
total_female                3
total_male                  5
purpose                     0
main_activity               0
info_source                 0
tour_arrangement            0
package_transport_int       0
package_accomodation        0
package_food                0
package_transport_tz        0
package_sightseeing         0
package_guided_tour         0
package_insurance           0
night_mainland              0
night_zanzibar              0
payment_mode                0
first_trip_tz               0
most_impressing           313
total_cost                  0
dtype: int64


##### **Note**: Since we have missing values , then we have to replace them with real data.

In [73]:
# Check missing values count per column
missing_counts = data[['travel_with', 'total_female', 'total_male', 'most_impressing']].isnull().sum()

for col in ['travel_with', 'total_female', 'total_male', 'most_impressing']:
    if data[col].isnull().sum() > 0:
        if data[col].dtype == 'O':  # object/string type
            data[col] = data[col].fillna('Alone')
        else:  # numeric
            data[col] = data[col].fillna(data[col].median())


# Confirm no more nulls in those columns
print(data.isnull().sum())


ID                       0
country                  0
age_group                0
travel_with              0
total_female             0
total_male               0
purpose                  0
main_activity            0
info_source              0
tour_arrangement         0
package_transport_int    0
package_accomodation     0
package_food             0
package_transport_tz     0
package_sightseeing      0
package_guided_tour      0
package_insurance        0
night_mainland           0
night_zanzibar           0
payment_mode             0
first_trip_tz            0
most_impressing          0
total_cost               0
dtype: int64


##### since we have no missing values we can proceed to data mapping


In [74]:
age_mapping = {
    "1-24" : "Child/Youth",
    "25-44" : "Young Adult",
    "45-64" : "Senior Adult",
    "65+" : "Elder",
}

data['age_group'] = data.age_group.map(age_mapping)
# data.head()
data.age_group.value_counts()

age_group
Young Adult     2487
Senior Adult    1391
Child/Youth      624
Elder            307
Name: count, dtype: int64

#### we need to organize well the data 
For packages we are combining all in one column and , night_spent in tanzania mainland and zanzibar , because it is one country 

In [75]:
# Fix column name typos
data.rename(columns={
    'infor_source': 'info_source',
    'tour_arrangment': 'tour_arrangement',
    'package_accomodation': 'package_accommodation'
}, inplace=True)

# Combine nights stayed on mainland and Zanzibar into one total nights column
data['nights_stayed'] = data['night_mainland'] + data['night_zanzibar']

# Package columns and mapping
package_mapping = {
    'package_transport_int': 'Transport Package',
    'package_accommodation': 'Accommodation Package',
    'package_food': 'Food Package',
    'package_transport_tz': 'Transportation Package',
    'package_sightseeing': 'Sightseeing Package',
    'package_guided_tour': 'Guided Tour Package',
    'package_insurance': 'Insurance Package'
}
package_cols = list(package_mapping.keys())

# Create combined multi-label column 'package_services'
def get_included_services(row):
    services = [package_mapping[col] for col in package_cols if str(row[col]).strip().lower() == 'yes']
    return ', '.join(services) if services else 'None'

data['package_services'] = data.apply(get_included_services, axis=1)

# ✅ Drop package columns AND nights columns in one go
data.drop(columns=package_cols + ['night_mainland', 'night_zanzibar'], inplace=True)

# Check the result
data.head()


Unnamed: 0,ID,country,age_group,travel_with,total_female,total_male,purpose,main_activity,info_source,tour_arrangement,payment_mode,first_trip_tz,most_impressing,total_cost,nights_stayed,package_services
0,tour_0,SWIZERLAND,Senior Adult,Friends/Relatives,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Friends, relatives",Independent,Cash,No,Friendly People,674602.5,13.0,
1,tour_10,UNITED KINGDOM,Young Adult,Alone,1.0,0.0,Leisure and Holidays,Cultural tourism,others,Independent,Cash,Yes,"Wonderful Country, Landscape, Nature",3214906.5,21.0,
2,tour_1000,UNITED KINGDOM,Young Adult,Alone,0.0,1.0,Visiting Friends and Relatives,Cultural tourism,"Friends, relatives",Independent,Cash,No,Excellent Experience,3315000.0,32.0,
3,tour_1002,UNITED KINGDOM,Young Adult,Spouse,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Package Tour,Cash,Yes,Friendly People,7790250.0,11.0,"Accommodation Package, Food Package, Transport..."
4,tour_1004,CHINA,Child/Youth,Alone,1.0,0.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Independent,Cash,Yes,No comments,1657500.0,11.0,


In [76]:
data.country.value_counts()

country
UNITED STATES OF AMERICA    695
UNITED KINGDOM              533
ITALY                       393
FRANCE                      280
ZIMBABWE                    274
                           ... 
CYPRUS                        1
URUGUAY                       1
MORROCO                       1
BERMUDA                       1
ESTONIA                       1
Name: count, Length: 105, dtype: int64

In [77]:
data.age_group.value_counts()

age_group
Young Adult     2487
Senior Adult    1391
Child/Youth      624
Elder            307
Name: count, dtype: int64

In [78]:
data.travel_with.value_counts()

travel_with
Alone                  2379
Spouse                 1005
Friends/Relatives       895
Spouse and Children     368
Children                162
Name: count, dtype: int64

In [79]:
data.total_female.value_counts()

total_female
1.0     2421
0.0     1669
2.0      463
3.0      144
4.0       46
5.0       25
6.0       15
7.0       10
9.0        4
10.0       4
11.0       3
12.0       3
15.0       1
49.0       1
Name: count, dtype: int64

In [80]:
data.total_male.value_counts()

total_male
1.0     2966
0.0     1137
2.0      478
3.0      139
4.0       46
6.0       17
5.0       15
15.0       2
7.0        2
10.0       2
9.0        2
17.0       1
12.0       1
44.0       1
Name: count, dtype: int64

In [81]:
data.purpose.value_counts()

purpose
Leisure and Holidays              2840
Business                           671
Visiting Friends and Relatives     633
Meetings and Conference            312
Volunteering                       138
Other                              128
Scientific and Academic             87
Name: count, dtype: int64

In [82]:
data.main_activity.value_counts()

main_activity
Wildlife tourism            2259
Beach tourism               1025
Hunting tourism              457
Conference tourism           367
Cultural tourism             359
Mountain climbing            234
business                      58
Bird watching                 37
Diving and Sport Fishing      13
Name: count, dtype: int64

In [83]:
data.info_source.value_counts()

info_source
Travel, agent, tour operator      1913
Friends, relatives                1635
others                             490
Newspaper, magazines,brochures     359
Radio, TV, Web                     249
Trade fair                          77
Tanzania Mission Abroad             68
inflight magazines                  18
Name: count, dtype: int64

In [84]:
data.tour_arrangement.value_counts()

tour_arrangement
Independent     2570
Package Tour    2239
Name: count, dtype: int64

In [85]:
data.payment_mode.value_counts()

payment_mode
Cash                 4172
Credit Card           622
Other                   8
Travellers Cheque       7
Name: count, dtype: int64

In [86]:
data.first_trip_tz.value_counts()

first_trip_tz
Yes    3243
No     1566
Name: count, dtype: int64

In [87]:
data.most_impressing.value_counts()

most_impressing
Friendly People                         1541
 Wildlife                               1038
No comments                              743
Wonderful Country, Landscape, Nature     507
Good service                             365
Alone                                    313
Excellent Experience                     271
Satisfies and Hope Come Back              31
Name: count, dtype: int64

In [88]:
data.total_cost.value_counts()

total_cost
1657500.0     154
3315000.0     109
828750.0       88
497250.0       76
331500.0       76
             ... 
14983800.0      1
416000.0        1
885038.7        1
1768552.5       1
2657000.0       1
Name: count, Length: 1637, dtype: int64

In [89]:
data.nights_stayed.value_counts()

nights_stayed
7.0      505
2.0      387
3.0      376
14.0     353
4.0      341
        ... 
53.0       1
95.0       1
65.0       1
145.0      1
44.0       1
Name: count, Length: 76, dtype: int64

In [90]:
data.package_services.value_counts()

package_services
None                                                                                                                                           2545
Accommodation Package, Food Package, Transportation Package, Sightseeing Package, Guided Tour Package                                           387
Transport Package, Accommodation Package, Food Package, Transportation Package, Sightseeing Package, Guided Tour Package, Insurance Package     357
Transport Package, Accommodation Package, Food Package, Transportation Package, Sightseeing Package, Guided Tour Package                        287
Transport Package, Accommodation Package, Food Package, Transportation Package                                                                  186
                                                                                                                                               ... 
Accommodation Package, Transportation Package, Sightseeing Package, Guided Tour Package, Insura

In [91]:
data.head()

Unnamed: 0,ID,country,age_group,travel_with,total_female,total_male,purpose,main_activity,info_source,tour_arrangement,payment_mode,first_trip_tz,most_impressing,total_cost,nights_stayed,package_services
0,tour_0,SWIZERLAND,Senior Adult,Friends/Relatives,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Friends, relatives",Independent,Cash,No,Friendly People,674602.5,13.0,
1,tour_10,UNITED KINGDOM,Young Adult,Alone,1.0,0.0,Leisure and Holidays,Cultural tourism,others,Independent,Cash,Yes,"Wonderful Country, Landscape, Nature",3214906.5,21.0,
2,tour_1000,UNITED KINGDOM,Young Adult,Alone,0.0,1.0,Visiting Friends and Relatives,Cultural tourism,"Friends, relatives",Independent,Cash,No,Excellent Experience,3315000.0,32.0,
3,tour_1002,UNITED KINGDOM,Young Adult,Spouse,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Package Tour,Cash,Yes,Friendly People,7790250.0,11.0,"Accommodation Package, Food Package, Transport..."
4,tour_1004,CHINA,Child/Youth,Alone,1.0,0.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Independent,Cash,Yes,No comments,1657500.0,11.0,


### **Note:** since our table is ready then we should proceed

#### **Types of EDA**

##### A. Univariate Analysis

In this section, we will do univariate analysis. It is the simplest form of analyzing data where we examine each variable individually. For categorical features we can use frequency table or bar plots which will calculate the number of each category in a particular variable. For numerical features, probability density plots can be used to look at the distribution of the variable.