## **Exploratory Data Analysis for Tourism Sector in Tanzania.**

### **1. Understand the Problem Statement**


Tourism is a vital pillar of Tanzania’s economy, contributing significantly to foreign exchange earnings, employment, and community development. Renowned globally for attractions such as Serengeti National Park, Mount Kilimanjaro, and Zanzibar beaches, the country draws millions of visitors annually.

However, there is limited data-driven understanding of visitor demographics, seasonal patterns, and revenue drivers. This lack of insight hinders informed decision-making for policymakers, tourism boards, and stakeholders, resulting in inefficiencies in marketing strategies, resource allocation, and sustainable sector planning.

To address this, there is a need to analyze historical tourism data to uncover patterns in tourist arrivals, spending, preferences, and peak seasons. Furthermore, a predictive model can be developed to classify and forecast tourist flow levels (e.g., "High", "Medium", or "Low") and segment visitors based on key attributes like source market or package preferences.

By combining Exploratory Data Analysis (EDA) with predictive modeling, this project aims to provide actionable insights and build a foundation for data-driven decision-making in Tanzania’s tourism sector.

##### Import libraries.

In [112]:
# import important modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams["axes.labelsize"] = 18
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline


##### Load Dataset

In [113]:
data = pd.read_csv('./datasets/Train.csv')

In [None]:
#explore data
data.head()

### **2. Hypothesis Generation**

Hypothesis generation is a critical stage in any data science or machine learning pipeline. It involves thoroughly understanding the problem and brainstorming all possible factors that may influence the outcome. This step is carried out before analyzing the data to form logical assumptions that can later be tested.

Based on the tourism data and context, the following hypotheses are proposed:

* **Age Group**: Young adults are more likely to visit Tanzania compared to other age groups.
* **Travel Companions**: Visitors are more likely to travel alone or with their spouse rather than with children.
* **Purpose of Visit**: Most visitors travel to Tanzania primarily for leisure and holidays rather than business or visiting friends and relatives.
* **Tourism Activities**: Visitors prefer wildlife and beach tourism over cultural or business-related tourism.
* **Information Source**: Tourists are more likely to learn about Tanzania through travel agents, friends, or relatives than through TV, radio, web platforms, magazines, or Tanzanian missions abroad.
* **Tour Arrangement**: Visitors prefer arranging tours independently rather than opting for package tours.
* **Payment Method**: Cash is the most common mode of payment among visitors, compared to credit cards or travelers' cheques.
* **Trip Frequency**: A majority of visitors are on their first trip to Tanzania rather than being repeat travelers.
* **Visitor Impressions**: Visitors are more likely to appreciate the friendliness of Tanzanians compared to giving feedback on experiences or overall satisfaction.
* **Pricing Preference**: Most visitors prefer reasonably priced services rather than low-cost or high-cost options.
* **Length of Stay**: The typical length of stay for visitors is up to two weeks.
* **Destination Preference**: Visitors spend more time in Zanzibar than on mainland Tanzania.
* **Package Services**: Most visitors prefer not to use package services.



### **3. Describe Data Structure**

**Note:** Open the VariableDefinition file to understand the meaning of each variable in this dataset

In [None]:
variable_dfn = pd.read_csv('./datasets/VariableDefinitions.csv')
display(variable_dfn)

In [None]:
#data shape
print(f"Data Shape: {data.shape}")

In [None]:
#list first data
data.tail()

In [None]:
data.describe

### **4. Explolatory Data Analysis**


This is the process of finding some insights from you dataset before create predictive models.
**Note:** This is an important steps in your Data science workflow.

In [None]:
# Check for missing values
print('missing values:', data.isnull().sum())

##### **Note**: Since we have missing values , then we have to replace them with real data.

In [None]:
# Check missing values count per column
missing_counts = data[['travel_with', 'total_female', 'total_male', 'most_impressing']].isnull().sum()

for col in ['travel_with', 'total_female', 'total_male', 'most_impressing']:
    if data[col].isnull().sum() > 0:
        if data[col].dtype == 'O':  # object/string type
            data[col] = data[col].fillna('Alone')
        else:  # numeric
            data[col] = data[col].fillna(data[col].median())


# Confirm no more nulls in those columns
print(data.isnull().sum())


##### since we have no missing values we can proceed to data mapping


In [None]:
age_mapping = {
    "1-24" : "Child or Youth",
    "25-44" : "Young Adult",
    "45-64" : "Senior Adult",
    "65+" : "Elder",
}

data['age_group'] = data.age_group.map(age_mapping)
# data.head()
data.age_group.value_counts()

#### we need to organize well the data 
For packages we are combining all in one column and , night_spent in tanzania mainland and zanzibar , because it is one country 

In [None]:
# Fix column name typos
data.rename(columns={
    'infor_source': 'info_source',
    'tour_arrangment': 'tour_arrangement',
    'package_accomodation': 'package_accommodation'
}, inplace=True)

# Combine nights stayed on mainland and Zanzibar into one total nights column
# data['nights_stayed'] = data['night_mainland'] + data['night_zanzibar']

# Package columns and mapping
package_mapping = {
    'package_transport_int': 'Transport Package',
    'package_accommodation': 'Accommodation Package',
    'package_food': 'Food Package',
    'package_transport_tz': 'Transportation Package',
    'package_sightseeing': 'Sightseeing Package',
    'package_guided_tour': 'Guided Tour Package',
    'package_insurance': 'Insurance Package'
}

package_cols = list(package_mapping.keys())

# Create combined multi-label column 'package_services'
def get_included_services(row):
    services = [package_mapping[col] for col in package_cols if str(row[col]).strip().lower() == 'yes']
    return ', '.join(services) if services else 'None'

data['package_services'] = data.apply(get_included_services, axis=1)

# ✅ Drop package columns AND nights columns in one go
data.drop(columns=package_cols  ,inplace=True)
# Convert to integers to remove decimals
data['total_male'] = data['total_male'].astype(int)
data['total_female'] = data['total_female'].astype(int)
# Convert to integers to remove decimals
data['night_mainland'] = data['night_mainland'].astype(int)
data['night_zanzibar'] = data['night_zanzibar'].astype(int)
# Convert to integers to remove decimals
data['total_cost'] = data['total_cost'].astype(int)


# Check the result
data.head(30)


In [None]:
data.head()

#### **Types of EDA**

##### A. Univariate Analysis

In this section, we will do univariate analysis. It is the simplest form of analyzing data where we examine each variable individually. For categorical features we can use frequency table or bar plots which will calculate the number of each category in a particular variable. For numerical features, probability density plots can be used to look at the distribution of the variable.

In [None]:
# Assuming your DataFrame is named data

# 1. Make sure country names are uppercase for consistent mapping
data['country'] = data['country'].str.upper()

# 2. Complete region mapping dictionary (based on your 105 countries)
region_map = {
    # Africa
    'KENYA': 'Africa',
    'UGANDA': 'Africa',
    'ZIMBABWE': 'Africa',
    'SOUTH AFRICA': 'Africa',
    'ZAMBIA': 'Africa',
    'BURUNDI': 'Africa',
    'RWANDA': 'Africa',
    'DRC': 'Africa',
    'MALAWI': 'Africa',
    'MOZAMBIQUE': 'Africa',
    'ETHIOPIA': 'Africa',
    'SUDAN': 'Africa',
    'SWAZILAND': 'Africa',
    'DJIBOUT': 'Africa',
    'ALGERIA': 'Africa',
    'GHANA': 'Africa',
    'MAURITIUS': 'Africa',
    'NIGERIA': 'Africa',
    'NAMIBIA': 'Africa',
    'ANGOLA': 'Africa',
    'COMORO': 'Africa',
    'CAPE VERDE': 'Africa',
    'LESOTHO': 'Africa',
    'MADAGASCAR': 'Africa',
    'IVORY COAST': 'Africa',
    'MORROCO': 'Africa',
    'TUNISIA': 'Africa',

    # Europe
    'UNITED KINGDOM': 'Europe',
    'ITALY': 'Europe',
    'FRANCE': 'Europe',
    'GERMANY': 'Europe',
    'SPAIN': 'Europe',
    'NETHERLANDS': 'Europe',
    'SWEDEN': 'Europe',
    'BELGIUM': 'Europe',
    'DENMARK': 'Europe',
    'NORWAY': 'Europe',
    'AUSTRIA': 'Europe',
    'POLAND': 'Europe',
    'CZECH REPUBLIC': 'Europe',
    'PORTUGAL': 'Europe',
    'FINLAND': 'Europe',
    'GREECE': 'Europe',
    'SERBIA': 'Europe',
    'LITHUANIA': 'Europe',
    'SLOVAKIA': 'Europe',
    'ROMANIA': 'Europe',
    'HUNGARY': 'Europe',
    'LATVIA': 'Europe',
    'LUXEMBOURG': 'Europe',
    'SLOVENIA': 'Europe',
    'MONTENEGRO': 'Europe',
    'CROATIA': 'Europe',
    'ESTONIA': 'Europe',
    'CYPRUS': 'Europe',
    'MALT': 'Europe',
    'SCOTLAND': 'Europe',
    'BURGARIA': 'Europe',

    # Asia
    'INDIA': 'Asia',
    'CHINA': 'Asia',
    'JAPAN': 'Asia',
    'MALAYSIA': 'Asia',
    'ISRAEL': 'Asia',
    'KOREA': 'Asia',
    'TAIWAN': 'Asia',
    'PAKISTAN': 'Asia',
    'SINGAPORE': 'Asia',
    'SRI LANKA': 'Asia',
    'INDONESIA': 'Asia',
    'NEPAL': 'Asia',
    'IRAN': 'Asia',
    'MYANMAR': 'Asia',
    'IRAQ': 'Asia',
    'PHILIPINES': 'Asia',
    'YEMEN': 'Asia',
    'LEBANON': 'Asia',
    'KUWAIT': 'Asia',
    'QATAR': 'Asia',
    'UNITED ARAB EMIRATES': 'Asia',
    'OMAN': 'Asia',

    # North America
    'UNITED STATES OF AMERICA': 'North America',
    'CANADA': 'North America',
    'BERMUDA': 'North America',
    'MEXICO': 'North America',
    'COSTARICA': 'North America',
    'DOMINICA': 'North America',
    'TRINIDAD TOBACCO': 'North America',

    # South America
    'BRAZIL': 'South America',
    'ARGENTINA': 'South America',
    'CHILE': 'South America',
    'COLOMBIA': 'South America',
    'URUGUAY': 'South America',

    # Oceania
    'AUSTRALIA': 'Oceania',
    'NEW ZEALAND': 'Oceania',

    # Middle East
    'UAE': 'Middle East',
    'UNITED ARAB EMIRATES': 'Middle East',
    'QATAR': 'Middle East',
    'KUWAIT': 'Middle East',
    'OMAN': 'Middle East',
    'YEMEN': 'Middle East',
    'LEBANON': 'Middle East',
    'IRAN': 'Middle East',
    'IRAQ': 'Middle East',
}

# 3. Map the countries to their respective regions
data['region'] = data['country'].map(region_map)

# 4. Fill any missing regions with 'Other'
data['region'] = data['region'].fillna('Other')

# 5. Check new column
# data.total_male.value_counts()
sns.catplot(
    x='region',
    kind='count',
    data=data,
    height=5,       # height in inches
    aspect=2.5,     # width = height * aspect
    palette='viridis')

- Most visitors much from Europe than other regions

In [None]:
# data.total_male.value_counts()
sns.catplot(x='age_group' , kind='count' , data=data)

In [None]:

sns.set(style="whitegrid")

# Control size: height (per facet) and aspect ratio
sns.catplot(
    x='travel_with',
    kind='count',
    data=data,
    height=5,       # height in inches
    aspect=1.5,     # width = height * aspect
    palette='viridis'
)

In [None]:
sns.catplot(x='total_female', kind='count', data=data)

- Most visitors travel with 1 female relative/child/spause

In [None]:
sns.catplot(x='total_male', kind='count', data=data)

- Most visitors travel with 1 male relative/child/spause

In [None]:
sns.catplot(
    x='purpose' , 
    kind='count' , 
    data=data , 
    height=6,      
    aspect=3.5,     
    palette='viridis'
    )

- Most visitor's pupose is Laisure and holidays

In [None]:
sns.catplot(
    x='main_activity' , 
    kind='count' , 
    data=data , 
    height=6,      
    aspect=3.5,     
    palette='viridis'
    )

- Most visitors like/main activity is Widlife Tourism

In [None]:
sns.catplot(
    x='info_source' , 
    kind='count' , 
    data=data , 
    height=6,      
    aspect=3.5,     
    palette='viridis'
    )

- Most visitors tourism info source is from Travel , agent or tour operator

In [None]:
sns.catplot(
    x='tour_arrangement' , 
    kind='count' , 
    data=data , 
    height=4,      
    aspect=1.5,     
    palette='viridis'
    )

- Most visitors like independent Tour arrangements than package Tour

In [None]:
sns.catplot(
    x='night_mainland' , 
    kind='count' , 
    data=data , 
    height=5,      
    aspect=3.5,     
    palette='viridis'
    )
sns.catplot(
    x='night_zanzibar' , 
    kind='count' , 
    data=data , 
    height=6,      
    aspect=3.5,     
    palette='viridis'
    )

- Most visitors like to visit and spend most time in Zanzibar than mainland

In [None]:
sns.catplot(
    x='payment_mode' , 
    kind='count' , 
    data=data , 
    height=4,      
    aspect=1.5,     
    palette='viridis'
    )

- Most visitors pay on cash than online payment services and travel cheques.

In [None]:
sns.catplot(
    x='first_trip_tz' , 
    kind='count' , 
    data=data , 
    height=4,      
    aspect=1.5,     
    palette='viridis'
    )

- Most visitors (almost half) of first time trip are notinterested in visiting again after first trip

In [None]:
sns.catplot(
    x='most_impressing' , 
    kind='count' , 
    data=data , 
    height=6,      
    aspect=3.5,     
    palette='viridis'
    )

Most visitors are impressed with how friendly people are

In [None]:
# Calculate min and max cost values first
min_cost = data['total_cost'].min()
max_cost = data['total_cost'].max()

print(f"Minimum cost: {min_cost}")
print(f"Maximum cost: {max_cost}")

bins = np.linspace(min_cost, max_cost, 6)  # 5 equal ranges
labels = [f"{int(bins[i])}-{int(bins[i+1])}" for i in range(len(bins)-1)]

data['cost_range'] = pd.cut(data['total_cost'], bins=bins, labels=labels)

plt.figure(figsize=(16, 6))
sns.countplot(x='cost_range', data=data, palette='magma')
plt.title("Visitor Count by Spending Ranges")
plt.xlabel("Spending Range (Min–Max Bins)")
plt.ylabel("Number of Visitors")
plt.show()

- Most visitors can afford costs at most 19,945,775 

In [None]:
service_counts = data.package_services.value_counts()

plt.figure(figsize=(30, 60))
sns.barplot(x=service_counts.values, y=service_counts.index, palette="magma")
plt.title("Most Chosen Package Services")
plt.xlabel("Number of Visitors")
plt.ylabel("Service Type")
plt.show()

##### B. Bivariate Analysis

Bivariate analysis is the simultaneous analysis of two variables (attributes). It explores the concept of relationship between two variables, whether there exists an association and the strength of this association, or whether there are differences between two variables and the significance of these differences.

After looking at every variable individually in univariate analysis, we will now explore them again with respect to the target variable.

In [None]:
#Explore region vs age_group
plt.figure(figsize=(16, 8))  # Increased height for better visibility
sns.countplot(x='region', hue='age_group', data=data, palette='viridis')

# Add proper title and labels
plt.title('Tourist Distribution by Region and Age Group', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Region', fontsize=14, fontweight='bold')
plt.ylabel('Number of Tourists', fontsize=14, fontweight='bold')

# Improve x-axis labels
plt.xticks(rotation=45, ha='right', fontsize=12)  # Rotate labels for better readability
plt.yticks(fontsize=12)

# Improve legend
plt.legend(title='Age Group', title_fontsize=12, fontsize=11, bbox_to_anchor=(1.05, 1), loc='upper left')

# Adjust layout to prevent label cutoff
plt.tight_layout()
plt.show()

In [None]:
#Explore region vs Travel with
plt.figure(figsize=(16, 8))  # Increased height for better visibility
sns.countplot(x='region', hue='travel_with', data=data, palette='viridis')

# Add proper title and labels
plt.title('Tourist Distribution by Region and Travel With', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Region', fontsize=14, fontweight='bold')
plt.ylabel('Number of Tourists', fontsize=14, fontweight='bold')

# Improve x-axis labels
plt.xticks(rotation=45, ha='right', fontsize=12)  # Rotate labels for better readability
plt.yticks(fontsize=12)

# Improve legend
plt.legend(title='Travel With', title_fontsize=12, fontsize=11, bbox_to_anchor=(1.05, 1), loc='upper left')

# Adjust layout to prevent label cutoff
plt.tight_layout()
plt.show()

In [None]:
#Explore region vs purpose
plt.figure(figsize=(16, 8))  # Increased height for better visibility
sns.countplot(x='region', hue='purpose', data=data, palette='viridis')

# Add proper title and labels
plt.title('Tourist Distribution by Region and Purpose', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Region', fontsize=14, fontweight='bold')
plt.ylabel('Number of Tourists', fontsize=14, fontweight='bold')

# Improve x-axis labels
plt.xticks(rotation=45, ha='right', fontsize=12)  # Rotate labels for better readability
plt.yticks(fontsize=12)

# Improve legend
plt.legend(title='Purpose', title_fontsize=12, fontsize=11, bbox_to_anchor=(1.05, 1), loc='upper left')

# Adjust layout to prevent label cutoff
plt.tight_layout()
plt.show()

In [None]:
#Explore region vs main_activity
plt.figure(figsize=(16, 8))  # Increased height for better visibility
sns.countplot(x='region', hue='main_activity', data=data, palette='viridis')

# Add proper title and labels
plt.title('Tourist Distribution by Region and Main Activity', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Region', fontsize=14, fontweight='bold')
plt.ylabel('Number of Tourists', fontsize=14, fontweight='bold')

# Improve x-axis labels
plt.xticks(rotation=45, ha='right', fontsize=12)  # Rotate labels for better readability
plt.yticks(fontsize=12)

# Improve legend
plt.legend(title='Main Activity', title_fontsize=12, fontsize=11, bbox_to_anchor=(1.05, 1), loc='upper left')

# Adjust layout to prevent label cutoff
plt.tight_layout()
plt.show()

In [None]:
#Explore region vs info_source
plt.figure(figsize=(16, 8))  # Increased height for better visibility
sns.countplot(x='region', hue='info_source', data=data, palette='viridis')

# Add proper title and labels
plt.title('Tourist Distribution by Region and Info Source', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Region', fontsize=14, fontweight='bold')
plt.ylabel('Number of Tourists', fontsize=14, fontweight='bold')

# Improve x-axis labels
plt.xticks(rotation=45, ha='right', fontsize=12)  # Rotate labels for better readability
plt.yticks(fontsize=12)

# Improve legend
plt.legend(title='Info Source', title_fontsize=12, fontsize=11, bbox_to_anchor=(1.05, 1), loc='upper left')

# Adjust layout to prevent label cutoff
plt.tight_layout()
plt.show()

In [None]:
#Explore region vs tour_arrangement
plt.figure(figsize=(16, 8))  # Increased height for better visibility
sns.countplot(x='region', hue='tour_arrangement', data=data, palette='viridis')

# Add proper title and labels
plt.title('Tourist Distribution by Region and Tour Arrangement', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Region', fontsize=14, fontweight='bold')
plt.ylabel('Number of Tourists', fontsize=14, fontweight='bold')

# Improve x-axis labels
plt.xticks(rotation=45, ha='right', fontsize=12)  # Rotate labels for better readability
plt.yticks(fontsize=12)

# Improve legend
plt.legend(title='Tour Arrangement', title_fontsize=12, fontsize=11, bbox_to_anchor=(1.05, 1), loc='upper left')

# Adjust layout to prevent label cutoff
plt.tight_layout()
plt.show()

In [None]:
# Create night categories for better visualization
def categorize_nights(nights):
    if nights == 0:
        return '0 nights'
    elif nights <= 3:
        return '1-3 nights'
    elif nights <= 7:
        return '4-7 nights'
    elif nights <= 14:
        return '8-14 nights'
    elif nights <= 30:
        return '15-30 nights'
    else:
        return '30+ nights'

# Apply categorization
data['night_mainland_category'] = data['night_mainland'].apply(categorize_nights)

# Now visualize
plt.figure(figsize=(16, 8))
sns.countplot(x='region', hue='night_mainland_category', data=data, palette='viridis')

plt.title('Tourist Distribution by Region and Length of Stay on Mainland', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Region', fontsize=14, fontweight='bold')
plt.ylabel('Number of Tourists', fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right', fontsize=12)
plt.yticks(fontsize=12)
plt.legend(title='Length of Stay', title_fontsize=12, fontsize=11, bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

In [None]:
# Create night categories for better visualization
def categorize_nights(nights):
    if nights == 0:
        return '0 nights'
    elif nights <= 3:
        return '1-3 nights'
    elif nights <= 7:
        return '4-7 nights'
    elif nights <= 14:
        return '8-14 nights'
    elif nights <= 30:
        return '15-30 nights'
    else:
        return '30+ nights'

# Apply categorization
data['night_zanzibar_category'] = data['night_zanzibar'].apply(categorize_nights)

# Now visualize
plt.figure(figsize=(16, 8))
sns.countplot(x='region', hue='night_zanzibar_category', data=data, palette='viridis')

plt.title('Tourist Distribution by Region and Length of Stay on zanzibar', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Region', fontsize=14, fontweight='bold')
plt.ylabel('Number of Tourists', fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right', fontsize=12)
plt.yticks(fontsize=12)
plt.legend(title='Length of Stay', title_fontsize=12, fontsize=11, bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

In [None]:
#Explore region vs payment_mode
plt.figure(figsize=(16, 8))  # Increased height for better visibility
sns.countplot(x='region', hue='payment_mode', data=data, palette='viridis')

# Add proper title and labels
plt.title('Tourist Distribution by Region and Payment Mode', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Region', fontsize=14, fontweight='bold')
plt.ylabel('Number of Tourists', fontsize=14, fontweight='bold')

# Improve x-axis labels
plt.xticks(rotation=45, ha='right', fontsize=12)  # Rotate labels for better readability
plt.yticks(fontsize=12)

# Improve legend
plt.legend(title='Payment Mode', title_fontsize=12, fontsize=11, bbox_to_anchor=(1.05, 1), loc='upper left')

# Adjust layout to prevent label cutoff
plt.tight_layout()
plt.show()

In [None]:
#Explore region vs first_trip_tz
plt.figure(figsize=(16, 8))  # Increased height for better visibility
sns.countplot(x='region', hue='first_trip_tz', data=data, palette='viridis')

# Add proper title and labels
plt.title('Tourist Distribution by Region and First Trip Tz', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Region', fontsize=14, fontweight='bold')
plt.ylabel('Number of Tourists', fontsize=14, fontweight='bold')

# Improve x-axis labels
plt.xticks(rotation=45, ha='right', fontsize=12)  # Rotate labels for better readability
plt.yticks(fontsize=12)

# Improve legend
plt.legend(title='First Trip Tz', title_fontsize=12, fontsize=11, bbox_to_anchor=(1.05, 1), loc='upper left')

# Adjust layout to prevent label cutoff
plt.tight_layout()
plt.show()

In [None]:
#Explore region vs most_impressing
plt.figure(figsize=(16, 8))  # Increased height for better visibility
sns.countplot(x='region', hue='most_impressing', data=data, palette='viridis')

# Add proper title and labels
plt.title('Tourist Distribution by Region and Most Impressing', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Region', fontsize=14, fontweight='bold')
plt.ylabel('Number of Tourists', fontsize=14, fontweight='bold')

# Improve x-axis labels
plt.xticks(rotation=45, ha='right', fontsize=12)  # Rotate labels for better readability
plt.yticks(fontsize=12)

# Improve legend
plt.legend(title='Most Impressing', title_fontsize=12, fontsize=11, bbox_to_anchor=(1.05, 1), loc='upper left')

# Adjust layout to prevent label cutoff
plt.tight_layout()
plt.show()

In [None]:
# Create simplified package categories
def simplify_package_services(package_str):
    if package_str == 'None':
        return 'No Package'
    elif ',' in package_str:
        # Count number of services
        service_count = len(package_str.split(','))
        if service_count <= 2:
            return 'Basic Package (1-2 services)'
        elif service_count <= 4:
            return 'Standard Package (3-4 services)'
        else:
            return 'Premium Package (5+ services)'
    else:
        return 'Single Service'

# Apply simplification
data['package_category'] = data['package_services'].apply(simplify_package_services)

# Now visualize
plt.figure(figsize=(16, 8))
sns.countplot(x='region', hue='package_category', data=data, palette='viridis')

plt.title('Tourist Distribution by Region and Package Type', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Region', fontsize=14, fontweight='bold')
plt.ylabel('Number of Tourists', fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right', fontsize=12)
plt.yticks(fontsize=12)
plt.legend(title='Package Type', title_fontsize=12, fontsize=11, bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

In [None]:
#Explore region vs cost_range
plt.figure(figsize=(16, 8))  # Increased height for better visibility
sns.countplot(x='region', hue='cost_range', data=data, palette='viridis')

# Add proper title and labels
plt.title('Tourist Distribution by Region and Cost Range', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Region', fontsize=14, fontweight='bold')
plt.ylabel('Number of Tourists', fontsize=14, fontweight='bold')

# Improve x-axis labels
plt.xticks(rotation=45, ha='right', fontsize=12)  # Rotate labels for better readability
plt.yticks(fontsize=12)

# Improve legend
plt.legend(title='Cost Range', title_fontsize=12, fontsize=11, bbox_to_anchor=(1.05, 1), loc='upper left')

# Adjust layout to prevent label cutoff
plt.tight_layout()
plt.show()

In [None]:
data

#### **Our Hypothesis Results**


- Young adults are more likely to visit Tanzania compared to other age groups **TRUE**
- Visitors are more likely to travel alone or with their spouse rather than with children **TRUE**
- Most visitors travel to Tanzania primarily for leisure and holidays rather than business or visiting friends and relatives **TRUE**
- Tour Visitors prefer wildlife and beach tourism over cultural or business-related tourism **TRUE**
- Tourists are more likely to learn about Tanzania through travel agents, friends, or relatives than through TV, radio, web platforms, magazines, or Tanzanian missions abroad **TRUE**
- Visitors prefer arranging tours independently rather than opting for package tours **TRUE**
-  Cash is the most common mode of payment among visitors, compared to credit cards or travelers' cheques **TRUE**
- A majority of visitors are on their first trip to Tanzania rather than being repeatedly travelers **TRUE**
- Visitors are more likely to appreciate the friendliness of Tanzanians compared to giving feedback on experiences or overall satisfaction **TRUE**
- Most visitors prefer reasonably priced services rather than low-cost or high-cost options **TRUE**
- The typical length of stay for visitors is up to two weeks **TRUE**
- Destinat Visitors spend more time in Zanzibar than on mainland Tanzania **TRUE**
- Most visitors prefer not to use package services **TRUE**



In [None]:
data

### **5. Data Profiling Package**

**Profiling** is a process that helps us in understanding our data and Pandas Profiling is python package which does exactly that. It is a simple and fast way to perform exploratory data analysis of a Pandas Dataframe.

The pandas **df.describe()** and **df.info()** functions are normally used as a first step in the EDA process. However, it only gives a very basic overview of the data and doesn’t help much in the case of large data sets. The Pandas Profiling function, on the other hand, extends the pandas DataFrame with **df.profile_report()** for quick data analysis.

Pandas profiling generates a complete report for your dataset, which includes:
- Basic data type information
- Descriptive statistics (mean,median etc.)
- Common and Extreme Values
- Quantile statistics (tells you about how your data is distributed)
- Histograms for your data (again, for visualizing distributions)
- Correlations (Show features that are related)
- Missing values


#### how to install the packate

You can install using the pip package manager by running

In [None]:
from ydata_profiling import ProfileReport  # or from pandas_profiling import ProfileReport if using older version

profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)
profile.to_file("tourism_data_report.html")
