In [1]:
import numpy as np
import pandas as pd

<h1 align="center">Data Wrangling: Import and Cleaning(2017-2022)</h1>

<h3> Overview</h3>
The code reads in data from Excel files for each year from 2017 to 2022 containing information about the number of foreign and domestic tourists, as well as the number of overnight stays, by type of accommodation. It creates a dictionary of dataframes, where each key is the year of the data.

The code then performs data cleaning on each dataframe, which includes renaming columns, dropping unnecessary columns, removing null rows, and converting float values to integers. The cleaned data is saved to CSV files.

Overall, the code performs data wrangling tasks such as data importing and cleaning to prepare the data for further analysis and visualization.

<h3>Table 1: Tourists by Accommodation Type</h3>

This table contains data on the number of foreign and domestic tourists, as well as the number of overnight stays, by type of accommodation for each year from 2017 to 2022. The code first reads in the data from Excel files for each year and creates a dictionary of dataframes. It then renames the columns, drops unnecessary columns, removes null rows, converts float values to integers, and saves the cleaned data to CSV files. Finally, the code prints the total number of foreign tourists for each year.

In [6]:
file_paths = {
    '2022': r'C:\Users\Ivan\Desktop\tourism_analysis\data\2017-2022\2022\ukupno_2022.xlsx',
    '2021': r'C:\Users\Ivan\Desktop\tourism_analysis\data\2017-2022\2021\ukupno_2021.xlsx',
    '2020': r'C:\Users\Ivan\Desktop\tourism_analysis\data\2017-2022\2020\ukupno_2020.xlsx',
    '2019': r'C:\Users\Ivan\Desktop\tourism_analysis\data\2017-2022\2019\ukupno_2019.xls',
    '2018': r'C:\Users\Ivan\Desktop\tourism_analysis\data\2017-2022\2018\ukupno_2018.xls',
    '2017': r'C:\Users\Ivan\Desktop\tourism_analysis\data\2017-2022\2017\ukupno_2017.xls'
}

In [7]:
tourists_by_acc_dict = {}
for year, path in file_paths.items():
    tourists_by_acc_dict[f'tourists_by_accomodation_{year}'] = pd.read_excel(path, sheet_name='Sheet1', header=6)
    

In [8]:
for year, df in tourists_by_acc_dict.items():
    #renaming columns
    df.rename(columns={
        df.columns[0]: 'tip_smjestaja',
        df.columns[1]: 'dolasci_stranih_turista',
        df.columns[2]: 'dolasci_domacih_turista',
        df.columns[4]: 'nocenje_stranih_turista',
        df.columns[5]: 'nocenje_domacih_turista'},
        inplace=True)
    #dropping unnecessary columns
    df.drop(
    columns=[
        df.columns[3],
        df.columns[6],
    ],
    inplace=True)
    
    #removing null rows and converting float to int
    tourists_by_acc_dict[year] = df[~df['tip_smjestaja'].isnull()]
    tourists_by_acc_dict[year] = tourists_by_acc_dict[year].reset_index(drop=True)
    tourists_by_acc_dict[year] = tourists_by_acc_dict[year].fillna(0)
    tourists_by_acc_dict[year] = tourists_by_acc_dict[year].replace('-', 0)
    

for year, df in tourists_by_acc_dict.items():
    df.iloc[:, 1:] = df.iloc[:, 1:].apply(lambda x: x.astype('int') if x.dtype == 'float' else x)

  df.iloc[:, 1:] = df.iloc[:, 1:].apply(lambda x: x.astype('int') if x.dtype == 'float' else x)


In [10]:
for year, df in tourists_by_acc_dict.items():
    print(year, df['dolasci_stranih_turista'].sum())

tourists_by_accomodation_2022 2036403
tourists_by_accomodation_2021 1553558
tourists_by_accomodation_2020 350795
tourists_by_accomodation_2019 2509625
tourists_by_accomodation_2018 2076803
tourists_by_accomodation_2017 1877212


In [None]:
for year, df in tourists_by_acc_dict.items():
    df.to_csv(fr'C:\Users\Ivan\Desktop\tourism_analysis\data\cleaned_data\final_data_accomodation\{year}.csv', index=False)

<h3>Table 2: Tourists by Municipality</h3>

This table contains data on the number of foreign and domestic tourists, as well as the number of overnight stays, by municipality for each year from 2017 to 2022. The code reads in the data from Excel files for each year and creates a dictionary of dataframes. It then drops unnecessary columns and null rows, converts float values to integers, and saves the cleaned data to CSV files. Finally, the code prints the total number of foreign tourists for each year.

In [11]:
tourists_by_mun_dict = {}
for year, path in file_paths.items():
    tourists_by_mun_dict[f'tourists_by_municipality_{year}'] = pd.read_excel(path, sheet_name='Sheet2', header=3)
    

In [12]:
for year, df in tourists_by_mun_dict.items():
    df.rename(columns={
            df.columns[0]: 'opstina',
            df.columns[1]: 'dolasci_stranih_turista',
            df.columns[2]: 'dolasci_domacih_turista',
            df.columns[5]: 'nocenje_stranih_turista',
            df.columns[6]: 'nocenje_domacih_turista'},
            inplace=True
    )
    df.drop(labels=0, inplace=True)
    df.drop(columns=[df.columns[3], df.columns[4], df.columns[7], df.columns[8]],
            inplace=True)
    tourists_by_mun_dict[year] = tourists_by_mun_dict[year].fillna(0)
    tourists_by_mun_dict[year] = tourists_by_mun_dict[year].replace('-', 0)
for year, df in tourists_by_mun_dict.items():
    df.iloc[:, 1:] = df.iloc[:, 1:].apply(lambda x: x.astype('int') if x.dtype == 'float' else x)

  df.iloc[:, 1:] = df.iloc[:, 1:].apply(lambda x: x.astype('int') if x.dtype == 'float' else x)


In [13]:
tourists_by_mun_dict['tourists_by_municipality_2021'].head()

Unnamed: 0,opstina,dolasci_stranih_turista,dolasci_domacih_turista,nocenje_stranih_turista,nocenje_domacih_turista
1,Andrijevica,404,119,2097,468
2,Bar,145416,6999,1333030,27611
3,Berane,1890,1955,3193,2645
4,Bijelo Polje,3227,1116,12923,1944
5,Budva,522934,33613,2688528,105605


In [None]:
for year, df in tourists_by_mun_dict.items():
    df.to_csv(fr'C:\Users\Ivan\Desktop\tourism_analysis\data\cleaned_data\final_data_municipality\{year}.csv', index=False)

<h3>Table 3: Tourists by Country</h3>

Tourist data from multiple Excel files were imported into a dictionary named "tourists_by_country_dict" using pandas' read_excel function. The imported data was then cleaned and filtered to remove unnecessary rows and columns. Specifically, the first column was renamed to "zemlja_porijekla", the second column to "dolasci", and the fourth column to "nocenja". The third and fifth columns were dropped. Additionally, rows containing the values "Strani turisti", "Evropa", and "Vanevropske zemlje" in the "zemlja_porijekla" column were filtered out of each DataFrame.

The cleaned DataFrames were then exported to separate CSV files for each year using pandas' to_csv function.

In [14]:
tourists_by_country_dict = {}
for year, path in file_paths.items():
    tourists_by_country_dict[f'tourists_by_country_{year}'] = pd.read_excel(path, sheet_name='Sheet3', header=1)

In [15]:
for year, df in tourists_by_country_dict.items():
    df.rename(columns={
        df.columns[0]: 'zemlja_porijekla', 
        df.columns[1]: 'dolasci',
        df.columns[3]:'nocenja',},
        inplace= True)
    
    df.drop(columns=[df.columns[2], df.columns[4]],
            inplace=True)
    
    filt = (~df['zemlja_porijekla'].isin(['Strani turisti', 'Evropa', 'Vanevropske zemlje']))
    tourists_by_country_dict[year] = df[filt]
    tourists_by_country_dict[year] = tourists_by_country_dict[year].reset_index(drop=True)

In [16]:
tourists_by_country_dict['tourists_by_country_2022'].tail()

Unnamed: 0,zemlja_porijekla,dolasci,nocenja
56,Ujedinjeni Arapski Emirati,2302,6183
57,Ostale azijske zemlje,27486,124100
58,Australija,6902,32521
59,Novi Zeland,1528,8519
60,Ostale zemlje Okeanije,3819,11510


In [None]:
for year, df in tourists_by_country_dict.items():
    df.to_csv(fr'C:\Users\Ivan\Desktop\tourism_analysis\data\cleaned_data\final_data_country\{year}.csv', index=False)