## Import Requirements and Datasets

In [3]:
# requirements
import pandas as pd
from pandas_profiling import ProfileReport

In [5]:
# dataset 1
df_data = pd.read_csv('original_datasets/data.csv')
df_data.head()

Unnamed: 0,Date,Selina,Sub Category,Capacity,Out of Order,Available Beds,Sold Beds,Room Revenue USD,Food and Beverage Revenue,Activities Revenue,Co-Working Revenue,Average Guest Review
0,2017-12-01,San Jose,Dorm,86,1,85,65,1170.0,341.64,179.01,79.56,8.755
1,2017-12-01,Medellin,Dorm,215,25,190,91,1820.0,373.1,274.82,72.8,8.787
2,2017-12-01,Jaco,Dorm,118,2,116,55,880.0,289.52,100.32,23.76,8.78
3,2017-12-01,Bocas del Toro,Dorm,154,2,152,108,1728.0,618.624,390.528,164.16,8.689
4,2017-12-01,Red Frog,Dorm,24,1,23,17,255.0,91.545,44.88,20.655,8.754


In [7]:
# dataset 2
df_op_dates = pd.read_csv('original_datasets/opening_dates.csv')
df_op_dates.head()

Unnamed: 0,Selina,Selina Country,Selina Opening Date
0,Venao,Panama,2014-08-01
1,Bocas del Toro,Panama,2014-12-01
2,Pedasi,Panama,2016-03-01
3,San Jose,Costa Rica,2016-11-01
4,Red Frog,Panama,2016-11-01


## Assessing

### Visual Assessment

#### List of Visual Assessments:

- Contains Selinas that opened outside of Q4 2017-2018 date range (ref. opening dates):
    - Are all Selina's listed considered "newly opened" and we're simply analyzing their 2018 performance?
    - Or do we remove all Selinas that opened prior to the data collection period?
- Missing columns:
    - Total revenue
    - Selina's KPIs
- Some Selina's are missing chunks of data for certain bed types and some are missing individual rows of data for certain bed types:
    - Does this mean Selina didn't have this bed type available?
    - Or maybe this bed type wasn't booked for that day?
- Column names not in snake_case
- Missing categorical columns that could potentially add value
    - Country
    - Month of year each Selina opened

### Programmatic Assessment

In [11]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19077 entries, 0 to 19076
Data columns (total 12 columns):
Date                         19077 non-null object
Selina                       19077 non-null object
Sub Category                 19077 non-null object
Capacity                     19077 non-null int64
Out of Order                 19077 non-null int64
Available Beds               19077 non-null int64
Sold Beds                    19077 non-null int64
Room Revenue USD             19077 non-null float64
Food and Beverage Revenue    19077 non-null float64
Activities Revenue           19077 non-null float64
Co-Working Revenue           19077 non-null float64
Average Guest Review         19077 non-null float64
dtypes: float64(5), int64(4), object(3)
memory usage: 1.7+ MB


In [14]:
df_data.duplicated().sum()

0

In [15]:
df_data.describe()

Unnamed: 0,Capacity,Out of Order,Available Beds,Sold Beds,Room Revenue USD,Food and Beverage Revenue,Activities Revenue,Co-Working Revenue,Average Guest Review
count,19077.0,19077.0,19077.0,19077.0,19077.0,19077.0,19077.0,19077.0,19077.0
mean,64.614195,4.362426,60.251769,31.359228,629.480657,151.040265,92.938102,32.068546,8.778343
std,53.954401,5.809417,49.804572,25.606965,411.518606,134.996453,74.167449,29.188877,0.091046
min,6.0,0.0,5.0,2.0,36.0,0.0,0.0,0.0,8.685
25%,24.0,1.0,23.0,13.0,324.0,57.21435,42.88284,13.54752,8.717
50%,40.0,1.0,39.0,22.0,540.0,120.66912,72.42102,23.332608,8.753
75%,96.0,5.0,86.0,41.0,820.8,207.8208,122.0688,41.1642,8.785
max,254.0,43.0,233.0,126.0,2541.0,900.19125,541.47992,277.3975,9.01


#### List of Programmatic Assessments:
- Column headers not in snake_case
- No NaNs
    - If we add rows for the missing data recognized in the visual assessment, we will create NaNs that need to be filled accordingly
- Date column wrong dtype
- No duplicates

## Cleaning

In [23]:
df_data.Selina.value_counts().count()

23

In [24]:
df_op_dates.Selina.value_counts().count()

25

In [27]:
df_data.Selina.value_counts()

Puerto Viejo            1095
Manuel Antonio          1095
Medellin                1094
Santa Teresa North      1094
San Jose                1093
La Fortuna              1093
Tamarindo               1093
Bocas del Toro          1092
Venao                   1090
Jaco                    1083
Red Frog                1057
Playa del Carmen         973
Montanita                967
Antigua                  917
Mexico City              821
Cartagena                808
Puerto Escondido         585
Pedasi                   574
Quito                    550
Banos                    353
Lima                     272
Bogota La Candelaria     225
Cusco                     53
Name: Selina, dtype: int64

In [29]:
df_op_dates.Selina.value_counts()

Bocas del Toro          1
San Jose                1
Puerto Viejo            1
Manuel Antonio          1
Mexico City             1
Cancun - Hotel Zone     1
Cusco                   1
Pedasi                  1
Medellin                1
Bogota La Candelaria    1
Banos                   1
Puerto Escondido        1
Montanita               1
Jaco                    1
Antigua                 1
Playa del Carmen        1
Quito                   1
Venao                   1
Lima                    1
Santa Teresa North      1
Santa Teresa South      1
Cartagena               1
Tamarindo               1
La Fortuna              1
Red Frog                1
Name: Selina, dtype: int64