# Initial data exploration
In this step, we will get basic information about the data and what we need to do during the next step: data preprocessing.

## 1. Import Packages and Data

In [1]:
import pandas as pd

In [2]:
filepath= "./jpn-hostel-data/raw.csv"
df = pd.read_csv(filepath)

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,hostel.name,City,price.from,Distance,summary.score,rating.band,atmosphere,cleanliness,facilities,location.y,security,staff,valueformoney,lon,lat
0,1,"""Bike & Bed"" CharinCo Hostel",Osaka,3300,2.9km from city centre,9.2,Superb,8.9,9.4,9.3,8.9,9.0,9.4,9.4,135.513767,34.682678
1,2,& And Hostel,Fukuoka-City,2600,0.7km from city centre,9.5,Superb,9.4,9.7,9.5,9.7,9.2,9.7,9.5,,
2,3,&And Hostel Akihabara,Tokyo,3600,7.8km from city centre,8.7,Fabulous,8.0,7.0,9.0,8.0,10.0,10.0,9.0,139.777472,35.697447
3,4,&And Hostel Ueno,Tokyo,2600,8.7km from city centre,7.4,Very Good,8.0,7.5,7.5,7.5,7.0,8.0,6.5,139.783667,35.712716
4,5,&And Hostel-Asakusa North-,Tokyo,1500,10.5km from city centre,9.4,Superb,9.5,9.5,9.0,9.0,9.5,10.0,9.5,139.798371,35.727898


In [4]:
df.shape

(342, 16)

## 2. Data checks to perform
- Missing values
- Duplicates
- Data type
- Numbeer of unique values
- Statistics of numerical features
- Categories present in categorical features

### 2.1. Missing values

In [5]:
df.isna().sum()

Unnamed: 0        0
hostel.name       0
City              0
price.from        0
Distance          0
summary.score    15
rating.band      15
atmosphere       15
cleanliness      15
facilities       15
location.y       15
security         15
staff            15
valueformoney    15
lon              44
lat              44
dtype: int64

### 2.2. Duplicates

In [6]:
df.duplicated().sum()

0

### 2.3. Data types

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 342 entries, 0 to 341
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0     342 non-null    int64  
 1   hostel.name    342 non-null    object 
 2   City           342 non-null    object 
 3   price.from     342 non-null    int64  
 4   Distance       342 non-null    object 
 5   summary.score  327 non-null    float64
 6   rating.band    327 non-null    object 
 7   atmosphere     327 non-null    float64
 8   cleanliness    327 non-null    float64
 9   facilities     327 non-null    float64
 10  location.y     327 non-null    float64
 11  security       327 non-null    float64
 12  staff          327 non-null    float64
 13  valueformoney  327 non-null    float64
 14  lon            298 non-null    float64
 15  lat            298 non-null    float64
dtypes: float64(10), int64(2), object(4)
memory usage: 42.9+ KB


### 2.4. Number of unique values

In [8]:
df.nunique()

Unnamed: 0       342
hostel.name      342
City               5
price.from        42
Distance         119
summary.score     44
rating.band        5
atmosphere        42
cleanliness       38
facilities        40
location.y        36
security          38
staff             30
valueformoney     35
lon              296
lat              296
dtype: int64

### 2.5. Statistics for numerical features

In [9]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,342.0,171.5,98.871128,1.0,86.25,171.5,256.75,342.0
price.from,342.0,8388.011696,76415.272323,1000.0,2000.0,2500.0,2900.0,1003200.0
summary.score,327.0,8.782569,0.960909,3.1,8.6,9.0,9.4,10.0
atmosphere,327.0,8.238838,1.382002,2.0,7.8,8.6,9.0,10.0
cleanliness,327.0,9.011927,1.215775,2.0,8.8,9.3,9.8,10.0
facilities,327.0,8.597554,1.285356,2.0,8.0,9.0,9.3,10.0
location.y,327.0,8.694801,1.102703,2.0,8.0,9.0,9.4,10.0
security,327.0,8.947401,1.114345,2.0,8.7,9.2,9.6,10.0
staff,327.0,9.133333,1.086513,2.0,9.0,9.4,9.8,10.0
valueformoney,327.0,8.848318,1.047809,4.0,8.6,9.0,9.5,10.0


### 2.6. Categories present in categorical features

In [10]:
object_columns = df.select_dtypes(include='object').columns

# Print unique values for the categorical features
for column in object_columns:
    if df[column].nunique()<10:
        print(f"Unique values for {column}:")
        print(df[column].unique())
        print("\n")

Unique values for City:
['Osaka' 'Fukuoka-City' 'Tokyo' 'Hiroshima' 'Kyoto']


Unique values for rating.band:
['Superb' 'Fabulous' 'Very Good' nan 'Rating' 'Good']




## 3. Observations
- There are some missing values
- Price.from probably has some outliers (Mean >> Median)
- Unnamed: 0, can be removed (it is index)
- hostel.name, lon, lat can be removed (not relevant for this project)
- Distance should be converted to numerical