(https://www.kaggle.com/datasets/rupindersinghrana/airbnb-price-dataset)
Dataset Overview: Dataset ini berisi daftar Airbnb dengan kolom seperti id, log_price, property_type, room_type, amenities, accommodates, bathrooms, bed_type, cancellation_policy, cleaning_fee, city, neighbourhood, number_of_reviews, review_scores_rating, bedrooms, beds, dll.
Objective: Menganalisis faktor-faktor yang mempengaruhi log_price (e.g., property_type, room_type, amenities, neighbourhood, review_scores_rating) untuk merekomendasikan strategi harga yang kompetitif yang selaras dengan tren pasar dan anggaran pelanggan.
Metrik Utama: Korelasi antara fasilitas, akomodasi, kamar tidur, dan harga.
- Install dan masuk ke Cloudera VM.
- Lakukan pengecekan apakah Hadoop, Hive, dan Spark telah terinstal.
- Membuat shared folder untuk berbagi file dengan komputer lokal.
- Upload dataset ke Cloudera VM untuk diolah
- Upload dataset ke HDFS:
hdfs dfs -mkdir /user/cloudera/airbnb
hdfs dfs -put airbnb_data.csv /user/cloudera/airbnb/
- Verifikasi file di HDFS:
hdfs dfs -ls /user/cloudera/airbnb
- Membuat database HIVE
hive -e "CREATE DATABASE airbnb_db;"
- Membuat tabel eksternal untuk data CSV
CREATE EXTERNAL TABLE airbnb_db.listings (
id STRING,
log_price STRING,
property_type STRING,
room_type STRING,
amenities STRING,
accommodates INT,
bathrooms STRING,
bed_type STRING,
cancellation_policy STRING,
cleaning_fee BOOLEAN,
city STRING,
description STRING,
first_review STRING,
host_has_profile_pic STRING,
host_identity_verified STRING,
host_response_rate STRING,
host_since STRING,
instant_bookable STRING,
last_review STRING,
latitude STRING,
longitude STRING,
name STRING,
neighbourhood STRING,
number_of_reviews INT,
review_scores_rating STRING,
thumbnail_url STRING,
zipcode STRING,
bedrooms STRING,
beds STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/cloudera/airbnb'
TBLPROPERTIES ('skip.header.line.count'='1');
- Bersihkan 'log_price' (hapus koma, dan ubah menjadi float)
CREATE TABLE airbnb_db.listings_cleaned AS
SELECT
id,
CAST(REPLACE(log_price, ',', '') AS DOUBLE) AS log_price,
property_type,
room_type,
amenities,
accommodates,
CAST(bathrooms AS FLOAT) AS bathrooms,
bed_type,
cancellation_policy,
cleaning_fee,
city,
neighbourhood,
number_of_reviews,
CAST(review_scores_rating AS FLOAT) AS review_scores_rating,
CAST(bedrooms AS FLOAT) AS bedrooms,
CAST(beds AS FLOAT) AS beds
FROM airbnb_db.listings;
- Harga Rata-rata berdasarkan Kota dan Lingkungan:
SELECT city, neighbourhood, AVG(log_price) AS avg_price
FROM airbnb_db.listings_cleaned
GROUP BY city, neighbourhood
ORDER BY avg_price DESC;
Hal ini mengidentifikasi lingkungan yang bernilai tinggi, sehingga membantu pemilik menetapkan harga yang kompetitif.
- Dampak Fasilitas terhadap Harga
Mengurai 'Amenities' (membagi string menjadi beberapa fasilitas):
CREATE TABLE airbnb_db.amenities_expanded AS
SELECT id, log_price, amenity
FROM airbnb_db.listings_cleaned
LATERAL VIEW EXPLODE(SPLIT(REGEXP_REPLACE(amenities, '[{}"]', ''), ',')) AS amenity;
Hitung harga rata-rata berdasarkan 'amenities'
SELECT amenity, AVG(log_price) AS avg_price
FROM airbnb_db.amenities_expanded
GROUP BY amenity
ORDER BY avg_price DESC;
Ini menunjukkan fasilitas mana (misalnya, “AC”, “Dapur”) yang meningkatkan harga.
- Korelasi Antara Fitur dan Harga
Gunakan Hive untuk menghitung rata-rata untuk daftar dengan peringkat tinggi atau lebih banyak kamar tidur
SELECT bedrooms, AVG(log_price) AS avg_price
FROM airbnb_db.listings_cleaned
WHERE review_scores_rating >= 90
GROUP BY bedrooms;
- Export Hasil Analisis ke CSV
Dampak Fasilitas terhadap Harga
INSERT OVERWRITE DIRECTORY '/user/cloudera/airbnb/price_by_amenity'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
SELECT amenity, ROUND(AVG(log_price), 2) AS avg_price
FROM airbnb_db.amenities_expanded
GROUP BY amenity
ORDER BY avg_price DESC;
Korelasi Antara Kamar Tidur dan Harga (Daftar Harga Tinggi)
INSERT OVERWRITE DIRECTORY '/user/cloudera/airbnb/price_by_bedrooms'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
SELECT bedrooms, ROUND(AVG(log_price), 2) AS avg_price
FROM airbnb_db.listings_cleaned
WHERE review_scores_rating >= 90
GROUP BY bedrooms
ORDER BY bedrooms;
Verifikasi File CSV di HDFS
hdfs dfs -ls /user/cloudera/airbnb/price_by_amenity
hdfs dfs -ls /user/cloudera/airbnb/price_by_bedrooms
Gabungkan dan salin file HDFS ke direktori Home
hdfs dfs -getmerge /user/cloudera/airbnb/price_by_amenity /home/cloudera/price_by_amenity.csv
hdfs dfs -getmerge /user/cloudera/airbnb/price_by_bedrooms /home/cloudera/price_by_bedrooms.csv
- Pindahkan file hasil analisis ke Shared Folder
- Masuk ke dalam Google Colab, dan upload file hasil analisis
- Visualisasikan harga rata-rata berdasarkan fasilitas (amenities)
import matplotlib.pyplot as plt
df_price_by_amenity['Amenity'] = df_price_by_amenity['Amenity'].astype(str)
plt.figure(figsize=(12, 6))
plt.bar(df_price_by_amenity['Amenity'], df_price_by_amenity['Average Price'].astype(float))
plt.xlabel('Fasilitas')
plt.ylabel('Harga Rata-rata')
plt.title('Harga Rata-rata berdasarkan Fasilitas')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
- Visualisasikan harga rata-rata berdasarkan jumlah kamar tidur (bedrooms)
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.bar(df_price_by_bedrooms['Bedrooms'], df_price_by_bedrooms['Average Price'].astype(float))
plt.xlabel('Jumlah Kamar Tidur')
plt.ylabel('Harga Rata-rata')
plt.title('Harga Rata-rata berdasarkan Jumlah Kamar Tidur')
plt.show()
- Visualisasi proporsi harga rata-rata berdasarkan fasilitas (amenities)
import matplotlib.pyplot as plt
df_price_by_amenity['Amenity'] = df_price_by_amenity['Amenity'].astype(str)
df_price_by_amenity['Average Price'] = df_price_by_amenity['Average Price'].astype(float)
total_price_amenity = df_price_by_amenity['Average Price'].sum()
df_price_by_amenity['Percentage'] = (df_price_by_amenity['Average Price'] / total_price_amenity) * 100
plt.figure(figsize=(10, 10))
plt.pie(df_price_by_amenity['Average Price'], labels=df_price_by_amenity['Amenity'], autopct='%1.1f%%', startangle=140)
plt.title('Proporsi Harga Rata-rata berdasarkan Fasilitas')
plt.axis('equal')
plt.show()
- Visualisasi proporsi harga rata-rata berdasarkan jumlah kamar tidur (bedrooms)
import matplotlib.pyplot as plt
df_price_by_bedrooms['Bedrooms'] = df_price_by_bedrooms['Bedrooms'].astype(str)
df_price_by_bedrooms['Average Price'] = df_price_by_bedrooms['Average Price'].astype(float)
total_price_bedrooms = df_price_by_bedrooms['Average Price'].sum()
df_price_by_bedrooms['Percentage'] = (df_price_by_bedrooms['Average Price'] / total_price_bedrooms) * 100
plt.figure(figsize=(10, 10))
plt.pie(df_price_by_bedrooms['Average Price'], labels=df_price_by_bedrooms['Bedrooms'], autopct='%1.1f%%', startangle=140)
plt.title('Proporsi Harga Rata-rata berdasarkan Jumlah Kamar Tidur')
plt.axis('equal')
plt.show()
- Kembali ke Cloudera VM, lalu pastikan Oozie aktif
oozie admin -oozie http://localhost:11000/oozie -status
- Membuat struktur workflow
mkdir -p ~/oozie_workflow
cd ~/oozie_workflow
- Membuat file workflow.xml untuk mendefinisikan workflow Oozie yang akan menjalankan Hive script
gedit workflow.xml
- Isi workflow.xml
<workflow-app name="price-data-processing" xmlns="uri:oozie:workflow:0.5">
<global>
<configuration>
<property>
<name>mapreduce.job.queuename</name>
<value>default</value>
</property>
</configuration>
</global>
<start to="process-amenity-data"/>
<action name="process-amenity-data">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.apache.hadoop.mapred.lib.IdentityMapper</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.apache.hadoop.mapred.lib.IdentityReducer</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${amenityInputPath}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${amenityOutputPath}</value>
</property>
<property>
<name>mapred.job.name</name>
<value>ProcessAmenityData</value>
</property>
</configuration>
</map-reduce>
<ok to="process-bedrooms-data"/>
<error to="fail"/>
</action>
<action name="process-bedrooms-data">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>org.apache.hadoop.mapred.lib.IdentityMapper</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.apache.hadoop.mapred.lib.IdentityReducer</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>${bedroomsInputPath}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${bedroomsOutputPath}</value>
</property>
<property>
<name>mapred.job.name</name>
<value>ProcessBedroomsData</value>
</property>
</configuration>
</map-reduce>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
- Membuat job.properties untuk menentukan konfigurasi Oozie (HDFS path, JobTracker, lokasi workflow).
gedit job.properties
- Isi job.properties
nameNode=hdfs://quickstart.cloudera:8020
jobTracker=quickstart.cloudera:8032
queueName=default
examplesRoot=oozie_workflow
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/workflow.xml
amenityInputPath=/user/${user.name}/oozie_data/price_by_amenity.csv
amenityOutputPath=/user/${user.name}/oozie_output/amenity_processed
bedroomsInputPath=/user/${user.name}/oozie_data/price_by_bedrooms.csv
bedroomsOutputPath=/user/${user.name}/oozie_output/bedrooms_processed
oozie.use.system.libpath=true
oozie.action.sharelib.for.mapreduce=mapreduce
- Membuat direktori di HDFS untuk file input
hdfs dfs -mkdir /user/cloudera/oozie_data
- Mengunggah file hasil analisis CSV ke HDFS
hdfs dfs -put /home/cloudera/price_by_amenity.csv /user/cloudera/oozie_data/
hdfs dfs -put /home/cloudera/price_by_bedrooms.csv /user/cloudera/oozie_data/
- Membuat direktori di HDFS untuk file workflow dan job.properties
hdfs dfs -mkdir /user/cloudera/oozie_workflow
- Mengunggah file workflow.xml dan job.properties ke HDFS
hdfs dfs -put /home/cloudera/oozie_workflow/workflow.xml /user/cloudera/oozie_workflow/
hdfs dfs -put /home/cloudera/oozie_workflow/job.properties /user/cloudera/oozie_workflow/
- Menjalankan workflow Oozie
oozie job -oozie http://quickstart.cloudera:11000/oozie -config /home/cloudera/oozie_workflow/job.properties -run
- Menjalankan Spark shell pada terminal
spark-shell
- Menjalankan wordcount dan hitung durasinya, dilanjutkan dengan menjalankan cache
price_by_amenity
val t0 = System.nanoTime()
val data = sc.textFile("hdfs:///user/cloudera/airbnb/price_by_amenity")
val words = data.flatMap(_.split(" "))
val counts = words.map(word => (word, 1)).reduceByKey(_ + _)
counts.count()
val t1 = System.nanoTime()
println("Durasi: " + (t1 - t0) / 1e9 + " detik")
Mengaktifkan cache
val cachedData = data.cache()
price_by_bedrooms
val t0 = System.nanoTime()
val data = sc.textFile("hdfs:///user/cloudera/airbnb/price_by_bedrooms")
val words = data.flatMap(_.split(" "))
val counts = words.map(word => (word, 1)).reduceByKey(_ + _)
counts.count()
val t1 = System.nanoTime()
println("Durasi: " + (t1 - t0) / 1e9 + " detik")
Mengaktifkan cache
val cachedData = data.cache()
- Menjalankan spark kembali setelah caching, namun dengan konfigurasi tuning
spark-shell --executor-memory 2G --driver-memory 1G --conf spark.default.parallelism=4
- Menjalankan job word count ulang
price_by_amenity
val t0 = System.nanoTime()
val data = sc.textFile("hdfs:///user/cloudera/airbnb/price_by_amenity")
val words = data.flatMap(_.split(" "))
val counts = words.map(word => (word, 1)).reduceByKey(_ + _)
counts.count()
val t1 = System.nanoTime()
println("Durasi: " + (t1 - t0) / 1e9 + " detik")
price_by_bedrooms
val t0 = System.nanoTime()
val data = sc.textFile("hdfs:///user/cloudera/airbnb/price_by_bedrooms")
val words = data.flatMap(_.split(" "))
val counts = words.map(word => (word, 1)).reduceByKey(_ + _)
counts.count()
val t1 = System.nanoTime()
println("Durasi: " + (t1 - t0) / 1e9 + " detik")
- Bandingkan hasil sebelum dan sesudah tuning