<a href="https://colab.research.google.com/github/hyulianton/BigData/blob/main/Analisis_Tren_Kesehatan_Masyarakat_dari_Data_Penyakit_di_Berbagai_Wilayah.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install Java Development Kit versi 8

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

Download spark versi 3.5.6

In [6]:
!wget https://dlcdn.apache.org/spark/spark-3.5.6/spark-3.5.6-bin-hadoop3.tgz

--2025-07-01 14:31:47--  https://dlcdn.apache.org/spark/spark-3.5.6/spark-3.5.6-bin-hadoop3.tgz
Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 400923510 (382M) [application/x-gzip]
Saving to: ‘spark-3.5.6-bin-hadoop3.tgz’


2025-07-01 14:31:53 (64.9 MB/s) - ‘spark-3.5.6-bin-hadoop3.tgz’ saved [400923510/400923510]



Ekstrak spark

In [7]:
!tar -xf spark-3.5.6-bin-hadoop3.tgz

Setting environment variable

In [8]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.6-bin-hadoop3"

Install library spark

In [9]:
!pip install findspark -q -q -q
!pip install pyspark -q -q -q

# 🧠 **Judul Proyek**: Analisis Tren Kesehatan Masyarakat dari Data Penyakit di Berbagai Wilayah

### 🎯 Tujuan Proyek

Membangun pipeline data dari proses pengambilan data mentah, pembersihan, pengolahan batch, hingga visualisasi tren penyakit berdasarkan lokasi geografis. Mahasiswa harus menunjukkan bagaimana pola ini dapat digunakan untuk mendukung keputusan publik, misalnya dalam alokasi sumber daya kesehatan.

----------

## 🏗️ Arsitektur Data Pipeline

1.  **Data Ingestion (Extract)**  
    Mengambil dataset penyakit dari library `sklearn.datasets` dan menyimulasikan sebagai data historis harian.
    
2.  **Data Cleaning (Transform)**  
    Menghapus data hilang, outlier, dan membuat fitur-fitur baru.
    
3.  **Batch Processing (Load)**  
    Menggunakan PySpark untuk proses agregasi bulanan berdasarkan wilayah.
    
4.  **Data Storage**  
    Simulasi penyimpanan ke file Parquet.
    
5.  **Visualization**  
    Dashboard interaktif dengan `Plotly` atau `Dash`.
    

----------

## 📦 Dataset yang Digunakan

Kita akan menggunakan dataset `load_diabetes()` dari `sklearn.datasets` lalu menambahkan simulasi lokasi dan tanggal.

----------

## 🧪 Implementasi Proyek (Python + PySpark)

### 1. Setup Environment

In [10]:
from sklearn.datasets import load_diabetes
import pandas as pd
import numpy as np
from datetime import timedelta, datetime
import random

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date, month, avg

# Membuat Spark session
spark = SparkSession.builder.appName("BigDataPipelineDiabetes").getOrCreate()

### 2. Simulasi Dataset Realistis

In [18]:
# Load dataset dari sklearn
data = load_diabetes()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Tambahkan kolom lokasi dan tanggal acak untuk simulasi batch harian
locations = ['Jakarta', 'Bandung', 'Surabaya', 'Medan', 'Makassar']
start_date = datetime(2023, 1, 1)

df['location'] = np.random.choice(locations, len(df))
df['date'] = [start_date + timedelta(days=random.randint(0, 365)) for _ in range(len(df))]

df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target,location,date
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0,Jakarta,2023-11-17
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0,Makassar,2023-06-26
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0,Medan,2023-10-25
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0,Surabaya,2023-10-24
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0,Jakarta,2023-08-25


### 3. Convert ke Spark DataFrame dan Bersihkan Data

In [19]:
sdf = spark.createDataFrame(df)

# Bersihkan data: hanya ambil kolom yang relevan
sdf_clean = sdf.select('date', 'location', 'bmi', 'bp', 'target')

# Konversi tanggal dan filter out nilai BMI yang tidak logis
sdf_clean = sdf_clean.withColumn("date", to_date("date")) \
                     .filter((col("bmi") > 0) & (col("bmi") < 100))

### 4. Batch Processing: Agregasi BMI dan Tekanan Darah per Bulan dan Lokasi

In [22]:
# Tambahkan kolom bulan
sdf_monthly = sdf_clean.withColumn("month", month("date"))

# Agregasi
sdf_result = sdf_monthly.groupBy("location", "month") \
                        .agg(avg("bmi").alias("avg_bmi"),
                             avg("bp").alias("avg_bp"),
                             avg("target").alias("avg_progression"))
sdf_result = sdf_result.orderBy("location", "month")

sdf_result.show()

+--------+-----+--------------------+--------------------+------------------+
|location|month|             avg_bmi|              avg_bp|   avg_progression|
+--------+-----+--------------------+--------------------+------------------+
| Bandung|    1| 0.03475090467166331|  0.0941722560068712|             236.0|
| Bandung|    2| 0.07301323329443166| 0.08212227759139878|             263.5|
| Bandung|    3| 0.04957082068752428| 0.03564378941743375|            268.25|
| Bandung|    4|0.036475403989872555| 0.02829674543497142|             161.2|
| Bandung|    5| 0.04660683748435207| 0.06548183120812735|229.33333333333334|
| Bandung|    6|0.030439656376140097| 0.03318461014896999|176.71428571428572|
| Bandung|    7|0.028284032228378497|-0.01599897522030...|             302.0|
| Bandung|    8| 0.06331292462950448| 0.03220093844158448|             241.0|
| Bandung|    9|0.018583723563451296| 0.01498668356233818|              94.5|
| Bandung|   10| 0.03582871674554408|-0.01599897522030...|      

### 5. Simulasi Penyimpanan (Write ke Parquet)

In [14]:
sdf_result.write.mode("overwrite").parquet("/tmp/diabetes_summary.parquet")

### 6. Visualisasi: Dashboard Interaktif (Optional tapi Recommended)

Contoh visualisasi dengan `Plotly`:

In [23]:
import plotly.express as px

# Load kembali dari Parquet
pdf_result = sdf_result.toPandas()

fig = px.line(pdf_result, x="month", y="avg_bmi", color="location",
              title="Rata-rata BMI per Bulan dan Lokasi")
fig.show()


## 📊 Dampak Nyata yang Diperlihatkan Mahasiswa

-   Menunjukkan tren kesehatan (BMI & tekanan darah) di berbagai kota sepanjang tahun.
    
-   Insight untuk kebijakan publik seperti peningkatan edukasi kesehatan atau penyediaan alat medis di kota dengan tren buruk.
    
-   Demonstrasi kemampuan mengolah data mentah hingga menjadi visualisasi siap presentasi.