# Minggu 2 — Hands‑On Data Collection (Dataset Industri)

Notebook ini menuntun langkah akuisisi data, pembuatan *data dictionary*, validasi deskripsi, dan pengarsipan artefak.
Jalankan sel sesuai kebutuhan (unggah file / URL / Kaggle API).

Disusun oleh: **Budi Sunaryo**

Mata kuliah: **Data Analytics**

Program Studi: **Teknik Industri, Fakultas Teknologi Industri, Universitas Bung Hatta**

## 1) Persiapan Lingkungan

In [14]:
# (Opsional) instal paket tambahan di Colab
# !pip install -q ydata-profiling pyarrow

import sys, os, pandas as pd, numpy as np
print('pandas:', pd.__version__)


pandas: 2.2.2


## 2) Akuisisi Data — Pilih salah satu opsi: **(A) Unggah**, **(B) URL**, **(C) Kaggle API (lanjutan)**

In [15]:
# Opsi (A): Unggah file dari komputer (CSV/Excel/JSON)
try:
    from google.colab import files  # type: ignore
    uploaded = files.upload()
    fname = next(iter(uploaded)) if uploaded else None
    if fname:
        if fname.lower().endswith('.csv'):
            df = pd.read_csv(fname)
        elif fname.lower().endswith(('.xls', '.xlsx')):
            df = pd.read_excel(fname)
        elif fname.lower().endswith('.json'):
            df = pd.read_json(fname, lines=False)
        else:
            raise ValueError("Format tidak didukung. Gunakan CSV/Excel/JSON.")
        print("Loaded:", fname, df.shape)
except Exception as e:
    print("Lewati jika tidak pakai Colab upload:", e)


Saving supply_chain_orders_sample.csv to supply_chain_orders_sample (1).csv
Loaded: supply_chain_orders_sample (1).csv (100, 18)


In [17]:
# Opsi (B): Ambil dari URL (CSV/Parquet/JSON). Ganti URL dengan direct link file Anda.
url = "https://raw.githubusercontent.com/plotly/datasets/master/2014_usa_states.csv"  # <-- Ganti
try:
    if url.endswith(".parquet"):
        df = pd.read_parquet(url)
    elif url.endswith(".json"):
        df = pd.read_json(url)
    else:
        df = pd.read_csv(url)
    print("Loaded from URL:", df.shape)
except Exception as e:
    print("Jika gagal, pastikan URL langsung ke file:", e)


Loaded from URL: (52, 4)


In [None]:
# Opsi (C): Kaggle API (lanjutan) — perlu kaggle.json
# from google.colab import files  # type: ignore
# files.upload()  # upload kaggle.json
# !mkdir -p ~/.kaggle && cp kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json
# !kaggle datasets download -d shivamb/machine-predictive-maintenance-classification -p /content/data
# !unzip -o /content/data/*.zip -d /content/data
# import pandas as pd, glob
# f = glob.glob('/content/data/**/*.csv', recursive=True)[0]
# df = pd.read_csv(f)
# print("Loaded Kaggle:", df.shape, f)


## 3) Pemeriksaan Awal

In [18]:
try:
    df
except NameError:
    raise RuntimeError("DataFrame 'df' belum ada. Jalankan salah satu opsi akuisisi.")
df.head()


Unnamed: 0,Rank,State,Postal,Population
0,1,Alabama,AL,4849377.0
1,2,Alaska,AK,736732.0
2,3,Arizona,AZ,6731484.0
3,4,Arkansas,AR,2966369.0
4,5,California,CA,38802500.0


In [19]:
df.info()
df.describe(include='all').T.head(20)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Rank        52 non-null     int64  
 1   State       52 non-null     object 
 2   Postal      52 non-null     object 
 3   Population  52 non-null     float64
dtypes: float64(1), int64(1), object(2)
memory usage: 1.8+ KB


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Rank,52.0,,,,26.5,15.154757,1.0,13.75,26.5,39.25,52.0
State,52.0,52.0,Alabama,1.0,,,,,,,
Postal,52.0,52.0,AL,1.0,,,,,,,
Population,52.0,,,,6200104.846154,7063773.826174,584153.0,1796360.5,4191848.0,6824438.5,38802500.0


## 4) Data Dictionary (otomatis + manual)

In [20]:
sample_values = df.head(3).astype(str)
data_dict = []
for col in df.columns:
    data_dict.append({
        "column": col,
        "dtype": str(df[col].dtype),
        "non_null": int(df[col].notna().sum()),
        "nulls": int(df[col].isna().sum()),
        "missing_rate": float(df[col].isna().mean()),
        "example_values": sample_values[col].tolist()
    })
ddf = pd.DataFrame(data_dict)
ddf


Unnamed: 0,column,dtype,non_null,nulls,missing_rate,example_values
0,Rank,int64,52,0,0.0,"[1, 2, 3]"
1,State,object,52,0,0.0,"[Alabama, Alaska, Arizona]"
2,Postal,object,52,0,0.0,"[AL, AK, AZ]"
3,Population,float64,52,0,0.0,"[4849377.0, 736732.0, 6731484.0]"


**Tambahkan deskripsi kolom secara manual pada kolom `description` di bawah ini.**

In [21]:
ddf["description"] = ""
ddf


Unnamed: 0,column,dtype,non_null,nulls,missing_rate,example_values,description
0,Rank,int64,52,0,0.0,"[1, 2, 3]",
1,State,object,52,0,0.0,"[Alabama, Alaska, Arizona]",
2,Postal,object,52,0,0.0,"[AL, AK, AZ]",
3,Population,float64,52,0,0.0,"[4849377.0, 736732.0, 6731484.0]",


## 5) Profiling (opsional)

In [None]:
# from ydata_profiling import ProfileReport
# profile = ProfileReport(df, title="Profiling Report - Minggu 2", minimal=True)
# profile.to_file("profiling_report.html")
# print("Saved profiling_report.html")


## 6) Simpan Artefak

In [22]:
import datetime, os
today = datetime.date.today()
os.makedirs("docs", exist_ok=True)
os.makedirs("data/raw", exist_ok=True)

# Simpan data dictionary
ddf.to_csv("docs/data_dictionary.csv", index=False)

# Simpan ringkasan data
with open("docs/acquisition_log.md", "w", encoding="utf-8") as f:
    f.write("# Acquisition Log\n")
    f.write(f"- Date: {today}\n")
    f.write("- Source: <isi sumber data Anda>\n")
    f.write("- Method: <CSV/API/Log>\n")
    f.write("- Tooling: <Colab/Requests/etc>\n")
    f.write("- Notes: <kendala & keputusan>\n")

print("Saved: docs/data_dictionary.csv & docs/acquisition_log.md")


Saved: docs/data_dictionary.csv & docs/acquisition_log.md


## 6a) Upload data yang sudah diisi deksripsinya

In [24]:
from google.colab import files
uploaded = files.upload()  # pilih file data_dictionary.csv yang sudah diedit
import pandas as pd, os
ddf = pd.read_csv("data_dictionary.csv")        # nama file hasil upload
os.makedirs("docs", exist_ok=True)
ddf.to_csv("docs/data_dictionary.csv", index=False)  # timpa file lama
print("✔️ data_dictionary.csv sudah diperbarui di docs/")


Saving data_dictionary.csv to data_dictionary.csv
✔️ data_dictionary.csv sudah diperbarui di docs/


### 6b) Validasi **Data Dictionary** (wajib sebelum arsip)

In [25]:
# Validasi Data Dictionary
import os, pandas as pd

# Muat dari file bila 'ddf' tak ada di memori
try:
    ddf
except NameError:
    if os.path.exists("docs/data_dictionary.csv"):
        ddf = pd.read_csv("docs/data_dictionary.csv")
        print("Loaded ddf from docs/data_dictionary.csv:", ddf.shape)
    else:
        raise FileNotFoundError("docs/data_dictionary.csv belum ditemukan. Jalankan sel Simpan Artefak.")

if "description" not in ddf.columns:
    raise ValueError("Kolom 'description' belum ada pada data dictionary. Tambahkan lalu jalankan ulang.")

is_blank = ddf["description"].astype(str).str.strip().eq("").fillna(True)
n_missing = int(is_blank.sum())
total = int(len(ddf))
print(f"Ringkasan deskripsi: {total - n_missing} terisi / {total} total kolom; {n_missing} kosong.")

os.makedirs("docs", exist_ok=True)
with open("docs/validation_report.md", "w", encoding="utf-8") as f:
    f.write("# Validation Report - Data Dictionary\n")
    f.write(f"- Total columns: {total}\n")
    f.write(f"- Missing descriptions: {n_missing}\n")
    if n_missing > 0:
        f.write("- Columns tanpa deskripsi:\n")
        for col in ddf.loc[is_blank, "column"].astype(str).tolist():
            f.write(f"  - {col}\n")

ALLOW_MISSING = False  # set True untuk melewati cek (tidak direkomendasikan)
if n_missing > 0 and not ALLOW_MISSING:
    raise AssertionError("❌ Masih ada 'description' yang kosong. Lengkapi dahulu sebelum mengarsip.")
else:
    print("✅ Data dictionary valid untuk diarsipkan.")


Ringkasan deskripsi: 4 terisi / 4 total kolom; 0 kosong.
✅ Data dictionary valid untuk diarsipkan.


## 7) Arsip & Download Artefak (otomatis)

In [26]:
import os, shutil, glob, datetime

for p in ["docs", "data/raw"]:
    if not os.path.exists(p):
        raise FileNotFoundError(f"Folder belum ditemukan: {p}. Jalankan sel 'Simpan Artefak' dulu.")

stamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
zip_name = f"Minggu2_artefak_{stamp}"
zip_path = shutil.make_archive(zip_name, "zip", ".")

print("Files in docs/:", glob.glob("docs/*"))
print("Files in data/raw/:", glob.glob("data/raw/*"))
print("ZIP created:", zip_path)

try:
    from google.colab import files  # type: ignore
    files.download(zip_path)
    if os.path.exists("docs/data_dictionary.csv"):
        files.download("docs/data_dictionary.csv")
    if os.path.exists("docs/acquisition_log.md"):
        files.download("docs/acquisition_log.md")
except Exception as e:
    print("Jika tidak di Colab, abaikan pesan ini. ZIP tersimpan lokal:", e)


Files in docs/: ['docs/acquisition_log.md', 'docs/validation_report.md', 'docs/data_dictionary.csv']
Files in data/raw/: []
ZIP created: /content/Minggu2_artefak_20250921-145708.zip


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>