# Proyek Analisis Data: Bike Sharing Dataset
- **Nama:** Nico Gilang Aprully
- **Email:** m180d4ky2887@bangkit.academy
- **ID Dicoding:** nikogilang

## Menentukan Pertanyaan Bisnis

1.   Apakah terdapat perbedaan pola penggunaan sepeda antara musim panas, musim gugur, musim dingin, dan musim semi?
2. Bagaimana perubahan pola penggunaan sepeda selama akhir pekan (hari libur) dibandingkan dengan hari kerja?

## Import Semua Packages/Library yang Digunakan

In [1]:
!pip install numpy pandas scipy matplotlib seaborn jupyter streamlit babel plotly



In [2]:
import http
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import warnings
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans


## Data Wrangling

### Gathering Data

In [3]:
day_url = 'https://raw.githubusercontent.com/nikogilang/analisis-data-python/main/Bike-sharing-dataset/day.csv'
hour_url = 'https://raw.githubusercontent.com/nikogilang/analisis-data-python/main/Bike-sharing-dataset/hour.csv'

In [4]:
df_day = pd.read_csv(day_url, delimiter=",")
df_hour = pd.read_csv(hour_url, delimiter=",")

Keseluruhan dataframe Hari

In [5]:
df_day

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1,0,1,0,2,1,1,0.200000,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1,0,1,0,3,1,1,0.226957,0.229270,0.436957,0.186900,82,1518,1600
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
726,727,2012-12-27,1,1,12,0,4,1,2,0.254167,0.226642,0.652917,0.350133,247,1867,2114
727,728,2012-12-28,1,1,12,0,5,1,2,0.253333,0.255046,0.590000,0.155471,644,2451,3095
728,729,2012-12-29,1,1,12,0,6,0,2,0.253333,0.242400,0.752917,0.124383,159,1182,1341
729,730,2012-12-30,1,1,12,0,0,0,1,0.255833,0.231700,0.483333,0.350754,364,1432,1796


Keseluruhan Dataframe Jam

In [6]:
df_hour

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0000,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.80,0.0000,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.80,0.0000,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0000,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0000,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17374,17375,2012-12-31,1,1,12,19,0,1,1,2,0.26,0.2576,0.60,0.1642,11,108,119
17375,17376,2012-12-31,1,1,12,20,0,1,1,2,0.26,0.2576,0.60,0.1642,8,81,89
17376,17377,2012-12-31,1,1,12,21,0,1,1,1,0.26,0.2576,0.60,0.1642,7,83,90
17377,17378,2012-12-31,1,1,12,22,0,1,1,1,0.26,0.2727,0.56,0.1343,13,48,61


### Assessing Data

In [7]:
df_day.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     731 non-null    int64  
 1   dteday      731 non-null    object 
 2   season      731 non-null    int64  
 3   yr          731 non-null    int64  
 4   mnth        731 non-null    int64  
 5   holiday     731 non-null    int64  
 6   weekday     731 non-null    int64  
 7   workingday  731 non-null    int64  
 8   weathersit  731 non-null    int64  
 9   temp        731 non-null    float64
 10  atemp       731 non-null    float64
 11  hum         731 non-null    float64
 12  windspeed   731 non-null    float64
 13  casual      731 non-null    int64  
 14  registered  731 non-null    int64  
 15  cnt         731 non-null    int64  
dtypes: float64(4), int64(11), object(1)
memory usage: 91.5+ KB


In [8]:
df_hour.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     17379 non-null  int64  
 1   dteday      17379 non-null  object 
 2   season      17379 non-null  int64  
 3   yr          17379 non-null  int64  
 4   mnth        17379 non-null  int64  
 5   hr          17379 non-null  int64  
 6   holiday     17379 non-null  int64  
 7   weekday     17379 non-null  int64  
 8   workingday  17379 non-null  int64  
 9   weathersit  17379 non-null  int64  
 10  temp        17379 non-null  float64
 11  atemp       17379 non-null  float64
 12  hum         17379 non-null  float64
 13  windspeed   17379 non-null  float64
 14  casual      17379 non-null  int64  
 15  registered  17379 non-null  int64  
 16  cnt         17379 non-null  int64  
dtypes: float64(4), int64(12), object(1)
memory usage: 2.3+ MB


Mencari nilai Missing Value pada dataframe

In [9]:
print('DataFrame Hari:')
print(df_day.isna().sum())

print('\nDataFrame Jam:')
print(df_hour.isna().sum())

DataFrame Hari:
instant       0
dteday        0
season        0
yr            0
mnth          0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

DataFrame Jam:
instant       0
dteday        0
season        0
yr            0
mnth          0
hr            0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64


Mencari nilai duplikasi pada Dataframe

In [10]:
print('DataFrame Hari:')
print(df_day.duplicated().sum())

print("DataFrame Jam:")
print(df_hour.duplicated().sum())

DataFrame Hari:
0
DataFrame Jam:
0


### Cleaning Data

Mengubah tipe data

In [11]:
df_day['dteday'] = pd.to_datetime(df_day['dteday'])
df_hour['dteday'] = pd.to_datetime(df_hour['dteday'])

In [12]:
print('dtday_day  : ', df_day['dteday'].dtype)
print('dtday_hour : ',df_hour['dteday'].dtype)

dtday_day  :  datetime64[ns]
dtday_hour :  datetime64[ns]


## Exploratory Data Analysis (EDA)

### Explore ...

In [13]:
df_day.describe()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,731.0,731,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0
mean,366.0,2012-01-01 00:00:00,2.49658,0.500684,6.519836,0.028728,2.997264,0.683995,1.395349,0.495385,0.474354,0.627894,0.190486,848.176471,3656.172367,4504.348837
min,1.0,2011-01-01 00:00:00,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.05913,0.07907,0.0,0.022392,2.0,20.0,22.0
25%,183.5,2011-07-02 12:00:00,2.0,0.0,4.0,0.0,1.0,0.0,1.0,0.337083,0.337842,0.52,0.13495,315.5,2497.0,3152.0
50%,366.0,2012-01-01 00:00:00,3.0,1.0,7.0,0.0,3.0,1.0,1.0,0.498333,0.486733,0.626667,0.180975,713.0,3662.0,4548.0
75%,548.5,2012-07-01 12:00:00,3.0,1.0,10.0,0.0,5.0,1.0,2.0,0.655417,0.608602,0.730209,0.233214,1096.0,4776.5,5956.0
max,731.0,2012-12-31 00:00:00,4.0,1.0,12.0,1.0,6.0,1.0,3.0,0.861667,0.840896,0.9725,0.507463,3410.0,6946.0,8714.0
std,211.165812,,1.110807,0.500342,3.451913,0.167155,2.004787,0.465233,0.544894,0.183051,0.162961,0.142429,0.077498,686.622488,1560.256377,1937.211452


In [14]:
df_hour.describe()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,17379.0,17379,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0
mean,8690.0,2012-01-02 04:08:34.552045568,2.50164,0.502561,6.537775,11.546752,0.02877,3.003683,0.682721,1.425283,0.496987,0.475775,0.627229,0.190098,35.676218,153.786869,189.463088
min,1.0,2011-01-01 00:00:00,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.02,0.0,0.0,0.0,0.0,0.0,1.0
25%,4345.5,2011-07-04 00:00:00,2.0,0.0,4.0,6.0,0.0,1.0,0.0,1.0,0.34,0.3333,0.48,0.1045,4.0,34.0,40.0
50%,8690.0,2012-01-02 00:00:00,3.0,1.0,7.0,12.0,0.0,3.0,1.0,1.0,0.5,0.4848,0.63,0.194,17.0,115.0,142.0
75%,13034.5,2012-07-02 00:00:00,3.0,1.0,10.0,18.0,0.0,5.0,1.0,2.0,0.66,0.6212,0.78,0.2537,48.0,220.0,281.0
max,17379.0,2012-12-31 00:00:00,4.0,1.0,12.0,23.0,1.0,6.0,1.0,4.0,1.0,1.0,1.0,0.8507,367.0,886.0,977.0
std,5017.0295,,1.106918,0.500008,3.438776,6.914405,0.167165,2.005771,0.465431,0.639357,0.192556,0.17185,0.19293,0.12234,49.30503,151.357286,181.387599


In [15]:
# Plot korelasi hubungan antar variable

correlation_matrix = df_day.corr()
fig = px.imshow(correlation_matrix)
fig.update_traces(colorscale='balance')
fig.update_layout(width=1200, height=600)
fig.update_layout(title="Plot Korelasi Variabel Numerik")
fig.show()

In [16]:
# Plot Distribusi Variabel Numerik

var_numeric = ['temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered', 'cnt']
for col in var_numeric:
    fig = px.histogram(df_day, x=col, title=f'Distribusi Variabel {col}')
    fig.update_yaxes(title_text="Jumlah")
    fig.update_layout(width=800, height=600)
    fig.show()

In [17]:
# Hubungan Season dan Jumlah Sewa

fig = px.box(df_day, x='season', y='cnt')
fig.update_layout(title='Hubungan Season dan Jumlah Sewa')
fig.show()

In [18]:
# Hubungan Weekday dan Jumlah Sewa

df_day['weekday'] = df_day['weekday'].astype('category')

fig = px.box(df_day, x='weekday', y='cnt')
fig.update_layout(title='Hubungan Hari dan Jumlah Sewa')
fig.show()

## Visualization & Explanatory Analysis

### Pertanyaan 1: Apakah terdapat perbedaan pola penggunaan sepeda antara musim panas, musim gugur, musim dingin, dan musim semi?

In [19]:
# Ubah tipe data kolom 'dteday' menjadi tipe data datetime
df_day['dteday'] = pd.to_datetime(df_day['dteday'])

# Ekstrak bulan dari kolom 'dteday'
df_day['month'] = df_day['dteday'].dt.month

# Visualisasi penggunaan sepeda berdasarkan musim
fig = px.box(df_day, x='season', y='cnt', color='season',
             category_orders={'season': ['Spring', 'Summer', 'Fall', 'Winter']},
             labels={'season': 'Musim', 'cnt': 'Jumlah Sewa Sepeda'})
fig.update_layout(title='Penggunaan Sepeda Berdasarkan Musim',
                  xaxis_title='Musim', yaxis_title='Jumlah Sewa Sepeda')
fig.show()

In [20]:
# Ubah tipe data kolom 'dteday' menjadi tipe data datetime
df_day['dteday'] = pd.to_datetime(df_day['dteday'])

# Ekstrak bulan dari kolom 'dteday'
df_day['month'] = df_day['dteday'].dt.month

# Ubah kolom season menjadi kategori dengan nama musim yang sesuai
df_day['season'] = df_day['season'].map({1: 'Spring', 2: 'Summer', 3: 'Fall', 4: 'Winter'})

# Visualisasi penggunaan sepeda berdasarkan musim dalam diagram balok
fig = px.bar(df_day, x='season', y='cnt', color='season',
             category_orders={'season': ['Spring', 'Summer', 'Fall', 'Winter']},
             labels={'season': 'Musim', 'cnt': 'Jumlah Sewa Sepeda'},
             title='Penggunaan Sepeda Berdasarkan Musim')

fig.update_layout(xaxis_title='Musim', yaxis_title='Jumlah Sewa Sepeda')
fig.show()

Visualisasi box plot dan bar plot menunjukkan variasi pola penggunaan sepeda berdasarkan musim. Dari grafik-grafik tersebut, terlihat bahwa jumlah sewa sepeda cenderung meningkat seiring pergantian musim dari musim semi hingga musim gugur, sementara menurun pada musim yang lebih dingin seperti musim dingin. Terdapat beberapa data outlier pada data tersebut, menunjukkan adanya kejadian yang memengaruhi jumlah sewa sepeda pada musim tertentu. Analisis visual ini memberikan informasi dan data berharga bagi pengelola layanan penyewaan sepeda untuk mengelola stok sepeda dan menyesuaikan strategi pemasaran berdasarkan tren penggunaan sepeda pada tiap musim.

### Pertanyaan 2: Bagaimana perubahan pola penggunaan sepeda selama akhir pekan (hari libur) dibandingkan dengan hari kerja?

In [21]:
# Mengonversi tipe data kolom 'dteday' menjadi tipe data datetime
df_day['dteday'] = pd.to_datetime(df_day['dteday'])

# Ekstrak hari dari kolom 'dteday' (0: Sunday, 1: Monday, ..., 6: Saturday)
df_day['day_of_week'] = df_day['dteday'].dt.dayofweek

# Mengidentifikasi apakah hari adalah akhir pekan (0: Tidak, 1: Ya)
df_day['is_weekend'] = df_day['day_of_week'].isin([5, 6]).astype(int)

# Menghitung rata-rata jumlah sewa sepeda berdasarkan akhir pekan dan hari kerja
df_weekend = df_day.groupby('is_weekend')['cnt'].mean().reset_index()

# Visualisasi rata-rata penggunaan sepeda pada akhir pekan dan hari kerja menggunakan bar plot
fig = px.bar(df_weekend, x='is_weekend', y='cnt', color='is_weekend',
             labels={'is_weekend': 'Hari Libur (0: Tidak, 1: Ya)', 'cnt': 'Rata-rata Jumlah Sewa Sepeda'},
             title='Rata-rata Penggunaan Sepeda pada Akhir Pekan vs. Hari Kerja',
             text='cnt')
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_layout(xaxis_title='Hari Libur (0: Tidak, 1: Ya)', yaxis_title='Rata-rata Jumlah Sewa Sepeda')
fig.update_xaxes(tickvals=[0, 1], ticktext=['Hari Kerja', 'Akhir Pekan'])
fig.show()

In [22]:
df_day['weekday'] = df_day['weekday'].astype('category')

# Ubah nilai weekday menjadi nama hari dalam seminggu
df_day['weekday'] = df_day['weekday'].map({0: 'Sunday', 1: 'Monday', 2: 'Tuesday', 3: 'Wednesday', 4: 'Thursday', 5: 'Friday', 6: 'Saturday'})

# Hitung rata-rata jumlah sewa untuk setiap hari dalam seminggu
weekday_counts = df_day.groupby('weekday')['cnt'].mean().reset_index()

# Visualisasi penggunaan sepeda berdasarkan hari dalam seminggu dalam diagram balok
fig = px.bar(weekday_counts, x='weekday', y='cnt',
             labels={'weekday': 'Hari', 'cnt': 'Jumlah Sewa Sepeda Rata-rata'},
             title='Hubungan Hari dan Jumlah Sewa')

fig.update_layout(xaxis_title='Hari', yaxis_title='Jumlah Sewa Sepeda Rata-rata')
fig.show()

Visualisasi box plot dan bar plot dari hari-hari yang berkaitan dengan jumlah sewa, dapat dilihat pada grafik pertama bahwa lebih banyak penyewaan sepeda di hari-hari kerja dibandingkan dengan hari libur. Sedangkan pada hari-hari dalam seminggu tampak adanya kenaikan dimulai dari hari minggu ke jumat, lalu turun lagi ketika di hari sabtu. Informasi dan data ini bisa digunakan oleh pihak penyewa sepeda untuk memaksimalkan target pasar dan strategi pemasaran.

## Conclusion

1. Terdapat perbedaan jumlah sewa sepeda cenderung meninggi dimulai dari musim semi hingga musim gugur, sementara cenderung menurun pada musim dingin. Hal ini bisa dikarenakan kondisi salju dan cuaca yang dingin yang mempersulit ketika menggunakan sepeda.
2. Rata-rata jumlah sewa sepeda cenderung lebih tinggi pada hari kerja dibandingkan dengan hari libur. Hal ini mungkin disebabkan oleh penggunaan sepeda sebagai mode transportasi untuk bepergian dan beraktivitas lebih banyak digunakan pada saat hari kerja.