<a href="https://colab.research.google.com/github/mikhail-karim/submission/blob/main/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Proyek Analisis Data: Air Quality Dataset
- **Nama:** Mikhail Shams Afzal Karim
- **Email:** mikhailsakarim@gmail.com
- **ID Dicoding:** mikhailkarim2004

## Menentukan Pertanyaan Bisnis

- How can a change in the current weather climate affect the current rate of bike rentals?
- How does the bike rental business perform during certain weather conditions that aren't too suitable for bike riding?
- How much of a difference is there between working days and holidays or weekends when it comes to bike rental behavior?
- How can we predict the rate of bike rentals depending on environmental and social factors to optimize our advertising and expand our business?


## Import Semua Packages/Library yang Digunakan

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Data Wrangling

### Gathering Data

In this step, we will be importing two datasets from a github repository


In [4]:
# mengimport dataset harian dari repository github (day.csv)
day_df = pd.read_csv("https://raw.githubusercontent.com/mikhail-karim/submission/refs/heads/main/data/day.csv")
day_df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


In [19]:
# mengimport dataset jam-an dari repository github (hour.csv)
hour_df = pd.read_csv("https://raw.githubusercontent.com/mikhail-karim/submission/refs/heads/main/data/hour.csv")
hour_df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


In [27]:
# prompt: Using dataframe hour_df: check for missing data

# Check for missing values in the dataframe
missing_data = hour_df.isnull().sum()

# Display the count of missing values for each column
print(missing_data)


instant       0
dteday        0
season        0
yr            0
mnth          0
hr            0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64


**Insight:**
- From the hour.csv dataset, we can see that a lot of the columns related to time are 0 since it's hour-based and not day-based.
- The day.csv doesn't have a "hour" column/attribute since each row counts as one day.

### Assessing Data

The first code snippet within the "Assessing Data" shows us the structure information of the .csv files

In [37]:
# mencari informasi tentang file dataset

day_df.info()
print("")
hour_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     731 non-null    int64  
 1   dteday      731 non-null    object 
 2   season      731 non-null    int64  
 3   yr          731 non-null    int64  
 4   mnth        731 non-null    int64  
 5   holiday     731 non-null    int64  
 6   weekday     731 non-null    int64  
 7   workingday  731 non-null    int64  
 8   weathersit  731 non-null    int64  
 9   temp        731 non-null    float64
 10  atemp       731 non-null    float64
 11  hum         731 non-null    float64
 12  windspeed   731 non-null    float64
 13  casual      731 non-null    int64  
 14  registered  731 non-null    int64  
 15  cnt         731 non-null    int64  
dtypes: float64(4), int64(11), object(1)
memory usage: 91.5+ KB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (to

Finding null variables a.k.a. variables that doesn't exist within the day.csv and hour.csv database

In [50]:
# apabila data ditemukan ada yang null, maka akan mencetak pesan kalau jumlah data null ada ___

if day_df.isnull() is True:
  print("Jumlah data yang ditemukan null: ", day_df.isnull().sum())
else:
  print("Jumlah data yang ditemukan null: 0")
print(" ")
day_df.isnull() #day.csv (harian)

Jumlah data yang ditemukan null: 0
 


Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
726,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
727,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
728,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
729,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [69]:
# apabila data ditemukan ada yang null, maka akan mencetak pesan kalau jumlah data null ada ___

if hour_df.isnull() is True:
  print("Jumlah data yang ditemukan null: ", hour_df.isnull().sum())
else:
  print("Jumlah data yang ditemukan null: 0")
print(" ")
hour_df.isnull() #hour.csv (jam-an)

Jumlah data yang ditemukan null: 0
 


Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17374,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
17375,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
17376,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
17377,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


Finding duplicated variables within the day.csv and hour.csv database

In [53]:
# apabila data ditemukan ada yang null, maka akan mencetak pesan kalau jumlah data null ada ___

if day_df.duplicated() is True:
  print("Jumlah data yang ditemukan duplikat: ", day_df.duplicated().sum())
else:
  print("Jumlah data yang ditemukan duplikat: 0")
print(" ")
day_df.duplicated() #day.csv (harian)

Jumlah data yang ditemukan duplikat: 0
 


Unnamed: 0,0
0,False
1,False
2,False
3,False
4,False
...,...
726,False
727,False
728,False
729,False


In [54]:
# apabila data ditemukan ada yang null, maka akan mencetak pesan kalau jumlah data null ada ___

if day_df.duplicated() is True:
  print("Jumlah data yang ditemukan duplikat: ", hour_df.duplicated().sum())
else:
  print("Jumlah data yang ditemukan duplikat: 0")
print(" ")
hour_df.duplicated() #hour.csv (jam-an)

Jumlah data yang ditemukan duplikat: 0
 


Unnamed: 0,0
0,False
1,False
2,False
3,False
4,False
...,...
17374,False
17375,False
17376,False
17377,False


Checking for any kind of inconsistencies within both of the datasets

In [66]:
# menggunakan fungsi .describe() untuk mempermudah pengecekan

day_df.describe() #day.csv (harian)

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0
mean,366.0,2.49658,0.500684,6.519836,0.028728,2.997264,0.683995,1.395349,0.495385,0.474354,0.627894,0.190486,848.176471,3656.172367,4504.348837
std,211.165812,1.110807,0.500342,3.451913,0.167155,2.004787,0.465233,0.544894,0.183051,0.162961,0.142429,0.077498,686.622488,1560.256377,1937.211452
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.05913,0.07907,0.0,0.022392,2.0,20.0,22.0
25%,183.5,2.0,0.0,4.0,0.0,1.0,0.0,1.0,0.337083,0.337842,0.52,0.13495,315.5,2497.0,3152.0
50%,366.0,3.0,1.0,7.0,0.0,3.0,1.0,1.0,0.498333,0.486733,0.626667,0.180975,713.0,3662.0,4548.0
75%,548.5,3.0,1.0,10.0,0.0,5.0,1.0,2.0,0.655417,0.608602,0.730209,0.233214,1096.0,4776.5,5956.0
max,731.0,4.0,1.0,12.0,1.0,6.0,1.0,3.0,0.861667,0.840896,0.9725,0.507463,3410.0,6946.0,8714.0


In [65]:
# menggunakan fungsi .describe() untuk mempermudah pengecekan

hour_df.describe() #hour.csv (jam-an)

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,17379.0,17379,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0
mean,8690.0,2012-01-02 04:08:34.552045568,2.50164,0.502561,6.537775,11.546752,0.02877,3.003683,0.682721,1.425283,0.496987,0.475775,0.627229,0.190098,35.676218,153.786869,189.463088
min,1.0,2011-01-01 00:00:00,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.02,0.0,0.0,0.0,0.0,0.0,1.0
25%,4345.5,2011-07-04 00:00:00,2.0,0.0,4.0,6.0,0.0,1.0,0.0,1.0,0.34,0.3333,0.48,0.1045,4.0,34.0,40.0
50%,8690.0,2012-01-02 00:00:00,3.0,1.0,7.0,12.0,0.0,3.0,1.0,1.0,0.5,0.4848,0.63,0.194,17.0,115.0,142.0
75%,13034.5,2012-07-02 00:00:00,3.0,1.0,10.0,18.0,0.0,5.0,1.0,2.0,0.66,0.6212,0.78,0.2537,48.0,220.0,281.0
max,17379.0,2012-12-31 00:00:00,4.0,1.0,12.0,23.0,1.0,6.0,1.0,4.0,1.0,1.0,1.0,0.8507,367.0,886.0,977.0
std,5017.0295,,1.106918,0.500008,3.438776,6.914405,0.167165,2.005771,0.465431,0.639357,0.192556,0.17185,0.19293,0.12234,49.30503,151.357286,181.387599


**Insight:**
- Throughout the data assessment stage on both datasets, there has been no indication of missing or duplicated or inconsistent values.
- The data type for "dteday" on day.csv is supposed to be datetime, and not object. Whereas on hour.csv, it has the correct data type.

### Cleaning Data

Changing the data type of "dteday" from being an object to being a datetime

In [6]:
day_df['dteday'] = pd.to_datetime(day_df['dteday'])

Deleting unused columns

In [8]:
day_df = day_df.drop(['instant', 'holiday', 'windspeed'], axis=1)
day_df.head()

KeyError: "['instant', 'holiday', 'windspeed'] not found in axis"

Renaming some of the attributes or column names to make them more readable

In [1]:
day_df = day_df.rename(columns={
    'dteday': 'date',
    'yr': 'year',
    'mnth': 'month',
    'weekday' : 'day',
    'workingday': 'day_type',
    'weathersit': 'weather_type',
    'atemp': 'feels',
    'hum': 'humidity',
    'cnt': 'total'})
day_df.head()

NameError: name 'day_df' is not defined

In [None]:
day_df['date'].replace({1 : 'Jan', 2 : 'Feb', 3 : 'Mar', 4 : 'Apr', 5 : 'May', 6 : 'Jun', 7 : 'Jul', 8 : 'Aug', 9 : 'Sep', 10 : 'Oct', 11 : 'Nov', 12 : 'Dec'})
day_df['year'].replace({0 : 2011, 1 : 2012})
day_df['season'].replace({1 : 'Spring', 2 : 'Summer', 3 : 'Fall', 4 : 'Winter'})
day_df['day'].replace({0 :'sunday', 1 : 'monday', 2 : 'tuesday', 3 : 'wednesday', 4 : 'thursday', 5 :'friday', 6 :'saturday'})
day_df['weather_type'].replace({1 : 'Clear/Cloudy', 2 : 'Mist', 3 : 'Light Snow/Rain', 4 : 'Heavy Rain/Fog'})
day_df['day_type'].replace({0 : 'Working Day', 1 : 'Weekend'})
day_df.head()

**Insight:**
- xxx
- xxx

## Exploratory Data Analysis (EDA)

### Explore ...

**Insight:**
- xxx
- xxx

## Visualization & Explanatory Analysis

### Pertanyaan 1:

### Pertanyaan 2:

**Insight:**
- xxx
- xxx

## Analisis Lanjutan (Opsional)

## Conclusion

- Conclution pertanyaan 1
- Conclution pertanyaan 2