# Bike Sharing Daily Data:
This notebook walks through:
1. Exploring the `day.csv` dataset from the Bike Sharing Daily data.
2. Checking and addressing issues across four data quality dimensions:
   - Completeness
   - Uniqueness
   - Accuracy
   - Consistency
3. Transforming the cleaned data into a star schema:
   - `dim_date`, `dim_season`, `dim_weather`, and `fact_daily_rentals`.
4. Saving the resulting tables to CSV files.

## 1. Imports and Data Loading

In [2]:
import pandas as pd

# Load the daily data
df = pd.read_csv('Dataset/day.csv', parse_dates=['dteday'])
display(df.head())

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


## 2. Overview of the Data

In [3]:
# Basic info and description
display(df.info())
display(df.describe(include='all'))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   instant     731 non-null    int64         
 1   dteday      731 non-null    datetime64[ns]
 2   season      731 non-null    int64         
 3   yr          731 non-null    int64         
 4   mnth        731 non-null    int64         
 5   holiday     731 non-null    int64         
 6   weekday     731 non-null    int64         
 7   workingday  731 non-null    int64         
 8   weathersit  731 non-null    int64         
 9   temp        731 non-null    float64       
 10  atemp       731 non-null    float64       
 11  hum         731 non-null    float64       
 12  windspeed   731 non-null    float64       
 13  casual      731 non-null    int64         
 14  registered  731 non-null    int64         
 15  cnt         731 non-null    int64         
dtypes: datetime64[ns](1), floa

None

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,731.0,731,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0
mean,366.0,2012-01-01 00:00:00,2.49658,0.500684,6.519836,0.028728,2.997264,0.683995,1.395349,0.495385,0.474354,0.627894,0.190486,848.176471,3656.172367,4504.348837
min,1.0,2011-01-01 00:00:00,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.05913,0.07907,0.0,0.022392,2.0,20.0,22.0
25%,183.5,2011-07-02 12:00:00,2.0,0.0,4.0,0.0,1.0,0.0,1.0,0.337083,0.337842,0.52,0.13495,315.5,2497.0,3152.0
50%,366.0,2012-01-01 00:00:00,3.0,1.0,7.0,0.0,3.0,1.0,1.0,0.498333,0.486733,0.626667,0.180975,713.0,3662.0,4548.0
75%,548.5,2012-07-01 12:00:00,3.0,1.0,10.0,0.0,5.0,1.0,2.0,0.655417,0.608602,0.730209,0.233214,1096.0,4776.5,5956.0
max,731.0,2012-12-31 00:00:00,4.0,1.0,12.0,1.0,6.0,1.0,3.0,0.861667,0.840896,0.9725,0.507463,3410.0,6946.0,8714.0
std,211.165812,,1.110807,0.500342,3.451913,0.167155,2.004787,0.465233,0.544894,0.183051,0.162961,0.142429,0.077498,686.622488,1560.256377,1937.211452


## 3. Data Quality Checks

### 3.1 Completeness
Check for missing values in each column.

In [4]:
null_counts = df.isnull().sum()
display(null_counts)

instant       0
dteday        0
season        0
yr            0
mnth          0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

### 3.2 Uniqueness
Check for duplicate records at the row level.

In [5]:
dup_count = df.duplicated().sum()
print(f'Duplicate rows: {dup_count}')

Duplicate rows: 0


### 3.3 Accuracy
Verify numeric ranges: temperature [0,1], humidity [0,1], windspeed [0,1], season and weather codes valid.

In [6]:
accuracy_checks = {
    'temp_range': df['temp'].between(0, 1).all(),
    'hum_range': df['hum'].between(0, 1).all(),
    'windspeed_range': df['windspeed'].between(0, 1).all(),
    'season_codes': set(df['season'].unique()),
    'weather_codes': set(df['weathersit'].unique())
}
display(accuracy_checks)

{'temp_range': np.True_,
 'hum_range': np.True_,
 'windspeed_range': np.True_,
 'season_codes': {np.int64(1), np.int64(2), np.int64(3), np.int64(4)},
 'weather_codes': {np.int64(1), np.int64(2), np.int64(3)}}

### 3.4 Consistency
Ensure categorical codes match expected mappings.

In [17]:
expected_seasons = {1,2,3,4}
expected_weathers = {1,2,3,4}
print('Season codes consistent:', set(df['season'].unique()) <= expected_seasons)
print('Weather codes consistent:', set(df['weathersit'].unique()) <= expected_weathers)

Season codes consistent: True
Weather codes consistent: True


#### 3.5 Denormalizing temp, atemp, hum, and windspeed

In [18]:
# temp_norm = (t_actual – t_min)/(t_max – t_min),  t_min = –8 °C, t_max = +39 °C
df['temp_c'] = df['temp'] * (39 - (-8)) + (-8)

# atemp_norm = (t_actual – t_min)/(t_max – t_min),  t_min = –16 °C, t_max = +50 °C
df['atemp_c'] = df['atemp'] * (50 - (-16)) + (-16)

# hum_norm = humidity_percent/100
df['humidity_pct'] = df['hum'] * 100

# windspeed_norm = windspeed_mph/67  (67 mph is the historical max)
df['windspeed_mph'] = df['windspeed'] * 67

## 4. Transform to Star Schema

### 4.1 Build `dim_date`

In [19]:
# Create date dimension
date_df = pd.DataFrame({'date': pd.date_range(start=df['dteday'].min(), end=df['dteday'].max(), freq='D')})
date_df['date_key'] = date_df['date'].dt.strftime('%Y%m%d').astype(int)
date_df['year'] = date_df['date'].dt.year
date_df['quarter'] = date_df['date'].dt.quarter
date_df['month'] = date_df['date'].dt.month
date_df['day'] = date_df['date'].dt.day
date_df['weekday'] = date_df['date'].dt.weekday + 1  # 1=Monday
date_df['is_holiday'] = df[['dteday','holiday']].drop_duplicates().set_index('dteday')['holiday'].reindex(date_df['date']).fillna(0).astype(int).values
date_df['is_working'] = df[['dteday','workingday']].drop_duplicates().set_index('dteday')['workingday'].reindex(date_df['date']).fillna(0).astype(int).values


### 4.2 Build `dim_season`

In [20]:
# Season dimension mapping
dim_season = pd.DataFrame({
    'season_id': [1,2,3,4],
    'season_name': ['Spring','Summer','Fall','Winter']
})

### 4.3 Build `dim_weather`

In [21]:
# Weather dimension mapping
dim_weather = pd.DataFrame({
    'weather_id': [1,2,3,4],
    'weather_desc': [
        'Clear/Partly Cloudy',
        'Mist/Cloudy',
        'Light Rain/Snow',
        'Heavy Rain/Snow/Storm'
    ]
})

### 4.4 Build `fact_daily_rentals`

In [22]:
# Fact table creation
fact = pd.DataFrame({
    'date_key': df['dteday'].dt.strftime('%Y%m%d').astype(int),
    'season_id': df['season'],
    'weather_id': df['weathersit'],
    'casual_count': df['casual'],
    'registered_count': df['registered'],
    'total_count': df['cnt'],
    'temp_c': df['temp_c'],
    'atemp_c': df['atemp_c'],
    'humidity_pct': df['humidity_pct'],
    'windspeed_mph': df['windspeed_mph']
})

## 5. Save Schema to CSV Files

In [23]:
# Save each table to CSV
date_df.to_csv('dim_date.csv', index=False)
dim_season.to_csv('dim_season.csv', index=False)
dim_weather.to_csv('dim_weather.csv', index=False)
fact.to_csv('fact_daily_rentals.csv', index=False)
print('All tables saved to CSV.')

All tables saved to CSV.
