# Initial Data Processing

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 28/04/2025   | Martin | Created   | Created notebook. Initial data cleaning for Bike Rental and Penguins datasets | 

# Content 

* [Bike Rentals](#bike-rentals)

# Bike Rentals

Daily counts of rented bikes from bike rental company in Washington along with weather information

<u>Added/ Explained columns</u>

* `cnt` - Count of bikes for both casual and registered users. _Target variable_
* `weather_rmp` - 1: Good, 2: Misty, 3: Bad
* `temp_cel` - Temperature in celsius
* `atemp_cel` - Feeling temperature in celsius
* `rel_hum` - Relative humidity (0-100)
* `windspeed_kmh` - Wind speed in km/h
* `cnt_2d_bfr` - Count of rented bikes 2 days before

<u>Additional Processing</u>

* Removed day (2011-03-10) because humidity is 0
* Removed first 2 days because missing data count for 2 days before

In [14]:
import pandas as pd

In [15]:
hour = pd.read_csv('./data/bike_rental/hour.csv')
day = pd.read_csv('./data/bike_rental/day.csv')

In [16]:
# Remove day where humidity is 0
hour = hour[hour['hum'] != 0]
day = day[day['hum'] != 0]

# Add cnt_2d_bft column
day['cnt_2d_bfr'] = day['cnt'].shift(2)
day = day[~day['cnt_2d_bfr'].isna()]
day['cnt_2d_bfr'] = day['cnt_2d_bfr'].astype(int)

# Change the weather mapping to 1/2/3 - Only hour needs to be changed
weather_map = {
  1: 1,
  2: 2,
  3: 3,
  4: 3
}
hour['weather_rmp'] = hour['weathersit'].map(weather_map)
day = day.rename({'weathersit': 'weather_rmp'}, axis=1)

# Apply denormalisation to columns
day['temp_cel'] = day['temp'] * 41
hour['temp_cel'] = hour['temp'] * 41

day['atemp_cel'] = day['atemp'] * 50
hour['atemp_cel'] = hour['atemp'] * 50

day['rel_hum'] = day['hum'] * 100
hour['rel_hum'] = hour['hum'] * 100

day['windspeed_kmh'] = day['windspeed'] * 67
hour['windspeed_kmh'] = hour['windspeed'] * 67

# Reset index
day = day.reset_index(drop=True)
hour = hour.reset_index(drop=True)

In [17]:
day.shape, hour.shape

((728, 21), (17357, 22))

In [18]:
day.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weather_rmp,temp,...,hum,windspeed,casual,registered,cnt,cnt_2d_bfr,temp_cel,atemp_cel,rel_hum,windspeed_kmh
0,3,2011-01-03,1,0,1,0,1,1,1,0.196364,...,0.437273,0.248309,120,1229,1349,985,8.050924,9.47025,43.7273,16.636703
1,4,2011-01-04,1,0,1,0,2,1,1,0.2,...,0.590435,0.160296,108,1454,1562,801,8.2,10.6061,59.0435,10.739832
2,5,2011-01-05,1,0,1,0,3,1,1,0.226957,...,0.436957,0.1869,82,1518,1600,1349,9.305237,11.4635,43.6957,12.5223
3,6,2011-01-06,1,0,1,0,4,1,1,0.204348,...,0.518261,0.089565,88,1518,1606,1562,8.378268,11.66045,51.8261,6.000868
4,7,2011-01-07,1,0,1,0,5,1,2,0.196522,...,0.498696,0.168726,148,1362,1510,1600,8.057402,10.44195,49.8696,11.304642


In [19]:
day.to_csv('./data/bike_rental/day_cleaned.csv', index=False)
hour.to_csv('./data/bike_rental/hour_cleaned.csv', index=False)

---

# Palmer Penguins

Classification task. Contains measurements from 333 penguides from the Palmer Archipelago in Antarctica. Studies the differences in appearance between male and female.

<u>Additional Processing</u>

* Removed missing data

In [8]:
penguins = pd.read_csv('./data/penguins/penguins_size.csv')

In [9]:
penguins.head()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


In [10]:
penguins = penguins[penguins['sex'] != '.']

In [11]:
penguins.dropna().reset_index(drop=True).to_csv('./data/penguins/penguins_cleaned.csv', index=False)