This notebook cleans and processes weather data of Manhattan from 2024 (the whole year). It starts with loading a raw CSV file with hourly weather observations, it drops the duplicate timestamps, chooses the columns of interest (e.g., temperature, humidity, wind speed), and changes the timestamps to Eastern Time. To guarantee uniformity in the timeframe, the notebook deletes the entries that are beyond 2024 upon conversion and estimates the final five hours of data by examining most recent trends. The resulting dataset is clean, comprehensive, and is utilised in our machine learning model.

In [27]:
# Importing.
import pandas as pd
from datetime import timedelta

In [28]:
# Loading the CSV file.
file_path = "Manhattan_40_768517_-73_982194_1704067200_1735689599_684c94e349583b00082484e5.csv"
df = pd.read_csv(file_path)

In [29]:
# Dropping duplicates based on 'dt_iso', keeping the first occurrence.
df = df.drop_duplicates(subset='dt_iso', keep='first')

In [30]:
# Getting row count.
row_count = len(df)
print("Row count:", row_count)

Row count: 8784


In [31]:
# Keeping only the useful columns.
df = df[['dt_iso', 'temp', 'humidity', 'wind_speed', 'feels_like', 'weather_main']]

# Examining results.
print(df.head())

                          dt_iso  temp  humidity  wind_speed  feels_like  \
0  2024-01-01 00:00:00 +0000 UTC  5.64        54        3.09        3.21   
1  2024-01-01 01:00:00 +0000 UTC  5.77        53        3.60        3.04   
2  2024-01-01 02:00:00 +0000 UTC  5.59        56        3.60        2.82   
3  2024-01-01 03:00:00 +0000 UTC  5.75        58        3.60        3.01   
4  2024-01-01 04:00:00 +0000 UTC  5.81        58        3.60        3.09   

  weather_main  
0       Clouds  
1       Clouds  
2       Clouds  
3         Rain  
4       Clouds  


In [32]:
# Cleaning the string by removing ' +0000 UTC'.
df['dt_iso'] = df['dt_iso'].str.replace(' +0000 UTC', '', regex=False)

# Parsing as UTC time.
df['dt_iso'] = pd.to_datetime(df['dt_iso'], format='%Y-%m-%d %H:%M:%S', utc=True)

# Converting to Eastern Time (Manhattan).
df['dt_iso'] = df['dt_iso'].dt.tz_convert('America/New_York')

# Extracting date and hour.
df['date'] = df['dt_iso'].dt.date
df['hour'] = df['dt_iso'].dt.hour

# Dropping dt_iso and rearranging.
df = df.drop(columns=['dt_iso'])
df = df[['temp', 'humidity', 'wind_speed', 'feels_like', 'weather_main', 'date', 'hour']]

# Examining results.
print(df.head(10))

   temp  humidity  wind_speed  feels_like weather_main        date  hour
0  5.64        54        3.09        3.21       Clouds  2023-12-31    19
1  5.77        53        3.60        3.04       Clouds  2023-12-31    20
2  5.59        56        3.60        2.82       Clouds  2023-12-31    21
3  5.75        58        3.60        3.01         Rain  2023-12-31    22
4  5.81        58        3.60        3.09       Clouds  2023-12-31    23
5  5.75        59        4.47        2.53         Rain  2024-01-01     0
6  5.43        60        4.12        2.32         Rain  2024-01-01     1
7  5.45        62        4.12        2.35       Clouds  2024-01-01     2
8  5.05        64        4.12        1.85       Clouds  2024-01-01     3
9  4.87        65        3.10        2.28       Clouds  2024-01-01     4


Since we're only using data from 2024, shifting the originally provided timestamps to Manhattan time caused them to move back by 5 hours. As a result, some data now falls into 2023 and needs to be removed. We’ll also need to infer the last five rows of data for 2024 to maintain continuity. I’ll do this next.

In [33]:
# Dropping rows where shifted datetime now falls into 2023.
df = df[df['date'] >= pd.to_datetime('2024-01-01').date()]

# Examining results.
print(df.head(10))

    temp  humidity  wind_speed  feels_like weather_main        date  hour
5   5.75        59        4.47        2.53         Rain  2024-01-01     0
6   5.43        60        4.12        2.32         Rain  2024-01-01     1
7   5.45        62        4.12        2.35       Clouds  2024-01-01     2
8   5.05        64        4.12        1.85       Clouds  2024-01-01     3
9   4.87        65        3.10        2.28       Clouds  2024-01-01     4
10  4.60        68        2.10        2.80       Clouds  2024-01-01     5
11  4.74        70        3.60        1.78         Rain  2024-01-01     6
12  3.92        78        3.60        0.78         Rain  2024-01-01     7
13  3.94        81        2.68        1.49         Rain  2024-01-01     8
14  4.22        80        2.60        1.89         Rain  2024-01-01     9


Inferring the last five rows of data for 2024 below.

In [34]:
# Number of rows to forecast.
n_forecast = 5

# Using last 6 rows to compute average change per hour.
df_last = df.tail(6).copy()
deltas = df_last[['temp', 'humidity', 'wind_speed', 'feels_like']].diff().mean()

last_row = df.iloc[-1].copy()
new_rows = []
for i in range(1, n_forecast + 1):
    new_row = last_row.copy()
    
    # Applying results.
    new_row['temp'] += deltas['temp']
    new_row['humidity'] += deltas['humidity']
    new_row['wind_speed'] += deltas['wind_speed']
    new_row['feels_like'] += deltas['feels_like']
    
    # Handling hour/date increment.
    new_hour = int(new_row['hour']) + 1
    if new_hour > 23:
        new_row['hour'] = 0
        new_row['date'] = (pd.to_datetime(new_row['date']) + timedelta(days=1)).date()
    else:
        new_row['hour'] = new_hour

    # Preserve last known weather condition
    new_row['weather_main'] = last_row['weather_main']

    # Appending and updating the last_row.
    new_rows.append(new_row.copy())
    last_row = new_row.copy()

# Appending inferred rows to the original DataFrame.
df = pd.concat([df, pd.DataFrame(new_rows)], ignore_index=True)

# Rounding numerical columns to 2 decimal places.
df[['temp', 'humidity', 'wind_speed', 'feels_like']] = df[['temp', 'humidity', 'wind_speed', 'feels_like']].round(2)

# Examining results.
print(df.tail(10))

       temp  humidity  wind_speed  feels_like weather_main        date  hour
8774  10.77      57.0        5.66        9.39        Clear  2024-12-31    14
8775  10.78      59.0        6.17        9.45        Clear  2024-12-31    15
8776  10.24      69.0        5.14        9.12        Clear  2024-12-31    16
8777   9.87      70.0        5.14        7.37        Clear  2024-12-31    17
8778   9.23      73.0        5.66        6.38        Clear  2024-12-31    18
8779   8.84      77.8        5.97        5.73        Clear  2024-12-31    19
8780   8.45      82.6        6.28        5.08        Clear  2024-12-31    20
8781   8.05      87.4        6.58        4.42        Clear  2024-12-31    21
8782   7.66      92.2        6.89        3.77        Clear  2024-12-31    22
8783   7.27      97.0        7.20        3.12        Clear  2024-12-31    23


I’ll now check for any missing combinations of date and hour, as we should have a complete set for all of 2024.

In [35]:
# Getting date and hour combinations for 2024.
all_dates = pd.date_range(start='2024-01-01', end='2024-12-31', freq='D')
all_hours = list(range(24))

# Creating full expected MultiIndex of date and hour.
expected = pd.MultiIndex.from_product([all_dates.date, all_hours], names=['date', 'hour'])

# Extracting current date and hour combinations.
actual = pd.MultiIndex.from_frame(df[['date', 'hour']])

# Finding missing combinations.
missing = expected.difference(actual)

# Examining results.
if missing.empty:
    print("No missing date and hour combinations!")
else:
    print(f"{len(missing)} missing date and hour combination/s:")
    print(missing.to_frame(index=False).head(10))

1 missing date and hour combination/s:
         date  hour
0  2024-03-10     2


One hour is missing from the dataset (2024-03-10 02:00) due to Daylight Saving Time in New York. Clocks skip from 01:59 to 03:00, so 02:00 does not exist on that day. Hence, this is correct.

In [36]:
# Total days in 2024.
total_days = len(all_dates)  # 366 for leap year

# Expected count.
raw_expected_total = total_days * 24  # 8784

# Adjusted expected count excluding DST gap.
adjusted_expected_total = raw_expected_total - len(missing)

# Actual count.
actual_total = len(actual)

# Displaying summary.
print("Overall Summary:")
print(f"  Expected rows (assuming 24/hr):   {raw_expected_total} (before DST adjustment)")
print(f"  Adjusted expected rows:           {adjusted_expected_total}")
print(f"  Actual rows in DataFrame:         {actual_total}")
print(f"  Missing rows (due to DST):        {len(missing)}")

Overall Summary:
  Expected rows (assuming 24/hr):   8784 (before DST adjustment)
  Adjusted expected rows:           8783
  Actual rows in DataFrame:         8784
  Missing rows (due to DST):        1


There is a problem. The actual number of rows is 8784, but it should be 8783 after accounting for the 1 missing hour due to DST (2024-03-10 02:00). This suggests there may be a duplicate date and hour entry that should be removed.

In [37]:
# Checking for duplicate combinations of date and hour.
duplicates = df[df.duplicated(subset=['date', 'hour'], keep=False)]

# Examining results
if duplicates.empty:
    print("No duplicate date and hour combinations found.")
else:
    print(f"Found {len(duplicates)} duplicate rows based on date and hour:")
    print(duplicates.sort_values(['date', 'hour']).head())

Found 2 duplicate rows based on date and hour:
      temp  humidity  wind_speed  feels_like weather_main        date  hour
7368  8.61      57.0        4.63        6.01        Clear  2024-11-03     1
7369  8.10      59.0        5.81        4.90        Clear  2024-11-03     1


On 2024-11-03, 1:00 AM occurs twice due to DST ending. Dropping one of the duplicate rows to maintain a consistent hourly time series for 2025 prediction.

In [38]:
# Keeping the first row.
df = df.drop_duplicates(subset=['date', 'hour'], keep='first')

Weather data is now clean.

In [39]:
# Saving to CSV file.
df.to_csv('weather_2024_cleaned.csv', index=False)