# **Data Collection & Integration:**

*   Identify & access three datasets:

  *   Traffic Dataset
  *   Weather Dataset(inlcude temperature, precipitation, humidity & wind speed)
  *   Event Dataset(Holidays/any events)


*   Integrating Datasets from various sources into single dataset:

  *   Data Syncronization by timestamp to match traffic with its weather and event info.
  *   Developing a data integration pipeline to merge traffic, weather, and event data into a unified dataset.



*   Handling data quality issues:

  *   Cleaning the dataset (drop duplicates, fill/remove missing values, and fix inconsistencies)
  *   Scaling variables to a common range.(Normalize or standardize)










In [1]:
# Importing Libraries & loading all three datasets

import pandas as pd
import warnings
warnings.filterwarnings("ignore")

# Load the traffic data
traffic_data = pd.read_csv('Dataset_Uber Traffic.csv')
traffic_data['DateTime'] = pd.to_datetime(traffic_data['DateTime'])

# Load the weather data
weather_data = pd.read_csv('weather.csv')
weather_data['DateTime'] = pd.to_datetime(weather_data['date_time'])

# Load the event data
event_data = pd.read_csv('events.csv')

# Convert 'date' to 'DateTime'
event_data['DateTime'] = pd.to_datetime(event_data['date'])

# **Data Syncronization by timestamp to match traffic with its weather and event info.**

In [2]:
# Extract the year from the traffic data to align with
traffic_year = traffic_data['DateTime'].dt.year.unique()[0]

def adjust_year(df, target_year):
    def replace_year(x):
        try:
            return x.replace(year=target_year)
        except ValueError:
            # Handle the February 29 case for leap years
            if x.month == 2 and x.day == 29:
                return x.replace(month=2, day=28, year=target_year)
            else:
                raise

    df['DateTime'] = df['DateTime'].apply(replace_year)
    return df

# Adjust the year in weather data
weather_data = adjust_year(weather_data, traffic_year)

# Adjust the year in event data
event_data = adjust_year(event_data, traffic_year)

# **Data Integration**
Integrating all three datasets into a single dataset

In [3]:
# Merge the datasets on DateTime
integrated_data = pd.merge(traffic_data, weather_data, on='DateTime', how='left')
integrated_data = pd.merge(integrated_data, event_data, on='DateTime', how='left')

# **Handling Data quality issues:**

In [4]:
# Remove duplicates
integrated_data = integrated_data.drop_duplicates()

# Handle missing values
integrated_data = integrated_data.fillna(method='ffill')  # Example: forward fill

In [5]:
# Checking columns in the integrated dataset
print(integrated_data.columns)

Index(['DateTime', 'Junction', 'Vehicles', 'ID', 'date_time', 'maxtempC',
       'mintempC', 'totalSnow_cm', 'sunHour', 'uvIndex', 'uvIndex.1',
       'moon_illumination', 'moonrise', 'moonset', 'sunrise', 'sunset',
       'DewPointC', 'FeelsLikeC', 'HeatIndexC', 'WindChillC', 'WindGustKmph',
       'cloudcover', 'humidity', 'precipMM', 'pressure', 'tempC', 'visibility',
       'winddirDegree', 'windspeedKmph', 'date', 'day', 'holiday',
       'holiday_type'],
      dtype='object')


# **Normalization**

In [6]:
from sklearn.preprocessing import StandardScaler

# Normalize the relevant columns
scaler = StandardScaler()
integrated_data[['tempC', 'humidity', 'windspeedKmph']] = scaler.fit_transform(integrated_data[['tempC', 'humidity', 'windspeedKmph']])

# Save the cleaned and merged data
integrated_data.to_csv('integrated_data.csv', index=False)

In [7]:
# Displaying the first 10 rows of combined dataset after normalization
print(integrated_data.head(10))

    DateTime  Junction  Vehicles           ID            date_time  maxtempC  \
0 2015-01-11         1        15  20151101001  2009-01-11 00:00:00      27.0   
1 2015-01-11         1        15  20151101001  2010-01-11 00:00:00      26.0   
2 2015-01-11         1        15  20151101001  2011-01-11 00:00:00      28.0   
3 2015-01-11         1        15  20151101001  2012-01-11 00:00:00      29.0   
4 2015-01-11         1        15  20151101001  2013-01-11 00:00:00      29.0   
5 2015-01-11         1        15  20151101001  2014-01-11 00:00:00      28.0   
6 2015-01-11         1        15  20151101001  2015-01-11 00:00:00      26.0   
7 2015-01-11         1        15  20151101001  2016-01-11 00:00:00      27.0   
8 2015-01-11         1        15  20151101001  2017-01-11 00:00:00      26.0   
9 2015-01-11         1        15  20151101001  2018-01-11 00:00:00      27.0   

   mintempC  totalSnow_cm  sunHour  uvIndex  ...  precipMM  pressure  \
0      15.0           0.0     11.6      6.0  ..

#   **Finally the data is now fully integrated, cleaned, and scaled, we have a reliable, unified dataset that captures traffic patterns alongside weather and event influences setting the stage for accurate exploration, insightful visualizations, and high‑performance predictive models.**