# Data Formating

Before running the preprocessing file and predictive model, ensure that the datasets are formatted correctly. Use the `Data_Formating.ipynb` notebook provided in the repository to format the datasets appropriately. You may need to modify the code to suit your specific data. 

Ensure that the final dataset to be used contains only numerical data. Additionally, there should be a column with datetime values formatted correctly to be used as an index for time-series analysis in future steps.


## Importing Libraries

In [1]:
import pandas as pd

## Formating for Aotizhongxin Dataset

### 1. Reading Dataset

In [2]:
# Read the CSV file into a DataFrame
df = pd.read_csv('./Raw Datasets/Aotizhongxin.csv',na_values='None')

### 2. Creating the Datetime Column 

In [3]:
# Convert 'From Date' and 'To Date' columns to datetime
df['date_str'] = df.apply(lambda x: f"{int(x['year'])}-{int(x['month']):02d}-{int(x['day']):02d} {int(x['hour']):02d}:00:00", axis=1)

# Convert the date_str column to datetime format
df['Datetime'] = pd.to_datetime(df['date_str'])

# Drop the intermediate date_str column if needed
df.drop(columns=['date_str','station','No','month','hour','day','year'], inplace=True)
# Print the updated DataFrame with the new datetime column
# Set 'From Date' as the index
df.set_index('Datetime', inplace=True)

### 3. Converting the Categorical Columns into Numerical Columns

In [4]:
df['wd'] = pd.factorize(df['wd'], use_na_sentinel=True)[0]

### 4. Print the final dataframe and save the data

In [5]:
df.to_csv('./Datasets/Without Imputation/Final_Dataset_Aotizhongxin.csv', index=True) 

df

Unnamed: 0_level_0,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,wd,WSPM
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2013-03-01 00:00:00,4.0,4.0,4.0,7.0,300.0,77.0,-0.7,1023.0,-18.8,0.0,0,4.4
2013-03-01 01:00:00,8.0,8.0,4.0,7.0,300.0,77.0,-1.1,1023.2,-18.2,0.0,1,4.7
2013-03-01 02:00:00,7.0,7.0,5.0,10.0,300.0,73.0,-1.1,1023.5,-18.2,0.0,0,5.6
2013-03-01 03:00:00,6.0,6.0,11.0,11.0,300.0,72.0,-1.4,1024.5,-19.4,0.0,2,3.1
2013-03-01 04:00:00,3.0,3.0,12.0,12.0,300.0,72.0,-2.0,1025.2,-19.5,0.0,1,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...
2017-02-28 19:00:00,12.0,29.0,5.0,35.0,400.0,95.0,12.5,1013.5,-16.2,0.0,2,2.4
2017-02-28 20:00:00,13.0,37.0,7.0,45.0,500.0,81.0,11.6,1013.6,-15.1,0.0,11,0.9
2017-02-28 21:00:00,16.0,37.0,10.0,66.0,700.0,58.0,10.8,1014.2,-13.3,0.0,2,1.1
2017-02-28 22:00:00,21.0,44.0,12.0,87.0,700.0,35.0,10.5,1014.4,-12.9,0.0,0,1.2


## Formating for Ghaziabad Dataset

### 1. Reading Dataset

In [6]:
# Read the CSV file into a DataFrame
df = pd.read_csv('./Raw Datasets/Ghaziabad.csv',na_values='None')

### 2. Creating the Datetime Column 

In [7]:
# Convert 'From Date' and 'To Date' columns to datetime
df['From Date'] = pd.to_datetime(df['From Date'])
df['To Date'] = pd.to_datetime(df['To Date'])
# Set 'From Date' as the index
df.set_index('From Date', inplace=True)
df.drop(columns=['To Date'],inplace=True)
df = df.rename_axis('Datetime')

### 3. Print the final dataframe and save the data

In [8]:
df.to_csv('./Datasets/Without Imputation/Final_Dataset_Ghaziabad.csv', index=True) 

df

Unnamed: 0_level_0,PM2.5,PM10,NO,NO2,NOx,NH3,SO2,CO,Ozone,Benzene,Toluene,Temp,RH,WS,WD
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2017-01-11 00:00:00,332.50,494.00,23.90,107.45,76.58,100.97,55.50,3.00,8.90,0.60,,31.38,80.75,0.72,143.00
2017-01-11 01:00:00,295.50,435.50,20.42,95.42,67.38,104.13,45.20,2.44,10.70,0.55,,30.73,81.25,0.53,126.75
2017-01-11 02:00:00,270.00,395.00,18.22,77.62,56.10,99.42,29.12,1.97,12.18,0.50,,30.65,82.00,0.60,161.25
2017-01-11 03:00:00,248.50,352.75,16.52,74.72,53.18,97.13,20.62,1.77,10.75,0.43,,30.63,83.00,0.50,113.75
2017-01-11 04:00:00,261.75,365.50,16.43,82.20,57.07,97.62,17.85,1.65,8.72,0.40,,31.23,85.25,0.85,104.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-12-11 19:00:00,316.25,709.50,185.43,139.25,224.80,46.30,16.30,4.47,38.43,14.35,74.10,31.07,81.25,0.32,151.00
2021-12-11 20:00:00,397.75,743.75,242.80,148.68,276.50,43.05,16.62,4.98,39.30,14.70,76.80,29.60,84.50,0.30,151.00
2021-12-11 21:00:00,471.50,765.75,276.50,137.30,297.80,40.27,15.95,0.01,37.17,15.70,79.40,29.57,86.75,0.30,151.00
2021-12-11 22:00:00,500.00,763.50,292.30,118.07,300.48,42.33,17.05,7.37,51.85,15.60,79.22,29.43,87.25,0.35,151.00
