# bilaspur-air-quality

Use the "Run" button to execute the code.

In [1]:
print('Hello World')

Hello World


## Install the required library

In [2]:
!pip install pandas scikit-learn --upgrade --quiet

In [3]:
import pandas as pd


### Create pandas dataframe of csv file which is uploaded in notebook

In [4]:
df=pd.read_csv('bilaspur-air-quality.csv')

In [5]:
df

Unnamed: 0,date,pm25,pm10,no2,so2,co
0,2023/6/1,122,59,3,6,21
1,2023/6/2,95,54,3,5,13
2,2023/6/3,99,58,3,5,11
3,2023/6/4,101,55,3,4,6
4,2023/6/5,82,51,4,5,18
...,...,...,...,...,...,...
518,2022/5/22,,54,6,3,7
519,2022/6/12,,50,6,2,17
520,2022/6/19,,49,7,3,8
521,2022/1/16,,22,5,3,7


# Data cleaning
### To check the data type and its null value we give the info() function

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 523 entries, 0 to 522
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    523 non-null    object
 1    pm25   523 non-null    object
 2    pm10   523 non-null    object
 3    no2    523 non-null    object
 4    so2    523 non-null    object
 5    co     523 non-null    object
dtypes: object(6)
memory usage: 24.6+ KB


**Here it is clear that there is not null value but all column are object Dtype so we need to change date column for datetime data type and other remaining five columns should be change for float Dtype because it has numeric value**

In [7]:
#date column changed to date time datatype
df['date']=pd.to_datetime(df['date'])

**When we changed the object dtype to numeric dtype (for pm25, pm10, no2, so2, and co) we found that in dataset there are some special character like  " ". There fore it was stopping the function to chnage in numeric Dtype hence it was required to remove those special character from the rows**

In [8]:
#changed the object dtype to numeric dtype and remove special character by using errors='coerce'
df[' pm25']= pd.to_numeric(df[' pm25'], errors='coerce')
df[' pm10']=pd.to_numeric(df[' pm10'], errors='coerce')
df[' no2']= pd.to_numeric(df[' no2'], errors='coerce')
df[' so2']= pd.to_numeric(df[' so2'], errors='coerce')
df[' co']=pd.to_numeric(df[' co'], errors='coerce')

In [9]:
#After removing the special character we again check its Dtype and the number of rows which are now not null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 523 entries, 0 to 522
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    523 non-null    datetime64[ns]
 1    pm25   507 non-null    float64       
 2    pm10   503 non-null    float64       
 3    no2    502 non-null    float64       
 4    so2    507 non-null    float64       
 5    co     509 non-null    float64       
dtypes: datetime64[ns](1), float64(5)
memory usage: 24.6 KB


**From above info() function we can see that rows containing special characters has removed from the dataset and the Dtype has changed according to the data like date column has changed to datetime and numeric columns has changed to its float type**

In [10]:
# let's see dataframe after removing special character
df

Unnamed: 0,date,pm25,pm10,no2,so2,co
0,2023-06-01,122.0,59.0,3.0,6.0,21.0
1,2023-06-02,95.0,54.0,3.0,5.0,13.0
2,2023-06-03,99.0,58.0,3.0,5.0,11.0
3,2023-06-04,101.0,55.0,3.0,4.0,6.0
4,2023-06-05,82.0,51.0,4.0,5.0,18.0
...,...,...,...,...,...,...
518,2022-05-22,,54.0,6.0,3.0,7.0
519,2022-06-12,,50.0,6.0,2.0,17.0
520,2022-06-19,,49.0,7.0,3.0,8.0
521,2022-01-16,,22.0,5.0,3.0,7.0


**Again we got from the above dataset that there is some NaN values in data set. It need to drop from the data before start the modelling process**

In [11]:
# To remove NaN value we apply drop() method
df=df.dropna()

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 482 entries, 0 to 506
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    482 non-null    datetime64[ns]
 1    pm25   482 non-null    float64       
 2    pm10   482 non-null    float64       
 3    no2    482 non-null    float64       
 4    so2    482 non-null    float64       
 5    co     482 non-null    float64       
dtypes: datetime64[ns](1), float64(5)
memory usage: 26.4 KB


**From above function we now found in dataset that there is no null value and its data type is also changed according to the it's value**

In [13]:
#arranging the data set in decending order of date coloumns i.e present date will be in first row of datasets
df=df.sort_values('date', ascending=False)
df

Unnamed: 0,date,pm25,pm10,no2,so2,co
21,2023-06-22,75.0,35.0,5.0,5.0,21.0
20,2023-06-21,67.0,50.0,6.0,4.0,19.0
19,2023-06-20,95.0,52.0,5.0,4.0,4.0
18,2023-06-19,129.0,39.0,4.0,2.0,8.0
17,2023-06-18,122.0,64.0,6.0,4.0,19.0
...,...,...,...,...,...,...
505,2022-01-01,73.0,54.0,6.0,5.0,5.0
504,2021-12-31,78.0,51.0,6.0,5.0,7.0
503,2021-12-30,75.0,53.0,6.0,5.0,5.0
502,2021-12-29,74.0,52.0,5.0,5.0,5.0


In [14]:
#convert the date column as index and check top 20 days details where latest date will be in top 20
latest_20_days_details= df.set_index('date').head(20)
latest_20_days_details

Unnamed: 0_level_0,pm25,pm10,no2,so2,co
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2023-06-22,75.0,35.0,5.0,5.0,21.0
2023-06-21,67.0,50.0,6.0,4.0,19.0
2023-06-20,95.0,52.0,5.0,4.0,4.0
2023-06-19,129.0,39.0,4.0,2.0,8.0
2023-06-18,122.0,64.0,6.0,4.0,19.0
2023-06-17,122.0,60.0,6.0,5.0,8.0
2023-06-16,100.0,61.0,5.0,5.0,17.0
2023-06-15,98.0,55.0,3.0,4.0,15.0
2023-06-14,82.0,57.0,3.0,5.0,15.0
2023-06-13,50.0,54.0,4.0,5.0,22.0


### check the variation of data during past 20 dates and get the details by describe method

In [15]:
latest_20_days_details.describe()

Unnamed: 0,pm25,pm10,no2,so2,co
count,20.0,20.0,20.0,20.0,20.0
mean,93.35,53.3,4.25,4.45,14.9
std,19.958576,6.966839,1.292692,0.759155,5.757375
min,50.0,35.0,2.0,2.0,4.0
25%,81.0,50.75,3.0,4.0,10.25
50%,98.0,55.0,4.0,5.0,16.0
75%,102.0,57.25,5.0,5.0,19.0
max,129.0,64.0,6.0,5.0,23.0


**From above describe method we find the mean value for pm25, pm10, no2, so2, co are 93.35, 53.3, 4.25, 4.45, and 14.9 respectively. Moreover standard deviation for  pm25, pm10, no2, so2, co are 19.9, 6.97, 1.3, 0.75 and 5.8 respectively**

In [16]:
'''Now we are creating and new dataframe for target where the data for next three days will be taken from the mean
and standard devaiation of the pm25, pm10, no2, so2, co 
and we are giving mean value to the middle day i.e. for the date '2023-06-24' and 
substacte its std value from mean value for the date '2023-06-23' and
finally we add the std value with mean value for the date '2023-06-25' i.e for last day '''


target_col= pd.DataFrame({'date':['2023-06-23', '2023-06-24', '2023-06-25'],
                     ' pm25':['88.0', '93.35', '98.0 '],
                     ' pm10' : ['47.0', '53.3', '59.0'],
                      ' no2' : ['3.0 ', '4.25 ', '5.5 '],
                      ' so2' : ['3.8', '4.45', '5.2'],
                      ' co' : ['9.0 ', '14.9', '18.5 ']
                    })

In [17]:
target_col

Unnamed: 0,date,pm25,pm10,no2,so2,co
0,2023-06-23,88.0,47.0,3.0,3.8,9.0
1,2023-06-24,93.35,53.3,4.25,4.45,14.9
2,2023-06-25,98.0,59.0,5.5,5.2,18.5


In [18]:
input_cols=[' pm25', ' pm10',' no2', ' so2', ' co']

### Split the data set into train_df (60% of the orignal data), val_df (20 % of original dat) and test_df (20% of the original data)

In [19]:
from sklearn.model_selection import train_test_split

In [20]:
train_val_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
train_df, val_df = train_test_split(train_val_df, test_size=0.25, random_state=42)

In [21]:
print('train_df.shape :', train_df.shape)
print('val_df.shape :', val_df.shape)
print('test_df.shape :', test_df.shape)

train_df.shape : (288, 6)
val_df.shape : (97, 6)
test_df.shape : (97, 6)


In [22]:
train_inputs = train_df[input_cols].copy()
train_targets = target_col.copy()


In [23]:
val_inputs = val_df[input_cols].copy()
val_targets = target_col.copy()

In [24]:
test_inputs = test_df[input_cols].copy()
test_targets = target_col.copy()

In [25]:
train_inputs

Unnamed: 0,pm25,pm10,no2,so2,co
364,77.0,50.0,6.0,4.0,8.0
90,103.0,76.0,8.0,2.0,12.0
466,75.0,48.0,5.0,4.0,3.0
285,83.0,51.0,7.0,3.0,20.0
272,78.0,50.0,7.0,3.0,25.0
...,...,...,...,...,...
359,76.0,47.0,5.0,4.0,6.0
400,85.0,54.0,6.0,2.0,16.0
56,68.0,50.0,4.0,4.0,10.0
71,41.0,44.0,4.0,4.0,7.0


In [26]:
val_inputs

Unnamed: 0,pm25,pm10,no2,so2,co
482,73.0,50.0,5.0,6.0,8.0
173,75.0,46.0,4.0,3.0,20.0
284,83.0,52.0,6.0,3.0,16.0
403,81.0,56.0,6.0,2.0,27.0
217,67.0,66.0,10.0,4.0,8.0
...,...,...,...,...,...
113,65.0,36.0,5.0,2.0,15.0
228,54.0,42.0,4.0,4.0,12.0
481,73.0,48.0,6.0,6.0,9.0
39,116.0,40.0,3.0,5.0,16.0


In [27]:
test_inputs

Unnamed: 0,pm25,pm10,no2,so2,co
477,72.0,46.0,6.0,7.0,16.0
420,87.0,71.0,5.0,4.0,3.0
208,91.0,84.0,6.0,5.0,14.0
449,81.0,47.0,6.0,5.0,8.0
263,76.0,51.0,5.0,3.0,7.0
...,...,...,...,...,...
168,80.0,46.0,4.0,4.0,10.0
335,75.0,53.0,6.0,2.0,12.0
411,78.0,55.0,5.0,3.0,7.0
396,78.0,51.0,5.0,3.0,10.0


In [28]:
train_targets

Unnamed: 0,date,pm25,pm10,no2,so2,co
0,2023-06-23,88.0,47.0,3.0,3.8,9.0
1,2023-06-24,93.35,53.3,4.25,4.45,14.9
2,2023-06-25,98.0,59.0,5.5,5.2,18.5


In [29]:
val_targets

Unnamed: 0,date,pm25,pm10,no2,so2,co
0,2023-06-23,88.0,47.0,3.0,3.8,9.0
1,2023-06-24,93.35,53.3,4.25,4.45,14.9
2,2023-06-25,98.0,59.0,5.5,5.2,18.5


In [30]:
test_targets

Unnamed: 0,date,pm25,pm10,no2,so2,co
0,2023-06-23,88.0,47.0,3.0,3.8,9.0
1,2023-06-24,93.35,53.3,4.25,4.45,14.9
2,2023-06-25,98.0,59.0,5.5,5.2,18.5


# preprocessing the dataset

### Imputation

In [31]:
from sklearn.impute import SimpleImputer

In [32]:
imputer= SimpleImputer().fit(train_inputs)

In [33]:
train_inputs= imputer.transform(train_inputs)
val_inputs= imputer.transform(val_inputs)
test_inputs= imputer.transform(test_inputs)

### scaling

In [34]:
from sklearn.preprocessing import MinMaxScaler

In [35]:
scaler= MinMaxScaler().fit(train_inputs)

In [36]:
train_inputs= scaler.transform(train_inputs)
val_inputs= scaler.transform(val_inputs)
test_inputs= scaler.transform(test_inputs)

**In above data set we found that there are no categorical value in data set Therefore we don't need to apply encoder. Its all data are continuous value so we can apply any regression model like linear, lasso, Ridge or XGBRegressor but we are using XGBRegressor because according to the dataset, this model can give the best prediction**

# XGBRegressor Modeling

In [37]:
!pip install xgboost --upgrade --quiet

In [38]:
from xgboost import XGBRegressor

In [39]:
model= XGBRegressor(random_state=42, n_jobs=-1)

In [40]:
model

In [None]:
train_preds = model.fit(train_inputs,train_targets)

In [None]:
predictions=model.predict(train_targets)