<a href="https://colab.research.google.com/github/ikkaya/AirQualityPrediction/blob/main/AirQualityPrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [32]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from xgboost import XGBRegressor
from sklearn.preprocessing import OneHotEncoder

##Reading Data




In [17]:
df = pd.read_csv('cities_air_quality_water_pollution.18-10-2021.csv')
df.shape

(3963, 5)

##Exploring and Cleaning Data

In [18]:
df.head()

Unnamed: 0,City,"""Region""","""Country""","""AirQuality""","""WaterPollution"""
0,New York City,"""New York""","""United States of America""",46.816038,49.50495
1,"Washington, D.C.","""District of Columbia""","""United States of America""",66.129032,49.107143
2,San Francisco,"""California""","""United States of America""",60.514019,43.0
3,Berlin,"""""","""Germany""",62.36413,28.612717
4,Los Angeles,"""California""","""United States of America""",36.621622,61.299435


In [19]:
df= df.rename(columns={' "Region"': 'Region', ' "Country"': 'Country', ' "AirQuality"': 'AirQuality', ' "WaterPollution"':'WaterPollution'})
df.head()

Unnamed: 0,City,Region,Country,AirQuality,WaterPollution
0,New York City,"""New York""","""United States of America""",46.816038,49.50495
1,"Washington, D.C.","""District of Columbia""","""United States of America""",66.129032,49.107143
2,San Francisco,"""California""","""United States of America""",60.514019,43.0
3,Berlin,"""""","""Germany""",62.36413,28.612717
4,Los Angeles,"""California""","""United States of America""",36.621622,61.299435


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3963 entries, 0 to 3962
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   City            3963 non-null   object 
 1   Region          3963 non-null   object 
 2   Country         3963 non-null   object 
 3   AirQuality      3963 non-null   float64
 4   WaterPollution  3963 non-null   float64
dtypes: float64(2), object(3)
memory usage: 154.9+ KB


There are categorical and numerical features in the dataset.

In [21]:
# Checking missing values:

print(df.isna().sum())

# Checking duplicated values:

print(df.duplicated().sum())

City              0
Region            0
Country           0
AirQuality        0
WaterPollution    0
dtype: int64
0



Unfortunately, we could not catch the empty ("") values in the Region column of the dataset.

In [22]:
#Elimination of Region column:
df = df.drop(columns= ['Region'])
df.head()

Unnamed: 0,City,Country,AirQuality,WaterPollution
0,New York City,"""United States of America""",46.816038,49.50495
1,"Washington, D.C.","""United States of America""",66.129032,49.107143
2,San Francisco,"""United States of America""",60.514019,43.0
3,Berlin,"""Germany""",62.36413,28.612717
4,Los Angeles,"""United States of America""",36.621622,61.299435


##Data Analysis

It is possible to make analyses with the dataset according to the interests of people.

In [23]:
df.loc[(df["Country"]== ' "Turkey"') & (df["AirQuality"] == 100.0)]

Unnamed: 0,City,Country,AirQuality,WaterPollution
1768,Adapazari,"""Turkey""",100.0,75.0
2084,Fethiye,"""Turkey""",100.0,75.0
2136,Bayburt,"""Turkey""",100.0,0.0


In [24]:
df_environmentalquality = df.sort_values(['AirQuality','WaterPollution'] ,ascending=(False,True))
df_environmentalquality.head(20)

Unnamed: 0,City,Country,AirQuality,WaterPollution
166,Vaduz,"""Liechtenstein""",100.0,0.0
266,Ingolstadt,"""Germany""",100.0,0.0
305,Sibenik,"""Croatia""",100.0,0.0
436,Koprivnica,"""Croatia""",100.0,0.0
437,Bjelovar,"""Croatia""",100.0,0.0
438,Mary,"""Turkmenistan""",100.0,0.0
451,Pematangsiantar,"""Indonesia""",100.0,0.0
489,Girona,"""Spain""",100.0,0.0
545,Pamplona,"""Spain""",100.0,0.0
557,Magelang,"""Indonesia""",100.0,0.0


The above is the list of the top 20 cities which have good environmental quality according to the air quality and the water pollution measurements.

##Selecting The Prediction Target and The Features

In [25]:
y = df['AirQuality']
X = df.drop(['AirQuality'], axis=1)
X.head()

Unnamed: 0,City,Country,WaterPollution
0,New York City,"""United States of America""",49.50495
1,"Washington, D.C.","""United States of America""",49.107143
2,San Francisco,"""United States of America""",43.0
3,Berlin,"""Germany""",28.612717
4,Los Angeles,"""United States of America""",61.299435


##Train-Test Split dataset

In [26]:
# split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .20 ,random_state=4)

In [27]:
X_test

Unnamed: 0,City,Country,WaterPollution
1780,Wels,"""Austria""",0.000000
3102,Camarillo,"""United States of America""",0.000000
1035,Charleston,"""United States of America""",50.000000
693,Long Beach,"""United States of America""",32.692308
3887,Kulai,"""Malaysia""",50.000000
...,...,...,...
3580,Rohtak,"""India""",25.000000
3643,Zeeland,"""United States of America""",0.000000
1116,Virginia Beach,"""United States of America""",36.666667
1489,Lafayette,"""United States of America""",0.000000


##Preprocessing and Modeling


There's a lot of non-numeric data. We will get an error if we try to use these variables into our models without preprocessing them first. 

In [28]:
# Get list of categorical variables
t = (X_train.dtypes == 'object')
object_cols = list(t[t].index)

print("Categorical variables:")
print(object_cols)

Categorical variables:
['City', 'Country']


As a preprocessing approach, we can use One-Hot Encoding and the OneHotEncoder class from scikit-learn to get one-hot encodings. 

In [29]:
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_test = pd.DataFrame(OH_encoder.transform(X_test[object_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_test.index = X_test.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_test = X_test.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_test = pd.concat([num_X_test, OH_cols_test], axis=1)


**Random Forest Model**

We build a random forest model by using the RandomForestRegressor class.

In [30]:
rf_model = RandomForestRegressor(n_estimators=100, random_state=0)
rf_model.fit(OH_X_train, y_train)
rf_predictions = rf_model.predict(OH_X_test)
rf_model_score= rf_model.score(OH_X_test, y_test)
print("Mean Absolute Error: ", mean_absolute_error(y_test, rf_predictions))
print("Accuracy: ", rf_model_score)

Mean Absolute Error:  15.820076358801876
Accuracy:  0.4453179264933825


Although having a good mean absolute error result, the accuracy of the model is quite a bit low.
The problem can be related to the data samples or our model.

**XGBoost Model**

XGBoost is a better modeling technique for structured data. We can use it to obtain a more accurate model.

In [31]:
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(OH_X_train, y_train)
predictions = my_model.predict(OH_X_test)
my_model_score = my_model.score(OH_X_test, y_test)
print("Mean Absolute Error: ", mean_absolute_error(y_test, predictions))
print("Accuracy: ", my_model_score)

Mean Absolute Error:  16.806899074952714
Accuracy:  0.4906244064643963


XGBoost has a few parameters that can dramatically affect accuracy and training speed. It is possible to improve the accuracy and reduce the mean absolute error values by tuning these parameters. If we assume not having a problem with the dataset, experimenting with a different algorithm or cross-validation technique is also can be required to improve our machine learning model's accuracy.