<a href="https://colab.research.google.com/github/jccrews256/ST-554-Project1-Template/blob/main/Task1/Task_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ST 554 Project 1, Task 1
*By Cass Crews*

*Reviewed by Joy Zhou and Trevor Lillywhite*


## Introduction

To be added.

## Importing and Cleaning Data

In [11]:
# Installing UCI machine learning repository module
!pip install ucimlrepo

# Importing key modules
import pandas as pd
import numpy as np
import math
import ucimlrepo as uci



In [26]:
# Reading in the air quality data
air_quality = uci.fetch_ucirepo(id=360)

# Capturing data only
air_quality_df = air_quality.data.features

air_quality_df.head()

air_quality_df.columns

Index(['Date', 'Time', 'CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)',
       'PT08.S2(NMHC)', 'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)',
       'PT08.S5(O3)', 'T', 'RH', 'AH'],
      dtype='object')

In [29]:
# Creating more informative column names
air_quality_df = air_quality_df.rename(columns = {"CO(GT)": "CO_true",
                                 "PT08.S1(CO)": "CO_sensor",
                                 "NMHC(GT)": "NMHC_true",
                                 "C6H6(GT)": "C6H6_true",
                                 "PT08.S2(NMHC)": "NMHC_sensor",
                                 "NOx(GT)": "NOx_true",
                                 "PT08.S3(NOx)": "NOx_sensor",
                                 "NO2(GT)": "NO2_true",
                                 "PT08.S4(NO2)": "NO2_sensor",
                                 "PT08.S5(O3)": "O3_sensor",
                                 "T": "Temp",
                                 "RH": "Rel_Humid",
                                 "AH": "Abs_Humid"})

air_quality_df.columns

Index(['Date', 'Time', 'CO_true', 'CO_sensor', 'NMHC_true', 'C6H6_true',
       'NMHC_sensor', 'NOx_true', 'NOx_sensor', 'NO2_true', 'NO2_sensor',
       'O3_sensor', 'Temp', 'Rel_Humid', 'Abs_Humid'],
      dtype='object')

Let's extract some information on the data frame, such as the total number of observations, each column's data type, and the number of null values by column.

In [30]:
air_quality_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9357 entries, 0 to 9356
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Date         9357 non-null   object 
 1   Time         9357 non-null   object 
 2   CO_true      9357 non-null   float64
 3   CO_sensor    9357 non-null   int64  
 4   NMHC_true    9357 non-null   int64  
 5   C6H6_true    9357 non-null   float64
 6   NMHC_sensor  9357 non-null   int64  
 7   NOx_true     9357 non-null   int64  
 8   NOx_sensor   9357 non-null   int64  
 9   NO2_true     9357 non-null   int64  
 10  NO2_sensor   9357 non-null   int64  
 11  O3_sensor    9357 non-null   int64  
 12  Temp         9357 non-null   float64
 13  Rel_Humid    9357 non-null   float64
 14  Abs_Humid    9357 non-null   float64
dtypes: float64(5), int64(8), object(2)
memory usage: 1.1+ MB


Note that Python does not find any null values, as the non-null counts for each column are equal to the total number of observations. However, the data documentation indicate missing values are denoted by -200 and, therefore, would not be considered null values by Python. We clearly see this issue when we generate some basic summary statistics for the numeric variables using the `describe()` method.

In [31]:
air_quality_df.describe()

Unnamed: 0,CO_true,CO_sensor,NMHC_true,C6H6_true,NMHC_sensor,NOx_true,NOx_sensor,NO2_true,NO2_sensor,O3_sensor,Temp,Rel_Humid,Abs_Humid
count,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0
mean,-34.207524,1048.990061,-159.090093,1.865683,894.595276,168.616971,794.990168,58.148873,1391.479641,975.072032,9.778305,39.48538,-6.837604
std,77.65717,329.83271,139.789093,41.380206,342.333252,257.433866,321.993552,126.940455,467.210125,456.938184,43.203623,51.216145,38.97667
min,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0
25%,0.6,921.0,-200.0,4.0,711.0,50.0,637.0,53.0,1185.0,700.0,10.9,34.1,0.6923
50%,1.5,1053.0,-200.0,7.9,895.0,141.0,794.0,96.0,1446.0,942.0,17.2,48.6,0.9768
75%,2.6,1221.0,-200.0,13.6,1105.0,284.0,960.0,133.0,1662.0,1255.0,24.1,61.9,1.2962
max,11.9,2040.0,1189.0,63.7,2214.0,1479.0,2683.0,340.0,2775.0,2523.0,44.6,88.7,2.231


When looking at the `min` for each column, we see that all numeric variables have some observations that are missing values. In fact, when looking at `NMHC_true`, we see that more than 75% of observations are missing the true NMHC level.

Let's extract the total number of missing values for each numeric variable so we have a complete understanding of missing value rates.

In [36]:
# Counting missing values
for v in air_quality_df.describe():
    print(v, " Missing Value Count: ", (air_quality_df[v] == -200).sum(), " (", (100*(air_quality_df[v] == -200).sum()/(len(air_quality_df))).round(2), "%)", sep = "")

CO_true Missing Value Count: 1683 (17.99%)
CO_sensor Missing Value Count: 366 (3.91%)
NMHC_true Missing Value Count: 8443 (90.23%)
C6H6_true Missing Value Count: 366 (3.91%)
NMHC_sensor Missing Value Count: 366 (3.91%)
NOx_true Missing Value Count: 1639 (17.52%)
NOx_sensor Missing Value Count: 366 (3.91%)
NO2_true Missing Value Count: 1642 (17.55%)
NO2_sensor Missing Value Count: 366 (3.91%)
O3_sensor Missing Value Count: 366 (3.91%)
Temp Missing Value Count: 366 (3.91%)
Rel_Humid Missing Value Count: 366 (3.91%)
Abs_Humid Missing Value Count: 366 (3.91%)
