# Data Preprocessing

This notebook contains the data preprocessing part of the project.

Contents:
1. [Imporiting Libraries](#Importing)
2. [Loading Data](#Loading-data)
3. [Cleaning Data](#cleaning-data)
4. [Statistical Summary](#statistical-summary)
5. [Saving Dataframe](#saving-the-dataframe)

# Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Loading data
Loading the data into a dataframe.

In [2]:
df = pd.read_csv('../data/GlobalWeatherRepository.csv')

In [3]:
df.head()

Unnamed: 0,country,location_name,latitude,longitude,timezone,last_updated_epoch,last_updated,temperature_celsius,temperature_fahrenheit,condition_text,...,air_quality_PM2.5,air_quality_PM10,air_quality_us-epa-index,air_quality_gb-defra-index,sunrise,sunset,moonrise,moonset,moon_phase,moon_illumination
0,Afghanistan,Kabul,34.52,69.18,Asia/Kabul,1715849100,2024-05-16 13:15,26.6,79.8,Partly Cloudy,...,8.4,26.6,1,1,04:50 AM,06:50 PM,12:12 PM,01:11 AM,Waxing Gibbous,55
1,Albania,Tirana,41.33,19.82,Europe/Tirane,1715849100,2024-05-16 10:45,19.0,66.2,Partly cloudy,...,1.1,2.0,1,1,05:21 AM,07:54 PM,12:58 PM,02:14 AM,Waxing Gibbous,55
2,Algeria,Algiers,36.76,3.05,Africa/Algiers,1715849100,2024-05-16 09:45,23.0,73.4,Sunny,...,10.4,18.4,1,1,05:40 AM,07:50 PM,01:15 PM,02:14 AM,Waxing Gibbous,55
3,Andorra,Andorra La Vella,42.5,1.52,Europe/Andorra,1715849100,2024-05-16 10:45,6.3,43.3,Light drizzle,...,0.7,0.9,1,1,06:31 AM,09:11 PM,02:12 PM,03:31 AM,Waxing Gibbous,55
4,Angola,Luanda,-8.84,13.23,Africa/Luanda,1715849100,2024-05-16 09:45,26.0,78.8,Partly cloudy,...,183.4,262.3,5,10,06:12 AM,05:55 PM,01:17 PM,12:38 AM,Waxing Gibbous,55


# Cleaning Data
- Handling Missing Values
- Converting to appropriate data types
- Checking duplicates

In [4]:
df.isna().sum()

country                         0
location_name                   0
latitude                        0
longitude                       0
timezone                        0
last_updated_epoch              0
last_updated                    0
temperature_celsius             0
temperature_fahrenheit          0
condition_text                  0
wind_mph                        0
wind_kph                        0
wind_degree                     0
wind_direction                  0
pressure_mb                     0
pressure_in                     0
precip_mm                       0
precip_in                       0
humidity                        0
cloud                           0
feels_like_celsius              0
feels_like_fahrenheit           0
visibility_km                   0
visibility_miles                0
uv_index                        0
gust_mph                        0
gust_kph                        0
air_quality_Carbon_Monoxide     0
air_quality_Ozone               0
air_quality_Ni

The dataset has no null values.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30096 entries, 0 to 30095
Data columns (total 41 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   country                       30096 non-null  object 
 1   location_name                 30096 non-null  object 
 2   latitude                      30096 non-null  float64
 3   longitude                     30096 non-null  float64
 4   timezone                      30096 non-null  object 
 5   last_updated_epoch            30096 non-null  int64  
 6   last_updated                  30096 non-null  object 
 7   temperature_celsius           30096 non-null  float64
 8   temperature_fahrenheit        30096 non-null  float64
 9   condition_text                30096 non-null  object 
 10  wind_mph                      30096 non-null  float64
 11  wind_kph                      30096 non-null  float64
 12  wind_degree                   30096 non-null  int64  
 13  w

The datatypes of the columns are correct.

In [6]:
df.duplicated().sum()

0

There are no duplicated in the dataset

# Statistical Summary

In [7]:
df.describe()

Unnamed: 0,latitude,longitude,last_updated_epoch,temperature_celsius,temperature_fahrenheit,wind_mph,wind_kph,wind_degree,pressure_mb,pressure_in,...,gust_kph,air_quality_Carbon_Monoxide,air_quality_Ozone,air_quality_Nitrogen_dioxide,air_quality_Sulphur_dioxide,air_quality_PM2.5,air_quality_PM10,air_quality_us-epa-index,air_quality_gb-defra-index,moon_illumination
count,30096.0,30096.0,30096.0,30096.0,30096.0,30096.0,30096.0,30096.0,30096.0,30096.0,...,30096.0,30096.0,30096.0,30096.0,30096.0,30096.0,30096.0,30096.0,30096.0,30096.0
mean,19.132083,21.969201,1722537000.0,25.743129,78.339228,8.560663,13.781134,176.260832,1012.425306,29.896218,...,20.292344,460.003296,64.979336,10.053476,7.282977,17.877346,33.602397,1.438796,1.969631,49.496312
std,24.485795,65.853435,3938018.0,7.174114,12.913489,11.830692,19.039875,100.303245,6.209667,0.183018,...,20.379515,1070.886244,42.72982,22.527067,60.34986,48.422078,76.90399,0.819501,1.989057,34.998246
min,-41.3,-175.2,1715849000.0,-8.4,16.9,2.2,3.6,1.0,976.0,28.82,...,3.6,-9999.0,0.0,0.0,-9999.0,0.5,0.5,1.0,1.0,0.0
25%,3.75,-6.84,1719064000.0,22.1,71.8,4.3,6.8,92.0,1009.0,29.8,...,11.5,203.6,34.3,0.6,0.5,2.8,5.1,1.0,1.0,15.0
50%,17.25,23.32,1722602000.0,26.6,79.9,7.6,12.2,172.0,1013.0,29.91,...,18.4,262.7,60.0,2.035,1.5,7.9,13.32,1.0,1.0,50.0
75%,40.4,49.8822,1725971000.0,30.0,86.0,11.9,19.1,260.0,1016.0,30.0,...,26.3,397.75,90.0,7.7,5.365,18.315,32.38125,2.0,2.0,83.0
max,64.15,179.22,1729245000.0,49.2,120.6,1841.2,2963.2,360.0,1045.0,30.86,...,2970.4,38879.398,480.7,427.7,255.855,1614.1,1814.4,6.0,10.0,100.0


In [8]:
# summary for key features like temperature, humidity, wind speed, precipitation

cols = []
for col in df.columns:
    if (
        col.startswith('temp') or
        col.startswith('wind') or
        col.startswith('humidity') or
        col.startswith('preci')
    ) and df[col].dtype != 'O':
        cols.append(col)

print(f"Key Columns: \n{cols}")

Key Columns: 
['temperature_celsius', 'temperature_fahrenheit', 'wind_mph', 'wind_kph', 'wind_degree', 'precip_mm', 'precip_in', 'humidity']


In [9]:
for col in cols:
    print(f'{col}: ')
    print(f"Min: {df[col].min()}")
    print(f"Max: {df[col].max()}")
    
    print('-----------')
    print()

temperature_celsius: 
Min: -8.4
Max: 49.2
-----------

temperature_fahrenheit: 
Min: 16.9
Max: 120.6
-----------

wind_mph: 
Min: 2.2
Max: 1841.2
-----------

wind_kph: 
Min: 3.6
Max: 2963.2
-----------

wind_degree: 
Min: 1
Max: 360
-----------

precip_mm: 
Min: 0.0
Max: 27.82
-----------

precip_in: 
Min: 0.0
Max: 1.1
-----------

humidity: 
Min: 2
Max: 100
-----------



# Saving the Dataframe

In [10]:
df.to_csv('../data/CleanedGlobalWeatherRepository.csv', index=False)