# Checkpoint Three: Cleaning Data

Now you are ready to clean your data. Before starting coding, provide the link to your dataset below.

My dataset: https://www.ncei.noaa.gov/cdo-web/datasets/GHCND/locations/ZIP:64116/detail

Import the necessary libraries and create your dataframe(s).

In [1]:
# import libraries
import pandas as pd
import numpy as np
import seaborn as sb
import plotly.express as px
import matplotlib.pyplot as plt
from matplotlib import style

# read in .csv file as dataframe
weather_df = pd.read_csv('WeatherData.csv')
weather_df = weather_df.drop(['Unnamed: 0'],axis=1)
weather_df.head(25)

Unnamed: 0,STATION,NAME,DATE,Avg Daily Cloudiness (%),Avg Solar Day Cloudiness (%),Avg Daily Wind Speed (mph),Days in Multiday Precipitation,Fastest Wind Time (HH:MM),Base of Frozen Ground Layer (inches),Top of Frozen Ground Layer (inches),...,Weather Type - Freezing Drizzle,Weather Type - Rain,Weather Type - Freezing Rain,Weather Type - Snow,Weather Type - Ice Fog or Freezing Fog,Year,Month,Day,Month & Day,Year & Month
0,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",1934-01-01,,,,,,,,...,,1.0,,1.0,,1934,1,1,01-01,1934-01
1,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",1934-01-02,,,,,,,,...,,,,1.0,,1934,1,2,01-02,1934-01
2,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",1934-01-03,,,,,,,,...,,1.0,,1.0,,1934,1,3,01-03,1934-01
3,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",1934-01-04,,,,,,,,...,,1.0,,1.0,,1934,1,4,01-04,1934-01
4,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",1934-01-05,,,,,,,,...,,,,1.0,,1934,1,5,01-05,1934-01
5,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",1934-01-06,,,,,,,,...,,1.0,,,,1934,1,6,01-06,1934-01
6,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",1934-01-07,,,,,,,,...,,1.0,,1.0,,1934,1,7,01-07,1934-01
7,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",1934-01-08,,,,,,,,...,,,,1.0,,1934,1,8,01-08,1934-01
8,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",1934-01-09,,,,,,,,...,,1.0,,1.0,,1934,1,9,01-09,1934-01
9,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",1934-01-10,,,,,,,,...,,,,1.0,,1934,1,10,01-10,1934-01


## Unnecessary Data

Look for the different types of unnecessary data in your dataset and address it as needed. Make sure to use code comments to illustrate your thought process.

In [3]:
# Remove NKC station.  The data span is not significant and there is no max and min temperature data.

weather_df = weather_df[weather_df['STATION']=='USW00013988']
weather_df.tail()

Unnamed: 0,STATION,NAME,DATE,Avg Daily Cloudiness (%),Avg Solar Day Cloudiness (%),Avg Daily Wind Speed (mph),Days in Multiday Precipitation,Fastest Wind Time (HH:MM),Base of Frozen Ground Layer (inches),Top of Frozen Ground Layer (inches),...,Weather Type - Freezing Drizzle,Weather Type - Rain,Weather Type - Freezing Rain,Weather Type - Snow,Weather Type - Ice Fog or Freezing Fog,Year,Month,Day,Month & Day,Year & Month
29965,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",2023-05-17,,,4.03,,,,,...,,,,,,2023,5,17,05-17,2023-05
29966,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",2023-05-18,,,4.47,,,,,...,,,,,,2023,5,18,05-18,2023-05
29967,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",2023-05-19,,,4.7,,,,,...,,,,,,2023,5,19,05-19,2023-05
29968,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",2023-05-20,,,4.25,,,,,...,,,,,,2023,5,20,05-20,2023-05
29969,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",2023-05-21,,,3.58,,,,,...,,,,,,2023,5,21,05-21,2023-05


In [4]:
# Review column information

weather_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29970 entries, 0 to 29969
Data columns (total 56 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   STATION                                           29970 non-null  object 
 1   NAME                                              29970 non-null  object 
 2   DATE                                              29970 non-null  object 
 3   Avg Daily Cloudiness (%)                          3886 non-null   float64
 4   Avg Solar Day Cloudiness (%)                      3886 non-null   float64
 5   Avg Daily Wind Speed (mph)                        10904 non-null  float64
 6   Days in Multiday Precipitation                    3 non-null      float64
 7   Fastest Wind Time (HH:MM)                         3875 non-null   float64
 8   Base of Frozen Ground Layer (inches)              5 non-null      float64
 9   Top of Frozen Gro

In [5]:
# remove all columns with less than 5% coverage

total_records = len(weather_df.index)

columns_to_delete = []

for column in weather_df.columns:
    if weather_df[column].notnull().sum() < (total_records * 0.05):
        columns_to_delete.append(column)

weather_df.drop(weather_df.loc[:,columns_to_delete], axis=1,inplace=True)

# Review remaining columns
weather_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29970 entries, 0 to 29969
Data columns (total 34 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   STATION                                           29970 non-null  object 
 1   NAME                                              29970 non-null  object 
 2   DATE                                              29970 non-null  object 
 3   Avg Daily Cloudiness (%)                          3886 non-null   float64
 4   Avg Solar Day Cloudiness (%)                      3886 non-null   float64
 5   Avg Daily Wind Speed (mph)                        10904 non-null  float64
 6   Fastest Wind Time (HH:MM)                         3875 non-null   float64
 7   Distance Between River and Gauge Height (inches)  6203 non-null   float64
 8   Peak Gust Time (HH:MM)                            5274 non-null   float64
 9   Precipitation (in

In [6]:
# Remove remaining irrelevant columns

irrelevant_columns = ['Avg Daily Cloudiness (%)',
                      'Avg Solar Day Cloudiness (%)',
                      'Avg Daily Wind Speed (mph)',
                      'Fastest Wind Time (HH:MM)',
                      'Distance Between River and Gauge Height (inches)',
                      'Peak Gust Time (HH:MM)',
                      'Daily % of Possible Sunshine',
                      'Direction of Fastest 2-Minute Wind (degrees)',
                      'Direction of Fastest 5-Minute Wind (degrees)',
                      'Fastest mile wind direction (degrees)',
                      'Fastest 2-minute wind speed (mph)',
                      'Fastest 5-minute wind speed (mph)',
                      'Fastest mile wind speed (mph)',
                      'Weather Type - Fog','Weather Type - Thunder',
                      'Weather Type - Smoke or Haze',
                      'Weather Type - Rain','Weather Type - Snow',
                      'Daily Total Sunshine (minutes)']

weather_df.drop(weather_df.loc[:,irrelevant_columns], axis=1,inplace=True)

weather_df.head()

Unnamed: 0,STATION,NAME,DATE,Precipitation (inches),Snowfall (inches),Snow Depth (inches),Avg Hourly Temp (Fahrenheit),Highest Hourly Temp (Fahrenheit),Lowest Hourly Temp (Fahrenheit),Temperature at Observation (Fahrenheit),Year,Month,Day,Month & Day,Year & Month
0,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",1934-01-01,0.0,0.0,0.0,,26.0,15.0,,1934,1,1,01-01,1934-01
1,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",1934-01-02,0.0,0.0,0.0,,33.0,25.0,,1934,1,2,01-02,1934-01
2,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",1934-01-03,0.21,0.0,0.0,,32.0,30.0,,1934,1,3,01-03,1934-01
3,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",1934-01-04,0.2,0.0,0.0,,34.0,32.0,,1934,1,4,01-04,1934-01
4,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",1934-01-05,0.0,0.0,0.0,,36.0,32.0,,1934,1,5,01-05,1934-01


## Missing Data

Test your dataset for missing data and handle it as needed. Make notes in the form of code comments as to your thought process.

In [7]:
weather_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29970 entries, 0 to 29969
Data columns (total 15 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   STATION                                  29970 non-null  object 
 1   NAME                                     29970 non-null  object 
 2   DATE                                     29970 non-null  object 
 3   Precipitation (inches)                   29908 non-null  float64
 4   Snowfall (inches)                        25137 non-null  float64
 5   Snow Depth (inches)                      24073 non-null  float64
 6   Avg Hourly Temp (Fahrenheit)             2677 non-null   float64
 7   Highest Hourly Temp (Fahrenheit)         29934 non-null  float64
 8   Lowest Hourly Temp (Fahrenheit)          29931 non-null  float64
 9   Temperature at Observation (Fahrenheit)  2768 non-null   float64
 10  Year                                     29970

In [11]:
# Check for duplicate dates

len(weather_df['DATE'].unique())

29970

In [12]:
Missing_Data = weather_df[(weather_df['Lowest Hourly Temp (Fahrenheit)'].isna()) | (weather_df['Highest Hourly Temp (Fahrenheit)'].isna()) | (weather_df['Precipitation (inches)'].isna())]
Missing_Data

Unnamed: 0,STATION,NAME,DATE,Precipitation (inches),Snowfall (inches),Snow Depth (inches),Avg Hourly Temp (Fahrenheit),Highest Hourly Temp (Fahrenheit),Lowest Hourly Temp (Fahrenheit),Temperature at Observation (Fahrenheit),Year,Month,Day,Month & Day,Year & Month
15056,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",1981-05-22,,0.0,0.0,,85.0,69.0,,1981,5,22,05-22,1981-05
15795,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",1983-05-31,0.0,0.0,0.0,,,51.0,,1983,5,31,05-31,1983-05
16521,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",1985-05-26,,0.0,0.0,,93.0,66.0,,1985,5,26,05-26,1985-05
16522,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",1985-05-27,,0.0,0.0,,72.0,63.0,,1985,5,27,05-27,1985-05
17737,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",1988-09-23,,0.0,0.0,,69.0,59.0,,1988,9,23,09-23,1988-09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29478,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",2022-01-15,,,,,33.0,21.0,,2022,1,15,01-15,2022-01
29670,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",2022-07-26,,,,,79.0,66.0,,2022,7,26,07-26,2022-07
29671,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",2022-07-27,,,,,87.0,70.0,,2022,7,27,07-27,2022-07
29691,USW00013988,"KANSAS CITY DOWNTOWN AIRPORT, MO US",2022-08-16,,,,,72.0,65.0,,2022,8,16,08-16,2022-08


In [None]:
# 
years = list(map(str, range(1934,2023)))
years

## Irregular Data

Detect outliers in your dataset and handle them as needed. Use code comments to make notes about your thought process.

## Inconsistent Data

Check for inconsistent data and address any that arises. As always, use code comments to illustrate your thought process.

## Summarize Your Results

Make note of your answers to the following questions.

1. Did you find all four types of dirty data in your dataset?
2. Did the process of cleaning your data give you new insights into your dataset?
3. Is there anything you would like to make note of when it comes to manipulating the data and making visualizations?