# HWTW Data Acquisition Notebook
author(s):  David Yerrington (david@yerrington.net), Hig (hig314@gmail.com)


This notebook will contain a clean process in which to import data in which we can use to prototype with.

In [116]:
import pandas as pd, numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

## Basic Data Loading

Looks like our intial csv based on the "short" sample can be loaded simply with pandas without much trouble.  The data does appear to be somewhat improperly typed and the column headers can be updated from Hig's original code.

In [180]:
df = pd.read_csv(
    "../data/external/25339_short.csv", 
    low_memory            = False
)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38214 entries, 0 to 38213
Data columns (total 25 columns):
 % Relative Humidity      38214 non-null object
 Dew Point Temp           38214 non-null object
 Dry Bulb Temp            38214 non-null object
 Maintenance Indicator    38214 non-null object
 Precip. Total            38214 non-null object
 Pressure Tendency        38214 non-null object
 Record Type              38214 non-null object
 Sea Level Pressure       38214 non-null object
 Sky Conditions           38214 non-null object
 Station Pressure         38214 non-null object
 Station Type             38214 non-null object
 Time                     38214 non-null int64
 Val for Wind Char.       38214 non-null object
 Visibility               38214 non-null object
 Weather Type             38214 non-null object
 Wet Bulb Temp            38214 non-null object
 Wind Char. Gusts (kt)    38214 non-null object
 Wind Direction           38214 non-null object
 Wind Speed (kt)          38

### Most of the variable types are "object"
We can't do much with "object" types so we need to convert them to `float64` or `int64` so we can do aggregation of any kind.

### General Cleaning

#### Fix columns with leading spaced and special characters.

For cleaning / mapping raw data for this source, we will use a basic class.  Once we get many different sources we can use a factory pattern to abstract this problem so our code is managable as it grows for more sources.

In [181]:
class clean_data:
    
    verbose = False
    
    column_opts = dict(
        remove_chars = ["%", "(", ")", "."]
    )
    
    df = False
    
    def __init__(self, **opts):
        for attr, value in opts.items():
            if hasattr(self, attr):
                setattr(self, attr, value)
    
    def strip_characters(self, name):
        for char in self.column_opts['remove_chars']:
            name = name.replace(char, "")
        return name
    
    def clean_columns(self):
        df.columns = [
            self.strip_characters(col).lower().strip().replace(" ", "_") 
            for col in self.df.columns
        ]

    def clean(self):
        self.clean_columns()
        
    def get_df(self):
        return self.df

cleaner = clean_data(df = df)
cleaner.clean()
clean_df = cleaner.get_df()

### Convert "-" which presumable are "unknown" to proper type "NaN"
> Will push these fixes back to the class so we can run all these transformations after we've handled all the cases we want to clean for our ETL job.  This will make it easy to automate future sources.

In [182]:
clean_df.replace(r'-', np.nan, inplace = True, regex = True)

In [183]:
exclude_columns = [
    'record_type', 
    'sky_conditions', 
    'station_type', 
    'visibility', 
    'weather_type', 
    'raw_sky_code', 
    'precip_total',       # has "T" strings as some values
    'wind_char_gusts_kt', # has "G" strings as some values
    'wind_direction'    
]
object_columns = [
    name for name, dtype in clean_df.dtypes.items() 
    if dtype == "object" and name not in exclude_columns
]
clean_df[object_columns] = clean_df[object_columns].astype(float)

### Now we have much more usable types

In [184]:
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38214 entries, 0 to 38213
Data columns (total 25 columns):
relative_humidity        20385 non-null float64
dew_point_temp           19891 non-null float64
dry_bulb_temp            20338 non-null float64
maintenance_indicator    0 non-null float64
precip_total             8367 non-null object
pressure_tendency        6779 non-null float64
record_type              38214 non-null object
sea_level_pressure       20391 non-null float64
sky_conditions           38204 non-null object
station_pressure         20395 non-null float64
station_type             38130 non-null object
time                     38214 non-null int64
val_for_wind_char        38211 non-null float64
visibility               38208 non-null object
weather_type             6307 non-null object
wet_bulb_temp            20310 non-null float64
wind_char_gusts_kt       5433 non-null object
wind_direction           38211 non-null object
wind_speed_kt            38211 non-null float

### Still need timeseries type

In [185]:
clean_df['yearmonthday'] = pd.to_datetime(clean_df['yearmonthday'], format = "%Y%m%d")

### More cleaning necessary

I'll hold off until we've had a chance to talk more.

- Mapping of column names
- Proper typing of datetime
- etc

## Save dataset 1.0

- Still need to come up with schema for a common format for all weather.

In [204]:
clean_df.to_csv("../data/processed/25339_short_cleaned.csv", encoding = "UTF8")