# Data Cleaning and ORM:
While hacking on my starting notebook, it became necessary to refactor and refocus the code. I will try and strictly limit the scope of this notebook to accomplish two tasks:
1. Impute the missing null values
2. Load just the imputed values into a table in the sqlite database, hawaii.sqlite

## Coding Goals:
I want to stick with sqlalchemy and its ORM to its fullest extent. Probe sqlite and it's features and functions. Idealy minimize pandas and ETL as much as possible.'

## Run this notebook once and you should have another table with the imputed values added to the sqlite database.
This might be handy if we want to see analyse the values our imputation process created. I've got a hunch that temperature alone is a poor predictor of precipitation values. Pressure might be better. Another step of the data cleaning process might include collecting more data.

In [ ]:
#Reserved for Libraries to import as I need them.
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.automap import automap_base
# for creating tables:
from sqlalchemy import Table, TEXT, Column, FLOAT, ForeignKey, Integer
# pandas for ETL
import pandas as pd
# sklearn for the impute
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

database_path = "Homework\homework_08\sqlalchemy_challenge\data\hawaii.sqlite"

In [ ]:
database_path = "Homework\homework_08\sqlalchemy_challenge\data\hawaii.sqlite"
# First task is to create a session for my already existiung sqlite database I will need sqlalchemy create_engine, and sqlaclhemy.orm sessionmaker. I am going to need automap_base to create the classes from the database as it exists already. I did this manually in a sloppy manner in the previous notbook. # There is an order to doing this that I do not quite understand yet. I Declare my base and then create my engine. OK actually I have to pre-declare my new table. I'll call it "imputed"
Base = automap_base()
# This is the same setup as in 'measurement' table. Except for the relationships. There is a more sqlalchemy way to declare this table though...
class Imputed(Base):
    __tablename__ = 'imputed'
   # __table_args__ = ''

    id = Column(Integer, primary_key = True)
    station = Column('station', TEXT())
    date = Column('date', TEXT())
    prcp = Column('prcp', FLOAT())
    tobs = Column('tobs', FLOAT())

    def __repr__(self):
        return f'station={self.station}\
                date={self.date}\
                prcp{self.prcp}\
                tobs={self.tobs}'

engine = create_engine(f'sqlite:///{database_path}',  connect_args={'timeout': 15})
# Create our table?
Base.metadata.create_all(engine)
# Now to do the reflect and whatnot
Base = automap_base()
Base.prepare(engine, reflect=True)
Measurement = Base.classes.measurement
Station = Base.classes.station
Imputed = Base.classes.imputed
# Now to make a session for our queries:
conn = engine.engine.connect()
Session = sessionmaker(bind=engine)
session = Session()

## Data Cleaning:
Extract whole record => Impute Null Values => Populate imputed table with the values from previous step

In [ ]:
measurement_df = pd.read_sql_table('measurement', conn)
null_impute_df = pd.read_sql_query('SELECT * FROM measurement WHERE prcp ISNULL;', conn, index_col='id')
conn.close()
null_impute_df.head()

In [ ]:
# This function will take a dataframe and spit out a dataframe with the values imputed:
def impute_prcp(data_frame):
    impute_fit = data_frame.sample(int(len(measurement_df)/3))
    imp = IterativeImputer(max_iter=1000, random_state=1235312395)
    imp.fit(impute_fit[['prcp','tobs']])
    impute_df =  pd.DataFrame(np.round(imp.transform(data_frame[['prcp', 'tobs']]),2), columns = ['prcp', 'tobs'])
    cat_df = data_frame
    cat_df['prcp_imp'] = impute_df['prcp']
    return cat_df

In [ ]:
# Some quick hacking to get the dataframe into the right format: 
test = impute_prcp(measurement_df)
cols = ['id', 'station','date', 'prcp', 'tobs']
merge_df = pd.merge(null_impute_df, test, how='inner', sort=False, on=['id', 'station', 'date', 'tobs'])
imputed_df = merge_df[['id', 'station','date', 'prcp_imp', 'tobs']]
imputed_df['prcp'] = imputed_df['prcp_imp']
imputed_df = imputed_df[cols]
imputed_df = imputed_df.set_index('id')
imputed_df


In [ ]:
# Bulk insert was giving me concurrency errors that I couldn't figure out. Sqlite plays nice with pandas to_sql, so let's just do that.
imputed_df.to_sql('imputed', engine, if_exists='append', index='id')
