![tower_bridge](tower_bridge.jpg)

As the climate changes, predicting the weather becomes even more important for businesses. You have been asked to support on a machine learning project with the aim of building a pipeline to predict the climate in London, England. Specifically, the model should predict mean temperature in degrees Celsius (°C).

Since the weather depends on a lot of different factors, you will want to run a lot of experiments to determine what the best approach is to predict the weather. In this project, you will run experiments for different regression models predicting the mean temperature, using a combination of `sklearn` and `mlflow`.

You will be working with data stored in `london_weather.csv`, which contains the following columns:
- **date** - recorded date of measurement - (**int**)
- **cloud_cover** - cloud cover measurement in oktas - (**float**)
- **sunshine** - sunshine measurement in hours (hrs) - (**float**)
- **global_radiation** - irradiance measurement in Watt per square meter (W/m2) - (**float**)
- **max_temp** - maximum temperature recorded in degrees Celsius (°C) - (**float**)
- **mean_temp** - **target** mean temperature in degrees Celsius (°C) - (**float**)
- **min_temp** - minimum temperature recorded in degrees Celsius (°C) - (**float**)
- **precipitation** - precipitation measurement in millimeters (mm) - (**float**)
- **pressure** - pressure measurement in Pascals (Pa) - (**float**)
- **snow_depth** - snow depth measurement in centimeters (cm) - (**float**)

In [None]:
# Run this cell to import the modules you require
import pandas as pd
import numpy as np
import mlflow
import mlflow.sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Loading Data

In [None]:
# Read in the data
weather = pd.read_csv("london_weather.csv")

# Start coding here
# Use as many cells as you like

In [None]:
weather.head()

In [None]:
weather.describe()

# Cleaning Data

In [None]:
weather['date'] = pd.to_datetime(weather['date'], format='%Y%m%d')
weather.head()

- process null values
- Normalize features for linear regression
- Extract day of year, day of week, day of the month and month as features
- Generate average mean temp per month


In [None]:
weather.isna().sum()

In [None]:
# Since the amount of nulls in these columns is not that big, lets drop the rows where these values are null.
weather = weather[~weather['mean_temp'].isnull()]
weather = weather[~weather['cloud_cover'].isnull()]
weather = weather[~weather['global_radiation'].isnull()]
weather = weather[~weather['pressure'].isnull()]
weather = weather[~weather['precipitation'].isnull()]

In [None]:
weather.isna().sum()

# Exploratory Data Analysis

In [None]:
plt.figure(figsize=(50, 6))
sns.lineplot(data=weather, x='date', y='mean_temp')

In [None]:
plt.figure(figsize=(25, 6))
sns.lineplot(data=weather, x='date', y='snow_depth')

In [None]:
# Since its a timeseries, lets interpolate to fulfill the missing snow_depths
weather['snow_depth'] = weather['snow_depth'].interpolate(method='linear')

# Feature Selection

In [None]:
sns.pairplot(data=weather)

# Preprocessing Data

In [None]:
weather.loc[:, 'month'] = weather['date'].dt.month

In [None]:
X = weather.drop(['mean_temp', 'date'], axis=1)
y = weather['mean_temp']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=43)

len(X_train), len(X_test), len(y_train), len(y_test)

In [None]:
X_columns = X.columns

scaler = StandardScaler() 
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


In [None]:
X.isna().sum()

# Training and Evaluation

In [None]:
# Storing mlflow tracking in a sqlite file.
mlflow.set_tracking_uri('sqlite:///./mydb.sqlite')

In [None]:
#experiment_id = mlflow.create_experiment('London_Weather')

experiment = mlflow.get_experiment(experiment_id)
print(f'Name: {experiment.name}')
print(f'Creation time: {experiment.creation_time}')

In [None]:
mlflow.set_experiment('London_Weather')

In [None]:
mlflow.sklearn.autolog()

In [None]:
eval_data = pd.DataFrame(data=X_test, columns=X_columns).reset_index(drop=True)
eval_data["mean_temp"] = y_test.reset_index(drop=True)

In [None]:
with mlflow.start_run():

    dt = DecisionTreeRegressor(criterion='squared_error', 
                           splitter='best', 
                           max_depth=5 
                          ) 

    dt.fit(X_train, y_train)

    mlflow.sklearn.log_model(dt, "model")
    
    model_uri = mlflow.get_artifact_uri("model")
    
    # This will run the evaluate Method against our model and our evaluation Data for the Regressor Type.
    # Here we are also only selecting the "default" evaluators
    result = mlflow.evaluate(
        model_uri,
        eval_data,
        targets="mean_temp",
        model_type="regressor",
        evaluators="default"
    )

    predicts = dt.predict(X_test)
    dt_rmse = np.sqrt(mean_squared_error(y_test, predicts))
    mlflow.log_metric("rmse_dt", dt_rmse)

In [None]:
with mlflow.start_run():

    lr = LinearRegression() 

    lr.fit(X_train, y_train)

    mlflow.sklearn.log_model(lr, "model")
    
    model_uri = mlflow.get_artifact_uri("model")
    
    # This will run the evaluate Method against our model and our evaluation Data for the Regressor Type.
    # Here we are also only selecting the "default" evaluators
    result = mlflow.evaluate(
        model_uri,
        eval_data,
        targets="mean_temp",
        model_type="regressor",
        evaluators="default"
    )

    predicts = lr.predict(X_test)
    lr_rmse = np.sqrt(mean_squared_error(y_test, predicts))
    mlflow.log_metric("rmse_lr", lr_rmse)

In [None]:
experiment_results = mlflow.search_runs()