<a href="https://colab.research.google.com/github/mfligiel/Models-for-MLOPS-Review/blob/main/Evidently_for_WeatherModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Weather Data

I am going to predict Chicago's weather from the weather of 5 other places nearby using a weather API.  This model isn't the most useful, but is good for showcasing model monitoring.

Here, I will pull in some June data, but replacing Toronto with Phoenix.  A bit of a different temperature distribution!

In [1]:
!pip install evidently

Collecting evidently
[?25l  Downloading https://files.pythonhosted.org/packages/8b/64/817e8fb176d8393eb2b49f5650957e7ddb11dc3f9d531deb9e26036f8553/evidently-0.1.19.dev0-py3-none-any.whl (15.2MB)
[K     |████████████████████████████████| 15.2MB 185kB/s 
Collecting dataclasses
  Downloading https://files.pythonhosted.org/packages/26/2f/1095cdc2868052dd1e64520f7c0d5c8c550ad297e944e641dbf1ffbb9a5d/dataclasses-0.6-py3-none-any.whl
Installing collected packages: dataclasses, evidently
Successfully installed dataclasses-0.6 evidently-0.1.19.dev0


In [2]:
import requests
import pandas as pd
import time
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
import evidently

In [3]:

from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


This should work!  I'll now find the IDs of 5 cities I will use to predict Chicago's weather:

Milwaukee\
Detroit\
Toronto\
St Louis\
Omaha, NE


I'll use this site to look it up: https://www.findmecity.com/

Milwaukee: 2451822\
Detroit: 2391585 \
Toronto: 4118\
St. Louis: 2486982\
Omaha, NE: 2465512

I'll switch Toronto's WOEID for that of Phoenix: 2471390


In [None]:
#dictionary of cities
cities = {'Milwaukee':'2451822', 'Detroit':'2391585', 'Toronto':'2471390', 'St. Louis':'2486982', 'Omaha':'2465512', 'Chicago':'2379574'} #phoenix for toronto

#empty list to enter these into:
values = []

#loop through cities
for k, v in cities.items():
  #loop through 3 months
  for mth in ['6']:
    #just do days through 30, it's not time series, I don't care
    for day in range(1, 15):
      #what to request
      strng = 'https://www.metaweather.com/api/location/' + v +'/2021/' + mth + '/' +str(day) + '/'
      if day == 1:
        print(strng)
      reqst = requests.get(strng)
      #get the pieces
      date = pd.to_datetime(pd.DataFrame(reqst.json()).max()['created']).date()
      maxtemp = pd.DataFrame(reqst.json()).max()['max_temp']
      values.append([k, date, maxtemp])
      time.sleep(3)





https://www.metaweather.com/api/location/2451822/2021/6/1/


In [None]:
import pickle

#pickle.dump(values, open('.pkl', 'wb'))

In [None]:
pd.DataFrame(values).to_csv('Chicago.csv')

In [None]:
!ls

Chicago.csv  gdrive  sample_data


In [None]:
!cp Chicago.csv gdrive/MyDrive

Now that this is pulled in, I can begin doing a basic model.  Given that this is mostly for the purpose of tracking, I am fine with just making and SVM and doing minimal hyperparameter optimization.

To do this, I will
- load the files
- pivot them according to city
- drop date column (I am ignoring time series aspect here)
- run a quick grid search

In [None]:
!ls gdrive/MyDrive/ModelMonitoringBlog/

 Chicago.csv
 Detroit.csv
'First take on Evidently, and potential Datasets, MF 6.18.21.gdoc'
 Milwaukee.csv
'Model Code'
'Notes, 6.23.2021.gdoc'
 Omaha.csv
 St_Louis.csv
'Table of Contents.gdoc'
 Toronto.csv


In [None]:
#re creating the dictionary above 
cities = {'Milwaukee':'2451822', 'Detroit':'2391585', 'Toronto':'4118', 'St. Louis':'2486982', 'Omaha':'2465512', 'Chicago':'2379574'}

df = pd.DataFrame()

for i in cities.keys():
  if i == 'St. Louis':
    i = 'St_Louis'
  pth = "gdrive/MyDrive/ModelMonitoringBlog/" + i + ".csv"
  print(pth)
  to_append = pd.read_csv(pth)
  print(to_append.head())
  if df.empty:
    df = to_append
    print(df.empty)
  else:
    df = pd.concat([df, to_append], ignore_index=True)
  


gdrive/MyDrive/ModelMonitoringBlog/Milwaukee.csv
   Unnamed: 0          0           1      2
0           0  Milwaukee  2021-03-02  3.505
1           1  Milwaukee  2021-03-03  7.490
2           2  Milwaukee  2021-03-04  9.355
3           3  Milwaukee  2021-03-05  9.935
4           4  Milwaukee  2021-03-06  9.345
False
gdrive/MyDrive/ModelMonitoringBlog/Detroit.csv
   Unnamed: 0        0           1       2
0           0  Detroit  2021-03-02   8.360
1           1  Detroit  2021-03-03   6.820
2           2  Detroit  2021-03-04  12.215
3           3  Detroit  2021-03-05  11.255
4           4  Detroit  2021-03-06  10.325
gdrive/MyDrive/ModelMonitoringBlog/Toronto.csv
   Unnamed: 0        0           1      2
0           0  Toronto  2021-03-02  6.035
1           1  Toronto  2021-03-03  3.360
2           2  Toronto  2021-03-04  7.225
3           3  Toronto  2021-03-05  7.525
4           4  Toronto  2021-03-06  7.010
gdrive/MyDrive/ModelMonitoringBlog/St_Louis.csv
   Unnamed: 0          0     

In [None]:
df

Unnamed: 0.1,Unnamed: 0,0,1,2
0,0,Milwaukee,2021-03-02,3.505
1,1,Milwaukee,2021-03-03,7.490
2,2,Milwaukee,2021-03-04,9.355
3,3,Milwaukee,2021-03-05,9.935
4,4,Milwaukee,2021-03-06,9.345
...,...,...,...,...
535,85,Chicago,2021-05-27,28.900
536,86,Chicago,2021-05-28,23.165
537,87,Chicago,2021-05-29,23.900
538,88,Chicago,2021-05-30,28.300


In [None]:
#Now, to rename the columns
df.columns = ['drp', 'city', 'date', 'maxtemp']
df.drop('drp', axis=1, inplace=True)

In [None]:
df = df.pivot(index='date', columns='city', values='maxtemp')

In [None]:
df.head()

city,Chicago,Detroit,Milwaukee,Omaha,St. Louis,Toronto
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-03-02,6.06,8.36,3.505,7.59,15.105,6.035
2021-03-03,6.1,6.82,7.49,14.695,15.48,3.36
2021-03-04,10.15,12.215,9.355,17.6,20.425,7.225
2021-03-05,9.785,11.255,9.935,17.77,18.55,7.525
2021-03-06,8.965,10.325,9.345,17.01,19.07,7.01


Okay - now it is time to do a test train split, and a quick grid search.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
                        df.drop('Chicago', axis=1), df['Chicago'],
                test_size = 0.25, random_state = 101)



In [None]:
# defining parameter range
param_grid = {'C': [0.1, 1, 10, 100], 
              'gamma': [1, 0.1, 0.01, 0.001],
              'kernel': ['rbf', 'linear']} 
  
grid = GridSearchCV(SVR(), param_grid, refit = True, verbose = 2)
  
# fitting the model for grid search
grid.fit(X_train, y_train)

Fitting 5 folds for each of 32 candidates, totalling 160 fits
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] ....................... C=0.1, gamma=1, kernel=rbf, total=   0.0s
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] ....................... C=0.1, gamma=1, kernel=rbf, total=   0.0s
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] ....................... C=0.1, gamma=1, kernel=rbf, total=   0.0s
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] ....................... C=0.1, gamma=1, kernel=rbf, total=   0.0s
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] ....................... C=0.1, gamma=1, kernel=rbf, total=   0.0s
[CV] C=0.1, gamma=1, kernel=linear ...................................
[CV] .................... C=0.1, gamma=1, kernel=linear, total=   0.0s
[CV] C=0.1, gamma=1, kernel=linear ...................................
[CV] ..........

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


[CV] ................ C=0.1, gamma=0.001, kernel=linear, total=   0.0s
[CV] C=1, gamma=1, kernel=rbf ........................................
[CV] ......................... C=1, gamma=1, kernel=rbf, total=   0.0s
[CV] C=1, gamma=1, kernel=rbf ........................................
[CV] ......................... C=1, gamma=1, kernel=rbf, total=   0.0s
[CV] C=1, gamma=1, kernel=rbf ........................................
[CV] ......................... C=1, gamma=1, kernel=rbf, total=   0.0s
[CV] C=1, gamma=1, kernel=rbf ........................................
[CV] ......................... C=1, gamma=1, kernel=rbf, total=   0.0s
[CV] C=1, gamma=1, kernel=rbf ........................................
[CV] ......................... C=1, gamma=1, kernel=rbf, total=   0.0s
[CV] C=1, gamma=1, kernel=linear .....................................
[CV] ...................... C=1, gamma=1, kernel=linear, total=   0.0s
[CV] C=1, gamma=1, kernel=linear .....................................
[CV] .

[Parallel(n_jobs=1)]: Done 160 out of 160 | elapsed:    9.3s finished


GridSearchCV(cv=None, error_score=nan,
             estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3,
                           epsilon=0.1, gamma='scale', kernel='rbf',
                           max_iter=-1, shrinking=True, tol=0.001,
                           verbose=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001],
                         'kernel': ['rbf', 'linear']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=2)

In [None]:
print(grid.best_params_)

{'C': 0.1, 'gamma': 1, 'kernel': 'linear'}


In [None]:
print(grid.best_estimator_)

SVR(C=0.1, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=1,
    kernel='linear', max_iter=-1, shrinking=True, tol=0.001, verbose=False)


In [None]:
print(grid.best_score_)

0.9426824538085308


In [None]:
import pickle
pickle.dump(grid.best_estimator_, open('weather_model.pkl', 'wb'))

In [None]:
!cp weather_model.pkl gdrive/MyDrive/ModelMonitoringBlog