## Model Development Using Scikit-Learn
- Please open this notebook in edit mode.

#### Load Airline Delay data as pandas dataframe from the csv
- This is the csv file we stored as a data asset in the previous step.

In [7]:
import pandas as pd
from project_lib import Project


project = Project.access()
df = pd.read_csv('/project_data/data_asset/train_flights_jan_2015_csv_shaped_1ev8vmlhh3wga43246hgfq3dc')
df.head()

Unnamed: 0,DAY,DAY_OF_WEEK,ORIGIN_AIRPORT,DESTINATION_AIRPORT,DEPARTURE_DELAY,TAXI_OUT,DISTANCE,DELAYED
0,23,5,MSP,PHX,-2.0,13.0,1276,0
1,2,5,RAP,SLC,-6.0,32.0,508,0
2,8,4,STX,MIA,-5.0,9.0,1139,0
3,16,5,CLT,ATL,6.0,14.0,226,0
4,11,7,ATL,BDL,0.0,18.0,859,0


### Seperate featues and label Column

In [9]:
X=df.drop('DELAYED',axis=1)
y=df['DELAYED']

#### Separate categorical and numerical columns

In [10]:
cat=["DAY","DAY_OF_WEEK","ORIGIN_AIRPORT","DESTINATION_AIRPORT"]
numeric=['DEPARTURE_DELAY','TAXI_OUT',"DISTANCE"]

### Create preprocessor for categorical and numerical columns

In [12]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric),
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat)])

### Create a pipeline with the preprocessor and an estimator

In [13]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

#clf=GradientBoostingClassifier();
clf=LogisticRegression()
pipe = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', clf)])

### K-fold cross validation 

In [14]:
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import cross_val_score

scores = cross_val_score(pipe, X, y, cv=5)
scores.mean()

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.9133772530075503

### Fit the model

In [15]:
model=pipe.fit(X,y);

In [16]:
sdf=pd.DataFrame([[11,7,"MSP","PHX",2,11,570]],columns=X.columns)
sdf

Unnamed: 0,DAY,DAY_OF_WEEK,ORIGIN_AIRPORT,DESTINATION_AIRPORT,DEPARTURE_DELAY,TAXI_OUT,DISTANCE
0,11,7,MSP,PHX,2,11,570


In [17]:
model.predict(sdf)

array([0])

## Store trained model
- We will store the trained model in the project. The model should appear in the `Assets` tab under `Models` section.


##### Create a WML client
- It fetches required credentials from the environment. **Please don't replace anything.**

In [78]:
#!pip install -U ibm-watson-machine-learning Install latest WML version once if required

In [20]:
project = Project.access()
project_id = project.get_metadata()['metadata']['guid']
print(project_id)

1f4d574e-3b85-4be4-b7da-57ff5b06f7b9


In [21]:
import os
from ibm_watson_machine_learning import APIClient
token = os.environ['USER_ACCESS_TOKEN']
wml_credentials = {
   "token": token,
   "instance_id" : "wml_local",
   "url": os.environ['RUNTIME_ENV_APSX_URL'],
   "version": "4.0"
}
wml_client = APIClient(wml_credentials)

In [22]:
wml_client.set.default_project(project_id)
#wml_client.software_specifications.list()

'SUCCESS'

### Prepare metadata for storing the model
- <font color='red'>Please provide a unique model name to `MODEL_NAME` before running the following cell.</font>
- Note: `RUNTIME_UID` is very important for metadata.


In [23]:
MODEL_NAME= "<unique-model-name>"
project = Project.access()
project_id=project.get_metadata()['metadata']['guid']
#print(project_id)

software_spec_uid = wml_client.software_specifications.get_uid_by_name("default_py3.7")
#print(software_spec_uid)

metadata = {
    wml_client.repository.ModelMetaNames.NAME: MODEL_NAME,
    wml_client.repository.ModelMetaNames.SOFTWARE_SPEC_UID: software_spec_uid,
    wml_client.repository.ModelMetaNames.TYPE: "scikit-learn_0.23"
}

### Store model in the WML repository
- It gets schema details from the training data and target.
- It stores model in project space.

In [24]:
#setting default project
wml_client.set.default_project(project_id)

#Storing
model_details = wml_client.repository.store_model( model, meta_props=metadata, training_data=X, training_target=y)
model_details

{'entity': {'label_column': 'DELAYED',
  'software_spec': {'id': 'e4429883-c883-42b6-87a8-f419d64088cd',
   'name': 'default_py3.7'},
  'training_data_references': [{'connection': {'access_key_id': 'not_applicable',
     'endpoint_url': 'not_applicable',
     'secret_access_key': 'not_applicable'},
    'id': '1',
    'location': {},
    'schema': {'fields': [{'name': 'DAY', 'type': 'int64'},
      {'name': 'DAY_OF_WEEK', 'type': 'int64'},
      {'name': 'ORIGIN_AIRPORT', 'type': 'object'},
      {'name': 'DESTINATION_AIRPORT', 'type': 'object'},
      {'name': 'DEPARTURE_DELAY', 'type': 'float64'},
      {'name': 'TAXI_OUT', 'type': 'float64'},
      {'name': 'DISTANCE', 'type': 'int64'}],
     'id': '1',
     'type': 'DataFrame'},
    'type': 'fs'}],
  'type': 'scikit-learn_0.23'},
 'metadata': {'created_at': '2021-08-26T20:20:38.926Z',
  'id': '716067e5-55e3-4417-a04b-453f8695f2cc',
  'modified_at': '2021-08-26T20:20:40.846Z',
  'name': 'airline-scikitlearn',
  'owner': '1000330999',