# Driverless AI NLP with Key Words Demo

In this notebook we will
1. Download and prep and open source text dataset which has 3 categories
2. Connect to Driverless AI and and run a model (in this notebook we use lowest settings & no Word Embeddings, you probably want to use these in your real use case)
3. Download the predictions, clean the text with gensim, and run tf-idf to get human-readable features
4. Upload 3 datasets to DAI (one for each target class) and run MLI on each one to understand which key phrases positively and negatively impact each predictions
5. Go to DAI UI

In [1]:
import pandas as pd
from sklearn import model_selection
from h2oai_client import Client
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
import gensim

## 1. Download and Prep and Open Source Dataset

The below code downloads the twitter airline sentiment dataset and save it in the current folder. 

In [2]:
! wget https://www.figure-eight.com/wp-content/uploads/2016/03/Airline-Sentiment-2-w-AA.csv

--2019-10-11 11:39:49--  https://www.figure-eight.com/wp-content/uploads/2016/03/Airline-Sentiment-2-w-AA.csv
Resolving www.figure-eight.com (www.figure-eight.com)... 3.208.243.97, 54.164.48.21
Connecting to www.figure-eight.com (www.figure-eight.com)|3.208.243.97|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3704908 (3.5M) [application/octet-stream]
Saving to: ‘Airline-Sentiment-2-w-AA.csv.2’


2019-10-11 11:39:52 (1.41 MB/s) - ‘Airline-Sentiment-2-w-AA.csv.2’ saved [3704908/3704908]



We can now split the data into training and testing datasets.

In [3]:
al = pd.read_csv("Airline-Sentiment-2-w-AA.csv", encoding='ISO-8859-1')
train_al, test_al = model_selection.train_test_split(al, test_size=0.2, random_state=2018)
train_al.to_csv("train_airline_sentiment.csv", index=False)
test_al.to_csv("test_airline_sentiment.csv", index=False)

In [4]:
train_path = './train_airline_sentiment.csv'
test_path = './test_airline_sentiment.csv'

In [25]:
target = "airline_sentiment"

## 2. Connect to DAI & Run a Model

In [6]:
h2oai = Client("http://IP:12345", "UN", "PW")

Read the train and test files into Driverless AI using the `upload_dataset_sync` command.

In [7]:
train = h2oai.upload_dataset_sync(train_path)
test = h2oai.upload_dataset_sync(test_path)

Now let us look at some basic information about the dataset.

In [8]:
print('Train Dataset: ', len(train.columns), 'x', train.row_count)
print('Test Dataset: ', len(test.columns), 'x', test.row_count)

[c.name for c in train.columns]

Train Dataset:  20 x 11712
Test Dataset:  20 x 2928


['_unit_id',
 '_golden',
 '_unit_state',
 '_trusted_judgments',
 '_last_judgment_at',
 'airline_sentiment',
 'airline_sentiment:confidence',
 'negativereason',
 'negativereason:confidence',
 'airline',
 'airline_sentiment_gold',
 'name',
 'negativereason_gold',
 'retweet_count',
 'text',
 'tweet_coord',
 'tweet_created',
 'tweet_id',
 'tweet_location',
 'user_timezone']

We just need two columns for our experiment. `text` which contains the text of the tweet and `airline_sentiment` which contains the sentiment of the tweet (target column). We can drop the remaining columns for this experiment. 

In [9]:
exp_preview = h2oai.get_experiment_preview_sync(
    dataset_key=train.key
    , validset_key=''
    , target_col=target
    , classification=True
    , dropped_cols=["_unit_id", "_golden", "_unit_state", "_trusted_judgments", "_last_judgment_at",
                  "airline_sentiment:confidence", "negativereason", "negativereason:confidence", "airline",
                  "airline_sentiment_gold", "name", "negativereason_gold", "retweet_count", 
                  "tweet_coord", "tweet_created", "tweet_id", "tweet_location", "user_timezone"]
    , accuracy=6
    , time=4
    , interpretability=5
    , is_time_series=False
    , enable_gpus=True
    , reproducible=False
    , resumed_experiment_id=''
    , config_overrides="""
        enable_tensorflow_charcnn='on'
        enable_tensorflow_textcnn='on'
        enable_tensorflow_textbigru='on'
    """
)
exp_preview

['ACCURACY [6/10]:',
 '- Training data size: *11,712 rows, 2 cols*',
 '- Feature evolution: *[LightGBM, XGBoostGBM]*, *3-fold CV**, 2 reps*',
 '- Final pipeline: *Ensemble (6 models), 3-fold CV*',
 '',
 'TIME [4/10]:',
 '- Feature evolution: *4 individuals*, up to *64 iterations*',
 '- Early stopping: After *5* iterations of no improvement',
 '',
 'INTERPRETABILITY [5/10]:',
 '- Feature pre-pruning strategy: None',
 '- Monotonicity constraints: disabled',
 '- Feature engineering search space (where applicable): [CVCatNumEncode, CVTargetEncode, ClusterDist, ClusterId, ClusterTE, Dates, Frequent, Interactions, IsHoliday, NumCatTE, NumToCatTE, Original, TextBiGRU, TextCNN, TextCharCNN, Text, TruncSVDNum]',
 '',
 '[LightGBM, XGBoostGBM] models to train:',
 '- Model and feature tuning: *144*',
 '- Feature evolution: *504*',
 '- Final pipeline: *6*',
 '',
 'Estimated runtime: *minutes*',
 'Auto-click Finish if not done in: *1 day*']

Please note that the `Text` and `TextCNN` features are enabled for this experiment.

Now we can start the experiment. We will use the most basic settings for speed.

In [11]:
model = h2oai.start_experiment_sync(
    dataset_key=train.key,
    testset_key=test.key,
    target_col=target,
    scorer='F1',
    is_classification=True,
    cols_to_drop=["_unit_id", "_golden", "_unit_state", "_trusted_judgments", "_last_judgment_at",
                  "airline_sentiment:confidence", "negativereason", "negativereason:confidence", "airline",
                  "airline_sentiment_gold", "name", "negativereason_gold", "retweet_count", 
                  "tweet_coord", "tweet_created", "tweet_id", "tweet_location", "user_timezone"],
    accuracy=1,
    time=1,
    interpretability=10,
    enable_gpus=True,
#     config_overrides="""
#         enable_tensorflow_charcnn='on'
#         enable_tensorflow_textcnn='on'
#         enable_tensorflow_textbigru='on'
#     """
)

## 3. Download the Predictions & Prep Text Data

In [63]:
test_pred_path = h2oai.download_prediction_sync(dest_dir="."
                                                , model_key=model.key
                                                , dataset_type="test" #train, test, or validation
                                                , include_columns="[text, airline_sentiment]")

test_preds = pd.read_csv(test_pred_path)

In [64]:
test_preds.head()

Unnamed: 0,airline_sentiment,airline,text,airline_sentiment.negative,airline_sentiment.neutral,airline_sentiment.positive
0,negative,US Airways,@USAirways We did. @AmericanAir said to open o...,0.726035,0.115679,0.158286
1,negative,United,@united @AmericanAir so that's it? It just end...,0.847708,0.039875,0.112417
2,negative,American,@AmericanAir I paid seat upgrade b4 the severe...,0.829466,0.1354,0.035133
3,negative,American,@AmericanAir @_Lucy_May surely much better to ...,0.662014,0.163143,0.174843
4,negative,US Airways,"@USAirways As a member of the news media, the ...",0.792003,0.038913,0.169084


In [65]:
n_items = len(test_preds[target].unique())
test_preds["predicted_value"] = test_preds.iloc[:,-n_items:].idxmax(axis=1).str.replace("airline_sentiment.", "")

for m in test_preds["airline_sentiment"].unique():
    test_preds["actual_"+m] = test_preds[target] == m
    test_preds["predicted_"+m] = test_preds['predicted_value'] == m


    
test_preds.head()

Unnamed: 0,airline_sentiment,airline,text,airline_sentiment.negative,airline_sentiment.neutral,airline_sentiment.positive,predicted_value,actual_negative,predicted_negative,actual_positive,predicted_positive,actual_neutral,predicted_neutral
0,negative,US Airways,@USAirways We did. @AmericanAir said to open o...,0.726035,0.115679,0.158286,negative,True,True,False,False,False,False
1,negative,United,@united @AmericanAir so that's it? It just end...,0.847708,0.039875,0.112417,negative,True,True,False,False,False,False
2,negative,American,@AmericanAir I paid seat upgrade b4 the severe...,0.829466,0.1354,0.035133,negative,True,True,False,False,False,False
3,negative,American,@AmericanAir @_Lucy_May surely much better to ...,0.662014,0.163143,0.174843,negative,True,True,False,False,False,False
4,negative,US Airways,"@USAirways As a member of the news media, the ...",0.792003,0.038913,0.169084,negative,True,True,False,False,False,False


In [66]:
# clean text
test_preds['clean_text'] = pd.DataFrame([" ".join(gensim.utils.simple_preprocess(t)) for t in test_preds['text']])
test_preds['clean_text'] = pd.DataFrame([gensim.parsing.preprocessing.remove_stopwords(t) for t in test_preds["clean_text"]])

test_preds.head()

Unnamed: 0,airline_sentiment,airline,text,airline_sentiment.negative,airline_sentiment.neutral,airline_sentiment.positive,predicted_value,actual_negative,predicted_negative,actual_positive,predicted_positive,actual_neutral,predicted_neutral,clean_text
0,negative,US Airways,@USAirways We did. @AmericanAir said to open o...,0.726035,0.115679,0.158286,negative,True,True,False,False,False,False,usairways americanair said open
1,negative,United,@united @AmericanAir so that's it? It just end...,0.847708,0.039875,0.112417,negative,True,True,False,False,False,False,united americanair ends come traveled literall...
2,negative,American,@AmericanAir I paid seat upgrade b4 the severe...,0.829466,0.1354,0.035133,negative,True,True,False,False,False,False,americanair paid seat upgrade severe weather g...
3,negative,American,@AmericanAir @_Lucy_May surely much better to ...,0.662014,0.163143,0.174843,negative,True,True,False,False,False,False,americanair surely better let share feedback hide
4,negative,US Airways,"@USAirways As a member of the news media, the ...",0.792003,0.038913,0.169084,negative,True,True,False,False,False,False,usairways member news media awful service days...


In [67]:
tfidf = TfidfVectorizer(min_df=50, ngram_range=(1,4))
tfidf_fit = tfidf.fit_transform(test_preds["clean_text"])

feature_names = tfidf.get_feature_names()
features_df = pd.DataFrame(tfidf_fit.toarray(), columns = feature_names)

features_df.head()

Unnamed: 0,aa,agent,airline,airport,americanair,amp,bag,bags,cancelled,cancelled flightled,...,trying,united,usairways,ve,virginamerica,wait,waiting,want,way,weather
0,0.0,0.0,0.0,0.0,0.708924,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.705285,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.443345,0.0,0.0,0.0,0.0,0.0,...,0.0,0.39244,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.26631,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.488454
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.572441,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [68]:
df = pd.concat([test_preds, features_df],sort=False,axis=1)
df.head()

Unnamed: 0,airline_sentiment,airline,text,airline_sentiment.negative,airline_sentiment.neutral,airline_sentiment.positive,predicted_value,actual_negative,predicted_negative,actual_positive,...,trying,united,usairways,ve,virginamerica,wait,waiting,want,way,weather
0,negative,US Airways,@USAirways We did. @AmericanAir said to open o...,0.726035,0.115679,0.158286,negative,True,True,False,...,0.0,0.0,0.705285,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,negative,United,@united @AmericanAir so that's it? It just end...,0.847708,0.039875,0.112417,negative,True,True,False,...,0.0,0.39244,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,negative,American,@AmericanAir I paid seat upgrade b4 the severe...,0.829466,0.1354,0.035133,negative,True,True,False,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.488454
3,negative,American,@AmericanAir @_Lucy_May surely much better to ...,0.662014,0.163143,0.174843,negative,True,True,False,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,negative,US Airways,"@USAirways As a member of the news media, the ...",0.792003,0.038913,0.169084,negative,True,True,False,...,0.0,0.0,0.572441,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 4. Load Data to DAI & Run MLI

In [69]:
# Final loop to run MLI on each group
for m in df["airline_sentiment"].unique():

    csv_name = m+"_mli.csv"
    
    final_cols = []
    final_cols.append("actual_"+m)
    final_cols.append("predicted_"+m)

    for x in features_df.columns.tolist(): # add all top words
        final_cols.append(x) 
        
    df_m = df[final_cols]
    
    print("Writing to "+csv_name)
    
    path = "/Users/mtanco/Downloads/" + csv_name
    df_m.to_csv(path, index=False)
    
    df_m_dai = h2oai.upload_dataset_sync(path)
    
    print("Starting MLI")
    mli = h2oai.run_interpretation_sync(dai_model_key = "" # no experiment key
                                        , dataset_key = df_m_dai.key
                                        , target_col = "actual_"+m
                                        , prediction_col = "predicted_"+m
                                        , lime_method="LIME_SUP"
                                        )

Writing to negative_mli.csv
Starting MLI
Writing to positive_mli.csv
Starting MLI
Writing to neutral_mli.csv
Starting MLI


## 5. Go to DAI UI & Look at Dashboard & Surogate Models