Task Overview
====

## Brief
TASK 3: RISK STRATIFICATION FOR ‘ACTUAL HARM” BASED ON
‘EVENT DESCRIPTION’



## Requirements

To assist the Department to utilise text mining and machine learning techniques
for risk stratification of events and examine the link between the ‘Event
Description’ and ‘Actual Harm’.

Machine Learning Step 1
===

In this step we are going to extract a dataset

In [1]:
# We need a Python verison 3.x to run following program
# To show python version, do the following
!python -V

Python 3.6.3


Now we can move to download, extracting the pre-trained embedding

In [0]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip
# This steps required only if the file has not been in word2vec format
!pip install -q gensim
!python -m gensim.scripts.glove2word2vec --input glove.6B.50d.txt --output 50d.w2vformat.txt

In [0]:
# requirements for this task to run
# Pandas
# Gensim
# Sklearn

import io, os, sys, types, time, datetime, math, random, requests, subprocess, tempfile
import gensim.models.keyedvectors
import keras
import numpy as np

# -----------------------
# set up task options
#------------------------

MAX_NUM_WORDS = 10000
EMB_DIMS = 50

sen1 = "how are you"
sen2 = "nice to meet you and good day"

# ----------------------
# helper functions
# ----------------------

def sentence_avg(sentence):
  # A list to aggreate
  sum = []
  keylist = list(word_index.keys())
  for word in sentence:
    # Note, I have idea to add an word 'UNK' to keys, so we don't need 
    # the -1 offset to get the word reversely
    word = keylist[word-1]
    if word in w2v_gl:
      sum.append(w2v_gl[word])
  sum = np.array(sum)
  return sum.mean(0)

# Note when call fit_on_texts, it must be entire corpus, not a sentence
tokenizer = keras.preprocessing.text.Tokenizer(num_words = MAX_NUM_WORDS)
corpus = (sen1, sen2)
tokenizer.fit_on_texts(corpus)
X = tokenizer.texts_to_sequences(corpus)
word_index = tokenizer.word_index


from gensim.models.word2vec import Word2Vec

PRE_TRAINED_EMBEDDING = '50d.w2vformat.txt'
model_g = gensim.models.keyedvectors.KeyedVectors.load_word2vec_format(PRE_TRAINED_EMBEDDING, binary = False)
w2v_gl = {w: vec for w, vec in zip((model_g.vocab), model_g.vectors)}
print(w2v_gl['word'])
print(X)

y = [[1], [0]]

Machine Learning Step 2
===

In [13]:
!ls

datafiles  glove.6B.100d.txt  glove.6B.300d.txt  glove.6B.zip
datalab    glove.6B.200d.txt  glove.6B.50d.txt


In [21]:
# embedding file for glove.6B.50D, replace with your file
from gensim.models.word2vec import Word2Vec
PRE_TRAINED_EMBEDDING = '50d.w2vformat.txt'
model_g = gensim.models.keyedvectors.KeyedVectors.load_word2vec_format(PRE_TRAINED_EMBEDDING, binary = False)
w2v_gl = {w: vec for w, vec in zip((model_g.vocab), model_g.vectors)}
print(w2v_gl['word'])

[-0.1643     0.15722   -0.55021   -0.3303     0.66463   -0.1152
 -0.2261    -0.23674   -0.86119    0.24319    0.074499   0.61081
  0.73683   -0.35224    0.61346    0.0050975 -0.62538   -0.0050458
  0.18392   -0.12214   -0.65973   -0.30673    0.35038    0.75805
  1.0183    -1.7424    -1.4277     0.38032    0.37713   -0.74941
  2.9401    -0.8097    -0.66901    0.23123   -0.073194  -0.13624
  0.24424   -1.0129    -0.24919   -0.06893    0.70231   -0.022177
 -0.64684    0.59599    0.027092   0.11203    0.61214    0.74339
  0.23572   -0.1369   ]


Machine Learning Step 3
===

# define classifier

In [0]:
import sklearn

# you need to divide training and test set before this step

# Here the logit classifier just use default hyperparameter

C = 1.0

classifier = sklearn.linear_model.LogisticRegression(C=C)

classifier.fit(X, y)

y_pred = classifier.predict(X)

assume we use tfidf as the vectorizer

In [1]:
#in Scikit-Learn
from sklearn.feature_extraction.text import TfidfVectorizer

tokenize = lambda doc: doc.lower().split(" ")

document_0 = "China has a strong economy that is growing at a rapid pace. However politically it differs greatly from the US Economy."
document_1 = "At last, China seems serious about confronting an endemic problem: domestic violence and corruption."
document_2 = "Japan's prime minister, Shinzo Abe, is working towards healing the economic turmoil in his own country for his view on the future of his people."
document_3 = "Vladimir Putin is working hard to fix the economy in Russia as the Ruble has tumbled."
document_4 = "What's the future of Abenomics? We asked Shinzo Abe for his views"
document_5 = "Obama has eased sanctions on Cuba while accelerating those against the Russian Economy, even as the Ruble's value falls almost daily."
document_6 = "Vladimir Putin was found to be riding a horse, again, without a shirt on while hunting deer. Vladimir Putin always seems so serious about things - even riding horses."

all_documents = [document_0, document_1, document_2, document_3, document_4, document_5, document_6]

sklearn_tfidf = TfidfVectorizer(norm='l2',min_df=0, use_idf=True, smooth_idf=False, sublinear_tf=True, tokenizer=tokenize)
sklearn_representation = sklearn_tfidf.fit_transform(all_documents)

print(sklearn_representation)

  (0, 17)	0.18320378146489946
  (0, 42)	0.15022972156764192
  (0, 1)	0.31019096605521496
  (0, 77)	0.23957330918096045
  (0, 28)	0.18320378146489946
  (0, 78)	0.23957330918096045
  (0, 50)	0.15022972156764192
  (0, 40)	0.23957330918096045
  (0, 15)	0.18320378146489946
  (0, 65)	0.23957330918096045
  (0, 59)	0.23957330918096045
  (0, 47)	0.23957330918096045
  (0, 61)	0.23957330918096045
  (0, 51)	0.23957330918096045
  (0, 24)	0.23957330918096045
  (0, 39)	0.23957330918096045
  (0, 37)	0.23957330918096045
  (0, 79)	0.10868731908150663
  (0, 86)	0.23957330918096045
  (0, 30)	0.23957330918096045
  (1, 17)	0.2214557196249166
  (1, 15)	0.2214557196249166
  (1, 53)	0.2895948935298433
  (1, 72)	0.2214557196249166
  (1, 73)	0.2214557196249166
  :	:
  (6, 1)	0.2549169133624212
  (6, 72)	0.1505580355265494
  (6, 73)	0.1505580355265494
  (6, 5)	0.1505580355265494
  (6, 57)	0.12345974289432536
  (6, 91)	0.2549169133624212
  (6, 64)	0.2549169133624212
  (6, 82)	0.1505580355265494
  (6, 95)	0.1505580

In [3]:
type(sklearn_representation)
df_list = iter(sklearn_representation)
a = next(df_list)
print(a)

  (0, 17)	0.18320378146489946
  (0, 42)	0.15022972156764192
  (0, 1)	0.31019096605521496
  (0, 77)	0.23957330918096045
  (0, 28)	0.18320378146489946
  (0, 78)	0.23957330918096045
  (0, 50)	0.15022972156764192
  (0, 40)	0.23957330918096045
  (0, 15)	0.18320378146489946
  (0, 65)	0.23957330918096045
  (0, 59)	0.23957330918096045
  (0, 47)	0.23957330918096045
  (0, 61)	0.23957330918096045
  (0, 51)	0.23957330918096045
  (0, 24)	0.23957330918096045
  (0, 39)	0.23957330918096045
  (0, 37)	0.23957330918096045
  (0, 79)	0.10868731908150663
  (0, 86)	0.23957330918096045
  (0, 30)	0.23957330918096045
