<a href="https://colab.research.google.com/github/mralexdmitriy/parallel_hw/blob/master/colab_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### This is tutorial which explains 3 different methods to make the baseline results in Kaggle competition https://www.kaggle.com/c/tensorflow2-question-answering/overview:
###1) Using CPU-only method without parallel computing
###2) Using multiprocessing
###3) Using PyCuda

#### In this competition, your goal is to predict short and long answer responses to real questions about Wikipedia articles. In this tutorial we try to predict only long answers using TF-IDF similarity between question and condidate to answer based on appropriate article.

## 1) CPU-only method without parallel computing

###Importing

In [0]:
import os
import re
import json
import time
import warnings
import numpy as np
import pandas as pd
from scipy import spatial
from sklearn.metrics import accuracy_score, f1_score
from tqdm import tqdm_notebook as tqdm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text

### Mounting Google Drive to read dataset (150k lines in json: 8Gb) 

In [6]:
from google.colab import drive
drive.mount('/content/drive/')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive/


## Data overview

In [0]:
pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', 200)

In [0]:
path = 'drive/My Drive/Colab Notebooks/150k.json'
def read_data(path,  chunksize = 5):
   
    df = []
    with open(path, 'rt') as reader:
        for i in range(chunksize):
            df.append(json.loads(reader.readline()))
    df = pd.DataFrame(df)
    return df

train = read_data(path)
train[['document_text', 'question_text', 'long_answer_candidates', 'annotations']].head(2)


Unnamed: 0,document_text,question_text,long_answer_candidates,annotations
0,"Email marketing - Wikipedia <H1> Email marketing </H1> Jump to : navigation , search <Table> <Tr> <Td> </Td> <Td> ( hide ) This article has multiple issues . Please help improve it or discuss thes...",which is the most common use of opt-in e-mail marketing,"[{'start_token': 14, 'top_level': True, 'end_token': 170}, {'start_token': 15, 'top_level': False, 'end_token': 169}, {'start_token': 52, 'top_level': False, 'end_token': 103}, {'start_token': 53,...","[{'yes_no_answer': 'NONE', 'long_answer': {'start_token': 1952, 'candidate_index': 54, 'end_token': 2019}, 'short_answers': [{'start_token': 1960, 'end_token': 1969}], 'annotation_id': 59316545022..."
1,"The Mother ( How I Met Your Mother ) - wikipedia <H1> The Mother ( How I Met Your Mother ) </H1> Jump to : navigation , search <Table> <Tr> <Th_colspan=""2""> Tracy McConnell </Th> </Tr> <Tr> <Td_co...",how i.met your mother who is the mother,"[{'start_token': 28, 'top_level': True, 'end_token': 212}, {'start_token': 29, 'top_level': False, 'end_token': 35}, {'start_token': 35, 'top_level': False, 'end_token': 45}, {'start_token': 45, '...","[{'yes_no_answer': 'NONE', 'long_answer': {'start_token': 212, 'candidate_index': 15, 'end_token': 310}, 'short_answers': [{'start_token': 213, 'end_token': 215}], 'annotation_id': 120348741537837..."


### Preprocessing and TFIDF - Vectorising

In [0]:
stop_words = text.ENGLISH_STOP_WORDS.union(["book"])
warnings.filterwarnings("ignore")

In [0]:
def predict(json_data):
    # Parse JSON data
    candidates = json_data['long_answer_candidates']
    doc_tokenized = json_data['document_text'].split(' ')
    question = json_data['question_text']
    question_s = question.split(' ') 
    annotation = json_data['annotations'][0]

    # TFIDF for the document
    # Convert a collection of raw documents to a matrix of TF-IDF features.

    tfidf = TfidfVectorizer(ngram_range=(1,1), stop_words=stop_words)
    tfidf.fit([json_data['document_text']])  
    q_tfidf = tfidf.transform([question]).todense() 
    
    # Find the nearest answer from candidates using cosine distanse
    scores = []
    for i, c in enumerate(candidates):
        s, e = c['start_token'], c['end_token']
        t = ' '.join(doc_tokenized[s:e])
        t_tfidf = tfidf.transform([t]).todense()
       
        score = 1 - spatial.distance.cosine(q_tfidf, t_tfidf)
        scores.append(score)

    # Put the nearest condidate 

    ans = (np.array(candidates)[np.argsort(scores)])[-1] # dict, top condidate
    
    if np.max(scores) < 0.2:
        ans_long = '-1:-1'
        ans = {'start_token': 0, 'end_token': 0}
    else:
        ans_long = str(ans['start_token']) + ':' + str(ans['end_token'])
              
    return ans_long

In [0]:
%%time
ids, annotations, predictions = [], [], []
n_samples = 10000
with open('drive/My Drive/Colab Notebooks/150k.json', 'r') as json_file:
    cnt = 0
    for line in tqdm(json_file):
        json_data = json.loads(line)

        annotated_answer = str(json_data['annotations'][0]['long_answer']['start_token']) + ':' + \
            str(json_data['annotations'][0]['long_answer']['end_token'])
        
        predicted_answer = predict(json_data)
        
        ids.append(str(json_data['example_id']) + '_long')
        annotations.append(annotated_answer)
        predictions.append(predicted_answer)
        
        cnt += 1
        if cnt >= n_samples:
            break

# Generating Dataframe
df = pd.DataFrame()
df['example_id'] = ids
df['CorrectString'] = annotations
df['PredictionString'] = predictions

# Evaluating
f1 = f1_score(df['CorrectString'].values, df['PredictionString'].values, average='micro')
print(f'F1-score: {f1:.4f}')

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

F1-score: 0.1013
CPU times: user 17min 12s, sys: 6.95 s, total: 17min 19s
Wall time: 17min 17s


##Using multiprocessing

In [0]:
import psutil
from concurrent.futures import ProcessPoolExecutor
from multiprocessing import Process, Manager

In [0]:
print(f" Logical CPU count: {psutil.cpu_count(logical=True)}")
print(f" Physical CPU count: {psutil.cpu_count(logical=False)}")

 Logical CPU count: 2
 Physical CPU count: 1


### Refactoring process function for multiprocessing: split data to chunks which will handle by different processes.

In [0]:
def process(json_path, chunk_index, total_list):
    
    ids, annotations, predictions = [], [], []
    n_rows = 10000
    num_cores = 2
    chunk_size = int(n_rows/num_cores)  # number of rows for 1 chunk
    
    with open(json_path, 'r') as json_file:
        
        cnt = 0 + (chunk_index-1)*chunk_size # starting row
        start_row = cnt
        finish_row = chunk_size*chunk_index
        
        for i, line in enumerate(json_file):
           
            if i < start_row or i > finish_row:
              continue
            
            json_data = json.loads(line)
            annotated_answer = str(json_data['annotations'][0]['long_answer']['start_token']) + ':' + \
                str(json_data['annotations'][0]['long_answer']['end_token'])

            predicted_answer = predict(json_data)

            ids.append(str(json_data['example_id']) + '_long')
            annotations.append(annotated_answer)
            predictions.append(predicted_answer)

            cnt += 1
            
            if cnt%(chunk_size/10) == 0 and cnt < (chunk_size+1):
                print(f"computing progress: {int(cnt*100/chunk_size)}%")
            
            if cnt >= finish_row:
                break

    chunk_dict = {}
    chunk_dict['example_id'] = ids
    chunk_dict['CorrectString'] = annotations
    chunk_dict['PredictionString'] = predictions
    total_list.append(chunk_dict)

In [0]:
sum_list = list()
def multiprocessed():
    cores = 2
    processes = []
    a = time.time()
    with Manager() as manager:
        sum_list = manager.list()  # <-- can be shared between processes.
        for i in range(0, cores):
            p = Process(target=process,args=(path, i+1, sum_list))
            processes.append(p)
        # Start the processes
        for p in processes:
            p.start()
        # Ensure all processes have finished execution
        for p in processes:
            p.join()
        
        sum_list = list(sum_list)
        b = time.time()
        print(f"the executing time using multiprocessing is: {round(b-a, 3)} sec")
        return sum_list

In [0]:
sum_list = multiprocessed()

computing progress: 10%
computing progress: 20%
computing progress: 30%
computing progress: 40%
computing progress: 50%
computing progress: 60%
computing progress: 70%
computing progress: 80%
computing progress: 90%
computing progress: 100%
the executing time using multiprocessing is: 783.324 sec


In [0]:
def creating_df(lst):
    total_df = pd.DataFrame()
    for l in lst:
        df_chunk = pd.DataFrame.from_dict(l)
        total_df = total_df.append(df_chunk)
    total_df.reset_index(inplace=True, drop=True)
    return total_df
total_df = creating_df(sum_list)    
f1 = f1_score(total_df['CorrectString'].values, total_df['PredictionString'].values, average='micro')
print(f'F1-score: {f1:.4f}')

F1-score: 0.1013


### Due to google colab provide only 1 physical CPU for user, multiprocessing reduces computation time only for 24.5%. So try to compute this in local machine (laptop with 4 physical CPU , see multiprocessing_local_machine.ipynb in current repo). Spoiler: computation time reduces by 3.3x.

## 3) Using PyCuda

In [12]:
!pip install pycuda
!pip install scikit-cuda
import pycuda.autoinit
import pycuda.gpuarray as gpuarray
import pycuda.cumath as cumath
import skcuda.linalg as linalg
linalg.init()



### Refactor the most frequently used function **cosine distanse** with pycuda/skcuda.


In [0]:
def cosine_distance_cuda(u, v):
  """Computes the cosine distanse between two vectors u and v"""
  
  u = spatial.distance._validate_vector(u)
  v = spatial.distance._validate_vector(v)
  
  u = u.astype(np.float32)
  v = v.astype(np.float32)
  u_gpu = gpuarray.to_gpu(u)
  v_gpu = gpuarray.to_gpu(v)
  
  uv_gpu = gpuarray.dot(u_gpu, v_gpu)
  u_gpu_mag = cumath.sqrt(gpuarray.dot(u_gpu, u_gpu))
  v_gpu_mag = cumath.sqrt(gpuarray.dot(v_gpu, v_gpu))
 
  dist = 1.0 - uv_gpu / (u_gpu_mag * v_gpu_mag)
 
  return dist.get().item()

### Ok, let's replace default scipy **cosine distance** function with **cosine_distance_cuda** in predict function.

In [0]:
def predict(json_data):
    # Parse JSON data
    candidates = json_data['long_answer_candidates']
    doc_tokenized = json_data['document_text'].split(' ')
    question = json_data['question_text']
    question_s = question.split(' ') 
    annotation = json_data['annotations'][0]

    # TFIDF for the document
    # Convert a collection of raw documents to a matrix of TF-IDF features.

    tfidf = TfidfVectorizer(ngram_range=(1,1), stop_words=stop_words)
    tfidf.fit([json_data['document_text']])  
    q_tfidf = tfidf.transform([question]).todense() 
    
    # Find the nearest answer from candidates using cosine distanse
    scores = []
    for i, c in enumerate(candidates):
        s, e = c['start_token'], c['end_token']
        t = ' '.join(doc_tokenized[s:e])
        t_tfidf = tfidf.transform([t]).todense()
        
        #Replacing below
        score = 1 - cosine_distance_cuda(q_tfidf, t_tfidf)
        scores.append(score)

    # Put the nearest condidate 

    ans = (np.array(candidates)[np.argsort(scores)])[-1] # dict, top condidate
    
    if np.max(scores) < 0.2:
        ans_long = '-1:-1'
        ans = {'start_token': 0, 'end_token': 0}
    else:
        ans_long = str(ans['start_token']) + ':' + str(ans['end_token'])
              
    return ans_long

In [12]:
%%time
ids, annotations, predictions = [], [], []
n_samples = 10000
with open('drive/My Drive/Colab Notebooks/150k.json', 'r') as json_file:
    cnt = 0
    for line in tqdm(json_file):
        json_data = json.loads(line)

        annotated_answer = str(json_data['annotations'][0]['long_answer']['start_token']) + ':' + \
            str(json_data['annotations'][0]['long_answer']['end_token'])
        
        predicted_answer = predict(json_data)
        
        ids.append(str(json_data['example_id']) + '_long')
        annotations.append(annotated_answer)
        predictions.append(predicted_answer)
        
        cnt += 1
        if cnt >= n_samples:
            break

# Generating Dataframe
df = pd.DataFrame()
df['example_id'] = ids
df['CorrectString'] = annotations
df['PredictionString'] = predictions

# Evaluating
f1 = f1_score(df['CorrectString'].values, df['PredictionString'].values, average='micro')
print(f'F1-score: {f1:.4f}')

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

*** compiler output in /tmp/tmpc_u55t85
*** compiler output in /tmp/tmpr3zxs5u8
F1-score: 0.1009
CPU times: user 49min 53s, sys: 21.5 s, total: 50min 15s
Wall time: 50min 18s


### Text processing with PyCuda increases computation time by 2.94x. What's wrong with it? Let's try to analyse our new cosine distanse function.

### Scipy cosine distanse function from source:

In [0]:
def cosine_distance(u, v):
  """Computes the cosine distanse between two vectors u and v in Scipy"""
  u = spatial.distance._validate_vector(u)
  v = spatial.distance._validate_vector(v)
  uv = np.average(u * v)
  uu = np.average(np.square(u))
  vv = np.average(np.square(v))
  dist = 1.0 - uv / np.sqrt(uu * vv)
  return dist

### Our PyCuda cosine distance funtion

In [0]:
def cosine_distance_cuda(u, v):
  """Computes the cosine distanse between two vectors u and v"""
  
  u = spatial.distance._validate_vector(u)
  v = spatial.distance._validate_vector(v)
  
  u = u.astype(np.float32)
  v = v.astype(np.float32)
  u_gpu = gpuarray.to_gpu(u)
  v_gpu = gpuarray.to_gpu(v)
  
  uv_gpu = gpuarray.dot(u_gpu, v_gpu)
  u_gpu_mag = cumath.sqrt(gpuarray.dot(u_gpu, u_gpu))
  v_gpu_mag = cumath.sqrt(gpuarray.dot(v_gpu, v_gpu))
 
  dist = 1.0 - uv_gpu / (u_gpu_mag * v_gpu_mag)
 
  return dist.get().item()

### Computing time depends on input vectors dimension. Let's compare two functions productivity depending on vectors dimension from 1x1 to 1x1e9.

In [0]:
vector_dims = [1, 1e1, 1e2, 1e3, 1e4, 1e5, 1e6, 1e7, 1e8]
vector_dims = [int(x) for x in vector_dims]
freq_vector_dims = []
for i, v in enumerate(vector_dims):
  if i < len(vector_dims) -1:
    l = [x for x in range(v, vector_dims[i+1], 2*v )]
    freq_vector_dims += l
  else:
    freq_vector_dims.append(v)

In [0]:
def test_cosine(iter_num, dim, CUDA=False):
  start = time.time()
  for i in range(iter_num):
      u =  np.random.rand(1,dim).squeeze()
      v =  np.random.rand(1,dim).squeeze()
      if CUDA:
        d = cosine_distance_cuda(u, v)
      else:  
        d = cosine_distance(u, v)
  finish = time.time()
  proc_time = finish - start
  return proc_time

In [0]:
time_logs_cpu = []
time_logs_cuda = []
for dim in freq_vector_dims:
  proc_cpu = test_cosine(5, dim)
  proc_cuda = test_cosine(5, dim, CUDA=True)
  time_logs_cpu.append(proc_cpu)
  time_logs_cuda.append(proc_cuda)
time_logs_cpu = [x/5 for x in time_logs_cpu]
time_logs_cuda = [x/5 for x in time_logs_cuda]

In [62]:
import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(go.Scatter(x=freq_vector_dims, y=time_logs_cpu,
                    mode='lines+markers',
                    name='CPU time'))
fig.add_trace(go.Scatter(x=freq_vector_dims, y=time_logs_cuda,
                    mode='lines+markers',
                    name='CUDA time'))
fig.update_layout(
    title={
        'text': "Comparing of cosine distanсe function computation time",
        'xanchor': 'center',
        'y':0.9,
        'x':0.5,
        'yanchor': 'top'},
    xaxis_title="Input vectors 1D dimension",
    yaxis_title="Execution time, sec",
    font=dict(
        family="Courier New, monospace",
        size=16,
        color="#7f7f7f"
    )
)
fig.show()

In [61]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=freq_vector_dims[10:29], y=time_logs_cpu[10:29],
                    mode='lines+markers',
                    name='CPU time'))
fig.add_trace(go.Scatter(x=freq_vector_dims[10:29], y=time_logs_cuda[10:29],
                    mode='lines+markers',
                    name='CUDA time'))
fig.update_layout(
    title={
        'text': "Comparing of cosine distanсe function computation time (zoomed)",
        'xanchor': 'center',
        'y':0.9,
        'x':0.5,
        'yanchor': 'top'},
    xaxis_title="Input vectors 1D dimension",
    yaxis_title="Execution time, sec",
    font=dict(
        family="Courier New, monospace",
        size=16,
        color="#7f7f7f"
    )
)
fig.show()

### As you can see from plots, CUDA cosine function become faster than default function beginning only from 1x500k vectors dimension. In our task the input vectors dimension is about 1000 and this function is executed about 70 times in every question. Therefore, our CUDA function makes text processing slower. 

### In this tutorial we try to use 3 methods to compute TF-IDF similarity for solving Kaggle competiotion task. The most appropriate method is multiprocessing, while the PyCuda method was the slowest one because of small input vectors dimensions and expensive transformation from numpy array to gpu array. 