## 1. Prepare dataset

- Use gdown ([github-link](https://github.com/wkentaro/gdown)) to download large file data.
- Description this dataset: 
  - This dataset is generated from GEO (a well-known semantic parsing dataset).The target of this dataset is the relation between the natural sentence and logic-term of its. 
  - It contains 5 columns ['label', '#1 ID', '#2 ID', 'sentence1', 'sentence2']. We only use 'sentence1',  'sentence2'  and 'label' as a input features and label of each training/testing sample, separately. 
  - Size of this dataset: 20400 training samples, 3400 dev samples, around 9500 private test samples.   
- Note: **Undeditable** mean that edit this part code is not accepted. 


In [None]:
!pip install gdown
!gdown https://drive.google.com/uc?id=1tyfX0kv6qvA14JgmCkI4xU2zu0eaIIma -O data.zip
!unzip ./data.zip

Downloading...
From: https://drive.google.com/uc?id=1tyfX0kv6qvA14JgmCkI4xU2zu0eaIIma
To: /content/data.zip
100% 160k/160k [00:00<00:00, 60.1MB/s]
Archive:  ./data.zip
   creating: geo-data-ml-class/
  inflating: __MACOSX/._geo-data-ml-class  
  inflating: geo-data-ml-class/.DS_Store  
  inflating: __MACOSX/geo-data-ml-class/._.DS_Store  
  inflating: geo-data-ml-class/dev.enfnamepair  
  inflating: geo-data-ml-class/train.enfnamepair  
   creating: geo-data-ml-class/.ipynb_checkpoints/
  inflating: geo-data-ml-class/train.vocab  


In [None]:
import pandas as pd
import pickle
from scipy.sparse import coo_matrix, hstack

train = pd.read_csv('geo-data-ml-class/train.enfnamepair', sep=',')
dev = pd.read_csv('geo-data-ml-class/dev.enfnamepair', sep=',')
# test = pd.load('geo-data-ml-class/test.csv')

In [None]:
train.columns

Index(['label', '#1 ID', '#2 ID', 'sentence1', 'sentence2'], dtype='object')

In [None]:
train.iloc[3000:3100]

Unnamed: 0,label,#1 ID,#2 ID,sentence1,sentence2
3000,1,next_to:t,88,next _ to : t,how mani state border s0
3001,0,size:i,88,size : i,how mani state border s0
3002,1,count,88,count,how mani state border s0
3003,0,elevation:i,88,elevation : i,how mani state border s0
3004,0,argmin,88,argmin,how mani state border s0
...,...,...,...,...,...
3095,1,loc:t,91,loc : t,what river flow through s0
3096,0,state:t,91,state : t,what river flow through s0
3097,1,lambda,91,lambda,what river flow through s0
3098,0,argmax,91,argmax,what river flow through s0


## 2. Preprocessing data

- Concat 2 columns "sentence1" and "sentence2" as a document. 
- Use Tfidf technique ([Tfidf document](https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting)) to convert document into vector. We convert all training data, development data into vector.

In [None]:

import re
def func_name_norm(f_name):
    f_name = f_name.strip().replace(":", " : ").replace("_", " _ ")
    f_name = re.sub(r' {2,}', ' ', f_name)
    return f_name

train_text_concatination = train[['sentence1', 'sentence2']].agg(' '.join, axis=1)
train_text_concatination.head()

0          and where is c0
1      loc : t where is c0
2    state : t where is c0
3       lambda where is c0
4       argmax where is c0
dtype: object

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer().fit(train_text_concatination)

In [None]:
def data_frame_to_vector2(df):
    text_concatination = df[['sentence1', 'sentence2']].agg(' '.join, axis=1)
    return vectorizer.transform(text_concatination)
        

In [None]:
train_data_vector = data_frame_to_vector2(train)
dev_data_vector = data_frame_to_vector2(dev)

## 3. Build model and train it (Editable)

- Build model by your self (using sklearn library or external library is accepted). 
- Training and optimize your model to get best performance on development set.
- Finally, save your model prediction in variable **pred** (list of int value 0 or 1) for evaluation in next step.

In [None]:
 
from sklearn.neural_network import MLPClassifier
#  build model 
cls = MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=True, epsilon=1e-08,
       hidden_layer_sizes=(100, 100), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,
       solver='lbfgs', tol=0.001, validation_fraction=0.1, verbose=True,
       warm_start=False)

cls.fit(train_data_vector.toarray()  , train['label'])
pred = cls.predict(dev_data_vector.toarray() )


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


In [None]:
pred[:10] # show first 10 values of the prediction 

array([0, 1, 0, 1, 0, 0, 0, 0, 0, 0])

## 4. Evaluate your model prediction (**Uneditable**)

- Build model by your-self (using sklearn library or external library is accepted). 
- Training and optimize your model to get best performance on development set.
- Note: **Undeditable** mean that edit this block code is not accepted. 

In [None]:
from sklearn.metrics import *
print(classification_report(dev['label'], pred))
precision_score(dev['label'], pred), recall_score(dev['label'], pred), f1_score(dev['label'], pred, pos_label=1)

              precision    recall  f1-score   support

           0       0.99      0.98      0.99      3097
           1       0.85      0.90      0.87       303

    accuracy                           0.98      3400
   macro avg       0.92      0.94      0.93      3400
weighted avg       0.98      0.98      0.98      3400



(0.8478260869565217, 0.900990099009901, 0.8735999999999999)