<a href="https://colab.research.google.com/github/nahbos/Advanced-Information-Retrieval/blob/main/Ex01/traditional_methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Sobhan Moradian Daghigh

- 11-22-2022

### Ex-01: Traditional methods (WarmUp)

In [239]:
import numpy as np
import pandas as pd
import scipy
from gensim.models import TfidfModel
from gensim.corpora import Dictionary
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import remove_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import random
import pickle

In [82]:
!wget -nc https://raw.githubusercontent.com/nahbos/Advanced-Information-Retrieval/main/Ex01/Data/train_data.csv
!wget -nc https://raw.githubusercontent.com/nahbos/Advanced-Information-Retrieval/main/Ex01/Data/valid_data.csv
!wget -nc https://raw.githubusercontent.com/nahbos/Advanced-Information-Retrieval/main/Ex01/Data/test_data.csv

File ‘train_data.csv’ already there; not retrieving.

File ‘valid_data.csv’ already there; not retrieving.

File ‘test_data.csv’ already there; not retrieving.



# Part One. 
* Data Loading

In [83]:
train = pd.read_csv('./train_data.csv')
val   = pd.read_csv('./valid_data.csv')
test  = pd.read_csv('./test_data.csv')

In [84]:
train.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,14,29,30,What are the laws to change your status from a...,What are the laws to change your status from a...,0
1,18,37,38,Why are so many Quora users posting questions ...,Why do people ask Quora questions which can be...,1
2,38,77,78,How do we prepare for UPSC?,How do I prepare for civil service?,1
3,58,117,118,I was suddenly logged off Gmail. I can't remem...,I can't remember my Gmail password or my recov...,1
4,60,121,122,How do I download content from a kickass torre...,Is Kickass Torrents trustworthy?,0


In [85]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37250 entries, 0 to 37249
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            37250 non-null  int64 
 1   qid1          37250 non-null  int64 
 2   qid2          37250 non-null  int64 
 3   question1     37250 non-null  object
 4   question2     37250 non-null  object
 5   is_duplicate  37250 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 1.7+ MB


In [86]:
val.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1010 entries, 0 to 1009
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1010 non-null   int64 
 1   qid1          1010 non-null   int64 
 2   qid2          1010 non-null   int64 
 3   question1     1010 non-null   object
 4   question2     1010 non-null   object
 5   is_duplicate  1010 non-null   int64 
dtypes: int64(4), object(2)
memory usage: 47.5+ KB


In [87]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 980 entries, 0 to 979
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            980 non-null    int64 
 1   qid1          980 non-null    int64 
 2   qid2          980 non-null    int64 
 3   question1     980 non-null    object
 4   question2     980 non-null    object
 5   is_duplicate  980 non-null    int64 
dtypes: int64(4), object(2)
memory usage: 46.1+ KB


**Ok, Everything looks right ))**

# Part Two.
* Vector Space Retrieval

In [235]:
dataset = train

# Dictionary length
#     - with    stop words: 9335
#     - without stop words: 9284

tokenized_qs = [simple_preprocess(remove_stopwords(q)) for q in dataset.loc[:, 'question2']]
dct = Dictionary(tokenized_qs)  # fit dictionary
corpus = [dct.doc2bow(tokenized_q) for tokenized_q in tokenized_qs]  # convert corpus to BoW format

In [237]:
model = TfidfModel(corpus)        # fit model
tfidf_vector = model[corpus]      # apply model to the all corpus document

for question in tfidf_vector[:20]:
   print([[dct[id], round(freq, 2)] for id, freq in question])

[['card', 0.23], ['change', 0.22], ['compare', 0.15], ['green', 0.24], ['how', 0.05], ['immigration', 0.37], ['japan', 0.28], ['laws', 0.57], ['status', 0.29], ['student', 0.25], ['us', 0.2], ['visa', 0.29], ['what', 0.04]]
[['answered', 0.5], ['ask', 0.37], ['easily', 0.41], ['google', 0.4], ['people', 0.28], ['questions', 0.32], ['quora', 0.25], ['why', 0.21]]
[['how', 0.1], ['civil', 0.63], ['prepare', 0.5], ['service', 0.59]]
[['how', 0.08], ['can', 0.23], ['email', 0.33], ['gmail', 0.32], ['mail', 0.42], ['password', 0.31], ['recover', 0.37], ['recovery', 0.38], ['remember', 0.41]]
[['is', 0.18], ['kickass', 0.54], ['torrents', 0.49], ['trustworthy', 0.66]]
[['how', 0.08], ['bad', 0.42], ['book', 0.36], ['new', 0.26], ['rowling', 0.78]]
[['how', 0.12], ['english', 0.36], ['fluently', 0.62], ['learn', 0.42], ['speak', 0.53]]
[['what', 0.18], ['about', 0.48], ['actually', 0.47], ['life', 0.59], ['purpose', 0.41]]
[['compare', 0.2], ['what', 0.06], ['cambodia', 0.29], ['earthquake', 

**Since the Gensim dosent support for max_features, so for rest of the implementation, Im gonna use sklearn instead.**


In [201]:
tr_vectorizer = TfidfVectorizer(max_features=2000, stop_words='english')
tfidf_matrix_train = tr_vectorizer.fit_transform(train.loc[:, 'question2'])
print(tfidf_matrix_train.shape)

(37250, 2000)


In [202]:
ts_vectorizer = TfidfVectorizer(vocabulary=tr_vectorizer.vocabulary_, stop_words='english')
tfidf_matrix_test = ts_vectorizer.fit_transform(test.loc[:, 'question1'])
print(tfidf_matrix_test.shape)

(980, 2000)


In [205]:
similarity = cosine_similarity(tfidf_matrix_test, tfidf_matrix_train)
similarity.shape

(980, 37250)

In [258]:
def get_similar_questions(test_data, train_data, similarity_matrix, n_sim=10, samples=None):
  for i, test_q in enumerate(similarity_matrix):
    if samples is None or i in samples:
      check_duplicated_qs = []
      bests = np.argsort(test_q.tolist())[::-1]
      print('\n-', test_data.loc[i, 'question1'])
      for best in bests:
        q = train_data.loc[best, 'question2']
        if q not in check_duplicated_qs:
          print('  > ', q)
          check_duplicated_qs.append(q)
          if len(check_duplicated_qs) == n_sim:
            break

In [260]:
random.seed(2)
random_tests = random.sample(range(0, len(test)), 5)
get_similar_questions(test, train, similarity, samples=random_tests)


- Who will win the election in united states?
  >  Who will win the 2016 United States Presidential election: Trump or Clinton?
  >  Who is the coolest First Lady of the United States?
  >  Who will win the US election?
  >  Who will win Uttar Pradesh election?
  >  Who should be the next President of The United States?
  >  Who do you think will win the U.S. election in November?
  >  Who will win the US election in 2016?
  >  Does the President of the United States have a food taster?
  >  Who will win 2017 Uttar Pradesh Election and why?
  >  Who will win up 2017 election?

- How do I recover my Gmail account when it does not open after password reset?
  >  How can I recover my Gmail account's password?
  >  How do you recover your gmail account password?
  >  How can I reset the password for my Gmail account?
  >  How do I reset my Gmail account password?
  >  How do I recover my Gmail password?
  >  How Can You Recover Your Gmail Password?
  >  How can you recover your Gmail pass

# Part Three.
* Language Model Retrieval

# Part Four.
* Evaluation Metrics

### Finito