Link: https://soundcloud.com/applied-ai-course/quora-question-pair

When to use what performance metrics
1. If we want probabilities of classes: Log loss
2. If classes are balanced: Accuracy
3. If classes are imbalanced: and if we are more concerned about only true positive, then we use precision.
4. If we are more concerned about False negative and True positive then we use recall.
5. F1 score is a balance between precision and recall.
6. If our concern is both classes (true negative and true positive) then we use ROC_AUC.

In sklearn "ROC_AUC_SCORE" will give area under ROC curve. "AUC" function will give Area Under the Curve. If you are giving fpr, tpr to AUC function, it gives ROC AUC score.

The lower FPR, the more negative points will be classified correctly.

If timestamp was one of the feature, we would have done temporal splitting.

![](https://user-images.githubusercontent.com/63338657/178470949-c07d9cf4-75e9-4a61-8267-1b19efd60198.png)

![](https://user-images.githubusercontent.com/63338657/178471227-79e240b8-bd7f-48a3-8381-e3056e03fd4b.png)

As there is no timestamp, we opt for random splitting.

![](https://user-images.githubusercontent.com/63338657/178472279-f706f265-7eca-4c09-bee0-cb54a1b4ebf4.png)

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.DataFrame(data={1: [0, 1, 2, np.nan, np.nan, 5, 6],
                        2: [1, np.nan, np.nan, 4, 5, 6, 7]})
display(df)

Unnamed: 0,1,2
0,0.0,1.0
1,1.0,
2,2.0,
3,,4.0
4,,5.0
5,5.0,6.0
6,6.0,7.0


In [3]:
display(df[df.isnull().any(axis=1)])

Unnamed: 0,1,2
1,1.0,
2,2.0,
3,,4.0
4,,5.0


In [4]:
df = df.fillna(value='')
display(df)

Unnamed: 0,1,2
0,0.0,1.0
1,1.0,
2,2.0,
3,,4.0
4,,5.0
5,5.0,6.0
6,6.0,7.0


In [5]:
print(df[df.isnull().any(axis=1)])

Empty DataFrame
Columns: [1, 2]
Index: []


In [6]:
all_ = pd.Series(df[1].to_list() + df[2].to_list())
display(all_)

0     0.0
1     1.0
2     2.0
3        
4        
5     5.0
6     6.0
7     1.0
8        
9        
10    4.0
11    5.0
12    6.0
13    7.0
dtype: object

In [7]:
freq = all_.value_counts().to_frame()
display(freq)

Unnamed: 0,0
,4
1.0,2
5.0,2
6.0,2
0.0,1
2.0,1
4.0,1
7.0,1


In [8]:
rep = np.sum(np.where(freq[0] != 1, 1, 0))
display(rep)

4

![](https://user-images.githubusercontent.com/63338657/178702396-c534ffd4-0c5f-44bd-b040-304fb2baab89.png)

Link: https://towardsdatascience.com/how-to-apply-continual-learning-to-your-machine-learning-models-4754adcd7f7f

Common words: number of common words in both questions.
Similar words: number of words that are similar in both questions.

Ex:

Q1: I'm very smart.<br>Q2: I'm very handsome.

Here, the words "I'm very" are common words. Words like "smart" and "handsome" are similar words.

Similar words: words that are similar or have the same meaning, semantically.

In [9]:
arr1 = np.array([1, 2, 3, 4])
print(arr1, arr1.shape)

arr2 = np.array([5, 6, 7, 8])
print(arr2, arr2.shape)

[1 2 3 4] (4,)
[5 6 7 8] (4,)


In [10]:
arrh = np.hstack(tup=(arr1, arr2))
print(arrh, arrh.shape)

arrv = np.vstack(tup=(arr1, arr2))
print(arrv, arrv.shape)

[1 2 3 4 5 6 7 8] (8,)
[[1 2 3 4]
 [5 6 7 8]] (2, 4)


In [11]:
arrd = np.dstack(tup=(arr1, arr2, arr1, arr2))
print(arrd, arrd.shape)

arrdf = arrd.flatten()
print(arrdf, arrdf.shape)

[[[1 5 1 5]
  [2 6 2 6]
  [3 7 3 7]
  [4 8 4 8]]] (1, 4, 4)
[1 5 1 5 2 6 2 6 3 7 3 7 4 8 4 8] (16,)


In [12]:
arr1r = arr1.reshape((-1, 1))
print(arr1r, arr1r.shape)

arr2r = arr2.reshape((-1, 1))
print(arr2r, arr2r.shape)

[[1]
 [2]
 [3]
 [4]] (4, 1)
[[5]
 [6]
 [7]
 [8]] (4, 1)


In [13]:
arrhr = np.hstack(tup=(arr1r, arr2r))
print(arrhr, arrhr.shape)

arrvr = np.vstack(tup=(arr1r, arr2r))
print(arrvr, arrvr.shape)

[[1 5]
 [2 6]
 [3 7]
 [4 8]] (4, 2)
[[1]
 [2]
 [3]
 [4]
 [5]
 [6]
 [7]
 [8]] (8, 1)


In [14]:
arrdr = np.dstack(tup=(arr1r, arr2r, arr1r, arr2r))
print(arrdr, arrdr.shape)

arrdrf = arrdr.flatten()
print(arrdrf, arrdrf.shape)

[[[1 5 1 5]]

 [[2 6 2 6]]

 [[3 7 3 7]]

 [[4 8 4 8]]] (4, 1, 4)
[1 5 1 5 2 6 2 6 3 7 3 7 4 8 4 8] (16,)


![](https://user-images.githubusercontent.com/63338657/178907103-792489b0-6775-4fbd-9743-49ba090d00b8.png)

![](https://user-images.githubusercontent.com/63338657/178907258-f5d8b747-8a57-4f85-a695-1a92fc262238.png)

![](https://user-images.githubusercontent.com/63338657/178909226-6e56bc9a-3d01-4ad0-8a1d-47e30c148ca5.png)

![](https://user-images.githubusercontent.com/63338657/178946909-1da271cb-37cc-411a-8d03-35e15a1fb2e1.png)

In [21]:
print(df.index)

RangeIndex(start=0, stop=7, step=1)


In [41]:
rp = np.random.rand(1, 2)
print(rp)

[[0.0029329  0.63523934]]


In [42]:
rp_prob = rp / np.sum(rp)
rp_prob

array([[0.00459579, 0.99540421]])

In [43]:
alpha = [10 ** x for x in range(-5, 2)]
print(alpha)

[1e-05, 0.0001, 0.001, 0.01, 0.1, 1, 10]
