## Quora Insincere Questions Classification
#### Aleix Casellas Comas, Rubén Barco Terrones, Andreu Masdeu Ninot, Pablo Lázaro Terrones, Marco Gani Remane

### Libraries

In [1]:
import pandas as pd
import numpy as np
import sklearn
import re
import multiprocessing
import gensim.models.word2vec as w2v
import os

from sklearn import model_selection



### ETL
#### Split data into train and test

In [2]:
dir_data = 'C:/Users/ruben/Documents/Máster Data Science/2º Cuatrimestre/Natural Languaje Processing/ML_for_NLP-master/project_1/quora/'
train_data = pd.read_csv(dir_data+'train.csv')
train_data = train_data.drop(train_data.index[420816]) #'"'
train_data = train_data.reset_index(drop=True)
train_data = train_data.drop(train_data.index[792938]) #'Do '
train_data = train_data.reset_index(drop=True)
train_data = train_data.drop(train_data.index[995255]) #'W'
train_data = train_data.reset_index(drop=True)

train_data

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0
5,00004f9a462a357c33be,"Is Gaza slowly becoming Auschwitz, Dachau or T...",0
6,00005059a06ee19e11ad,Why does Quora automatically ban conservative ...,0
7,0000559f875832745e2e,Is it crazy if I wash or wipe my groceries off...,0
8,00005bd3426b2d0c8305,"Is there such a thing as dressing moderately, ...",0
9,00006e6928c5df60eacb,Is it just me or have you ever been in this ph...,0


In [3]:
X_train, X_test = model_selection.train_test_split(train_data, test_size=0.2, stratify=train_data['target'], random_state=123)

In [4]:
X_train.shape, X_test.shape

((1044895, 3), (261224, 3))

In [5]:
y_train =  X_train['target'].values
y_train.shape

(1044895,)

In [6]:
y_test = X_test['target'].values
y_test.shape

(261224,)

In [7]:
x_train = X_train['question_text'].values
x_train, len(x_train)

(array([ 'How is the writing style and structure in the novel "Germinal" by Émile Zola depicted?',
        'Is a debtor, a bonded labor? Why',
        'What are the best ways to develop leads?', ...,
        'Is the discount rate of buying one share of a stock equal to the discount rate of buying ten shares?',
        'What is the best way to get a personal loan in Kenya?',
        'Do you think a piloted airplane could fly under the Deception Pass Bridge?'], dtype=object),
 1044895)

In [8]:
x_test = X_test['question_text'].values
x_test, len(x_test)

(array(['What is the minimum salary required for American Express Card?',
        'Can you make French fries only out of russet potatoes?',
        'How is the mark vs relative grade at NITC? What would be the pass mark for maths 1 usually? No one has answered this type of question on Quora . How much marks required for each grade?',
        ...,
        'What is the maximum size Transmission/Front sprocket that can be used for a Bajaj Avenger 220?',
        'What should I do if I have a small penis at 15-years-old?',
        'In which direction does spiders make its web?'], dtype=object),
 261224)

In [9]:
def sentence_to_wordlist(raw):
    clean = raw#    clean = re.sub("[^a-zA-Z]"," ", raw)
    clean = clean.lower()
    words = clean.split()
    return words

In [10]:
sentences = []
for raw_sentence in x_train:
    if len(raw_sentence) > 0:
        sentences.append(sentence_to_wordlist(raw_sentence))
for raw_sentence in x_test:
    if len(raw_sentence) > 0:
        sentences.append(sentence_to_wordlist(raw_sentence))

In [11]:
len(sentences), len(x_train)+len(x_test)

(1306119, 1306119)

In [12]:
sentences[0]

['how',
 'is',
 'the',
 'writing',
 'style',
 'and',
 'structure',
 'in',
 'the',
 'novel',
 '"germinal"',
 'by',
 'émile',
 'zola',
 'depicted?']

In [13]:
token_count = sum([len(sentence) for sentence in sentences])
print("The book corpus contains {0:,} tokens".format(token_count))

The book corpus contains 16,723,073 tokens


## Word2vec - 10 epochs / 300 features
### Train word2vec

In [14]:
num_features = 300
num_epochs   = 10

# Minimum word count threshold.
min_word_count = 0

# Number of threads to run in parallel.
num_workers = multiprocessing.cpu_count()

# Context window length.
context_size = 5

# Downsample setting for frequent words.
#0 - 1e-5 is good for this
downsampling = 1e-3

seed = 1

#optional Training algorithm: 1 for skip-gram; otherwise CBOW
sg = 1

In [15]:
word2vec = w2v.Word2Vec(
    sg=sg,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling)

In [16]:
word2vec.build_vocab(sentences, keep_raw_vocab=True)

In [17]:
word2vec.corpus_count

1306119

In [18]:
len(word2vec.vocabulary.raw_vocab)

450473

In [19]:
total_examples = len(sentences)

In [20]:
%%time
word2vec.train(sentences,
               epochs = num_epochs,
               total_examples=total_examples)

Wall time: 8min 20s


(124296125, 167230730)

In [21]:
foldername = "./saved_models/" + "w2v_" + str(num_features) +"features_"+str(num_epochs)+"epochs"
modelname  = "word2vec_" + str(num_features) +"features" + str(num_epochs)+"epochs.w2v"

if not os.path.exists(foldername):
    os.makedirs(foldername)
    word2vec.save(os.path.join(foldername, modelname))
else:
    print("folder {} already exists".format(foldername))

### Predicting from a Word2vec averaged representation

Probar también el otro método en el que contatenaba este vector al vector de palabras. Creo que estaba en el notebook 2.

In [21]:
def doc_to_vec(sentence, word2vec):
    word_list    = sentence_to_wordlist(sentence)
    word_vectors = []
    for w in word_list:
        word_vectors.append(word2vec.wv.get_vector(w))

    return np.mean(word_vectors,axis=0)

In [22]:
X_tr = np.zeros((len(x_train), num_features))
y_tr = y_train
n_samples = X_tr.shape[0]

for i in range(n_samples):
    X_tr[i,:] = doc_to_vec(x_train[i], word2vec)

In [23]:
X_tr[0,:]

array([ 0.20921111,  0.21025616,  0.06008345, -0.02214998, -0.10523272,
       -0.02579505,  0.09889877,  0.19174583, -0.20905982,  0.2006634 ,
       -0.06001522,  0.08307546, -0.12640342, -0.02311627, -0.07260622,
       -0.06967316,  0.09929208, -0.13151792,  0.02075363, -0.04538102,
        0.05306617,  0.03122287, -0.00906889,  0.03343381,  0.15381609,
        0.12021481,  0.10070093,  0.02337468,  0.20169829,  0.08823855,
        0.09444793,  0.20519114, -0.21866359, -0.07030618, -0.18285768,
       -0.15142709, -0.13870668, -0.12757736, -0.13962598,  0.01909691,
       -0.20975958, -0.15397128, -0.23655084, -0.10314508, -0.06874561,
       -0.14424036,  0.02214006,  0.16260675,  0.08767717,  0.15231596,
        0.17467186, -0.03408764, -0.12284363, -0.06196873, -0.04730185,
       -0.14256351, -0.11773074,  0.12460084,  0.08886473,  0.18074241,
        0.14246114,  0.0523156 ,  0.04918189, -0.11882342,  0.05237473,
        0.22714394,  0.02760408, -0.01816032, -0.17608851, -0.05

In [24]:
X_te = np.zeros((len(x_test), num_features))
y_te = y_test
n_samples = X_te.shape[0]
for i in range(n_samples):
    X_te[i,:] = doc_to_vec(x_test[i], word2vec)

In [25]:
X_tr.shape, X_te.shape

((1044895, 300), (261224, 300))

### Predicting with a Logistic Regression 

In [26]:
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
logreg = LogisticRegression(n_jobs=1, C=1e5)
logreg.fit(X_tr, y_tr)

LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [27]:
y_pred = logreg.predict(X_tr)
print('Train accuracy: {}'.format(accuracy_score(y_tr, y_pred)))
print('Train F1 score: {}'.format(f1_score(y_tr, y_pred)))

y_pred = logreg.predict(X_te)
print('Test accuracy: {}'.format(accuracy_score(y_te, y_pred)))
print('Test F1 score: {}'.format(f1_score(y_te, y_pred)))

Train accuracy: 0.9470329554644247
Train F1 score: 0.4322832787961472
Test accuracy: 0.9466970875570392
Test F1 score: 0.42840722495894906


## Word2vec - 20 epochs / 300 features
### Train word2vec

In [32]:
num_features = 300
num_epochs   = 20

# Minimum word count threshold.
min_word_count = 0

# Number of threads to run in parallel.
num_workers = multiprocessing.cpu_count()

# Context window length.
context_size = 5

# Downsample setting for frequent words.
#0 - 1e-5 is good for this
downsampling = 1e-3

seed = 1

#optional Training algorithm: 1 for skip-gram; otherwise CBOW
sg = 1

In [33]:
word2vec = w2v.Word2Vec(
    sg=sg,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling)

In [34]:
word2vec.build_vocab(sentences, keep_raw_vocab=True)

In [35]:
word2vec.corpus_count

1306119

In [36]:
len(word2vec.vocabulary.raw_vocab)

450473

In [37]:
total_examples = len(sentences)

In [38]:
%%time
word2vec.train(sentences,
               epochs = num_epochs,
               total_examples=total_examples)

Wall time: 17min 38s


(248598897, 334461460)

In [39]:
foldername = "./saved_models/" + "w2v_" + str(num_features) +"features_"+str(num_epochs)+"epochs"
modelname  = "word2vec_" + str(num_features) +"features" + str(num_epochs)+"epochs.w2v"

if not os.path.exists(foldername):
    os.makedirs(foldername)
    word2vec.save(os.path.join(foldername, modelname))
else:
    print("folder {} already exists".format(foldername))

### Predicting from a Word2vec averaged representation

Probar también el otro método en el que contatenaba este vector al vector de palabras. Creo que estaba en el notebook 2.

In [40]:
def doc_to_vec(sentence, word2vec):
    word_list    = sentence_to_wordlist(sentence)
    word_vectors = []
    for w in word_list:
        word_vectors.append(word2vec.wv.get_vector(w))

    return np.mean(word_vectors,axis=0)

In [41]:
X_tr = np.zeros((len(x_train), num_features))
y_tr = y_train
n_samples = X_tr.shape[0]

for i in range(n_samples):
    X_tr[i,:] = doc_to_vec(x_train[i], word2vec)

In [42]:
X_te = np.zeros((len(x_test), num_features))
y_te = y_test
n_samples = X_te.shape[0]
for i in range(n_samples):
    X_te[i,:] = doc_to_vec(x_test[i], word2vec)

In [43]:
X_tr.shape, X_te.shape

((1044895, 300), (261224, 300))

### Predicting with a Logistic Regression 

In [44]:
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
logreg = LogisticRegression(n_jobs=1, C=1e5)
logreg.fit(X_tr, y_tr)

LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [45]:
y_pred = logreg.predict(X_tr)
print('Train accuracy: {}'.format(accuracy_score(y_tr, y_pred)))
print('Train F1 score: {}'.format(f1_score(y_tr, y_pred)))

y_pred = logreg.predict(X_te)
print('Test accuracy: {}'.format(accuracy_score(y_te, y_pred)))
print('Test F1 score: {}'.format(f1_score(y_te, y_pred)))

Train accuracy: 0.9471085611472924
Train F1 score: 0.43476926853215514
Test accuracy: 0.9467238844822834
Test F1 score: 0.43072769664989563


## Word2vec - 20 epochs / 200 features
### Train word2vec

In [46]:
num_features = 200
num_epochs   = 20

# Minimum word count threshold.
min_word_count = 0

# Number of threads to run in parallel.
num_workers = multiprocessing.cpu_count()

# Context window length.
context_size = 5

# Downsample setting for frequent words.
#0 - 1e-5 is good for this
downsampling = 1e-3

seed = 1

#optional Training algorithm: 1 for skip-gram; otherwise CBOW
sg = 1

In [47]:
word2vec = w2v.Word2Vec(
    sg=sg,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling)

In [48]:
word2vec.build_vocab(sentences, keep_raw_vocab=True)

In [49]:
word2vec.corpus_count

1306119

In [50]:
len(word2vec.vocabulary.raw_vocab)

450473

In [51]:
total_examples = len(sentences)

In [52]:
%%time
word2vec.train(sentences,
               epochs = num_epochs,
               total_examples=total_examples)

Wall time: 14min 56s


(248596134, 334461460)

In [53]:
foldername = "./saved_models/" + "w2v_" + str(num_features) +"features_"+str(num_epochs)+"epochs"
modelname  = "word2vec_" + str(num_features) +"features" + str(num_epochs)+"epochs.w2v"

if not os.path.exists(foldername):
    os.makedirs(foldername)
    word2vec.save(os.path.join(foldername, modelname))
else:
    print("folder {} already exists".format(foldername))

### Predicting from a Word2vec averaged representation

Probar también el otro método en el que contatenaba este vector al vector de palabras. Creo que estaba en el notebook 2.

In [54]:
def doc_to_vec(sentence, word2vec):
    word_list    = sentence_to_wordlist(sentence)
    word_vectors = []
    for w in word_list:
        word_vectors.append(word2vec.wv.get_vector(w))

    return np.mean(word_vectors,axis=0)

In [55]:
X_tr = np.zeros((len(x_train), num_features))
y_tr = y_train
n_samples = X_tr.shape[0]

for i in range(n_samples):
    X_tr[i,:] = doc_to_vec(x_train[i], word2vec)

In [56]:
X_te = np.zeros((len(x_test), num_features))
y_te = y_test
n_samples = X_te.shape[0]
for i in range(n_samples):
    X_te[i,:] = doc_to_vec(x_test[i], word2vec)

In [57]:
X_tr.shape, X_te.shape

((1044895, 200), (261224, 200))

### Predicting with a Logistic Regression 

In [58]:
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
logreg = LogisticRegression(n_jobs=1, C=1e5)
logreg.fit(X_tr, y_tr)

LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [59]:
y_pred = logreg.predict(X_tr)
print('Train accuracy: {}'.format(accuracy_score(y_tr, y_pred)))
print('Train F1 score: {}'.format(f1_score(y_tr, y_pred)))

y_pred = logreg.predict(X_te)
print('Test accuracy: {}'.format(accuracy_score(y_te, y_pred)))
print('Test F1 score: {}'.format(f1_score(y_te, y_pred)))

Train accuracy: 0.9466310012010776
Train F1 score: 0.426043907409504
Test accuracy: 0.9463334150001531
Test F1 score: 0.4224446916326783


## Word2vec - 20 epochs / 350 features
### Train word2vec

In [60]:
num_features = 350
num_epochs   = 20

# Minimum word count threshold.
min_word_count = 0

# Number of threads to run in parallel.
num_workers = multiprocessing.cpu_count()

# Context window length.
context_size = 5

# Downsample setting for frequent words.
#0 - 1e-5 is good for this
downsampling = 1e-3

seed = 1

#optional Training algorithm: 1 for skip-gram; otherwise CBOW
sg = 1

In [61]:
word2vec = w2v.Word2Vec(
    sg=sg,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling)

In [62]:
word2vec.build_vocab(sentences, keep_raw_vocab=True)

In [63]:
word2vec.corpus_count

1306119

In [64]:
len(word2vec.vocabulary.raw_vocab)

450473

In [65]:
total_examples = len(sentences)

In [66]:
%%time
word2vec.train(sentences,
               epochs = num_epochs,
               total_examples=total_examples)

Wall time: 13min 51s


(248600971, 334461460)

In [67]:
foldername = "./saved_models/" + "w2v_" + str(num_features) +"features_"+str(num_epochs)+"epochs"
modelname  = "word2vec_" + str(num_features) +"features" + str(num_epochs)+"epochs.w2v"

if not os.path.exists(foldername):
    os.makedirs(foldername)
    word2vec.save(os.path.join(foldername, modelname))
else:
    print("folder {} already exists".format(foldername))

### Predicting from a Word2vec averaged representation

Probar también el otro método en el que contatenaba este vector al vector de palabras. Creo que estaba en el notebook 2.

In [68]:
def doc_to_vec(sentence, word2vec):
    word_list    = sentence_to_wordlist(sentence)
    word_vectors = []
    for w in word_list:
        word_vectors.append(word2vec.wv.get_vector(w))

    return np.mean(word_vectors,axis=0)

In [69]:
X_tr = np.zeros((len(x_train), num_features))
y_tr = y_train
n_samples = X_tr.shape[0]

for i in range(n_samples):
    X_tr[i,:] = doc_to_vec(x_train[i], word2vec)

In [70]:
X_te = np.zeros((len(x_test), num_features))
y_te = y_test
n_samples = X_te.shape[0]
for i in range(n_samples):
    X_te[i,:] = doc_to_vec(x_test[i], word2vec)

In [71]:
X_tr.shape, X_te.shape

((1044895, 350), (261224, 350))

### Predicting with a Logistic Regression 

In [72]:
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
logreg = LogisticRegression(n_jobs=1, C=1e5)
logreg.fit(X_tr, y_tr)

LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [73]:
y_pred = logreg.predict(X_tr)
print('Train accuracy: {}'.format(accuracy_score(y_tr, y_pred)))
print('Train F1 score: {}'.format(f1_score(y_tr, y_pred)))

y_pred = logreg.predict(X_te)
print('Test accuracy: {}'.format(accuracy_score(y_te, y_pred)))
print('Test F1 score: {}'.format(f1_score(y_te, y_pred)))

Train accuracy: 0.9473621751467851
Train F1 score: 0.43908010810259546
Test accuracy: 0.9468693535050379
Test F1 score: 0.4345487879405174


### Some cells to see some things about the data

In [28]:
X_tr.max()

1.0285873413085938

In [29]:
sum(np.isnan(X_tr))

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0])

In [113]:
lengths = []
for i in range(len(x_train)):
    lengths.append(len(x_train[i].split(' ')))
    if len(x_train[i].split(' ')) == 0 or len(x_train[i].split(' ')) == 1 or len(x_train[i].split(' ')) == 2:
        print(x_train[i], i)
    
aux = np.asarray(lengths)
aux.argmin(), aux.min()

If  1283
Bye Bye? 6395
Whofound India? 37749
Wicca:  48551
Explain cryptocurrency? 69865
Nepal:  80771
Hello sir? 85842
What graphic? 128486
Can coffee? 160121
Feminism:  161286
What's? E=mc2 162057
Whorote gitanjali? 170242
Whatis diphthong? 175078
Whatis computer? 181681
Maladaptive daydreaming? 181975
What's pornstar? 182879
India:  185000
What cyber? 197980
Islam:  200460
Whatis rpm? 211245
Imagination:  214966
What's KFC? 228209
I 12? 265089
Dowry:  266262
Why hospitality? 288077
In Islam? 323112
Free Sandeep? 339084
Whtis love? 347064
Whatis synergy? 384154
Sykes–Picot Agreement? 420301
[math]24-7=?[/math] 453153
Germany:  455658
Incest:  457982
What nudist? 498106
Hungary:  500973
Whatis demobilisation? 512352
Is  542168
VJTI TEXTILE? 602027
To Quora: 631052
Sexism:  633587
Semspa center? 656875
What meow? 658557
What's disadvantage? 710108
What's geology? 716664
Are Jehovah's Witnesses evil? 719222
Colonialism:  720712
Which certification? 741149
Criminals:  776160
What isOrgan

(453153, 1)

In [120]:
aux = x_train[1283]
aux, re.sub("[^a-zA-Z]"," ", aux), re.sub("[^a-zA-Z]"," ", aux).split()

('If ', 'If ', ['If'])

In [62]:
X_train[145458:]

Unnamed: 0,qid,question_text,target
995255,c309469a202434b5f1d2,W,1
917150,b3b5ef435f94323017a8,Can I learn German in one year and study mecha...,0
919670,b438f0def51ae7290356,What would it look like if a human puts on 300...,0
272187,3547d1f0e5413040d32d,What is the first language ever?,0
324216,3f8b907035841cdcf361,Why is the view from ISS suddenly pink?,0
810204,9ec500b2510bdcf60cd3,Are there P2P payment APIs which can be integr...,0
524030,66963873eb8f5bcdd075,Why some countries such as Iran and Turkey cal...,0
965370,bd22c346ba11df87fe41,What is the way to use one WhatsApp account on...,0
290958,38fb779f3730e31c53d0,"As a college senior, when do I start my applic...",0
915791,b36f9883113721d9364e,How do I get free Java programming courses?,0


In [65]:
train_data.iloc[995255]['question_text']

'W'

In [44]:
for i in range(len(train_data)):
    if train_data.iloc[i]['question_text'] == 'Do ':
        print(i)

In [None]:
train_data = train_data.drop(train_data.index[420816])

In [48]:
train_data.iloc[792938]['question_text']

'Do '

In [43]:
len(train_data)

1306121

In [121]:
lengths = []
for i in range(len(x_test)):
    lengths.append(len(x_test[i].split(' ')))
    if len(x_test[i].split(' ')) == 0 or len(x_test[i].split(' ')) == 1 or len(x_test[i].split(' ')) == 2:
        print(x_test[i], i)
    
aux = np.asarray(lengths)
aux.argmin(), aux.min()

Neigh! Whinny! 3568
Poland:  3985
Are rabbits? 4814
Does Bangladeshis? 79759
Whatis extempore? 80038
Nuclear weapons? 87378
Who.ismost powerful.man? 122021
What's WhatsApp? 161416
UTGST RATES? 162080
Quora:  165156
Google jigsaw? 228339
What’s 1+1=? 239114
Whats nuclear? 251789
IS---RA---EL (OHIM)? 261195


(3568, 2)

In [128]:
for i in range(len(X_tr)):
    if np.isnan(sum(X_tr[i])):
        print(i)

8946
19818
51545
89077
805767


In [135]:
x_train[805767]

'Какая компания сегодня создала самый мощный искусственный интеллект?'

In [130]:
x_train

array([ 'How is the writing style and structure in the novel "Germinal" by Émile Zola depicted?',
       'Is a debtor, a bonded labor? Why',
       'What are the best ways to develop leads?', ...,
       'Is the discount rate of buying one share of a stock equal to the discount rate of buying ten shares?',
       'What is the best way to get a personal loan in Kenya?',
       'Do you think a piloted airplane could fly under the Deception Pass Bridge?'], dtype=object)