# Final project: StackOverflow assistant bot

Congratulations on coming this far and solving the programming assignments! In this final project, we will combine everything we have learned about Natural Language Processing to construct a *dialogue chat bot*, which will be able to:
* answer programming-related questions (using StackOverflow dataset);
* chit-chat and simulate dialogue on all non programming-related questions.

For a chit-chat mode we will use a pre-trained neural network engine available from [ChatterBot](https://github.com/gunthercox/ChatterBot).
Those who aim at honor certificates for our course or are just curious, will train their own models for chit-chat.
![](https://imgs.xkcd.com/comics/twitter_bot.png)
©[xkcd](https://xkcd.com)

### Data description

To detect *intent* of users questions we will need two text collections:
- `tagged_posts.tsv` — StackOverflow posts, tagged with one programming language (*positive samples*).
- `dialogues.tsv` — dialogue phrases from movie subtitles (*negative samples*).


In [1]:
import sys
sys.path.append("..")
from common.download_utils import download_project_resources

download_project_resources()

HBox(children=(IntProgress(value=0, max=18012894), HTML(value='')))




HBox(children=(IntProgress(value=0, max=145677870), HTML(value='')))




For those questions, that have programming-related intent, we will proceed as follow predict programming language (only one tag per question allowed here) and rank candidates within the tag using embeddings.
For the ranking part, you will need:
- `word_embeddings.tsv` — word embeddings, that you  trained with StarSpace in the 3rd assignment. It's not a problem if you didn't do it, because we can offer an alternative solution for you.

As a result of this notebook, you should obtain the following new objects that you will then use in the running bot:

- `intent_recognizer.pkl` — intent recognition model;
- `tag_classifier.pkl` — programming language classification model;
- `tfidf_vectorizer.pkl` — vectorizer used during training;
- `thread_embeddings_by_tags` — folder with thread embeddings, arranged by tags.
    

Some functions will be reused by this notebook and the scripts, so we put them into *utils.py* file. Don't forget to open it and fill in the gaps!

In [3]:
! cp ../week3/starSpaceEmbeddings.tsv.tsv .


In [8]:
! wc -l starspace_embeddings.tsv

   65505 starspace_embeddings.tsv


In [78]:
from utils import *

## Part I. Intent and language recognition

We want to write a bot, which will not only **answer programming-related questions**, but also will be able to **maintain a dialogue**. We would also like to detect the *intent* of the user from the question (we could have had a 'Question answering mode' check-box in the bot, but it wouldn't fun at all, would it?). So the first thing we need to do is to **distinguish programming-related questions from general ones**.

It would also be good to predict which programming language a particular question referees to. By doing so, we will speed up question search by a factor of the number of languages (10 here), and exercise our *text classification* skill a bit. :)

In [10]:
import numpy as np
import pandas as pd
import pickle
import re

from sklearn.feature_extraction.text import TfidfVectorizer

### Data preparation

In the first assignment (Predict tags on StackOverflow with linear models), you have already learnt how to preprocess texts and do TF-IDF tranformations. Reuse your code here. In addition, you will also need to [dump](https://docs.python.org/3/library/pickle.html#pickle.dump) the TF-IDF vectorizer with pickle to use it later in the running bot.

In [37]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def text_prepare(text):
    """
        text: a string
        
        return: modified initial string
    """
#     print(text)
    text = text.lower() # lowercase text
    text = re.sub(REPLACE_BY_SPACE_RE, " ", text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = re.sub(BAD_SYMBOLS_RE, "", text)# delete symbols which are in BAD_SYMBOLS_RE from text
    text = " ".join([w for w in text.split(" ") if w != "" and w not in STOPWORDS])  # delete stopwords from text
    return text

In [51]:
def tfidf_features(X_train, X_test, vectorizer_path):
    """Performs TF-IDF transformation and dumps the model."""
    
    # Train a vectorizer on X_train data.
    # Transform X_train and X_test data.
    tfidf_vectorizer = TfidfVectorizer(min_df=5, max_df=0.9, ngram_range=(0, 1), token_pattern= '(\S+)')
    X_train = tfidf_vectorizer.fit_transform(X_train)
    X_test = tfidf_vectorizer.transform(X_test)
    # Pickle the trained vectorizer to 'vectorizer_path'
    # Don't forget to open the file in writing bytes mode.

    pickle.dump(tfidf_vectorizer, open( vectorizer_path, 'wb'), protocol=3)
    ######################################
    ######### YOUR CODE HERE #############
    ######################################
    
    return X_train, X_test

In [13]:
pickle.dump?

Now, load examples of two classes. Use a subsample of stackoverflow data to balance the classes. You will need the full data later.

In [32]:
sample_size = 200000

dialogue_df = pd.read_csv('data/dialogues.tsv', sep='\t').sample(sample_size, random_state=0)
stackoverflow_df = pd.read_csv('data/tagged_posts.tsv', sep='\t').sample(sample_size, random_state=0)

Check how the data look like:

In [33]:
dialogue_df.head()

Unnamed: 0,text,tag
82925,"Donna, you are a muffin.",dialogue
48774,He was here last night till about two o'clock....,dialogue
55394,"All right, then make an appointment with her s...",dialogue
90806,"Hey, what is this-an interview? We're supposed...",dialogue
107758,Yeah. He's just a friend of mine I was trying ...,dialogue


In [34]:
stackoverflow_df.head()

Unnamed: 0,post_id,title,tag
2168983,43837842,Efficient Algorithm to compose valid expressio...,python
1084095,15747223,Why does this basic thread program fail with C...,c_cpp
1049020,15189594,Link to scroll to top not working,javascript
200466,3273927,Is it possible to implement ping on windows ph...,c#
1200249,17684551,GLSL normal mapping issue,c_cpp


Apply *text_prepare* function to preprocess the data:

In [19]:
from utils import text_prepare

In [42]:
dialogue_df['text'].head()

82925                              Donna, you are a muffin.
48774     He was here last night till about two o'clock....
55394     All right, then make an appointment with her s...
90806     Hey, what is this-an interview? We're supposed...
107758    Yeah. He's just a friend of mine I was trying ...
Name: text, dtype: object

In [40]:
dialogue_df['text'].apply(text_prepare)

82925                                          donna muffin
48774     last night till two oclock hear really got stu...
55394                            right make appointment see
90806             hey thisan interview supposed making love
107758                     yeah hes friend mine trying help
Name: text, dtype: object

In [44]:
dialogue_df['text'] = dialogue_df['text'].apply(text_prepare)
stackoverflow_df['title'] = stackoverflow_df['title'].apply(text_prepare)

In [45]:
dialogue_df['text'].head()

82925                                          donna muffin
48774     last night till two oclock hear really got stu...
55394                            right make appointment see
90806             hey thisan interview supposed making love
107758                     yeah hes friend mine trying help
Name: text, dtype: object

### Intent recognition

We will do a binary classification on TF-IDF representations of texts. Labels will be either `dialogue` for general questions or `stackoverflow` for programming-related questions. First, prepare the data for this task:
- concatenate `dialogue` and `stackoverflow` examples into one sample
- split it into train and test in proportion 9:1, use *random_state=0* for reproducibility
- transform it into TF-IDF features

In [46]:
from sklearn.model_selection import train_test_split

In [62]:
X = np.concatenate([dialogue_df['text'].values, stackoverflow_df['title'].values])
y = ['dialogue'] * dialogue_df.shape[0] + ['stackoverflow'] * stackoverflow_df.shape[0]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0 )
print('Train size = {}, test size = {}'.format(len(X_train), len(X_test)))

X_train_tfidf, X_test_tfidf = tfidf_features(X_train, X_test, vectorizer_path=RESOURCE_PATH['TFIDF_VECTORIZER'])

Train size = 360000, test size = 40000


Train the **intent recognizer** using LogisticRegression on the train set with the following parameters: *penalty='l2'*, *C=10*, *random_state=0*. Print out the accuracy on the test set to check whether everything looks good.

In [53]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [54]:
?LogisticRegression

In [55]:
######################################
######### YOUR CODE HERE #############
######################################

intent_recognizer = LogisticRegression(penalty='l2', C=10, random_state=0)
intent_recognizer.fit(X_train_tfidf, y_train)



LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=0, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [56]:
# Check test accuracy.
y_test_pred = intent_recognizer.predict(X_test_tfidf)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('Test accuracy = {}'.format(test_accuracy))

Test accuracy = 0.991725


Dump the classifier to use it in the running bot.

In [58]:
pickle.dump(intent_recognizer, open(RESOURCE_PATH['INTENT_RECOGNIZER'], 'wb'))

### Programming language classification 

We will train one more classifier for the programming-related questions. It will predict exactly one tag (=programming language) and will be also based on Logistic Regression with TF-IDF features. 

First, let us prepare the data for this task.

In [63]:
X = stackoverflow_df['title'].values
y = stackoverflow_df['tag'].values

In [64]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print('Train size = {}, test size = {}'.format(len(X_train), len(X_test)))

Train size = 160000, test size = 40000


Let us reuse the TF-IDF vectorizer that we have already created above. It should not make a huge difference which data was used to train it.

In [65]:
vectorizer = pickle.load(open(RESOURCE_PATH['TFIDF_VECTORIZER'], 'rb'))

X_train_tfidf, X_test_tfidf = vectorizer.transform(X_train), vectorizer.transform(X_test)

Train the **tag classifier** using OneVsRestClassifier wrapper over LogisticRegression. Use the following parameters: *penalty='l2'*, *C=5*, *random_state=0*.

In [66]:
from sklearn.multiclass import OneVsRestClassifier

In [68]:
######################################
######### YOUR CODE HERE #############
######################################

tag_classifier = OneVsRestClassifier(LogisticRegression(penalty='l2', C=5, random_state=0))
tag_classifier.fit(X_train_tfidf, y_train)



OneVsRestClassifier(estimator=LogisticRegression(C=5, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=0, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
          n_jobs=None)

In [69]:
# Check test accuracy.
y_test_pred = tag_classifier.predict(X_test_tfidf)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('Test accuracy = {}'.format(test_accuracy))

Test accuracy = 0.797325


Dump the classifier to use it in the running bot.

In [70]:
pickle.dump(tag_classifier, open(RESOURCE_PATH['TAG_CLASSIFIER'], 'wb'))

## Part II. Ranking  questions with embeddings

To find a relevant answer (a thread from StackOverflow) on a question you will use vector representations to calculate similarity between the question and existing threads. We already had `question_to_vec` function from the assignment 3, which can create such a representation based on word vectors. 

However, it would be costly to compute such a representation for all possible answers in *online mode* of the bot (e.g. when bot is running and answering questions from many users). This is the reason why you will create a *database* with pre-computed representations. These representations will be arranged by non-overlaping tags (programming languages), so that the search of the answer can be performed only within one tag each time. This will make our bot even more efficient and allow not to store all the database in RAM. 

Load StarSpace embeddings which were trained on Stack Overflow posts. These embeddings were trained in *supervised mode* for duplicates detection on the same corpus that is used in search. We can account on that these representations will allow us to find closely related answers for a question. 

If for some reasons you didn't train StarSpace embeddings in the assignment 3, you can use [pre-trained word vectors](https://code.google.com/archive/p/word2vec/) from Google. All instructions about how to work with these vectors were provided in the same assignment. However, we highly recommend to use StartSpace's embeddings, because it contains more appropriate embeddings. If you chose to use Google's embeddings, delete the words, which is not in Stackoverflow data.

In [72]:
np.array([1,2,3], dtype=np.float)

array([1., 2., 3.])

In [76]:
! ls data


dialogues.tsv            starspace_embeddings.tsv tagged_posts.tsv


In [139]:
def load_embeddings(embeddings_path):
    """Loads pre-trained word embeddings from tsv file.

    Args:
      embeddings_path - path to the embeddings file.

    Returns:
      embeddings - dict mapping words to vectors;
      embeddings_dim - dimension of the vectors.
    """
    
    # Hint: you have already implemented a similar routine in the 3rd assignment.
    # Note that here you also need to know the dimension of the loaded embeddings.
    # When you load the embeddings, use numpy.float32 type as dtype

    ########################
    #### YOUR CODE HERE ####
    ########################
    starspace_embeddings = {}
    for line in open(embeddings_path, 'r'):
        word, *embs = line.strip().split('\t')
        starspace_embeddings[word] = np.array(list(map(float, embs)),  dtype=np.float32)

    return starspace_embeddings, starspace_embeddings[next(iter(starspace_embeddings))].shape[0]



In [146]:
def question_to_vec(question, embeddings, dim=300):
    """
        question: a string
        embeddings: dict where the key is a word and a value is its' embedding
        dim: size of the representation

        result: vector representation for the question
    """
    vectors = [embeddings[w] for w in question.split(' ') if w in embeddings]
    if len(vectors) > 0:
        return np.mean(np.vstack(vectors), axis=0)
    else:
        return np.zeros((dim))

In [140]:
starspace_embeddings, embeddings_dim = load_embeddings('data/starspace_embeddings.tsv')

Since we want to precompute representations for all possible answers, we need to load the whole posts dataset, unlike we did for the intent classifier:

In [86]:
posts_df = pd.read_csv('data/tagged_posts.tsv', sep='\t')

In [87]:
posts_df.head()

Unnamed: 0,post_id,title,tag
0,9,Calculate age in C#,c#
1,16,Filling a DataSet or DataTable from a LINQ que...,c#
2,39,Reliable timer in a console application,c#
3,42,Best way to allow plugins for a PHP application,php
4,59,"How do I get a distinct, ordered list of names...",c#


In [219]:
posts_df[posts_df['post_id'] == 759216]

Unnamed: 0,post_id,title,tag
34226,759216,implementation of composition and aggregation ...,c#


Look at the distribution of posts for programming languages (tags) and find the most common ones. 
You might want to use pandas [groupby](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) and [count](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html) methods:

In [129]:
counts_by_tag = posts_df[["title", "tag"]].groupby('tag').count()['title']

In [220]:
[(tag, count) for tag, count in counts_by_tag.items()]

[('c#', 394451),
 ('c_cpp', 281300),
 ('java', 383456),
 ('javascript', 375867),
 ('php', 321752),
 ('python', 208607),
 ('r', 36359),
 ('ruby', 99930),
 ('swift', 34809),
 ('vb', 35044)]

Now for each `tag` you need to create two data structures, which will serve as online search index:
* `tag_post_ids` — a list of post_ids with shape `(counts_by_tag[tag],)`. It will be needed to show the title and link to the thread;
* `tag_vectors` — a matrix with shape `(counts_by_tag[tag], embeddings_dim)` where embeddings for each answer are stored.

Implement the code which will calculate the mentioned structures and dump it to files. It should take several minutes to compute it.

In [221]:
import os
os.makedirs(RESOURCE_PATH['THREAD_EMBEDDINGS_FOLDER'], exist_ok=True)

for tag, count in counts_by_tag.items():
    tag_posts = posts_df[posts_df['tag'] == tag]
    
    tag_post_ids = [p for (i, p) in posts_df[posts_df['tag'] == tag]['post_id'].items()]
    
    tag_vectors = np.zeros((count, embeddings_dim), dtype=np.float32)
    for i, title in enumerate(tag_posts['title']):
        tag_vectors[i, :] = question_to_vec(title, starspace_embeddings, embeddings_dim)

    # Dump post ids and vectors to a file.
    filename = os.path.join(RESOURCE_PATH['THREAD_EMBEDDINGS_FOLDER'], os.path.normpath('%s.pkl' % tag))
    pickle.dump((tag_post_ids, tag_vectors), open(filename, 'wb'))

### Scratchpad

In [145]:
[(i, title) for i, title in enumerate(posts_df[posts_df['tag'] == 'r'].head()['title'])]

[(0, 'How to access the last value in a vector?'),
 (1, 'Explain the quantile() function in R'),
 (2, 'Sample Code for R?'),
 (3, 'What are some good books, web resources, and projects for learning R?'),
 (4, 'Thinking in Vectors with R')]

In [135]:
posts_df[posts_df['tag'] == 'c#'].head()['post_id']

0      9
1     16
2     39
4     59
5    109
Name: post_id, dtype: int64

In [136]:
[p for (i, p) in posts_df[posts_df['tag'] == 'c#']['post_id'].items()]

[9,
 16,
 39,
 59,
 109,
 174,
 260,
 289,
 482,
 601,
 650,
 709,
 752,
 832,
 930,
 944,
 1010,
 1241,
 1535,
 1760,
 1836,
 1848,
 1898,
 1936,
 1994,
 1995,
 2154,
 2209,
 2214,
 2250,
 2256,
 2267,
 2483,
 2527,
 2780,
 2785,
 2871,
 2872,
 2874,
 2987,
 3213,
 3234,
 3713,
 3725,
 3903,
 4157,
 4221,
 4227,
 4363,
 4432,
 4556,
 4610,
 4612,
 4664,
 4849,
 4850,
 4913,
 4930,
 5179,
 5194,
 5269,
 5307,
 5694,
 5706,
 5787,
 6184,
 6406,
 6623,
 6681,
 6890,
 6973,
 7015,
 7074,
 7095,
 7367,
 7586,
 7719,
 7990,
 7991,
 8042,
 8223,
 8348,
 8546,
 8566,
 8604,
 8691,
 8800,
 8893,
 8896,
 8987,
 9173,
 9303,
 9314,
 9472,
 9486,
 9508,
 9666,
 9673,
 9734,
 9805,
 10071,
 10098,
 10412,
 10456,
 10458,
 10531,
 10855,
 10901,
 10905,
 10915,
 10949,
 11194,
 11267,
 11288,
 11345,
 11423,
 11516,
 11632,
 11762,
 11767,
 11804,
 11806,
 12045,
 12051,
 12135,
 12306,
 12671,
 12702,
 12716,
 13060,
 13087,
 13170,
 13217,
 13353,
 13524,
 13615,
 13731,
 13765,
 14029,
 14359,
 

In [173]:
question = "Howdy, how are ya?"

In [195]:
question = "How to create a web service in Django?"

In [165]:
question = "How to create a Numpy array?"

In [215]:
question = "How to print hello world in Java?"

In [216]:

prepared_question = np.array([text_prepare(question)])
tfidf_features = vectorizer.transform(prepared_question)
intent_recognizer.predict(tfidf_features)[0]

'stackoverflow'

In [197]:
tag_classifier.predict(tfidf_features)[0]

'python'

In [176]:
from sklearn.metrics.pairwise import cosine_similarity

In [193]:
def rank_candidates(q_emb, candidate_threads_emb, candidate_thread_ids):
    """
        q_emb: embedding vector for a question
        candidate_threads_emb: matrix of candidate thread embeddings which we want to rank
        candidate_thread_ids: list of stackoverflow thread ids aligned with candidate_threads_emb
        
        result: a list of sorted tuples (initial position in the list, thread_id, cos_similiarity)
    """

    canditate_similarities = cosine_similarity(q_emb.reshape(1, -1), candidate_threads_emb)[0]
    sorted_candidates = sorted([(i, candidate_thread_ids[i], s) for (i, s) in enumerate(canditate_similarities)], key=lambda k: k[2], reverse=True)
    return sorted_candidates

In [198]:
tag = "python"
thread_ids, thread_embeddings = unpickle_file(os.path.join(RESOURCE_PATH['THREAD_EMBEDDINGS_FOLDER'], os.path.normpath('%s.pkl' % tag)))

In [207]:
[(w, starspace_embeddings[w]) for w in text_prepare(question).split(' ') if w in starspace_embeddings]

[('create',
  array([-0.0117805 ,  0.0226999 ,  0.0242219 ,  0.0397424 ,  0.00503275,
          0.00431922,  0.0118978 ,  0.0438049 , -0.0220885 , -0.0105157 ,
         -0.0436699 ,  0.0207152 , -0.00498137,  0.00757037, -0.0528204 ,
         -0.00130005,  0.0543642 , -0.0172891 , -0.0204819 ,  0.0264255 ,
          0.0246876 , -0.0404211 ,  0.0250035 , -0.0557527 , -0.010876  ,
          0.0269171 ,  0.00794895, -0.00327785, -0.0263906 ,  0.0127513 ,
         -0.0214427 , -0.0248284 ,  0.0227035 , -0.044855  , -0.0323937 ,
          0.0325731 , -0.0730036 ,  0.0236489 ,  0.00977082, -0.0183327 ,
          0.0149711 ,  0.037556  , -0.0237786 , -0.00751505,  0.0428431 ,
          0.0536639 ,  0.0155935 , -0.0252742 ,  0.00580531,  0.0109279 ,
          0.0260714 ,  0.00758684,  0.00270356, -0.00502032, -0.0469401 ,
          0.00013093,  0.0832324 ,  0.00744033,  0.0123482 , -0.0226863 ,
          0.0179812 ,  0.0292145 , -0.0426    ,  0.0136503 , -0.00322326,
         -0.0297234 ,  0.0

In [212]:
question_vec = question_to_vec(text_prepare(question), starspace_embeddings, embeddings_dim)

In [213]:
question_vec

array([-2.0612299e-02,  4.6748225e-02,  1.4964900e-02, -3.4771506e-03,
       -4.7224378e-03,  2.8464105e-02,  2.0079199e-02,  2.6631899e-02,
       -1.9289967e-02,  4.1977324e-02, -4.2577991e-03,  4.0161125e-02,
        1.4388918e-02,  1.8963244e-02,  2.0507043e-02,  2.8352968e-02,
        9.5844753e-03,  9.7209662e-03, -8.2302047e-03,  4.4800699e-02,
       -5.1195551e-02, -1.9228872e-02, -2.7145443e-02, -3.5137475e-02,
       -3.0817825e-02,  4.3857872e-02,  2.1068363e-02, -6.4436086e-03,
       -1.0723825e-02, -3.1370450e-02, -2.9865485e-03,  4.9859248e-03,
        5.3725354e-02, -4.9506746e-02, -1.6400397e-02, -2.8188247e-03,
        7.0545077e-04, -1.2939148e-02, -5.7110200e-03, -2.7289215e-02,
        1.6284784e-02,  2.0205500e-02,  1.7725624e-02,  1.4774680e-02,
       -2.6595455e-02, -1.4895603e-02, -2.4808621e-02, -2.1691367e-02,
        1.1398025e-02,  1.6916201e-02, -1.1606254e-03,  2.4657335e-02,
        8.3253654e-03, -1.7750755e-02,  3.0028626e-02, -9.2590041e-04,
      

In [218]:
rank_candidates(question_vec, thread_embeddings, thread_ids)

[(10899, 759216, 0.79782104),
 (129531, 8320834, 0.79054666),
 (25035, 1547594, 0.79034626),
 (111787, 7042156, 0.79034626),
 (99143, 6176386, 0.78155243),
 (119934, 7625437, 0.77325296),
 (176696, 11826936, 0.7710525),
 (2341, 232318, 0.7599995),
 (85743, 5314369, 0.75826097),
 (78710, 4878775, 0.75708705),
 (148630, 9710374, 0.7518618),
 (20007, 1270146, 0.7504495),
 (61449, 3812361, 0.7454943),
 (47194, 2932907, 0.74521184),
 (58569, 3632860, 0.74521184),
 (83545, 5175488, 0.7440789),
 (206540, 14288377, 0.7379978),
 (56542, 3512229, 0.7353431),
 (43664, 2720335, 0.7337319),
 (50980, 3169996, 0.7337319),
 (26339, 1625170, 0.7327568),
 (57702, 3579238, 0.73252034),
 (17702, 1142994, 0.7315666),
 (19180, 1223374, 0.7315666),
 (55450, 3446497, 0.7306223),
 (125280, 8010212, 0.7306223),
 (177012, 11851533, 0.7306223),
 (181502, 12212213, 0.73014116),
 (181366, 12201776, 0.7300775),
 (966, 106329, 0.72969776),
 (2027, 208186, 0.72969776),
 (2906, 274315, 0.72969776),
 (3236, 295387, 0.72