# Естественный отбор: проранжируй комментарии с помощью ML

## Секция Data Science

Помоги команде по работе с данными разработать механизм ранжирования комментариев к постам на основе методов машинного обучения для соцсети ВКонтакте. 

**Задача:** предложить механизм сортировки комментариев к постам по их популярности на основе методов машинного обучения, чтобы модель могла как можно лучше ранжировать пользовательские комментарии.

Этапы:

1. Провести проверку и разведочный анализ данных (EDA) 
2. Используя тренировочную и тестовую выборки датасета, обучить модель, ранжировать текстовые комментарии в порядке их популярности (от популярных к менее популярным) и провести валидацию модели. Выбор модели необходимо аргументировать, основываясь на результатах обучения.
3. Проанализировать полученные результаты и сформулировать полезные инсайты о том, что обычно содержит популярный комментарий, чтобы команда VK могла использовать эту информацию для улучшения комментариев своих пользователей.
4. Предложитть методы взаимодействия с комментаторами, а также механизмы поддержки для разных групп пользователей, включая тех, чьи комментарии непопулярны.

## EDA

In [65]:
# Импортируем необходимые библиотеки
import json
import pandas as pd
import numpy as np

In [13]:
train_raw = pd.read_json('/Users/anastasiamyskina/Downloads/IT/ranking_train.jsonl', lines=True)

In [14]:
train_raw=train_raw.reset_index()
train_raw

Unnamed: 0,index,text,comments
0,0,How many summer Y Combinator fundees decided n...,[{'text': 'Going back to school is not identic...
1,1,CBS acquires last.fm for $280m,[{'text': 'It will be curious to see where thi...
2,2,How Costco Became the Anti-Wal-Mart,[{'text': 'I really hate it when people falsel...
3,3,"Fortune Favors Big Turds | Screw The Money, Th...",[{'text': 'His real point is that something ca...
4,4,StartupWeekend: 70 Founders Create One Company...,[{'text': 'Looks like someone hasn't read The ...
...,...,...,...
88102,88102,Don't upgrade to iOS 8.0.1 or you may experien...,[{'text': 'I had this issue and was able to fi...
88103,88103,Ask HN: How do US HNers get their health insur...,[{'text': 'We use a HSA-qualified high-deducti...
88104,88104,Justin Gordon Using React on Rails,[{'text': 'neat insight! A friend of mine conv...
88105,88105,"iOS 8.0.1 released, broken on iPhone 6 models,...","[{'text': 'Ouch, I feel for whoever let this s..."


В колонке `comments` тексты коментариев `text` с показателем `score` находятся в одном массиве

In [15]:
train_raw['comments'][0]

[{'text': 'Going back to school is not identical with giving up. Some founders go back to school and keep working on the startup while there.  However, those do so much worse than the people who work on the startup full-time that going back to school seems, in practice, not too far removed from a death sentence for a startup.Off the top of my head, I\'d guess we\'ve had about 8 startups where the founders went back to school.  It doesn\'t only happen with summer batches.  Founders from winter batches do it too.Usually the reason is that the startup isn\'t doing very well. However, that judgement depends a lot on how determined the founders are.  One reason we now shy away from funding people still in school is that they often unconsciously want the startup to fail, because the idea of dropping out frightens them.A lot of startups look bad after 3 months.  Someone who\'s out of school and has to make it work or get a job in a cubicle will say "don\'t worry, we\'ll figure out how to make

In [16]:
text_score = pd.DataFrame([y for x in train_raw['comments'] for y in x])
text_score.head()

Unnamed: 0,text,score
0,Going back to school is not identical with giv...,0
1,There will invariably be those who don't see t...,1
2,For me school is a way to be connected to what...,2
3,I guess it really depends on how hungry you ar...,3
4,I know pollground decided to go back to school...,4


In [25]:
posts = train_raw['text'].repeat(5).reset_index()

In [26]:
train=pd.DataFrame()
train['post_text']=posts['text']
train['com_text']=text_score['text']
train['score']=text_score['score']
train

Unnamed: 0,post_text,com_text,score
0,How many summer Y Combinator fundees decided n...,Going back to school is not identical with giv...,0
1,How many summer Y Combinator fundees decided n...,There will invariably be those who don't see t...,1
2,How many summer Y Combinator fundees decided n...,For me school is a way to be connected to what...,2
3,How many summer Y Combinator fundees decided n...,I guess it really depends on how hungry you ar...,3
4,How many summer Y Combinator fundees decided n...,I know pollground decided to go back to school...,4
...,...,...,...
440530,Pay your rent with a Credit or Debit card. No ...,Most major banks offer a service called &#x27;...,0
440531,Pay your rent with a Credit or Debit card. No ...,"It costs 3.25%, or $74.25 for the example of $...",1
440532,Pay your rent with a Credit or Debit card. No ...,As many other comments have pointed out almost...,2
440533,Pay your rent with a Credit or Debit card. No ...,My apartment building uses Yapstone&#x27;s Ren...,3


Теперь тоже самое надо проделать с тестовой выборкой

In [22]:
test_raw = pd.read_json('/Users/anastasiamyskina/Downloads/IT/ranking_test.jsonl', lines=True)
test_raw=test_raw.reset_index()
test_raw

Unnamed: 0,index,text,comments
0,0,"iOS 8.0.1 released, broken on iPhone 6 models,...",[{'text': 'I&#x27;m still waiting for them to ...
1,1,Ask HN: How do US HNers get their health insur...,[{'text': 'Get it from your employer. It&#x27;...
2,2,San Diego Researcher Crowdfunding Patent-Free ...,[{'text': 'What I don&#x27;t understand is why...
3,3,Rethinking the origins of the universe,[{'text': 'I&#x27;m not a physicist. I imagin...
4,4,SlackTextViewController: A new growing text in...,[{'text': 'As someone that doesn&#x27;t do iOS...
...,...,...,...
13999,13999,The cat's miaow,"[{'text': 'Meanwhile in the US, Stubbs has bee..."
14000,14000,Facebook’s Piracy Problem,[{'text': 'A radical idea: Maybe our model of ...
14001,14001,Go GC: Solving the Latency Problem in Go 1.5,[{'text': 'Was the presentation more in-depth ...
14002,14002,Understanding Neural Networks Through Deep Vis...,[{'text': 'Ok now I want to &quot;hear&quot; o...


In [28]:
text_score_test = pd.DataFrame([y for x in test_raw['comments'] for y in x])
post_test = test_raw['text'].repeat(5).reset_index()

In [29]:
test=pd.DataFrame()
test['post_text']=post_test['text']
test['com_text']=text_score_test['text']
test['score']=text_score_test['score']
test

Unnamed: 0,post_text,com_text,score
0,"iOS 8.0.1 released, broken on iPhone 6 models,...",I&#x27;m still waiting for them to stabilize w...,
1,"iOS 8.0.1 released, broken on iPhone 6 models,...","For those who upgraded, no need to do a restor...",
2,"iOS 8.0.1 released, broken on iPhone 6 models,...",Upgraded shortly after it was released and suf...,
3,"iOS 8.0.1 released, broken on iPhone 6 models,...",I think they were under a lot of pressure on t...,
4,"iOS 8.0.1 released, broken on iPhone 6 models,...",Fix for those who already updated: http:&#x2F...,
...,...,...,...
70015,Why does Gmail hate my domain?,I send a LOT of emails each month (email newsl...,
70016,Why does Gmail hate my domain?,I hit a similar problems when sending automate...,
70017,Why does Gmail hate my domain?,That&#x27;s all a bit presumptive and inflamma...,
70018,Why does Gmail hate my domain?,If the domain is bitbin.de and the mail server...,


### Предобработка данных

In [50]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
import re
import time

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/anastasiamyskina/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [34]:
def text_filtering(df_dataset):
    df_dataset['filtered_body_text'] = df_dataset.copy()
    #убираем знаки препинания, кроме дефиса
    df_dataset['filtered_body_text'] = df_dataset['filtered_body_text'].apply(lambda x: re.sub('[^a-zA-z0-9\s-]','',x))
    #убираем \n \t
    df_dataset['filtered_body_text'] = df_dataset['filtered_body_text'].apply(lambda x: re.sub('\n|\t',' ',x))
    #убираем лишние пробелы
    df_dataset['filtered_body_text'] = df_dataset['filtered_body_text'].apply(lambda x: re.sub(' +',' ',x))
    # делаем все строчными буквами
    df_dataset['filtered_body_text'] = df_dataset['filtered_body_text'].apply(lambda x: x.lower())
    #уберем ссылки
    df_dataset['filtered_body_text'] = df_dataset['filtered_body_text'].apply(lambda x: re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'site', x))
    #убираем стоп слова
    stop_words = set(stopwords.words('english'))
    df_dataset['filtered_body_text'] = df_dataset['filtered_body_text'].apply(lambda x: [word for word in x.split() if word not in stop_words])
    df_dataset['filtered_body_text'] = df_dataset['filtered_body_text'].apply(lambda x: ' '.join(x))
    
    return df_dataset

In [38]:
# предобработаем тестовую выборку
train_fltr=pd.DataFrame()
train_fltr['post_text']=text_filtering(train['post_text'])['filtered_body_text']
train_fltr['com_text']=text_filtering(train['com_text'])['filtered_body_text']
train_fltr['score']=train['score']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_dataset['filtered_body_text'] = df_dataset.copy()# change this back to body-text
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_dataset['filtered_body_text'] = df_dataset.copy()# change this back to body-text


In [41]:
train_fltr.head()

Unnamed: 0,post_text,com_text,score
0,many summer combinator fundees decided continu...,going back school identical giving founders go...,0
1,many summer combinator fundees decided continu...,invariably dont see success set fall back orig...,1
2,many summer combinator fundees decided continu...,school way connected going real world entered ...,2
3,many summer combinator fundees decided continu...,guess really depends hungry much believe produ...,3
4,many summer combinator fundees decided continu...,know pollground decided go back school getting...,4


In [40]:
# предобработаем тестовую выборку
test_fltr=pd.DataFrame()
test_fltr['post_text']=text_filtering(test['post_text'])['filtered_body_text']
test_fltr['com_text']=text_filtering(test['com_text'])['filtered_body_text']

In [55]:
test_fltr.head()

Unnamed: 0,post_text,com_text,text
0,ios 801 released broken iphone 6 models withdrawn,ix27m still waiting stabilize wifi ipad sith i...,ios 801 released broken iphone 6 models withdr...
1,ios 801 released broken iphone 6 models withdrawn,upgraded need restore option-click quotupdateq...,ios 801 released broken iphone 6 models withdr...
2,ios 801 released broken iphone 6 models withdrawn,upgraded shortly released suffered consequence...,ios 801 released broken iphone 6 models withdr...
3,ios 801 released broken iphone 6 models withdrawn,think lot pressure healthkit front one big fla...,ios 801 released broken iphone 6 models withdr...
4,ios 801 released broken iphone 6 models withdrawn,fix already updated httpx2fx2fwwwimorecomx2fio...,ios 801 released broken iphone 6 models withdr...


In [43]:
# функция для стемминга
def str_stemmer(s):
    return " ".join([stemmer.stem(word) for word in s.lower().split()])

In [105]:
# функция для подсчета общих слов
def common_words(row):
    set1 = set(str(row['post_text']).split())
    set2 = set(str(row['com_text']).split())
    return len(set1.intersection(set2))

Соеденим тестовую и тренировочную выборки, чтобы сразу для обоих выборок сделать стемминг и добавлять новые выборки

In [59]:
df_all = pd.concat([train_fltr, test_fltr], axis=0, ignore_index=True) 
# соединим текст поста и текст коментария в одну строку
df_all['text'] = df_all['post_text'] + " " + df_all['com_text']

In [60]:
stemmer = SnowballStemmer('english')

In [61]:
start_time = time.time()


df_all['com_text'] = pd.Series(df_all['com_text'].map(lambda x:str_stemmer(str(x))))
df_all['post_text'] = pd.Series(df_all['post_text'].map(lambda x:str_stemmer(str(x))))
df_all['text'] = pd.Series(df_all['text'].map(lambda x:str_stemmer(str(x))))

# Operation Time:
print("--- %s seconds ---" % (time.time() - start_time))

--- 501.3412606716156 seconds ---


In [62]:
# сохраним данные
#df_all.to_csv(r'/Users/anastasiamyskina/Downloads/IT/df_all_stemmer.csv', index= False )

In [108]:
df_all = pd.read_csv('/Users/anastasiamyskina/Downloads/IT/df_all_stemmer.csv')
df_all

Unnamed: 0,post_text,com_text,score,text
0,mani summer combin funde decid continu startup...,go back school ident give founder go back scho...,0.0,mani summer combin funde decid continu startup...
1,mani summer combin funde decid continu startup...,invari dont see success set fall back origin p...,1.0,mani summer combin funde decid continu startup...
2,mani summer combin funde decid continu startup...,school way connect go real world enter school ...,2.0,mani summer combin funde decid continu startup...
3,mani summer combin funde decid continu startup...,guess realli depend hungri much believ product...,3.0,mani summer combin funde decid continu startup...
4,mani summer combin funde decid continu startup...,know pollground decid go back school get combi...,4.0,mani summer combin funde decid continu startup...
...,...,...,...,...
510550,gmail hate domain,send lot email month email newslett busi - yes...,,gmail hate domain send lot email month email n...
510551,gmail hate domain,hit similar problem send autom intern email go...,,gmail hate domain hit similar problem send aut...
510552,gmail hate domain,thatx27 bit presumpt inflammatori amount pure ...,,gmail hate domain thatx27 bit presumpt inflamm...
510553,gmail hate domain,domain bitbind mail server host server itx27 p...,,gmail hate domain domain bitbind mail server h...


#### Добавим новые признаки

In [109]:
# количество общих слов в посте и в комментарии
df_all['common_words'] = df_all.apply(common_words, axis=1)
# длина поста
df_all['len_of_post'] = df_all['post_text'].map(lambda x:len(str(x).split())).astype(np.int64)
# длина комментария
df_all['len_of_comm'] = df_all['com_text'].map(lambda x:len(str(x).split())).astype(np.int64)
# длина поста и коментария
df_all['len_of_text'] = df_all['text'].map(lambda x:len(str(x).split())).astype(np.int64)
# отношение числа слов в посте к общей длине поста 
df_all['ratio_post'] = df_all['common_words']/df_all['len_of_post']

In [110]:
df_all

Unnamed: 0,post_text,com_text,score,text,common_words,len_of_post,len_of_comm,len_of_text,ratio_post
0,mani summer combin funde decid continu startup...,go back school ident give founder go back scho...,0.0,mani summer combin funde decid continu startup...,6,11,96,107,0.545455
1,mani summer combin funde decid continu startup...,invari dont see success set fall back origin p...,1.0,mani summer combin funde decid continu startup...,3,11,34,45,0.272727
2,mani summer combin funde decid continu startup...,school way connect go real world enter school ...,2.0,mani summer combin funde decid continu startup...,2,11,44,55,0.181818
3,mani summer combin funde decid continu startup...,guess realli depend hungri much believ product...,3.0,mani summer combin funde decid continu startup...,1,11,31,42,0.090909
4,mani summer combin funde decid continu startup...,know pollground decid go back school get combi...,4.0,mani summer combin funde decid continu startup...,5,11,9,20,0.454545
...,...,...,...,...,...,...,...,...,...
510550,gmail hate domain,send lot email month email newslett busi - yes...,,gmail hate domain send lot email month email n...,2,3,82,85,0.666667
510551,gmail hate domain,hit similar problem send autom intern email go...,,gmail hate domain hit similar problem send aut...,0,3,20,23,0.000000
510552,gmail hate domain,thatx27 bit presumpt inflammatori amount pure ...,,gmail hate domain thatx27 bit presumpt inflamm...,0,3,30,33,0.000000
510553,gmail hate domain,domain bitbind mail server host server itx27 p...,,gmail hate domain domain bitbind mail server h...,1,3,42,45,0.333333


Обратно разделим на тренировочную и тестовую выборку

In [111]:
num_train=train_fltr.shape[0]
df_train = df_all.iloc[:num_train]
df_test = df_all.iloc[num_train:]

In [113]:
#df_train.to_csv(r'/Users/anastasiamyskina/Downloads/IT/data_train_novector.csv', index= False )
#df_test.to_csv(r'/Users/anastasiamyskina/Downloads/IT/data_test_novector.csv', index= False )