# Github comments sentiment extraction, labeling, and preprocessing via GitHub API and NLTK Vader

Natural Language Tool Kit (NLTK) was developed by Bird et al. NLTK Vader is rule-based sentiment analyzer, trained on movie reviews. It uses two classifiers, Naive Bayes and Hierarchical. 

Taking the text as an input, it returns the probabilities of the text being positive, negative, or neutral and the resulting label. While positive and negative probabilities add up to 1, neutral is computed standalone. If the probability exceeds the 0.5 boundary, the text is labeled neutral. Otherwise, the label is positive or negative depending on their scores. 

We use an API provided in www.text-processing.com to call this tool.

# Via $curl$

In [None]:
! curl -H 'Accept: application/vnd.github.v3+json' https://api.github.com/repos/octocat/hello-world/issues/comments

[
  {
    "url": "https://api.github.com/repos/octocat/Hello-World/issues/comments/1146825",
    "html_url": "https://github.com/octocat/Hello-World/pull/2#issuecomment-1146825",
    "issue_url": "https://api.github.com/repos/octocat/Hello-World/issues/2",
    "id": 1146825,
    "node_id": "MDEyOklzc3VlQ29tbWVudDExNDY4MjU=",
    "user": {
      "login": "mattstifanelli",
      "id": 783382,
      "node_id": "MDQ6VXNlcjc4MzM4Mg==",
      "avatar_url": "https://avatars.githubusercontent.com/u/783382?v=4",
      "gravatar_id": "",
      "url": "https://api.github.com/users/mattstifanelli",
      "html_url": "https://github.com/mattstifanelli",
      "followers_url": "https://api.github.com/users/mattstifanelli/followers",
      "following_url": "https://api.github.com/users/mattstifanelli/following{/other_user}",
      "gists_url": "https://api.github.com/users/mattstifanelli/gists{/gist_id}",
      "starred_url": "https://api.github.com/users/mattstifanelli/starred{/owner}{/repo}",
     

In [None]:
! curl -d "text=Let's try again via this awesome Issue tacker... \n" http://text-processing.com/api/sentiment/

{"probability": {"neg": 0.32924532949128682, "neutral": 0.36393938724832536, "pos": 0.67075467050871318}, "label": "pos"}

# Via Python $requests$ lib

In [None]:
! pip install requests



In [None]:
import requests

In [None]:
query = {'text':"Let's try again via awesome Issue tacker...\n"}
response = requests.post("http://text-processing.com/api/sentiment/", data=query)
print(response.json())

{'probability': {'neg': 0.31451181582793397, 'neutral': 0.33122764068566357, 'pos': 0.685488184172066}, 'label': 'pos'}


# GitHub comments extraction


Import Pyhton API library

In [None]:
import requests
import matplotlib.pyplot as plt

In [None]:
import csv
dataset_file = open('dataset.csv', 'a', newline='')
writer = csv.writer(dataset_file)
# writer.writerow(["ID", "Comment", "Repository Name", "Repository Owner", "Negative Probability", "Neutral Probability", "Positive Probability", "Label"])

In [None]:
def write_to_csv(csv_row):
    writer.writerow(csv_row)

In [None]:
# write_to_csv([1, "comment", "repository", "owner", "0.33", "0.33", "0.34", "pos"])

Implement helping function to analyze one page of GitHub comments.
The following parameters are used:

*   $username$ - GitHub alias of the repository owner;
*   $repo$ - GitHub repository name;
*   $per\_page$ - number of comments on the page (from 0 to 100);
*   $page$ - Page number of the results to fetch;
*   $print\_comments$ - boolean flag. If it is set to True, each fetched comment and its analysis will be printed.
*   $print\_stage\_results$ - boolean flag. If it's set to True, final statistics of the analyzed comments will be printend in the end.



In [None]:
# using NLTK

def analyze_comments_page(username, repo, per_page, page, print_comments, print_stage_results, is_write_to_csv):
  total = 0
  pos = 0
  neg = 0
  neut = 0

  print("Processing page #"+str(page)+"...\n")
  query={'per_page': per_page, 'page': page}
  resp = requests.get("https://api.github.com/repos/"+username+"/"+repo+"/issues/comments", params=query)
  comments = resp.json()

  for comment in comments:
    total=total+1
    if print_comments:
      print(str(total) + '. ' + comment.get("body"))

    query = {'text' : comment.get("body")}
    response = requests.post("http://text-processing.com/api/sentiment/", data=query)
    if print_comments:
      print(response.json())
      print('\n')

    sentiment = response.json().get("label")
    if sentiment=='pos':
      pos=pos+1
    elif sentiment=='neg':
      neg=neg+1
    else:
      neut=neut+1

    if is_write_to_csv:
      csv_data = [total, comment.get("body"), repo, username, response.json().get("probability").get("neg"), response.json().get("probability").get("neutral"), response.json().get("probability").get("pos"), response.json().get("label")]
      write_to_csv(csv_data)
  
  if print_stage_results:
    print('Processed: '+str(total))
    print('Positive comments: '+str(pos))
    print('Negative comments: '+str(neg))
    print('Neutral comments: '+str(neut))

  return total, pos, neg, neut

Implementation of the final function to be used. It analyzes the given number of comments in the given repository. The following parameters are used:

*   $username$ - GitHub alias of the repository owner;
*   $repo$ - GitHub repository name;
*   $comments\_to\_process$ - number of comments to be fetched.
*   $print\_comments$ - boolean flag. If it is set to True, each fetched comment and its analysis will be printed.
*   $print\_stage\_results$ - boolean flag. If it's set to True, statistics of the analyzed comments on each stage(for each fetched page) will be printend.

Function returns tuple of number of fetched in total(if number of comments in repo is less than $comments\_to\_process$ all the available comments will be processed), positive, negative and neutral comments.

In [None]:
def analyze_comments(username, repo, comments_to_process, print_comments, print_stage_results, is_write_to_csv):
  total = 0
  pos = 0
  neg = 0
  neut = 0
  page = 1
  temp = comments_to_process

  while True:
    if comments_to_process <= 0:
      print("Finishing...\n")
      break
    if comments_to_process <= 100:
      total, pos, neg, neut = map(lambda x: x[0]+x[1], zip((total, pos, neg, neut), analyze_comments_page(username, repo, comments_to_process, page, print_comments, print_stage_results, is_write_to_csv)))
      print("Processed in total: "+str(total)+"/"+str(temp)+"\n")
      break
    else:
      total, pos, neg, neut = map(lambda x: x[0]+x[1], zip((total, pos, neg, neut), analyze_comments_page(username, repo, 100, page, print_comments, print_stage_results, is_write_to_csv)))
      print("Currently processed: "+str(total)+"/"+str(temp)+"\n")
      page += 1
      comments_to_process -= 100
  
  return total, pos, neg, neut

In [None]:
def print_comments_analysis(username, repo, comments_to_process, print_comments, print_stage_results, is_write_to_csv):
  total, pos, neg, neut = analyze_comments(username, repo, comments_to_process, print_comments, print_stage_results, is_write_to_csv)

  print('Total processed: '+str(total))
  print('Positive comments: '+str(pos))
  print('Negative comments: '+str(neg))
  print('Neutral comments: '+str(neut)+'\n')

  labels = 'Positive\n'+str(pos), 'Negative\n'+str(neg), 'Neutral\n'+str(neut)
  sizes = [pos, neg, neut]
  maxc = max(pos, neg, neut)

  if maxc == neut:
      res = "neutral"
      explode = (0, 0, 0.1)
  elif maxc == pos:
      res = "positive"
      explode = (0.1, 0, 0)
  else:
      res = "negative. Some measures should be considered"
      explode = (0, 0.1, 0)

  fig1, ax1 = plt.subplots()
  ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
          shadow=True, startangle=90)
  ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

  plt.show()
  print("Communication is mostly "+res+".\n")

Collect and nalyse 500 comments from each repository.

In [None]:
print_comments_analysis("apache", "airflow", 500, True, True, True)

In [None]:
print_comments_analysis("mitmproxy", "mitmproxy", 500, True, True, True)

In [None]:
print_comments_analysis("hpcaitech", "ColossalAI", 500, True, True, True)

In [None]:
print_comments_analysis("mikecao", "umami", 500, True, True, True)

In [None]:
print_comments_analysis("type-challenges", "type-challenges", 500, True, True, True)

In [None]:
print_comments_analysis("flutter", "flutter", 500, True, True, True)

In [None]:
print_comments_analysis("PaddlePaddle", "PaddleNLP", 500, True, True, True)

In [None]:
print_comments_analysis("benbjohnson", "litestream", 500, True, True, True)

In [None]:
print_comments_analysis("NvChad", "NvChad", 500, True, True, True)

In [None]:
print_comments_analysis("Vonng", "ddia", 500, True, True, True)

In [None]:
print_comments_analysis("withfig", "autocomplete", 500, True, True, True)

In [None]:
print_comments_analysis("mantinedev", "mantine", 500, True, True, True)

In [None]:
dataset_file.close()

with open('dataset.csv', 'r') as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)

dataset_file.close()

# Data preprocessing


First, we import necessary libraries

In [None]:
import requests
import numpy as np
import csv
import re
import nltk
nltk.download()

## Split into sentences

Import nltk lib and open extracted dataset for reading

In [None]:
nltk.download("punkt")
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [None]:
dataset_file.close()

data = []

with open('original.csv', 'r') as file:
  reader = csv.reader(file, quoting=csv.QUOTE_ALL)
  for row in reader:
    data.append(row)

for row in data:
  print(row)

Create new csv file for the preprocessed dataset

In [None]:
dataset_file = open('natural.csv', 'w', newline='')
writer = csv.writer(dataset_file)
write_to_csv(data[0])

In [None]:
print(data[0])

### Code removal

Code parts are separated from text with ```, so that they can be easily removed with regex substraction:

In [None]:
for comment in data[1:]:
  comment[1] = re.sub(r"\`\`\`(.|\n)*\`\`\`", "", comment[1])

In [None]:
for row in data[1:]:
  print(row)

### Split

Split the comments and write to `natural.csv` dataset

In [None]:
counter = 1

for comment in data[1:]:
  body = comment[1]
  for sentence in tokenizer.tokenize(body):
    write_to_csv([counter, sentence] + comment[2:])
    counter+=1

In [None]:
dataset_file.close()

with open('natural.csv', 'r') as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)

dataset_file.close()

[1;30;43mВыходные данные были обрезаны до нескольких последних строк (5000).[0m
['7242', "But it doesn't seem to be in the package.json by default, right?", 'amplication', 'amplication', '0.732041524437338', '0.19500667220831436', '0.26795847556266195', 'neg']
['7243', 'It is not in the packge.json but probably a dependency of another package.', 'amplication', 'amplication', '0.732041524437338', '0.19500667220831436', '0.26795847556266195', 'neg']
['7244', "From the image you've attached, it looks like a dependency of react-scripts... but still you should have not installed it manually.", 'amplication', 'amplication', '0.732041524437338', '0.19500667220831436', '0.26795847556266195', 'neg']
['7245', 'Usually in this type of error, deleting the entire node_modules folder helps before running `npm i` again', 'amplication', 'amplication', '0.732041524437338', '0.19500667220831436', '0.26795847556266195', 'neg']
['7246', 'Thanks!', 'amplication', 'amplication', '0.3224228363291759', '0.3

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mВыходные данные были обрезаны до нескольких последних строк (5000).[0m
['30107', 'FWIW I am -1 on additional settings.', 'django', 'django', '0.620866952701133', '0.11495825485319973', '0.37913304729886693', 'neg']
['30108', 'Please open a ticket and add tests for this change and then create a new pull request.', 'django', 'django', '0.3847681651768602', '0.6304817918289369', '0.6152318348231398', 'neutral']
['30109', 'Please open a ticket to discuss a new feature first.', 'django', 'django', '0.4010594319980103', '0.597533629761093', '0.5989405680019897', 'neutral']
['30110', 'This patch also misses tests and docs.', 'django', 'django', '0.4010594319980103', '0.597533629761093', '0.5989405680019897', 'neutral']
['30111', 'Once the ticket is accepted you can open a new pull request.', 'django', 'django', '0.4010594319980103', '0.597533629761093', '0.5989405680019897', 'neutral']
['30112', 'Fixed in 3afb5916b215c79e36408b729c9516bc435f5cb7, thx for your work on this!', 'djang

## Preprocessing


Open splitted dataset for reading

In [None]:
data = []

with open('natural.csv', 'r') as file:
  reader = csv.reader(file, quoting=csv.QUOTE_ALL)
  for row in reader:
    data.append(row)

In [None]:
for row in data:
  print(row)

[1;30;43mВыходные данные были обрезаны до нескольких последних строк (5000).[0m
['9177', 'I think that ownCloud should be moved to groupware.', 'awesome-selfhosted', 'awesome-selfhosted', '0.3952084502391142', '0.3444060695128813', '0.6047915497608858', 'pos']
['9178', 'We can do this along side the split of File Sharing and Synchronization (#161).', 'awesome-selfhosted', 'awesome-selfhosted', '0.3952084502391142', '0.3444060695128813', '0.6047915497608858', 'pos']
['9179', 'Added in 22ba737.', 'awesome-selfhosted', 'awesome-selfhosted', '0.42608683312512197', '0.6612577094719252', '0.573913166874878', 'neutral']
['9180', 'Added in f89717b', 'awesome-selfhosted', 'awesome-selfhosted', '0.42608683312512197', '0.6612577094719252', '0.573913166874878', 'neutral']
['9181', "> moved to groupware\n\nbut owncloud also has file sync (it's main feature).", 'awesome-selfhosted', 'awesome-selfhosted', '0.2962803199581182', '0.4014145035129629', '0.7037196800418818', 'pos']
['9182', 'I have move

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Create new csv file for the preprocessed dataset

In [None]:
# dataset_file = open('preprocessed.csv', 'w', newline='')
# writer = csv.writer(dataset_file)
# write_to_csv(data[0][:4])

In [None]:
dataset_file = open('codeless.csv', 'w', newline='')
writer = csv.writer(dataset_file)
write_to_csv(data[0])

In [None]:
print(data[0])

['ID', 'Comment', 'Repository Name', 'Repository Owner', 'Negative Probability', 'Neutral Probability', 'Positive Probability', 'Label']


In [None]:
import re

### Emails, links, and usernames removal

Remove links with Imme Emosol regex

In [None]:
for comment in data[1:]:
  comment[1] = re.sub(r"(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:/[^\s]*)?", "", comment[1])

Remove emails with General Email Regex (RFC 5322 Official Standard)

In [None]:
for comment in data[1:]:
  comment[1] = re.sub(r"(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|\"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*\")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])", "", comment[1])

Remove usernames using `\B@([a-z0-9][a-z0-9-]+)` regex

In [None]:
for comment in data[1:]:
  comment[1] = re.sub(r"\B@([A-Za-z0-9][A-Za-z0-9-]+)", "", comment[1])

In [None]:
for row in data[1:]:
  print(row)

[1;30;43mВыходные данные были обрезаны до нескольких последних строк (5000).[0m
['3150', "Thanks  \nLooking at your app I see one additional entity besides the two that appear in your screenshot.\nThis additional entity was deleted by you and that's why is not in the list.\n\nIf you didn't commit the changes, you may be able to see the deleted entity on the pending changes page.\n\n\n ", 'amplication', 'amplication', '0.3659329013255319', '0.35156428176865084', '0.6340670986744681', 'pos']
['3151', ' can we close this one?', 'amplication', 'amplication', '0.5785656028859611', '0.3874765414228363', '0.42143439711403885', 'neg']
['3152', ' can we close this one?', 'amplication', 'amplication', '0.5785656028859611', '0.3874765414228363', '0.42143439711403885', 'neg']
['3153', ' thank you for your first contribution! \n', 'amplication', 'amplication', '0.2864667203699012', '0.2681070278111735', '0.7135332796300988', 'pos']
['3154', ' sounds like a good idea. We will discuss this request 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



### Non-English words and emojis removal

Remove emojis using `re` lib and their Unicodes

In [None]:
emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"  # chinese char
                               u"\U00002702-\U000027B0"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"  # dingbats
                               u"\u3030"
                               "]+", flags=re.UNICODE)

for comment in data[1:]:
  comment[1] = emoji_pattern.sub(r'', comment[1])

In [None]:
for row in data[1:]:
  print(row)

[1;30;43mВыходные данные были обрезаны до нескольких последних строк (5000).[0m
['9825', 'extraProps getValue setValue', 'lowcode-engine', 'alibaba', '0.5095519989067516', '0.5950890614040548', '0.4904480010932484', 'neutral']
['9826', ' clone ', 'lowcode-engine', 'alibaba', '0.5095519989067516', '0.5950890614040548', '0.4904480010932484', 'neutral']
['9827', 'schema\nuseImperativeHandle', 'lowcode-engine', 'alibaba', '0.5095519989067516', '0.5950890614040548', '0.4904480010932484', 'neutral']
['9828', '[!', 'lowcode-engine', 'alibaba', '0.4413525491033413', '0.11498424921866367', '0.5586474508966587', 'pos']
['9829', '[CLA assistant check]( <br/>Thank you for your submission!', 'lowcode-engine', 'alibaba', '0.4413525491033413', '0.11498424921866367', '0.5586474508966587', 'pos']
['9830', 'We really appreciate it.', 'lowcode-engine', 'alibaba', '0.4413525491033413', '0.11498424921866367', '0.5586474508966587', 'pos']
['9831', 'Like many open source projects, we ask that you sign our 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



['34546', 'The statistics that is computed in\n> > your implementation may be sufficient for some users, but insufficient for\n> > others, e.g.', 'opencv', 'opencv', '0.5333014142170851', '0.17550771802532453', '0.46669858578291484', 'neg']
['34547', 'J. Matas et al in "Real-time scene text localization and\n> > recognition" consider different, in my opinion very useful characteristics\n> > like the perimeter, number of holes, number of zero-crossings, number of\n> > inflections etc.', 'opencv', 'opencv', '0.5333014142170851', '0.17550771802532453', '0.46669858578291484', 'neg']
['34548', '> > \n> > Managing a fast implementation of all of the statistics is a hard thing to\n> > do.', 'opencv', 'opencv', '0.5333014142170851', '0.17550771802532453', '0.46669858578291484', 'neg']
['34549', 'With templates, as I do in the code, you can redifine the functor used\n> > for statistics computation and pay for what you use.', 'opencv', 'opencv', '0.5333014142170851', '0.17550771802532453', '0.46

### Punctuation and stopwords removal

Import NLTK stopwords dictionary

In [None]:
nltk.download("punkt")
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
for comment in data[1:]:
  comment[1] = word_tokenize(comment[1])
  comment[1] = [word for word in comment[1] if word.isalnum()]

In [None]:
nltk.download("stopwords")
print(nltk.corpus.stopwords.words('english'))
stop_words = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'bo

Remove stopwords

In [None]:
for comment in data[1:]:
  comment[1] = [w for w in comment[1] if not w.lower() in stop_words]

In [None]:
for row in data[1:]:
  print(row)

[1;30;43mВыходные данные были обрезаны до нескольких последних строк (5000).[0m
['10883', ['options', 'modular'], 'autocomplete', 'withfig', '0.5683677613587677', '0.29446867045533903', '0.4316322386412324', 'neg']
['10884', [], 'autocomplete', 'withfig', '0.5683677613587677', '0.29446867045533903', '0.4316322386412324', 'neg']
['10885', ['instead', 'run', 'run'], 'autocomplete', 'withfig', '0.5683677613587677', '0.29446867045533903', '0.4316322386412324', 'neg']
['10886', ['checks', 'passed'], 'autocomplete', 'withfig', '0.5683677613587677', '0.29446867045533903', '0.4316322386412324', 'neg']
['10887', ['Please', 'add', 'reaction', 'comment', 'show', 'read'], 'autocomplete', 'withfig', '0.5683677613587677', '0.29446867045533903', '0.4316322386412324', 'neg']
['10888', ['id', 'greetingComment', 'Hello', 'thank', 'much', 'creating', 'Pull', 'Request'], 'autocomplete', 'withfig', '0.5683677613587677', '0.29446867045533903', '0.4316322386412324', 'neg']
['10889', ['small', 'checklist', 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



['34581', ['something', 'encounter', 'place', 'unrelated', 'context', 'code', 'way', 'dealt'], 'opencv', 'opencv', '0.5333014142170851', '0.17550771802532453', '0.46669858578291484', 'neg']
['34582', ['believe', 'approach', 'slow', 'one', 'one', 'meaningful', 'amount', 'time'], 'opencv', 'opencv', '0.5333014142170851', '0.17550771802532453', '0.46669858578291484', 'neg']
['34583', ['flags', 'parameter', 'regulate', 'exactly', 'needs', 'computed'], 'opencv', 'opencv', '0.5333014142170851', '0.17550771802532453', '0.46669858578291484', 'neg']
['34584', ['0', 'means', 'statistics'], 'opencv', 'opencv', '0.5333014142170851', '0.17550771802532453', '0.46669858578291484', 'neg']
['34585', ['could', 'mean', 'adding', 'stuff', 'etc'], 'opencv', 'opencv', '0.5333014142170851', '0.17550771802532453', '0.46669858578291484', 'neg']
['34586', ['Predefined', 'groups', 'statistics', 'way', 'really', 'manage', 'complexity', 'implementation', 'though', 'still', 'leaves', 'custom', 'statistics', 'cold',

### Preprocessing result

Write resulting dataset(s) in csv file

In [None]:
for row in data[1:]:
  write_to_csv(row[:1]+[" ".join(row[1])]+row[2:])

dataset_file.close()

In [None]:
with open('codeless.csv', 'r') as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)

[1;30;43mВыходные данные были обрезаны до нескольких последних строк (5000).[0m
['11673', 'instead checks passed', 'autocomplete', 'withfig', '0.5701471131910464', '0.2756931360026104', '0.4298528868089536', 'neg']
['11674', 'Please add reaction comment show read', 'autocomplete', 'withfig', '0.5701471131910464', '0.2756931360026104', '0.4298528868089536', 'neg']
['11675', 'provides following tools simplify building integration cli cli commands commands produce JSON array dump internal knowledge registered sub commands registered global parameters respectively', 'autocomplete', 'withfig', '0.23485602270324524', '0.6606186727716427', '0.7651439772967548', 'neutral']
['11676', 'dump already includes active extensions via package manager', 'autocomplete', 'withfig', '0.23485602270324524', '0.6606186727716427', '0.7651439772967548', 'neutral']
['11677', 'cli command command lets provide string executed full partial command well position cursors return list available autocomplete options'

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mВыходные данные были обрезаны до нескольких последних строк (5000).[0m
['30107', 'FWIW additional settings', 'django', 'django', '0.620866952701133', '0.11495825485319973', '0.37913304729886693', 'neg']
['30108', 'Please open ticket add tests change create new pull request', 'django', 'django', '0.3847681651768602', '0.6304817918289369', '0.6152318348231398', 'neutral']
['30109', 'Please open ticket discuss new feature first', 'django', 'django', '0.4010594319980103', '0.597533629761093', '0.5989405680019897', 'neutral']
['30110', 'patch also misses tests docs', 'django', 'django', '0.4010594319980103', '0.597533629761093', '0.5989405680019897', 'neutral']
['30111', 'ticket accepted open new pull request', 'django', 'django', '0.4010594319980103', '0.597533629761093', '0.5989405680019897', 'neutral']
['30112', 'Fixed 3afb5916b215c79e36408b729c9516bc435f5cb7 thx work', 'django', 'django', '0.4048156291591041', '0.4195603693305412', '0.5951843708408959', 'pos']
['30113', 'Plea