In [1]:
import pandas as pd
import re
import numpy as np
import pathlib

In [2]:
DATA_DIR = pathlib.Path.cwd().parent / 'data'

questions = pd.read_csv(DATA_DIR / 'Questions.csv',encoding="ISO-8859-1")
tags = pd.read_csv(DATA_DIR / 'Tags.csv',encoding="ISO-8859-1")

In [3]:
print(f"questions: legth: {len(questions)} \n columns: {questions.columns}")
print(f"tags: legth: {len(tags)} \n columns: {tags.columns}")

questions: legth: 1264216 
 columns: Index(['Id', 'OwnerUserId', 'CreationDate', 'ClosedDate', 'Score', 'Title',
       'Body'],
      dtype='object')
tags: legth: 3750994 
 columns: Index(['Id', 'Tag'], dtype='object')


In [4]:
# Inicialmente irei juntar o dataset de perguntas com o de tags
questions = questions.drop(columns=['OwnerUserId','CreationDate','ClosedDate','Score','Title'])

In [5]:
# 37034 tags diferentes. Fazer limpeza para deixar apenas 1000
tag_counts = tags['Tag'].value_counts()
top_1000_tags = tag_counts.head(150).index.tolist()
filtered_tags = tags[tags['Tag'].isin(top_1000_tags)]
filtered_tags.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1931957 entries, 1 to 3750989
Data columns (total 2 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   Id      int64 
 1   Tag     object
dtypes: int64(1), object(1)
memory usage: 44.2+ MB


In [6]:
t = filtered_tags.copy()
t['Tag'] = t['Tag'].apply(lambda x : str(x))

In [7]:
complete = questions.merge(t)

In [8]:
complete = complete.groupby(['Id','Body'])['Tag'].apply(list).reset_index()

In [9]:
complete

Unnamed: 0,Id,Body,Tag
0,80,<p>I've written a database generation script i...,[actionscript-3]
1,120,<p>Has anyone got experience creating <strong>...,"[sql, asp.net]"
2,180,<p>This is something I've pseudo-solved many t...,[algorithm]
3,260,<p>I have a little game written in C#. It uses...,"[c#, .net]"
4,330,<p>I am working on a collection of classes use...,"[c++, oop, class]"
...,...,...,...
1098639,40143170,<p>I am trying to assign a Product model for s...,[laravel]
1098640,40143190,<p>I need to extend a shell script (bash). As ...,"[python, bash]"
1098641,40143210,<p>I am building a custom MVC project and I ha...,"[php, .htaccess]"
1098642,40143340,<p>Under minifyEnabled I changed from false to...,"[android, android-studio]"


In [10]:
exp1 = np.random.randint(0,len(complete))
exp2 = np.random.randint(0,len(complete))
print(f"Random examples of the Body:\n Tag:{complete['Tag'][exp1]} \n {complete['Body'][exp1]}  \n\n\n  Tag:{complete['Tag'][exp2]} \n {complete['Body'][exp2]}")

Random examples of the Body:
 Tag:['bash', 'shell'] 
 <p>Below is my pre-commit git hoook</p>

<pre><code>#!/bin/bash

....
# if git diff -U0 "$FILE_PATH" | grep -iq 'todo'; # Double quoting $FILE_PATH doesnt' change anything
if git diff -U0 $FILE_PATH | grep -iq 'todo';
then
    echo $FILE_PATH ' -&gt; Contains TODO'
    exit 1

else
    echo 'nooooooooooooooooooooooooooooooooooo'
fi
</code></pre>

<p>I'm always getting the <code>noooooooooooooooooooo</code> message, however the command below, tried directly on my terminal, works well:</p>

<pre><code>git diff -U0 my/file/path.php | grep -iq 'todo' &amp;&amp; echo 'true' || echo 'false'
</code></pre>

<p>Output</p>

<pre><code>true
</code></pre>

<p><strong>UPDATE</strong></p>

<p>When running <code>bash .git/hooks/pre-commit</code> it works, very strange!!</p>

<p><strong>FYI</strong></p>

<p>I don't know if it's an important information but .git/hooks/pre-commit is a symbolik link</p>
  


  Tag:['arrays', 'perl'] 
 <p>I am writing 

In [11]:
# Um grande problema encontrado 'e que algumas das perguntas possuem codigo. Para melhorar a performance do modelo,
# irei retirar todos os caracteres especiais 
def clean_body(txt:str):
    txt = txt.replace("<p>"," ")
    txt = txt.replace("</p>"," ")
    txt = txt.replace("<pre><code>"," ")
    txt = txt.replace("</pre></code>"," ")
    new = re.sub("[^0-9a-zA-Z]+"," ",txt)
    return new

complete['Body'] = complete['Body'].apply(clean_body)

In [12]:
print(f"Random examples of the clean Body: \n Tag:{complete['Tag'][exp1]} \n {complete['Body'][exp1]}  \n\n\n  Tag:{complete['Tag'][exp2]} \n {complete['Body'][exp2]}")

Random examples of the clean Body: 
 Tag:['bash', 'shell'] 
  Below is my pre commit git hoook bin bash if git diff U0 FILE PATH grep iq todo Double quoting FILE PATH doesnt change anything if git diff U0 FILE PATH grep iq todo then echo FILE PATH gt Contains TODO exit 1 else echo nooooooooooooooooooooooooooooooooooo fi code pre I m always getting the code noooooooooooooooooooo code message however the command below tried directly on my terminal works well git diff U0 my file path php grep iq todo amp amp echo true echo false code pre Output true code pre strong UPDATE strong When running code bash git hooks pre commit code it works very strange strong FYI strong I don t know if it s an important information but git hooks pre commit is a symbolik link   


  Tag:['arrays', 'perl'] 
  I am writing a perl script and currently working on a subroutine to sum all values of an array Currently my code only reads in each line and stores the entire line into each array element I need each indiv

In [13]:
# Por hora essa limpeza dos dados ja esta bom. Irei salvar em um novo csv
complete.to_csv(DATA_DIR / 'processed_data.csv')