Data Cleaning Phase

In [1]:
import pandas as pd

In [2]:
dt = pd.read_csv('vscode_bugs.csv')
print(dt.head())
print(f"\n {dt.info()}")
print(f"\n{dt.describe()}")
dt

   Issue id                                            Summary  \
0    223706  Configure unassigned keybindings, command name...   
1    223658  Some suggestions are missing on Windows with pwsh   
2    223641  Cannot split a terminal without a group [objec...   
3    223622  SCM - history graph handling of first commit i...   
4    223607          SCM - history graph using incorrect color   

                     Created                   Resolved  \
0  2024-07-25 14:34:59+00:00                        NaN   
1  2024-07-25 13:06:40+00:00  2024-07-25 16:14:57+00:00   
2  2024-07-25 12:19:09+00:00                        NaN   
3  2024-07-25 10:44:47+00:00                        NaN   
4  2024-07-25 09:43:02+00:00  2024-07-25 11:02:46+00:00   

                                         Description Resolution  
0  - Place cursor in the panel chat\r\n- Open the...        NaN  
1  I was trying to verify https://github.com/micr...  completed  
2  ![image](https://github.com/user-attachments/a.

Unnamed: 0,Issue id,Summary,Created,Resolved,Description,Resolution
0,223706,"Configure unassigned keybindings, command name...",2024-07-25 14:34:59+00:00,,- Place cursor in the panel chat\r\n- Open the...,
1,223658,Some suggestions are missing on Windows with pwsh,2024-07-25 13:06:40+00:00,2024-07-25 16:14:57+00:00,I was trying to verify https://github.com/micr...,completed
2,223641,Cannot split a terminal without a group [objec...,2024-07-25 12:19:09+00:00,,![image](https://github.com/user-attachments/a...,
3,223622,SCM - history graph handling of first commit i...,2024-07-25 10:44:47+00:00,,- [ ] Show the first commit in the repository ...,
4,223607,SCM - history graph using incorrect color,2024-07-25 09:43:02+00:00,2024-07-25 11:02:46+00:00,![Image](https://github.com/user-attachments/a...,completed
...,...,...,...,...,...,...
32824,8,"Flash between opening of workspaces, reload",2015-11-14 12:53:12+00:00,2015-11-19 10:26:08+00:00,start code\nset a dark theme _other than the d...,completed
32825,6,Windows only - OmniSharp does not provide Inte...,2015-11-13 16:50:44+00:00,2016-01-20 09:30:16+00:00,upgrade to rc1 dnx/runtime\n\n``` bash\ngit cl...,completed
32826,5,welcome.md packaged wrongly,2015-11-13 16:48:18+00:00,2015-11-14 05:42:28+00:00,VSCode > Help > Show Welcome > does not work\n...,completed
32827,4,"C# bracket insertion, indentation not working",2015-11-13 16:45:06+00:00,2015-11-16 11:29:58+00:00,Type if (\n\nexpected => closing )\nactual => ...,completed


In [3]:
print(dt.duplicated().sum()) #to count the number of duplicated rows if exist
dt.drop_duplicates(inplace=True)   #to remove the duplicates

print(dt.duplicated().sum())   #verify if duplicates still exist

1
0


In [4]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

dt['Created'] = pd.to_datetime(dt['Created'], errors='coerce', dayfirst=True)     #this converts the day and time in the Created column to readable datetime object
dt = dt.dropna(subset=['Created'])    #if a row fails to covert, drop the row

#for rows that have missing decriptions
dt['Description'] = dt['Description'].fillna('')

dt['Notes'] = dt['Summary'] + '  ' + dt['Description']    #combine the summary and description columns into one text column
print(dt['Notes'].head())        #check out the data for validity



0    Configure unassigned keybindings, command name...
1    Some suggestions are missing on Windows with p...
2    Cannot split a terminal without a group [objec...
3    SCM - history graph handling of first commit i...
4    SCM - history graph using incorrect color  ![I...
Name: Notes, dtype: object


  dt['Created'] = pd.to_datetime(dt['Created'], errors='coerce', dayfirst=True)     #this converts the day and time in the Created column to readable datetime object


In [5]:
import re

#remove special characters
dt['Notes'] = dt['Notes'].apply(lambda x: re.sub(r'[^A-Za-z0-9 ]', '', x))        #this takes care of special characters using the re.sub() function
print(dt['Notes'].head())       #verify if those changes have been made


0    Configure unassigned keybindings command name ...
1    Some suggestions are missing on Windows with p...
2    Cannot split a terminal without a group object...
3    SCM  history graph handling of first commit in...
4    SCM  history graph using incorrect color  Imag...
Name: Notes, dtype: object


In [6]:
#normalize to lowercase
dt['Notes'] = dt['Notes'].apply(lambda x: x.lower())
print(dt['Notes'].head())       #verify if those changes have been made


0    configure unassigned keybindings command name ...
1    some suggestions are missing on windows with p...
2    cannot split a terminal without a group object...
3    scm  history graph handling of first commit in...
4    scm  history graph using incorrect color  imag...
Name: Notes, dtype: object


In [7]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
nltk.download('punkt')
nltk.download('stopwords')
import nltk

# Ensure necessary NLTK resources are available
for pkg in ['punkt', 'punkt_tab', 'stopwords']:
    try:
        nltk.data.find(f'tokenizers/{pkg}')
    except LookupError:
        nltk.download(pkg)
dt["Cleaned"] = dt["Notes"].str.replace(r'[^a-zA-Z\s]', '', regex=True).str.lower()
print("Cleaned Text:")
print(dt["Cleaned"])

[nltk_data] Downloading package punkt to /home/kmumenin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/kmumenin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/kmumenin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Cleaned Text:
0        configure unassigned keybindings command name ...
1        some suggestions are missing on windows with p...
2        cannot split a terminal without a group object...
3        scm  history graph handling of first commit in...
4        scm  history graph using incorrect color  imag...
                               ...                        
32824    flash between opening of workspaces reload  st...
32825    windows only  omnisharp does not provide intel...
32826    welcomemd packaged wrongly  vscode  help  show...
32827    c bracket insertion indentation not working  t...
32828    omnisharp not included in linux build  our lin...
Name: Cleaned, Length: 32828, dtype: object


In [8]:
dt["Tokens"] = dt["Cleaned"].apply(word_tokenize)
print("\nTokenized Words:")
print(dt["Tokens"])



Tokenized Words:
0        [configure, unassigned, keybindings, command, ...
1        [some, suggestions, are, missing, on, windows,...
2        [can, not, split, a, terminal, without, a, gro...
3        [scm, history, graph, handling, of, first, com...
4        [scm, history, graph, using, incorrect, color,...
                               ...                        
32824    [flash, between, opening, of, workspaces, relo...
32825    [windows, only, omnisharp, does, not, provide,...
32826    [welcomemd, packaged, wrongly, vscode, help, s...
32827    [c, bracket, insertion, indentation, not, work...
32828    [omnisharp, not, included, in, linux, build, o...
Name: Tokens, Length: 32828, dtype: object


In [9]:
stop_words = set(stopwords.words("english"))
dt["Filtered"] = dt["Tokens"].apply(lambda tokens: [w for w in tokens if w not in stop_words])
print("\nFiltered Tokens (No Stopwords):")
print(dt["Filtered"])



Filtered Tokens (No Stopwords):
0        [configure, unassigned, keybindings, command, ...
1        [suggestions, missing, windows, pwsh, trying, ...
2        [split, terminal, without, group, object, obje...
3        [scm, history, graph, handling, first, commit,...
4        [scm, history, graph, using, incorrect, color,...
                               ...                        
32824    [flash, opening, workspaces, reload, start, co...
32825    [windows, omnisharp, provide, intellisense, st...
32826    [welcomemd, packaged, wrongly, vscode, help, s...
32827    [c, bracket, insertion, indentation, working, ...
32828    [omnisharp, included, linux, build, linux, bui...
Name: Filtered, Length: 32828, dtype: object


In [10]:
stemmer = PorterStemmer()
dt["Stemmed"] = dt["Filtered"].apply(lambda tokens: [stemmer.stem(w) for w in tokens])
print("\nStemmed Tokens:")
print(dt["Stemmed"])



Stemmed Tokens:
0        [configur, unassign, keybind, command, name, i...
1        [suggest, miss, window, pwsh, tri, verifi, htt...
2        [split, termin, without, group, object, object...
3        [scm, histori, graph, handl, first, commit, re...
4        [scm, histori, graph, use, incorrect, color, i...
                               ...                        
32824    [flash, open, workspac, reload, start, codeset...
32825    [window, omnisharp, provid, intellisens, stron...
32826    [welcomemd, packag, wrongli, vscode, help, sho...
32827    [c, bracket, insert, indent, work, type, expec...
32828    [omnisharp, includ, linux, build, linux, build...
Name: Stemmed, Length: 32828, dtype: object


In [11]:
dt

Unnamed: 0,Issue id,Summary,Created,Resolved,Description,Resolution,Notes,Cleaned,Tokens,Filtered,Stemmed
0,223706,"Configure unassigned keybindings, command name...",2024-07-25 14:34:59+00:00,,- Place cursor in the panel chat\r\n- Open the...,,configure unassigned keybindings command name ...,configure unassigned keybindings command name ...,"[configure, unassigned, keybindings, command, ...","[configure, unassigned, keybindings, command, ...","[configur, unassign, keybind, command, name, i..."
1,223658,Some suggestions are missing on Windows with pwsh,2024-07-25 13:06:40+00:00,2024-07-25 16:14:57+00:00,I was trying to verify https://github.com/micr...,completed,some suggestions are missing on windows with p...,some suggestions are missing on windows with p...,"[some, suggestions, are, missing, on, windows,...","[suggestions, missing, windows, pwsh, trying, ...","[suggest, miss, window, pwsh, tri, verifi, htt..."
2,223641,Cannot split a terminal without a group [objec...,2024-07-25 12:19:09+00:00,,![image](https://github.com/user-attachments/a...,,cannot split a terminal without a group object...,cannot split a terminal without a group object...,"[can, not, split, a, terminal, without, a, gro...","[split, terminal, without, group, object, obje...","[split, termin, without, group, object, object..."
3,223622,SCM - history graph handling of first commit i...,2024-07-25 10:44:47+00:00,,- [ ] Show the first commit in the repository ...,,scm history graph handling of first commit in...,scm history graph handling of first commit in...,"[scm, history, graph, handling, of, first, com...","[scm, history, graph, handling, first, commit,...","[scm, histori, graph, handl, first, commit, re..."
4,223607,SCM - history graph using incorrect color,2024-07-25 09:43:02+00:00,2024-07-25 11:02:46+00:00,![Image](https://github.com/user-attachments/a...,completed,scm history graph using incorrect color imag...,scm history graph using incorrect color imag...,"[scm, history, graph, using, incorrect, color,...","[scm, history, graph, using, incorrect, color,...","[scm, histori, graph, use, incorrect, color, i..."
...,...,...,...,...,...,...,...,...,...,...,...
32824,8,"Flash between opening of workspaces, reload",2015-11-14 12:53:12+00:00,2015-11-19 10:26:08+00:00,start code\nset a dark theme _other than the d...,completed,flash between opening of workspaces reload st...,flash between opening of workspaces reload st...,"[flash, between, opening, of, workspaces, relo...","[flash, opening, workspaces, reload, start, co...","[flash, open, workspac, reload, start, codeset..."
32825,6,Windows only - OmniSharp does not provide Inte...,2015-11-13 16:50:44+00:00,2016-01-20 09:30:16+00:00,upgrade to rc1 dnx/runtime\n\n``` bash\ngit cl...,completed,windows only omnisharp does not provide intel...,windows only omnisharp does not provide intel...,"[windows, only, omnisharp, does, not, provide,...","[windows, omnisharp, provide, intellisense, st...","[window, omnisharp, provid, intellisens, stron..."
32826,5,welcome.md packaged wrongly,2015-11-13 16:48:18+00:00,2015-11-14 05:42:28+00:00,VSCode > Help > Show Welcome > does not work\n...,completed,welcomemd packaged wrongly vscode help show...,welcomemd packaged wrongly vscode help show...,"[welcomemd, packaged, wrongly, vscode, help, s...","[welcomemd, packaged, wrongly, vscode, help, s...","[welcomemd, packag, wrongli, vscode, help, sho..."
32827,4,"C# bracket insertion, indentation not working",2015-11-13 16:45:06+00:00,2015-11-16 11:29:58+00:00,Type if (\n\nexpected => closing )\nactual => ...,completed,c bracket insertion indentation not working t...,c bracket insertion indentation not working t...,"[c, bracket, insertion, indentation, not, work...","[c, bracket, insertion, indentation, working, ...","[c, bracket, insert, indent, work, type, expec..."


In [12]:
dt.drop(columns=['Issue id', "Description","Summary",'Created', 'Resolution','Resolved'], inplace=True)
dt

Unnamed: 0,Notes,Cleaned,Tokens,Filtered,Stemmed
0,configure unassigned keybindings command name ...,configure unassigned keybindings command name ...,"[configure, unassigned, keybindings, command, ...","[configure, unassigned, keybindings, command, ...","[configur, unassign, keybind, command, name, i..."
1,some suggestions are missing on windows with p...,some suggestions are missing on windows with p...,"[some, suggestions, are, missing, on, windows,...","[suggestions, missing, windows, pwsh, trying, ...","[suggest, miss, window, pwsh, tri, verifi, htt..."
2,cannot split a terminal without a group object...,cannot split a terminal without a group object...,"[can, not, split, a, terminal, without, a, gro...","[split, terminal, without, group, object, obje...","[split, termin, without, group, object, object..."
3,scm history graph handling of first commit in...,scm history graph handling of first commit in...,"[scm, history, graph, handling, of, first, com...","[scm, history, graph, handling, first, commit,...","[scm, histori, graph, handl, first, commit, re..."
4,scm history graph using incorrect color imag...,scm history graph using incorrect color imag...,"[scm, history, graph, using, incorrect, color,...","[scm, history, graph, using, incorrect, color,...","[scm, histori, graph, use, incorrect, color, i..."
...,...,...,...,...,...
32824,flash between opening of workspaces reload st...,flash between opening of workspaces reload st...,"[flash, between, opening, of, workspaces, relo...","[flash, opening, workspaces, reload, start, co...","[flash, open, workspac, reload, start, codeset..."
32825,windows only omnisharp does not provide intel...,windows only omnisharp does not provide intel...,"[windows, only, omnisharp, does, not, provide,...","[windows, omnisharp, provide, intellisense, st...","[window, omnisharp, provid, intellisens, stron..."
32826,welcomemd packaged wrongly vscode help show...,welcomemd packaged wrongly vscode help show...,"[welcomemd, packaged, wrongly, vscode, help, s...","[welcomemd, packaged, wrongly, vscode, help, s...","[welcomemd, packag, wrongli, vscode, help, sho..."
32827,c bracket insertion indentation not working t...,c bracket insertion indentation not working t...,"[c, bracket, insertion, indentation, not, work...","[c, bracket, insertion, indentation, working, ...","[c, bracket, insert, indent, work, type, expec..."
