# Project Objective
* Predict tags assigned to a set of questions from the StackOverflow website
* Dataset is provided by kaggle website (2013), with a training set of more than six million questions
* **Reference:** *J. Gonzalez et al., "Multi-class Multi-tag Classifier System for StackOverflow Questions", 2015 IEEE International Autumn Meeting on Power, Electronics and Computing (ROPEC).*
* **Completed by**: Dr.AI
* **Team Members:** Tong Xu, Michael Lasby, Yunying Zhang

In [0]:
%sh
pip install bs4
pip install nltk

In [0]:
from bs4 import BeautifulSoup
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_en = stopwords.words('english')
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.ml.feature import RegexTokenizer

# Import Data

## Training Dataset from Kaggle
* 6,034,195 stack overflow questions (4,206,314 after removing duplicates)
* Features: 'ID', 'Title', 'Body'
* Target: 'Tags'

In [0]:
# File location and type
file_location = "/FileStore/Train.csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","
multiline = "true"
escape = "\""

# The applied options are for CSV files. For other file types, these will be ignored.
train = spark.read.csv(file_location,
                  inferSchema = infer_schema, 
                  sep = delimiter, 
                  header = first_row_is_header, 
                  multiLine = multiline, 
                  escape = escape)

In [0]:
display(train)

Id,Title,Body,Tags
1,How to check if an uploaded file is an image without mime type?,"I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload. Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?",php image-processing file-upload upload mime-types
2,How can I prevent firefox from closing when I press ctrl-w,"In my favorite editor (vim), I regularly use ctrl-w to execute a certain action. Now, it quite often happens to me that firefox is the active window (on windows) while I still look at vim (thinking vim is the active window) and press ctrl-w which closes firefox. This is not what I want. Is there a way to stop ctrl-w from closing firefox? Rene",firefox
3,R Error Invalid type (list) for variable,"I am import matlab file and construct a data frame, matlab file contains two columns with and each row maintain a cell that has a matrix, I construct a dataframe to run random forest. But I am getting following error. Error in model.frame.default(formula = expert_data_frame$t_labels ~ ., : invalid type (list) for variable 'expert_data_frame$t_labels' Here is the code how I import the matlab file and construct the dataframe: all_exp_traintest <- readMat(all_exp_filepath); len = length(all_exp_traintest$exp.traintest)/2;  for (i in 1:len) {  expert_train_df <- data.frame(all_exp_traintest$exp.traintest[i]);  labels = data.frame(all_exp_traintest$exp.traintest[i+302]);  names(labels)[1] <- ""t_labels"";  expert_train_df$t_labels <- labels;  expert_data_frame <- data.frame(expert_train_df);  rf_model = randomForest(expert_data_frame$t_labels ~., data=expert_data_frame, importance=TRUE, do.trace=100);  } Structure of the Matlab input file [56x12 double] [56x1 double] [62x12 double] [62x1 double] [62x12 double] [62x1 double] [62x12 double] [62x1 double] [62x12 double] [62x1 double] [74x12 double] [74x1 double] > str(all_exp_traintest) List of 1  $ exp.traintest:List of 604  ..$ NA: num [1:56, 1:12] 0 0 0 0 8 1 1 0 0 0 ...  ..$ NA: num [1:62, 1:12] 2 10 11 13 5 10 13 8 11 8 ...  ..$ NA: num [1:62, 1:12] 0 0 1 0 0 0 0 0 1 1 ...  ..$ NA: num [1:62, 1:12] 4 2 1 3 3 20 6 3 2 2 ...  ..$ NA: num [1:62, 1:12] 2731 2362 2937 1229 1898 ...  ..$ NA: num [1:74, 1:12] 27 33 34 38 33 35 36 35 47 46 ...  ..$ NA: num [1:74, 1:12] 106 79 99 94 153 104 146 105 125 146 ...  ..$ NA: num [1:74, 1:12] 3 9 3 0 1 26 0 4 0 0 ...  ..$ NA: num [1:51, 1:12] 5 7 3 30 0 0 0 0 0 0 ...  ..$ NA: num [1:66, 1:12] 0 0 13 0 0 3 2 2 0 2 ...  ..$ NA: num [1:73, 1:12] 1 0 1 0 0 0 2 1 2 5 ...  ..$ NA: num [1:73, 1:12] 23 14 20 14 24 22 32 61 84 278 ...  ..$ NA: num [1:75, 1:12] 1 7 0 1 2 3 3 0 16 10 ...  ..$ NA: num [1:90, 1:12] 10 7 8 15 25 12 37 31 18 48 ...  ..$ NA: num [1:90, 1:12] 0 6 3 1 5 7 8 6 1 1 ...  ..$ NA: num [1:90, 1:12] 0 1 1 2 0 4 9 6 3 4 ...  ..$ NA: num [1:90, 1:12] 6 0 5 27 11 50 22 8 10 4 ...  ..$ NA: num [1:90, 1:12] 3 9 13 12 4 0 5 0 5 0 ...  ..$ NA: num [1:90, 1:12] 1 0 1 0 1 2 1 0 1 2 ...  ..$ NA: num [1:90, 1:12] 3395 3400 3360 3770 3533 ...  ..$ NA: num [1:84, 1:12] 0 0 0 0 5 0 0 5 4 2 ...  ..$ NA: num [1:80, 1:12] 2 3 3 3 4 28 61 26 8 1 ...  ..$ NA: num [1:81, 1:12] 4 28 22 9 16 43 80 21 19 18 ...  ..$ NA: num [1:76, 1:12] 1 0 0 1 49 64 60 230 222 267 ...  ..$ NA: num [1:76, 1:12] 4786 4491 2510 1144 2071 ...  ..$ NA: num [1:76, 1:12] 80 128 254 109 114 267 152 139 368 363 ...  ..$ NA: num [1:76, 1:12] 1 5 8 2 14 5 3 13 8 2 ...  ..$ NA: num [1:76, 1:12] 10 3 8 79 4 4 11 30 2 0 ...  ..$ NA: num [1:68, 1:12] 0 0 2 0 0 2 6 0 0 4 ...  ..$ NA: num [1:68, 1:12] 1 4 5 2 2 3 3 1 3 0 ...  ..$ NA: num [1:68, 1:12] 0 0 1 0 0 0 0 0 0 1 ...  ..$ NA: num [1:69, 1:12] 39 45 2 0 1 4 3 0 13 0 ...  ..$ NA: num [1:69, 1:12] 0 4 6 0 0 4 1 6 10 1 ...  ..$ NA: num [1:69, 1:12] 0 2 5 2 2 2 0 0 3 6 ...  ..$ NA: num [1:69, 1:12] 3 0 1 1 1 4 7 5 5 1 ...  ..$ NA: num [1:66, 1:12] 5 0 0 0 0 0 0 1 3 5 ...  ..$ NA: num [1:66, 1:12] 4 3 3 0 0 4 0 0 0 0 ...  ..$ NA: num [1:65, 1:12] 0 0 1 0 0 0 5 8 4 1 ...  ..$ NA: num [1:65, 1:12] 0 5 6 0 2 0 0 1 1 2 ...  ..$ NA: num [1:69, 1:12] 0 16 5 1 14 0 1 0 0 16 ...  ..$ NA: num [1:69, 1:12] 0 0 0 0 0 25 2 3 0 0 ...  ..$ NA: num [1:64, 1:12] 2 0 0 0 0 0 0 0 0 0 ...  ..$ NA: num [1:42, 1:12] 0 0 0 0 0 0 0 0 0 0 ...  ..$ NA: num [1:67, 1:12] 0 2 4 10 15 4 1 43 1 7 ...  ..$ NA: num [1:63, 1:12] 32 6 12 5 92 8 29 7 21 20 ...  ..$ NA: num [1:63, 1:12] 2 5 12 8 10 13 6 11 10 14 ...  ..$ NA: num [1:63, 1:12] 3 5 10 9 0 1 8 13 2 14 ...  ..$ NA: num [1:54, 1:12] 0 0 14 0 0 0 0 0 0 1 ...  ..$ NA: num [1:82, 1:12] 152 99 63 57 105 44 28 33 43 49 ...  ..$ NA: num [1:81, 1:12] 0 1 0 0 0 0 0 0 0 0 ...  ..$ NA: num [1:75, 1:12] 0 1 3 0 0 0 0 0 0 0 ...  ..$ NA: num [1:75, 1:12] 1 0 0 2 0 1 0 0 0 0 ...  ..$ NA: num [1:75, 1:12] 1 6 5 5 3 8 1 3 1 0 ...  ..$ NA: num [1:72, 1:12] 0 0 0 0 1 0 1 2 0 0 ...  ..$ NA: num [1:62, 1:12] 310 91 4 4 9 0 0 1 0 0 ...  ..$ NA: num [1:62, 1:12] 239 374 1060 599 805 808 139 150 490 326 ...  ..$ NA: num [1:49, 1:12] 9 18 10 12 19 5 13 10 2 3 ...  ..$ NA: num [1:61, 1:12] 2 0 0 0 1 0 0 0 0 0 ...  ..$ NA: num [1:61, 1:12] 4 10 16 15 8 14 10 23 11 5 ...  ..$ NA: num [1:61, 1:12] 0 1 4 4 5 3 0 1 1 1 ...  ..$ NA: num [1:65, 1:12] 165 100 177 65 148 58 188 55 59 62 ...  ..$ NA: num [1:65, 1:12] 13 0 0 2 2 3 0 0 0 0 ...  ..$ NA: num [1:66, 1:12] 157 58 101 92 15 21 73 80 78 75 ...  ..$ NA: num [1:66, 1:12] 8 6 1 0 6 2 2 6 10 9 ...  ..$ NA: num [1:87, 1:12] 1 2 5 6 8 3 3 3 2 3 ...  ..$ NA: num [1:83, 1:12] 0 0 0 0 0 0 2 13 0 0 ...  ..$ NA: num [1:81, 1:12] 0 0 1 0 3 5 3 0 2 7 ...  ..$ NA: num [1:81, 1:12] 33 81 94 30 5 36 16 90 121 182 ...  ..$ NA: num [1:81, 1:12] 10 11 16 6 0 0 0 1 0 0 ...  ..$ NA: num [1:81, 1:12] 7 0 0 2 1 3 1 4 0 0 ...  ..$ NA: num [1:81, 1:12] 1 0 5 0 2 3 1 0 1 1 ...  ..$ NA: num [1:95, 1:12] 30 160 116 130 444 515 225 135 108 175 ...  ..$ NA: num [1:95, 1:12] 12 1 0 10 3 3 0 4 0 0 ...  ..$ NA: num [1:95, 1:12] 1 0 0 0 3 3 1 0 0 0 ...  ..$ NA: num [1:95, 1:12] 11 42 61 23 41 56 81 6 83 82 ...  ..$ NA: num [1:95, 1:12] 1 2 5 3 6 4 2 8 28 1 ...  ..$ NA: num [1:95, 1:12] 283 192 377 216 207 261 394 262 262 554 ...  ..$ NA: num [1:94, 1:12] 0 0 0 0 0 0 0 0 0 0 ...  ..$ NA: num [1:72, 1:12] 0 0 0 0 0 0 0 0 0 0 ...  ..$ NA: num [1:72, 1:12] 5 3 0 2 13 27 6 2 12 36 ...  ..$ NA: num [1:72, 1:12] 0 2 2 0 1 0 1 4 2 2 ...  ..$ NA: num [1:72, 1:12] 0 0 1 0 3 1 0 4 1 0 ...  ..$ NA: num [1:67, 1:12] 27 7 18 1 2 0 0 0 0 0 ...  ..$ NA: num [1:67, 1:12] 10 2 1 10 7 0 0 1 1 4 ...  ..$ NA: num [1:67, 1:12] 14 17 9 20 13 20 18 13 10 7 ...  ..$ NA: num [1:64, 1:12] 0 0 0 0 4 0 0 0 3 0 ...  ..$ NA: num [1:64, 1:12] 3 0 1 0 2 7 13 14 4 2 ...  ..$ NA: num [1:64, 1:12] 0 0 0 0 0 0 0 0 2 0 ...  ..$ NA: num [1:72, 1:12] 59 61 55 120 49 202 325 244 377 551 ...  ..$ NA: num [1:72, 1:12] 0 0 0 0 0 0 0 0 1 0 ...  ..$ NA: num [1:72, 1:12] 0 3 1 0 1 0 0 0 4 0 ...  ..$ NA: num [1:72, 1:12] 5 12 6 9 15 10 15 27 15 9 ...  ..$ NA: num [1:72, 1:12] 7 0 3 0 0 1 1 1 1 0 ...  ..$ NA: num [1:72, 1:12] 0 0 0 0 89 0 19 3 3 2 ...  ..$ NA: num [1:61, 1:12] 5 3 5 3 3 29 46 140 49 24 ...  ..$ NA: num [1:63, 1:12] 23 0 0 0 0 60 7 73 13 19 ...  ..$ NA: num [1:95, 1:12] 7 96 28 2 9 5 8 190 166 1 ...  ..$ NA: num [1:95, 1:12] 0 0 1 1 0 0 0 0 0 0 ...  ..$ NA: num [1:95, 1:12] 4 0 2 6 6 11 6 5 6 9 ...  .. [list output truncated]  - attr(*, ""header"")=List of 3  ..$ description: chr ""MATLAB 5.0 MAT-file, Platform: MACI64, Created on: Sun Dec 9 17:35:24 2012 ""  ..$ version : chr ""5""  ..$ endian : chr ""little"" After loading the matlab file into R all_exp_traintest$exp.traintest[1] $<NA>  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]  [1,] 0 0.0 0.00 0.000 0.5000 0.03125 0.015625 0.0000000 0.00000000 0.000000000 0.0000000000 0.0000000000  [2,] 0 0.0 0.00 1.000 0.0625 0.03125 0.000000 0.0000000 0.00000000 0.000000000 0.0000000000 0.0000000000  [3,] 0 0.0 2.00 0.125 0.0625 0.00000 0.000000 0.0000000 0.00000000 0.000000000 0.0000000000 0.0000000000  [4,] 0 4.0 0.25 0.125 0.0000 0.00000 0.000000 0.0000000 0.00000000 0.000000000 0.0000000000 0.0009765625  [5,] 8 0.5 0.25 0.000 0.0000 0.00000 0.000000 0.0000000 0.00000000 0.000000000 0.0019531250 0.0000000000  [6,] 1 0.5 0.00 0.000 0.0000 0.00000 0.000000 0.0000000 0.00000000 0.003906250 0.0000000000 0.0004882812  [7,] 1 0.0 0.00 0.000 0.0000 0.00000 0.000000 0.0000000 0.00781250 0.000000000 0.0009765625 0.0009765625  [8,] 0 0.0 0.00 0.000 0.0000 0.00000 0.000000 0.0156250 0.00000000 0.001953125 0.0019531250 0.0000000000  [9,] 0 0.0 0.00 0.000 0.0000 0.00000 0.031250 0.0000000 0.00390625 0.003906250 0.0000000000 0.0004882812 [10,] 0 0.0 0.00 0.000 0.0000 0.06250 0.000000 0.0078125 0.00781250 0.000000000 0.0009765625 0.0000000000 [11,] 0 0.0 0.00 0.000 0.1250 0.00000 0.015625 0.0156250 0.00000000 0.001953125 0.0000000000 0.0000000000 [12,] 0 0.0 0.00 0.250 0.0000 0.03125 0.031250 0.0000000 0.00390625 0.000000000 0.0000000000 0.0004882812 [13,] 0 0.0 0.50 0.000 0.0625 0.06250 0.000000 0.0078125 0.00000000 0.000000000 0.0009765625 0.0000000000 [14,] 0 1.0 0.00 0.125 0.1250 0.00000 0.015625 0.0000000 0.00000000 0.001953125 0.0000000000 0.0024414062 [15,] 2 0.0 0.25 0.250 0.0000 0.03125 0.000000 0.0000000 0.00390625 0.000000000 0.0048828125 0.0014648438 [16,] 0 0.5 0.50 0.000 0.0625 0.00000 0.000000 0.0078125 0.00000000 0.009765625 0.0029296875 0.0039062500 [17,] 1 1.0 0.00 0.125 0.0000 0.00000 0.015625 0.0000000 0.01953125 0.005859375 0.0078125000 0.0151367188 [18,] 2 0.0 0.25 0.000 0.0000 0.03125 0.000000 0.0390625 0.01171875 0.015625000 0.0302734375 0.0019531250 [19,] 0 0.5 0.00 0.000 0.0625 0.00000 0.078125 0.0234375 0.03125000 0.060546875 0.0039062500 0.0029296875 [20,] 1 0.0 0.00 0.125 0.0000 0.15625 0.046875 0.0625000 0.12109375 0.007812500 0.0058593750 0.0253906250 [21,] 0 0.0 0.25 0.000 0.3125 0.09375 0.125000 0.2421875 0.01562500 0.011718750 0.0507812500 0.0253906250 [22,] 0 0.5 0.00 0.625 0.1875 0.25000 0.484375 0.0312500 0.02343750 0.101562500 0.0507812500 0.0063476562 [23,] 1 0.0 1.25 0.375 0.5000 0.96875 0.062500 0.0468750 0.20312500 0.101562500 0.0126953125 0.0009765625 [24,] 0 2.5 0.75 1.000 1.9375 0.12500 0.093750 0.4062500 0.20312500 0.025390625 0.0019531250 0.0000000000 [25,] 5 1.5 2.00 3.875 0.2500 0.18750 0.812500 0.4062500 0.05078125 0.003906250 0.0000000000 0.0019531250 [26,] 3 4.0 7.75 0.500 0.3750 1.62500 0.812500 0.1015625 0.00781250 0.000000000 0.0039062500 0.0029296875 [27,] 8 15.5 1.00 0.750 3.2500 1.62500 0.203125 0.0156250 0.00000000 0.007812500 0.0058593750 0.0009765625 [28,] 31 2.0 1.50 6.500 3.2500 0.40625 0.031250 0.0000000 0.01562500 0.011718750 0.0019531250 0.0000000000 [29,] 4 3.0 13.00 6.500 0.8125 0.06250 0.000000 0.0312500 0.02343750 0.003906250 0.0000000000 0.0083007812 [30,] 6 26.0 13.00 1.625 0.1250 0.00000 0.062500 0.0468750 0.00781250 0.000000000 0.0166015625 0.0000000000 [31,] 52 26.0 3.25 0.250 0.0000 0.12500 0.093750 0.0156250 0.00000000 0.033203125 0.0000000000 0.0048828125 [32,] 52 6.5 0.50 0.000 0.2500 0.18750 0.031250 0.0000000 0.06640625 0.000000000 0.0097656250 0.0034179688 [33,] 13 1.0 0.00 0.500 0.3750 0.06250 0.000000 0.1328125 0.00000000 0.019531250 0.0068359375 0.0229492188 [34,] 2 0.0 1.00 0.750 0.1250 0.00000 0.265625 0.0000000 0.03906250 0.013671875 0.0458984375 0.0297851562 [35,] 0 2.0 1.50 0.250 0.0000 0.53125 0.000000 0.0781250 0.02734375 0.091796875 0.0595703125 0.0771484375 [36,] 4 3.0 0.50 0.000 1.0625 0.00000 0.156250 0.0546875 0.18359375 0.119140625 0.1542968750 0.0004882812 [37,] 6 1.0 0.00 2.125 0.0000 0.31250 0.109375 0.3671875 0.23828125 0.308593750 0.0009765625 0.0000000000 [38,] 2 0.0 4.25 0.000 0.6250 0.21875 0.734375 0.4765625 0.61718750 0.001953125 0.0000000000 0.0048828125 [39,] 0 8.5 0.00 1.250 0.4375 1.46875 0.953125 1.2343750 0.00390625 0.000000000 0.0097656250 0.0000000000 [40,] 17 0.0 2.50 0.875 2.9375 1.90625 2.468750 0.0078125 0.00000000 0.019531250 0.0000000000 0.0000000000 [41,] 0 5.0 1.75 5.875 3.8125 4.93750 0.015625 0.0000000 0.03906250 0.000000000 0.0000000000 0.0000000000 [42,] 10 3.5 11.75 7.625 9.8750 0.03125 0.000000 0.0781250 0.00000000 0.000000000 0.0000000000 0.0004882812 [43,] 7 23.5 15.25 19.750 0.0625 0.00000 0.156250 0.0000000 0.00000000 0.000000000 0.0009765625 0.0078125000 [44,] 47 30.5 39.50 0.125 0.0000 0.31250 0.000000 0.0000000 0.00000000 0.001953125 0.0156250000 0.0000000000 [45,] 61 79.0 0.25 0.000 0.6250 0.00000 0.000000 0.0000000 0.00390625 0.031250000 0.0000000000 0.0000000000 [46,] 158 0.5 0.00 1.250 0.0000 0.00000 0.000000 0.0078125 0.06250000 0.000000000 0.0000000000 0.0004882812 [47,] 1 0.0 2.50 0.000 0.0000 0.00000 0.015625 0.1250000 0.00000000 0.000000000 0.0009765625 0.0000000000 [48,] 0 5.0 0.00 0.000 0.0000 0.03125 0.250000 0.0000000 0.00000000 0.001953125 0.0000000000 0.0000000000 [49,] 10 0.0 0.00 0.000 0.0625 0.50000 0.000000 0.0000000 0.00390625 0.000000000 0.0000000000 0.0000000000 [50,] 0 0.0 0.00 0.125 1.0000 0.00000 0.000000 0.0078125 0.00000000 0.000000000 0.0000000000 0.0000000000 [51,] 0 0.0 0.25 2.000 0.0000 0.00000 0.015625 0.0000000 0.00000000 0.000000000 0.0000000000 0.0000000000 [52,] 0 0.5 4.00 0.000 0.0000 0.03125 0.000000 0.0000000 0.00000000 0.000000000 0.0000000000 0.0000000000 [53,] 1 8.0 0.00 0.000 0.0625 0.00000 0.000000 0.0000000 0.00000000 0.000000000 0.0000000000 0.0000000000 [54,] 16 0.0 0.00 0.125 0.0000 0.00000 0.000000 0.0000000 0.00000000 0.000000000 0.0000000000 0.0000000000 [55,] 0 0.0 0.25 0.000 0.0000 0.00000 0.000000 0.0000000 0.00000000 0.000000000 0.0000000000 0.0000000000 [56,] 0 0.5 0.00 0.000 0.0000 0.00000 0.000000 0.0000000 0.00000000 0.000000000 0.0000000000 0.0000000000",r matlab machine-learning
4,How do I replace special characters in a URL?,"This is probably very simple, but I simply cannot find the answer myself :( Basicaly, what I want is, given this string: ""http://www.google.com/search?hl=en&q=c# objects"" I want this output: http://www.google.com/search?hl=en&q=c%23+objects I'm sure there's some helper class somewhere buried in the Framework that takes care of that for me, but I'm having trouble finding it. EDIT: I should add, that this is for a Winforms App.",c# url encoding
5,How to modify whois contact details?,"function modify(.......) {  $mcontact = file_get_contents( ""https://test.httpapi.com/api/contacts/modify.json?auth-userid=$uid&auth-password=$pass&contact-id=$cid&name=$name &company=$company&email=$email&address-line-1=$street&city=$city&country=$country&zipcode=$pincode&phone-cc=$countryCodeList[$phc]&phone=$phone"" );  $mdetails = json_decode( $mcontact, true );  return $mdetails; } using this modify function, displays warning mesage Warning: file_get_contents(https://...@hihfg.com&address-line-1=3,dfgdf,fgdf&city=dfgfd&country=India&zipcode=641005&phone-cc=91&phone=756657) [function.file-get-contents]: failed to open stream: HTTP request failed!  HTTP/1.0 400 Bad request in /home/gfdgfd/public_html/new_one/customer/account/class.whois.php  on line 49 Please help me, modify contact details..",php api file-get-contents
6,setting proxy in active directory environment,"I am using a machine on which active directory is configured. I am developing an application on the same machine. Now I want to do some performance testing of that application using the JMeter. Now when I start the JMeter proxy server, and set it in browser and try to browse the application I get an error ""Internet Explorer cannot display the webpage"". Am I missing anything?",proxy active-directory jmeter
7,How to draw barplot in this way with Coreplot,My image is cannot post so the link is my picture I want to draw a chart like the image in iOS app I use the CorePlot to help me to make this My Question: How to draw 3 barPlot whit 3 kinds color How to draw a barPlot from the CPTXYAxis 0 to -4000 in the Upside down way Any help would be appreciate Thanks all,core-plot
8,How to fetch an XML feed using asp.net,"I've decided to convert a Windows Phone 7 app that fetches an XML feed and then parses it to an asp.net web app, using Visual Web Developer Express. I figure since the code already works for WP7, it should be a matter of mostly copying and pasting it for the C# code behind. HttpWebRequest request = HttpWebRequest.CreateHttp(""http://webservices.nextbus.com/service/publicXMLFeed?command=routeConfig&a=sf-muni&r="" + line1); That's the first line of code from my WP7 app that fetches the XML feed, but I can't even get HttpWebRequest to work in Visual Web Developer like that. Intellisense shows a create and createdefault, but no CreateHttp like there was in Windows Phone 7. I just need to figure out how to fetch the page, I assume the parsing will be the same as on my phone app. Any help? Thanks, Amanda",c# asp.net windows-phone-7
9,.NET library for generating javascript?,"Do you know of a .NET library for generating javascript code? I want to generate javascript code based on information in my .NET application. I would like to be able to create an AST-like datastructure (using C#) and have it turned into valid javascript. I need to be able to create functions, statements, expressions etc., so I need something more than a JSON serializer - but I guess you could think of this as a (very) generalized JSON serializer. Do such libraries exist and if so, could you recommend any? Thank you.",.net javascript code-generation
10,"SQL Server : procedure call, inline concatenation impossible?","I'm using SQL Server 2008 R2 and was wondering if there is another way of writing something like EXEC dbo.myProcedure (SELECT columnName FROM TableName) or EXEC dbo.myProcedure @myStringVariable + 'other text' so that these procedure calls actually work, without putting the whole stuff into a variable first.",sql variables parameters procedure calls


In [0]:
train.count()

In [0]:
# removing duplicates
train = train.select('Body','Title','Tags').distinct()
train.count()

## Collection of New Data
* used **Query Stack Overflow from Stack Exchange Data Explorer** to extract 11,298 stackoverflow question after 2013

In [0]:
# File location and type
file_location = "/FileStore/tables/Query_post2013-1.csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","
multiline = "true"
escape = "\""

# The applied options are for CSV files. For other file types, these will be ignored.
new_data = spark.read.csv(file_location,
                  inferSchema = infer_schema, 
                  sep = delimiter, 
                  header = first_row_is_header, 
                  multiLine = multiline, 
                  escape = escape)

In [0]:
display(new_data)

Id,PostTypeId,AcceptedAnswerId,ParentId,CreationDate,DeletionDate,Score,ViewCount,Body,OwnerUserId,OwnerDisplayName,LastEditorUserId,LastEditorDisplayName,LastEditDate,LastActivityDate,Title,Tags,AnswerCount,CommentCount,FavoriteCount,ClosedDate,CommunityOwnedDate,ContentLicense
34432310,2,,34432089.0,2015-12-23 09:24,,1,,"Facets automate some parts of project configuration and deployment. For example inform you when servlet is not added to web.xml in dynamic web project. Stuff like that. When you add JPA facet, eclipse will create persistence.xml and will keep notifying when you create entity class but don't configure it in persistence file. There's lots more, adding facets to projects also reconfigures how your project structure looks in eclipse. Basically they do exactly as you quoted: "" Adds support for writing applications using Java programming  language. Every facet add something new so if you want more specific answer you have to answer question about specific facet.  If you don't know what they can do for you - turn them off. You can always add them when you learn more. Real life example: Adding JPA facet messed up my project using ObjectDB by creating persistence file, which, turned out, I didn't even need.",2581593.0,,2581593.0,,2015-12-23 09:30,2015-12-23 09:30,,,,3,,,,CC BY-SA 3.0
34432311,2,,34431166.0,2015-12-23 09:24,,1,,"To achieve this you can use blocks. You need to add @property (nonatomic, copy) void (^didSelectAction)(NSIndexPath *indexPath); to view controller which is shown in popover. than in tableView: didSelectRowAtIndexPath: call this block - (void) tableView:(UITableView *)tableView didSelectRowAtIndexPath:(NSIndexPath *)indexPath {  if (self.didSelectAction)  self.didSelectAction(indexPath); } So when you create a popover you should provide additional handler. Something like this Add new action to your button - (UITableViewCell *) tableView:(UITableView *)tv cellForRowAtIndexPath:(NSIndexPath *)indexPath {  CustomCell *cell = [tableView dequeueReusableCellWithIdentifier:@""CustomCell""];  [[cell button] addTarget:self action:@selector(showPopoverFromButton:) forControlEvents:UIControlEventTouchUpInside]; } - (void) showPopoverFromButton:(UIButton *)sender {  //Your table view which is shown in popover  UITableViewController *controller = [[UITableViewController alloc] init];  [controller setDidSelectAction:^{  [sender setBackgroundColor:[UIColor redColor]];  }];  FPPopoverMenu *popover = [[FPPopoverController alloc] initWithViewController:controller]; [popover show]; }",1176219.0,,1176219.0,,2015-12-23 11:20,2015-12-23 11:20,,,,7,,,,CC BY-SA 3.0
34432312,2,,34431942.0,2015-12-23 09:24,,0,,"it's me again. For Q1, What is the event for tfDateNaissanceEditing function you're listening? What's your purpose for tfDateNaissance.resignFirstResponder()? I guess what happened there is that you listen the startEditing event. So the date picker is set as input view the first time you touch it, and immediately is recalled by the last line of code. Theoretically, you could achieve what you want by just have some code like the following in viewDidLoad ... // your other functions let datePickerView : UIDatePicker = UIDatePicker() datePickerView.datePickerMode = UIDatePickerMode.Date UITextField.inputView = datePickerView datePickerView.addTarget(self, action: Selector(""datePickerValueChanged:""), forControlEvents: UIControlEvents.ValueChanged) Q2. Currently every time the value of the DatePicker changes, it will resign itself, so remove tfDateNaissance.resignFirstResponder() in datePickerValueChanged function will solve the problem. If you want to have a custom toolbar with done button, you can manually create one and set it as input accessory view also in viewDidLoad. You may also achieve this by using navigation bar like this answer BTW, didReceiveMemoryWarning function is not needed for this question.",2710486.0,,-1.0,,2017-05-23 12:15,2015-12-23 09:24,,,,3,,,,CC BY-SA 3.0
34432314,1,34433183.0,,2015-12-23 09:24,,1,537.0,"I'm working on a Java program that try to reset a user's password in oracle and sql-server. This password is a random generated password that will have some character that is not acceptable as a normal string. Eg. ',"""",; The command that I'm using to reset user's password is : oracle: ALTER USER <username> IDENTIFIED BY <password> sql-server: ALTER LOGIN <username> WITH PASSWORD = '<passowrd>' How can I do this reset so that it can accept all kind of special character? I did google and found out about quoting method:. Also I did found out about using single code and double code. But what if the password generated have a "" or same quote delimiter inside that password? Then it will be a problem. Eg. IDENTIFIED BY 'jks'k""fjh''d' Eg. password = q[#kkksdj#jsksls#] Eg. password = ""nm.js""""kh:kjhs"" Is there any way for me to do this inside the oracle and sql-server? Or do I need to escape each character one by one from java before sending to oracle/sql-server? My reset program for oracle and sql-server are different. So the method can be different.",2775433.0,,2775433.0,,2015-12-23 09:52,2015-12-23 10:09,set escape character in oracle,sql sql-server oracle oracle11g escaping,1.0,3,,,,CC BY-SA 3.0
34432316,2,,34414345.0,2015-12-23 09:24,,1,,"Yes, it's possible. Here are all the steps : git clone xxx.git composer install (make sure you have included .env.example in your git for the app key) npm install bower install php artisan migrate (if, I hope, you use migrations) gulp And you are ready to work on your project.",2559851.0,,2559851.0,,2015-12-23 12:03,2015-12-23 12:03,,,,0,,,,CC BY-SA 3.0
34432317,2,,34432150.0,2015-12-23 09:24,,4,,"The $role variable within the handle parameters will contain the variable passed in after role: so role:editor will return ""editor""",5388039.0,,,,,2015-12-23 09:24,,,,2,,,,CC BY-SA 3.0
34432318,2,,34431624.0,2015-12-23 09:24,,0,,"Use bootstrap classes: <div class=""panel panel-default"">  <div class=""panel-heading text-center"">  <h3 class=""panel-title""><b>Owner</b></h3>  </div>  <div class=""panel-body"" id=""owners"">  <ul class=""list-unstyled list-inline"">  <li class=""text-center"">  <img src=""http://cdn.akamai.steamstatic.com/steamcommunity/public/images/avatars/40/405b7b5da64cc1dbcb68110ca5e65a9c751b79a0_full.jpg"" height=""64"">  <h4>Firav</h4>  </li>  <li class=""text-center"">  <img src=""http://cdn.akamai.steamstatic.com/steamcommunity/public/images/avatars/66/66cd4ded6f8bc2761f64c110ff8f8b93e568082e_full.jpg"" height=""64"">  <h4>Donnyy</h4>  </li>  </ul>  </div> </div> jsfiddle",4377017.0,,,,,2015-12-23 09:24,,,,0,,,,CC BY-SA 3.0
34432319,1,,,2015-12-23 09:24,,-4,48.0,"I'm needing to return the value of a string after it's been split in VB.Net. The string will be something along the lines of: someexpression1 OR someexpression2 OR someexpression3 OR someexpression4 OR someexpression5 The string can't contain more than 3 of these expressions so I need to retrieve everything after someexpression3. After the split I would need the following ""OR someexpression4 OR someexpression5"", the full string will always be different lengths so I need something dynamic in order to capture the last part of the string.",5710478.0,,5710478.0,,2015-12-23 10:53,2015-12-23 13:29,Retrieve everything after a split in string VB.net,vb.net,2.0,6,,,,CC BY-SA 3.0
34432320,2,,28061319.0,2015-12-23 09:24,,32,,"For me to freeze the first row following code worked. I am not sure what is logic there.  worksheet.View.FreezePanes(2,1);",5668583.0,,,,,2015-12-23 09:24,,,,4,,,,CC BY-SA 3.0
34432321,2,,34430676.0,2015-12-23 09:24,,0,,Subclass NSTextField and override following method to return nil. override func hitTest(aPoint: NSPoint) -> NSView? {  return nil },1196508.0,,,,,2015-12-23 09:24,,,,0,,,,CC BY-SA 3.0


In [0]:
new_data.count()

In [0]:
# removing duplicates
new_data = new_data.select('Body','Title','Tags', 'Score').distinct()
new_data.count()

## Filter New Data
* filter out questions that do not have a title, body, or tag
* filter for high quality data (questions with score > 0)
* 2,105 new questions after filtering

In [0]:
new_data = new_data.where(col("Title").isNotNull()).where(col("Body").isNotNull()).where(col("Tags").isNotNull())
new_data = new_data.where(col("Score")>0)
new_data.count()

# Compare Datasets

## Original Data

In [0]:
tags_udf = udf(lambda line: line.split(), ArrayType(StringType()))
train = train.withColumn('tags', tags_udf('Tags')).select('Title', 'Body', 'tags')
display(train)

Title,Body,tags
How do commercial obfuscators achieve to crash .net Reflector and ILDASM?,"Some commercial obfuscators claim they can crash ILDASM (and other similar tools such as Reflector) Any idea on how they achieve that? As stated in numerous threads here, someone with enough motivation/time/skill will always find a way to read your code (aka if it's runnable, it's decompilable), but it seems to me that most casual code readers won't bother decompiling my code if Reflector can't do it for them. This level of protection of my IP (ie, protected against anybody but the hardcore guys who would probably find a way around every single trick I would throw at them anyway) would definitely be enough for me.","List(.net, obfuscation, reflector)"
I need help with this error: java.lang.NoSuchMethodError,"I have this Java code (JPA): String queryString = ""SELECT b , sum(v.votedPoints) as votedPoint "" +  "" FROM Bookmarks b "" +  "" LEFT OUTER JOIN Votes v "" +  "" on (v.organizationId = b.organizationId) "" + ""WHERE b.userId = 101 "" + ""GROUP BY b.organizationId "" +  ""ORDER BY votedPoint ascending ""; EntityManager em = getEntityManager(); Query query = em.createQuery(queryString); query.setFirstResult(start); query.setMaxResults(numRecords); List results = query.getResultList(); I don't know what is wrong with my query because it gives me this error: java.lang.NoSuchMethodError: org.hibernate.hql.antlr.HqlBaseParser.recover(Lantlr/RecognitionException;Lantlr/collections/impl/BitSet;)V  at org.hibernate.hql.antlr.HqlBaseParser.fromJoin(HqlBaseParser.java:1802)  at org.hibernate.hql.antlr.HqlBaseParser.fromClause(HqlBaseParser.java:1420)  at org.hibernate.hql.antlr.HqlBaseParser.selectFrom(HqlBaseParser.java:1130)  at org.hibernate.hql.antlr.HqlBaseParser.queryRule(HqlBaseParser.java:702)  at org.hibernate.hql.antlr.HqlBaseParser.selectStatement(HqlBaseParser.java:296)  at org.hibernate.hql.antlr.HqlBaseParser.statement(HqlBaseParser.java:159)  at org.hibernate.hql.ast.QueryTranslatorImpl.parse(QueryTranslatorImpl.java:271)  at org.hibernate.hql.ast.QueryTranslatorImpl.doCompile(QueryTranslatorImpl.java:180)  at org.hibernate.hql.ast.QueryTranslatorImpl.compile(QueryTranslatorImpl.java:134)  at org.hibernate.engine.query.HQLQueryPlan.(HQLQueryPlan.java:101)  at org.hibernate.engine.query.HQLQueryPlan.(HQLQueryPlan.java:80)  at org.hibernate.engine.query.QueryPlanCache.getHQLQueryPlan(QueryPlanCache.java:94)  at org.hibernate.impl.AbstractSessionImpl.getHQLQueryPlan(AbstractSessionImpl.java:156)  at org.hibernate.impl.AbstractSessionImpl.createQuery(AbstractSessionImpl.java:135)  at org.hibernate.impl.SessionImpl.createQuery(SessionImpl.java:1650) Thanks.","List(java, hibernate, reflection, nosuchmethoderror)"
Cisco ASA5505 8.2 Multiple Outside IP to Multiple Inside IP,"Trying to setup ASA5505. Semi working but having issues with accessing services from the outside. ASA5505 Basic License, Version 8.2. (plus upgrade to unlimited inside hosts). Alert: I'm a Cisco Noob. 10.10.39.X is a place holder for privacy. (EDIT: to be less confusing) I came up with this config and tested it tonight. ASA Version 8.2(1) ! hostname <removed> domain-name <removed> enable password <removed> encrypted passwd <removed> encrypted names ! interface Vlan1  nameif inside  security-level 100  ip address 172.21.36.1 255.255.252.0 ! interface Vlan2  nameif outside  security-level 0  ip address 10.10.39.10 255.255.255.248 ! interface Ethernet0/0  switchport access vlan 2 ! interface Ethernet0/1 ! interface Ethernet0/2 ! interface Ethernet0/3 ! interface Ethernet0/4 ! interface Ethernet0/5 ! interface Ethernet0/6 ! interface Ethernet0/7 ! ftp mode passive dns server-group DefaultDNS  domain-name <removed> access-list outside_inbound extended permit tcp any host 10.10.39.10 eq pptp access-list outside_inbound extended permit tcp any host 10.10.39.11 eq https access-list outside_inbound extended permit tcp any host 10.10.39.11 eq 993 access-list outside_inbound extended permit tcp any host 10.10.39.11 eq smtp access-list outside_inbound extended permit tcp any host 10.10.39.11 eq 1001 access-list outside_inbound extended permit tcp any host 10.10.39.11 eq 465 access-list outside_inbound extended permit tcp any host 10.10.39.11 eq domain access-list outside_inbound extended permit udp any eq domain host 10.10.39.11 eq domain access-list outside_inbound extended permit tcp any host 10.10.39.12 eq www access-list outside_inbound extended permit tcp any host 10.10.39.12 eq https access-list outside_inbound extended permit tcp any host 10.10.39.13 eq www access-list outside_inbound extended permit tcp any host 10.10.39.13 eq https access-list outside_inbound extended permit icmp any any echo-reply access-list outside_inbound extended permit icmp any any source-quench access-list outside_inbound extended permit icmp any any unreachable access-list outside_inbound extended permit icmp any any time-exceeded access-list outside_inbound extended permit icmp any any traceroute access-list outside_inbound extended permit icmp any any echo pager lines 24 logging asdm informational mtu inside 1500 mtu outside 1500 icmp unreachable rate-limit 1 burst-size 1 no asdm history enable arp timeout 14400 global (outside) 2 10.10.39.11-10.10.39.14 netmask 255.255.255.248 global (outside) 1 interface nat (inside) 1 0.0.0.0 0.0.0.0 static (inside,outside) tcp interface pptp 172.21.37.20 pptp netmask 255.255.255.255 static (inside,outside) 10.10.39.11 172.21.37.14 netmask 255.255.255.255 static (inside,outside) 10.10.39.12 172.21.37.24 netmask 255.255.255.255 static (inside,outside) 10.10.39.13 172.21.37.17 netmask 255.255.255.255 access-group outside_inbound in interface outside route outside 0.0.0.0 0.0.0.0 10.10.39.9 1 route inside 192.168.15.0 255.255.255.0 172.21.36.52 1 timeout xlate 3:00:00 timeout conn 1:00:00 half-closed 0:10:00 udp 0:02:00 icmp 0:00:02 timeout sunrpc 0:10:00 h323 0:05:00 h225 1:00:00 mgcp 0:05:00 mgcp-pat 0:05:00 timeout sip 0:30:00 sip_media 0:02:00 sip-invite 0:03:00 sip-disconnect 0:02:00 timeout sip-provisional-media 0:02:00 uauth 0:05:00 absolute timeout tcp-proxy-reassembly 0:01:00 dynamic-access-policy-record DfltAccessPolicy http server enable http 172.21.36.0 255.255.252.0 inside no snmp-server location no snmp-server contact snmp-server enable traps snmp authentication linkup linkdown coldstart crypto ipsec security-association lifetime seconds 28800 crypto ipsec security-association lifetime kilobytes 4608000 telnet 172.21.36.0 255.255.252.0 inside telnet timeout 60 ssh timeout 5 console timeout 0 threat-detection basic-threat threat-detection statistics access-list no threat-detection statistics tcp-intercept webvpn ! class-map inspection_default  match default-inspection-traffic ! ! policy-map type inspect dns preset_dns_map  parameters  message-length maximum 512 policy-map global_policy  class inspection_default  inspect dns preset_dns_map  inspect ftp  inspect h323 h225  inspect h323 ras  inspect rsh  inspect rtsp  inspect sqlnet  inspect skinny  inspect sunrpc  inspect xdmcp  inspect sip  inspect netbios  inspect tftp  inspect pptp  inspect ipsec-pass-thru  inspect http ! service-policy global_policy global prompt hostname context The servers that had static forwards did not have any outside network access. couldn't ping google.com for instance. mail server couldn't Domain POP the Barracuda spam filter from our ISP etc. So after doing some reading I removed the statics for 10.10.39.11, 12 and 13, and replaced those three with what's below.. (Edit: corrected IPs in this statment.) static (inside,outside) tcp 10.10.39.11 https 172.21.37.14 https netmask 255.255.255.255 static (inside,outside) tcp 10.10.39.11 993 172.21.37.14 993 netmask 255.255.255.255 static (inside,outside) tcp 10.10.39.11 smtp 172.21.37.14 smtp netmask 255.255.255.255 static (inside,outside) tcp 10.10.39.11 1001 172.21.37.14 1001 netmask 255.255.255.255 static (inside,outside) tcp 10.10.39.11 465 172.21.37.14 465 netmask 255.255.255.255 static (inside,outside) tcp 10.10.39.11 domain 172.21.37.14 domain netmask 255.255.255.255 static (inside,outside) tcp 10.10.39.12 www 172.21.37.24 www netmask 255.255.255.255 static (inside,outside) tcp 10.10.39.12 https 172.21.37.24 https netmask 255.255.255.255 static (inside,outside) tcp 10.10.39.13 www 172.21.37.17 www netmask 255.255.255.255 static (inside,outside) tcp 10.10.39.13 https 172.21.37.17 https netmask 255.255.255.255 Now the servers (for instance 172.21.37.14) could ping the outside world again. Mail started flowing (Domain POP was successful) etc. etc. But I forgot to check if webmail worked from the outside admittedly. But the webservers at 172.21.37.17 and 172.21.37.24 still didn't respond from the outside world. Although I was able to PPTP VPN in on 10.10.39.10 (interface) which is the outside interface IP address. and it is static mapped to 172.21.37.20. So I'm thinking there must be something wrong with NAT somewhere? no response from 10.10.39.11 to 10.10.39.14.. Could anyone look over the config and please let me know what I've done wrong? Is there something I've missed? well obviously but.. please help! Thank you.",List(cisco-asa)
How can i access my custom webpart in sharepoint foundation 2010?,"So i just started trying to develop a simple webpart today for a sharepoint foundation i put on a virtual machine. I have no previous experience with sharepoint whatsoever. As i cant run a sharepoint 2010 on my local machine for dev purposes i followed advices in this thread http://social.technet.microsoft.com/Forums/en/sharepoint2010programming/thread/cda807f6-4edf-4efc-8e9b-4d446356c8ae to able to actually develop something (just the registry bit). I created the simple test web part (writes out ""hi""), uploaded it to virtual machine, added it with add-spsolution and install-spsolution in powershell with success. When i do get-solution through powershell on my webpart it says deployed = true. What am i missing from here to get it to actually show up somewhere in the web interface so i can add it to a page? Cheers","List(sharepoint, sharepoint2010)"
log4net creates new log every minute,"log4net in my project creates new log file every minute. I would like to have just one file per instance of my application, but every instance that runs should create new log file. This is from my app.config file: <appender name=""file"" type=""log4net.Appender.RollingFileAppender"">  <file value=""C:\\Logs\\log2_""/>  <rollingStyle value=""Date""/>  <datePattern value=""MMdd_HHmmss.\tx\t""/>  <staticLogFileName value=""false""/>  <appendToFile value=""true""/>  <maximumFileSize value=""500MB""/>  <layout type=""log4net.Layout.PatternLayout"">  <conversionPattern value=""%date %-5level %message%newline""/>  </layout> </appender> What is the error here?","List(c#, log4net)"
"Adding handler to form inside div, in the future","I am using the following code to direct the results from a form to a specific div. $(window).load(function () {  $(""#form1"").submit(function() {  $.post($(this).attr(""action""), $(this).serialize(), function(html) {  $(""#resultsDiv"").html(html);  });  return false; // prevent normal submit  }); }); How can I apply this (or any) handler to future forms that may be created within an updated div ( with new yet to created content inserted into the div at some point in the future)? I have looked at the .on but I do not see an event for the updating or reloading of a div. I have tried adding a similar function to the above, but replacing (window) with (""#thefutureDivID""), but no luck.","List(events, div, event-handling, jquery-live)"
ASP.NET User Controls Cross-Communication,"The scenario: 2 user controls (foo.ascx and fum.ascx) foo has a method that would really like to access a property from fum. They live on the same page, but I can't find a very simple way to accomplish this sort of communication. Any ideas?","List(c#, asp.net, vb.net, ascx)"
Distributed mysql synching,"I'm running a server which is connected to an SQL host. I have an another server and I decided to run it as an SQL backup. So, I have 3 of them. Srv A is the SQL host, srv B is the backup. I know there's mysql replication, but it's simply not for what I like (correct me if I'm wrong). I'd like something distributed, so if the srv A comes back, it won't overwrite the database built during the downtime on srv B. I only have 3 servers, so setting up a cluster is not an option. I'd be glad if anybody could help me with that.","List(mysql, distribution)"
Showing numbers as binary from a bound source,I need to display a number as binary string (e.g. 8 => 1000). Sure I can convert it using BitConverter and set the text of my TextBox on my own in the code behind file. But this looks somewhat ugly. Is it possible to bind the TextBox to some source and convert it automatically?,"List(binary, binding, ivalueconverter)"
'Global' may not respond to '+setShow',hai  I am new iphone. In my application there is warning 'Global' may not respond to '+setShow'  [Global setShow:True]; Please help me. Thank you,List(iphone)


In [0]:
tag_counts = train.select('tags').rdd.flatMap(lambda l : [(w,1) for w in l.tags]).reduceByKey(lambda a,b: a+b).collect()
tag_counts.sort(key=lambda tup: -tup[1])

display(tag_counts)

_1,_2
c#,331505
java,299414
php,284103
javascript,265423
android,235436
jquery,221533
c++,143936
python,134137
iphone,128681
asp.net,125651


## New Data

In [0]:
new_data = new_data.withColumn('tags', tags_udf('Tags')).select('Title', 'Body', 'tags')

In [0]:
new_tag_counts = new_data.select('tags').rdd.flatMap(lambda l : [(w,1) for w in l.tags]).reduceByKey(lambda a,b: a+b).collect()
new_tag_counts.sort(key=lambda tup: -tup[1])

display(new_tag_counts)

_1,_2
javascript,230
java,201
python,183
c#,143
android,129
php,102
html,94
c++,83
ios,79
jquery,69


# Combine Datasets
* 4,208,419 questions in the combined training data set

In [0]:
result = train.union(new_data)
display(result)

Title,Body,tags
How do commercial obfuscators achieve to crash .net Reflector and ILDASM?,"Some commercial obfuscators claim they can crash ILDASM (and other similar tools such as Reflector) Any idea on how they achieve that? As stated in numerous threads here, someone with enough motivation/time/skill will always find a way to read your code (aka if it's runnable, it's decompilable), but it seems to me that most casual code readers won't bother decompiling my code if Reflector can't do it for them. This level of protection of my IP (ie, protected against anybody but the hardcore guys who would probably find a way around every single trick I would throw at them anyway) would definitely be enough for me.","List(.net, obfuscation, reflector)"
I need help with this error: java.lang.NoSuchMethodError,"I have this Java code (JPA): String queryString = ""SELECT b , sum(v.votedPoints) as votedPoint "" +  "" FROM Bookmarks b "" +  "" LEFT OUTER JOIN Votes v "" +  "" on (v.organizationId = b.organizationId) "" + ""WHERE b.userId = 101 "" + ""GROUP BY b.organizationId "" +  ""ORDER BY votedPoint ascending ""; EntityManager em = getEntityManager(); Query query = em.createQuery(queryString); query.setFirstResult(start); query.setMaxResults(numRecords); List results = query.getResultList(); I don't know what is wrong with my query because it gives me this error: java.lang.NoSuchMethodError: org.hibernate.hql.antlr.HqlBaseParser.recover(Lantlr/RecognitionException;Lantlr/collections/impl/BitSet;)V  at org.hibernate.hql.antlr.HqlBaseParser.fromJoin(HqlBaseParser.java:1802)  at org.hibernate.hql.antlr.HqlBaseParser.fromClause(HqlBaseParser.java:1420)  at org.hibernate.hql.antlr.HqlBaseParser.selectFrom(HqlBaseParser.java:1130)  at org.hibernate.hql.antlr.HqlBaseParser.queryRule(HqlBaseParser.java:702)  at org.hibernate.hql.antlr.HqlBaseParser.selectStatement(HqlBaseParser.java:296)  at org.hibernate.hql.antlr.HqlBaseParser.statement(HqlBaseParser.java:159)  at org.hibernate.hql.ast.QueryTranslatorImpl.parse(QueryTranslatorImpl.java:271)  at org.hibernate.hql.ast.QueryTranslatorImpl.doCompile(QueryTranslatorImpl.java:180)  at org.hibernate.hql.ast.QueryTranslatorImpl.compile(QueryTranslatorImpl.java:134)  at org.hibernate.engine.query.HQLQueryPlan.(HQLQueryPlan.java:101)  at org.hibernate.engine.query.HQLQueryPlan.(HQLQueryPlan.java:80)  at org.hibernate.engine.query.QueryPlanCache.getHQLQueryPlan(QueryPlanCache.java:94)  at org.hibernate.impl.AbstractSessionImpl.getHQLQueryPlan(AbstractSessionImpl.java:156)  at org.hibernate.impl.AbstractSessionImpl.createQuery(AbstractSessionImpl.java:135)  at org.hibernate.impl.SessionImpl.createQuery(SessionImpl.java:1650) Thanks.","List(java, hibernate, reflection, nosuchmethoderror)"
Cisco ASA5505 8.2 Multiple Outside IP to Multiple Inside IP,"Trying to setup ASA5505. Semi working but having issues with accessing services from the outside. ASA5505 Basic License, Version 8.2. (plus upgrade to unlimited inside hosts). Alert: I'm a Cisco Noob. 10.10.39.X is a place holder for privacy. (EDIT: to be less confusing) I came up with this config and tested it tonight. ASA Version 8.2(1) ! hostname <removed> domain-name <removed> enable password <removed> encrypted passwd <removed> encrypted names ! interface Vlan1  nameif inside  security-level 100  ip address 172.21.36.1 255.255.252.0 ! interface Vlan2  nameif outside  security-level 0  ip address 10.10.39.10 255.255.255.248 ! interface Ethernet0/0  switchport access vlan 2 ! interface Ethernet0/1 ! interface Ethernet0/2 ! interface Ethernet0/3 ! interface Ethernet0/4 ! interface Ethernet0/5 ! interface Ethernet0/6 ! interface Ethernet0/7 ! ftp mode passive dns server-group DefaultDNS  domain-name <removed> access-list outside_inbound extended permit tcp any host 10.10.39.10 eq pptp access-list outside_inbound extended permit tcp any host 10.10.39.11 eq https access-list outside_inbound extended permit tcp any host 10.10.39.11 eq 993 access-list outside_inbound extended permit tcp any host 10.10.39.11 eq smtp access-list outside_inbound extended permit tcp any host 10.10.39.11 eq 1001 access-list outside_inbound extended permit tcp any host 10.10.39.11 eq 465 access-list outside_inbound extended permit tcp any host 10.10.39.11 eq domain access-list outside_inbound extended permit udp any eq domain host 10.10.39.11 eq domain access-list outside_inbound extended permit tcp any host 10.10.39.12 eq www access-list outside_inbound extended permit tcp any host 10.10.39.12 eq https access-list outside_inbound extended permit tcp any host 10.10.39.13 eq www access-list outside_inbound extended permit tcp any host 10.10.39.13 eq https access-list outside_inbound extended permit icmp any any echo-reply access-list outside_inbound extended permit icmp any any source-quench access-list outside_inbound extended permit icmp any any unreachable access-list outside_inbound extended permit icmp any any time-exceeded access-list outside_inbound extended permit icmp any any traceroute access-list outside_inbound extended permit icmp any any echo pager lines 24 logging asdm informational mtu inside 1500 mtu outside 1500 icmp unreachable rate-limit 1 burst-size 1 no asdm history enable arp timeout 14400 global (outside) 2 10.10.39.11-10.10.39.14 netmask 255.255.255.248 global (outside) 1 interface nat (inside) 1 0.0.0.0 0.0.0.0 static (inside,outside) tcp interface pptp 172.21.37.20 pptp netmask 255.255.255.255 static (inside,outside) 10.10.39.11 172.21.37.14 netmask 255.255.255.255 static (inside,outside) 10.10.39.12 172.21.37.24 netmask 255.255.255.255 static (inside,outside) 10.10.39.13 172.21.37.17 netmask 255.255.255.255 access-group outside_inbound in interface outside route outside 0.0.0.0 0.0.0.0 10.10.39.9 1 route inside 192.168.15.0 255.255.255.0 172.21.36.52 1 timeout xlate 3:00:00 timeout conn 1:00:00 half-closed 0:10:00 udp 0:02:00 icmp 0:00:02 timeout sunrpc 0:10:00 h323 0:05:00 h225 1:00:00 mgcp 0:05:00 mgcp-pat 0:05:00 timeout sip 0:30:00 sip_media 0:02:00 sip-invite 0:03:00 sip-disconnect 0:02:00 timeout sip-provisional-media 0:02:00 uauth 0:05:00 absolute timeout tcp-proxy-reassembly 0:01:00 dynamic-access-policy-record DfltAccessPolicy http server enable http 172.21.36.0 255.255.252.0 inside no snmp-server location no snmp-server contact snmp-server enable traps snmp authentication linkup linkdown coldstart crypto ipsec security-association lifetime seconds 28800 crypto ipsec security-association lifetime kilobytes 4608000 telnet 172.21.36.0 255.255.252.0 inside telnet timeout 60 ssh timeout 5 console timeout 0 threat-detection basic-threat threat-detection statistics access-list no threat-detection statistics tcp-intercept webvpn ! class-map inspection_default  match default-inspection-traffic ! ! policy-map type inspect dns preset_dns_map  parameters  message-length maximum 512 policy-map global_policy  class inspection_default  inspect dns preset_dns_map  inspect ftp  inspect h323 h225  inspect h323 ras  inspect rsh  inspect rtsp  inspect sqlnet  inspect skinny  inspect sunrpc  inspect xdmcp  inspect sip  inspect netbios  inspect tftp  inspect pptp  inspect ipsec-pass-thru  inspect http ! service-policy global_policy global prompt hostname context The servers that had static forwards did not have any outside network access. couldn't ping google.com for instance. mail server couldn't Domain POP the Barracuda spam filter from our ISP etc. So after doing some reading I removed the statics for 10.10.39.11, 12 and 13, and replaced those three with what's below.. (Edit: corrected IPs in this statment.) static (inside,outside) tcp 10.10.39.11 https 172.21.37.14 https netmask 255.255.255.255 static (inside,outside) tcp 10.10.39.11 993 172.21.37.14 993 netmask 255.255.255.255 static (inside,outside) tcp 10.10.39.11 smtp 172.21.37.14 smtp netmask 255.255.255.255 static (inside,outside) tcp 10.10.39.11 1001 172.21.37.14 1001 netmask 255.255.255.255 static (inside,outside) tcp 10.10.39.11 465 172.21.37.14 465 netmask 255.255.255.255 static (inside,outside) tcp 10.10.39.11 domain 172.21.37.14 domain netmask 255.255.255.255 static (inside,outside) tcp 10.10.39.12 www 172.21.37.24 www netmask 255.255.255.255 static (inside,outside) tcp 10.10.39.12 https 172.21.37.24 https netmask 255.255.255.255 static (inside,outside) tcp 10.10.39.13 www 172.21.37.17 www netmask 255.255.255.255 static (inside,outside) tcp 10.10.39.13 https 172.21.37.17 https netmask 255.255.255.255 Now the servers (for instance 172.21.37.14) could ping the outside world again. Mail started flowing (Domain POP was successful) etc. etc. But I forgot to check if webmail worked from the outside admittedly. But the webservers at 172.21.37.17 and 172.21.37.24 still didn't respond from the outside world. Although I was able to PPTP VPN in on 10.10.39.10 (interface) which is the outside interface IP address. and it is static mapped to 172.21.37.20. So I'm thinking there must be something wrong with NAT somewhere? no response from 10.10.39.11 to 10.10.39.14.. Could anyone look over the config and please let me know what I've done wrong? Is there something I've missed? well obviously but.. please help! Thank you.",List(cisco-asa)
How can i access my custom webpart in sharepoint foundation 2010?,"So i just started trying to develop a simple webpart today for a sharepoint foundation i put on a virtual machine. I have no previous experience with sharepoint whatsoever. As i cant run a sharepoint 2010 on my local machine for dev purposes i followed advices in this thread http://social.technet.microsoft.com/Forums/en/sharepoint2010programming/thread/cda807f6-4edf-4efc-8e9b-4d446356c8ae to able to actually develop something (just the registry bit). I created the simple test web part (writes out ""hi""), uploaded it to virtual machine, added it with add-spsolution and install-spsolution in powershell with success. When i do get-solution through powershell on my webpart it says deployed = true. What am i missing from here to get it to actually show up somewhere in the web interface so i can add it to a page? Cheers","List(sharepoint, sharepoint2010)"
log4net creates new log every minute,"log4net in my project creates new log file every minute. I would like to have just one file per instance of my application, but every instance that runs should create new log file. This is from my app.config file: <appender name=""file"" type=""log4net.Appender.RollingFileAppender"">  <file value=""C:\\Logs\\log2_""/>  <rollingStyle value=""Date""/>  <datePattern value=""MMdd_HHmmss.\tx\t""/>  <staticLogFileName value=""false""/>  <appendToFile value=""true""/>  <maximumFileSize value=""500MB""/>  <layout type=""log4net.Layout.PatternLayout"">  <conversionPattern value=""%date %-5level %message%newline""/>  </layout> </appender> What is the error here?","List(c#, log4net)"
"Adding handler to form inside div, in the future","I am using the following code to direct the results from a form to a specific div. $(window).load(function () {  $(""#form1"").submit(function() {  $.post($(this).attr(""action""), $(this).serialize(), function(html) {  $(""#resultsDiv"").html(html);  });  return false; // prevent normal submit  }); }); How can I apply this (or any) handler to future forms that may be created within an updated div ( with new yet to created content inserted into the div at some point in the future)? I have looked at the .on but I do not see an event for the updating or reloading of a div. I have tried adding a similar function to the above, but replacing (window) with (""#thefutureDivID""), but no luck.","List(events, div, event-handling, jquery-live)"
ASP.NET User Controls Cross-Communication,"The scenario: 2 user controls (foo.ascx and fum.ascx) foo has a method that would really like to access a property from fum. They live on the same page, but I can't find a very simple way to accomplish this sort of communication. Any ideas?","List(c#, asp.net, vb.net, ascx)"
Distributed mysql synching,"I'm running a server which is connected to an SQL host. I have an another server and I decided to run it as an SQL backup. So, I have 3 of them. Srv A is the SQL host, srv B is the backup. I know there's mysql replication, but it's simply not for what I like (correct me if I'm wrong). I'd like something distributed, so if the srv A comes back, it won't overwrite the database built during the downtime on srv B. I only have 3 servers, so setting up a cluster is not an option. I'd be glad if anybody could help me with that.","List(mysql, distribution)"
Showing numbers as binary from a bound source,I need to display a number as binary string (e.g. 8 => 1000). Sure I can convert it using BitConverter and set the text of my TextBox on my own in the code behind file. But this looks somewhat ugly. Is it possible to bind the TextBox to some source and convert it automatically?,"List(binary, binding, ivalueconverter)"
'Global' may not respond to '+setShow',hai  I am new iphone. In my application there is warning 'Global' may not respond to '+setShow'  [Global setShow:True]; Please help me. Thank you,List(iphone)


In [0]:
result.count()

# Data Preparation
We will train our 5 classifier model on the 500 most common tags. Each model will be trained on one of the five tag positions. We find the 500 most common tags for each tag position and filtered the dataset to keep only questions with tags in the 500 most common tags. Consequently, the training dataset and labels are different for each of the 5 classifier models. These operations encapsulated in a class (OOP).

In [0]:
from pyspark.sql.functions import desc, collect_list, col

def most_common_for_pos(df, n):
  """
  Find and return the top n most common tags in a list

  Args:
    df - dataframe with column 'tag'
    n - the n most common tags

  Returns:
    The top n most common tags in a list
  """

  df = df.na.drop("any")
  tags_df = df.select('tag').groupby('tag').count().sort(desc('count')).limit(n)
  tags_lst = tags_df.agg(collect_list(col('tag'))).collect()[0][0]
  return tags_lst

In [0]:
"""
Pad list with less than 5 elements with null values.
"""
def pad_list(lst):
  for i in range(5 - len(lst)):
    lst.append(None)
  return lst

class Data_Prep:
  """
  Preprocess training data, filter the dataframe based on a given number of most common tags
  """
  
  def __init__(self, df):
    """
    Construct a Data_Prep object with attribute padded_tags,which pads the list of targets with less than five tags with null values.

    Args:
      df - a dataframe 'tags' column consisting of a list of tags
    """    
    pad_tags_udf = udf(lambda lst: pad_list(lst), ArrayType(StringType()))
    self.padded_tags = df.withColumn("padded_tags", pad_tags_udf("tags"))
  
  def get_tags_at_position(self, i, n = 500):
    """
    Return a filtered dataframe with the column 'tag' containing tags at position i if the tag for a problem is in the most_common_tags
    
    Args:
     i: tag position (1-based indexing)
     n: number of most common tags
    
    Returns:
      Return a filtered dataframe with the column 'tag' containing tags at position i if the tag for a problem is in the most_common_tags
    """
    y_i = self.padded_tags.select('Body', 'Title', element_at("padded_tags", i).alias("tag"))
    tags = most_common_for_pos(y_i, n)
    tag_udf = udf(lambda tag: True if tag in tags else False, BooleanType()) 
    y_i = y_i.withColumn('valid_tag', tag_udf(col('tag')))
    return y_i.filter(y_i.valid_tag == True).select('Body', 'Title', 'tag')
  
  def get_yi(self, i):
    return self.padded_tags.select(element_at("padded_tags", i).alias("tag"))
  

## Cosine Similarity between Top tags for Original Data and New Data

In [0]:
import numpy as np

def getTextDict(a,b):
  ab = set(a+b)
  ka = {}
  kb = {}
  for k in ab: 
    ka[k] = 0
    kb[k] = 0
  for k in a: 
    ka[k] +=1
  for k in b: 
    kb[k] +=1
  return ka, kb
 
def getCosineSimilarity(ka,kb):
  a,b = getTextDict(ka,kb)
  v1 = list(a.values())
  v2 = list(b.values())
  return np.dot(v1, v2) / np.sqrt(np.dot(v1, v1)) / np.sqrt(np.dot(v2, v2)) 


In [0]:
split_tags_train = Data_Prep(train)
train_tag1 = split_tags_train.get_yi(1)
train_tag1 = train_tag1.na.drop("any")

train_tag2 = split_tags_train.get_yi(2)
train_tag2 = train_tag2.na.drop("any")

train_tag3 = split_tags_train.get_yi(3)
train_tag3 = train_tag3.na.drop("any")

train_tag4 = split_tags_train.get_yi(4)
train_tag4 = train_tag4.na.drop("any")

train_tag5 = split_tags_train.get_yi(5)
train_tag5 = train_tag5.na.drop("any")


split_tags_new = Data_Prep(new_data)
new_tag1 = split_tags_new.get_yi(1)
new_tag1 = new_tag1.na.drop("any")

new_tag2 = split_tags_new.get_yi(2)
new_tag2 = new_tag2.na.drop("any")

new_tag3 = split_tags_new.get_yi(3)
new_tag3 = new_tag3.na.drop("any")

new_tag4 = split_tags_new.get_yi(4)
new_tag4 = new_tag4.na.drop("any")

new_tag5 = split_tags_new.get_yi(5)
new_tag5 = new_tag5.na.drop("any")

print('Cosine Similarity for tag position 1 is {:.5f}'.format(getCosineSimilarity(train_tag1.select('tag').collect(), new_tag1.select('tag').collect())))
print('Cosine Similarity for tag position 2 is {:.5f}'.format(getCosineSimilarity(train_tag2.select('tag').collect(), new_tag2.select('tag').collect())))
print('Cosine Similarity for tag position 3 is {:.5f}'.format(getCosineSimilarity(train_tag3.select('tag').collect(), new_tag3.select('tag').collect())))
print('Cosine Similarity for tag position 4 is {:.5f}'.format(getCosineSimilarity(train_tag4.select('tag').collect(), new_tag4.select('tag').collect())))
print('Cosine Similarity for tag position 5 is {:.5f}'.format(getCosineSimilarity(train_tag5.select('tag').collect(), new_tag5.select('tag').collect())))

## Filter Data
### Original Data

In [0]:
train_prep = Data_Prep(train)
train1 = train_prep.get_tags_at_position(1)
train2 = train_prep.get_tags_at_position(2)
train3 = train_prep.get_tags_at_position(3)
train4 = train_prep.get_tags_at_position(4)
train5 = train_prep.get_tags_at_position(5)

print('Training size for tag 1 with original dataset', train1.count())
print('Training size for tag 2 with original dataset', train2.count())
print('Training size for tag 3 with original dataset', train3.count())
print('Training size for tag 4 with original dataset', train4.count())
print('Training size for tag 5 with original dataset', train5.count())

### Combined Data

In [0]:
result_prep = Data_Prep(result)
result1 = result_prep.get_tags_at_position(1)
result2 = result_prep.get_tags_at_position(2)
result3 = result_prep.get_tags_at_position(3)
result4 = result_prep.get_tags_at_position(4)
result5 = result_prep.get_tags_at_position(5)

print('Training size for tag 1 with combined datasets', result1.count())
print('Training size for tag 2 with combined datasets', result2.count())
print('Training size for tag 3 with combined datasets', result3.count())
print('Training size for tag 4 with combined datasets', result4.count())
print('Training size for tag 5 with combined datasets', result5.count())

# Data Sampling
Given the limited processing capacity of the Databricks community edition, we will limit the training data set to 100,000 for each classifer.

In [0]:
def sample_df(df, approxSize=100000):
  """
  Find and return a random stratified sample of df

  Args:
    df - dataframe with column 'tag'

  Returns:
    A dictionary mapping the tag to fraction in dataframe
  """
  cols = df.select('tag').groupby('tag').count().sort(desc('count')).collect()
  total_count = approxSize/df.count()
  tag_fraction = {}
  for row in cols:
    tag_fraction[row.tag] = total_count
  df_sample = df.sampleBy('tag',tag_fraction)
  return df_sample

## Original Data

In [0]:
sample_train1 = sample_df(train1)
print('Size of training set for tag position 1:', sample_train1.count())
sample_train1.toPandas().to_csv('sample_train1.csv')

In [0]:
sample_train2 = sample_df(train2)
print('Size of training set for tag position 2:', sample_train2.count())
sample_train2.toPandas().to_csv('sample_train2.csv')

In [0]:
sample_train3 = sample_df(train3)
print('Size of training set for tag position 3:', sample_train3.count())
sample_train3.toPandas().to_csv('sample_train3.csv')

In [0]:
sample_train4 = sample_df(train4)
print('Size of training set for tag position 4:', sample_train4.count())
sample_train4.toPandas().to_csv('sample_train4.csv')

In [0]:
train5.toPandas().to_csv('sample_train5.csv')

## Combined Data

In [0]:
sample_result1 = sample_df(result1)
print('Size of combined training set for tag position 1:', sample_result1.count())
sample_result1.toPandas().to_csv('sample_comb1.csv')

In [0]:
sample_result2 = sample_df(result2)
print('Size of combined training set for tag position 2:', sample_result2.count())
sample_result2.toPandas().to_csv('sample_comb2.csv')

In [0]:
sample_result3 = sample_df(result3)
print('Size of combined training set for tag position 3:', sample_result3.count())
sample_result3.toPandas().to_csv('sample_comb3.csv')

In [0]:
sample_result4 = sample_df(result4)
print('Size of combined training set for tag position 4:', sample_result4.count())
sample_result4.toPandas().to_csv('sample_comb4.csv')

In [0]:
result5.toPandas().to_csv('sample_comb5.csv')

# Data Preprocessing
Following the literature, we used two preprocessing methods: one performing a series of text cleaning tasks (**traditional preprocessing**) and a **"lazy" preprocessing**

## Helper Method
* trad_prep_udf
  * traditional preprocessing: clean noises in string (remove html tags, change to lowercase, remove stop words, remove tokens with less than 3 characters, stem words) and return the cleaned word tokens
* lazy_prep_udf
  * lazy preprocessing: remove html tags and change to lowercase in string (return a string)
  * the literature did not do any preprocessing and only used parameters for CountVectorizer (setting minDF and maxDF)
* combine_tokens
  * combine word tokens from title and body
* tokenize
  * split string of text into word tokens
* object for preprocessing

In [0]:
from nltk.tokenize import RegexpTokenizer 

@udf("String")
def trad_prep_udf(body):
  body = BeautifulSoup(body)
  stemmer = PorterStemmer()
  
  #noise
  urls = body.find_all('a')
  if len(urls)>0: 
    body.a.clear()
  
  text = body.get_text()
  
  words = []
  text = text.lower()

  tokenizer = RegexpTokenizer(r'\w+')

  sents=tokenizer.tokenize(text)
  for sent in sents: 
    for word in nltk.word_tokenize(sent):
      if word in stop_en: continue
      if len(word) < 3: continue
      words.append(stemmer.stem(word))
  return " ".join(words)

In [0]:
# """
# Clean noises from the string of text in given column

# Args:
#   c - Column containing string of text
  
# Returns:
#   Column with the cleaned string of text
# """
@udf("String")
def lazy_prep_udf(c):
  df = BeautifulSoup(c)
  
  #noise
  urls = df.find_all('a')
  if len(urls)>0: 
    df.a.clear()
  
  return df.get_text().lower()

In [0]:
"""
Split string of text in dataframe to word tokens.

Args:
  @param df Dataframe containing string of text
  @param inputCol name of input column containin the string of texts
  @param outputCol name of output column containin the word tokens
  
Returns:
  Dataframe containing the output column and the input column is dropped
"""
def tokenize(df, inputCol, outputCol):
  tokenizer = RegexTokenizer(inputCol = inputCol, outputCol = outputCol, pattern=r"\s+") 
  df = tokenizer.transform(df).drop(inputCol)
  return df

In [0]:
"""
Combine word tokens from title and body.

Args:
  df1 Dataframe containing Column with lists of stemmed word tokens from title
  inputCol1 name of the column containing lists of stemmed word tokens from title, cannot be same as inputCol2
  df2 Dataframe containing Column with lists of stemmed word tokens from body
  inputCol2 name of the column containing lists of stemmed word tokens from body, cannot be same as inputCol2
  outputCol column name of the ouput column containing the combined list of word tokens from title and body
"""
def combine_tokens(df, inputCol1, inputCol2, outputCol):
  result = df.withColumn(outputCol, concat(col(inputCol1), col(inputCol2)))
  return result

In [0]:
from pyspark.ml.feature import CountVectorizer

class Preprocessing:
  """Preprocessing stackoverflow question by remove html tags and tokenize words"""
  
  def __init__(self, df, titleCol = 'Title', bodyCol = 'Body', type):
    """
    Construct a Proprocessing object with attribute df: df cleans the text string in the given df using either lazy or traditional preprocessing
    """
    self.titleCol = titleCol
    self.bodyCol = bodyCol
    if (type == 'Lazy'):
      df = df.withColumn(titleCol, lazy_prep_udf( df[titleCol] ) )
      self.df = df.withColumn(bodyCol, lazy_prep_udf( df[bodyCol] ) )
    else:
      df = df.withColumn(titleCol, trad_prep_udf( df[titleCol] ) )
      self.df = df.withColumn(bodyCol, trad_prep_udf( df[bodyCol] ) )

  def get_token(self):
    self.df = tokenize(self.df, self.titleCol, 'Title_Tokens')
    self.df = tokenize(self.df, self.bodyCol, 'Body_Tokens')
    return combine_tokens(self.df, 'Title_Tokens', 'Body_Tokens', 'Combined_Tokens')

In [0]:
dbutils.fs.rm("/FileStore/tables/tag3_combined")

In [0]:
import os
os.listdir("/databricks/driver/")