In [1]:
import pandas as pd
import xml.etree.ElementTree as ET
import io
import numpy as np
pd.set_option('display.max_colwidth', -1)

In [2]:
tree = ET.parse("Posts.xml") 

In [3]:
root = tree.getroot() 

In [4]:
root

<Element 'posts' at 0x000002381D4E2EF8>

In [5]:
q1 = root[0].items()

In [6]:
q1

[('Id', '5'),
 ('PostTypeId', '1'),
 ('CreationDate', '2014-05-13T23:58:30.457'),
 ('Score', '9'),
 ('ViewCount', '516'),
 ('Body',
  '<p>I\'ve always been interested in machine learning, but I can\'t figure out one thing about starting out with a simple "Hello World" example - how can I avoid hard-coding behavior?</p>\n\n<p>For example, if I wanted to "teach" a bot how to avoid randomly placed obstacles, I couldn\'t just use relative motion, because the obstacles move around, but I don\'t want to hard code, say, distance, because that ruins the whole point of machine learning.</p>\n\n<p>Obviously, randomly generating code would be impractical, so how could I do this?</p>\n'),
 ('OwnerUserId', '5'),
 ('LastActivityDate', '2014-05-14T00:36:31.077'),
 ('Title',
  'How can I do simple machine learning without hard-coding behavior?'),
 ('Tags', '<machine-learning>'),
 ('AnswerCount', '1'),
 ('CommentCount', '1'),
 ('FavoriteCount', '1'),
 ('ClosedDate', '2014-05-14T14:40:25.950')]

In [7]:
[x[1] for x in q1 if x[0] == "Tags"]

['<machine-learning>']

In [17]:
[x[1] for x in q1 if x[0] == "Body"]

['<p>I\'ve always been interested in machine learning, but I can\'t figure out one thing about starting out with a simple "Hello World" example - how can I avoid hard-coding behavior?</p>\n\n<p>For example, if I wanted to "teach" a bot how to avoid randomly placed obstacles, I couldn\'t just use relative motion, because the obstacles move around, but I don\'t want to hard code, say, distance, because that ruins the whole point of machine learning.</p>\n\n<p>Obviously, randomly generating code would be impractical, so how could I do this?</p>\n']

In [21]:
# get post from all ques
all_posts = []
all_tags = []
for i in range(len(root)):
    q = root[i].items()
    all_posts.append([x[1] for x in q if x[0] == "Body"])
    all_tags.append([x[1] for x in q if x[0] == "Tags"])

In [22]:
all_posts[2]

['<p>Not sure if this fits the scope of this SE, but here\'s a stab at an answer anyway.</p>\n\n<p>With all AI approaches you have to decide what it is you\'re modelling and what kind of uncertainty there is. Once you pick a framework that allows modelling of your situation, you then see which elements are "fixed" and which are flexible. For example, the model may allow you to define your own network structure (or even learn it) with certain constraints. You have to decide whether this flexibility is sufficient for your purposes. Then within a particular network structure, you can learn parameters given a specific training dataset.</p>\n\n<p>You rarely hard-code behavior in AI/ML solutions. It\'s all about modelling the underlying situation and accommodating different situations by tweaking elements of the model.</p>\n\n<p>In your example, perhaps you might have the robot learn how to detect obstacles (by analyzing elements in the environment), or you might have it keep track of where 

In [23]:
all_tags[23]

['<algorithms>']

In [24]:
import re
re.sub("<|>"," ",all_tags[6][0]).split(" ")

['', 'machine-learning', '', 'bigdata', '', 'libsvm', '']

In [25]:
len(all_posts)

31570

In [26]:
valid_posts_idx = [idx for idx,x in enumerate(all_tags) if len(x) != 0]

In [27]:
all_posts = [x for idx,x in enumerate(all_posts) if idx in valid_posts_idx]
all_tags = [x for idx,x in enumerate(all_tags) if idx in valid_posts_idx]

### Use beautiful soup to clean text

In [28]:
from bs4 import BeautifulSoup

In [29]:
clean_posts = []
for post in all_posts:
    soup = BeautifulSoup(post[0])
    clear_text = " ".join(soup.find_all(text=True))
    clear_text = re.sub("[^a-zA-Z ]","",clear_text).lower()
    clean_posts.append(clear_text)

In [30]:
clean_posts[4]

'i use  libsvm  to train data and predict classification on  semantic analysis  problem but it has a  performance  issue on largescale data because semantic analysis concerns  ndimension  problem  last year  liblinear  was release and it can solve performance bottleneckbut it cost too much  memory  is  mapreduce  the only way to solve semantic analysis problem on big data or are there any other methods that can improve memory bottleneck on  liblinear  '

In [31]:
clean_tags = [re.sub("<|>"," ",x[0]) for x in all_tags]

In [32]:
clean_tags[4]

' machine-learning  bigdata  libsvm '

In [33]:
stack_exchg_datasc_posts = pd.DataFrame({"question":clean_posts,"tags":clean_tags})

In [34]:
stack_exchg_datasc_posts.shape

(14481, 2)

In [35]:
stack_exchg_datasc_posts.head()

Unnamed: 0,question,tags
0,ive always been interested in machine learning but i cant figure out one thing about starting out with a simple hello world example how can i avoid hardcoding behavior for example if i wanted to teach a bot how to avoid randomly placed obstacles i couldnt just use relative motion because the obstacles move around but i dont want to hard code say distance because that ruins the whole point of machine learning obviously randomly generating code would be impractical so how could i do this,machine-learning
1,as a researcher and instructor im looking for opensource books or similar materials that provide a relatively thorough overview of data science from an applied perspective to be clear im especially interested in a thorough overview that provides material suitable for a collegelevel course not particular pieces or papers,education open-source
2,i am sure data science as will be discussed in this forum has several synonyms or at least related fields where large data is analyzed my particular question is in regards to data mining i took a graduate class in data mining a few years back what are the differences between data science and data mining and in particular what more would i need to look at to become proficient in data mining,data-mining definitions
3,in which situations would one system be preferred over the other what are the relative advantages and disadvantages of relational databases versus nonrelational databases,databases
4,i use libsvm to train data and predict classification on semantic analysis problem but it has a performance issue on largescale data because semantic analysis concerns ndimension problem last year liblinear was release and it can solve performance bottleneckbut it cost too much memory is mapreduce the only way to solve semantic analysis problem on big data or are there any other methods that can improve memory bottleneck on liblinear,machine-learning bigdata libsvm


In [63]:
# generate a columns for each tag
from sklearn.feature_extraction.text import CountVectorizer
# initialize vectorizer
vect = CountVectorizer(max_features=50,tokenizer=lambda x: x.split(' '))
vect.fit(stack_exchg_datasc_posts["tags"])
tags = vect.transform(stack_exchg_datasc_posts["tags"])

In [64]:
vect.get_feature_names()

['',
 'algorithms',
 'apache-spark',
 'bigdata',
 'classification',
 'clustering',
 'cnn',
 'computer-vision',
 'convnet',
 'cross-validation',
 'data',
 'data-cleaning',
 'data-mining',
 'dataset',
 'decision-trees',
 'deep-learning',
 'feature-engineering',
 'feature-extraction',
 'feature-selection',
 'image-classification',
 'k-means',
 'keras',
 'linear-regression',
 'logistic-regression',
 'lstm',
 'machine-learning',
 'multiclass-classification',
 'neural-network',
 'nlp',
 'optimization',
 'pandas',
 'predictive-modeling',
 'python',
 'r',
 'random-forest',
 'recommender-system',
 'regression',
 'reinforcement-learning',
 'rnn',
 'scikit-learn',
 'statistics',
 'svm',
 'tensorflow',
 'text-mining',
 'time-series',
 'training',
 'unsupervised-learning',
 'visualization',
 'word2vec',
 'xgboost']

In [65]:
tags = pd.DataFrame(tags.toarray())
tags = tags.iloc[:,1:]
tags.columns = vect.get_feature_names()[1:]

In [66]:
stack_exchg_datasc_posts = pd.concat([stack_exchg_datasc_posts,tags],axis=1)
stack_exchg_datasc_posts.to_csv("stack_exchg_datasc_posts.csv",index=False)

In [67]:
stack_exchg_datasc_posts = pd.read_csv("stack_exchg_datasc_posts.csv")

### Training word2vec

In [68]:
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec

In [69]:
train_data = [x.split(" ") for x in stack_exchg_datasc_posts.question]

In [70]:
path = get_tmpfile("word2vec.model")
model = Word2Vec(train_data, size=100, window=5, min_count=5)
model.save("word2vec.model")

#### Questions:
1. What is the impact of changing min_count argument?
2. Find words most similiar to "pandas"

In [71]:
model.most_similar('pandas')

  """Entry point for launching an IPython kernel.


[('numpy', 0.7896469235420227),
 ('matplotlib', 0.7839069962501526),
 ('scipy', 0.7797021865844727),
 ('pyspark', 0.7779780626296997),
 ('seaborn', 0.7543668746948242),
 ('statsmodelsapi', 0.7470531463623047),
 ('sklearndecomposition', 0.7195875644683838),
 ('setuptools', 0.7173870205879211),
 ('matplotlibpyplot', 0.7083257436752319),
 ('numpyimport', 0.700340986251831)]

### Train tag prediction model

In [44]:
from gensim.models import Word2Vec
ai_w2vec_model = Word2Vec.load("word2vec.model")

In [45]:
from tqdm import tqdm
ques_vec = np.zeros((stack_exchg_datasc_posts.shape[0],100))
for i in tqdm(range(0,stack_exchg_datasc_posts.shape[0])):
    words = stack_exchg_datasc_posts["question"].iloc[i].split(" ")
    words = [x.strip() for x in words]
    ind_word_vecs = [ai_w2vec_model.wv[x] for x in words if x in ai_w2vec_model.wv.vocab]
    ques_vec[i] = np.array(ind_word_vecs).mean(axis=0)

100%|██████████████████████████████████████████████████████████████████████████| 14481/14481 [00:05<00:00, 2826.02it/s]


In [42]:
ques_vec.shape

(14481, 100)

In [43]:
stack_exchg_datasc_posts.columns

Index(['question', 'tags', 'classification', 'data-mining', 'deep-learning',
       'keras', 'machine-learning', 'neural-network', 'python', 'r',
       'scikit-learn'],
      dtype='object')

In [44]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, X_train_orig, X_test_orig = train_test_split(ques_vec, 
                                                                               stack_exchg_datasc_posts["keras"], 
                                                                               stack_exchg_datasc_posts["question"],
                                                                             random_state=2)


In [45]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [46]:
rf.classes_

array([0, 1])

In [47]:
# make class predictions for X_test_dtm
y_pred_prob = rf.predict_proba(X_test)[:,1]

In [48]:
y_pred_class = [1 if x > 0.1 else 0 for x in y_pred_prob]

In [49]:
np.sum(y_pred_class)

942

In [50]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.7754763877381938

In [51]:
res = pd.DataFrame({"text":X_test_orig,"actual":y_test, "pred":y_pred_class})

In [52]:
pd.crosstab(res["actual"],res["pred"])

pred,0,1
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2624,758
1,55,184


In [53]:
error = res[(res.actual == 0) & (res.pred==1)]

In [54]:
error

Unnamed: 0,actual,pred,text
4648,0,1,like above id like to know what exactly a skewed dataset is the explanation from statssecom sounds to me more like what i call an imbalanced dataset what is the distinction
4673,0,1,when training neural networks one hyperparameter is the size of a minibatch common choices are and elements per mini batch are there any rules guidelines how big a minibatch should be any publications which investigates the effect on the training
12821,0,1,im working on a classification problem i want to classify iris flowers from the famous iris data set using mlp i know that i the number of neurons in output layer should be the same number of classes but can i use one neuron in output layer which output is the value or or to refer to the three types or then it is considered as regression not classification thanks trin trout inpsizetrin outsizetrout hidden x iw reshapexhiddeninphiddeninp b reshapexhiddeninphiddeninphiddenhidden lw reshapexhiddeninphiddenhiddeninphiddenhiddenoutouthidden b reshapexhiddeninphiddenhiddenouthiddeninphiddenhiddenoutoutout y tanhtanhtriniwrepmatbsizetrinlwrepmatbsizetrin e gsubtracttrouty is this classification or it is considered as regression i mean should i make the out put bits to be consedered as classification and how to do this if yes
4490,0,1,i am running an svr prediction on some time series data and i am receiving this weird offset between my actual and predicted values i found this svm regression lag post that mentions adding a lag of data points behind instead of one however i am not sure how to incorporate that into my code which ive included below does anyone have any ideas on why my predicted vs actual is offset in this manner my code is as follows usrbinpythonimport mathimport statisticsimport visualizerimport numpy as npfrom datagen import constructdatafrom sklearn import svm applies support vector regression to the electricity dataset prints out the accuracy rate to the terminal and plots predictions against actual valuesdef suppvectorregress kernellist linearrbfpolykernel names linearradial basispoly preds retrieve time series data apply preprocessing data constructdata cutoff lendata xtrain datacutoff ytrain datacutoff xtest datacutoff ytest datacutoff fill in missing values denoted by zeroes as an average of both neighbors statisticsestimatemissingxtrain statisticsestimatemissingxtest logarithmically scale the data xtrain mathlogy for y in x for x in xtrain xtest mathlogy for y in x for x in xtest ytrain mathlogx for x in ytrain detrend the time series indices nparangelendata trainindices indicescutoff testindices indicescutoff detrendedslopeintercept statisticsdetrendtrainindicesytrain ytrain detrended for gen in rangelenkernellist use svr to predict test observations based upon training observations pred svrpredictionsxtrainytrainxtestkernellistgen add the trend back into the predictions trendedpred statisticsreapplytrendtestindicespredslopeintercept reverse the normalization trendedpred npexpx for x in trendedpred compute the nrmse err statisticsnormrmseytesttrendedpred print the normalized rootmean square error is strerr using kernel namesgen predsappendtrendedpred namesappendactual predsappendytest change the parameters based on the month you want to predict visualizercomparisonplotpredsnamesplotnamesupport vector regression load predictions vs actual yaxisnamepredicted kilowatts construct a support vector machine and get predictions for the test set returns a d vector of predictionsdef svrpredictionsxtrainytrainxtestk clf svmsvrckernelk clffitxtrainytrain return clfpredictxtest a scale invariant kernel note only conditionally semidefinitedef polykernelxy return npdotxytif namemain suppvectorregress
5807,0,1,i am relatively new to using wordvec i am interested in solving the topicword intrusion introduced here by using the vector spaces of words generated by wordvec and svc i have a corpus with a vocabulary of words the vocabulary is perfectly contained in the googles wordvec trained model i was wondering which model would provide a better representation of the words the pretrained model on m words or a model trained only on the words appearing in my corpus thanks
12523,0,1,i want to create a histogram out of the range of a column named surfacearea with country data frame i made an sql file in rrange maxcountry surface mincountry surfaceareain sql part b divide the range into equal width binsselect case when surfacearea then lowwhen surfacearea and surfacearea then mediumlowwhen surfacearea and surfacearea then mediumwhen surfacearea and surfacearea then mediumhighelse high end as type countsurfaceareagroup by typeorder by type asc im working on putting mysql into r
6022,0,1,i am new to the field of machine learning and recently learned the basics and working out various algorithms in python using libraries such as pandas numpy matplotlib scikitlearn etc i started learning about working bigdata by distributing it and using apache sparks library mllib to load and apply algorithms on it so is working with mllib the only way on spark or is there any other way to use pandas and other libraries on distributed data
7127,0,1,im trying to see how well a decision tree classifier performs on my input for this im trying to use the validation and learning curves and sklearns crossvalidation methods however they differ and i dont know what to make of it the validation curve shows up as follows based on varying the maximum depth parameter im getting worse and worse crossval scores however when i try the crossvalscore i get accuracy reliably while i was using the default tree depth for clf here it still puzzles me how the validation curve never reaches even but the crossval scores are all above what does this mean why is there a discrepancy code for reference below for the validation curve import matplotlibpyplot as pltimport numpy as npfrom sklearndatasets import loaddigitsfrom sklearnsvm import svcfrom sklearnmodelselection import validationcurvex y preparedataframexvalues preparedataframeyvaluesravelparamrange nparange trainscores testscores validationcurve decisiontreeclassifierclassweightbalanced x y paramnamemaxdepth paramrangeparamrange cvnone scoringaccuracy njobstrainscoresmean npmeantrainscores axistrainscoresstd npstdtrainscores axistestscoresmean npmeantestscores axistestscoresstd npstdtestscores axisplttitlevalidation curve with decision tree classifierpltxlabelmaxdepthpltxticksparamrangepltylabelscorepltylim lw pltplotparamrange trainscoresmean labeltraining score colordarkorange lwlwpltfillbetweenparamrange trainscoresmean trainscoresstd trainscoresmean trainscoresstd alpha colordarkorange lwlwpltplotparamrange testscoresmean labelcrossvalidation score colornavy lwlwpltfillbetweenparamrange testscoresmean testscoresstd testscoresmean testscoresstd alpha colornavy lwlwpltlegendlocbestpltshow for the crossval scores clf decisiontreeclassifierclassweightbalancedxtrain xtest ytrain ytest traintestsplit x y testsize randomstateclffitxtrain ytrainypred clfpredictxtestclfscorextest ytest update a comment has been asked about shuffling when i shuffle the data by x y preparedataframexvalues preparedataframeyvaluesravelindices nparangeyshapenprandomshuffleindicesx y xindices yindices i get which makes even less sense to me what does this mean
7852,0,1,im writing a university report on for the toxici comment classification kaggle competition comparing different attempt made with different models and i want to know if convolutional neural networks are one model multiple output layers model onmo are also known as multitask learning this approach would have one input layer one set of hidden layers and one output layer for each label on the other hand cnn are several layers of convolutions with nonlinear activation functions like relu or tanh applied to the results it uses convolutions over the input layer to compute the output rather than a fully connected layer or affine layer
8956,0,1,import numpy as npfrom sklearn import preprocessing crossvalidation neighborsimport pandas as pdfrom sklearnlinearmodel import linearregressiondf pdreadcsvdownloadsbreastcancerwisconsindatatxtskiprowsdfreplace inplacetruedfdropid inplacetrue x nparraydfdropclassy nparraydfclassxtrain xtest ytrain ytest crossvalidationtraintestsplitxytestsizeclf neighborskneighborsclassifierclf linearregressionnormalizetrueclffitxtrain ytrainaccuracy clfscorextest ytestprintaccuracyexamplemeasures nparrayexamplemeasures examplemeasuresreshapeprediction clfpredictexamplemeasures examplemeasuresprintprediction problem arises when i run the above command line at ubuntu or anaconda valueerror query data dimension must match training data dimension how to solve that problem i am sure that by method of isolating individual commandline and find it appears error at prediction clfpredictexamplemeasures i try to use prediction clfpredictxtest it is oki really want to predict the example i create how can i change the code
