In the previous module, we have solved text classification problem with both ML (Logistic, Random Forest, XGBoost etc) and Deep Learning Techniques (CNN, RNN & LSTM)

### Problem Statement: Building Automatic Question Tagging system on Stackoverflow dataset

    

About the Module

    1. Understand the Business Problem
    2. Business Problem into Datascience Problem
    3. About the Dataset
    4. Performance Metrics
    5. Implementation

Stackover flow is an online platform where people ask questions related to Computer Science. These questions can be upvoted or downvoted by users.

    Title: 
    
    Description:
    
        Tags:  This acts as indicators of the topics which this question is covering. This tags are annotated by users or 
               predicted by predictive models on stackover flow.

In this module we will study how these tags are predicted automatically. Input will be the Title and description of the question which are text. We extract features from these text. 

Why Automatic Tagging makes business sense?

   In Stackoverflow, thousands of questions will be asked and it is not an easy task for the experts to find the answer related to the question for their domain experties. Hence correctly tagging is very important and accurate tagging is profitable for the business.

## About the Dataset

    . Over 76,000 data science related questions
    . 100 unique tags (Multiple Questions might have same tag)
    . Maximum 5 tags

## Performance Metrics

    . High Precision: Out of all the predicted tags, how many actually belong to that question
    
    . High recall: Out of all the actual tags, how many show up in the predictions
    
    . Performance Metric: F1 Score. It gives good value if both precision and recall are high

Steps:

    1. Import Libraries and load datasets
    2. Inspect data
    3. clean and pre-process data
    4. reshape the target variable
    5. extract features from the text 
    6. Build multilabel classification model
    7. make predictions and evaluate model
    8. define inference function for new data

In [2]:
import pandas as pd
import numpy as np
import re
import nltk
import spacy
from tqdm import tqdm # it is handly library which provide percentage progress bar while executing for loops
import matplotlib.pyplot as plt

%matplotlib inline
pd.set_option('display.max_colwidth', 200)

In [3]:
df_questions = pd.read_hdf('auto_tagging_data_v2.h5')

Since this data is in hdf.h5 format which is an efficient way to store large amount of data, we are using read.hdf

In [4]:
df_questions.head(5)

Unnamed: 0,Id,Title,Body,Tags
0,6,The Two Cultures: statistics vs. machine learning?,"<p>Last year, I read a blog post from <a href=""http://anyall.org/"">Brendan O'Connor</a> entitled <a href=""http://anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/"">""Statistics vs. Mach...",[machine-learning]
1,21,Forecasting demographic census,<p>What are some of the ways to forecast demographic census with some validation and calibration techniques?</p>\n\n<p>Some of the concerns:</p>\n\n<ul>\n<li>Census blocks vary in sizes as rural\n...,[forecasting]
2,22,Bayesian and frequentist reasoning in plain English,<p>How would you describe in plain English the characteristics that distinguish Bayesian from Frequentist reasoning?</p>\n,[bayesian]
3,31,What is the meaning of p values and t values in statistical tests?,"<p>After taking a statistics course and then trying to help fellow students, I noticed one subject that inspires much head-desk banging is interpreting the results of statistical hypothesis tests....","[hypothesis-testing, t-test, p-value, interpretation]"
4,36,Examples for teaching: Correlation does not mean causation,"<p>There is an old saying: ""Correlation does not mean causation"". When I teach, I tend to use the following standard examples to illustrate this point:</p>\n\n<ol>\n<li>number of storks and birth ...",[correlation]


In [5]:
df_questions.sample(6,random_state = 11) # random state allows to reproduce the results again and again

Unnamed: 0,Id,Title,Body,Tags
41763,92185,Why is Sampling Importance Resampling (SIR) better than Importance Sampling (IS)?,"<p>From what I understand, SIR is a mechanism for sampling from a distribution $p$ that works as follows:</p>\n\n<ol>\n<li>Approximate a target distribution $p$ using an importance sample $S$ fro...","[sampling, mcmc]"
4245,179778,optimization approach in logistic regression,<p>In logistic regression we need to maximise the log likelihood which boils down to minimising a function which is sum of multiple log functions. We normally use gradient descent approach there. ...,"[machine-learning, logistic, classification, optimization]"
37183,168679,Consequences of violating proportional hazards assumption in Cox model,"<p>What are the consequences of violating the Proportional Hazards assumption in a Cox Model? I've got a Model where two factors are highly significative, but all the estimated betas associated to...","[regression, survival, cox-model]"
55932,144226,Moments and density tails,"<p>Assume that the first $n$ moments $m_1,\dots\,m_n$ of a random variable $X\in\mathbb{R}$ are known, but not its probability density function $p(x)$. </p>\n\n<p>Does there exist a methodology to...","[probability, pdf]"
47629,142745,What is the demonstration of the variance of the difference of two dependent variables?,"<p>I know that the variance of the difference of two independent variables is the sum of variances, and I can prove it. I want to know where the covariance goes in the other case.</p>\n","[variance, covariance]"
49639,195347,Rules for choosing how much training data one needs to learn a Radial Basis Function (RBF) model?,<p>I was trying to understand how much data I would need compared to the number of parameters (and to have good generalization) when I train a radial basis function (RBF) network on a regression t...,"[machine-learning, nonlinear-regression]"


If we don't use random state, it gives results at random.

Since we need both Title and Body while extracting the features, we will combine this and keep it as a single column 'Text'

In [6]:
df_questions['Text'] = df_questions['Title'] + " " + df_questions['Body']

In [7]:
df_questions.sample(6,random_state = 11)

Unnamed: 0,Id,Title,Body,Tags,Text
41763,92185,Why is Sampling Importance Resampling (SIR) better than Importance Sampling (IS)?,"<p>From what I understand, SIR is a mechanism for sampling from a distribution $p$ that works as follows:</p>\n\n<ol>\n<li>Approximate a target distribution $p$ using an importance sample $S$ fro...","[sampling, mcmc]","Why is Sampling Importance Resampling (SIR) better than Importance Sampling (IS)? <p>From what I understand, SIR is a mechanism for sampling from a distribution $p$ that works as follows:</p>\n\n..."
4245,179778,optimization approach in logistic regression,<p>In logistic regression we need to maximise the log likelihood which boils down to minimising a function which is sum of multiple log functions. We normally use gradient descent approach there. ...,"[machine-learning, logistic, classification, optimization]",optimization approach in logistic regression <p>In logistic regression we need to maximise the log likelihood which boils down to minimising a function which is sum of multiple log functions. We n...
37183,168679,Consequences of violating proportional hazards assumption in Cox model,"<p>What are the consequences of violating the Proportional Hazards assumption in a Cox Model? I've got a Model where two factors are highly significative, but all the estimated betas associated to...","[regression, survival, cox-model]",Consequences of violating proportional hazards assumption in Cox model <p>What are the consequences of violating the Proportional Hazards assumption in a Cox Model? I've got a Model where two fact...
55932,144226,Moments and density tails,"<p>Assume that the first $n$ moments $m_1,\dots\,m_n$ of a random variable $X\in\mathbb{R}$ are known, but not its probability density function $p(x)$. </p>\n\n<p>Does there exist a methodology to...","[probability, pdf]","Moments and density tails <p>Assume that the first $n$ moments $m_1,\dots\,m_n$ of a random variable $X\in\mathbb{R}$ are known, but not its probability density function $p(x)$. </p>\n\n<p>Does th..."
47629,142745,What is the demonstration of the variance of the difference of two dependent variables?,"<p>I know that the variance of the difference of two independent variables is the sum of variances, and I can prove it. I want to know where the covariance goes in the other case.</p>\n","[variance, covariance]","What is the demonstration of the variance of the difference of two dependent variables? <p>I know that the variance of the difference of two independent variables is the sum of variances, and I ca..."
49639,195347,Rules for choosing how much training data one needs to learn a Radial Basis Function (RBF) model?,<p>I was trying to understand how much data I would need compared to the number of parameters (and to have good generalization) when I train a radial basis function (RBF) network on a regression t...,"[machine-learning, nonlinear-regression]",Rules for choosing how much training data one needs to learn a Radial Basis Function (RBF) model? <p>I was trying to understand how much data I would need compared to the number of parameters (and...


In [8]:
df_questions['Text'].head(5)

0    The Two Cultures: statistics vs. machine learning? <p>Last year, I read a blog post from <a href="http://anyall.org/">Brendan O'Connor</a> entitled <a href="http://anyall.org/blog/2008/12/statisti...
1    Forecasting demographic census <p>What are some of the ways to forecast demographic census with some validation and calibration techniques?</p>\n\n<p>Some of the concerns:</p>\n\n<ul>\n<li>Census ...
2                             Bayesian and frequentist reasoning in plain English <p>How would you describe in plain English the characteristics that distinguish Bayesian from Frequentist reasoning?</p>\n
3    What is the meaning of p values and t values in statistical tests? <p>After taking a statistics course and then trying to help fellow students, I noticed one subject that inspires much head-desk b...
4    Examples for teaching: Correlation does not mean causation <p>There is an old saying: "Correlation does not mean causation". When I teach, I tend to use the following standard

15.00

Removing Html tags and urls helps, because it doesn't help in taging

## Text Cleaning and Pre-Processing

In [9]:
def clean_text(text):
    
    text = re.sub(r'<.*?>', '', text) # this removes html tags and url links in the text
    
    text = re.sub("[^a-zA-Z]"," ",text) # this will remove everything except alphabets
    
    text = ' '.join(text.split()) # # this will remove extra or white spaces in the text
    
    return text

In [10]:
# applying the function on text variable

df_questions['Text'] = df_questions['Text'].apply(lambda x: clean_text(x))

In [11]:
df_questions['Text'] = df_questions['Text'].str.lower()

In [12]:
df_questions['Text'].head(5)

0    the two cultures statistics vs machine learning last year i read a blog post from brendan o connor entitled statistics vs machine learning fight that discussed some of the differences between the ...
1    forecasting demographic census what are some of the ways to forecast demographic census with some validation and calibration techniques some of the concerns census blocks vary in sizes as rural ar...
2                                       bayesian and frequentist reasoning in plain english how would you describe in plain english the characteristics that distinguish bayesian from frequentist reasoning
3    what is the meaning of p values and t values in statistical tests after taking a statistics course and then trying to help fellow students i noticed one subject that inspires much head desk bangin...
4    examples for teaching correlation does not mean causation there is an old saying correlation does not mean causation when i teach i tend to use the following standard examples

Now the text looks much cleaner, however there are stopwords. Because they hardly helps in determing the tag of the question.

In [13]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

In [14]:
def strip_stopwords(text):
    
    # splitting the text into individual words and going through word by word and checking if it is a stop word
    
    clean_text = [w for w in text.split() if not w in stop_words] 
    
    return ' '.join(clean_text)

In [15]:
df_questions['Text_clean'] = df_questions['Text'].apply(lambda x: strip_stopwords(x))

### We can also apply various other text cleaning techniques.

To convert the target variable into one hot encoded format, there is transformer.

### Reshaping target variable

In [17]:
from sklearn.preprocessing import MultiLabelBinarizer

In [18]:
multilabel_binarizer = MultiLabelBinarizer()

multilabel_binarizer.fit(df_questions['Tags'])

# transform target variable ("Tags")
Y = multilabel_binarizer.transform(df_questions['Tags'])

In [19]:
Y.shape

(76365, 100)

It became an array of 100 Columns (100 Unique tags)

In [20]:
Y

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [21]:
# Next Step: Feature Extraction from text and train multiple to learn the relationship with newly created 100 target variable

# Here they have Tfidf to extract features from text, we can also use other methods such as bag of words, word2vec and glove

from sklearn.feature_extraction.text import TfidfVectorizer

In [22]:
# 10,000 most frequent words, 0.8 = words that appear more than 80% in the document will be discarded

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=10000) 

X_tfidf = tfidf_vectorizer.fit_transform(df_questions['Text_clean'])

# max_df and max_features are hyperparameters, we can change these values and check which values are working for us

In [23]:
X_tfidf

<76365x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 4017592 stored elements in Compressed Sparse Row format>

In [24]:
tfidf_vectorizer

TfidfVectorizer(max_df=0.8, max_features=10000)

### Ttrain - test Split

In [25]:
from sklearn.model_selection import train_test_split

x_train_tfidf, x_val_tfidf, y_train_tfidf, y_val_tfidf = train_test_split(X_tfidf, Y, test_size=0.2, random_state=9)

In [26]:
x_train_tfidf

<61092x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 3224103 stored elements in Compressed Sparse Row format>

In [27]:
x_val_tfidf

<15273x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 793489 stored elements in Compressed Sparse Row format>

In [28]:
y_train_tfidf

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [29]:
y_val_tfidf

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

### Model Building

Since we will be training as many unique tags (in our case 100 different models). Training 100 Models will be quite time consuming, hence we are taking a simple classifier (LogisticRegression) to build these 100 models. We manually use this by a loop or a better option i.e. OneVsRestClassifier. It will train 100 different Logistic Regression models for each and every tag in out dataset.

In [30]:
from sklearn.linear_model import LogisticRegression

# Binary Relevance
from sklearn.multiclass import OneVsRestClassifier

# Performance metric
from sklearn.metrics import f1_score

In [31]:
lr = LogisticRegression()
clf = OneVsRestClassifier(lr)

In [32]:
# fit model on train data
clf.fit(x_train_tfidf, y_train_tfidf)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

OneVsRestClassifier(estimator=LogisticRegression())

Now our model is trained on Train Dataset, we can make predictions

In [33]:
# make predictions for validation set
y_pred = clf.predict(x_val_tfidf)

In [34]:
# printing few prediction

# these predictions are one hot encoded format. It is not quite interpretable and difficult to tell which tags our model has predicted

print(y_pred[:3])

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


We have to convert these predictions into our tags. We can reuse the object MultiLabelBinarizer and use its inverse transform to convert these predictions into tags.

In [35]:
multilabel_binarizer.inverse_transform(y_pred)[:3]

[('prediction',), ('distributions', 'mean', 'variance'), ('r',)]

First question Tag: 1 (prediction)

Second Question Tag: 3 ('distributions', 'mean', 'variance')

Third Question Tag: 1 ('r',)

### Let us check what are the actual tags of these questions

In [36]:
multilabel_binarizer.inverse_transform(y_val_tfidf[:3])

[('confidence-interval', 'regression'),
 ('distributions', 'mean', 'variance'),
 ('bayesian', 'r')]

Prediction is missed with First question, second is more accurate where third is 50% correct

F1 Score between 0 and 1. The best value is at 1 with high precision and recall.

In [37]:
# evaluate performance
f1_score(y_val_tfidf, y_pred)

ValueError: Target is multilabel-indicator but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted', 'samples'].

Error means there is an average parameter in F1 Score which is default set to 'binary'

In [38]:
f1_score(y_val_tfidf, y_pred,average=None) # giving 100 different F1 Scores belonging to every tag present in the dataset

array([0.04040404, 0.62266501, 0.63430421, 0.41269841, 0.57627119,
       0.15151515, 0.359375  , 0.688     , 0.5326087 , 0.23913043,
       0.51530612, 0.43896976, 0.72303207, 0.18181818, 0.57769653,
       0.59310345, 0.33513514, 0.49367089, 0.64065708, 0.08910891,
       0.30662021, 0.42592593, 0.01298701, 0.40740741, 0.29063509,
       0.05494505, 0.17142857, 0.27272727, 0.26506024, 0.5388601 ,
       0.34558824, 0.5158371 , 0.41395349, 0.31944444, 0.35056968,
       0.05594406, 0.48201439, 0.08465608, 0.28193833, 0.03125   ,
       0.65042174, 0.37931034, 0.01104972, 0.44029851, 0.56047198,
       0.36619718, 0.17346939, 0.43333333, 0.54253612, 0.        ,
       0.1875    , 0.        , 0.39370079, 0.26506024, 0.32900433,
       0.18837675, 0.17910448, 0.7617689 , 0.10344828, 0.28310502,
       0.38511327, 0.24752475, 0.47560976, 0.52325581, 0.14117647,
       0.52918288, 0.7761807 , 0.29032258, 0.49781659, 0.06451613,
       0.06617647, 0.40333333, 0.27692308, 0.36666667, 0.56498

In [39]:
f1_score(y_val_tfidf, y_pred,average=None).shape

# we can take the mean of these values for better score

(100,)

In [40]:
np.mean(f1_score(y_val_tfidf, y_pred,average=None))

0.35117052776248087

Still one thing is bothering that is the distribution of tags accross the dataset. As we have already seen, distribution is quite imbalance. Here the weightage is given equally to all the tags. The weightage need to be given according to their distribution.

In [41]:
np.mean(f1_score(y_val_tfidf, y_pred,average="micro")) # Micro gives appropriated weightage based on the distribution

0.434766545051494

It is much better performance compare with earlier.

In [42]:
f1_score(y_val_tfidf, y_pred,average="macro") # this is similar to average and give equal weightage

0.35117052776248087

Study f1_score sklearn library to learn about more metrics.

By default, these predictions are arrived at based on the threshold value of 0.5. Try changing the values of threshold can impact the predictions.

In [43]:
# predict probabilities
y_pred_prob = clf.predict_proba(x_val_tfidf)

In [44]:
# set threshold value
t = 0.45

# convert to integers
y = (y_pred_prob >= t).astype(int)
f1_score(y_val_tfidf, y, average="micro")

0.4594094707520891

You can see some improvement compare with earlier (0.434766)

Buiding Autotagging model is complete with this step.

### Question: How are we going to get tags for new questions?

We must have function which can take a new stackover question as the input and gives out the appropriate tags as output. This entire process is known as Inference.

In [45]:
def infer_tags(q):
    q = clean_text(q)
    q = q.lower()
    q = strip_stopwords(q)
    q_vec = tfidf_vectorizer.transform([q])
    q_pred = clf.predict(q_vec)
    return multilabel_binarizer.inverse_transform(q_pred)

# we are following the same steps above in building the model

In [46]:
# give new question
new_q = "Regression line in ggplot doesn't match computed regression Im using R and created a chart using ggplot2. I then create a regression so I can make some predicitions I pass my data frame of to the predict function predict(regression, Measures) I'd expect the predictions to be the same as if I used the regression line on the chart, but they aren't the same. Why would this be the case? Is there a setting in ggplot or is my expectation incorrect?"

# get tags
infer_tags(new_q)

[('r', 'regression')]

Actual tags are 'r', 'regression' and 'ggplot'. Our model has predicted 2 tags out of 3.

# Building Autotagging using Deep Learning Keras

In [47]:
# Import Libraries and Datasets

# Text Cleaning and Preprocessing

# Reshape Target Variables (Tags)

# Model Building using Keras (steps changes from here, rest all are same like earlier)

# Performance Evaluation

In [48]:
df_questions[['Id','Text','Tags']].sample(5)

Unnamed: 0,Id,Text,Tags
24581,113399,generating a dataset from mean standard deviation n and ci i have the output of a couple of socprog models and i d like to see if the results are statistically significant group a output mean sd c...,[normal-distribution]
23213,9539,how do i go about conducting model diagnostics on wls i m familiar with the diagnostics required for ols however i m in new territory with a model i m fitting to data in r using poisson regression...,"[r, modeling]"
35167,32810,how can i compare the effect of an pre post intervention between two groups with multiple numeric dependent variables i have collected data from two groups age age in each group i have patients i ...,"[anova, statistical-significance, repeated-measures]"
46662,148343,interpretation of non significant coefficients my regression uses ols and annual macroeconomics data i find one independent variable x negative and not statistical significant from the theory i ex...,"[regression, interpretation]"
37609,162738,confidence intervals for estimates generated from a non probability sample from what i understand to generate a margin of error to have confidence intervals for a given estimate one needs the stan...,"[probability, sampling]"


Next Step: Representing text samples by integers. Every unique word or term need to be represented by integer


This is important because our model accepts only numbers


E.g. S1: "radiation can impact astronauts' memory" --> array = [1,2,3,4,5]

     S2: "how does memory work" ---> array = [6,7,5,8]
     
     Each word is represented by unique integer. Nice representation is available in notes, please refer.
    

### Encoding Text into Numbers

In [1]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

When I ran the above line first and then running pandas not giving any error, but the viceversa is throwing error (AttributeError: type object 'h5py.h5.H5PYConfig' has no attribute '__reduce_cython__')

In [53]:
!pip install h5py



In the below link, it mentioned that downgrading the h5py to 2.9 resolved the issue. Our version 2.10.0, see above

https://github.com/tensorflow/tensorflow/issues/36162

In [59]:
!pip install h5py



In [16]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df_questions['Text'])

In [17]:
# check unique words count
len(tokenizer.word_index)

81956

In [18]:
# check unique words count
vocab_size = len(tokenizer.word_index) + 1
vocab_size

81957

In [19]:
sequences = tokenizer.texts_to_sequences(df_questions['Text'])

In [20]:
i = 0
print(df_questions['Text'][i], '\n'), print(sequences[i])

the two cultures statistics vs machine learning last year i read a blog post from brendan o connor entitled statistics vs machine learning fight that discussed some of the differences between the two fields andrew gelman responded favorably to this simon blomberg from r s fortunes package to paraphrase provocatively machine learning is statistics minus any checking of models and assumptions brian d ripley about the difference between machine learning and statistics user vienna may season s greetings andrew gelman in that case maybe we should get rid of checking of models and assumptions more often then maybe we d be able to solve some of the problems that the machine learning people can solve but we can t there was also the statistical modeling the two cultures paper by leo breiman in which argued that statisticians rely too heavily on data modeling and that machine learning techniques are making progress by instead relying on the predictive accuracy of models has the statistics field 

(None, None)