# Detecting Disfluent Speech: Experiment 2

### What we know so far

1) Classifiers/Feature Selection for Exp#1 -  Naive Bayes TFIDF 100 <br>
2) Stemming/Lemmatization were NOT performed. <br>
3) F1 scores: Training - 0.93, Test - 0.62 (Model Possibly Overfit?) <br>

### Import Libraries

In [1]:
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report

In [2]:
train = pd.read_json("train.json")
test = pd.read_json("test.json")
train = train.transpose()
test = test.transpose()

### Redoing the Train/Test splits to possibly deal with overfitting from Experiment #1

In [3]:
df = pd.concat([train, test])
df.reset_index(drop= True, inplace = True)
df

Unnamed: 0,original,disfluent
0,What do unstable isotope studies indicate?,What do petrologists no what do unstable isoto...
1,What is the basic unit of territorial division...,What is the second level of territorial divisi...
2,Which genus lack tentacles and sheaths?,Juvenile platyctenids no wow Which genus lack ...
3,Long-lived memory cells can remember previous ...,When a pathogen is met again scratch that I me...
4,What led to Newcastle's rise to power as milit...,What led to the Duke of Cumberland's rise to p...
...,...,...
10820,Amazon rain forest experienced another mild dr...,Amazon rain forest experienced another mild dr...
10821,The Amazon releases how much carbon dioxide ea...,"The Amazon releases how much vegetation, ack s..."
10822,The 2010 drought had three what were vegetatio...,How many er the 2010 drought had three what we...
10823,In 2005 the force absorbed how much carbon dio...,"In 2010, the force absorbed how much carbon di..."


### Data Transformation

In [4]:
dummy_a = df['original']
df1 = pd.DataFrame(dummy_a)
df1['Target'] = 'Original'
df1.index.names = ['ID']
df1.columns = ['Text', 'Target']
dummy_b = df['disfluent']
df2 = pd.DataFrame(dummy_b)
df2['Target'] = 'Disfluent'
df2.index.names = ['ID']
df2.columns = ['Text', 'Target']
merged = [df1, df2]
final_df = pd.concat(merged)
final_df.reset_index(drop = True, inplace = True)
final_df['ID'] = final_df.index
final_df['Target'] =  final_df['Target'].apply(lambda x: 1 if x== 'Disfluent' else 0)
final_df

Unnamed: 0,Text,Target,ID
0,What do unstable isotope studies indicate?,0,0
1,What is the basic unit of territorial division...,0,1
2,Which genus lack tentacles and sheaths?,0,2
3,Long-lived memory cells can remember previous ...,0,3
4,What led to Newcastle's rise to power as milit...,0,4
...,...,...,...
21645,Amazon rain forest experienced another mild dr...,1,21645
21646,"The Amazon releases how much vegetation, ack s...",1,21646
21647,How many er the 2010 drought had three what we...,1,21647
21648,"In 2010, the force absorbed how much carbon di...",1,21648


## STEMMING

### Function to stem a sentence.


In [5]:
porter=PorterStemmer()

def stemSentence(sentence):
    '''
    Stems all words and returns a stemmed version of the sentence
    
    '''
    token_words=word_tokenize(sentence)
    token_words
    stem_sentence=[]
    for word in token_words:
        stem_sentence.append(porter.stem(word))
        stem_sentence.append(" ")
    return "".join(stem_sentence)

In [6]:
final_df['Text'].apply(lambda x: stemSentence(x))

0                    what do unstabl isotop studi indic ? 
1        what is the basic unit of territori divis in w...
2                    which genu lack tentacl and sheath ? 
3        long-liv memori cell can rememb previou encoun...
4        what led to newcastl 's rise to power as milit...
                               ...                        
21645    amazon rain forest experienc anoth mild drough...
21646    the amazon releas how much veget , ack sorri ,...
21647    how mani er the 2010 drought had three what we...
21648    in 2010 , the forc absorb how much carbon diox...
21649    in 2010 the forc experienc anoth sever drought...
Name: Text, Length: 21650, dtype: object

In [7]:
X = final_df[['Text','ID']]
y = final_df['Target']

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 7)

## Using TFIDF Feature Selection

In [9]:
tfidfvectorizer = TfidfVectorizer(analyzer='word')
train_matrix = tfidfvectorizer.fit_transform(X_train['Text'])
features = tfidfvectorizer.get_feature_names()

In [10]:
tfidfvectorizer_test = TfidfVectorizer(analyzer='word')
test_matrix = tfidfvectorizer_test.fit_transform(X_test['Text'])
features_test = tfidfvectorizer_test.get_feature_names()

In [11]:
tfidf_train = pd.DataFrame.sparse.from_spmatrix(train_matrix, columns = features)
tfidf_train

Unnamed: 0,000,003,010,04,10,100,1000,10000,101,1010,...,zoo,zoological,zooplankton,zooplanktonic,zpp,zuider,zulfiqar,zulu,zulus,zygumunt
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17315,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17316,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17317,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17318,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
tfidf_test = pd.DataFrame.sparse.from_spmatrix(test_matrix, columns = features_test)
tfidf_test

Unnamed: 0,000,003,010,04,09,10,100,1000,102,1031,...,zinn,zones,zoning,zooplankton,zpp,zuider,zulfiqar,zygumunt,π1,π2
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4325,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4326,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4327,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4328,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Ranking Features - Train Set

In [13]:
sums = train_matrix.sum(axis=0)
final_list = []
for col, terms in enumerate(features):
    final_list.append( (terms, sums[0,col] ))

ranks = pd.DataFrame(final_list, columns=['term','rank'])
ranks = ranks.sort_values('rank', ascending=False)
ranks

Unnamed: 0,term,rank
9431,the,1102.225094
10269,what,906.852176
6670,of,734.510177
4919,in,603.542296
5213,is,570.914929
...,...,...
9704,tstorage,0.119595
9587,tpharmacy,0.119595
9586,tpatients,0.119595
9739,twithin,0.119595


### Ranking Features - Test Set

In [14]:
sums_test = test_matrix.sum(axis=0)
final_list_test = []
for col, terms in enumerate(features_test):
    final_list_test.append( (terms, sums_test[0,col] ))

ranks = pd.DataFrame(final_list_test, columns=['term','rank'])
ranks = ranks.sort_values('rank', ascending=False)
ranks

Unnamed: 0,term,rank
6034,the,282.494758
6540,what,235.767033
4255,of,188.909452
3143,in,160.893416
4167,no,148.730134
...,...,...
1780,defines,0.221632
3445,justified,0.220824
5051,refusal,0.220824
5357,satyagraha,0.211329


### TFIDF TOP 100 TRAIN SET

In [15]:
top100 = list(ranks[:100]["term"])
top100_df = tfidf_train[top100]
top100_df

Unnamed: 0,the,what,of,in,no,is,to,was,did,how,...,century,had,schools,between,into,school,known,population,new,amazon
0,0.094864,0.085389,0.122198,0.000000,0.000000,0.149212,0.000000,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
1,0.080865,0.072788,0.000000,0.118213,0.118300,0.000000,0.000000,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
2,0.000000,0.173536,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.160896,0.0,0.170696,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
4,0.000000,0.087616,0.000000,0.000000,0.284799,0.000000,0.000000,0.173341,0.0,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17315,0.000000,0.081001,0.000000,0.000000,0.000000,0.000000,0.144789,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
17316,0.000000,0.139867,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
17317,0.126059,0.113468,0.000000,0.000000,0.000000,0.000000,0.000000,0.224487,0.0,0.000000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
17318,0.000000,0.063417,0.090754,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.232779,0.0,0.0,0.0,0.0,0.0,0.236575


### TFIDF TOP 100 TEST SET

In [16]:
top100_test = list(ranks[:100]["term"])
top100_df_test = tfidf_test[top100_test]
top100_df_test

Unnamed: 0,the,what,of,in,no,is,to,was,did,how,...,century,had,schools,between,into,school,known,population,new,amazon
0,0.000000,0.115007,0.166515,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0
1,0.077989,0.000000,0.000000,0.112571,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0
2,0.058822,0.052852,0.153047,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0
3,0.194376,0.058216,0.084290,0.093523,0.000000,0.100788,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0
4,0.000000,0.085121,0.000000,0.000000,0.000000,0.147367,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.0,0.000000,0.0,0.0,0.330662,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4325,0.000000,0.000000,0.081530,0.000000,0.000000,0.000000,0.200050,0.000000,0.000000,0.000000,...,0.0,0.214135,0.0,0.214135,0.0,0.0,0.000000,0.000000,0.0,0.0
4326,0.000000,0.066519,0.000000,0.000000,0.106285,0.000000,0.000000,0.000000,0.134414,0.000000,...,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0
4327,0.072687,0.000000,0.094561,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.279307,...,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0
4328,0.000000,0.166690,0.000000,0.000000,0.133169,0.000000,0.000000,0.166701,0.000000,0.000000,...,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0


## Create and Fit a Naive Bayes Model

In [17]:
NB = MultinomialNB()

In [18]:
tf100_model = NB.fit(top100_df, y_train)
tf100_pred_train = tf100_model.predict(top100_df)
tf100_pred_test = tf100_model.predict(top100_df_test)

## Create and Fit a Random Forest Model

In [19]:
RF = RandomForestClassifier(max_depth=2, random_state=0)
rf100_model = RF.fit(top100_df, y_train)
rf100_pred_train = rf100_model.predict(top100_df)
rf100_pred_test = rf100_model.predict(top100_df_test)

## Model Evaluation

In [20]:
def evaluate_model(y_test, y_pred):
    print("CONFUSION MATRIX\n")
    print(confusion_matrix(y_test, y_pred))
    print("\nCLASSIFICATION REPORT\n")
    print(classification_report(y_test, y_pred))

### NAIVE BAYES TFIDF WITH 100 FEATURES TRAIN SET

In [21]:
print("\n ----------MODEL EVALUATION FOR NB: TFIDF WITH 100 FEATURES---------\n\n")
evaluate_model(y_train,tf100_pred_train)


 ----------MODEL EVALUATION FOR NB: TFIDF WITH 100 FEATURES---------


CONFUSION MATRIX

[[8389  305]
 [ 711 7915]]

CLASSIFICATION REPORT

              precision    recall  f1-score   support

           0       0.92      0.96      0.94      8694
           1       0.96      0.92      0.94      8626

    accuracy                           0.94     17320
   macro avg       0.94      0.94      0.94     17320
weighted avg       0.94      0.94      0.94     17320



### NAIVE BAYES TFIDF WITH 100 FEATURES TEST SET

In [22]:
print("\n ----------MODEL EVALUATION FOR NB: TFIDF WITH 100 FEATURES---------\n\n")
evaluate_model(y_test,tf100_pred_test)


 ----------MODEL EVALUATION FOR NB: TFIDF WITH 100 FEATURES---------


CONFUSION MATRIX

[[2055   76]
 [ 184 2015]]

CLASSIFICATION REPORT

              precision    recall  f1-score   support

           0       0.92      0.96      0.94      2131
           1       0.96      0.92      0.94      2199

    accuracy                           0.94      4330
   macro avg       0.94      0.94      0.94      4330
weighted avg       0.94      0.94      0.94      4330



### RANDOM FOREST TFIDF WITH 100 FEATURES TRAIN SET

In [23]:
print("\n ----------MODEL EVALUATION FOR RF: TFIDF WITH 100 FEATURES---------\n\n")
evaluate_model(y_train,rf100_pred_train)


 ----------MODEL EVALUATION FOR RF: TFIDF WITH 100 FEATURES---------


CONFUSION MATRIX

[[8542  152]
 [1572 7054]]

CLASSIFICATION REPORT

              precision    recall  f1-score   support

           0       0.84      0.98      0.91      8694
           1       0.98      0.82      0.89      8626

    accuracy                           0.90     17320
   macro avg       0.91      0.90      0.90     17320
weighted avg       0.91      0.90      0.90     17320



### RANDOM FOREST TFIDF WITH 100 FEATURES TEST SET

In [24]:
print("\n ----------MODEL EVALUATION FOR RF: TFIDF WITH 100 FEATURES---------\n\n")
evaluate_model(y_test,rf100_pred_test)


 ----------MODEL EVALUATION FOR RF: TFIDF WITH 100 FEATURES---------


CONFUSION MATRIX

[[2081   50]
 [ 406 1793]]

CLASSIFICATION REPORT

              precision    recall  f1-score   support

           0       0.84      0.98      0.90      2131
           1       0.97      0.82      0.89      2199

    accuracy                           0.89      4330
   macro avg       0.90      0.90      0.89      4330
weighted avg       0.91      0.89      0.89      4330

