## Content
- [Benchmark Model and Baseline Score](#Benchmark-Model-and-Baseline-Score)
- [Hyper-Parameters Tuning](#Hyper-Parameters-Tuning)
- [Limitations of Model](#Limitations-of-Model)
- [Conclusion and Recommendations](#Conclusion-and-Recommendations)

In [1]:
# Importing libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

In [2]:
#  Import data
combine = pd.read_csv('../datasets/combine.csv')
combine.head()

Unnamed: 0,title,selftext,comments,subreddit,text
0,RULE REMINDER: You cannot Post Offers to Trade...,Admins have banned other subs for this.\n\nNo ...,,0,rule remind cannot post offer trade sell copyr...
1,,,What are your recommendations on increasing yo...,0,recommend increas
2,,,how about just giving someone a free jump prog...,0,someon jump program origin
3,,,[deleted],0,delet
4,r/Basketball Weekly Discussion: Basketball Sho...,#Welcome to /r/Basketball's weekly Shoe Discus...,,0,basketbal weekli discuss basketbal shoe septem...


In [3]:
X = combine['text']
y = combine['subreddit']
# X is our feature variable and y is target variable

In [4]:
combine.isna().sum()

title        6502
selftext     7306
comments     1735
subreddit       0
text            0
dtype: int64

In [5]:
X_training, X_unseen, y_training, y_unseen = train_test_split(X, y, test_size=0.2, random_state=42, stratify = y)
# first split of the data

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X_training, y_training, test_size=0.2, random_state=42, stratify = y_training)
# second split of the data

## Benchmark Model and Baseline Score

In [7]:
cvec = CountVectorizer()
X_train_cvec = cvec.fit_transform(X_train)
X_test_cvec = cvec.transform(X_test)
# Convert a collection of raw documents to a matrix of vectorized features
ss = StandardScaler(with_mean = False)
X_train_ss = ss.fit_transform(X_train_cvec)
X_test_ss = ss.transform(X_test_cvec)
# Scale and center vectorized features

In [8]:
benchmark = KNeighborsClassifier()  # Instantiate and fit model
benchmark.fit(X_train_ss, y_train)
benchmark_predict = benchmark.predict(X_test_ss)  # predicting results

In [9]:
TN, FP, FN, TP = confusion_matrix(y_test,benchmark_predict).ravel()

print('\033[1m'+'Benchmark model')
print('\033[0m'f'Specificity of benchmark\t: {TN/(TN + FP)}')
print(f'Sensitivity of model\t\t: {TP/(TP + FN)}')
print(f'Accuracy for train data\t\t: {cross_val_score(benchmark, X_train_ss, y_train, n_jobs = -1, cv = 5).mean()}')
print(f'Accuracy for test data\t\t: {cross_val_score(benchmark, X_test_ss, y_test, n_jobs = -1, cv = 5).mean()}\n')
print(classification_report(y_test,benchmark_predict))

pd.DataFrame(confusion_matrix(y_test,benchmark_predict),
             index = ['Actual Basketball', 'Actual Soccer'],
             columns = ['Predicted Basketball','Predicted Soccer'])

[1mBenchmark model
[0mSpecificity of benchmark	: 0.6238532110091743
Sensitivity of model		: 0.8674698795180723
Accuracy for train data		: 0.7427453978074947
Accuracy for test data		: 0.6328061988708377

              precision    recall  f1-score   support

           0       0.82      0.62      0.71       654
           1       0.70      0.87      0.78       664

    accuracy                           0.75      1318
   macro avg       0.76      0.75      0.74      1318
weighted avg       0.76      0.75      0.74      1318



Unnamed: 0,Predicted Basketball,Predicted Soccer
Actual Basketball,408,246
Actual Soccer,88,576


We have used the K Nearest Neighbors as our benchmark model. What we are trying to achieve here is to increase the specificity of the model to as close as 1 as possible. We will focus more on specificity as we want to minimize as many false positive as possible. We dont want to have a case where we predicted a sentence/text as soccer but in fact, it is actually basketball. As we have a balanced dataset, We also want to improve our accuracy of the model. Currently, our specificity is at 0.62385 and our accuracy is at 0.63281 to unseen data. We will proceed to tune our parameters to get the best model we want.

## Hyper-Parameters Tuning
- Multinomial Naive Bayes Model

The Count vectorizer will count the number of occurance of each word in each sentence and display the number of words for each column. On the other hand, the Tfidf vectorizer will return the term frequency within each document. There is a possibility of multiple values in each column across different sentences for both vectorizers, so Multinomial Naive Bayes model is the best choice here as it can take in features with multiple variables. Bernoulli Naive Bayes model only takes in features with 0 or 1 values.

In [10]:
mnb_pipe_cvec = Pipeline([('cvec',CountVectorizer()),('mnb',MultinomialNB())])

cvec_params = {'cvec__max_features': [6500, 7500, 8947],
    'cvec__min_df': [2, 3],
    'cvec__max_df': [0.4, 0.45],
    'cvec__ngram_range': [(1,1), (1,2)]}

mnb_gs_cvec = GridSearchCV(mnb_pipe_cvec, cvec_params, n_jobs = -1, cv = 5, verbose = 1)
mnb_gs_cvec.fit(X_train, y_train)
mnb_predict_cvec = mnb_gs_cvec.predict(X_test)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    6.6s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:   19.9s finished


In [11]:
TN, FP, FN, TP = confusion_matrix(y_test,mnb_predict_cvec).ravel()

print('\033[1m'+'Count vectorizer with MultinomialNB model')
print('\033[0m'f'Specificity of model\t\t: {TN/(TN + FP)}')
print(f'Sensitivity of model\t\t: {TP/(TP + FN)}')
print(f'Accuracy for the model\t\t: {mnb_gs_cvec.best_score_}\n')
print(classification_report(y_test,mnb_predict_cvec))

pd.DataFrame(confusion_matrix(y_test,mnb_predict_cvec),
             index = ['Actual Basketball', 'Actual Soccer'],
             columns = ['Predicted Basketball','Predicted Soccer'])

[1mCount vectorizer with MultinomialNB model
[0mSpecificity of model		: 0.9495412844036697
Sensitivity of model		: 0.8478915662650602
Accuracy for the model		: 0.8911038966878602

              precision    recall  f1-score   support

           0       0.86      0.95      0.90       654
           1       0.94      0.85      0.89       664

    accuracy                           0.90      1318
   macro avg       0.90      0.90      0.90      1318
weighted avg       0.90      0.90      0.90      1318



Unnamed: 0,Predicted Basketball,Predicted Soccer
Actual Basketball,621,33
Actual Soccer,101,563


In [12]:
mnb_cv_coef = pd.DataFrame(mnb_gs_cvec.best_estimator_.named_steps['mnb'].coef_,
             columns = mnb_gs_cvec.best_estimator_.named_steps['cvec'].get_feature_names()).rename(
    index={0:'Coefficients'}).T

print('Words likely related to soccer\t:\n',mnb_cv_coef.sort_values(by='Coefficients', ascending = False)[:10],'\n')
print('Words likely related to basketball\t:\n',mnb_cv_coef.sort_values(by='Coefficients', ascending = True)[:10])

Words likely related to soccer	:
            Coefficients
http          -4.881564
player        -4.889017
goal          -5.162667
match         -5.223496
leagu         -5.258961
like          -5.258961
team          -5.299482
substitut     -5.394041
would         -5.453633
game          -5.521686 

Words likely related to basketball	:
               Coefficients
zw              -10.882978
pivot dribbl    -10.882978
pivot           -10.882978
pickup game     -10.882978
feel better     -10.882978
feel comfort    -10.882978
feel good       -10.882978
feel great      -10.882978
physiqu         -10.882978
feel pain       -10.882978


In [13]:
mnb_pipe_tf = Pipeline([('tfidf',TfidfVectorizer()),('mnb',MultinomialNB())])

tfidf_params = {'tfidf__max_features': [5000, 7500, 10000],
    'tfidf__ngram_range': [(1,1), (1,2)],
    'tfidf__min_df': [2, 3],
    'tfidf__max_df': [0.4, 0.45]}

mnb_gs_tf = GridSearchCV(mnb_pipe_tf, tfidf_params, n_jobs = -1, cv = 5, verbose = 1)
mnb_gs_tf.fit(X_train, y_train)
mnb_predict_tf = mnb_gs_tf.predict(X_test)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    6.8s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:   21.7s finished


In [14]:
TN, FP, FN, TP = confusion_matrix(y_test,mnb_predict_tf).ravel()

print('\033[1m'+'Tfidf vectorizer with MultinomialNB model')
print('\033[0m'f'Specificity of model\t\t: {TN/(TN + FP)}')
print(f'Sensitivity of model\t\t: {TP/(TP + FN)}')
print(f'Accuracy for the model\t\t: {mnb_gs_tf.best_score_}\n')
print(classification_report(y_test,mnb_predict_tf))

pd.DataFrame(confusion_matrix(y_test,mnb_predict_tf),
             index = ['Actual Basketball', 'Actual Soccer'],
             columns = ['Predicted Basketball','Predicted Soccer'])

[1mTfidf vectorizer with MultinomialNB model
[0mSpecificity of model		: 0.9235474006116208
Sensitivity of model		: 0.8780120481927711
Accuracy for the model		: 0.8994492657176003

              precision    recall  f1-score   support

           0       0.88      0.92      0.90       654
           1       0.92      0.88      0.90       664

    accuracy                           0.90      1318
   macro avg       0.90      0.90      0.90      1318
weighted avg       0.90      0.90      0.90      1318



Unnamed: 0,Predicted Basketball,Predicted Soccer
Actual Basketball,604,50
Actual Soccer,81,583


In [15]:
mnb_tf_coef = pd.DataFrame(mnb_gs_tf.best_estimator_.named_steps['mnb'].coef_,
             columns = mnb_gs_tf.best_estimator_.named_steps['tfidf'].get_feature_names()).rename(
    index={0:'Coefficients'}).T

print('Words likely related to soccer\t:\n',mnb_tf_coef.sort_values(by='Coefficients', ascending = False)[:10],'\n')
print('Words likely related to basketball\t:\n',mnb_tf_coef.sort_values(by='Coefficients', ascending = True)[:10])

Words likely related to soccer	:
         Coefficients
player     -6.040902
sign       -6.143130
good       -6.180685
like       -6.229467
goal       -6.355205
season     -6.395407
would      -6.411158
great      -6.448056
well       -6.524089
leagu      -6.526777 

Words likely related to basketball	:
             Coefficients
local team     -9.752477
loud           -9.752477
lost skill     -9.752477
look weird     -9.752477
look start     -9.752477
look shoe      -9.752477
look see       -9.752477
look peopl     -9.752477
look open      -9.752477
look nba       -9.752477


We compared between Count Vectorizer and Tfidf Vectorizer and tuned its parameters using Grid Search to build our Multinomial Naive Bayes model. Both results have a higher specificity than sensitivity. This shows that our MultinomialNB models are better at predicting texts that are actually basketball than texts that are actually soccer. In comparison to our benchmark model, we managed to get a better Specificity and Accuracy in both MultinomialNB models. We only managed to get better Sensitivity for the MultinomialNB model with Tfidf vectorizer as compared to the benchmark.

## Hyper-Parameters Tuning
- Logistic Regression

In [16]:
log_pipe_cvec = Pipeline([('cvec',CountVectorizer()),('logreg', LogisticRegression())])

cvec_params = {'cvec__max_features': [6500, 7500, 8947],
    'cvec__min_df': [2, 3],
    'cvec__max_df': [0.4, 0.45],
    'cvec__ngram_range': [(1,1), (1,2)]}

log_gs_cvec = GridSearchCV(log_pipe_cvec, cvec_params, n_jobs = -1, cv = 5, verbose = 1)
log_gs_cvec.fit(X_train, y_train)
log_predict_cvec = log_gs_cvec.predict(X_test)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    9.9s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:   27.9s finished


In [17]:
TN, FP, FN, TP = confusion_matrix(y_test,log_predict_cvec).ravel()

print('\033[1m'+'Count vectorizer with Logistic Regression')
print('\033[0m'f'Specificity of model\t\t: {TN/(TN + FP)}')
print(f'Sensitivity of model\t\t: {TP/(TP + FN)}')
print(f'Accuracy for the model\t\t: {log_gs_cvec.best_score_}\n')
print(classification_report(y_test,log_predict_cvec))

pd.DataFrame(confusion_matrix(y_test,log_predict_cvec),
             index = ['Actual Basketball', 'Actual Soccer'],
             columns = ['Predicted Basketball','Predicted Soccer'])

[1mCount vectorizer with Logistic Regression
[0mSpecificity of model		: 0.8302752293577982
Sensitivity of model		: 0.9367469879518072
Accuracy for the model		: 0.8893950376359075

              precision    recall  f1-score   support

           0       0.93      0.83      0.88       654
           1       0.85      0.94      0.89       664

    accuracy                           0.88      1318
   macro avg       0.89      0.88      0.88      1318
weighted avg       0.89      0.88      0.88      1318



Unnamed: 0,Predicted Basketball,Predicted Soccer
Actual Basketball,543,111
Actual Soccer,42,622


In [18]:
log_cv_coef = pd.DataFrame(log_gs_cvec.best_estimator_.named_steps['logreg'].coef_,
             columns = log_gs_cvec.best_estimator_.named_steps['cvec'].get_feature_names()).rename(
    index={0:'Coefficients'}).T

print('Words likely related to soccer\t:\n',log_cv_coef.sort_values(by='Coefficients', ascending = False)[:10],'\n')
print('Words likely related to basketball\t:\n',log_cv_coef.sort_values(by='Coefficients', ascending = True)[:10])

Words likely related to soccer	:
            Coefficients
footbal        1.816624
sign           1.796920
match          1.677970
bayern         1.467118
liverpool      1.397566
legend         1.381313
unit           1.371288
striker        1.304146
arsen          1.300123
goal           1.250824 

Words likely related to basketball	:
            Coefficients
basketbal     -3.286422
nba           -2.774712
dunk          -2.431362
practic       -2.331687
court         -2.020693
school        -2.012413
jump          -1.939595
shoe          -1.939389
lebron        -1.893745
depend        -1.834010


In [19]:
log_pipe_tf = Pipeline([('tfidf',TfidfVectorizer()),('logreg', LogisticRegression())])

tfidf_params = {'tfidf__max_features': [5000, 7500, 10000],
    'tfidf__ngram_range': [(1,1), (1,2)],
    'tfidf__min_df': [2, 3],
    'tfidf__max_df': [0.4, 0.45]}

log_gs_tf = GridSearchCV(log_pipe_tf, tfidf_params, n_jobs = -1, cv = 5, verbose = 1)
log_gs_tf.fit(X_train, y_train)
log_predict_tf = log_gs_tf.predict(X_test)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    8.9s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:   26.5s finished


In [20]:
TN, FP, FN, TP = confusion_matrix(y_test,log_predict_tf).ravel()

print('\033[1m'+'Tfidf vectorizer with Logistic Regression')
print('\033[0m'f'Specificity of model\t\t: {TN/(TN + FP)}')
print(f'Sensitivity of model\t\t: {TP/(TP + FN)}')
print(f'Accuracy for the model\t\t: {log_gs_tf.best_score_}\n')
print(classification_report(y_test,log_predict_tf))

pd.DataFrame(confusion_matrix(y_test,log_predict_tf),
             index = ['Actual Basketball', 'Actual Soccer'],
             columns = ['Predicted Basketball','Predicted Soccer'])

[1mTfidf vectorizer with Logistic Regression
[0mSpecificity of model		: 0.8440366972477065
Sensitivity of model		: 0.9427710843373494
Accuracy for the model		: 0.8911013786343158

              precision    recall  f1-score   support

           0       0.94      0.84      0.89       654
           1       0.86      0.94      0.90       664

    accuracy                           0.89      1318
   macro avg       0.90      0.89      0.89      1318
weighted avg       0.90      0.89      0.89      1318



Unnamed: 0,Predicted Basketball,Predicted Soccer
Actual Basketball,552,102
Actual Soccer,38,626


In [21]:
log_tf_coef = pd.DataFrame(log_gs_tf.best_estimator_.named_steps['logreg'].coef_,
             columns = log_gs_tf.best_estimator_.named_steps['tfidf'].get_feature_names()).rename(
    index={0:'Coefficients'}).T

print('Words likely related to soccer\t:\n',log_tf_coef.sort_values(by='Coefficients', ascending = False)[:10],'\n')
print('Words likely related to basketball\t:\n',log_tf_coef.sort_values(by='Coefficients', ascending = True)[:10])

Words likely related to soccer	:
            Coefficients
sign           3.011367
footbal        2.493096
match          2.488996
season         2.219935
goal           2.196048
unit           2.188189
bayern         2.072980
loan           1.903082
arsen          1.762674
liverpool      1.730885 

Words likely related to basketball	:
            Coefficients
basketbal     -6.201912
nba           -4.471698
practic       -3.916437
dunk          -3.473481
help          -3.392226
jump          -3.359188
school        -3.164266
court         -3.088074
work          -3.026016
shoe          -2.767702


We compared between Count Vectorizer and Tfidf Vectorizer and tuned its parameters using Grid Search to build our Logistic regression model. Both results have a higher sensitivity than specificity. This shows that our logistic regression models are better at predicting texts that are actually soccer than texts that are actually basketball. Also, we found that Logistic Regression with Tfidf Vectorizer produces a better result than the Logistic Regression with Count Vectorizer. In comparison to our benchmark model, both Logistic Regression models performed better in terms of Specificity, Sensitivity and Accuracy than our benchmarks.

## Hyper-Parameters Tuning
- Support Vector Machine

In [22]:
svc_pipe_cvec = Pipeline([('cvec',CountVectorizer()),('ss',StandardScaler()),('svc',SVC())])

cvec_params = {'cvec__max_features': [6500, 7500, 8947],
    'cvec__min_df': [2, 3],
    'cvec__max_df': [0.4, 0.45],
    'cvec__ngram_range': [(1,1), (1,2)],
    'ss__with_mean': [False],
    'svc__C' : [1,10]}

svc_gs_cvec = GridSearchCV(svc_pipe_cvec, cvec_params, n_jobs = -1, cv = 5, verbose = 1)
svc_gs_cvec.fit(X_train, y_train)
svc_predict_cvec = svc_gs_cvec.predict(X_test)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  7.4min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed:  9.3min finished


In [23]:
TN, FP, FN, TP = confusion_matrix(y_test,svc_predict_cvec).ravel()

print('\033[1m'+'Cvec vectorizer with SVC')
print('\033[0m'f'Specificity of model\t\t: {TN/(TN + FP)}')
print(f'Sensitivity of model\t\t: {TP/(TP + FN)}')
print(f'Accuracy for the model\t\t: {svc_gs_cvec.best_score_}\n')
print(classification_report(y_test,svc_predict_cvec))

pd.DataFrame(confusion_matrix(y_test,svc_predict_cvec),
             index = ['Actual Basketball', 'Actual Soccer'],
             columns = ['Predicted Basketball','Predicted Soccer'])

[1mCvec vectorizer with SVC
[0mSpecificity of model		: 0.8042813455657493
Sensitivity of model		: 0.9156626506024096
Accuracy for the model		: 0.8474665683426712

              precision    recall  f1-score   support

           0       0.90      0.80      0.85       654
           1       0.83      0.92      0.87       664

    accuracy                           0.86      1318
   macro avg       0.86      0.86      0.86      1318
weighted avg       0.86      0.86      0.86      1318



Unnamed: 0,Predicted Basketball,Predicted Soccer
Actual Basketball,526,128
Actual Soccer,56,608


In [24]:
svc_pipe_tf = Pipeline([('tfidf',TfidfVectorizer()),('ss',StandardScaler()),('svc',SVC())])

tfidf_params = {'tfidf__max_features': [5000, 7500, 10000],
    'tfidf__ngram_range': [(1,1), (1,2)],
    'tfidf__min_df': [2, 3],
    'tfidf__max_df': [0.4, 0.45],
    'ss__with_mean': [False],
    'svc__C' : [1,10]}

svc_gs_tf = GridSearchCV(svc_pipe_tf, tfidf_params, n_jobs = -1, cv = 5, verbose = 1)
svc_gs_tf.fit(X_train, y_train)
svc_predict_tf = svc_gs_tf.predict(X_test)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  7.8min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed:  9.8min finished


In [25]:
TN, FP, FN, TP = confusion_matrix(y_test,svc_predict_tf).ravel()

print('\033[1m'+'Tfidf vectorizer with SVC')
print('\033[0m'f'Specificity of model\t\t: {TN/(TN + FP)}')
print(f'Sensitivity of model\t\t: {TP/(TP + FN)}')
print(f'Accuracy for the model\t\t: {svc_gs_tf.best_score_}\n')
print(classification_report(y_test,svc_predict_tf))

pd.DataFrame(confusion_matrix(y_test,svc_predict_tf),
             index = ['Actual Basketball', 'Actual Soccer'],
             columns = ['Predicted Basketball','Predicted Soccer'])

[1mTfidf vectorizer with SVC
[0mSpecificity of model		: 0.8119266055045872
Sensitivity of model		: 0.9292168674698795
Accuracy for the model		: 0.8702335494662625

              precision    recall  f1-score   support

           0       0.92      0.81      0.86       654
           1       0.83      0.93      0.88       664

    accuracy                           0.87      1318
   macro avg       0.88      0.87      0.87      1318
weighted avg       0.88      0.87      0.87      1318



Unnamed: 0,Predicted Basketball,Predicted Soccer
Actual Basketball,531,123
Actual Soccer,47,617


We compared between Count Vectorizer and Tfidf Vectorizer and tuned its parameters using Grid Search to build our SVC model. Both results have a higher sensitivity than specificity. This shows that our SVC models are better at predicting texts that are actually soccer than texts that are actually basketball. Also, we found that SVC model with Tfidf Vectorizer produces a better result than the Logistic Regression with Count Vectorizer. In comparison to our benchmark model, both SVC models performed better in terms of Specificity, Sensitivity and Accuracy than our benchmarks.

## Hyper-Parameters Tuning
- K Nearest Neighbors

In [26]:
knn_pipe_cvec = Pipeline([('cvec',CountVectorizer()),('ss',StandardScaler()),('knn',KNeighborsClassifier())])

cvec_params = {'cvec__max_features': [6500, 7500, 8947],
    'cvec__min_df': [2, 3],
    'cvec__max_df': [0.4, 0.45],
    'cvec__ngram_range': [(1,1), (1,2)],
    'ss__with_mean': [False],
    'knn__n_neighbors':[3,5,7,9]}

knn_gs_cvec = GridSearchCV(knn_pipe_cvec, cvec_params, n_jobs = -1, cv = 5, verbose = 1)
knn_gs_cvec.fit(X_train, y_train)
knn_predict_cvec = knn_gs_cvec.predict(X_test)

Fitting 5 folds for each of 96 candidates, totalling 480 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   12.7s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   57.4s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 480 out of 480 | elapsed:  2.4min finished


In [27]:
TN, FP, FN, TP = confusion_matrix(y_test,knn_predict_cvec).ravel()

print('\033[1m'+'Count vectorizer with K Nearest Neighbors')
print('\033[0m'f'Specificity of model\t\t: {TN/(TN + FP)}')
print(f'Sensitivity of model\t\t: {TP/(TP + FN)}')
print(f'Accuracy for the model\t\t: {knn_gs_cvec.best_score_}\n')
print(classification_report(y_test,knn_predict_cvec))

pd.DataFrame(confusion_matrix(y_test,knn_predict_cvec),
             index = ['Actual Basketball', 'Actual Soccer'],
             columns = ['Predicted Basketball','Predicted Soccer'])

[1mCount vectorizer with K Nearest Neighbors
[0mSpecificity of model		: 0.6590214067278287
Sensitivity of model		: 0.8298192771084337
Accuracy for the model		: 0.7507157567200554

              precision    recall  f1-score   support

           0       0.79      0.66      0.72       654
           1       0.71      0.83      0.77       664

    accuracy                           0.75      1318
   macro avg       0.75      0.74      0.74      1318
weighted avg       0.75      0.75      0.74      1318



Unnamed: 0,Predicted Basketball,Predicted Soccer
Actual Basketball,431,223
Actual Soccer,113,551


In [28]:
knn_pipe_tf = Pipeline([('tfidf',TfidfVectorizer()),('ss',StandardScaler()),('knn', KNeighborsClassifier())])

tfidf_params = {'tfidf__max_features': [5000, 7500, 10000],
    'tfidf__ngram_range': [(1,1), (1,2)],
    'tfidf__min_df': [2, 3],
    'tfidf__max_df': [0.4, 0.45],
    'ss__with_mean': [False],
    'knn__n_neighbors':[3,5,7,9]}

knn_gs_tf = GridSearchCV(knn_pipe_tf, tfidf_params, n_jobs = -1, cv = 5, verbose = 1)
knn_gs_tf.fit(X_train, y_train)
knn_predict_tf = knn_gs_tf.predict(X_test)

Fitting 5 folds for each of 96 candidates, totalling 480 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   11.4s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   55.8s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 480 out of 480 | elapsed:  2.5min finished


In [29]:
TN, FP, FN, TP = confusion_matrix(y_test,knn_predict_tf).ravel()

print('\033[1m'+'Tfidf vectorizer with K Nearest Neighbors')
print('\033[0m'f'Specificity of model\t\t: {TN/(TN + FP)}')
print(f'Sensitivity of model\t\t: {TP/(TP + FN)}')
print(f'Accuracy for the model\t\t: {knn_gs_tf.best_score_}\n')
print(classification_report(y_test,knn_predict_tf))

pd.DataFrame(confusion_matrix(y_test,knn_predict_tf),
             index = ['Actual Basketball', 'Actual Soccer'],
             columns = ['Predicted Basketball','Predicted Soccer'])

[1mTfidf vectorizer with K Nearest Neighbors
[0mSpecificity of model		: 0.5764525993883792
Sensitivity of model		: 0.7740963855421686
Accuracy for the model		: 0.6496025971923703

              precision    recall  f1-score   support

           0       0.72      0.58      0.64       654
           1       0.65      0.77      0.71       664

    accuracy                           0.68      1318
   macro avg       0.68      0.68      0.67      1318
weighted avg       0.68      0.68      0.67      1318



Unnamed: 0,Predicted Basketball,Predicted Soccer
Actual Basketball,377,277
Actual Soccer,150,514


We compared between Count Vectorizer and Tfidf Vectorizer and tuned its parameters using Grid Search to build our K Nearest Neighbors model. Both results have a higher specificity than sensitivity. This shows that our K Nearest Neighbors models are better at predicting texts that are actually basketball than texts that are actually soccer. Also, we found that K Nearest Neighbors with Tfidf Vectorizer produces a better result than the K Nearest Neighbors with Count Vectorizer. In comparison to our benchmark model, both K Nearest Neighbors models performed better in terms of Sensitivity and Accuracy than our benchmarks. However, specificity of both K Nearest Neighbors models did not out perform our benchmark model.

## Limitations of Model

Our model is fitted with sentences that were scrapped from basketball and soccer reddit APIs. As such, the model analysis is limited to the corpus of texts obtained.

Any words that are new to the corpus will not be considered when doing vectorizing transformation and prediction. At the same time, the API has a maximum capacity. So the data that we obtained are limited by the this cap.

Our logistic model assumes linear separability between different texts. However, in reality, texts or comments are not exactly linearly separable. At the same time, Naive Bayes model assumes independence between features. Texts or comments may not be independent in fact

Our models are only applicable to analyse basketball and soccer sentences. If a sentence is related to other sports, for example american football, our model might predict the text being associated to soccer

We are trying to classify between soccer texts and basketball texts this is because of the nature of MLS business. If MLS wants to compare between other sports, all we have to do is to repeat the process by getting the data we want, retrain our model and we will be able to get the model we want.

## Conclusion and Recommendations
**Production Model and Implementation**

In [30]:
unseen_predict = mnb_gs_tf.predict(X_unseen) # Predicting unseen data

In [31]:
TN, FP, FN, TP = confusion_matrix(y_unseen,unseen_predict).ravel()

print('\033[1m'+'Production Model and implementation on unseen date')
print('\033[0m'f'Specificity of model\t\t: {TN/(TN + FP)}')
print(f'Sensitivity of model\t\t: {TP/(TP + FN)}')
print(f'Accuracy for the model\t\t: {mnb_gs_tf.score(X_unseen, y_unseen)}\n')
print(classification_report(y_unseen, unseen_predict))

pd.DataFrame(confusion_matrix(y_unseen, unseen_predict),
             index = ['Actual Basketball', 'Actual Soccer'],
             columns = ['Predicted Basketball','Predicted Soccer'])

[1mProduction Model and implementation on unseen date
[0mSpecificity of model		: 0.9253365973072215
Sensitivity of model		: 0.8736462093862816
Accuracy for the model		: 0.8992718446601942

              precision    recall  f1-score   support

           0       0.88      0.93      0.90       817
           1       0.92      0.87      0.90       831

    accuracy                           0.90      1648
   macro avg       0.90      0.90      0.90      1648
weighted avg       0.90      0.90      0.90      1648



Unnamed: 0,Predicted Basketball,Predicted Soccer
Actual Basketball,756,61
Actual Soccer,105,726


In [32]:
def soccer_or_basketball(sentence):   # Function to predict if text is soccer or basketball
    if mnb_gs_tf.predict(sentence) == 1:
        return('Text associated to Soccer')
    
    else:
        return('Text associated to Basketball')

In [33]:
soccer_or_basketball([str(input())]) # Example one : 'Bringing the entertainment closer'

Bringing the entertainment closer


'Text associated to Basketball'

In [34]:
soccer_or_basketball([str(input())]) # Example two : 'Expansion plans coming your way'

Expansion plans coming your way


'Text associated to Soccer'

In [35]:
soccer_or_basketball([str(input())]) # Example three : 'Bringing more stadiums to you'

Bringing more stadiums to you


'Text associated to Soccer'

In [36]:
soccer_or_basketball([str(input())]) # Example four : 'Getting the most from 90 minutes'

Getting the most from 90 minutes


'Text associated to Soccer'

## Conclusion and Recommendations

We selected the Multinomial Naive Bayes with Tfidf Vectorizer as our production model due to the high specificity and accuracy. Remember that we want to optimize the specificity of our model as we want to filter out all texts that are associated to basketball. Having said that, we want a model that gives us minimum false positives given that soccer is 1 while basketball is 0.

By implementing our model on unseen data, we yield the following results :

||Production Model|Benchmark Model|
|---|---|---|
|**Specificity**|0.92355|0.62385|
|**Accuracy**|0.8994|0.63281|
|**Sensitivity**|0.87801|0.86750|

\
Even with unseen data, our production model outperformed the benchmark model in specificity, accuracy as well as sensitivity. This shows that our model is able to tell whether a statement is associated to soccer or basketball and can help in evaluating any suggested slogans.

We included a few sample slogans above to show you how our model evaluates them. The first sample is associated to basketball while the next few are associated to soccer.

By adopting our model, MLS will be able to consolidate all suggested slogans and run them through our model for evaluation. You will be able to filter out the slogans that are predicted as basketball and that will very much narrow down and reduce your search for a suitable slogan.

We believe the model we developed is capable of solving the issue of evaluating whether a suggested slogan is associated to basketball or soccer. Besides the slogan, this model is not subjected to a one time usage. You can implement it for any future testing of sentences and texts in your marketing materials or campaigns.

Though our model maybe useful at filtering out basketball texts, it is recommended that apart from solely relying on our model, you exercise your domain expertise and intuition.

As for selecting words that you are considering in the slogan, we recommend MLS to adopt the Tfidf vectorizer with logistic model's results for highly effective words for soccer and avoid words that are well associated to basketball.

We selected this logistic model to select keywords as it is also highly accurate and has better specificity, accuracy and sensitivity than our benchmark.

||Keywords Model|Benchmark Model|
|---|---|---|
|**Specificity**|0.84404|0.62385|
|**Accuracy**|0.89110|0.63281|
|**Sensitivity**|0.94277|0.86750|

Below are the words we believe are highly associated to soccer and basketball respectively. They are the highest/lowest coefficient words from our logistic model with Tfidf vectorizer.

In [41]:
print('Words likely related to soccer\t:\n',log_tf_coef.sort_values(by='Coefficients', ascending = False)[:10],'\n')
print('Words likely related to basketball\t:\n',log_tf_coef.sort_values(by='Coefficients', ascending = True)[:10])

Words likely related to soccer	:
            Coefficients
sign           3.011367
footbal        2.493096
match          2.488996
season         2.219935
goal           2.196048
unit           2.188189
bayern         2.072980
loan           1.903082
arsen          1.762674
liverpool      1.730885 

Words likely related to basketball	:
            Coefficients
basketbal     -6.201912
nba           -4.471698
practic       -3.916437
dunk          -3.473481
help          -3.392226
jump          -3.359188
school        -3.164266
court         -3.088074
work          -3.026016
shoe          -2.767702
