# Table of Contents #

- [1. Web Scrapping and Pre-Processing](#1.-Web-Scrapping-and-Pre-Processing)
    - [1.1 Importing Libraries](#1.1-Importing-Libraries)
    - [1.2 Scrapping 'AskMen' Subreddit](#1.2-Scrapping-'AskMen'-Subreddit)
        - [1.2.1 Pre-work](#1.2.1-Pre-work)
        - [1.2.2 Scrapping](#1.2.2-Scrapping)
    - [1.3 Scrapping 'AskWomen' Subreddit](#1.3-Scrapping-'AskWomen'-Subreddit)
        - [1.3.1 Pre-work](#1.3.1-Pre-work)
        - [1.3.2 Scrapping](#1.3.2-Scrapping)
    - [1.4 Merging Scrapping Results into a Data Frame for Future Processing](#1.4-Merging-Scrapping-Results-into-a-Data-Frame-for-Future-Processing)
- [2. Exploratory Data Analysis](#2.-Exploratory-Data-Analysis)
- [3. Baseline Model](#3.-Baseline-Model)
- [4. Modeling](#4.-Modeling)
    - [4.1 Logistic Regression](#4.1-Logistic-Regression)
    - [4.2 Naive Bayes Model](#4.2-Naive-Bayes-Model)
    - [4.3 Decision Trees Classifier](#4.3-Decision-Trees-Classifier)
    - [4.4 Bagging Classifier](#4.4-Bagging-Classifier)
    - [4.5 Random Forest Classifier](#4.5-Random-Forest-Classifier)
    - [4.6 Extra Trees Classifier](#4.6-Extra-Trees-Classifier)
    - [4.7 AdaBoost Classifier](#4.7-AdaBoost-Classifier)
    - [4.8 Gradient Boosting Classifier](#4.8-Gradient-Boosting-Classifier)
    - [4.9 Support Vector Machines](#4.9-Support-Vector-Machines)
- [5. Conclusions](#5.-Conclusions)
- [6. Recommendations](#6.-Recommendations)

# 1. Web Scrapping and Pre-Processing #

## 1.1 Importing Libraries ##

In [1]:
import pandas as pd
import numpy as np
import requests
import time
from sklearn.pipeline                import Pipeline
from sklearn.model_selection         import train_test_split, GridSearchCV
from sklearn.linear_model            import LogisticRegression
from sklearn.ensemble                import BaggingClassifier,RandomForestClassifier,ExtraTreesClassifier,AdaBoostClassifier,GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction      import stop_words
from sklearn.naive_bayes             import GaussianNB, MultinomialNB
from sklearn.tree                    import DecisionTreeClassifier
from sklearn.svm                     import SVC
import warnings

In [2]:
warnings.simplefilter(action='ignore', category=FutureWarning)

## 1.2 Scrapping 'AskMen' Subreddit ##

### 1.2.1 Pre-work

In [3]:
#Setting constants
url_base = 'https://www.reddit.com/r/AskMen.json'
user_agent = {"User-agent": 'ilya-k'}
after = None

In [4]:
#Test run
res = requests.get(url = url_base,
                   headers = user_agent)
res.status_code

200

### 1.2.2 Scrapping ###

In [5]:
#This chunk of code was adopted from Boom D.

#Scrapping the 'AskMen' subreddit

#Instantiating an empty list to store the results
posts=[]

#Looping through the subreddit content
for pull_num in range(20):
    
    #Status info print
    print(f'Pull number {pull_num+1}')
    
    #First pull
    if after == None:
        new_url = url_base
    
    #Other pulls
    else:
        new_url = url_base + '?after=' + after
  
    #Pull result
    res = requests.get(new_url,
                headers = user_agent)

    #Result processing
    if res.status_code == 200:
        json_data = res.json()
        for post in json_data['data']['children']:
            #Adding individual posts to the results list
            posts.extend([post['data']['selftext']])
    #Error protection   
    else:
        print("We've run into an error. The status code is:", res.status_code)
        break
    
    #Resetting the last post's indicator                           
    after = json_data['data']['after']
    
    #Additional protection for API secutity and stability
    time.sleep(2)

Pull number 1
Pull number 2
Pull number 3
Pull number 4
Pull number 5
Pull number 6
Pull number 7
Pull number 8
Pull number 9
Pull number 10
Pull number 11
Pull number 12
Pull number 13
Pull number 14
Pull number 15
Pull number 16
Pull number 17
Pull number 18
Pull number 19
Pull number 20


In [6]:
#Storing the scrapping results into a Data Frame
data_men = pd.DataFrame(posts)

#Setting a source indicator
data_men['source'] = 'AskMen'

#Setting the column names
data_men.columns = ['text','source']

# "Meeting" new DataFrame
data_men.head()

Unnamed: 0,text,source
0,No one cares that you like listening to Taylor...,AskMen
1,"Use this for all your Halloween discussion, su...",AskMen
2,,AskMen
3,,AskMen
4,,AskMen


As we could notice, there're empty posts in the Data Frame. As a part of Pre-processing process let's eradicate empty posts. 

In [7]:
#Keepiing only meaningful posts in our Data Frame
data_men = data_men[data_men['text'] !='']

#Storing the results in a csv file
data_men.to_csv('../data/AskMen.csv')

## 1.3 Scrapping 'AskWomen' Subreddit ##

### 1.3.1 Pre-work ###

In [8]:
#Setting constants
url_base = 'https://www.reddit.com/r/AskWomen.json'
user_agent = {"User-agent": 'ilya-k'}
after = None

In [9]:
#Test run
res = requests.get(url = url_base,
                   headers = user_agent)
res.status_code

200

### 1.3.2 Scrapping ###

In [10]:
#This chunk of code was adopted from Boom D.

#Scrapping the 'AskWomen' subreddit

#Instantiating an empty list to store the results
posts=[]


#Looping through the subreddit content
for pull_num in range(20):
    
    #Status info print
    print(f'Pull number {pull_num+1}')
    
    #First pull
    if after == None:
        new_url = url_base
    
    #Other pulls
    else:
        new_url = url_base + '?after=' + after
  
    #Pull result
    res = requests.get(new_url,
                headers = user_agent)

    #Result processing
    if res.status_code == 200:
        json_data = res.json()
        for post in json_data['data']['children']:
            #Adding individual posts to the results list
            posts.extend([post['data']['selftext']])
    #Error protection   
    else:
        print("We've run into an error. The status code is:", res.status_code)
        break
    
    #Resetting the last post's indicator                           
    after = json_data['data']['after']
    
    #Additional protection for API secutity and stability
    time.sleep(2)

Pull number 1
Pull number 2
Pull number 3
Pull number 4
Pull number 5
Pull number 6
Pull number 7
Pull number 8
Pull number 9
Pull number 10
Pull number 11
Pull number 12
Pull number 13
Pull number 14
Pull number 15
Pull number 16
Pull number 17
Pull number 18
Pull number 19
Pull number 20


In [11]:
#Storing the scrapping results into a Data Frame
data_women = pd.DataFrame(posts)

#Setting a source indicator
data_women['source'] = 'AskWomen'

#Setting the column names
data_women.columns = ['text','source']

# "Meeting" new DataFrame
data_women.head()

Unnamed: 0,text,source
0,For those of you who love (or hate!) this holi...,AskWomen
1,"\nEvery Friday, just say whatever is in your m...",AskWomen
2,Borrowed from the Ask Men sub but I loved all ...,AskWomen
3,,AskWomen
4,I'm mainly talking emotional support,AskWomen


In [12]:
#Keepiing only meaningful posts in our Data Frame
data_women = data_women[data_women['text'] !='']

#Storing the results in a csv file
data_women.to_csv('../data/AskWomen.csv')

## 1.4 Merging Scrapping Results into a Data Frame for Future Processing ##

In [13]:
#Joining our two Data Frames for the two subreddits into one Data Frame for future processing
data_joined = pd.concat([data_men,data_women])

In [14]:
#"Meeting" the new joined Data Frame
data_joined

Unnamed: 0,text,source
0,No one cares that you like listening to Taylor...,AskMen
1,"Use this for all your Halloween discussion, su...",AskMen
8,I don’t have friends in my new city. My family...,AskMen
9,I am curious for comparison because I have had...,AskMen
12,Saw that recent post about what your partner d...,AskMen
...,...,...
492,There is so much more to relationships than ph...,AskWomen
493,Edit: As in a personal story or a memory.,AskWomen
494,And have you ever caught yourself objectifying...,AskWomen
496,If you had a magical notebook in which anythi...,AskWomen


In [15]:
#Saving our new joined Data Frame to csv file
data_joined.to_csv('../data/data_raw.csv')

# 2. Exploratory Data Analysis #

In [16]:
#Importing data
data = pd.read_csv('../data/data_raw.csv')

#"Meeting" our new Data Frame
data.head()

Unnamed: 0.1,Unnamed: 0,text,source
0,0,No one cares that you like listening to Taylor...,AskMen
1,1,"Use this for all your Halloween discussion, su...",AskMen
2,8,I don’t have friends in my new city. My family...,AskMen
3,9,I am curious for comparison because I have had...,AskMen
4,12,Saw that recent post about what your partner d...,AskMen


In [17]:
#Dropping a column
data.drop(columns = ['Unnamed: 0'], axis = 1, inplace=True)

In [18]:
#Checking for missing values
data.isnull().sum()

text      0
source    0
dtype: int64

In [19]:
#Exploring the shape of our data
data.shape

(383, 2)

In [20]:
#Exploring data types
data.dtypes

text      object
source    object
dtype: object

In [21]:
#Identifying our classes
data['source'].value_counts()

AskMen      253
AskWomen    130
Name: source, dtype: int64

As we can see, we will need to introduce stratification into our future training and testing sets splits, as we might encounter unbalanced classes, especially with testing sets. 

# 3. Baseline Model #

Since we are going to work on classification models, we will need to set our Baseline Model on our prevailing class - it will be 'AskMen' with the 253 values against 'AskWomen' class with 130 values.

In [22]:
data['source'].value_counts(normalize=True)

AskMen      0.660574
AskWomen    0.339426
Name: source, dtype: float64

**Hence, our Baselline Model's accuracy score is 0.6606 - this is the accuracy score we are going to evaluate all our future models against.**

# 4. Modeling #

## 4.1 Logistic Regression ##

In order to look for optimal parameters for our Logistic Regression model, we will use a pipeline with an optimizer (count vectorizer and TD-IDF vectorizer) and our logistic regression as a model and perform a grid search. 

In [23]:
#Features matrix and target vector
X = data['text']
y = data['source']

#Training and testing sets split with stratification 
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=81)

In [24]:
#Instantiating two vectorizers
cvec = CountVectorizer()
tfidf = TfidfVectorizer()

#Instantiating a pipeline for CountVectorizer
pipe_cvec = Pipeline(steps = [('vectorizer', CountVectorizer()), 
                        ('model', LogisticRegression())])

#Setting grid search parameters
hyperparams_cvec = {'vectorizer__max_features':[1000,2500,5000],
                    'vectorizer__ngram_range':[(1,1),(2,2),(1,2), (1,3),(2,3),(3,3)],
                    'vectorizer__stop_words':[None, 'english']
                   }

#Instantiating a pipeline for TF-IDF Vectorizer
pipe_tfidf = Pipeline(steps = [('vectorizer', TfidfVectorizer()), 
                        ('model', LogisticRegression())])

#Setting grid search parameters
hyperparams_tfidf = {'vectorizer__max_features':[1000,2500,5000],
                    'vectorizer__ngram_range':[(1,1),(2,2),(1,2), (1,3),(2,3),(3,3)],
                    'vectorizer__stop_words':[None, 'english'],
                   }

In [26]:
#Instantiating grid search with 3-fold cross-validation
gs_cvec = GridSearchCV(pipe_cvec,
                      hyperparams_cvec,
                      cv=3)
gs_tfidf = GridSearchCV(pipe_tfidf,
                       hyperparams_tfidf,
                       cv=3)

#Fitting grid search
results_cvec = gs_cvec.fit(X_train,y_train)
results_tfidf = gs_tfidf.fit(X_train,y_train)



In [27]:
#Best accuracy score for a CountVectorizer pipeline
results_cvec.best_score_

0.7456445993031359

In [28]:
#Best accuracy on the training set for a CountVectorizer pipeline
results_cvec.score(X_train,y_train)

0.9965156794425087

In [29]:
#Best accuracy score for a TD-IDF pipeline
results_tfidf.best_score_ 

0.7038327526132404

In [30]:
#Best accuracy on the training set for a TF-IDF pipeline
results_tfidf.score(X_train,y_train)

0.8780487804878049

As we could see, the resulting accuracy for CountVectorizer is better than for TD-IDF Vectorizer. Hence, our best Logistic Regression modelf uses the following parameters:

In [31]:
#Best estimator for CountVectorizer
results_cvec.best_params_

{'vectorizer__max_features': 2500,
 'vectorizer__ngram_range': (1, 1),
 'vectorizer__stop_words': None}

 Our best Logistic Regression model has an accuracy score of 0.7456 (in comparison with the Baseline Model's accuracy of 0.6606. All Logistic regression models are quite overfitted on training data.

## 4.2 Naive Bayes Model ##

The choice of Naive Bayes model depends on our feature matrix, which is largely dependent on our choice of vectorizer. 
With Count Vectorizer our features matrix contains only positive integer values - hence the Naive Bayes Multinomial model will be the most appropriate.  

In [76]:
#Instantiating model
mnb = MultinomialNB()

#Instantiating Count Vectorizer
cvec=CountVectorizer()

#Vectorizing Transforming our features matriz
X_mnb = cvec.fit_transform(X)

#Trasforming our sparse matrix to array
X_mnb = X_mnb.toarray()

#Train-test split with stratification for this particular method
X_train_mnb, X_test_mnb, y_train_mnb, y_test_mnb = train_test_split(X_mnb, y, stratify=y, random_state=81)

#Fitting the model
result_mnb = mnb.fit(X_train_mnb,y_train_mnb)

In [77]:
#Getting model's accuracy on trainig set
result_mnb.score(X_train_mnb, y_train_mnb)

0.9024390243902439

In [78]:
#Getting model's accuracy on testingn set
result_mnb.score(X_test_mnb, y_test_mnb)

0.75

As we could notice, our Multinomial Naive Bayes model is very overfitted on training data, as it shows accuracy of 0.9024 on a training set, but just 0.75 on a testing set - which is for now our new highest score. It surpasses teh score for the optimized Logistic Regression just by 0.0044 which is kind of negligible, as both models are very overfit.

With TF-IDF Vectorizer our features matrix contains only positive non-integer values - hence the Naive Bayes Gaussian model will be the most appropriate.  

In [35]:
#Instantiating model
gnb = GaussianNB()

#Instantiating Count Vectorizer
tfidf = TfidfVectorizer()

#Vectorizing and transforming our features matriz
X_gnb = tfidf.fit_transform(X)

#Trasforming our sparse matrix to array
X_gnb = X_gnb.toarray()

#Train-test split with stratification for this particlular model
X_train_gau, X_test_gau, y_train_gau, y_test_gau = train_test_split(X_gnb, y, stratify=y, random_state=81)

#Fitting the model
result_gnb = gnb.fit(X_train_gau,y_train_gau)

In [36]:
#Getting model's accuracy on trainig set
result_gnb.score(X_train_gau, y_train_gau)

0.9860627177700348

In [37]:
#Getting model's accuracy on testingn set
result_gnb.score(X_test_gau, y_test_gau)

0.7083333333333334

As we could notice, our Gaussian Naive Bayes model is as well very overfitted on training data, as it shows accuracy of 0.9861 on a training set, but just 0.7083 on a testing set. 
But the accuracy on the testing score has not improved in comparison with the situation when we used Count Vectorizer and Multinomial Naive Bayes model.

## 4.3 Decision Trees Classifier ##

Let's try how a Decision Tree Classifier works on text data.

In [39]:
#Instantiating the model
tree = DecisionTreeClassifier()

#Instantiating Count Vectorizer
cvec = CountVectorizer()

#Vectorizing and transforming our features matrix
X_train_cvec=cvec.fit_transform(X_train)
X_test_cvec = cvec.transform(X_test)

#Fitting the model
result_tree_cvec = tree.fit(X_train_cvec, y_train)

In [40]:
#Accuracy score training set
result_tree_cvec.score(X_train_cvec, y_train)

1.0

In [41]:
#Accuracy score testing set
result_tree_cvec.score(X_test_cvec, y_test)

0.65625

As we could see, our Decision Tree model with Count Vectorizer is severely overfitted on training data again, with accuracy score for training set of 1.0 and for testing set - just 0.6525. Let's compare the results with the situation when we use TF-IDF Vectorizer.

In [42]:
#Instantiating TF-IDF Vectorizer
tfidf = TfidfVectorizer()

#Vectorizing and transforming our features matrix
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

#Fitting the model
result_tree_tfidf = tree.fit(X_train_tfidf, y_train)

In [43]:
#Accuracy score training set
result_tree_tfidf.score(X_train_tfidf, y_train)

1.0

In [44]:
#Accuracy score testing set
result_tree_tfidf.score(X_test_tfidf, y_test)

0.625

Accuracy scores for our Decision Tree Calssifier model with TF-IDF Vectorizer show a similar picture: the model is highly overfit on training data with accuracy for the training set of 1.0 and just 0.625 for the testing set. 

Again, the same type of model shows better accuracy on testing set when we use Count Vectorizer.

## 4.4 Bagging Classifier ##

Let's see if using of a Bagging Classifier brings us to better scores. While preparing this model I dod a grid search on some model parameters, and it turned out that a model with using of Count Vectorizer with maximum features  of 2500,  n_gram of 1 token and not using any stop words, along with 111 estimators (trees) was the best. I am commenting the code for these results as running it takes a lot of time. I'm instantiating a model with these parameters in this notebook instead. 

In [45]:
#THIS IS FOR REFERENCE ONLY

# pipe_bag = Pipeline(steps = [('vectorizer', CountVectorizer()), 
#                         ('model', BaggingClassifier())])

# hyperparams_bag = {'vectorizer__max_features':[1000,2500,5000],
#                     'vectorizer__ngram_range':[(1,1),(2,2),(1,2), (1,3),(2,3),(3,3)],
#                     'vectorizer__stop_words':[None, 'english'],
#                      'model__n_estimators':[11,33,55,77,99,111]
#                    }

# gs_bag = GridSearchCV(pipe_bag,
#                       hyperparams_bag,
#                       cv=3)
# results_bag = gs_bag.fit(X_train,y_train)
#  results_bag.best_estimator_

In [46]:
#Instantiating vectorizer
cvec = CountVectorizer (max_features = 2500,
                        ngram_range = (1,1),
                        stop_words = None)

#Instantiating Bagging Classifier
bag = BaggingClassifier(n_estimators = 111)

#Transforming our features matrix for this particular method (optimized Count Vectorizer)
X_bag_train_cvec = cvec.fit_transform(X_train)
X_bag_test_cvec = cvec.transform(X_test)

#Fitting the model
results_bag = bag.fit(X_bag_train_cvec, y_train)

In [47]:
#Accuracy score training set
results_bag.score(X_bag_train_cvec, y_train)

1.0

In [48]:
#Accuracy score testing set
results_bag.score(X_bag_test_cvec, y_test)

0.75

As we can see, with the Bagging classifier we get the results of accuracy for the training set of 1.0, and just 0.75 for the testing set. It's quite an improvement and so far exactly matches the best testing set accuracy for Multinomial NB model, but still tells that the model is very overfitted on training data. 

## 4.5 Random Forest Classifier ##

Having witnessed quite some growth with the testing set accuracy score, it's quite natural to want to try a ode advanced classifier on our data, which is a Random Forest Classifier.

In [49]:
#Instantiating the model
rf = RandomForestClassifier()

#Fitting the model
results_rf_cvec = rf.fit(X_train_cvec, y_train)

In [50]:
#Accuracy score training set
results_rf_cvec.score(X_train_cvec, y_train)

0.9895470383275261

In [51]:
#Accuracy score testing set
results_rf_cvec.score(X_test_cvec, y_test)

0.7395833333333334

As we can see, with the Random Forest classifier we get the results of accuracy for the training set of 0.9835, and just 0.7129 for the testing set. They are not the best results comparinng to our Bagging classifier model, but still tells that the model is very overfitted on training data.
As another attempt to improve model's performance, let's do a gridsearch on model's parameter n_estimators (number of trees).

In [52]:
#Setting a dictionary of parameters for a grid search
rf_params = { 'n_estimators': np.arange(2,60,2) }

#Instantiating a grid search with the parameters chosen above and a 5-fold cross-validation
gs = GridSearchCV(rf, param_grid=rf_params, cv=5)

#Fitting our grid search 
gs.fit(X_train_cvec, y_train)

#Getting our best parameter
gs.best_params_



{'n_estimators': 54}

In [53]:
#Getting our best score for our best parameter
gs.best_score_

0.7700348432055749

In [54]:
#Getting our best training score
gs.score(X_train_cvec,y_train)

1.0

Grid-searched by number of trees Random Forest classifier with 54 trees gives us the best accuracy score of 0.7700 on a testing set. This is so far the best result. But the model is quite overfitted on training data.

## 4.6 Extra Trees Classifier ##

In [79]:
#Instantiating an Extra Trees Clasifier with n_estimators = 54
et = ExtraTreesClassifier(n_estimators=54)

#Fitting the model
results_et_cvec = rf.fit(X_train_cvec, y_train)

In [80]:
#Accuracy score training set
results_et_cvec.score(X_train_cvec, y_train)

0.9790940766550522

In [81]:
#Accuracy score testing set
results_et_cvec.score(X_test_cvec, y_test)

0.75

So far, an Extra Trees Classifier hasn't shown any improvement in comparison with the previously set highest score on a testing set of 0.7700 by the Random Forest model.

## 4.7 AdaBoost Classifier ##

Having witnessed another improvement with the Random Forest Classifier with number of trees(n_estimators) optimized at 70 trees, it's quite natural to try an AdaBoost Classification model on this data.

In [82]:
#Instantiating an AdaBoost Clasifier with Decision Tree Classifier as estimator and n_estimators=60
ada = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(), 
                         n_estimators = 70)

#Fitting the model
results_ada_cvec = rf.fit(X_train_cvec, y_train)

In [83]:
#Accuracy score training set
results_ada_cvec.score(X_train_cvec, y_train)

0.9825783972125436

In [84]:
#Accuracy score testing set
results_ada_cvec.score(X_test_cvec, y_test)

0.71875

AdaBoost Classifier hasn't worked better on our data, as the result on testing data is slightly lower than for the Extra Tree model, and this model is also highly overfitted. 

## 4.8 Gradient Boosting Classifier ##

In [61]:
#Instantiating an Gradient Boosting Clasifier 
gb = GradientBoostingClassifier()

#Fitting the model
results_gb_cvec = rf.fit(X_train_cvec, y_train)

In [62]:
#Accuracy score training set
results_gb_cvec.score(X_train_cvec, y_train)

0.9790940766550522

In [63]:
#Accuracy score testing set
results_gb_cvec.score(X_test_cvec, y_test)

0.65625

Gradient Boosting Classifier hasn't worked better on our data as well, as the result on testing data is lower than for the Random Forest model, and this model is also highly overfitted.

## 4.9 Support Vector Machines ##

In [64]:
#Instantiating our model with default parameters
svc = SVC()

#Fitting the model
svc.fit(X_train_cvec,y_train)


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [65]:
#Accuracy score training set
svc.score(X_train_cvec, y_train)

0.662020905923345

In [66]:
#Accuracy score testing set
svc.score(X_test_cvec, y_test)

0.65625

As we could see, our model's performance is far from ideal. It even performs slightly below our Baseline Model (accuracy of 0.6792) with testing set accuracy slightly out-performing the training set, which indicates some underfitting. 
Theoretically, Support Vector Classifier should work quite well. Additionally, underfitting might be a sign more regularization is required. I would like to do a grid search on model's regularization parameter C.

In [69]:
#Setting a dictionary of parameters for a grid search
svc_params = { 'C': np.linspace(0.01,10,100),
             'kernel':['rbf','poly' ],
             'gamma': ['scale','auto_deprecated' ]}

#Instantiating a grid search with the parameters chosen above and a 5-fold cross-validation
gs = GridSearchCV(svc, param_grid=svc_params, cv=5)

#Fitting our grid search 
gs.fit(X_train_cvec, y_train);




In [70]:
#Getting our best parameter
gs.best_params_

{'C': 3.34, 'gamma': 'scale', 'kernel': 'rbf'}

In [71]:
#Getting best accuracy score
gs.best_score_

0.7770034843205574

In [72]:
#Best score on a training set
gs.score(X_train_cvec, y_train)

0.9512195121951219

So far, the Supporting Vectors Machine model works best on our data. We got 0.7770 score on testing set, which is the highest among all the models. But still, this model suffers from overfitting quite a lot as the accuracy score on training set is 0.9512. All attempts to enhance/extend hyperparameters grid search in order to improve model's perforamnce didn't really get any worthwhile change of model's accuracy on testing set. 

# 5. Conclusions #

My best model has turned out to be the Support Vectors Machine, and this is a expected result. This model has shown the highest accuracy on a testing set, and more than 3 out of 4 internet posts could be correctly identified as belonging to one of the two threads.
Quite an unexpected part of this result is that the model's performance is quite just a bit better (by just 3.14%) than the model's of the first choice - Linear Regression.

# 6. Recommendations #

Considering the current project I'd suggest the following:
- Working with bigger dataset, as current entire datasets contained just 383 posts
- Experiment with more optimization options - grid searching through more hyperparameters while paying attention to extending/narrowing each parameter's range and iteration steps through the corresponding ranges. _This inflicts quite some computational capacity problems_

Generally speaking, if we wanted to significantly improve our model's accuracy classifying pieces of human-produced text we should definitely consider using more advanced algorithms.