# Problem Statement 

Scrape posts from 2 Sub-Reddit forums and develop a Natural Language Processing model to accurately classify posts from each of these forums. 


# Executive Summary

I scraped posts from the "World news" and "Today I Learned" Sub-reddit forums. 

Total posts scraped were:
- 999 posts from 'TIL' forum
- 693 posts from 'World News' forum 

**Dataframe: all_df**

|Name|Type|Description|
|---|---|---|
|**subreddit**|*int64*|Binary values of '0' for TIL posts and '1' for World news| 
|**title**|*str*|Original uncleaned individual posts from the subreddit forums|
|**clean_posts**|*str*|cleaned individual posts from the subreddit forums|
|**clean_posts2**|*str*|cleaned individual posts from the subreddit forums without the 'til' term|


# Model Fitting and Results

We fitted a baseline model scoring using a few variations of the Naive Bayes Classification model and compared it against the scores using alternative models like Logisitic Regression and Decision Tree Classifer. 

**Baseline Model Scoring**

**Round 1 (Raw model + dataset)**

For the Naive Bayes model, I initially fitted and scored the model against the original training data set (1269 posts, 5584 features) and also scored it against the test set (423 posts, 5584 features). 


The data was vectorised using both Count Vectorizer and TFIDF Vectorizer before being fitted on the model using a standard Bayesian classifer model with the following combinations:


- Cvec train/test score (Bernoulli): 1.0 / 0.995
- Cvec train/test score(Multinomial): 0.999 / 0.957
- Tvec train/test score(Gaussian): 1.0 / 0.898

**Round 2 (Raw model + removed 'til' term)**

I took a closer look at the word frequencies and figured that 'til' was appearing too frequently in the 'TIL' sub-reddit posts - ran a 2nd fitting after removing the term 'til'.

Scores without 'til' term
- Cvec train score (Bernoulli): 0.988 / 0.926
- Cvec train score(Multinomial): 0.993 / 0.919
- Tvec train score(Gaussian): 1.0 / 0.891 

The model scores still seem slightly overfitted for CVEC + Bernoulli but in slightly better shape for the CVEC Multinomial combination.

**Round 3 (Hyper-parameter tuning + removed 'til' term)** 


Adapted a custom class function to take in multiple models and their respective parameters and pass in through Gridsearch CV.

The main hyperparameters that i was tuning for included:
- max features
- minimum document frequency
- maximum document frequency 
- ngram range 

The Model mixes (not too unlike above were):
- CVEC + Bernoulli / Multinomial Bayes
- TVEC + Bernoulli / Multinomial Bayes

Ran them through the function to optimise for different metrics:
- Accuracy: TVEC + Multinomial 0.897 (0.9 max df, 750 max features, 3 min df, n gram (1,1)) 
- Balanced Accuracy: CVEC + Bernoulli 0.889 (0.9 max df, 750 max features, 3 min df, n gram (1,1))
- F1 score: TVEC + Multinomial 0.870 (0.9 max df, 750 max features, 3 min df, n gram (1,1))
- Specificity: TVEC + Multinomial 0.908 (0.95 max df, 750 max features, 3 min df, n gram (1,1))

Overall it seemed like the plain models still worked better than the hyper-tuned ones!


**Fitting against other models**

**CountVect + Log Reg**

- without tuning (train score: 1, test score: 0.919)
- with tuning (train score: 0.916, test score: 0.888)
- Sensitivity: 0.8382
- Specificity: 0.948
- Precision: 0.9177
- Accuracy: 0.9031
- F1 Score: 0.8761

**TFIDFVect + Log Reg** 

- without tuning (train score: 0.961, test score: 0.865)
- with tuning (train score: 0.899, test score: 0.867)
- Sensitivity: 0.815
- Specificity: 0.976
- Precision: 0.9592
- Accuracy: 0.9102
- F1 Score: 0.8812

**Decision Tree**
- Grid search best train score: 0.849
- Grid search best test score: 0.804


**Verdict** 

- Untuned Naive Bayes Model (countvect + Multinomial) tied with untuned Countvect + log reg for the higest accuracy score 
- TFIDF + Log reg had the highest F1 score at 0.88 

It was curious that the untuned models consistently performed better than their tuned counterparts - perhaps this was a result of the lack of proper tuning? (maybe i should have explored a wider range of parameters). 

I am also unsure why amongst the tuned models, the optimal hyperparameters selected were unigram only since intuitively, bigrams would have made for better contextual classification. 

In terms of usage, I am unsure that the model can serve as a good classifier of updated future reddit posts as the list of words used in both TIL and world news would probably be constantly changing? it would probably have to be trained over a larger dataset over time.


