<a id='top'></a>
# Reddit API and Classification

**Previous:** [Preprocessing and Modeling](./03_preprocessing_and_modeling.ipynb)

## Conclusion and recommendation
---

### Imports

In [2]:
import pandas as pd

### Data imports

In [None]:
scores_df = pd.read_csv('../datasets/scores.csv')
prod_score_df = pd.read_csv('../datasets/prod_score.csv')

### Baseline model metrics

In [10]:
#baseline model
scores_df.iloc[[0]]

Unnamed: 0,model,vectorizer,valid_train,valid_test,mean_cv,tn,fp,fn,tp,bal_accuracy,misclassification,specificity,sensitivity,precision,roc_auc,params
0,nb,cvec,0.9862,0.9331,0.9373,130,14,7,163,0.9308,0.0669,0.9028,0.9588,0.9209,0.9764,default


### Production model metrics

In [11]:
#production model scored with validation data
scores_df.iloc[[7]]

Unnamed: 0,model,vectorizer,valid_train,valid_test,mean_cv,tn,fp,fn,tp,bal_accuracy,misclassification,specificity,sensitivity,precision,roc_auc,params
7,nb,cvec,0.9851,0.9363,0.9426,130,14,6,164,0.9337,0.0637,0.9028,0.9647,0.9213,0.9771,{'nb__alpha': 1.303512244681509}


In [12]:
#production model scored with unseen data
prod_score_df

Unnamed: 0,model,vectorizer,tn,fp,fn,tp,mean_cv,accuracy,bal_accuracy,misclassification,specificity,sensitivity,precision,roc_auc,params
0,nb,cvec,132,14,10,158,0.942621,0.923567,0.9223,0.0764,0.9041,0.9405,0.9186,0.98,1.303512


### Our production model: 
- Multinomial Naive Bayes with CountVectorizer, alpha: 1.3035

    After fitting our unseen data into our model, we conclude the following:
        - Our model correctly predicts 92% of observations based on accuracy score.
        - Among posts that our model predicted to be in r/apple, we have 92% of them correctly classified.
        - Among posts that are in r/apple, our model has 94% of them correctly classified.
        - Among posts that are in r/Android, our model has 90% of them correctly classified.


### Conclusion

With an overall success rate of more than 90% in correctly classifying if a post belongs the apple subreddit, moderators can now utilise the model as a feature to detect if the post is correctly classified/ relevant to the Apple community.

The model can be an embedded tool into the subreddit post submission page, whereby a pop-up dialog is raised when the model detects any unusual words written in the post that is indicative of irrelevance to the apple subreddit. The dialog temporarily halts program execution and prompts the user to confirm if the user is posting in the correct subreddit, which acts as a barrier for any irrelevant "spams" to the page, and reduces the number of posts with unrelated content.

It can also act as an automation tool when the model is employed to "flag out" posts that contain highly unusual words, and halts publishing of post before moderator approval.



With the model in place, the moderator would have an easier job of "cleaning" irrelevant posts as they would most likely have been identified by the model, instead of having to spend time and resources to scan through all the posts in the subreddit which are increasing by the minute. This also reduces the chance of misclassification as moderators are humans who might mistakenly miss out screening a post once in a while.

With the restriction in place, the apple community redditors can truely enjoy new content, without having to go through irrelevant content that might have been the cause of advertising or even harassments from "competitor communities" redditors.

### Recommendations to improve the model

**1. Explore other features**

>In addition to the analyses done in the previous notebooks, we can explore other text features such as post text data and comments text data which might provide us with more features for modelling. 

**2. Sentiment/Intent analysis**

>In analysing the underlying sentiment of the text data, as well as analyzing the user’s intention behind the text data, the model could be able to identify if it relates an opinion, news, marketing, complaint, suggestion, appreciation or query.
>
>This could result in a better model built that could even extend the restriction to other posts which are against the rules of the community.

**3. Other Classifier Models**

> In our analyses, we only employed the use of Naive Bayes and Logistic Regression models to create our production model. We can include other learning algorithmns such as the k Nearest Neighbors, Decision Tree Classifier and the Random Forest Classifier to determine the best model that can be optimised to be our production model.


**4. Retrain model periodically**

> Due to the nature of the fast-changing trends in technology, we recommend that the data be pulled to retrain the model in every new season or observed change in technology trends. New trends/ products trigger new keywords, and pulling new data retrains the model to recognise these new keywords. The model should be trained to recognised new terms and therefore, better able to classify new subreddit posts in the future.

<div style="text-align: right">
    <div class="right"> >>> <b>Back to: </b>
        <a href="../README.md">README</a>
    </div>
    </div>

[Go to top](#top)

---