# Executive Summary

## Problem Statement

Can you predict what subreddit a comment comes from?

## Description of Data

### Size

* 116,443 comments from reddit were collected
* 58,355 from the mildlyinteresting subreddit
* 58,088 from the interestingasfuck subreddit
* The body of the comment, the author of the comment, what subreddit the comment came from, and the time at which the comment was posted were collected
* 13 features were engineered from the comments

### Source

This data was collected using the Pushshift Reddit API.

The github repository and docs for the API can be found [here](https://github.com/pushshift/api).

### Target

The target was the subreddit.

### Data Dictionary

|Feature|Type|Dataset|Description|
|---|---|---|---|
|subreddit|object/int|comment.csv/comment_clean.csv|The type changes depending on the dataset. Subreddit where the comment came from. 'interestingasfuck'=1, 'mildlyinteresting'=0|
|body|object|comment.csv/comment_clean.csv|This is the text of the comment.|
|auth|object|comment.csv|The author of the comment.|
|time|int|comment.csv|The time the author posted the comment.|
|num_of_chars|int|comment_clean.csv|The total number of characters of a comment. This includes letters, numbers, punctuation, etc.|
|word_count|int|comment_clean.csv|The total number of words in a comment.|
|question_mark|int|comment_clean.csv|This feature is for whether or not a comment had a question mark. 1 means there is at least one, 0 means there are none.|
|exclaimation|int|comment_clean.csv|This feature is for whether or not a comment had a exclaimation point. 1 means there is at least one, 0 means there are none.|
|dot_dot_dot|int|comment_clean.csv|This feature is for whether or not a comment had a trailing sentence or "...". 1 means there is at least one, 0 means there are none.|
|quotes|int|comment_clean.csv|This feature is for whether or not a comment had text in quotes. 1 means there is at least one, 0 means there are none.|
|italics|int|comment_clean.csv|This feature is for whether or not a comment had text in italics. 1 means there is at least one, 0 means there are none.|
|bold|int|comment_clean.csv|This feature is for whether or not a comment had text in bold. 1 means there is at least one, 0 means there are none.|
|polarity|float|comment_clean.csv|This feature scored the polarity of a comment.|
|subjectivity|float|comment_clean.csv|This feature scored the subjectivity of a comment.|
|avg_word_length|int|comment_clean.csv|This feature is the average word length of a comment.|
|stop_word_count|int|comment_clean.csv|This feature is the count of the number of stop words in a comment.|


## Models

* Six different models were fit
* Three models were chosen to examine more closely
* MultinomialNB, LogisticRegression, and a VotingClassifier that incorporated both made the final cut

<br>

|Model|Accuracy|
|---|---|
|Baseline|53.7%|
|MultinomialNB|70.5%|
|LogisticRegression|70.8%|
|VotingClassifier|71.3%|

These models were chosen because they performed better, and the time to run them was significantly less than the other models that were tested.

## Findings/Conclusion/Recommendations

In conclusion, yes, I was able to predict what subreddit a comment came from. My model was able to classify what subreddit with an accuracy of 71.3%. This was a 17.6% improvement over the baseline accuracy of 53.7%.

I plan on collecting more data and find a better way to deal with some of the noise generated by comments with just a few words.

## What's Next

This was a great first step in NLP classification for me. Going through the process of collecting data, cleaning it, performing EDA, and finally creating a model, I was able to learn new techniques and continue to sharpen my skills. I look forward to learning other ways of cleaning data and fine-tuning my models to achieve maximum performance.

I don't believe my model is ready for production, but I also don't think it is far off. I was able to streamline my process where I can continue to collect more data, clean it, and model it with just a few clicks. Finding a way to deal with the noise that comes with comments with just a few words would take my model to a production level.
