This experience started as a daunting project but quickly became fun and interesting experience. I webscraped the Reddit.com API to gather unique posts from two different subreddits. The first was the California Politics subreddit and the second was the Texas Politics subreddit. At first glance these seemed to provide a very nice contrast with each other as the topics discussed in each would have some overlap when the content centered around national politics and some divergence when the content centered around state and local politics. The hope was that they would provide enough features for the Natural Language Processor to correctly classify a posts origin.
I decided to set up the classification problem to attempt to predict if a post came from the California subreddit. I had sample sizes that were relatively large (~980 vs. ~930) and almost equal. After running multiple models through the GridSearch I concluded that they were all running well giving a 98-99% accuracy score for the training set and a 92-93% accuracy score for the testing set. The told me that my model did not suffer from a high degree of overfitting.
I decided to re-run my models without including important features such as 'california' & 'texas' to see how this changed the results. The accuracy score of the training set dropped to ~92% and the testing set accuracy dropped to ~82%.
This told me that these features where playing a very strong role in helping to classify a post correctly. I then explored the Beta values of the features to see how they looked.
I was able to conclude that the model ran very consistently and that the misclassified posts were generally a result of a particular subreddit post explicitly talking about what was happening in the opposing state.
Overall this was a fantastic learning experience and I thoroughly enjoyed the process.
Subreddit California Politics: https://www.reddit.com/r/California_Politics/
Subreddit Texas Politics: https://www.reddit.com/r/TexasPolitics/
- Two jupyter notebooks:
- The first contains the data gathering information and process
- The second contains the data modeling process and visualization.
- A .csv file of our scrapped data titled 'reddit.csv'
- A powerpoint file titled 'Subreddit_Presentation.pptx'
- A folder titled 'images' containing:
- Multiple images generated by our data
- Images used throughout the presentation file from the web
Data Modeling & Visualization Notebook
- Import Libraries
- Creating our URLs
- Accessing the API
- Keep only uniqie Posts
- Cleaning and Creating Master DataFrame
- Exporting our DataFrame for Modeling
- Data Import
- Baseline Accuracy
- Data Modeling
- Data Exploration
- Taking a Deeper Look at Our Data
- Prediction and Misclassification
- Feature Betas
- California Wordcloud
- Texas Wordcloud
- Looking At Misclassification in Depth
Here is an example of some of the code used to model our data:
Python
# Deciding which words to remove via stop words
stop_words = ['to', 'the', 'in', 'of', 'for', 'and', 'on', 'is', 'it',
'with', 'what', 'about', 'are', 'as', 'from', 'at', 'will',
'that', 'says', 'by', 'be', 'this', 'can', 'has', 'how',
'california', 'texas']
# Setting up our hyperparameters to pass through our pipeline
pipe_params = {
'vec' : [CountVectorizer(), TfidfVectorizer()],
'vec__max_features': [1700, 1900, 2500, 3000],
'vec__min_df': [2, 3],
# 'vec__max_df': [0.4, 0.5],
'vec__ngram_range': [(1,2), (1,1), (1,3)],
'model' : [LogisticRegression(),
LogisticRegression(penalty='l1', solver='liblinear'),
LogisticRegression(penalty='l2', solver='liblinear'),
MultinomialNB(alpha=1.1),
# RandomForestClassifier(n_estimators=1500)
],
'vec__stop_words': [frozenset(stop_words)],
}
# Defining a function to do our model analysis. This function takes in X, y, and any pipe parameters
def model_analysis(X, y, **pipe_params):
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
pipe = Pipeline([
('vec', CountVectorizer()),
('model', LogisticRegression())])
gs = GridSearchCV(pipe, param_grid=pipe_params, cv=3, verbose=1, n_jobs=3)
gs.fit(X_train, y_train)
print(f' Best Parameters: {gs.best_params_}')
print('')
print(f' Cross Validation Accuracy Score: {gs.best_score_}')
print(f' Training Data Accuracy Score: {gs.score(X_train, y_train)}')
print(f' Testing Data Accuracy Score: {gs.score(X_test, y_test)}')


