Skip to content

Code for the Datathon hosted by RPI and College Factual, April 2015.

License

Notifications You must be signed in to change notification settings

hjweide/rpi-datathon-2015

Repository files navigation

rpi-datathon-2015

Background

Our code for the Datathon hosted by RPI's IDEA Lab and College Factual on the 25th of April 2015. Teams were free to investigate any issue related to North American universities and their surrounding areas.

Investigation

We decided to try to predict the sentiment expressed in posts on social media by students over the course of a year from different universities across the US and Canada. We wanted to do this sentiment analysis on posts made on the subreddits of the various universities, but because we could not find a suitable dataset, we decided instead to train on a publicly available Twitter dataset and see if this generalizes to posts on Reddit.

Some things we were interested in:

  • Are students more active during the semester than the summer?
  • Are students on the cold east coast more affected by the winter than those on the west coast?
  • Does the sentiment expressed in posts change as end-of-year exams approach?

The Data

We trained a Random Forest Classifier using a Bag-of-Words model trained on a dataset consisting of 1.5M tweets. Unfortunately we did not have the computational resources to train on all the data, and so selected different subsets to train and evaluate our models. We used the Twitter Sentiment Analysis Dataset at: http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/

To predict sentiment, we scraped the top posts from each university's subreddit over the course of the previous year.

Problems

Sentiment analysis is a very difficult problem. Posts made on the internet are inherently ambiguous and often not very coherent. This made it difficult to train a learning model with good classification accuracy. Additionally, the difference in posts made on Twitter compared to Reddit made it difficult to use our model trained on Twitter data to predict on Reddit posts. We also tried to train our initial classifier on a dataset of IMDB movie reviews, but we found that this generalization was even worse (probably because movie reviews tend to be much longer and more coherent than social media posts).

Results

We plotted the fraction of Reddit posts containing positive and negative sentiment for each university per month over the course of a year. There were clear trends indicating that students are more active during the semester, but no discernible difference between the sentiment for the winter or summer months. Interestingly, the number of posts seemed to increase as the semester progresses, with the fraction of posts containing negative sentiment increasing slighly as finals approach.

Conclusion

Predicting the sentiment contained in posts made on social media is a very difficult task. Additionally, a model trained on Twitter data does not generalize well to posts made on Reddit. This investigation would be much more valuable if the classifier could be trained on an actual sentiment-labeled Reddit dataset.

About

Code for the Datathon hosted by RPI and College Factual, April 2015.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages