Machine learning on imgur
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.vscode
docs
examples
imgurpca
testdata
.gitattributes
.gitignore
CHANGELOG.md
README.md
imgurpca_logo.png
setup.py
test.py

README.md

imgurPCA v2

imgurpca is a modular text analysis and machine learning library. It uses post comments on imgur as its corpus.

Logo

imgurpca contains a set of tools that help parse, analyze, and act on insights gained from textual data. A modular design allows for complex workflows.

While initially written for imgur, the project has been structured so that it can be easily ported to another data source/API. Core class definitions are present in imgurpca\base which can be subclassed for different needs.

imgurpca bases all processing on Principal Component Analysis (PCA) as a way to reduce the dimensionality of data points. A data point can be the set of comments made by a user, or comments on a single post. There can be thousands of unique words describing a set of points. Analysing such data becomes intractable (see Curse of dimensionality). For a given set of data, it finds vectors of words that best describe the set's distribution. Those vectors can be used as axes for later computations. For example, posts on imgur.com may have thousands of unique words which would mean thousands of dimensions in the data. But with PCA, a few vectors can be used to distinguish posts without significant loss in accuracy.

With fewer vectors needed to describe posts and users (by their comments), other computations become less costly. For example, you can try to predict:

  • What score a comment would get based on existing gallery comments (linear regression),
  • If a user is likely to favourite a post based on their history (logistic regression),
  • Posts that are similar in 'tone' (k-means clustering),
  • Tags a post would get based on comments/title/description (decision tree).

All of the machine learning methods above in parentheses have been implemented in the Learner class, with more to come.

imgurpca also provides the Bot class. It can post comments, upload pictures, send messages etc. interactively or in the backround on a schedule.

See docs folder for more details on usage.

About

This was my first ever personal open-source project. I had left it in suspension in favour of other things to do. I came back more than a year later to add features and fix bugs here and there. If you compare last year's version v1 with this one (v2) you'll notice quite a change in approach. That's the lesson here folks, procrastinate for a year # TODO: finish the joke.

Disclaimer: Performance of my implementation may not be optimal. This project was an exercise in using my machine learning knowledge. Where possible I coded features from scratch.