Skip to content
master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

imgurPCA v2

imgurpca is a modular text analysis and machine learning library. It uses post comments on imgur as its corpus.

Logo

imgurpca contains a set of tools that help parse, analyze, and act on insights gained from textual data. A modular design allows for complex workflows.

While initially written for imgur, the project has been structured so that it can be easily ported to another data source/API. Core class definitions are present in imgurpca\base which can be subclassed for different needs.

imgurpca bases all processing on Principal Component Analysis (PCA) as a way to reduce the dimensionality of data points. A data point can be the set of comments made by a user, or comments on a single post. There can be thousands of unique words describing a set of points. Analysing such data becomes intractable (see Curse of dimensionality). For a given set of data, it finds vectors of words that best describe the set's distribution. Those vectors can be used as axes for later computations. For example, posts on imgur.com may have thousands of unique words which would mean thousands of dimensions in the data. But with PCA, a few vectors can be used to distinguish posts without significant loss in accuracy.

With fewer vectors needed to describe posts and users (by their comments), other computations become less costly. For example, you can try to predict:

  • What score a comment would get based on existing gallery comments (linear regression),
  • If a user is likely to favourite a post based on their history (logistic regression),
  • Posts that are similar in 'tone' (k-means clustering),
  • Tags a post would get based on comments/title/description (decision tree).

All of the machine learning methods above in parentheses have been implemented in the Learner class, with more to come.

imgurpca also provides the Bot class. It can post comments, upload pictures, send messages etc. interactively or in the backround on a schedule.

See docs folder for more details on usage.

About

This was my first ever personal open-source project. I had left it in suspension in favour of other things to do. I came back more than a year later to add features and fix bugs here and there. If you compare last year's version v1 with this one (v2) you'll notice quite a change in approach. That's the lesson here folks, procrastinate for a year # TODO: finish the joke.

Disclaimer: Performance of my implementation may not be optimal. This project was an exercise in using my machine learning knowledge. Where possible I coded features from scratch.

About

Machine learning on imgur

Topics

Resources

Releases

No releases published

Packages

No packages published

Languages