CS412: Mining of Reddit Text Data by Kalina Borkiewicz and Kat Schroeder

This project mines the unstructured text of conversation threads (submissions and comments) from the website www.reddit.com within three domains -- Common [Software] Vulnerabilities and Exposures (CVE), Crypto-currency, and Cybersecurity -- to find keywords that indicate relevant dimensions in the data.

Though other methods were attempted (most notably using the RAKE algorithm), these are the final three scripts that were used in obtaining the K-means clustering results:

preprocess.py - The data preprocessing script that was run to clean the data
findBestK.py - A script that tests multiple values of K for K-means clustering and creates a plot to show the costs.
redditMining.py - The script that computes K-Means clustering to find best keywords

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
findBestK.py		findBestK.py
preprocess.py		preprocess.py
redditMining.py		redditMining.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

findBestK.py

findBestK.py

preprocess.py

preprocess.py

redditMining.py

redditMining.py

Repository files navigation

CS412: Mining of Reddit Text Data by Kalina Borkiewicz and Kat Schroeder

About

Releases

Packages

Languages

kalinalinkalina/reddit-mining

Folders and files

Latest commit

History

Repository files navigation

CS412: Mining of Reddit Text Data by Kalina Borkiewicz and Kat Schroeder

About

Resources

Stars

Watchers

Forks

Languages