Skip to content

ngathan/DataMiningS2020

master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Data Mining - Spring 2020

Goals

  1. Get familiar with R Programming Language
  2. Review Linear Regression, Logistic Regression
  3. Learn new Machine Learning Techniques: Random Forest, Support Vector Machine
  4. Introduction to digital trace data
  5. Learn how to use Twitter API
  6. Pratice text mining techniques: structural topic modeling, sentiment analysis

Textbooks

  1. Salganik, M. (2019). Bit by bit: Social research in the digital age. Princeton University Press. (optional)

  2. Wickham, H., & Grolemund, G. (2016). R for data science: import, tidy, transform, visualize, and model data. " O'Reilly Media, Inc.". (required, also available for free)

  3. Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach. " O'Reilly Media, Inc.". (required) (Online book)

  4. Additional articles and reports on Github

Class structure

  1. Introduction to Data Science & Ethical Issues in Data Science (Week 1-2)

  2. R Programming Language & Machine Learning 101 ( object-oriented language, review regression, support vector machine, principle component analysis, and deep learning) (Week 3-Week 8)

  3. Text Mining (digital trace data, scraping Twitter, forums, using API, topic modeling) (Week 9-11)

  4. Final Project (Week 12-16)

Data Sources

  1. NCANDS - The National Child Abuse and Neglect Data System (NCANDS) is a voluntary data collection system that gathers information from all 50 states, the District of Columbia, and Puerto Rico about reports of child abuse and neglect.

  2. Kaggle - Kaggle is an online community of data scientists and machine learning practitioners. It allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

  3. OpenNYC. Open Data is free public data published by New York City agencies and other partners.

  4. RedpillWomen Subreddit: https://www.reddit.com/r/RedPillWomen/. Ask professor for the dataset.

  5. Podcasting Subreddit: https://www.reddit.com/r/podcasting/. Ask professor for the dataset.

  6. Gab.com dataset. Ask professor for the Dataset

  7. Twitter hashtags: Scraping Twitter using R packages. This topic will be covered in the Text Mining Part of the class.

  8. NYC Teachers' #sickout during Covid19 pandemic.

  9. Existing Tweet Datasets (more than 800 million of tweets). https://tweetsets.library.gwu.edu/

Extra Resources

  1. Data Camp has free resources for R: https://www.datacamp.com/users/sign_up?redirect=%2Fcourses%2Ffree-introduction-to-r%2Fcontinue

  2. Guest speakers will visit class and share their data science journeys.

  3. Join the class #DataHunterS2020 Slack Channel: https://join.slack.com/t/datahunters2020/signup

  4. Join the R-Ladies Community Slack Channel if you want to reach out to women in Data Science: https://rladies-community-slack.herokuapp.com/

  5. Listen to Podcast Data Skeptic: https://dataskeptic.com/

More information about the Data Mining class from the Wiki

Have more questions, please raise an issue or email at nthan@gradcenter.cuny.edu.

About

This repo is for Data Mining Course at Hunter College - Spring 2020

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages