This repository is a (more-or-less) comprehensive list of the projects I have worked on as a student in statistics and data science at BYU and CMU since 2017. All the projects are tagged with the following topical designations:
Statistical Computing
Algorithms, data structures, recursion, object-oriented programming, web-scrapingStatistical Modeling
Projects involving statistical modelingMachine Learning and NLP
Machine learning and NLP projects: constructing ML algorithms from scratch, dimensionality reduction, unsupervised, and supervised learning.Economics Projects
Various projects related to economics at BYU, Cambridge, and CMUSQL
SQL practice and challenges. Note that most of my experience with SQL comes from my internship with OrderBoard in summer 2019.
Project | Language(s) | Method(s) | Description |
---|---|---|---|
English Proficiency | R | NLP, PCA, Random Forest, Shiny | Determine probability of individual passing the TOEFL exam. Includes GUI interface for student to write essay and PowerPoint. |
Random Forest | Python, SQL | Random Forest, object-oriented programming | Create Python Random Forests and SQL decision trees from scratch |
Sound of Music | R | Mixed models, hierarchical modeling | Determine factors that affect how people interpret music genre. Includes paper. |
Project | Language(s) | Method(s) | Description | Date |
---|---|---|---|---|
Closest Pair | R | Divide and conquer | Determine closest pair of points from given set | Nov 2019 |
Das Blinkenlights | R | Data structures, recursion, command line wrapper | Modular arithmetic problem with cows | Oct 2019 |
Tree Builder | R | Recursion, object-oriented programming | Build binary classification tree | Sep 2019 |
Web Scraping | R | Web-scraping, regular expressions, automatic e-mail | Web-scrape Walmart and Glassdoor websites | July 2019 |
Project | Language(s) | Method(s) | Description | Date | Includes Paper |
---|---|---|---|---|---|
Sound of Music | R | Mixed models | Determine factors that affect how people interpret music genre | Nov 2019 | Yes |
Particulate Matter | R | Logistic mixed-effects, ROC | Determine effectiveness of particulate matter detectors | Apr 2019 | Yes |
Macular Degeneration | R | Longitudinal MLR, optim | Determine causes of age-related macular degeneration | Apr 2019 | |
Land Analysis | R | Spatial modeling, imputation | Determine effects of increased temperature; Create and map temperature at locations impeded by cloud coverage | Mar 2019 | |
Food Expenditures | R | GLS, fixing heteroskedasticity | Estimate effect of income on eating out | Mar 2019 | |
Statistics Pedagogy | R | GLS | Determine relevance of class activities on student grades | Feb 2019 | Yes |
Game of Thrones | R | Time series (SARIMA) | Predict Game of Thrones viewership | Feb 2019 | |
Greenhouse | R, SAS | Linear regression | Determine effect of various gases on average global temperature | Feb 2019 | |
Climate Change | R | Time series (SARIMA) | Predict climate change for next 5 years | Feb 2019 | |
Avalanche | R, SAS | Poisson Regression | Model the number of avalanches in Utah | Jan 2019 | |
Student Grades | SAS | Data summarization in SAS | Create reports for student grades in SAS | Dec 2018 | |
Myocardial Infarction | R | GLM, ROC/AUC | Determine causes of heart attacks | Nov 2018 | |
Cardiovascular Health | R | Longitudinal models | Determine causes of Tachycardia | Nov 2018 | |
Birthweights | R | Linear regression, cross validation | Determine factors that lead to a change in baby birthweight | Sep 2018 | |
STEM | R | Logistic mixed-effects, ROC | Determine influencers of whether or not students remain in STEM majors | Sep 2018 |
Project | Language(s) | Method(s) | Description | Date |
---|---|---|---|---|
English Proficiency | R | NLP, PCA, Random Forest, Shiny | Determine probability of individual passing the TOEFL exam. Includes GUI interface for student to write essay. | Jan 2020 |
Stylometrics | R | NLP, PCA, Random Forest | Determine distinguishability of authors in Book of Mormon | Dec 2019 |
Information Retrieval | R | NLP, PCA | Use bag of words to search and cluster text data | Oct 2019 |
Dimensionality Reduction | Python | Hierarchical clustering, t-SNE, clustering | Classify written numbers (MNIST) | Nov 2018 |
Poverty | Python | Logistic regression, Naive Bayes, Random Forest, K-Nearest Neighbors | Determine causes of poverty in Costa Rica | Nov 2018 |
Housing Prices | Python | SGD, Lasso, Kernel Ridge, K Nearest Neighbors, feature engineering, train-test-split | Predict Housing Prices (Supervised learning) | Oct 2018 |
Project | Language(s) | Method(s) | Description | Date | Includes Paper |
---|---|---|---|---|---|
Per Capita Income | R | Linear regression, feature engineering | Determine socioeconomic factors that affect per-capita income | Sep 2019 | Yes |
Cost of Homeschooling | Stata | Logistic regression, fixed effects | Determine effect of maternal education on odds of child being homeschooled (working paper) | Apr 2018 | Yes |
Crime and Divorce | Stata | Linear regression, fixed effects | Explore differences in the divorce and crime rate in the U.S. and U.K. (working paper) | July 2017 | Yes (paper only) |
Project | Description (all in SQL) | Date |
---|---|---|
CRUD | Create, Read, Update, and Delete (“CRUD”) in SQL | Oct 2019 |
Science Forums Querying | Perform calculations and work with data from ScienceForums.net in SQL | Nov 2019 |