No description, website, or topics provided.
R Python Shell
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
data
figure
lib
.RData
.Rhistory
.gitignore
01-extract-wv.R
MSHA.Rmd
MSHA_pandas.html
MSHA_pandas.md
MSHA_pandas.rmd
MSHA_post_pandas.Rmd
README.md
TODO
extract_wv.py
load_sql.r
mongo_MSHA.html
mongo_MSHA.md
mongo_MSHA.rmd
msha_pandas.py
msha_workspace.RData
sample_graph.png

README.md

MSHA Data Exploration

An analysis of violations and accidents at WV mines, in hopes of identifying a predictive model for accidents as a function of cited violations or some other factor.

My intention is to document the process by which someone with some fluency in R and understanding of a data set's subject matter, but very little depth of understanding in predictive analytics, might go about using the great tools available to us all today.

The data is from http://www.msha.gov/OpenGovernmentData/OGIMSHA.asp

Tables include:

  • Mines
  • Inspections
  • Violations
  • Accidents

Project Organization

I have built this project as an RMarkdown document.

The /data/msha_source folder contains the large source files and their definition files. The table of violations is over 500 MB, so for the purposes of this project, I have selected out mines in West Virginia using the /munge/01-extract-wv-R script.

Updates

2013-11-21 The R script I built to count inspections, violations, and accidents inside a moving window was way too slow. I have rebuilt that functionality in python pandas, and it's fast enough that now it's no longer necessary to isolate subsets of the records.

2013-12-08 msha_pandas.py is now doing the work of calcuating daily statistics. It exports a csv file that MSHA_post_pandas.Rmd processes.

2013-12-13 Even using pandas for data prep was not enough to keep R happy. I rewrote the nested MINE_ID & VIOLATION_OCCUR_DT indexes, and added a logit analysis in pandas. The next step is to approach it not as a regression problem with trailing events as independent variables, but as a survival analysis problem.