Skip to content
Data analysis tools in journalism
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
data init Jun 21, 2019

Journalism tools

These tools are for data journalists and other writrse and researchers to use in reporting stories. The scripts are found in the corresponding language in the src directory. Data was obtained from Corey Seliger and Megan Sapp Nelson of Purdue University through the Purdue University Research Repository (PURR). The raw input data files are found in data, and the output files are created and stored in output.

How to use these tools

  1. Make sure you have python, R, and the appropriate packages installed for each script. See "Installation" below for instructions.
  2. Download this repository as a .zip file (from the "Clone or download" button in the top-right), and unzip it.
  3. Identify and open the terminal emulator program on your computer. Mac and Linux systems come with Terminal installed, and Windows systems come with Console. If there isn’t one installed, download one online.
  4. Type pwd and press enter. This command shows what your current working directory is. Type ls to display which directories and files are in this current directory. To move to another directory, run cd directory in which "directory" is the name of the directory you’d like to move to. To move up a directory, run cd ... To view a file run less file in which file is the name of the file, and use the arrow keys to move up and down. To exit less, type q. In the files, the code comments are on lines beginning with # or are separated off by """. Comments are explanatory notes for anyone to use and understand the scripts. The script doesn't run lines that are commented out. They're only for anyone to leave notes in scripts.
  5. Navigate to the unzipped directory.
  6. From there you can run the scripts from the directories in the src directory as instructed by the files and the comments in the scripts.

Python scripts

Run these python scripts to analyze data. Each python script begins with import statements that let the user import functions and modules from various classes. Then there are lines that are separated by """. These lines are commented out.

  • : This script converts a list of citations into a readable .csv with a panda DataFrame as a middle product.

    • Usage: python citationsfile
      • For example: python ../../data/citation/statllcpub.txt
    • Requirements: pandas, matplotlib.
    • Input files must end with a suffix (such as .txt) and follow the "Scientific Style and Format for Authors, Editors, and Publishers" format.
    • Comments in the input files must begin with #.
    • Uses pandas as a method of storing and manipulating data with DataFRames and matplotlib for plotting graphs.
  • : This script analyzes the Data Science curricula across universities.

    • Usage: python
    • Requirements: pandas, matplotlib.
    • Uses the curricula .csv files in the data/curricula directory as input.
  • : This script establishes the scaffolding of data skills.

    • Usage: python
    • Requirements: pandas.
  • : This script extracts additional info from the raw input data.

  • : This script performs latent analysis of the semantics of the job descriptions.

  • : This script is an example for using pandas on manipulating data.

    • Usage: python
    • Requirements: pandas.
    • Uses pandas for data manipulation.
  • : This script analyzes the data science job postings across the United States.

    • Usage: python
    • This file must be run after running
    • Uses the extracted.csv file in the output/postings as input.
    • Requirements: basemap, numpy, matplotlib, pandas, proj4, scikit-learn.
    • Uses numpy for numerical operations, pandas for data manipulation, scikit-learn for machine learning, and basemap and proj4 for plotting.
  • This script performs latent analysis by outputting word similarity and frequency statistics for a set of documents.

  • : This script has simple statistics tests for evaluating uncertainty.

    • Usage: python
    • Requirements: matplotlib, numpy.
  • : This script creates an example dataset and plots it.

    • Usage: python
    • Requirements: matplotlib, pandas, seaborn.
    • Uses matplotlib and seaborn for plotting and pandas for data manipulation.

R scripts

Run these R scripts to analyze data. Take apart knowledge, skills, and abilities (KSA) as they relate to data science jobs and research. These files require input .tsv files, tab-separated value files that have data science information in rows separated by tabs. .tsv files can generally be exported from software such as Microsoft Excel.

  • techdict.R : This script creates a technology dictionary for skills and education required for data science jobs.

    • Usage: R techdict.R dspos.tsv
    • Requirements: dplyr, gdata, ggplot2, tidyr.
    • Uses gdata, dplyr, and tidyr for manipulating and cleaning data and ggplot2 for plotting.
  • yearexp.R : This script plots the number of data science job postings for minimum years and educational requirements for an input tsv file.

    • Usage: R yearexp.R exp.tsv
    • Requirements: dplyr, tidyr, ggplot2.
    • Uses dplyr and tidyr for manipulating and cleaning data and ggplot2 for plotting.


The required packages are found in the file in the directory for a corresponding language in the src directory. Packages can be installed using anaconda. Anaconda is a package manager that lets you easily download packages using commands such as conda install -c anaconda numpy to, for example, install numpy or conda install -c conda-forge matplotlib to install matplotlib. Before installing packages, the corresopnding channels must be added with conda config --add channels new_channel in which new_channel is the name of the channel. The required channels are found on the conda page for the corresponding packages. The links to these pages are in the files.


Nelson, M. S. (2016). Scaffolding for data management skills: From undergraduate education through postgraduate training and beyond. Purdue University Research Repository. doi:10.4231/R7QJ7F9R

Seliger, C. S. (2018). Data Scientist Postings and Data Science Curriculum Datasets. Purdue University Research Repository. doi:10.4231/R7B27SJS

Seliger, C. S. (2018). Regular Expression Dictionaries Derived from Data Scientist Positions and Course Curriculum. Purdue University Research Repository. doi:10.4231/R7R78CGR

Seliger, C. S. (2018). Text Mining and Plotting Tools for KSA / DS / HEI Research Study. Purdue University Research Repository. doi:10.4231/R7MK6B49

You can’t perform that action at this time.