# SF-DAT-21 | Lab 05: Storms, `pandas`, and GitHub

## OJECTIVES

This session closes the course's first unit: "Research Design and Data Analysis" in which we covered important building blocks for performing data science work:
- Research Design
- Data Manipulation, dataset tidying and Exploratory Data Analysis with `pandas`
- Statistics and Visualization with `pandas`

Starting with the next session, we will shift our attention to building predictive models.  However, the skills you've learned so far will prove to be very fruitful as a data scientist (e.g., mastering tidying data will save you hours of time and make your data much easier to work on).  This lab gives you an opportunity to put into practice all these concepts.

## BACKGROUND

Severe weather events cause public health and economic problems, many resulting in fatalities, injuries, and property/crop damage.  Preventing such outcomes to the extent possible is a key concern.

Using the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database (in the `datasets` folder alongside some documentation), which tracked characteristics of major weather events in the United States from 1950 to 2011, you will study and report what types of events cause most of the fatalities, injuries, and economic damages.

## PROMPT

You'll work in group of 4 (no less) to 5 (no more), collaborate using git, GitHub, and the SF-DAT-21-students remote repository here: https://github.com/paspeur/SF-DAT-21-students

Specifically:
- Do individual work inside the directory named after your GitHub username.  This means copying this iPython Notebook (just this notebook `Storm.ipnb`) in your directory.  Commit and push your changes regularly.  (We'll go over it in class)
- Name your team and have one person in your team to create a new directory at the root of the repository in the form `class-05-XXX`.  Copy over there this notebook.
- How exactly you collaborate among yourself (as long as it works) is left for you to decide.  We are there to help you.

## DELIVERABLES

- Your team deliverable (more below) will be the notebook in your team directory; your individual deliverable will be the notebook in your personal directory (it can be the final team notebook but I'm mostly interested in seeing you building proficiency with git/GitHub and this is one way to demonstrate)
- Your notebook should not just be code, not text either.  I'm looking for both text and code, one supporting the other in a logical manner.
- More specifically, I want you to approach your work following the data science workflow.

## DATA SCIENCE WORKFLOW

[Don't read into each question too literally.  The lab is relatively open-ended; the goal is to improve the mastery of your new skills and have fun doing it.]

- **1. IDENTIFY the Problem**
  - Write a SMART research question around the vaguely following objective: "What types of events result in most fatalities, injuries, and economic damages?"


- **2. ACQUIRE the Data**
  - The raw dataset you'll work with is in the `datasets` folder (at the root of the repository) alongside some documentation
  - Questions you might ask yourself:
    - _What type of data is it?  (e.g., cross-sectional or longitudinal)_
    - _How well was the data collected?_
    - _Is there much missing data?_
    - _Was the data collection instrument calibrated?_
    - _Is the dataset aggregated?_
    - _Do we need pre-aggregated data?_


- **3. PARSE the Data**
  - [Again, the documentation is in the `datasets` folder]
  - You need to understand what you're working with
  - To better understand your data
    - _Create or review the data dictionary_
    - _Perform exploratory surface analysis_
    - _Describe data structure and information being collected_
    - _Explore variables and data types_


- **4. Mine the Data**
  - Mine the Data
  - Determine sampling methodology and sample data
  - Format, clean, slice, and combine data in Python
  - Create necessary derived columns from the data (new data)


- **7. Present the Results**
  - Present the Results
  - Summarize findings with narrative, storytelling techniques
  - Present limitations and assumptions of your analysis
  - Identify follow up problems and questions for future analysis

## DOCUMENTATION

- Some documentation for the dataset is in the `datasets` folder at the root of the repository
- `pandas` documentation: http://pandas.pydata.org/pandas-docs/stable/
- Slides from the previous sessions...

## IDEAS TO KEEP IN MIND

1. The datset is big.  You might want to use a small subset to get things working before moving on to the entire dataset.
2. Generate both tables and histograms to answer your questions.  That'll make it into a better presentation to your classmates.
3. At some point in time, you may want to split tidying up the dataset and the exploratory data analysis into different notebooks (multiple well defined notebooks make it easier to collaborate).  You could save your cleaned up dataset into disk (it should be small at this point) and load it from the other notebook.
4. If time allows, create a new tidy dataset reporting fatalities, injuries, and property/crop damage (min, max, median, Q1, Q3, mean, and variance) per calendar year and per state and save it as a CSV file.

## CODE TO GET YOU STARTED

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import csv

pd.set_option('display.max_rows', 10)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

%matplotlib inline
plt.style.use('ggplot')

In [None]:
df = pd.read_csv(os.path.join('..', 'datasets', 'Storms.csv.bz2'))

In [None]:
df

## ALL YOURS NOW...