# Exercise: NEISS, Question Set F

#### Summary

The [National Electronic Injury Surveillance System](https://www.cpsc.gov/Safety-Education/Safety-Guides/General-Information/National-Electronic-Injury-Surveillance-System-NEISS) is a data product produced by the US Consumer Product Safety Commission. It tracks emergency room injuries related to consumer products (e.g., "a door fell on me!").

#### Files

- **nss15.tsv**: injury data (one injury per row)
- **2017NEISSCodingManualCPSConlyNontrauma.pdf**: column definitions and explanations
- **2015 Neiss data highlights.pdf**: a partial summary of the data
- **2017ComparabilityTable.pdf**: product code definitions
- **categories-cleaned.txt**: product code definitions in CSV format (great for joining!)

#### Source

https://www.cpsc.gov/Safety-Education/Safety-Guides/General-Information/National-Electronic-Injury-Surveillance-System-NEISS

#### Skills

- Reading tab-separated files
- Ignoring bad lines
- Replacing values
- Using numpy/`np.nan`
- String search using regular expressions
- String replacement using regular expressions
- Using codebooks

# Read in `nss15.tsv`

Some of the lines just **aren't formatted correctly**. Maybe we can avoid those?

In [1]:
import pandas as pd
import matplotlib

In [2]:
df = pd.read_csv('nss15.tsv', error_bad_lines = False, sep = "\t")

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.


In [3]:
df.head()

Unnamed: 0,CPSC Case #	trmt_date	psu	weight	stratum	age	sex	race	race_other	diag	diag_other	body_part	disposition	location	fmv	prod1	prod2	narr1	narr2
0,150733174\t07/11/2015\t38\t15.7762\tV\t5\t1\t0...
1,150817487\t08/02/2015\t51\t74.8813\tL\t20\t2\t...
2,150717776\t06/26/2015\t41\t15.7762\tV\t61\t1\t...
3,150721694\t07/04/2015\t42\t74.8813\tL\t88\t2\t...
4,150713483\t06/08/2015\t93\t15.7762\tV\t25\t1\t...


### Check that your dataframe has 357727 rows and 19 columns.

### List the columns and their data types

### What does each column mean?

# Cleaning up a column

Take a look at the **race** column. How many rows of each race are there?

## Replace the numbers with the appropriate words they stand for.

Those numbers are terrible - codes are fine for storage but not really for reading. **Replace the numbers with the  words they stand for.**

Refer to page 28 of the column definitions file.

## Confirm you have 145813 White, 138666 not stated, and 48868 Black.

## Graph the number of each race, but don’t included the “Not Stated” records

## "Not Stated" seems silly - change it to be `NaN` instead

Don't use `na_values` for this.

## Graph the count of each race, but don’t included the “Not Stated” records

Yes, again! The code you use should be different this time.

## Graph the top 10 most popular products for injuries 

# Cleaning up `race_other`

## `race_other` is a field for free-form race input. How many patients have a race of "HISPANIC"?

## What are the top 5 most popular "other" races?

## Searching for multiracial patients

Wow, this data entry is terrible. “Multiracial” is spelled as **MULT RACIAL**, **MULTIPLE RACIAL**, and many more. How many different spellings can you find? **Use only one line to find the spellings.**

- Tip: Be sure to **ignore the na values**.
- Tip: You should probably find the multiracial-ish rows and then `value_counts` their `other_race`
- Tip: Maybe... ask me about .str.contains support for regular expressions?

## Replace all of those spellings with “MULTIRACIAL.”

Confirm that you’ve ended up with about 1900 MULTIRACIAL rows (yours might be anywhere between 1899-1910, depending on how many spellings you caught)

## Do the same thing with misspellings of "Unknown"

You should end up with around 1660-1670 UNKNOWN entries

## What variations on HISPANIC can you find? Standardize them to HISPANIC.

## Now try counting the number of hispanic people again.

## Seems like a lot! Update their race column to be ‘Hispanic’ instead of “Other”

You'll try to do this using skills you know, but pandas will probably yell at you. You get to learn this new thing called `loc` now! 

```
df.loc[df.country == 'Angola', "continent"] = "Africa"
```

This updates the `continent` column to be `Africa` for every row where `country == 'Angola'`. You CANNOT do the following, which is probably what you've wanted to do:

```
df[df.country == 'Angola']['continent'] = 'Africa'
```

And now you know.

## Graph the frequency of each race in the dataset

## Find every injury involving unicycles.

## What is the racial breakdown of people involved in unicycle accidents?

I want a **percentage**, and I want that percentage to include unknowns/NaN values.

## How about injuries with toboggans?

Is the racial breakdown significantly different than the racial breakdown of all patients?

## Find the top 5 most dangerous products

Just use the `prod1` column.

## Find the top 5 most dangerous products by race

This is that weird groupby thing that you can either memorize or cut and paste every time. If you ask I'll tell it to you and you won't have to search!