Copyright 2020 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.

# Data cleaning, transformations, & versioning: Problem solving

This session will use the `mpg` dataset, which contains measurements of fuel economy and other properties of cars from the 1970s, to practice data cleaning, transformation, and versioning concept.s

| Variable     | Type     | Description                              |
|:--------------|:----------|:------------------------------------------|
| mpg          | Ratio    | Miles per gallon; fuel economy           |
| cylinders    | Ordinal  | Number of cylinders in engine            |
| displacement | Ratio    | Volume inside cylinders (likely inches)                  |
| horsepower   | Ratio    | Unit of power                            |
| weight       | Ratio    | Weight of car (likely pounds)                           |
| acceleration | Ratio    | Acceleration of car (likely in seconds to 60 MPH) |
| model_year   | Interval | Year of car manufacture; last two digits |
| origin       | Nominal  | Numeric code corresponding to continent  |
| name     | Nominal  | Car model name (ID)                      |

<div style="text-align:center;font-size: smaller">
    <b>Source:</b> This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
</div>
<br>

## Missing data

Import `pandas` so you can load a dataframe.

Load a dataframe with `datasets/mpg-na-hidden.csv` and display it.

-----------
**QUESTION:**

Do you think there are any `NaN` values? Why? 

**ANSWER: (click here to edit)**


<hr>

`describe` the dataframe.

Look at the min, mean, and max.

-----------
**QUESTION:**

Do you think there are any `NaN` values? Why and what are they?

**ANSWER: (click here to edit)**


<hr>

One way to tell if your guess is correct is to plot the variable(s) in question.

To make a plots, import `plotly.express`.

Now plot a histogram with the variable(s) in question.

Another way to see this is with a boxplot.
Copy the code above, but change `histogram` to `box`.

-----------
**QUESTION:**

Does these plots support your guess?
How do we know this is missing data and not outliers?


**ANSWER: (click here to edit)**


<hr>

Load the dataframe again, but this time tell `pandas` that the value you found is `NaN`

-----------
**QUESTION:**

You have two choices at this point: `dropna` or `fillna` with another value (like the median).
Which would you do, and why?


**ANSWER: (click here to edit)**


<hr>

Do whichever of the two options you chose above.

## Transforming data

### Outliers

`clip` the values below the 1st percentile and above the 99th percentile for each variable in the dataframe, save this in the dataframe, and display it.

-----------
**QUESTION:**

Did anything happen that you didn't expect? Are you concerned? What could you do about it?


**ANSWER: (click here to edit)**


<hr>

`describe` the dataframe again.

-----------
**QUESTION:**

Did the min, mean, and max change?
What about the 25th, 50th (median), and 75th percentiles?
Are you surprised?


**ANSWER: (click here to edit)**


<hr>

### Non-normality

Plot a histogram of `displacement`.

-----------
**QUESTION:**

Suppose we wanted to predict `displacement` in a linear regression.
Should we transform it to make it more normal?


**ANSWER: (click here to edit)**


<hr>

Import `numpy` so you can transform `displacement`.

Plot a log transformed `displacement`.

-----------
**QUESTION:**

Is this better?
What else might you consider doing?


**ANSWER: (click here to edit)**


<hr>

## Create new variables

Create a new variable `ratio` in the dataframe that is `weight` divided by `horsepower`.
This is called the power to weight ratio.

Compare `ratio` to `weight` and `horsepower` in three plots.

First, plot the histogram of weight.

Next plot the histogram of `horsepower`.

Finally plot the histogram of `ratio`.

-----------
**QUESTION:**

What can you say about the distributions of variables in these three plots?
How does it make you feel about `ratio`?


**ANSWER: (click here to edit)**


<hr>

## Versioning

Try to make a commit using your current workspace. 

-----------
**QUESTION:**

Was there anything about your workspace in `git` that surprised you?


**ANSWER: (click here to edit)**


<hr>

<!--  -->