In [None]:
# Import Numpy and Datascience modules.
import numpy as np
from datascience import *
import pandas as pd

# Plotting 
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

# Data Munging

"Data munging is a set of concepts and a methodology for taking data from unusable and erroneous forms to the new levels of structure and quality required by modern analytics processes and consumers." [Ref](https://www.talend.com/resources/what-is-data-munging/)

## Dealing with missing data

Often dataset have missing values, which when loaded into python can result in nan's (Not A Number).

In [None]:
ramen = Table().read_table("data/ramen-ratings.csv")

In [None]:
ramen

If you look at the CSV file, you will see that the column "Top Ten" only has ten entries; the rest of the column values are blank. Although it appears that there is also at least one entry with just a carriage return ('\n').

In [None]:
# Print all the unique entries in the column
np.unique(ramen.column("Top Ten"))

**Important point:** These are not true numpy nans, but just the string 'nan' so we can find and replace them with the where command.

Suppose we want to replace all of the lines that have nans or '\n' with the string "Not in top 10" 

In [None]:
# create a function that accepts a single value
def replace_nan(x):
    if x == 'nan':
        return "Not in top 10"
    elif x == '\n':
        return "Not in top 10"
    else:
        return x

In [None]:
# Use our function to replace values in the Top Ten column using apply()
ramen = ramen.with_column('Top Ten', ramen.apply(replace_nan, 'Top Ten') )
ramen.show(3)

**Mission Accomplished!**

### What if we have true np.nan values in our data set?

In [None]:
# Load one of the candidate group project data sets. This one is about air quality.
url = "https://opendata.arcgis.com/api/v3/datasets/3899a065577747fbb824f0a21afc2e7c_0/downloads/data?format=csv&spatialRefId=4326"
air = Table.read_table(url)
air.show(3)

In [None]:
ozone_values = np.unique(air.column('OZONE_PPM'))
ozone_values

In [None]:
np.isnan(ozone_values[3])

Yup! That is a true nan value. We cannot remove these with the where() method because Equality doesn't work for nans.

In [None]:
np.nan == np.nan

Wow! That is confusing. It turns out you cannot test for equality with nans, but you can identify them. Numpy has a method for this.

In [None]:
test = make_array( 5, 8, 12, np.nan, 2, 1)
test

In [None]:
np.isnan(test)

We can use this to replace the nans in our data table. Let's say we wanted to replace all the OZONE_PPM values that are nan with zeros. How do we do this? Again, we can start by writing a function.

In [None]:
# create a function that accepts a single value
def replace_true_nan(x):
    if np.isnan(x):
        return 0
    else:
        return x

In [None]:
# Test the function
for x in test:
    print(replace_true_nan(x))

In [None]:
# Use our function to replace values in the Top Ten column using apply()
air2 = air.with_column('OZONE_PPM', air.apply(replace_true_nan, 'OZONE_PPM'))
air2.show(3)

### ... or you can use Pandas
What I just demonstrated is how to deal with nan's within the world of Data 8's data tables. Outside this class, Pandas dateframes are the weapon of choice for pythonistas working with tabular data. Pandas had many built in methods for dealing with missing data, and you'll need to learn pandas as you continue your python journey.

Let's repeat the same operation of replacing nan's in the OZONE_PPM column using pandas.

In [None]:
import pandas as pd

# Convert the data table to a pandas dataframe
df = air.to_df()

# Replace the nans in a column
df['OZONE_PPM'] = df['OZONE_PPM'].replace(np.nan, 0)

# Convert the pandas dataframe back into a data table
air3 = Table().from_df(df)
air3.show(3)

Pandas had many other methods for dealing with nans, including dropping those row, filling with a value, and intepolating between neighboring value. You can read all about it [here](https://pandas.pydata.org/docs/user_guide/missing_data.html).

## Data Parsing 

Let's take a quick look at another one of the suggested data sets or the Group Project: Near Earth Objects.

In [None]:
url = '../Group-project/data/cneos_closeapproach_data.csv'
neo = Table.read_table(url)
neo

This is a really cool data set! Asteroids that may or may not be on a collision path with Earth. What's not to love? Well, the way the data are formatted, for one. Suppose we wanted to make a histogram of asteroid diameters. Look what we have to work with!

In [None]:
neo.column('Diameter')[0:30]

UGH!  Sometimes the diameter is given in meters. Sometimes in kilometers. For some a range is given. For other, it is a value with a +/-. 

**What do we do with this?**

We need to make a plan, and document it as we go. Here it what I choose to do:

1. If a range is given, find the average.
2. Drop any +/-.
3. Convert the value from a string to an number.
4. Convert all diameters to meters.

Rather than just give you the final function, I will walk you through the creation process.

In [None]:
# Create test cases
test1 = '250 m -  570 m'
test2 = '0.459±0.004 km'

**For test 1:**

Notice that the first element in the string is the lower bound on the diameter and the second to last element is the upper bound. We can split the string (which returns a list of the values split on the spaces.

In [None]:
def convert_diameter(x):
    if '-' in x:
        x = x.split()
        print(x[0], x[-2])

In [None]:
convert_diameter(test1)

That is a good start. Now make the function return the average.

In [None]:
def convert_diameter(x):
    if '-' in x:
        x = x.split()
        return np.mean([float(x[0]), float(x[-2])])

In [None]:
convert_diameter(test1)

Looks good. Now let's tackle the second case. Instead of splitting on spaces, we will split on the '±', keeping the first element.

In [None]:
def convert_diameter(x):
    if '-' in x:
        x = x.split()
        return np.mean([float(x[0]), float(x[-2])])
    elif '±' in x:
        x = x.split('±')
        print(x[0])

In [None]:
convert_diameter(test2)

Great! But we need to convert this to a number and from kilometers to meters.

In [None]:
def convert_diameter(x):
    if '-' in x:
        x = x.split()
        return np.mean([float(x[0]), float(x[-2])])
    elif '±' in x:
        x = x.split('±')[0]
        return 1000 * float(x)

In [None]:
convert_diameter(test2)

Bingo! But what if there is number somewhere in the diameter column that does not fit either of our two type cases? Better check.

In [None]:
def convert_diameter(x):
    if '-' in x:
        x = x.split()
        return np.mean([float(x[0]), float(x[-2])])
    elif '±' in x:
        x = x.split('±')[0]
        return 1000 * float(x)
    else:
        print("This value is unexpected:", x)

Fingers crossed, we will apply this function to the entire column in the data table.

In [None]:
neo2 = neo.with_column('Diameter', neo.apply(convert_diameter, 'Diameter'))

*DARN!!* It turn out there are two more test cases needed.

In [None]:
# Create test cases
test1 = '250 m -  570 m'
test2 = '0.459±0.004 km'
test3 = '1.4 km'

For test case three, we are looking for strings that contain 'km', but not '±'. Since we already check for '±', we just need to add a check for 'km' after it.

In [None]:
def convert_diameter(x):
    if '-' in x:
        x = x.split()
        return np.mean([float(x[0]), float(x[-2])])
    elif '±' in x:
        x = x.split('±')[0]
        return 1000 * float(x)
    elif 'km' in x:
        x = x.split()[0]
        return 1000 * float(x)
    else:
        print("This value is unexpected:", x)

In [None]:
convert_diameter(test3)

Finally, we have to decide what to do about nans. These are the asteroids for which the diameter is unknown. If we change them to some default number, they could affect the distribution, so we should filter them out with the where() method  before with apply our function.

In [None]:
diameter = neo.select('Diameter').where('Diameter', are.not_equal_to('nan'))
diameter.show(3)

In [None]:
diameter_no_nan = Table().with_column('Diameter', diameter.apply(convert_diameter, 'Diameter'))
diameter_no_nan.show(3)

In [None]:
diameter_no_nan.hist('Diameter')

Well, it worked, but what a boring histogram!  There must be a few large asteroids and many, many, small ones.

In [None]:
max(diameter_no_nan.column('Diameter'))

Let's limit our histogram to asteoroids under 1000 meters in diameter.

In [None]:
diameter_no_nan.where('Diameter', are.below(1000)).hist('Diameter', bins=20)
plt.title("Near-Earth Objects Diameters less than 1000 m");

# Concluding Remarks

Well, we did it, but it was a lot of work. 

Here is a very telling quote [ref](https://www.appier.com/en/blog/means-data-scientist-today):

>What Makes for a Good Data Scientist

>Of course, every job has some less lovable bits and the burden of the data scientist is data cleaning! In most cases, the data we gather is ‘dirty’, with errors and discrepancies in it. For example, data showing that sales of a product have dropped dramatically may simply mean that malfunctioning machines have failed to capture the data accurately.

>Most data scientists will agree that data cleaning is the most boring part of this job. Our inside joke is that data science is 80 percent cleaning of data and 20 percent complaining about it!

>But jokes aside, data cleaning is painstaking but important work. If not done right, it can have a huge impact on the accuracy and reliability of insights.