# Goal: Read in a "dirty" data file and clean it up
* known problems with the data
  * typos
  * missing data
  * incorrect formatting

## Read in the data file `data/WA_Fn-UseC_-Sales-Win-Loss-DIRTY.csv`

In [None]:
import pandas as pd
import numpy as np
data = pd.read_csv('data/WA_Fn-UseC_-Sales-Win-Loss-DIRTY.csv')

## Take a look at the data

In [None]:
data.head(25)

## Take a look at the column names
* remove anything that doesn't look right

In [None]:
data.columns

In [None]:
data = data.drop('Unnamed: 0', axis=1)
data

In [None]:
data = data.drop('Opportunity Next Step', axis=1)
data

## Find typos
* Hint: take a look at text-based fields and use the value_counts() method to see the counts of each value

In [None]:
data['Supplies Group'].value_counts()

In [None]:
import re
data['Supplies Group'] = data['Supplies Group'].apply(lambda s: re.sub('.*r A.*', 'Car Accessoriess', s))

In [None]:
data['Supplies Group'].value_counts()

In [None]:
# Start simple, just to see what we can do
data['Supplies Group'] = data['Supplies Group'].apply(lambda s: re.sub('.*& N.*', 'Performance & Non-auto', s))
# That folded back in two of the typos

In [None]:
data['Supplies Group'].value_counts()

In [None]:
# Get more agressive...regex
# Clean up 'Performance & Non-auto'
import re
data['Supplies Group'] = data['Supplies Group'].apply(lambda s: re.sub('^P.*', 'Performance & Non-auto', s))
data['Supplies Group'].value_counts()

In [None]:
# Now clean up 'Car Accessories'
data['Supplies Group'] = data['Supplies Group'].apply(lambda s: re.sub('.*r A.*', 'Car Accessories', s))
data['Supplies Group'].value_counts()

In [None]:
data['Supplies Subgroup'].value_counts()

In [None]:
data['Region'].value_counts()

# Locate missing data–what do we do about it?
* It's probably OK to drop a small amount of missing data, but if a lot of data is missing, what should we do?

In [None]:
data['Region'].isnull().sum()

In [None]:
data = data.dropna(subset=['Region'])
data

In [None]:
data['Supplies Subgroup'].isnull().sum()

In [None]:
import numpy as np
data = data.replace(np.nan, 'Motorcycle Parts', regex=True)
print(data['Supplies Subgroup'].value_counts())
data['Supplies Subgroup'].isnull().sum()

## Formatting errors
* it's not uncommon for data files to have thing like dates formatting inconsistently
* there are no dates in these data, but one column is formatted inconsistenly
* Hint: descriptive statistics might help

In [None]:
data['Opportunity Amount USD'].mean()

In [None]:
data["Opportunity Amount USD"].value_counts()

In [None]:
data['Opportunity Amount USD'] = data[
     'Opportunity Amount USD'].apply(lambda s: int(s.replace('$', '')))

In [None]:
data['Opportunity Amount USD'].mean()

## Write your cleansed data to the file __`data/WA_Fn-UseC_-Sales-Win-Loss-CLEAN.csv`__

In [None]:
data.to_csv('data/WA_Fn-UseC_-Sales-Win-Loss-CLEAN.csv')