# Goal: Read in a "dirty" data file and clean it up
* known problems with the data
  * typos
  * missing data
  * incorrect formatting


## Read in the data file `data/WA_Fn-UseC_-Sales-Win-Loss-DIRTY.csv`

In [2]:
import pandas as pd
import numpy as np
data = pd.read_csv('data/WA_Fn-UseC_-Sales-Win-Loss-DIRTY.csv')

## Take a look at the data

In [4]:
data.head()

Unnamed: 0.1,Unnamed: 0,Opportunity Number,Supplies Subgroup,Supplies Group,Region,Route To Market,Elapsed Days In Sales Stage,Opportunity Result,Sales Stage Change Count,Total Days Identified Through Closing,...,Opportunity Amount USD,Client Size By Revenue,Client Size By Employee Count,Revenue From Client Past Two Years,Competitor Type,Ratio Days Identified To Total Days,Ratio Days Validated To Total Days,Ratio Days Qualified To Total Days,Deal Size Category,Opportunity Next Step
0,0,1641984,Exterior Accessories,Car Accessories,Northwest,Fields Sales,76,Won,13,104,...,0,5,5,0,Unknown,0.69636,0.113985,0.154215,1,Call
1,1,1658010,,Car Accessories,Pacific,Reseller,63,Loss,2,163,...,$0,3,5,0,Unknown,0.0,1.0,0.0,1,
2,2,1674737,Motorcycle Parts,Performance & Non-auto,Pacific,Reseller,24,Won,7,82,...,7750,1,1,0,Unknown,1.0,0.0,0.0,1,Pending
3,3,1675224,Shelters & RV,Performance & Non-auto,,Reseller,16,Loss,5,124,...,$$0,1,1,0,Known,1.0,0.0,0.0,1,
4,4,1689785,Exterior Accessories,Car Accessories,Pacific,Reseller,69,Loss,11,91,...,69756,1,1,0,Unknown,0.0,0.141125,0.0,4,


## Take a look at the column names
* remove anything that doesn't look right

In [5]:
data.columns

Index(['Unnamed: 0', 'Opportunity Number', 'Supplies Subgroup',
       'Supplies Group', 'Region', 'Route To Market',
       'Elapsed Days In Sales Stage', 'Opportunity Result',
       'Sales Stage Change Count', 'Total Days Identified Through Closing',
       'Total Days Identified Through Qualified', 'Opportunity Amount USD',
       'Client Size By Revenue', 'Client Size By Employee Count',
       'Revenue From Client Past Two Years', 'Competitor Type',
       'Ratio Days Identified To Total Days',
       'Ratio Days Validated To Total Days',
       'Ratio Days Qualified To Total Days', 'Deal Size Category',
       'Opportunity Next Step'],
      dtype='object')

In [6]:
data = data.drop('Unnamed: 0', axis=1)
data.columns

Index(['Opportunity Number', 'Supplies Subgroup', 'Supplies Group', 'Region',
       'Route To Market', 'Elapsed Days In Sales Stage', 'Opportunity Result',
       'Sales Stage Change Count', 'Total Days Identified Through Closing',
       'Total Days Identified Through Qualified', 'Opportunity Amount USD',
       'Client Size By Revenue', 'Client Size By Employee Count',
       'Revenue From Client Past Two Years', 'Competitor Type',
       'Ratio Days Identified To Total Days',
       'Ratio Days Validated To Total Days',
       'Ratio Days Qualified To Total Days', 'Deal Size Category',
       'Opportunity Next Step'],
      dtype='object')

In [7]:
data = data.drop('Opportunity Next Step', axis=1)
data.columns

Index(['Opportunity Number', 'Supplies Subgroup', 'Supplies Group', 'Region',
       'Route To Market', 'Elapsed Days In Sales Stage', 'Opportunity Result',
       'Sales Stage Change Count', 'Total Days Identified Through Closing',
       'Total Days Identified Through Qualified', 'Opportunity Amount USD',
       'Client Size By Revenue', 'Client Size By Employee Count',
       'Revenue From Client Past Two Years', 'Competitor Type',
       'Ratio Days Identified To Total Days',
       'Ratio Days Validated To Total Days',
       'Ratio Days Qualified To Total Days', 'Deal Size Category'],
      dtype='object')

## Find typos
* Hint: take a look at text-based fields and use the __`value_counts()`__ method to see the counts of each value

In [9]:
data['Supplies Group'].value_counts()

Car Accessories            49806
Performance & Non-auto     27321
Tires & Wheels               609
Car Electronics              281
Pefforrmance & NNonauto        1
Performance  & Non-auto        1
Performmanc & Npn-augp         1
Car Acceessories               1
Performance &  Non-auto        1
Car Accssoriez                 1
ar Avccessories                1
CCar Acessories                1
Name: Supplies Group, dtype: int64

# Locate missing data–what do we do about it?
* It's probably OK to drop a small amount of missing data, but if a lot of data is missing, what should we do? (Hint: descriptive statistics)

In [11]:
import re
data['Supplies Group'] = data['Supplies Group'].apply(lambda s: re.sub('.*r A.*', 'Car Accessoriess', s))
data['Supplies Group'].value_counts()

Car Accessoriess           49810
Performance & Non-auto     27321
Tires & Wheels               609
Car Electronics              281
Pefforrmance & NNonauto        1
Performance  & Non-auto        1
Performmanc & Npn-augp         1
Performance &  Non-auto        1
Name: Supplies Group, dtype: int64

In [12]:
data['Supplies Group'] = data['Supplies Group'].apply(lambda s: re.sub('.*& N.*', 'Performance & Non-auto', s))
data['Supplies Group'].value_counts()

Car Accessoriess           49810
Performance & Non-auto     27324
Tires & Wheels               609
Car Electronics              281
Performance &  Non-auto        1
Name: Supplies Group, dtype: int64

In [13]:
import re
data['Supplies Group'] = data['Supplies Group'].apply(lambda s: re.sub('^P.*', 'Performance & Non-auto', s))
data['Supplies Group'].value_counts()

Car Accessoriess          49810
Performance & Non-auto    27325
Tires & Wheels              609
Car Electronics             281
Name: Supplies Group, dtype: int64

In [14]:
data['Supplies Group'] = data['Supplies Group'].apply(lambda s: re.sub('.*r A.*', 'Car Accessories', s))
data['Supplies Group'].value_counts()

Car Accessories           49810
Performance & Non-auto    27325
Tires & Wheels              609
Car Electronics             281
Name: Supplies Group, dtype: int64

In [15]:
data['Region'].value_counts()

Midwest         21019
Pacific         15142
Northwest        9552
Southeast        9232
Southwest        8153
Mid-Atlantic     7568
Northeast        7358
Name: Region, dtype: int64

## Formatting errors
* it's not uncommon for data files to have thing like dates formatting inconsistently
* there are no dates in these data, but one column is formatted inconsistenly

## Write your cleansed data to the file __`data/WA_Fn-UseC_-Sales-Win-Loss-CLEAN.csv`__
