#### Wouldn't it be nice...

Wouldn't it be nice to consolidate the install_type field because there are 24 unique values, of which many are broken.  

I want to edit a particular row, but in order to set a column value, I need a unique id.
That is a real annoyance because I don't have a unique row indexer.  For this date (the index)
I have 286 entries...

It seems like there are two choices here:
1. wipe out the smallest 12 categories 
2. figure out how to get at these and make them as right as I can.

There are a few issues.  Option 1 isn't good because it leaves me with categories that still need to be consolidated.  

Here's a brainstorm on how to edit the install_type:
What we need to do is apply a mapping from actual to target for every one of these bad types.  That's actually pretty easy since I have all the values from values_count.  So all I need is the mapping.


In [1]:
# %load ../pycode/setup.py
# set up
import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

%matplotlib inline

def ecdf(data):
    '''Compute ECDF for a one-dimensional array of measurements.'''
    n = len(data)
    x = np.sort(data)
    y = np.arange(1, n + 1) / n
    return x, y

def min015099max(series):
    ''' return list of [ min, 1%, median, 99%, max ] series values '''
    vals = list(np.percentile(series, [1.0, 50.0, 99.0]))
    vals.insert(-1, series.max())
    vals.insert(0, series.min())
    return vals

# ss = np.arange(1, 101)
# min_1_50_99_max(ss)             

def mid98(series):
    '''  return middle 98% of series '''
    bounds = series.quantile([0.01, 0.99])
    return(series[(series > bounds.values[0]) & (series < bounds.values[1])])

# ss = np.arange(1, 101)
# min_1_50_99_max(ss)


In [2]:
# load more concise dataset for exploration
dfLive = pd.read_csv('../local/data/live20171229.csv', index_col='date_installed', 
                     parse_dates=True, dtype={'zipcode' : np.object})

In [3]:
dfLive.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1002025 entries, 2004-01-01 to 2015-12-31
Data columns (total 18 columns):
cost_per_watt    745516 non-null float64
cost             745629 non-null float64
size_kw          1002025 non-null float64
state            1002025 non-null object
zipcode          1002025 non-null object
city             788849 non-null object
county           980790 non-null object
install_type     966720 non-null object
new_constr       27098 non-null float64
tracking         1920 non-null float64
third_party      306989 non-null float64
appraised        223431 non-null object
incentive        788415 non-null object
utility          783186 non-null object
tech             580399 non-null object
model            580399 non-null object
installer        694390 non-null object
bipv             5252 non-null float64
dtypes: float64(7), object(11)
memory usage: 145.3+ MB


In [4]:
dfLive.install_type.value_counts()

residential                    898178
commercial                      35928
unknown                         19189
government                       5675
nonprofit                        4191
educational                      2363
agricultural                      343
customer                          326
utility                           224
education                         209
institutional                      29
commercial - other                 16
public                             15
gov't/np                           11
agriculture                         5
residential/sf                      5
nonresidential                      3
not stated                          2
commercial - small business         2
small business                      2
municipal                           1
commerical                          1
commercial - builders               1
commercial - agriculture            1
Name: install_type, dtype: int64

In [5]:
theInstallTypes = list(dfLive.install_type.value_counts().index)

In [6]:
theInstallTypes = list(dfLive.install_type.value_counts().index)

In [7]:
theInstallTypes.sort(); theInstallTypes

['agricultural',
 'agriculture',
 'commercial',
 'commercial - agriculture',
 'commercial - builders',
 'commercial - other',
 'commercial - small business',
 'commerical',
 'customer',
 'education',
 'educational',
 "gov't/np",
 'government',
 'institutional',
 'municipal',
 'nonprofit',
 'nonresidential',
 'not stated',
 'public',
 'residential',
 'residential/sf',
 'small business',
 'unknown',
 'utility']

In [8]:
len(theInstallTypes)

24

In [9]:
dfLive[dfLive.install_type == 'commercial - agriculture']

Unnamed: 0_level_0,cost_per_watt,cost,size_kw,state,zipcode,city,county,install_type,new_constr,tracking,third_party,appraised,incentive,utility,tech,model,installer,bipv
date_installed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2009-12-31,8.58,12009.79,1.4,OH,43432,,Ottawa,commercial - agriculture,,,,,,,,,,


In [10]:
# watch out
dfLive[dfLive.install_type == 'commercial - agriculture'].index

DatetimeIndex(['2009-12-31'], dtype='datetime64[ns]', name='date_installed', freq=None)

In [11]:
# I want to set the 'install_type' in the row above.  
# To do it using .loc I need row and column indexes but...
# I don't have a unique id for the row.
dfLive.loc[dfLive[dfLive.install_type == 'commercial - agriculture'].index, :]

Unnamed: 0_level_0,cost_per_watt,cost,size_kw,state,zipcode,city,county,install_type,new_constr,tracking,third_party,appraised,incentive,utility,tech,model,installer,bipv
date_installed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2009-12-31,9.042553,510000.00,56.400000,MA,01701,Framingham,Middlesex,government,,1.0,,,Massachusetts Clean Energy Center,NSTAR (DBA EverSource),Poly,ND-224U1F,Solar Design Associates,
2009-12-31,5.787037,700000.00,120.960000,AZ,85215,MESA,Maricopa,unknown,,,,,Salt River Project,Salt River Project,Poly,multiple matches,Pedersen Electric,
2009-12-31,7.964912,22700.00,2.850000,CA,91335,Reseda,Los Angeles,residential,,,,,Los Angeles Department of Water & Power,Los Angeles Department of Water & Power,,,American Vision Solar Lp,
2009-12-31,5.616331,105924.00,18.860000,AZ,85718,TUCSON,Pima,residential,,,,,Tucson Electric Power,Tucson Electric Power,,,Technicians For Sustainabililty,
2009-12-31,9.067576,11969.20,1.320000,CA,95616,Davis,Yolo,residential,,,,,California Public Utilities Commission (Califo...,Pacific Gas & Electric Co,,,Grid Alternatives,
2009-12-31,,,4.400000,IN,47401,Bloomington,,residential,,,,,,,,,SSI,
2009-12-31,5.748715,58177.00,10.120000,TX,75154,Red Oak,Ellis,residential,,,,,Oncor Electric Delivery Company,Oncor Electric Delivery Company,Poly,REC230AE-US,,
2009-12-31,5.857488,48500.00,8.280000,CA,95658,Newcastle,Placer,residential,,,,,California Public Utilities Commission (Califo...,Pacific Gas & Electric Company,Poly,BP3230N,Sunrise Solar,
2009-12-31,7.223766,25283.18,3.500000,CA,94611,Oakland,Alameda,residential,,,,,California Public Utilities Commission (Califo...,Pacific Gas & Electric Company,Mono,NT-175U1,RGS/Real Goods,
2009-12-31,,,1.200000,IN,47408,Bloomington,,residential,,,,,,,,,SSI,


In [12]:
typeMap = { 'agricultural'                : 'agricultural',
            'agriculture'                 : 'agricultural',
            'commercial'                  : 'commercial',           
            'commerical'                  : 'commercial',
            'customer'                    : 'unknown',
            'education'                   : 'educational',
            'educational'                 : 'educational',
            "gov't/np"                    : 'government',
            'government'                  : 'government',
            'institutional'               : 'nonprofit',
            'municipal'                   : 'government',
            'nonprofit'                   : 'nonprofit',
            'nonresidential'              : 'unknown',
            'not stated'                  : 'unknown',
            'public'                      : 'government',
            'residential'                 : 'residential',
            'residential/sf'              : 'residential',
            'small business'              : 'commercial',
            'unknown'                     : 'unknown',
            'utility'                     : 'utility',
            'commercial - agriculture'    : 'agricultural',
            'commercial - builders'       : 'commercial',
            'commercial - other'          : 'commercial',
            'commercial - small business' : 'commercial'}

In [13]:
dfLive.install_type.map(typeMap).count()

966720

In [14]:
dfLive.install_type.count()

966720

In [15]:
dfLive.install_type.size

1002025

In [16]:
dfLive.install_type.map(typeMap).size

1002025

In [17]:
i_type_clean = dfLive.install_type.map(typeMap)

In [18]:
dfLive_with_i_type = dfLive.assign(i_type=i_type_clean)

In [19]:
# Looks like it worked.
dfLive_with_i_type.loc[dfLive_with_i_type.install_type.notnull() & 
                       (dfLive_with_i_type.install_type != dfLive_with_i_type.i_type)]

Unnamed: 0_level_0,cost_per_watt,cost,size_kw,state,zipcode,city,county,install_type,new_constr,tracking,third_party,appraised,incentive,utility,tech,model,installer,bipv,i_type
date_installed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2004-01-07,9.36,117980.00,12.600,NJ,07934,,Somerset,education,,,,,,,,,,,educational
2004-01-08,,0.00,2.220,WI,53570,,Green,customer,,,,,,,,,,,unknown
2004-01-15,9.21,36490.00,3.960,FL,34604,,Hernando,education,,,,,,,,,,,educational
2004-01-15,10.03,16852.00,1.680,FL,32608,,Alachua,education,,,,,,,,,,,educational
2004-01-15,10.03,16852.00,1.680,FL,32605,,Alachua,education,,,,,,,,,,,educational
2004-01-17,,0.00,2.800,WI,54234,,Door,customer,,,,,,,,,,,unknown
2004-01-30,1.62,90000.00,55.500,OH,45309,,Montgomery,commerical,,,,,,,,,,,commercial
2004-01-30,7.75,37180.00,4.800,FL,32609,,Alachua,education,,,,,,,,,,,educational
2004-02-15,2.72,3095.86,1.140,WI,54406,,Portage,customer,,,,,,,,,,,unknown
2004-02-16,14.02,14017.90,1.000,WI,54814,,Bayfield,customer,,,,,,,,,,,unknown


In [20]:
# so if I add up all the value_counts of install_types I mapped, I should get 629...
dfLive_with_i_type.loc[dfLive_with_i_type.install_type.notnull() & 
                       (dfLive_with_i_type.install_type != dfLive_with_i_type.i_type)].count()

cost_per_watt    544
cost             549
size_kw          629
state            629
zipcode          629
city               0
county           629
install_type     629
new_constr         0
tracking           0
third_party        0
appraised          0
incentive          0
utility            0
tech               0
model              0
installer          0
bipv               0
i_type           629
dtype: int64

In [21]:
mappedTypes = [key for key in typeMap if key != typeMap[key]]; mappedTypes

['agriculture',
 'commerical',
 'customer',
 'education',
 "gov't/np",
 'institutional',
 'municipal',
 'nonresidential',
 'not stated',
 'public',
 'residential/sf',
 'small business',
 'commercial - agriculture',
 'commercial - builders',
 'commercial - other',
 'commercial - small business']

In [22]:
# very nice
dfLive.install_type.value_counts()[mappedTypes].sum()

629

In [23]:
# sweet
dfLive_with_i_type.i_type.value_counts()

residential     898183
commercial       35950
unknown          19520
government        5702
nonprofit         4220
educational       2572
agricultural       349
utility            224
Name: i_type, dtype: int64

In [24]:
dfLive.columns

Index(['cost_per_watt', 'cost', 'size_kw', 'state', 'zipcode', 'city',
       'county', 'install_type', 'new_constr', 'tracking', 'third_party',
       'appraised', 'incentive', 'utility', 'tech', 'model', 'installer',
       'bipv'],
      dtype='object')

In [25]:
dfLive.install_type.value_counts()

residential                    898178
commercial                      35928
unknown                         19189
government                       5675
nonprofit                        4191
educational                      2363
agricultural                      343
customer                          326
utility                           224
education                         209
institutional                      29
commercial - other                 16
public                             15
gov't/np                           11
agriculture                         5
residential/sf                      5
nonresidential                      3
not stated                          2
commercial - small business         2
small business                      2
municipal                           1
commerical                          1
commercial - builders               1
commercial - agriculture            1
Name: install_type, dtype: int64

In [26]:
dfLive_with_i_type.install_type.value_counts()

residential                    898178
commercial                      35928
unknown                         19189
government                       5675
nonprofit                        4191
educational                      2363
agricultural                      343
customer                          326
utility                           224
education                         209
institutional                      29
commercial - other                 16
public                             15
gov't/np                           11
agriculture                         5
residential/sf                      5
nonresidential                      3
not stated                          2
commercial - small business         2
small business                      2
municipal                           1
commerical                          1
commercial - builders               1
commercial - agriculture            1
Name: install_type, dtype: int64

In [27]:
dfLive_with_i_type.columns

Index(['cost_per_watt', 'cost', 'size_kw', 'state', 'zipcode', 'city',
       'county', 'install_type', 'new_constr', 'tracking', 'third_party',
       'appraised', 'incentive', 'utility', 'tech', 'model', 'installer',
       'bipv', 'i_type'],
      dtype='object')

In [28]:
dfLive_with_i_type.i_type.value_counts()

residential     898183
commercial       35950
unknown          19520
government        5702
nonprofit         4220
educational       2572
agricultural       349
utility            224
Name: i_type, dtype: int64

In [29]:
# drop install_type??
# reassign dfLive??
# for sure save the data...
dfLive = dfLive_with_i_type.drop('install_type', axis='columns')

In [30]:
dfLive.to_csv('../local/data/20180101.csv')

In [31]:
dfLive.zipcode[:24]

date_installed
2004-01-01    37397
2004-01-01    55407
2004-01-01    95616
2004-01-01    92504
2004-01-01    83115
2004-01-01    94707
2004-01-01    59101
2004-01-01    95819
2004-01-01    91711
2004-01-01    92833
2004-01-01    96725
2004-01-01    90804
2004-01-01    91024
2004-01-01    93453
2004-01-01    37771
2004-01-01    91711
2004-01-01    92399
2004-01-01    92646
2004-01-01    59711
2004-01-01    93722
2004-01-01    59718
2004-01-01    95037
2004-01-01    93465
2004-01-01    95452
Name: zipcode, dtype: object

In [32]:
dfLive.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1002025 entries, 2004-01-01 to 2015-12-31
Data columns (total 18 columns):
cost_per_watt    745516 non-null float64
cost             745629 non-null float64
size_kw          1002025 non-null float64
state            1002025 non-null object
zipcode          1002025 non-null object
city             788849 non-null object
county           980790 non-null object
new_constr       27098 non-null float64
tracking         1920 non-null float64
third_party      306989 non-null float64
appraised        223431 non-null object
incentive        788415 non-null object
utility          783186 non-null object
tech             580399 non-null object
model            580399 non-null object
installer        694390 non-null object
bipv             5252 non-null float64
i_type           966720 non-null object
dtypes: float64(7), object(11)
memory usage: 165.3+ MB
