### Part 5.5 of data cleaning

#### Agenda

* 20180109 Okay up to here.  Now work through this again
* re-index so I can get at the data to drop
* Get rid of 29925 rows with no installation type.  They are not helpful since my project is focused on residential installations.  I might be able to statistically infer which ones are residential, but that would be a different kind of project. 

* Get rid of duplicate data.  What does that mean exactly?

    1. strict duplicates - two or more more rows where every column of each contains exactly the same data
    2. strict duplicates with relaxed floating point comparisons ($r_1(var_n) - r_2(var_n) < epsilon$ for some small epsilon)
    3. subset duplicates - these are rows that have the most significant fields equal and differences or missing data in the other fields.  My theory on this is that they come from multiple loads of similar datasets (perhaps from different sources).
    4. subset duplicates with relaxed floating point comparisons 
    5. other - typos, wrong dates, bad math

The dataset loaded by this code contains ~745k rows.  The changes leading to the initial reduction are summarized in the table below:

file                  | beginning rows    | end rows  |output file         | comments
----------------------|-------------------|-----------|-----------------------------------------------------------------
d_q_and_c_1           | 1020524           | 1020516   |                    | delete rows with null indices (drop 8 rows)
d_q_and_c_2           | 1020516           | 1020516   |                    | no changes          
d_q_and_c_3           | 1020516           | 1002025   | live20171229.csv   | save data from 2004-2015 inclusive (drop 18491 rows)
d_q_and_c_install_type| 1002025           | 1002025   | 20180101.csv       | clean install_type; no change in rows
d_q_and_c_5           | 1002025           |  745362   | 20180105           | drop rows without all of size_kw, cost, cost_per_watt (~250k)
d_q_and_c_5.5         |  745362           | 61614     | 20180109.csv       | dropped many duplicate entries


In [1]:
# %load ../pycode/setup.py
# set up
import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

%matplotlib inline

def ecdf(data):
    '''Compute ECDF for a one-dimensional array of measurements.'''
    n = len(data)
    x = np.sort(data)
    y = np.arange(1, n + 1) / n
    return x, y

def min015099max(series, minmax=False):
    ''' return list of [ min, 1%, median, 99%, max ] series values '''
    vals = list(np.percentile(series, [1.0, 50.0, 99.0]))
    if minmax: 
        vals.insert(0, series.min())
        vals.append(series.max())    
    return vals
# ss = np.arange(1, 101)
# min_1_50_99_max(ss)             

def mid98(series):
    '''  return middle 98% of series '''
    bounds = series.quantile([0.01, 0.99])
    return(series[(series > bounds.values[0]) & (series < bounds.values[1])])

# ss = np.arange(1, 101)
# min_1_50_99_max(ss)


In [2]:
# load dataset
dfLive = pd.read_csv('../local/data/20180105', index_col='date_installed', 
                     parse_dates=True, dtype={'zipcode' : np.object})

In [3]:
# how big? 745k.  how many nulls?
dfLive.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 745362 entries, 2004-01-01 to 2015-12-31
Data columns (total 19 columns):
Unnamed: 0       745362 non-null int64
cost_per_watt    745362 non-null float64
cost             745362 non-null float64
size_kw          745362 non-null float64
state            745362 non-null object
zipcode          745362 non-null object
city             552240 non-null object
county           732641 non-null object
new_constr       23813 non-null float64
tracking         1705 non-null float64
third_party      239541 non-null float64
appraised        146655 non-null object
incentive        552618 non-null object
utility          550246 non-null object
tech             421871 non-null object
model            421871 non-null object
installer        470553 non-null object
bipv             3897 non-null float64
i_type           715437 non-null object
dtypes: float64(7), int64(1), object(11)
memory usage: 113.7+ MB


In [4]:
# drop column that's about to be redundant
dfLive.drop('Unnamed: 0', axis='columns', inplace=True)

In [5]:
# see what we've got 
dfLive.head()

Unnamed: 0_level_0,cost_per_watt,cost,size_kw,state,zipcode,city,county,new_constr,tracking,third_party,appraised,incentive,utility,tech,model,installer,bipv,i_type
date_installed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2004-01-01,6.15,4000.0,0.65,TN,37397,,Marion,,,,,,,,,,,
2004-01-01,5.74,195000.0,34.0,MN,55407,,Hennepin,,,,,,,,,,,commercial
2004-01-01,7.3887,21944.44,2.97,CA,95616,Davis,Yolo,,,,,California Energy Commission (Emerging Renewab...,Pacific Gas & Electric Company,,,Davis Lumber & Hardware Co,,
2004-01-01,10.227273,27000.0,2.64,CA,92504,Riverside,Riverside,,,,,California Energy Commission (Emerging Renewab...,Southern California Edison,,,Future Air & Windows,,
2004-01-01,7.187847,32410.0,4.509,CA,94707,Kensington,Alameda,,,,,California Energy Commission (Emerging Renewab...,Pacific Gas & Electric Company,Poly,ND-167U1,Wencon Development,,


In [6]:
# index by row_id
dfLive.reset_index(inplace=True)

In [7]:
# look at index and columns
dfLive.head()

Unnamed: 0,date_installed,cost_per_watt,cost,size_kw,state,zipcode,city,county,new_constr,tracking,third_party,appraised,incentive,utility,tech,model,installer,bipv,i_type
0,2004-01-01,6.15,4000.0,0.65,TN,37397,,Marion,,,,,,,,,,,
1,2004-01-01,5.74,195000.0,34.0,MN,55407,,Hennepin,,,,,,,,,,,commercial
2,2004-01-01,7.3887,21944.44,2.97,CA,95616,Davis,Yolo,,,,,California Energy Commission (Emerging Renewab...,Pacific Gas & Electric Company,,,Davis Lumber & Hardware Co,,
3,2004-01-01,10.227273,27000.0,2.64,CA,92504,Riverside,Riverside,,,,,California Energy Commission (Emerging Renewab...,Southern California Edison,,,Future Air & Windows,,
4,2004-01-01,7.187847,32410.0,4.509,CA,94707,Kensington,Alameda,,,,,California Energy Commission (Emerging Renewab...,Pacific Gas & Electric Company,Poly,ND-167U1,Wencon Development,,


In [8]:
# rename the index for clarity
dfLive.rename_axis('row_id', inplace=True)

In [9]:
dfLive.head()

Unnamed: 0_level_0,date_installed,cost_per_watt,cost,size_kw,state,zipcode,city,county,new_constr,tracking,third_party,appraised,incentive,utility,tech,model,installer,bipv,i_type
row_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
0,2004-01-01,6.15,4000.0,0.65,TN,37397,,Marion,,,,,,,,,,,
1,2004-01-01,5.74,195000.0,34.0,MN,55407,,Hennepin,,,,,,,,,,,commercial
2,2004-01-01,7.3887,21944.44,2.97,CA,95616,Davis,Yolo,,,,,California Energy Commission (Emerging Renewab...,Pacific Gas & Electric Company,,,Davis Lumber & Hardware Co,,
3,2004-01-01,10.227273,27000.0,2.64,CA,92504,Riverside,Riverside,,,,,California Energy Commission (Emerging Renewab...,Southern California Edison,,,Future Air & Windows,,
4,2004-01-01,7.187847,32410.0,4.509,CA,94707,Kensington,Alameda,,,,,California Energy Commission (Emerging Renewab...,Pacific Gas & Electric Company,Poly,ND-167U1,Wencon Development,,


#### Reindexing is done.  Now take care of rows with null i_type.

In [10]:
# how many?
dfLive.i_type.isnull().sum()

29925

In [11]:
# drop them
dfLive.drop(dfLive[dfLive.i_type.isnull()].index, inplace=True)

In [12]:
# all gone
dfLive.i_type.isnull().sum()

0

In [13]:
# down to 715k
dfLive.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 715437 entries, 1 to 745361
Data columns (total 19 columns):
date_installed    715437 non-null datetime64[ns]
cost_per_watt     715437 non-null float64
cost              715437 non-null float64
size_kw           715437 non-null float64
state             715437 non-null object
zipcode           715437 non-null object
city              526210 non-null object
county            702938 non-null object
new_constr        21238 non-null float64
tracking          1635 non-null float64
third_party       239336 non-null float64
appraised         146051 non-null object
incentive         525525 non-null object
utility           523209 non-null object
tech              408436 non-null object
model             408436 non-null object
installer         457974 non-null object
bipv              3126 non-null float64
i_type            715437 non-null object
dtypes: datetime64[ns](1), float64(7), object(11)
memory usage: 109.2+ MB


####  Having dropped all rows with null i_type we have 715437 rows left.

#### Now on to duplicates.

There are several kinds of duplicates.  Exact dups have all fields exactly the same.

In [14]:
# every field exactly the same as another row?
completeDups = dfLive.duplicated(keep=False)

In [15]:
# how many? 34k
len(dfLive[completeDups])

34124

In [16]:
# have a look at some of them
# everything is the same in these. I've got 2, 3, and 4 dups just in this little sample
dfLive[completeDups].head(15)

Unnamed: 0_level_0,date_installed,cost_per_watt,cost,size_kw,state,zipcode,city,county,new_constr,tracking,third_party,appraised,incentive,utility,tech,model,installer,bipv,i_type
row_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
282,2004-01-13,10.0,50000.0,5.0,NY,11788,Hauppauge,Suffolk,,,,,New York State Energy Research and Development...,PSEG Long Island,,,,,residential
283,2004-01-13,10.0,50000.0,5.0,NY,11788,Hauppauge,Suffolk,,,,,New York State Energy Research and Development...,PSEG Long Island,,,,,residential
681,2004-01-26,9.26,26656.0,2.88,CA,95627,,Yolo,,,,,,,,,,,residential
701,2004-01-26,9.26,26656.0,2.88,CA,95627,,Yolo,,,,,,,,,,,residential
805,2004-02-01,13.81,319000.0,23.1,CT,6141,,,,,,,,,,,,,unknown
806,2004-02-01,13.81,319000.0,23.1,CT,6141,,,,,,,,,,,,,unknown
894,2004-02-04,6.08,19454.0,3.2,CA,94303,,Santa Clara,,,,,,,,,,,residential
899,2004-02-04,6.08,19454.0,3.2,CA,94303,,Santa Clara,,,,,,,,,,,residential
901,2004-02-04,6.08,19454.0,3.2,CA,94303,,Santa Clara,,,,,,,,,,,residential
917,2004-02-04,6.08,19454.0,3.2,CA,94303,,Santa Clara,,,,,,,,,,,residential


In [17]:
# same kind of thing here
dfLive[completeDups].tail(16)

Unnamed: 0_level_0,date_installed,cost_per_watt,cost,size_kw,state,zipcode,city,county,new_constr,tracking,third_party,appraised,incentive,utility,tech,model,installer,bipv,i_type
row_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
744328,2015-12-31,3.0,592920.0,197.64,NY,12010,Amsterdam,Montgomery,,,1.0,,New York State Energy Research and Development...,National Grid,Poly,multiple matches,,,commercial
744369,2015-12-31,3.0,592920.0,197.64,NY,12010,Amsterdam,Montgomery,,,1.0,,New York State Energy Research and Development...,National Grid,Poly,multiple matches,,,commercial
744397,2015-12-31,4.019637,268994.14,66.92,NY,12306,Schenectady,Schenectady,,,1.0,,New York State Energy Research and Development...,National Grid,Poly,multiple matches,,,commercial
744401,2015-12-31,4.019637,268994.14,66.92,NY,12306,Schenectady,Schenectady,,,1.0,,New York State Energy Research and Development...,National Grid,Poly,multiple matches,,,commercial
744404,2015-12-31,4.019637,268994.14,66.92,NY,12306,Schenectady,Schenectady,,,1.0,,New York State Energy Research and Development...,National Grid,Poly,multiple matches,,,commercial
744418,2015-12-31,4.019637,268994.14,66.92,NY,12306,Schenectady,Schenectady,,,1.0,,New York State Energy Research and Development...,National Grid,Poly,multiple matches,,,commercial
744479,2015-12-31,4.019637,268994.14,66.92,NY,12306,Schenectady,Schenectady,,,1.0,,New York State Energy Research and Development...,National Grid,Poly,multiple matches,,,commercial
744521,2015-12-31,5.119934,31206.0,6.095,CA,92154,SAN DIEGO,San Diego,,,,True,California Public Utilities Commission (Non-CS...,San Diego Gas & Electric Company,Poly,KU265-6ZPA,SolarCity,,residential
744529,2015-12-31,5.119934,31206.0,6.095,CA,92154,SAN DIEGO,San Diego,,,,True,California Public Utilities Commission (Non-CS...,San Diego Gas & Electric Company,Poly,KU265-6ZPA,SolarCity,,residential
744852,2015-12-31,3.983456,5500.0,1.380711,CA,91780,Temple City,Los Angeles,,,,,California Public Utilities Commission (Non-CS...,Southern California Edison,Mono,SW 275 Mono Black,PetersenDean,,residential


In [18]:
# okay, let's get rid of those extra entries
dfNoCompleteDups = dfLive.drop_duplicates()

In [19]:
# down to 694k
dfNoCompleteDups.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 693961 entries, 1 to 745361
Data columns (total 19 columns):
date_installed    693961 non-null datetime64[ns]
cost_per_watt     693961 non-null float64
cost              693961 non-null float64
size_kw           693961 non-null float64
state             693961 non-null object
zipcode           693961 non-null object
city              514129 non-null object
county            685144 non-null object
new_constr        14855 non-null float64
tracking          1615 non-null float64
third_party       236320 non-null float64
appraised         143876 non-null object
incentive         513445 non-null object
utility           511173 non-null object
tech              398154 non-null object
model             398154 non-null object
installer         447607 non-null object
bipv              2425 non-null float64
i_type            693961 non-null object
dtypes: datetime64[ns](1), float64(7), object(11)
memory usage: 105.9+ MB


In [20]:
# any more? nope
dfNoCompleteDups.duplicated().sum()

0

In [21]:
dfLive = dfNoCompleteDups

#### Having dropped all exact duplicates, we have 693,961 rows.

In [22]:
dfLive.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 693961 entries, 1 to 745361
Data columns (total 19 columns):
date_installed    693961 non-null datetime64[ns]
cost_per_watt     693961 non-null float64
cost              693961 non-null float64
size_kw           693961 non-null float64
state             693961 non-null object
zipcode           693961 non-null object
city              514129 non-null object
county            685144 non-null object
new_constr        14855 non-null float64
tracking          1615 non-null float64
third_party       236320 non-null float64
appraised         143876 non-null object
incentive         513445 non-null object
utility           511173 non-null object
tech              398154 non-null object
model             398154 non-null object
installer         447607 non-null object
bipv              2425 non-null float64
i_type            693961 non-null object
dtypes: datetime64[ns](1), float64(7), object(11)
memory usage: 105.9+ MB


##### Now for the more insidious kind of dup.  all of the important fields are the same but there are differences in the other fields.

These columns are the most important: ['date_installed', 'cost_per_watt', 'cost', 'size_kw', 'state', 'county'. 'zipcode'].  I'm $*not*$ going to use cost_per_watt because I know already that there are some of these with minor differences in rightmost decimal places. One theory come from the data being entered (or calculated from cost and size_kw) at different time with different precision.

In [23]:
# partially duplicated rows
pdups = dfLive.duplicated(subset=['date_installed', 'cost', 'size_kw', 'state', 'county', 'zipcode'], keep=False)

In [24]:
# how many?  154k.  That's a lot of rows
pdups.sum()

154277

##### Let's take a look at some of these.

Having looked at these, I'm convinced that almost all if not every row represents the same installation as another row.
Pairs seem to differ mostly in the 'city' column and also differ in the number of columns with missing data.

Some also differ only in the columns [incentive, utility, tech, model,installer, bipv].

To move ahead, we keep the first entry and delete the rest.  The downside is that we may actually lose distinct installations but, it's likely that will be quite a small percentage.  In addition, we may be losing some descriptive data fields in some entries.  If there were a way to chose between duplicates, I would specify keeping the row with less missing data.  These fields are secondary to the purpose of this project.

In [25]:
sortedPdups = dfLive[pdups].sort_values(['date_installed', 'state', 'zipcode', 'county'])
sortedPdups.head(12)

Unnamed: 0_level_0,date_installed,cost_per_watt,cost,size_kw,state,zipcode,city,county,new_constr,tracking,third_party,appraised,incentive,utility,tech,model,installer,bipv,i_type
row_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
19,2004-01-05,6.62,18278.0,2.76,MA,2451,,Middlesex,,,,,,,,,,,residential
21,2004-01-05,6.622464,18278.0,2.76,MA,2451,Waltham,Middlesex,,,,,Massachusetts Clean Energy Center,NSTAR (DBA EverSource),,,Conservation Services Group (Csg),,residential
20,2004-01-05,6.0,34000.0,5.67,MA,2770,,Plymouth,,,,,,,,,,,residential
22,2004-01-05,5.996473,34000.0,5.67,MA,2770,Rochester,Plymouth,,,,,Massachusetts Clean Energy Center,NSTAR (DBA EverSource),,,Alternate Energy Systems,,residential
32,2004-01-07,6.7,16953.0,2.53,MA,2474,,Middlesex,,,,,,,,,,,residential
46,2004-01-07,6.700791,16953.0,2.53,MA,2474,Arlington,Middlesex,,,,,Massachusetts Clean Energy Center,NSTAR (DBA EverSource),,,Conservation Services Group (Csg),,residential
49,2004-01-07,9.36,117980.0,12.6,NJ,7934,,Somerset,,,,,,,,,,,educational
74,2004-01-07,9.363492,117980.0,12.6,NJ,7934,GLADSTONE,Somerset,,,,,New Jersey Board of Public Utilities (CORE & R...,JCP&L,,,Solar Energy Systems,,educational
78,2004-01-08,12.8,43000.0,3.36,CA,90005,,Los Angeles,,,,,,,,,,,residential
120,2004-01-08,12.797619,43000.0,3.36,CA,90005,Los Angeles,Los Angeles,,,,,Los Angeles Department of Water & Power,Los Angeles Department of Water & Power,,,,,residential


In [26]:
sortedPdups.city.isnull().sum()

57584

In [27]:
sortedPdups.city.notnull().sum()

96693

In [28]:
blee = sortedPdups[sortedPdups.city.notnull()]

In [29]:
bleeblop = blee[blee.duplicated(subset=['date_installed', 'cost', 'size_kw', 'state', 'county', 'zipcode'], keep=False)]

In [30]:
bleeblop.head()

Unnamed: 0_level_0,date_installed,cost_per_watt,cost,size_kw,state,zipcode,city,county,new_constr,tracking,third_party,appraised,incentive,utility,tech,model,installer,bipv,i_type
row_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
4618,2004-05-20,9.174286,19266.0,2.1,NJ,8401,ATLANTIC CITY,Atlantic,,,,,New Jersey Board of Public Utilities (CORE & R...,Atlantic City Electric Co,,,Ecological Systems,,commercial
4619,2004-05-20,9.174286,19266.0,2.1,NJ,8401,ATLANTIC CITY,Atlantic,,,,,New Jersey Board of Public Utilities (CORE & R...,Conectiv,,,Ecological Systems,,commercial
6295,2004-07-26,7.95,75127.5,9.45,NJ,8820,EDISON,Middlesex,,,,,New Jersey Board of Public Utilities (CORE & R...,PSE&G BPU,Mono,multiple matches,Sunlit Systems,,nonprofit
6318,2004-07-26,7.95,75127.5,9.45,NJ,8820,EDISON,Middlesex,,,,,New Jersey Board of Public Utilities (CORE & R...,PSE&G BPU,Mono,NT-175U1,Sunlit Systems,,nonprofit
14875,2005-06-08,11.232071,44479.0,3.96,MN,55051,Mora,Kanabec,,1.0,,,Minnesota Department of Commerce,Mora Municipal,,,,,residential


In [31]:
# For my own edification, I'll save these to a csv file.
# Each row should have exactly the same values for ['date_installed', 'cost', 'size_kw', 'state', 'county', 'zipcode'] as
# another row.  Since they're sorted, the "dups" adjacent (at least, very close) in the file.
sortedPdups.to_csv('./partial_dups.csv')

In [32]:
dfLive.drop_duplicates(subset=['date_installed', 'cost', 'size_kw', 'state', 'county', 'zipcode'], 
                       keep='first', inplace= True)

In [33]:
# down to 616k
dfLive.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 616144 entries, 1 to 745361
Data columns (total 19 columns):
date_installed    616144 non-null datetime64[ns]
cost_per_watt     616144 non-null float64
cost              616144 non-null float64
size_kw           616144 non-null float64
state             616144 non-null object
zipcode           616144 non-null object
city              465435 non-null object
county            607416 non-null object
new_constr        13314 non-null float64
tracking          1445 non-null float64
third_party       211824 non-null float64
appraised         125678 non-null object
incentive         464753 non-null object
utility           462605 non-null object
tech              368886 non-null object
model             368886 non-null object
installer         404024 non-null object
bipv              2010 non-null float64
i_type            616144 non-null object
dtypes: datetime64[ns](1), float64(7), object(11)
memory usage: 94.0+ MB


In [34]:
# double check
dfLive.duplicated(subset=['date_installed', 'cost', 'size_kw', 'state', 'county', 'zipcode'], keep=False).sum()

0

#### Having dropped a set of partial duplicates we are now at 616144 rows

In [35]:
dfLive.to_csv('../local/data/20180109.csv')

### Stopping here.  Cells below merged and commented out because I don't trust that code to be doing what I want.  More tomorrow.


#### Are there still dups with only minor differences in cost/size_kw??

In [36]:
# # apparently there are.  THIS IS BROKEN!!!  Well actually not.  The comment is broken.
# # this line returns a boolean series that can be used to index dfLive return items where the subset is duplicated.
# # that doesn't mean minor differences in cost, could be any difference in cost.

# # hmmm, on second thought if the size is the same, should the cost be the same (if it's actually a dup)?
# # maybe not.  imagine two guys across the street.  One pays x, the other pays y.  They are flagged as a duplicate pair


# dfLive.duplicated(subset=['date_installed', 'size_kw', 'state', 'county', 'zipcode'], keep=False).sum()

# # grab them
# thing = dfLive[dfLive.duplicated(subset=['date_installed', 'size_kw', 'state', 'county', 'zipcode'], keep=False)]

# # have a look.  are they really dups??  I don't think they are
# thing.head(20)

# thingSorted = thing.sort_values(by=['date_installed', 'state', 'size_kw', 'cost', 'county'])

# thingSorted.to_csv('./sortaDups.csv')

# thing.groupby([thing.index.year]).row_id.count().plot()

# thing.groupby([thing.index.year, thing.index.month]).row_id.count().plot()

# thing.groupby([thing.index.year, thing.index.month]).cost_per_watt.median().plot()

# thing.to_csv('../local/data/thing0108')

# len(dfLive.incentive.value_counts())