###### Note: This file reads TTS.csv but does not write.

#### Figure out how to save the data and then read back in so that it has the right dtypes and values

Lots of columns are mis-coded and/or mis-typed.

I have (note: you only see this if you look at the raw text):

1. floats that contain 1.0, 0.0, -9999 for boolean/NaN
1. ints that contain 1, 0, -9999 for boolean/NaN
1. strings that contain "1", "0", -9999 for boolean/NaN

Here are the variables we definitely need to keep:


name        	              |  dType	      | numUnique |vals
------------------------------|---------------|-----------|-----
index       	              |  int64	      | 100000	  | NaN
Data Provider	              |  object       | 21	      | NaN
System ID (from Data Provider)|	object        | 96348	  | NaN
System ID (Tracking the Sun)  | object	      | 100000	  | NaN
Installation Date	          | datetime64[ns]| 3309	  | NaN
System Size	               	  |  float64      | 4637	  | NaN
Total Installed Price	      |  float64	  | 32343	  | NaN
Appraised Value Flag	      |  bool         | 2		  | [False, True]
Customer Segment	          |  object	      | 7	      | [RES, NON-RES, COM, SCHOOL, GOV, NON-PROFIT]
New Construction	          |  float64	  | 3	      | [0.0, 1.0]
Ground Mounted	              |  float64	  | 3	      | [0.0, 1.0]
Battery System	              |  float64	  | 3	      | [0.0, 1.0]
Zip Code	               	  |  float64	  | 1024	  | NaN
City	               	      |  object	      | 723	      | NaN
County	               		  |  object	      | 84	      | NaN
State	               	      |  object	      | 3	      | [AZ, CA, AR]
Third-Party Owned		      |  float64	  | 3	      | [1.0, 0.0]
Microinverter		          |  float64	  | 3	      | [0.0, 1.0]
DC Optimizer		          |  float64	  | 3	      | [0.0, 1.0]
  


Some other variables make groups that we likey won't use.

1. Module Group
```
    Module Manufacturer #1	object	99
    Module Manufacturer #2	object	147
    Module Manufacturer #3	object	112
    Module Model #1	        object	1122
    Module Model #2	        object	168
    Module Model #3	        object	45
    Module Technology #1	object	13
    Module Technology #2	object	10
    Module Technology #3	object	6
    Module Efficiency #1	float64	655
    Module Efficiency #2	float64	136
```
2. Inverter Group
```
    Inverter Manufacturer	object	59	NaN
    Inverter Model	object	384	NaN
```
3. BIPV Group
```
    BIPV Module #1	float64	3	[0.0, 1.0]
    BIPV Module #2	float64	3	[0.0, 1.0]
    BIPV Module #3	float64	3	[0.0, 1.0]
```
4. Financial Group
```
    Sales Tax Cost	float64	1994
    Rebate or Grant	float64	8802
    Performance-Based Incentive (Annual Payment)	float64	789
    Performance-Based Incentives (Duration)	int64	3
    Feed-in Tariff (Annual Payment)	float64	1
    Feed-in Tariff (Duration)	int64	1
```
1. Tilt/Azimuth Group
```
Azimuth #1	float64	156	NaN
Azimuth #2	float64	130	NaN
Azimuth #3	float64	54	NaN
Tilt #1	float64	50	NaN
Tilt #2	float64	29	NaN
Tilt #3	float64	26	NaN
```
1. Tracking Group
```
Tracking	float64	3	[0.0, 1.0]
Tracking Type	object	3	[Fixed, Single-Axis]
```
1. Inverter Group
```
Inverter Manufacturer	object	59	NaN
Inverter Model	object	384	NaN
```
1. Misc

```
    Utility Service Territory	object	59	NaN
    Installer Name	object	1714	NaN
    Self-Installed	float64	3	[0.0, 1.0]

```


#### Things I noticed along the way:

1. I better rename the columns on the way in or right after.
1. "System ID (from Data Provider)" can have nulls.
1. "System ID (Tracking the Sun)" should handle nulls in SIFDP.
1. I get a new index every time I don't spec one in the read; now fixed
1. "index" is the index in the original file; could add col for source file.
    * I'd like to keep track of the file/row source for each row.  One way I could do that is to add 10,000,000 to this column for part 1 and 20,000,000 for part 2.  Kinda hacky, but cheap.  I'll probably never use but if I needed it, it might save a bunch of time. 
1. see note A below.
1. Can I use -9999 for NaNs everywhere?  Everywhere we find one it means NA.  Let's try.
    * Looks like this works fine.  
1. I'm starting to think that the most effective way to do this is to just read it all in and then work with the column types/values.  Then save it out and read back in with dtypes for each col that I want to keep.  Along the way, I could rename the columns.  One way to rename would be to downcase everything and turn it all into snake_case (or CamelCase).
1. 12 variables can be converted and changed to np.bool
```
    ['Appraised Value Flag', 'New Construction', 'Tracking','Ground Mounted', 'Battery System', 'Third-Party Owned',
     'Self-Installed', 'BIPV Module \#1', 'BIPV Module \#2', 'BIPV Module \#3', 'Microinverter', 'DC Optimizer']
```

In [1]:
import numpy as np
import pandas as pd
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4, suppress=True)
# please show all columns
pd.set_option("display.max_columns", 60)

##### Note A
Need converter dict with function(s) for (really) nonnumerical columns or maybe I don't if I use dtype.  It worked with np:bool for 'Third-Party Owned'.


In [2]:
# here's the first one
# change to take string arg rather than int...
def fnA(val):
    ''' take string value of "0", "1", "-9999" and turn into True, False and NaN'''
    if val in (['0', '1', '-9999']):
        return {'0' : False,
                '1' : True,
                '-9999' : np.nan}[val]
    else:
        raise TypeError('bad arg to fnA: {}, type={}'.format(val, type(val))) 

# print((fnA('0'), fnA('1'), fnA('-9999')))
# fnA('hugely successful')
      

In [3]:
# here's the 2nd one; may not need it
# change to take float arg rather than str
def fnB (val):
    ''' take floats 0.0, 1.0, and turn into True, False'''
    if val in ([0.0, 1.0]):
        return {0.0 : False,
                1.0 : True,}[val]
    else:
        raise TypeError('bad arg to fnB: {}, type={}'.format(val, type(val))) 

# print((fnB(0.0), fnB(1.0)))
# print(fnB('hugely successful'))

##### For now, don't convert the values.  

We need to:

1. see what comes in on the read

2. decide which columns are:
    * boolean
    * int
    * float
    * string

In [4]:
# unfortunately converting the values is not enough to change the type of the column

# myConverters = {'Third-Party Owned' : fnA,
#                 'Microinverter'     : fnA,
#                 'DC Optimizer'      : fnA,                
#                 'New Construction'  : fnA,  
#                 'Tracking'          : fnA,                 
#                 'Ground Mounted'    : fnA,
#                 'Battery System'    : fnA,
#                 'Third-Party Owned' : fnA,
#                 'Self Installed'    : fnA,
#                  #'Customer Segment'  : fnA,
#                  #'Tracking Type'  : fnA,
#                }

dfmini = pd.read_csv('../local/data/LBNL_openpv_tts_data/TTS.csv',
                     index_col='row_id',
                     parse_dates=['Installation Date'],
                     converters=None, # myConverters,
                     # low_memory=False,
                     # dtype={'Third-Party Owned' : np.bool},
                     na_values=[-9999],
                     nrows = 100000
                    ); dfmini.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0_level_0,index,Data Provider,System ID (from Data Provider),System ID (Tracking the Sun),Installation Date,System Size,Total Installed Price,Appraised Value Flag,Sales Tax Cost,Rebate or Grant,Performance-Based Incentive (Annual Payment),Performance-Based Incentives (Duration),Feed-in Tariff (Annual Payment),Feed-in Tariff (Duration),Customer Segment,New Construction,Tracking,Tracking Type,Ground Mounted,Battery System,Zip Code,City,County,State,Utility Service Territory,Third-Party Owned,Installer Name,Self-Installed,Azimuth #1,Azimuth #2,Azimuth #3,Tilt #1,Tilt #2,Tilt #3,Module Manufacturer #1,Module Manufacturer #2,Module Manufacturer #3,Module Model #1,Module Model #2,Module Model #3,Module Technology #1,Module Technology #2,Module Technology #3,BIPV Module #1,BIPV Module #2,BIPV Module #3,Module Efficiency #1,Module Efficiency #2,Module Efficiency #3,Inverter Manufacturer,Inverter Model,Microinverter,DC Optimizer
row_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1
0,0,Arkansas State Energy Office,,AR_EDC_1,2010-04-29,2.016,14558.0,False,510.762764,0.0,3644.64,1,0.0,0,RES,,0.0,Fixed,0.0,,71953.0,Mena,Polk,AR,SWEPCO,0.0,Liberty Solar Solutions,0.0,,,,,,,Sharp,Sharp,Sharp,ND-224UC1,no match,no match,Multi-c-Si,,,0.0,0.0,0.0,0.142431,,,Enphase Energy,,1.0,0.0
1,1,Arkansas State Energy Office,,AR_EDC_2,2010-04-26,3.36,26096.0,False,851.271273,0.0,7210.5,1,0.0,0,RES,,0.0,Fixed,0.0,,72641.0,Jasper,Newton,AR,Carroll Electric,0.0,Liberty Solar Solutions,0.0,,,,,,,Sharp,Sharp,Sharp,ND-224UC1,no match,no match,Multi-c-Si,,,0.0,0.0,0.0,0.142431,,,Enphase Energy,,1.0,0.0
2,2,Arkansas State Energy Office,,AR_EDC_3,2010-04-20,13.44,91139.0,False,3405.085091,0.0,25178.97,1,0.0,0,RES,,0.0,Fixed,0.0,,71801.0,Hope,Hempstead,AR,Hope Water & Light,0.0,Liberty Solar Solutions,0.0,,,,,,,Sharp,Sharp,Sharp,ND-224UC1,no match,no match,Multi-c-Si,,,0.0,0.0,0.0,0.142431,,,Enphase Energy,,1.0,0.0
3,3,Arkansas State Energy Office,,AR_EDC_4,2010-04-21,5.52,40043.0,False,1398.517091,0.0,10724.34,1,0.0,0,RES,,0.0,Fixed,0.0,,71909.0,Hot Springs Village,Saline,AR,First Electric,0.0,Liberty Solar Solutions,0.0,,,,,,,Sharp,Sharp,Sharp,NU-U230F3,no match,no match,Mono-c-Si,,,0.0,0.0,0.0,0.14109,,,Enphase Energy,,1.0,0.0
4,4,Arkansas State Energy Office,,AR_EDC_5,2010-04-22,2.53,21497.0,False,640.987,0.0,3736.17,1,0.0,0,RES,,0.0,Fixed,0.0,,71909.0,Hot Springs Village,Garland,AR,Entergy,0.0,Liberty Solar Solutions,0.0,,,,,,,Sharp,Sharp,Sharp,NU-U230F3,no match,no match,Mono-c-Si,,,0.0,0.0,0.0,0.14109,,,Enphase Energy,,1.0,0.0


In [5]:
list(enumerate(dfmini.columns))

[(0, 'index'),
 (1, 'Data Provider'),
 (2, 'System ID (from Data Provider)'),
 (3, 'System ID (Tracking the Sun)'),
 (4, 'Installation Date'),
 (5, 'System Size'),
 (6, 'Total Installed Price'),
 (7, 'Appraised Value Flag'),
 (8, 'Sales Tax Cost'),
 (9, 'Rebate or Grant'),
 (10, 'Performance-Based Incentive (Annual Payment)'),
 (11, 'Performance-Based Incentives (Duration)'),
 (12, 'Feed-in Tariff (Annual Payment)'),
 (13, 'Feed-in Tariff (Duration)'),
 (14, 'Customer Segment'),
 (15, 'New Construction'),
 (16, 'Tracking'),
 (17, 'Tracking Type'),
 (18, 'Ground Mounted'),
 (19, 'Battery System'),
 (20, 'Zip Code'),
 (21, 'City'),
 (22, 'County'),
 (23, 'State'),
 (24, 'Utility Service Territory'),
 (25, 'Third-Party Owned'),
 (26, 'Installer Name'),
 (27, 'Self-Installed'),
 (28, 'Azimuth #1'),
 (29, 'Azimuth #2'),
 (30, 'Azimuth #3'),
 (31, 'Tilt #1'),
 (32, 'Tilt #2'),
 (33, 'Tilt #3'),
 (34, 'Module Manufacturer #1'),
 (35, 'Module Manufacturer #2'),
 (36, 'Module Manufacturer #3'),

In [None]:
# you have to do this for all of these that get converted
#dfmini['Battery System'] = dfmini['Battery System'].astype(np.bool)

In [6]:
def dfShowTypesNumUnique(df):
    # get a list of the cols
    cols = df.columns.tolist()
    # get a dict: {colName, colType}
    nameNumUnique = dict([(col, df[col].unique().size) for col in cols])
    # get a list of tuples: {colName, numUniques}
    nameDtype = dict([(col, df[col].dtype) for col in cols])
    # return (nameNumUnique,  nameDtype)
    return pd.DataFrame(np.array([[nameDtype[col] for col in cols],
                                  [nameNumUnique[col] for col in cols]]).T, 
                        index=cols,
                        columns=['dType', 'numUnique'])

#### Okay, this is nice.  We can see what the categories of problems are reading the csv

In [7]:
# data frame describing columns
metaDF = dfShowTypesNumUnique(dfmini); metaDF

Unnamed: 0,dType,numUnique
index,int64,100000
Data Provider,object,21
System ID (from Data Provider),object,96348
System ID (Tracking the Sun),object,100000
Installation Date,datetime64[ns],3309
System Size,float64,4637
Total Installed Price,float64,32343
Appraised Value Flag,bool,2
Sales Tax Cost,float64,1994
Rebate or Grant,float64,8802


In [8]:
# if there a small number of values in a column, capture it
theVals = metaDF.index.map(
    lambda x:  dfmini[x].value_counts().index.values 
               if metaDF.loc[x, 'numUnique'] < 10 
               else np.nan)
theVals.values

array([nan, nan, nan, nan, nan, nan, nan,
       array([False, True], dtype=object), nan, nan, nan,
       array([ 0, 20,  1], dtype=int64), array([ 0.]),
       array([0], dtype=int64),
       array(['RES', 'NON-RES', 'COM', 'SCHOOL', 'GOV', 'NON-PROFIT'], dtype=object),
       array([ 0.,  1.]), array([ 0.,  1.]),
       array(['Fixed', 'Single-Axis'], dtype=object), array([ 0.,  1.]),
       array([ 0.,  1.]), nan, nan, nan,
       array(['AZ', 'CA', 'AR'], dtype=object), nan, array([ 1.,  0.]),
       nan, array([ 0.,  1.]), nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan,
       array(['Multi-c-Si', 'Mono-c-Si', 'Poly', 'Mono', 'Thin Film'], dtype=object),
       array([ 0.,  1.]), array([ 0.,  1.]), array([ 0.,  1.]), nan, nan,
       nan, nan, nan, array([ 0.,  1.]), array([ 0.,  1.])], dtype=object)

In [9]:
# now add it to the metadata
metaDF['vals'] = theVals.values


In [10]:
metaDF

Unnamed: 0,dType,numUnique,vals
index,int64,100000,
Data Provider,object,21,
System ID (from Data Provider),object,96348,
System ID (Tracking the Sun),object,100000,
Installation Date,datetime64[ns],3309,
System Size,float64,4637,
Total Installed Price,float64,32343,
Appraised Value Flag,bool,2,"[False, True]"
Sales Tax Cost,float64,1994,
Rebate or Grant,float64,8802,


In [11]:
boolies = metaDF[metaDF.numUnique <= 3]; boolies

Unnamed: 0,dType,numUnique,vals
Appraised Value Flag,bool,2,"[False, True]"
Performance-Based Incentives (Duration),int64,3,"[0, 20, 1]"
Feed-in Tariff (Annual Payment),float64,1,[0.0]
Feed-in Tariff (Duration),int64,1,[0]
New Construction,float64,3,"[0.0, 1.0]"
Tracking,float64,3,"[0.0, 1.0]"
Tracking Type,object,3,"[Fixed, Single-Axis]"
Ground Mounted,float64,3,"[0.0, 1.0]"
Battery System,float64,3,"[0.0, 1.0]"
State,object,3,"[AZ, CA, AR]"


In [12]:
to_bool = boolies.drop(['Tracking Type', 'State',
                        'Feed-in Tariff (Duration)',
                        'Feed-in Tariff (Annual Payment)',
                       'Performance-Based Incentives (Duration)']).index.values

In [13]:
to_bool

array(['Appraised Value Flag', 'New Construction', 'Tracking',
       'Ground Mounted', 'Battery System', 'Third-Party Owned',
       'Self-Installed', 'BIPV Module #1', 'BIPV Module #2',
       'BIPV Module #3', 'Microinverter', 'DC Optimizer'], dtype=object)

In [14]:
len(to_bool)

12