#Data Analysis with `pandas` 

### Munging

- select, filter
- groupby
- aggregate
- reshape (pivot)

### Statistics and ML

- Clustering
- Dimensionality reduction
- Modeling

##Resources

###Data Analysis and  `pandas`

#### Text

- [Python for Data Analysis](http://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1449319793) by Wes McKinney

#### Documentation
- [pandas.pydata.org](http://pandas.pydata.org/)
- [Comparision with SQL](http://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html)

#### Other

- [Tidy data](http://vita.had.co.nz/papers/tidy-data.pdf) by Hadley Wickham
- [Data Wrangling Kung Fu with Pandas](vimeo.com/63295598) by Wes McKinney

###Machine Learning and  `sklearn`

#### Text

- [Elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) \$71 (free PDF)
- [Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/) \$54 (free PDF)
- [Learn from Data](http://www.amazon.com/gp/product/1600490069) \$28

#### Online classes

- [Stanford's Introduction to Statistical Learning](http://online.stanford.edu/course/statistical-learning-winter-2014)
- [Coursera's Machine Learning](https://www.coursera.org/course/ml)
- [Caltech's Learning from Data](http://work.caltech.edu/telecourse.html)

#### Websites

- [Kaggle](http://www.kaggle.com)
- [Metaoptimize](http://metaoptimize.com/)

#### Online `sklearn` videos

- [Jake Vanderplas'](http://www.astro.washington.edu/users/vanderplas/) excellent videos on Scikit-learn:
    - [PyData NYC 2013](https://vimeo.com/80093925) 1.5 hours
    - [Scipy 2013](http://pyvideo.org/video/2157/intro-to-scikit-learn-i-scipy2013-tutorial-pa-7) 8 hours.
    - [PyCon 2013](http://www.youtube.com/watch?v=4ONBVNm3isI) 3 hours.
    - [PyData NYC 2012](http://vimeo.com/53062607) 45 minutes.





## Easy reading 

In [27]:
import utils

In [22]:
df = utils.read_csv('data/neighborhood/outputFiles/triplex_meter:measured_real_power.csv')
df.columns = [ x.replace('triplex_meter:measured_real_power:','') for x in df.columns ]
df.columns

Index([u'triplex_meter_0', u'triplex_meter_1', u'triplex_meter_2',
       u'triplex_meter_3', u'triplex_meter_4', u'triplex_meter_5',
       u'triplex_meter_6', u'triplex_meter_7', u'triplex_meter_8',
       u'triplex_meter_9', u'triplex_meter_10', u'triplex_meter_11',
       u'triplex_meter_12', u'triplex_meter_13', u'triplex_meter_14',
       u'triplex_meter_15', u'triplex_meter_16', u'triplex_meter_17',
       u'triplex_meter_18', u'triplex_meter_19'],
      dtype='object')

In [23]:
filename = 'measured_real_power'

with open('data/{}.csv'.format(filename), 'w') as outfile:
    outfile.write("""# file...... {}.csv
# date...... Tue Aug 11 13:56:33 2015
# user...... mlunacek
# host...... (null)
# group..... class=house
# property.. {}
# limit..... 0
# interval.. 60
# """.format(filename, filename))

df.to_csv('data/{}.csv'.format(filename), mode='a')

In [18]:
!head data/measured_real_power.csv

# file...... measured_real_power.csv
# date...... Tue Aug 11 13:56:33 2015
# user...... mlunacek
# host...... (null)
# group..... class=house
# property.. measured_real_power
# limit..... 0
# interval.. 60
# timestamp,triplex_meter_0,triplex_meter_1,triplex_meter_2,triplex_meter_3,triplex_meter_4,triplex_meter_5,triplex_meter_6,triplex_meter_7,triplex_meter_8,triplex_meter_9,triplex_meter_10,triplex_meter_11,triplex_meter_12,triplex_meter_13,triplex_meter_14,triplex_meter_15,triplex_meter_16,triplex_meter_17,triplex_meter_18,triplex_meter_19
2013-07-01 00:00:00 UTC,510.912,510.912,510.912,547.924,510.912,510.912,547.924,547.924,581.637,581.637,384.146,384.146,384.146,531.494,531.494,531.494,530.845,530.845,271.123,383.677


In [25]:
df = utils.read_csv('data/neighborhood/outputFiles/triplex_meter:measured_reactive_power.csv')
df.columns = [ x.replace('triplex_meter:measured_reactive_power:','') for x in df.columns ]
df.columns

Index([u'triplex_meter_0', u'triplex_meter_1', u'triplex_meter_2',
       u'triplex_meter_3', u'triplex_meter_4', u'triplex_meter_5',
       u'triplex_meter_6', u'triplex_meter_7', u'triplex_meter_8',
       u'triplex_meter_9', u'triplex_meter_10', u'triplex_meter_11',
       u'triplex_meter_12', u'triplex_meter_13', u'triplex_meter_14',
       u'triplex_meter_15', u'triplex_meter_16', u'triplex_meter_17',
       u'triplex_meter_18', u'triplex_meter_19'],
      dtype='object')

In [26]:
filename = 'measured_reactive_power'

with open('data/{}.csv'.format(filename), 'w') as outfile:
    outfile.write("""# file...... {}.csv
# date...... Tue Aug 11 13:56:33 2015
# user...... mlunacek
# host...... (null)
# group..... class=house
# property.. {}
# limit..... 0
# interval.. 60
# """.format(filename, filename))

df.to_csv('data/{}.csv'.format(filename), mode='a')

In [50]:
import tdshub as hub
import pandas as pd

In [35]:
g = hub.load('data/neighborhood/model.glm')

In [52]:
def create_solar_lookup():
    for meter in g.names('triplex_meter'):
        if len(g.graph.successors(meter)) == 2:
            yield meter, 'solar'
        else:
            yield meter, 'nosolar'

df = pd.DataFrame(list(create_solar_lookup()))
df.columns = ['triplex_meter', 'type']
df

Unnamed: 0,triplex_meter,type
0,triplex_meter_11,nosolar
1,triplex_meter_18,nosolar
2,triplex_meter_19,solar
3,triplex_meter_10,solar
4,triplex_meter_17,solar
5,triplex_meter_14,nosolar
6,triplex_meter_15,solar
7,triplex_meter_16,solar
8,triplex_meter_8,solar
9,triplex_meter_9,nosolar


In [55]:
df.to_csv('data/triplex_meter_solar.csv', index=False)