# Statflo Interview 

At Quora, we deal with a lot of data. As the number of users on our site grows, so does the volume of content and the number of interactions that go on daily.

We want to be able to determine various social and intellectual topics that are trending over time. However, we take privacy very seriously, and so we want to do this in a way that does not put any individual’s personal information at risk.

Your task is to design and implement a trend analysis engine. This engine should be able to take as input a table of raw data and present an interface for trend analysis. You are free to design this interface and choose the data analysis features that you wish to support. It is important that the engine only exposes to the user data that does not have any personal identifiers and protects sensitive attributes from being revealed. We provide some links to resources about how to think about this in the Resources section.


Your engine should take as input a path to a comma-delimited text file. The first line will be the field names, and the second line will be the field type. The remaining rows will be the raw data.

The table below lists the field types we want you to support. If you have ideas for other types, feel free to include support for them and provide us an input file for us to evaluate them.


In [8]:
import unicodecsv as csv
from faker import Factory 
from collections import defaultdict

#This was adapted from the following URL blogpost http://blog.districtdatalabs.com/a-practical-guide-to-anonymizing-datasets-with-python-faker
#This code will anomitize the dataset if desired

class ReadandAnonmitizeDataset():
    
    def anonymize_rows(rows):
        faker = Factory.create()
        names = defaultdict(faker.name)
        emails = defaultdict(faker.email)
        
        for row in rows:
            row['name'] = names[row['name']]
            row['email'] = emails[row['email']]
            yield row
    
    
    def anonymize(source, target):
        with open(source, 'rU') as f:
            with open(target, 'w') as o:
                reader = csv.DictReader(f)
                writer = csv.DictWriter(o, reader.fieldnames)
                for row in anonymize_rows(reader):
                    writer.writerow(row)

In [9]:
ReadandAnonmitizeDataset()

<__main__.ReadandAnonmitizeDataset at 0x10c230978>

In [27]:
import pandas as pd
import ploty 

#This code will read in the file and plot interesting information 
def readfile():
    pd.read_csv('/Users/marconlaforet/Python_notebook_jupyter/Statflo_challenge/QuoraTrendAnalyzer_SampleData/airsampling.csv', header=int(1)) 

Unnamed: 0,DATE,MATRIX,SAMPLE NAME,LOCATION,STATE NAME,CAS NUMBER,SUBSTANCE,RESULT,UNIT,REPORTING LIMIT,REPORTING LIMIT UNIT,DETECTED,LATITUDE,LONGITUDE
0,09/18/2010 12:00 AM,Air,C05-20100917-0945-PUF,C05,Louisiana,91-20-3,Naphthalene,0.027000,ug/m3,0.033000,ug/m3,True,83,70085.0
1,09/18/2010 12:00 AM,Air,C05-20100917-0945-24,C05,Louisiana,71-43-2,Benzene,0.540000,ug/m3,0.440000,ug/m3,True,84,70085.0
2,09/18/2010 12:00 AM,Air,C05-20100917-0945-24,C05,Louisiana,108-88-3,Toluene,0.960000,ug/m3,0.520000,ug/m3,True,25,70085.0
3,09/18/2010 12:00 AM,Air,C05-20100917-0945-24,C05,Louisiana,100-41-4,Ethylbenzene,,ug/m3,0.600000,ug/m3,False,39,70085.0
4,09/18/2010 12:00 AM,Air,C05-20100917-0945-24,C05,Louisiana,179601-23-1,m p-Xylene,0.910000,ug/m3,0.600000,ug/m3,True,84,70085.0
5,09/18/2010 12:00 AM,Air,C05-20100917-0945-24,C05,Louisiana,95-47-6,o-Xylene,,ug/m3,0.600000,ug/m3,False,88,70085.0
6,09/18/2010 12:00 AM,Air,C05-20100917-0945-PUF,C05,Louisiana,53-70-3,Dibenz[a h]anthracene,,ug/m3,0.033000,ug/m3,False,54,70085.0
7,09/18/2010 12:00 AM,Air,C05-20100917-0945-PUF,C05,Louisiana,218-01-9,Chrysene,,ug/m3,0.033000,ug/m3,False,66,70085.0
8,09/18/2010 12:00 AM,Air,C05-20100917-0945-PUF,C05,Louisiana,205-99-2,Benzo(b)fluoranthene,,ug/m3,0.033000,ug/m3,False,26,70085.0
9,09/18/2010 12:00 AM,Air,C05-20100917-0945-PUF,C05,Louisiana,207-08-9,Benzo[k]fluoranthene,,ug/m3,0.033000,ug/m3,False,15,70085.0


In [2]:
pd.read_csv('/Users/marconlaforet/Python_notebook_jupyter/Statflo_challenge/SampleData/CleanedFiles/hospital.csv', header=int(1)) 

NameError: name 'pd' is not defined

In [18]:
import pandas as pd
whitehouse = pd.read_csv('/Users/marconlaforet/Python_notebook_jupyter/StatfloChallenge/SampleData/CleanedFiles/whitehouse.csv', header=int(1)) 

Running unit tests for pandas
pandas version 0.18.1
numpy version 1.11.0
pandas is installed in /Users/marconlaforet/bin/anaconda/lib/python3.4/site-packages/pandas
Python version 3.4.4 |Anaconda 2.3.0 (x86_64)| (default, Jan  9 2016, 17:30:09) [GCC 4.2.1 (Apple Inc. build 5577)]
nose version 1.3.7


See here for a guide on how to port your code to rpy2: http://pandas.pydata.org/pandas-docs/stable/r_interface.html
  return f(*args, **kwds)
.SS..SS..SS..SS..SS..SS..SS..SS..SSSSSSSSSSSSSS..SS..SS..SS.........SS..SS..SS..SS..SS..SS..SSSSSSSSSSSSSSSS................SS..SS..........................................................................................................................................................................S.........................................................S...............................S.......................................................................................................S...........................................S...............................S.......................................................................................................S..........................................................S................................................................................................S...........................

<nose.result.TextTestResult run=899 errors=0 failures=0>

In [24]:
whitehouse.dtypes

ID                  object
ID.1                object
ID.2                object
TIME                object
TIME.1              object
CONT               float64
CAT/SENSITIVE       object
CAT/SENSITIVE.1     object
CAT                 object
CAT.1              float64
CAT/SENSITIVE.2     object
CAT/SENSITIVE.3     object
CAT.2               object
TIME.2              object
dtype: object

In [32]:
import pandas
from sklearn import preprocessing 


In [33]:
whitehouse.apply(preprocessing.LabelEncoder().fit_transform)


NameError: name 'LabelEncoder' is not defined