### To analyse the OpenAQ data for India
- Option 1: require a laptop or high-performance computer with at least 12 GB of memory and python with pandas.  
- Option 2: use [dask](https://docs.dask.org/en/latest/dataframe.html) with [parquet](https://arrow.apache.org/docs/python/parquet.html) files instead of pandas with csv files, with an example screencast [here](https://youtu.be/0eEsIA0O1iE) This approach "lazily" loads files, partitioning the dataframe for later parallel processing via the task scheduler.  
Note: There is also data for all countries, though this is a much larger dataset and would be suitable for use with option 2.  

#### Option 1: Pandas and csv files
- Download and install the latest miniconda for flexibile python package management from [here](https://docs.conda.io/en/latest/miniconda.html).  
- Install the required python libraries.  
```
conda install -c conda-forge pandas
```

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv(
    '/nfs/b0004/Users/earlacoa/openaq/shared/openaq_data_2013-2020_india.csv', 
    parse_dates=['date.utc'],
    usecols=['date.utc', 'parameter', 'value', 'unit', 'coordinates.latitude', 'coordinates.longitude', 'city', 'country'],
    index_col='date.utc'
)

In [3]:
df.head()

Unnamed: 0_level_0,city,coordinates.latitude,coordinates.longitude,country,parameter,unit,value
date.utc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2013-12-14 10:30:00+00:00,Kanpur,26.519,80.233,IN,pm25,µg/m³,106.5
2013-12-14 11:30:00+00:00,Kanpur,26.519,80.233,IN,pm25,µg/m³,127.6
2013-12-14 12:30:00+00:00,Kanpur,26.519,80.233,IN,pm25,µg/m³,124.0
2013-12-14 13:30:00+00:00,Kanpur,26.519,80.233,IN,pm25,µg/m³,84.9
2013-12-14 14:30:00+00:00,Kanpur,26.519,80.233,IN,pm25,µg/m³,36.8


In [4]:
df.tail()

Unnamed: 0_level_0,city,coordinates.latitude,coordinates.longitude,country,parameter,unit,value
date.utc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-05-21 23:45:00+00:00,Delhi,28.657381,77.158545,IN,no2,µg/m³,25.34
2020-05-21 23:45:00+00:00,Bengaluru,12.913522,77.59508,IN,o3,µg/m³,36.29
2020-05-21 23:45:00+00:00,Delhi,28.657381,77.158545,IN,pm10,µg/m³,334.42
2020-05-21 23:45:00+00:00,Bengaluru,12.935205,77.681449,IN,co,µg/m³,450.0
2020-05-21 23:45:00+00:00,Bengaluru,12.935205,77.681449,IN,so2,µg/m³,0.0


In [5]:
# slice for date range of interest
df['2020-02-01':'2020-04-30']

Unnamed: 0_level_0,city,coordinates.latitude,coordinates.longitude,country,parameter,unit,value
date.utc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-02-01 00:00:00+00:00,Jorapokhar,23.707909,86.414670,IN,co,µg/m³,0.00
2020-02-01 00:00:00+00:00,Jorapokhar,23.707909,86.414670,IN,o3,µg/m³,15.52
2020-02-01 00:00:00+00:00,Tirupati,13.670000,79.350000,IN,so2,µg/m³,3.70
2020-02-01 00:00:00+00:00,Tirupati,13.670000,79.350000,IN,co,µg/m³,100.00
2020-02-01 00:00:00+00:00,Tirupati,13.670000,79.350000,IN,o3,µg/m³,27.90
...,...,...,...,...,...,...,...
2020-04-30 23:45:00+00:00,Delhi,28.657381,77.158545,IN,co,µg/m³,1890.00
2020-04-30 23:45:00+00:00,Delhi,28.657381,77.158545,IN,no2,µg/m³,17.69
2020-04-30 23:45:00+00:00,Kolkata,22.544808,88.340369,IN,no2,µg/m³,9.73
2020-04-30 23:45:00+00:00,Delhi,28.776200,77.051074,IN,pm10,µg/m³,259.00


#### Option 2: Dask and parquet files (recommended)
- Download and install the latest miniconda for flexibile python package management from [here](https://docs.conda.io/en/latest/miniconda.html).  
- Install the required python libraries.  
```
conda install -c conda-forge dask pyarrow
```

In [6]:
import dask.dataframe as dd
from dask.distributed import Client
client = Client() # set up local cluster (in this instance on viper) and connect to client
client

0,1
Client  Scheduler: tcp://127.0.0.1:38434  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 8  Cores: 32  Memory: 405.66 GB


In [7]:
df = dd.read_parquet(
    '/nfs/b0004/Users/earlacoa/openaq/shared/openaq_data_2013-2020_india.parquet',
    columns=['parameter', 'value', 'unit', 'coordinates.latitude', 'coordinates.longitude', 'city', 'country']
)

In [8]:
df

Unnamed: 0_level_0,parameter,value,unit,coordinates.latitude,coordinates.longitude,city,country
npartitions=2672,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
,object,float64,object,float64,float64,object,object
,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...
,...,...,...,...,...,...,...


In [9]:
# could also use dask with the csv file
df = dd.read_csv(
    '/nfs/b0004/Users/earlacoa/openaq/shared/openaq_data_2013-2020_india.csv', 
    parse_dates=['date.utc'],
    usecols=['date.utc', 'parameter', 'value', 'unit', 'coordinates.latitude', 'coordinates.longitude', 'city', 'country']
).set_index('date.utc')

In [10]:
df

Unnamed: 0_level_0,city,coordinates.latitude,coordinates.longitude,country,parameter,unit,value
npartitions=151,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2013-12-14 10:30:00+00:00,object,float64,float64,object,object,object,float64
2016-03-30 13:15:00+00:00,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...
2020-05-11 02:00:00+00:00,...,...,...,...,...,...,...
2020-05-21 23:45:00+00:00,...,...,...,...,...,...,...


to persist into memory use the `.persist()` method  
to compute scheduler tasks use the `.compute()` method

In [11]:
# close the client when finished
client.close()