# Task 2.Data

done by Malyshev Kyrylo

In [1]:
import pandas as pd
import numpy as np
import os


The zip file `specdata.zip` [2.4MB] containing the data can be downloaded from data folder is course repository.

The zip file contains 332 comma-separated-value (CSV) files containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in the file `200.csv`. Each file contains three variables:
* Date: the date of the observation in YYYY-MM-DD format (year-month-day)
* sulfate: the level of sulfate PM in the air on that date (measured in micrograms per cubic meter)
* nitrate: the level of nitrate PM in the air on that date (measured in micrograms per cubic meter)

In each file there are many days where either sulfate or nitrate (or both) are missing (coded as NA). This is common with air pollution monitoring data in the United States.

### Part 1

Write a function named pollutantmean that calculates the mean of a pollutant (sulfate ormnitrate) across a specified list of monitors. The function pollutantmean takes three arguments: mdirectory, pollutant, and id. Given a vector monitor ID numbers, pollutantmean reads that monitors’ particulate matter data from the directory specified in the directory argument and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA.

In [2]:
def pollutantmean(mdirectory, pollutant, id):
    if isinstance(id, int):
        id = [id]
    values = []
    for i in id:
        values.append(pd.read_csv(os.path.join(mdirectory, f'{i:>03}.csv')))
    df = pd.concat(values)
    return df[pollutant].mean()


In [3]:
pollutantmean('specdata', 'sulfate', range(1, 11))


4.064128242560359

In [4]:
pollutantmean('specdata', 'nitrate', range(70, 73))


1.706047351694915

In [5]:
pollutantmean('specdata', 'nitrate', 23)


1.2808333333333333

### Part 2

Write a function named complete that reads a directory full of files and reports the number of completely observed cases in each data file. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases.

In [6]:
def complete(mdirectory, id):
    if isinstance(id, int):
        id = [id]
    values = []
    for i in id:
        values.append(pd.read_csv(os.path.join(mdirectory, f'{i:>03}.csv')))
    df = pd.concat(values)
    df = df.dropna(how='any')
    return df[df['ID'].isin(id)] \
        .groupby('ID')['ID'] \
        .agg(nobs=lambda x: sum(x.notnull()))


In [7]:
complete('specdata', [1, 2])


Unnamed: 0_level_0,nobs
ID,Unnamed: 1_level_1
1,117
2,1041


In [8]:
complete('specdata', [2, 4, 8, 10, 12])


Unnamed: 0_level_0,nobs
ID,Unnamed: 1_level_1
2,1041
4,474
8,192
10,148
12,96


In [9]:
complete('specdata', range(30, 24, -1))


Unnamed: 0_level_0,nobs
ID,Unnamed: 1_level_1
25,463
26,586
27,338
28,475
29,711
30,932


### Part 3

Write a function named corr that takes a directory of data files and a threshold for complete cases and calculates the correlation between sulfate and nitrate for monitor locations where the number of completely observed cases (on all variables) is greater than the threshold. The function should return a vector of correlations for the monitors that meet the threshold requirement. If no monitors meet the threshold requirement, then the function should return a numeric vector of length 0. For this function you will need to use the ‘cor’ function in R which calculates the correlation between two vectors.

In [10]:
def corr(mdirectory, threshold):
    values = []
    for filename in os.listdir(mdirectory):
        df = pd.read_csv(os.path.join(mdirectory, filename))
        df = df.dropna(how='any')
        if len(df) > threshold:
            values.append(df['sulfate'].corr(df['nitrate']))
    return pd.DataFrame(values)


In [11]:
v = corr('specdata', 150)
v


Unnamed: 0,0
0,-0.018958
1,-0.140513
2,-0.043897
3,-0.068160
4,-0.123507
...,...
229,0.316429
230,0.268780
231,0.279397
232,0.267261


In [12]:
v.describe()


Unnamed: 0,0
count,234.0
mean,0.125253
std,0.218957
min,-0.210568
25%,-0.049993
50%,0.094626
75%,0.268445
max,0.763129


In [13]:
v = corr('specdata', 400)
v


Unnamed: 0,0
0,-0.018958
1,-0.043897
2,-0.068160
3,-0.075888
4,0.763129
...,...
122,-0.123172
123,0.253978
124,0.268780
125,0.279397


In [14]:
v.describe()


Unnamed: 0,0
count,127.0
mean,0.139686
std,0.210523
min,-0.176233
25%,-0.031093
50%,0.100212
75%,0.268492
max,0.763129


In [15]:
v = corr('specdata', 5000)
v


In [16]:
len(v)


0