## Reading Data with cuDF

As noted in the cuDF basics worksheet, the [cuDF](https://github.com/rapidsai/cudf) library enables you to create and manipulate GPU-accelerated dataframes.  In this notebook we will highlight a variety of introductory cuDF dataframe functions.  The primary advantage of cuDF is a 2x - 50x performance enhancement.

We will start off by reading a data file into a dataframe:

* Read a Comma Separate Values (CSV) data file with `cudf.read_csv`.
    * Argument is the name of the file to be read.
    * Assign result to a variable to store the data that was read.

In [2]:
import cudf

data = cudf.read_csv('data/gapminder_gdp_oceania.csv')
print(data)

       country  gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  \
0    Australia     10039.59564     10949.64959     12217.22686   
1  New Zealand     10556.57566     12247.39532     13175.67800   

   gdpPercap_1967  gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  \
0     14526.12465     16788.62948     18334.19751     19477.00928   
1     14463.91893     16046.03728     16233.71770     17632.41040   

   gdpPercap_1987  gdpPercap_1992  gdpPercap_1997  gdpPercap_2002  \
0     21888.88903     23424.76683     26997.93657     30687.75473   
1     19007.19129     18363.32494     21050.41377     23189.80135   

   gdpPercap_2007  
0     34435.36744  
1     25185.00911  


* The columns in a dataframe are the observed variables, and the rows are the observations.
* cuDF uses backslash `\` to show wrapped lines when output is too wide to fit the screen.

#### File Not Found

Our lessons store their `data` files in a data sub-directory, which is why the path to the file is `data/gapminder_gdp_oceania.csv`. If you forget to include `data/`, or if you include it but your copy of the file is somewhere else, you will get a runtime error that ends with a line like this:

`OSError: File b'gapminder_gdp_oceania.csv' does not exist`

### Use `index_col` to specify that a column's values should be used as row headings.

* Row headings are numbers (0 and 1 in this case).
* Really want to index by country.
* Pass the name of the column to `read_csv` as its `index_col` parameter to do this.

In [3]:
data = cudf.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')
print(data)

             gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  gdpPercap_1967  \
country                                                                       
Australia       10039.59564     10949.64959     12217.22686     14526.12465   
New Zealand     10556.57566     12247.39532     13175.67800     14463.91893   

             gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  gdpPercap_1987  \
country                                                                       
Australia       16788.62948     18334.19751     19477.00928     21888.88903   
New Zealand     16046.03728     16233.71770     17632.41040     19007.19129   

             gdpPercap_1992  gdpPercap_1997  gdpPercap_2002  gdpPercap_2007  
country                                                                      
Australia       23424.76683     26997.93657     30687.75473     34435.36744  
New Zealand     18363.32494     21050.41377     23189.80135     25185.00911  


## Use `DataFrame.dtypes` to find out more about a dataframe.

> Note
> The Pandas info() method is not supported in cudf.  Use the dtypes attribute instead.
>

In [3]:
data.dtypes

country            object
gdpPercap_1952    float64
gdpPercap_1957    float64
gdpPercap_1962    float64
gdpPercap_1967    float64
gdpPercap_1972    float64
gdpPercap_1977    float64
gdpPercap_1982    float64
gdpPercap_1987    float64
gdpPercap_1992    float64
gdpPercap_1997    float64
gdpPercap_2002    float64
gdpPercap_2007    float64
dtype: object

> This `DataFrame` contains twelve columns, each being a floating point number.

## The `DataFrame.columns` variable stores information about the dataframe's columns.

* Note that this is data, *not* a method.
    * Like `math.pi`.
    * So do not use `()` to try to call it.
* Called a *member variable*, or just *member*.

In [6]:
print(data.columns)

Index(['gdpPercap_1952', 'gdpPercap_1957', 'gdpPercap_1962', 'gdpPercap_1967',
       'gdpPercap_1972', 'gdpPercap_1977', 'gdpPercap_1982', 'gdpPercap_1987',
       'gdpPercap_1992', 'gdpPercap_1997', 'gdpPercap_2002', 'gdpPercap_2007'],
      dtype='object')


## Use `DataFrame.T` to transpose a dataframe.

* Sometimes want to treat columns as rows and vice versa.
* Transpose (written `.T`) doesn't copy the data, just changes the program's view of it.
* Like `columns`, it is a member variable.

In [7]:
print(data.T)

                  Australia  New Zealand
gdpPercap_1952  10039.59564  10556.57566
gdpPercap_1957  10949.64959  12247.39532
gdpPercap_1962  12217.22686  13175.67800
gdpPercap_1967  14526.12465  14463.91893
gdpPercap_1972  16788.62948  16046.03728
gdpPercap_1977  18334.19751  16233.71770
gdpPercap_1982  19477.00928  17632.41040
gdpPercap_1987  21888.88903  19007.19129
gdpPercap_1992  23424.76683  18363.32494
gdpPercap_1997  26997.93657  21050.41377
gdpPercap_2002  30687.75473  23189.80135
gdpPercap_2007  34435.36744  25185.00911


## Use `DataFrame.describe` to get summary statistics about data.

* `DataFrame.describe()` gets the summary statistics of only the columns that have numerical data. All other columns are ignored, unless you use the argument `include='all'`.

In [7]:
print(data.describe())

       gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  gdpPercap_1967  \
count        2.000000        2.000000        2.000000        2.000000   
mean     10298.085650    11598.522455    12696.452430    14495.021790   
std        365.560078      917.644806      677.727301       43.986086   
min      10039.595640    10949.649590    12217.226860    14463.918930   
25%      10168.840645    11274.086022    12456.839645    14479.470360   
50%      10298.085650    11598.522455    12696.452430    14495.021790   
75%      10427.330655    11922.958888    12936.065215    14510.573220   
max      10556.575660    12247.395320    13175.678000    14526.124650   

       gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  gdpPercap_1987  \
count         2.00000        2.000000        2.000000        2.000000   
mean      16417.33338    17283.957605    18554.709840    20448.040160   
std         525.09198     1485.263517     1304.328377     2037.668013   
min       16046.03728    16233.717700    17632.410

* Not particularly useful with just two records, but very helpful when there are thousands.

## Questions

#### Q1: Reading Other Data

Read the data in `gapminder_gdp_americas.csv` (which should be the same directory as `gapminder_gdp_oceania.csv`) into the variable called `americas` and display its summary statistics. 

##### Solution

In [None]:
%load solutions/062_solution_01.py

#### Q2: Inspecting Data 

After reading the data for the AMericans, use `help(americas.head)` and `help(americas.tal)` to find out what `DataFrame.head` and `DataFrame.tail` do.

1. What method call will display the first three rows fo this data?
2. What method call will display the last three columns of this data? (Hint: you may need to change your view of the data)

##### Solution

In [3]:
%load solutions/062_solution_02.py

#### Q3: Reading Files in Other Directories

The data for your current project is stored in a file called `microbes.csv`, which is located in a folder called `field_data`. Your are doing analysis in a notebook called `analysis.ipynb` in a sibliong folder called `thesis`:

In [None]:
your home directory
+-- field data/
|    +-- microbes.csv
+-- thesis/
     +-- analysis.ipynb

What value(s) should you pass to `read.csv` to read `microbes.csv` in `analysis.ipynb`?

##### Solution

In [None]:
%load solutions/062_solution_03.py

#### Q4: Writing Data

As well as the `read_csv` function for reafing data from a  file, Pandas provides a `to_csv` function to write dataframes to files. Applying what you've learned about reading from files, write one of your dataframes to a file calles `processed.csv`. You can use `help` to get information on how to use `to_csv`. 

##### Solution

In [None]:
%load solutions/062_solution_04.py