# Lesson 1: Reading and Summarizing CSV Data

## Intro to Pandas

- `pandas` = Python Data Analysis Library (https://pandas.pydata.org/)
- With `pandas` you can do pretty much everything you would in a spreadsheet, plus a whole lot more!

- Why Python + Pandas for spreadsheet data?
  - Working with huge data files and complex calculations
  - Dealing with messy and missing data
  - Merging data from multiple files
  - Timeseries analysis
  - Automate repetitive tasks
  - Huge variety of fully customized graphs for visualizing data

Import the `pandas` library and give it the nickname `pd`

In [1]:
import pandas as pd

*Note: For learning purposes, we are importing each library as we introduce it. In general, it's good practice to collect all your `import` statements together and put them at the start of the notebook.*

# Reading a CSV file

Let's look at the file `weather_yvr.csv` in the sub-folder `data`

- 24 hours of weather measurements at Vancouver Airport
- View it in Jupyter Lab's CSV viewer

We will read the CSV file into our notebook with the function `pd.read_csv`:
- Try typing `pd.re` and then press `Tab` and select `read_csv` from the auto-complete options
- Our input to the `read_csv` function is the file path (including sub-folder) as a string: `'data/weather_yvr.csv'`

In [2]:
pd.read_csv('data/weather_yvr.csv')

Unnamed: 0,Datetime,Hour of Day,Conditions,Temperature (C),Relative Humidity (%)
0,2018-05-21 22:00:00,22,Mainly Clear,14.8,75.0
1,2018-05-21 23:00:00,23,Clear,13.5,76.0
2,2018-05-22 00:00:00,0,Clear,13.1,77.0
3,2018-05-22 01:00:00,1,Clear,12.9,84.0
4,2018-05-22 02:00:00,2,Clear,12.2,88.0
5,2018-05-22 03:00:00,3,Clear,12.0,87.0
6,2018-05-22 04:00:00,4,Clear,11.9,88.0
7,2018-05-22 05:00:00,5,Fog,10.4,97.0
8,2018-05-22 06:00:00,6,Mainly Sunny,11.0,91.0
9,2018-05-22 07:00:00,7,Partly Cloudy,13.0,92.0


- We've displayed the data, but we can't do anything further with this on-screen display
- We need to store the data in a variable

Store the output of `pd.read_csv` in a variable `weather_yvr`:

In [3]:
weather_yvr = pd.read_csv('data/weather_yvr.csv')

In [4]:
weather_yvr

Unnamed: 0,Datetime,Hour of Day,Conditions,Temperature (C),Relative Humidity (%)
0,2018-05-21 22:00:00,22,Mainly Clear,14.8,75.0
1,2018-05-21 23:00:00,23,Clear,13.5,76.0
2,2018-05-22 00:00:00,0,Clear,13.1,77.0
3,2018-05-22 01:00:00,1,Clear,12.9,84.0
4,2018-05-22 02:00:00,2,Clear,12.2,88.0
5,2018-05-22 03:00:00,3,Clear,12.0,87.0
6,2018-05-22 04:00:00,4,Clear,11.9,88.0
7,2018-05-22 05:00:00,5,Fog,10.4,97.0
8,2018-05-22 06:00:00,6,Mainly Sunny,11.0,91.0
9,2018-05-22 07:00:00,7,Partly Cloudy,13.0,92.0


What type of variable is `weather_yvr`?

In [5]:
type(weather_yvr)

pandas.core.frame.DataFrame

- `weather_yvr` is a **DataFrame**, a data type from the `pandas` library
  - A DataFrame is a 2-dimensional array (like a table in a spreadsheet)

- When we display `weather_yvr`, the integer numbers in bold on the left are the DataFrame's **index**

Using `print` to display `weather_yvr` looks a bit different than the IPython display shown in the previous slide

In [6]:
print(weather_yvr)

               Datetime  Hour of Day     Conditions  Temperature (C)  \
0   2018-05-21 22:00:00           22   Mainly Clear             14.8   
1   2018-05-21 23:00:00           23          Clear             13.5   
2   2018-05-22 00:00:00            0          Clear             13.1   
3   2018-05-22 01:00:00            1          Clear             12.9   
4   2018-05-22 02:00:00            2          Clear             12.2   
5   2018-05-22 03:00:00            3          Clear             12.0   
6   2018-05-22 04:00:00            4          Clear             11.9   
7   2018-05-22 05:00:00            5            Fog             10.4   
8   2018-05-22 06:00:00            6   Mainly Sunny             11.0   
9   2018-05-22 07:00:00            7  Partly Cloudy             13.0   
10  2018-05-22 08:00:00            8   Mainly Sunny             14.8   
11  2018-05-22 09:00:00            9  Partly Cloudy             15.8   
12  2018-05-22 10:00:00           10  Partly Cloudy             

# Data at a Glance

`pandas` provides many ways to quickly and easily summarize your data:
- How many rows and columns are there?
- What are all the column names and what type of data is in each column?

- Numerical data: What is the average and range of the values?
- Text data: What are the unique values and how often does each occur?
- How many values are missing in each column?

DataFrame methods:

In [7]:
weather_yvr.head?

[1;31mSignature:[0m [0mweather_yvr[0m[1;33m.[0m[0mhead[0m[1;33m([0m[0mn[0m[1;33m=[0m[1;36m5[0m[1;33m)[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return the first n rows.

Parameters
----------
n : int, default 5
    Number of rows to select.

Returns
-------
obj_head : type of caller
    The first n rows of the caller object.
[1;31mFile:[0m      c:\users\jenfl\anaconda3\lib\site-packages\pandas\core\generic.py
[1;31mType:[0m      method


- Type `weather_yvr.` followed by `Tab` to see other methods available for the DataFrame

In [8]:
weather_yvr.head()

Unnamed: 0,Datetime,Hour of Day,Conditions,Temperature (C),Relative Humidity (%)
0,2018-05-21 22:00:00,22,Mainly Clear,14.8,75.0
1,2018-05-21 23:00:00,23,Clear,13.5,76.0
2,2018-05-22 00:00:00,0,Clear,13.1,77.0
3,2018-05-22 01:00:00,1,Clear,12.9,84.0
4,2018-05-22 02:00:00,2,Clear,12.2,88.0


In [9]:
weather_yvr.head(3)

Unnamed: 0,Datetime,Hour of Day,Conditions,Temperature (C),Relative Humidity (%)
0,2018-05-21 22:00:00,22,Mainly Clear,14.8,75.0
1,2018-05-21 23:00:00,23,Clear,13.5,76.0
2,2018-05-22 00:00:00,0,Clear,13.1,77.0


In [10]:
weather_yvr.tail()

Unnamed: 0,Datetime,Hour of Day,Conditions,Temperature (C),Relative Humidity (%)
19,2018-05-22 17:00:00,17,Mainly Sunny,20.1,64.0
20,2018-05-22 18:00:00,18,Sunny,19.7,59.0
21,2018-05-22 19:00:00,19,Sunny,19.6,67.0
22,2018-05-22 20:00:00,20,Sunny,18.4,70.0
23,2018-05-22 21:00:00,21,Clear,16.1,84.0


In [11]:
weather_yvr.sample(4)

Unnamed: 0,Datetime,Hour of Day,Conditions,Temperature (C),Relative Humidity (%)
23,2018-05-22 21:00:00,21,Clear,16.1,84.0
10,2018-05-22 08:00:00,8,Mainly Sunny,14.8,82.0
4,2018-05-22 02:00:00,2,Clear,12.2,88.0
21,2018-05-22 19:00:00,19,Sunny,19.6,67.0


Number of rows and columns:

In [12]:
weather_yvr.shape

(24, 5)

- The DataFrame `weather_yvr` has 24 rows and 5 columns
- The index does not count as a column
- Notice there are no parentheses at the end of `weather_yvr.shape`
- `shape` is an **attribute** of the variable `weather_yvr`

We can save `weather_yvr.shape` as a variable:

In [13]:
shape_yvr = weather_yvr.shape
print(shape_yvr)

(24, 5)


In [14]:
type(shape_yvr)

tuple

A tuple is another data type, similar to a list
- Items are enclosed in `()` instead of `[]`
- Tuples are immutable&mdash;you can't modify individual items inside a tuple

Unlike a list, which can contain items of different types, each column of a DataFrame must contain items of the same type.

We can find out the names and data types of each column from the `dtypes` attribute:

In [15]:
weather_yvr.dtypes

Datetime                  object
Hour of Day                int64
Conditions                object
Temperature (C)          float64
Relative Humidity (%)    float64
dtype: object

- In a `pandas` DataFrame, a column containing text data (or containing a mix of text and numbers) is assigned a `dtype` of `object` and is treated as a column of strings

- `int64` and `float64` are integer and float, respectively
  - The `64` at the end means that they are stored as 64-bit numbers in memory
  - These data types are equivalent to `int` and `float` in Python (`pandas` is a just a bit more explicit in how it names them)

If we just want a list of the column names, we can use the `columns` attribute:

In [16]:
weather_yvr.columns

Index(['Datetime', 'Hour of Day', 'Conditions', 'Temperature (C)',
       'Relative Humidity (%)'],
      dtype='object')

# Simple Summary Statistics

In [17]:
weather_yvr.describe()

Unnamed: 0,Hour of Day,Temperature (C),Relative Humidity (%)
count,24.0,24.0,24.0
mean,11.5,15.608333,77.166667
std,7.071068,3.119492,10.470109
min,0.0,10.4,59.0
25%,5.75,12.975,67.75
50%,11.5,15.85,77.0
75%,17.25,18.5,84.75
max,23.0,20.1,97.0


- The `describe` method is a way to quickly summarize the averages, extremes, and variability of each numerical data column
- You can look at each statistic individually with methods such as `mean`, `median`, `min`, `max`, and `std`

In [18]:
weather_yvr.mean()

Hour of Day              11.500000
Temperature (C)          15.608333
Relative Humidity (%)    77.166667
dtype: float64

In [19]:
weather_yvr.max()

Datetime                 2018-05-22 21:00:00
Hour of Day                               23
Conditions                             Sunny
Temperature (C)                         20.1
Relative Humidity (%)                     97
dtype: object

The `max` method includes string data from the 'Datetime' and 'Conditions' columns in its calculations, which probably isn't what we want.

Let's check out the documentation for `max`:

In [20]:
weather_yvr.max?

[1;31mSignature:[0m [0mweather_yvr[0m[1;33m.[0m[0mmax[0m[1;33m([0m[0maxis[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mskipna[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mlevel[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mnumeric_only[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[0m
[1;31mDocstring:[0m
This method returns the maximum of the values in the object.
            If you want the *index* of the maximum, use ``idxmax``. This is
            the equivalent of the ``numpy.ndarray`` method ``argmax``.

Parameters
----------
axis : {index (0), columns (1)}
skipna : boolean, default True
    Exclude NA/null values when computing the result.
level : int or level name, default None
    If the axis is a MultiIndex (hierarchical), count along a
    particular level, collapsing into a Series
numeric_only : boolean, default None
    Include only float, int, boolean columns. If None, will attempt to use
    everything, the

We can use the **keyword argument** `numeric_only` with a Boolean value of `True` to include only the numeric columns:

In [21]:
weather_yvr.max(numeric_only=True)

Hour of Day              23.0
Temperature (C)          20.1
Relative Humidity (%)    97.0
dtype: float64

Auto-complete works for keyword arguments too! You can start typing `nu` inside `weather_yvr.max()`, then press `Tab` and see what happens.

Try pressing `Shift` and `Tab` together after you type the opening `(` in `weather_yvr.max()`
- A little window pops up showing a shortened version of the documentation, including the list of available keyword arguments

# Lesson 1 Recap

### Importing `pandas` Library
```
import pandas as pd
```
- Libraries only need to be imported once in a notebook
- It's good practice to consolidate all your `import` commands together near the start of your notebook

### Reading a CSV File

To read a CSV file and store it as a DataFrame variable:
```
df = pd.read_csv('some_cool_data.csv')
```

### Quick and Easy Summaries of a DataFrame

Number of rows and columns (rows first, columns second): 
```
df.shape
```

Names and data types of each column: 
```
df.dtypes
```
Just the names of each column:
```
df.columns
```

#### Rows at a Glance

- First 5 rows:
```
df.head()
```
- Last 5 rows:
```
df.tail()
```
- A random sampling (1 row):
```
df.sample()
```
- The number of rows can be specified as an input to any of the above methods (e.g. `df.tail(7)` returns the last 7 rows)

#### Summary Statistics

Full set of summary statistics (min, max, mean, standard deviation, etc.) for each column of a DataFrame:
```
df.describe()
```

Mean value of each column:
```
df.mean()
```

And similarly for other summary statistics: `df.min()`, `df.max()`, `df.median()`, `df.std()`

Optional keyword argument to `min` and `max` methods, to include only numerical data columns:
```
df.max(numeric_only=True)
```

# Exercise 1

a) Read data for Saskatoon Airport from `'data/weather_yxe.csv'` into a new variable `weather_yxe` and display the first 7 rows.

b) How many rows and columns does `weather_yxe` have? 

c) What are the names and data types of the columns?

d) What are the minimum and maximum relative humidity in this data?

##### Bonus exercises

e) What is the mean wind speed during the first 8 hours (first 8 rows) of data?

f) What are the minimum and maximum relative humidity during the last 10 hours of data?

g) Display summary statistics for all columns for a random sampling of 12 hours of data.

a) Read data for Saskatoon Airport from `'data/weather_yxe.csv'` into a new variable `weather_yxe` and display the first 7 rows. 

# Interlude: So What?

- Why bother with any of this? 
- Isn't it easier to do all these tasks in Excel?
- Why would I care about the shape of a DataFrame, or printing out the column names and data types? 
  - Can't I just look at a spreadsheet and this information is obvious, without writing any code?

There are plenty of cases in which Python is overkill and Excel (or other application) is the perfect tool for the job.

In many other cases, for example when your data is large and unwieldy, even simple Python commands can hugely simplify tasks that would be very difficult and time consuming to do within a spreadsheet.

- For an example, let's look at: `Example Notebook - Python Developers Survey`