# Lesson 1: Reading and Summarizing CSV Data

To jump to the recap, click [here](#recap)

## Intro to Pandas

- `pandas` = Python Data Analysis Library (https://pandas.pydata.org/)
- With `pandas` you can do pretty much everything you would in a spreadsheet, plus a whole lot more!

- Why Python + Pandas for spreadsheet data?
  - Working with huge data files and complex calculations
  - Dealing with messy and missing data
  - Merging data from multiple files
  - Timeseries analysis
  - Automate repetitive tasks
  - Huge variety of fully customized graphs for visualizing data

Import the `pandas` library and give it the nickname `pd`

In [None]:
import pandas as pd

*Note: For learning purposes, we are importing each library as we introduce it. In general, it's good practice to collect all your `import` statements together and put them at the start of the notebook.*

# Reading a CSV file

Let's look at the file `weather_yvr.csv` in the sub-folder `data`

- 24 hours of weather measurements at Vancouver Airport
- View it in Jupyter Lab's CSV viewer

We will read the CSV file into our notebook with the function `pd.read_csv`:
- Try typing `pd.re` and then press `Tab` and select `read_csv` from the auto-complete options
- Our input to the `read_csv` function is the file path (including sub-folder) as a string: `'data/weather_yvr.csv'`

In [None]:
pd.read_csv('data/weather_yvr.csv')

- We've displayed the data, but we can't do anything further with this on-screen display
- We need to store the data in a variable

Store the output of `pd.read_csv` in a variable `weather_yvr`:

In [None]:
weather_yvr = pd.read_csv('data/weather_yvr.csv')

In [None]:
weather_yvr

What type of variable is `weather_yvr`?

In [None]:
type(weather_yvr)

- `weather_yvr` is a **DataFrame**, a data type from the `pandas` library
  - A DataFrame is a 2-dimensional array (like a table in a spreadsheet)

- When we display `weather_yvr`, the integer numbers in bold on the left are the DataFrame's **index**

Using `print` to display `weather_yvr` looks a bit different than the IPython display shown in the previous slide

In [None]:
print(weather_yvr)

# Data at a Glance

`pandas` provides many ways to quickly and easily summarize your data:
- How many rows and columns are there?
- What are all the column names and what type of data is in each column?

- Numerical data: What is the average and range of the values?
- Text data: What are the unique values and how often does each occur?
- How many values are missing in each column?

DataFrame methods:

In [None]:
weather_yvr.head?

- Type `weather_yvr.` followed by `Tab` to see other methods available for the DataFrame

In [None]:
weather_yvr.head()

In [None]:
weather_yvr.head(3)

In [None]:
weather_yvr.tail()

In [None]:
weather_yvr.sample(4)

Number of rows and columns:

In [None]:
weather_yvr.shape

- The DataFrame `weather_yvr` has 24 rows and 5 columns
- The index does not count as a column
- Notice there are no parentheses at the end of `weather_yvr.shape`
- `shape` is an **attribute** of the variable `weather_yvr`

We can save `weather_yvr.shape` as a variable:

In [None]:
shape_yvr = weather_yvr.shape
print(shape_yvr)

In [None]:
type(shape_yvr)

A tuple is another data type, similar to a list
- Items are enclosed in `()` instead of `[]`
- Tuples are immutable&mdash;you can't modify individual items inside a tuple

Unlike a list, which can contain items of different types, each column of a DataFrame must contain items of the same type.

We can find out the names and data types of each column from the `dtypes` attribute:

In [None]:
weather_yvr.dtypes

- In a `pandas` DataFrame, a column containing text data (or containing a mix of text and numbers) is assigned a `dtype` of `object` and is treated as a column of strings

- `int64` and `float64` are integer and float, respectively
  - The `64` at the end means that they are stored as 64-bit numbers in memory
  - These data types are equivalent to `int` and `float` in Python (`pandas` is a just a bit more explicit in how it names them)

If we just want a list of the column names, we can use the `columns` attribute:

In [None]:
weather_yvr.columns

# Simple Summary Statistics

In [None]:
weather_yvr.describe()

- The `describe` method is a way to quickly summarize the averages, extremes, and variability of each numerical data column
- You can look at each statistic individually with methods such as `mean`, `median`, `min`, `max`, and `std`

In [None]:
weather_yvr.mean()

In [None]:
weather_yvr.max()

The `max` method includes string data from the 'Datetime' and 'Conditions' columns in its calculations, which probably isn't what we want.

Let's check out the documentation for `max`:

In [None]:
weather_yvr.max?

We can use the **keyword argument** `numeric_only` with a Boolean value of `True` to include only the numeric columns:

In [None]:
weather_yvr.max(numeric_only=True)

Auto-complete works for keyword arguments too! You can start typing `nu` inside `weather_yvr.max()`, then press `Tab` and see what happens.

Try pressing `Shift` and `Tab` together after you type the opening `(` in `weather_yvr.max()`
- A little window pops up showing a shortened version of the documentation, including the list of available keyword arguments

<a id="recap"></a>
# Lesson 1 Recap

### Importing `pandas` Library
```
import pandas as pd
```
- Libraries only need to be imported once in a notebook
- It's good practice to consolidate all your `import` commands together near the start of your notebook

### Reading a CSV File

To read a CSV file and store it as a DataFrame variable:
```
df = pd.read_csv('some_cool_data.csv')
```

### Quick and Easy Summaries of a DataFrame

Number of rows and columns (rows first, columns second): 
```
df.shape
```

Names and data types of each column: 
```
df.dtypes
```
Just the names of each column:
```
df.columns
```

#### Rows at a Glance

- First 5 rows:
```
df.head()
```
- Last 5 rows:
```
df.tail()
```
- A random sampling (1 row):
```
df.sample()
```
- The number of rows can be specified as an input to any of the above methods (e.g. `df.tail(7)` returns the last 7 rows)

#### Summary Statistics

Full set of summary statistics (min, max, mean, standard deviation, etc.) for each column of a DataFrame:
```
df.describe()
```

Mean value of each column:
```
df.mean()
```

And similarly for other summary statistics: `df.min()`, `df.max()`, `df.median()`, `df.std()`

Optional keyword argument to `min` and `max` methods, to include only numerical data columns:
```
df.max(numeric_only=True)
```

# Exercise 1

a) Read data for Saskatoon Airport from `'data/weather_yxe.csv'` into a new variable `weather_yxe` and display the first 7 rows.

b) How many rows and columns does `weather_yxe` have? 

c) What are the names and data types of the columns?

d) What are the minimum and maximum relative humidity in this data?

##### Bonus exercises

e) What is the mean wind speed during the first 8 hours (first 8 rows) of data?

f) What are the minimum and maximum relative humidity during the last 10 hours of data?

g) Display summary statistics for all columns for a random sampling of 12 hours of data.

a) Read data for Saskatoon Airport from `'data/weather_yxe.csv'` into a new variable `weather_yxe` and display the first 7 rows. 

In [None]:
weather_yxe = pd.read_csv('data/weather_yxe.csv')
weather_yxe.head(7)

- Compare the data file in the CSV viewer and with the table displayed here 
- `NaN` means "not a number", i.e. missing data

b) How many rows and columns does `weather_yxe` have? 

In [None]:
weather_yxe.shape

24 rows and 12 columns

c) What are the names and data types of the columns?

In [None]:
weather_yxe.dtypes

d) What are the minimum and maximum wind speeds in this data?

We can use `max` and `min`, or `describe`. Here's the `describe` version:

In [None]:
weather_yxe.describe()

The minimum relative humidity is 23% and the maximum is 79%.

e) We can use `describe` or `mean` methods. Here is an example solution using `mean`.

In [None]:
first_eight = weather_yxe.head(8)
first_eight.mean()

The mean wind speed in the first 8 hours is 14.625 km/hr.

f) We could create a new variable `last_ten = weather_yxe.tail(10)`, as in the previous example, or we can skip that step and "chain" the methods together:

In [None]:
weather_yxe.tail(10).describe()

The minimum and maximum relative humidity during the last 10 hours are 23% and 51%, respectively.

g) Since the `sample` method gives a different random sampling each time, the numbers will be different in your output compared to the example solution below.

In [None]:
weather_yxe.sample(12).describe()

# Interlude: So What?

- Why bother with any of this? 
- Isn't it easier to do all these tasks in Excel?
- Why would I care about the shape of a DataFrame, or printing out the column names and data types? 
  - Can't I just look at a spreadsheet and this information is obvious, without writing any code?

There are plenty of cases in which Python is overkill and Excel (or other application) is the perfect tool for the job.

In many other cases, for example when your data is large and unwieldy, even simple Python commands can hugely simplify tasks that would be very difficult and time consuming to do within a spreadsheet.

- For an example, let's look at: `Example Notebook - Python Developers Survey`