<div class="page-wrap">
    <h1 class="title-slide">Data Analysis with Pandas </h1>
    <br>
    <center><img src="img/pyladies.png" width="25%"></center>
</div><!-- close .page-wrap -->


<footer>
    <p class="footer-left">September 8, 2018</p>
    <p class="footer-right">Instructor: Jennifer Walker</p>

</footer>

# Agenda

- Lesson 1: Getting Started with Data
- Lesson 2: Sorting, Filtering, and Aggregation
- Lesson 3: Subsets and Indexing

# Lesson 1: Getting Started with Data

#  Python Data Analysis Ecosystem

![ecosystem](img/ecosystem.png)

# Intro to Pandas

- `pandas` = Python Data Analysis Library (https://pandas.pydata.org/)
- With `pandas` you can do pretty much everything you would in a spreadsheet, plus a whole lot more!

### Why Pandas?
- Working with large data files and complex calculations
- Dealing with messy and missing data
- Merging data from multiple files
- Timeseries analysis
- Automate repetitive tasks
- Combine with other Python libraries to create beautiful and fully customized visualizations

# Reading a CSV file

Let's look at the file `data/weather_hourly_YVR_2017.csv`
- One full year of Environment Canada hourly weather measurements at Vancouver Airport
- First, let's check it out in the JupyterLab CSV viewer

Now let's start working with the data in `pandas`
- First, import the `pandas` library and give it the nickname `pd`

In [1]:
import pandas as pd

Use the function `pd.read_csv` to read the data into Python and store as a variable:

In [2]:
weather_yvr = pd.read_csv('data/weather_hourly_YVR_2017.csv')

In [3]:
weather_yvr

Unnamed: 0,Datetime,Temp (deg C),Dew Point Temp (deg C),Rel Hum (%),Wind Dir (deg),Wind Spd (km/h),Visibility (km),Pressure (kPa),Hmdx,Wind Chill,Conditions
0,2017-01-01 00:00:00,1.2,-0.4,89.0,360.0,8.0,19.3,100.54,,,
1,2017-01-01 01:00:00,0.9,-0.7,89.0,360.0,2.0,24.1,100.55,,,Cloudy
2,2017-01-01 02:00:00,1.2,0.6,96.0,80.0,4.0,19.3,100.61,,,Cloudy
3,2017-01-01 03:00:00,0.6,0.2,97.0,360.0,2.0,19.3,100.65,,,Cloudy
4,2017-01-01 04:00:00,0.6,0.2,97.0,230.0,3.0,19.3,100.65,,,Cloudy
5,2017-01-01 05:00:00,0.2,-0.1,98.0,20.0,3.0,24.1,100.67,,,Cloudy
6,2017-01-01 06:00:00,0.2,-0.1,98.0,70.0,3.0,24.1,100.67,,,Cloudy
7,2017-01-01 07:00:00,0.8,0.4,97.0,40.0,8.0,24.1,100.72,,,Cloudy
8,2017-01-01 08:00:00,0.4,-2.2,83.0,20.0,5.0,32.2,100.74,,,Cloudy
9,2017-01-01 09:00:00,0.8,-4.1,70.0,360.0,19.0,48.3,100.82,,,Cloudy


- Only the first 30 and last 30 rows are displayed (but the data is all there in our `weather_yvr` variable)
- You may notice some weird `NaN` values&mdash;these represent missing data (`NaN` = "not a number")

What type of object is `weather_yvr`?

In [4]:
type(weather_yvr)

pandas.core.frame.DataFrame

- `weather_yvr` is a **DataFrame**, a data structure from the `pandas` library
  - A DataFrame is a 2-dimensional array (organized into rows and columns, like a table in a spreadsheet)

- When we display `weather_yvr`, the integer numbers in bold on the left are the DataFrame's **index**
  - In this case, the index is simply a range of integers corresponding with the row numbers

In [5]:
weather_yvr

Unnamed: 0,Datetime,Temp (deg C),Dew Point Temp (deg C),Rel Hum (%),Wind Dir (deg),Wind Spd (km/h),Visibility (km),Pressure (kPa),Hmdx,Wind Chill,Conditions
0,2017-01-01 00:00:00,1.2,-0.4,89.0,360.0,8.0,19.3,100.54,,,
1,2017-01-01 01:00:00,0.9,-0.7,89.0,360.0,2.0,24.1,100.55,,,Cloudy
2,2017-01-01 02:00:00,1.2,0.6,96.0,80.0,4.0,19.3,100.61,,,Cloudy
3,2017-01-01 03:00:00,0.6,0.2,97.0,360.0,2.0,19.3,100.65,,,Cloudy
4,2017-01-01 04:00:00,0.6,0.2,97.0,230.0,3.0,19.3,100.65,,,Cloudy
5,2017-01-01 05:00:00,0.2,-0.1,98.0,20.0,3.0,24.1,100.67,,,Cloudy
6,2017-01-01 06:00:00,0.2,-0.1,98.0,70.0,3.0,24.1,100.67,,,Cloudy
7,2017-01-01 07:00:00,0.8,0.4,97.0,40.0,8.0,24.1,100.72,,,Cloudy
8,2017-01-01 08:00:00,0.4,-2.2,83.0,20.0,5.0,32.2,100.74,,,Cloudy
9,2017-01-01 09:00:00,0.8,-4.1,70.0,360.0,19.0,48.3,100.82,,,Cloudy


For large DataFrames, it's often useful to display just the first few or last few rows:

In [6]:
weather_yvr.head()

Unnamed: 0,Datetime,Temp (deg C),Dew Point Temp (deg C),Rel Hum (%),Wind Dir (deg),Wind Spd (km/h),Visibility (km),Pressure (kPa),Hmdx,Wind Chill,Conditions
0,2017-01-01 00:00:00,1.2,-0.4,89.0,360.0,8.0,19.3,100.54,,,
1,2017-01-01 01:00:00,0.9,-0.7,89.0,360.0,2.0,24.1,100.55,,,Cloudy
2,2017-01-01 02:00:00,1.2,0.6,96.0,80.0,4.0,19.3,100.61,,,Cloudy
3,2017-01-01 03:00:00,0.6,0.2,97.0,360.0,2.0,19.3,100.65,,,Cloudy
4,2017-01-01 04:00:00,0.6,0.2,97.0,230.0,3.0,19.3,100.65,,,Cloudy


- The `head` method returns a new DataFrame consisting of the first `n` rows (default 5)
- To display the documentation for this method, you can run the command `weather_yvr.head?` in your Jupyter notebook

First two rows:

In [7]:
weather_yvr.head(2)

Unnamed: 0,Datetime,Temp (deg C),Dew Point Temp (deg C),Rel Hum (%),Wind Dir (deg),Wind Spd (km/h),Visibility (km),Pressure (kPa),Hmdx,Wind Chill,Conditions
0,2017-01-01 00:00:00,1.2,-0.4,89.0,360.0,8.0,19.3,100.54,,,
1,2017-01-01 01:00:00,0.9,-0.7,89.0,360.0,2.0,24.1,100.55,,,Cloudy


Last four rows:

In [8]:
weather_yvr.tail(4)

Unnamed: 0,Datetime,Temp (deg C),Dew Point Temp (deg C),Rel Hum (%),Wind Dir (deg),Wind Spd (km/h),Visibility (km),Pressure (kPa),Hmdx,Wind Chill,Conditions
8756,2017-12-31 20:00:00,-1.2,-2.9,89.0,90.0,9.0,32.2,102.52,,-4.0,Mainly Clear
8757,2017-12-31 21:00:00,-0.7,-2.2,89.0,100.0,11.0,32.2,102.58,,-4.0,Mainly Clear
8758,2017-12-31 22:00:00,-1.5,-2.4,93.0,100.0,11.0,32.2,102.63,,-5.0,Mainly Clear
8759,2017-12-31 23:00:00,-2.1,-3.0,93.0,90.0,9.0,32.2,102.69,,-6.0,Mainly Clear


# Data at a Glance

`pandas` provides many ways to quickly and easily summarize your data:
- How many rows and columns are there?
- What are all the column names and what type of data is in each column?

- Numerical data: What is the average and range of the values?
- Text data: What are the unique values and how often does each occur?
- How many values are missing in each column?

Number of rows and columns:

In [9]:
weather_yvr.shape

(8760, 11)

- The DataFrame `weather_yvr` has 8760 rows and 11 columns
- The index does not count as a column
- `shape` is a **data attribute** of the variable `weather_yvr`

- Within a column of a DataFrame, the data must all be of the same type
- We can find out the names and data types of each column from the `dtypes` attribute:

In [10]:
weather_yvr.dtypes

Datetime                   object
Temp (deg C)              float64
Dew Point Temp (deg C)    float64
Rel Hum (%)               float64
Wind Dir (deg)            float64
Wind Spd (km/h)           float64
Visibility (km)           float64
Pressure (kPa)            float64
Hmdx                      float64
Wind Chill                float64
Conditions                 object
dtype: object

- In a `pandas` DataFrame, a column containing text data (or containing a mix of text and numbers) is assigned a `dtype` of `object` and is treated as a column of strings

If we just want a list of the column names, we can use the `columns` attribute:

In [11]:
weather_yvr.columns

Index(['Datetime', 'Temp (deg C)', 'Dew Point Temp (deg C)', 'Rel Hum (%)',
       'Wind Dir (deg)', 'Wind Spd (km/h)', 'Visibility (km)',
       'Pressure (kPa)', 'Hmdx', 'Wind Chill', 'Conditions'],
      dtype='object')

# Exercise 1.1 

Let's explore `'data/weather_airports_24hr_snapshot.csv'`, which contains a 24 hour snapshot of Environment Canada weather measurements at major airport stations around Canada.

a) Read the CSV file into a new DataFrame `weather_all` and display the first 10 rows.

b) How many rows and columns does `weather_all` have?

c) Display the names and data types of each column.

a) Read the file `'data/weather_airports_24hr_snapshot.csv'` into a new DataFrame `weather_all` and display the first 10 rows.

b) How many rows and columns does `weather_all` have?

c) Display the names and data types of each column.

# Simple Summary Statistics

Returning to our `weather_yvr` data:

In [12]:
print(weather_yvr.shape)
weather_yvr.head(3)

(8760, 11)


Unnamed: 0,Datetime,Temp (deg C),Dew Point Temp (deg C),Rel Hum (%),Wind Dir (deg),Wind Spd (km/h),Visibility (km),Pressure (kPa),Hmdx,Wind Chill,Conditions
0,2017-01-01 00:00:00,1.2,-0.4,89.0,360.0,8.0,19.3,100.54,,,
1,2017-01-01 01:00:00,0.9,-0.7,89.0,360.0,2.0,24.1,100.55,,,Cloudy
2,2017-01-01 02:00:00,1.2,0.6,96.0,80.0,4.0,19.3,100.61,,,Cloudy


The `describe` method computes simple summary statistics and returns them as a DataFrame:

In [13]:
weather_yvr.describe()

Unnamed: 0,Temp (deg C),Dew Point Temp (deg C),Rel Hum (%),Wind Dir (deg),Wind Spd (km/h),Visibility (km),Pressure (kPa),Hmdx,Wind Chill
count,8758.0,8758.0,8758.0,8745.0,8758.0,8758.0,8758.0,446.0,517.0
mean,10.249441,6.693663,79.785225,170.810749,13.23704,33.13752,101.601111,26.804933,-5.321083
std,6.63264,5.940729,13.019574,96.231337,7.360099,13.237931,0.853132,1.890264,2.546734
min,-8.4,-15.7,18.0,0.0,0.0,0.0,98.63,25.0,-13.0
25%,5.6,3.4,71.0,90.0,8.0,24.1,101.15,25.0,-7.0
50%,9.8,6.9,82.0,130.0,12.0,32.2,101.63,26.0,-5.0
75%,15.4,11.2,90.0,270.0,17.0,48.3,102.12,28.0,-4.0
max,29.2,19.2,100.0,360.0,68.0,64.4,103.94,33.0,0.0


- The `describe` method is a way to quickly summarize the averages, extremes, and variability of each numerical data column
- You can look at each statistic individually with methods such as `mean`, `median`, `min`, `max`,`std`, and `count`

In [14]:
weather_yvr.mean()

Temp (deg C)               10.249441
Dew Point Temp (deg C)      6.693663
Rel Hum (%)                79.785225
Wind Dir (deg)            170.810749
Wind Spd (km/h)            13.237040
Visibility (km)            33.137520
Pressure (kPa)            101.601111
Hmdx                       26.804933
Wind Chill                 -5.321083
dtype: float64

# Working with DataFrame Columns

Similar to a dictionary, we can index a specific column of a DataFrame using the column name inside square brackets:
- *Pro tip: In Jupyter notebooks, auto-complete works for DataFrame column names!*

In [15]:
weather_yvr['Temp (deg C)']

0       1.2
1       0.9
2       1.2
3       0.6
4       0.6
5       0.2
6       0.2
7       0.8
8       0.4
9       0.8
10      2.1
11      2.0
12      1.6
13      1.4
14      1.3
15      1.0
16      0.9
17     -1.3
18     -0.4
19     -0.6
20     -1.2
21     -1.0
22     -1.7
23     -3.0
24     -2.6
25     -2.4
26     -2.2
27     -2.4
28     -3.0
29     -2.8
       ... 
8730    1.4
8731    0.6
8732    0.5
8733   -0.4
8734   -1.2
8735   -1.1
8736   -1.1
8737   -0.4
8738    0.4
8739   -1.4
8740   -3.0
8741   -2.8
8742   -2.3
8743   -3.8
8744   -3.5
8745   -2.9
8746   -0.7
8747   -0.4
8748    1.3
8749    2.5
8750    3.6
8751    3.6
8752    2.3
8753    1.4
8754    1.5
8755    1.0
8756   -1.2
8757   -0.7
8758   -1.5
8759   -2.1
Name: Temp (deg C), Length: 8760, dtype: float64

The numbers on the left are the **index**

What type of object is this?

In [16]:
temperature = weather_yvr['Temp (deg C)']
type(temperature)

pandas.core.series.Series

It's a Series, another data structure from the `pandas` library.

- **DataFrame:** 2-dimensional array, like a table in a spreadsheet
  - The rows are axis 0
  - The columns are axis 1
- **Series:** 1-dimensional array, like a single column or row in a spreadsheet
  - Each individual column or row of a DataFrame is represented as a Series

Let's look at the Series again:

In [17]:
temperature

0       1.2
1       0.9
2       1.2
3       0.6
4       0.6
5       0.2
6       0.2
7       0.8
8       0.4
9       0.8
10      2.1
11      2.0
12      1.6
13      1.4
14      1.3
15      1.0
16      0.9
17     -1.3
18     -0.4
19     -0.6
20     -1.2
21     -1.0
22     -1.7
23     -3.0
24     -2.6
25     -2.4
26     -2.2
27     -2.4
28     -3.0
29     -2.8
       ... 
8730    1.4
8731    0.6
8732    0.5
8733   -0.4
8734   -1.2
8735   -1.1
8736   -1.1
8737   -0.4
8738    0.4
8739   -1.4
8740   -3.0
8741   -2.8
8742   -2.3
8743   -3.8
8744   -3.5
8745   -2.9
8746   -0.7
8747   -0.4
8748    1.3
8749    2.5
8750    3.6
8751    3.6
8752    2.3
8753    1.4
8754    1.5
8755    1.0
8756   -1.2
8757   -0.7
8758   -1.5
8759   -2.1
Name: Temp (deg C), Length: 8760, dtype: float64

The last line of the output above tells us that our Series `temperature` is named `'Temp (deg C)'`, it is 8760 rows long, and its data type is `float64`.

Many of the methods we use on a DataFrame can also be used on a Series, and vice versa

In [18]:
temperature.head()

0    1.2
1    0.9
2    1.2
3    0.6
4    0.6
Name: Temp (deg C), dtype: float64

In [19]:
weather_yvr['Rel Hum (%)'].describe()

count    8758.000000
mean       79.785225
std        13.019574
min        18.000000
25%        71.000000
50%        82.000000
75%        90.000000
max       100.000000
Name: Rel Hum (%), dtype: float64

In [20]:
weather_yvr['Wind Spd (km/h)'].max()

68.0

# Simple Calculations

We can perform calculations on Series.
- Let's convert temperature from Celsius to Fahrenheit (multiply by 1.8 and add 32)

In [21]:
temperature_F = 1.8 * temperature + 32
temperature_F.head()

0    34.16
1    33.62
2    34.16
3    33.08
4    33.08
Name: Temp (deg C), dtype: float64

Side note: the new Series `temperature_F` still has the same name (`'Temp (deg C)'`) as the original Series
- If we want to change the name, we can update the `name` attribute of the Series:

In [22]:
temperature_F.name = 'Temp (deg F)'
temperature_F.head()

0    34.16
1    33.62
2    34.16
3    33.08
4    33.08
Name: Temp (deg F), dtype: float64

Similar to a dictionary, we can add a new column to a DataFrame by simply assigning a value to a new column name:

In [23]:
weather_yvr['Temp (deg F)'] = 1.8 * weather_yvr['Temp (deg C)'] + 32
weather_yvr.head(3)

Unnamed: 0,Datetime,Temp (deg C),Dew Point Temp (deg C),Rel Hum (%),Wind Dir (deg),Wind Spd (km/h),Visibility (km),Pressure (kPa),Hmdx,Wind Chill,Conditions,Temp (deg F)
0,2017-01-01 00:00:00,1.2,-0.4,89.0,360.0,8.0,19.3,100.54,,,,34.16
1,2017-01-01 01:00:00,0.9,-0.7,89.0,360.0,2.0,24.1,100.55,,,Cloudy,33.62
2,2017-01-01 02:00:00,1.2,0.6,96.0,80.0,4.0,19.3,100.61,,,Cloudy,34.16


By creating a new column labelled `'Temp (deg F)'`, the Series in this column is automatically given the same name as its column label:

In [24]:
weather_yvr['Temp (deg F)'].head()

0    34.16
1    33.62
2    34.16
3    33.08
4    33.08
Name: Temp (deg F), dtype: float64

# Multiple Columns of a DataFrame

We can also select several columns, in whatever order we like, using a list of column names:

In [25]:
subset_columns = ['Datetime', 'Wind Spd (km/h)', 'Wind Dir (deg)', 'Rel Hum (%)', 'Temp (deg C)']
weather_subset = weather_yvr[subset_columns]
weather_subset.head()

Unnamed: 0,Datetime,Wind Spd (km/h),Wind Dir (deg),Rel Hum (%),Temp (deg C)
0,2017-01-01 00:00:00,8.0,360.0,89.0,1.2
1,2017-01-01 01:00:00,2.0,360.0,89.0,0.9
2,2017-01-01 02:00:00,4.0,80.0,96.0,1.2
3,2017-01-01 03:00:00,2.0,360.0,97.0,0.6
4,2017-01-01 04:00:00,3.0,230.0,97.0,0.6


`weather_subset` is a DataFrame containing only the specified columns of `weather_yvr`.

We can use subsets  to compute summary statistics just for specific columns of interest:

In [26]:
stats_columns = ['Rel Hum (%)', 'Temp (deg C)']
weather_yvr[stats_columns].mean()

Rel Hum (%)     79.785225
Temp (deg C)    10.249441
dtype: float64

If our list of columns is pretty short, it's often convenient to skip a step and define the list directly within the indexing operator: 

In [27]:
weather_yvr[['Datetime', 'Temp (deg C)']].head()

Unnamed: 0,Datetime,Temp (deg C)
0,2017-01-01 00:00:00,1.2
1,2017-01-01 01:00:00,0.9
2,2017-01-01 02:00:00,1.2
3,2017-01-01 03:00:00,0.6
4,2017-01-01 04:00:00,0.6


Note the double square brackets!
- These are required because we need both:
  - A pair of square brackets to extract the subset, AND
  - A pair of square brackets to define the list of columns to select.

What happens if we forget and just use a single pair of square brackets?
- It doesn't work!
- This is a very easy mistake to make, especially when you're coding on the fly in Jupyter
- For more details on why forgetting the second square brackets causes an error, check out the section "Interlude: Forgetting the Double Square Brackets" in Lesson 3

# Saving to CSV

We can save a DataFrame to a CSV file using the `to_csv` method.

- When you first start working with `to_csv`, it's easy to accidentally overwrite existing files, so you might want to first make a copy of your data folder, as a backup.
- By default, the `to_csv` method will save the DataFrame's index as an additional column in the CSV file. To turn this off, we use the keyword argument `index=False`.

Let's save the data from the columns `'Datetime'`, `'Temp (deg C)'`, and the new `'Temp (deg F)'` to a CSV file:

In [28]:
temp_df = weather_yvr[['Datetime', 'Temp (deg C)', 'Temp (deg F)']]

temp_df.to_csv('data/temperatures_YVR.csv', index=False)

Check out your new file in the JupyterLab CSV viewer!

# Categorical Data

What about text data like the `'Conditions'` column?

In [29]:
weather_yvr['Conditions']

0                 NaN
1              Cloudy
2              Cloudy
3              Cloudy
4              Cloudy
5              Cloudy
6              Cloudy
7              Cloudy
8              Cloudy
9              Cloudy
10             Cloudy
11             Cloudy
12             Cloudy
13             Cloudy
14             Cloudy
15             Cloudy
16      Mostly Cloudy
17      Mostly Cloudy
18      Mostly Cloudy
19      Mostly Cloudy
20      Mostly Cloudy
21      Mostly Cloudy
22      Mostly Cloudy
23      Mostly Cloudy
24      Mostly Cloudy
25       Mainly Clear
26       Mainly Clear
27       Mainly Clear
28              Clear
29              Clear
            ...      
8730    Mostly Cloudy
8731    Mostly Cloudy
8732    Mostly Cloudy
8733    Mostly Cloudy
8734     Mainly Clear
8735     Mainly Clear
8736     Mainly Clear
8737     Mainly Clear
8738     Mainly Clear
8739     Mainly Clear
8740            Clear
8741            Clear
8742            Clear
8743     Freezing Fog
8744     F

This column consists of categories, with many repeated values.
- What are the unique values in the Series?
- How often does each value occur?
- What are the most common values?

We can answer these questions with the `unique`, `nunique`, and `value_counts` methods.
- These methods are only applicable to Series, not DataFrames.

`value_counts` is a very handy method to quickly summarize a Series of text data and find the most common values:

In [30]:
weather_yvr['Conditions'].value_counts()

Cloudy                                          2248
Mostly Cloudy                                   1822
Mainly Clear                                    1799
Clear                                           1077
Rain                                             997
Rain Showers                                     237
Fog                                              136
Rain,Fog                                         132
Snow                                              80
Drizzle,Fog                                       40
Moderate Rain,Fog                                 36
Moderate Rain                                     31
Drizzle                                           20
Rain,Drizzle,Fog                                  15
Snow Showers                                      14
Rain,Snow                                         12
Snow,Fog                                           8
Freezing Fog                                       7
Rain,Snow,Fog                                 

We can use the `unique` method to list the unique values:

In [31]:
weather_yvr['Conditions'].unique()

array([nan, 'Cloudy', 'Mostly Cloudy', 'Mainly Clear', 'Clear', 'Rain',
       'Rain,Fog', 'Rain,Drizzle,Fog', 'Rain Showers',
       'Moderate Rain,Fog', 'Moderate Rain', 'Snow', 'Moderate Snow',
       'Snow,Fog', 'Rain,Snow,Fog', 'Snow,Ice Pellets,Fog', 'Rain,Snow',
       'Ice Pellets', 'Snow Showers', 'Freezing Rain,Fog',
       'Heavy Rain,Fog', 'Drizzle,Fog', 'Rain Showers,Fog',
       'Rain,Ice Pellets', 'Rain Showers,Snow Showers,Fog',
       'Rain Showers,Snow Pellets', 'Rain Showers,Snow Showers',
       'Drizzle', 'Moderate Rain Showers', 'Moderate Rain,Drizzle',
       'Rain,Drizzle', 'Fog',
       'Heavy Rain Showers,Moderate Snow Pellets,Fog',
       'Heavy Rain,Moderate Hail,Fog', 'Moderate Rain,Drizzle,Fog',
       'Freezing Fog', 'Moderate Rain,Snow,Fog', 'Moderate Snow,Fog'],
      dtype=object)

We can use the `nunique` method to find the number of unique values:

In [32]:
weather_yvr['Conditions'].nunique()

37

# Lesson 1 Recap

### Importing `pandas` Library
```
import pandas as pd
```

### DataFrames and Series

Data in `pandas` is organized into DataFrames and Series.

- **DataFrame:** 2-dimensional array, like a table in a spreadsheet
  - The rows are axis 0
  - The columns are axis 1
- **Series:** 1-dimensional array, like a single column or row in a spreadsheet
  - Each individual column or row of a DataFrame is represented as a Series

### Reading a CSV File

To read a CSV file and store it as a DataFrame variable:
```
df = pd.read_csv('some_cool_data.csv')
```

Missing data in a DataFrame or Series is represented as `NaN` ("not a number").

### Saving to a CSV File

To save a DataFrame to a CSV file: 
```
df.to_csv('cool_output.csv', index=False)
```
- To include the DataFrame's index as a column in the CSV file, omit the `index=False` keyword argument.

### Quick and Easy Summaries of a DataFrame

|||
---|----
**Useful Attributes** |
Number of rows and columns (rows first, columns second) | `df.shape` 
Names and data types of each column |  `df.dtypes` 
Just the names of each column | `df.columns` 
**Rows at a Glance** |
First `n` rows (default 5) |`df.head(n)`
Last `n` rows (default 5) | `df.tail(n)`
A random sampling of `n` rows (default 1) | `df.sample(n)`


#### Summary Statistics

Full set of summary statistics (min, max, mean, standard deviation, etc.) for each numerical column of a DataFrame:
```
df.describe()
```

Mean value of each column:
```
df.mean()
```

And similarly for other summary statistics: `df.min()`, `df.max()`, `df.median()`, `df.std()`

### Working with DataFrame Columns

#### Single Columns

Each column of a DataFrame is a Series.
```
series_X = df['X']
```

Most DataFrame methods can be applied to a Series, for example:
- `df['X'].head()`
- `df['X'].max()`

Basic calculations with a Series and adding a new column to a DataFrame: 
```
df['Double X'] = 2 * df['X']
```

#### Multiple Columns

Use a list of column names to select several columns of a DataFrame, in a specified order:
```
df_subset = df[['E', 'A', 'C']]
```

### Categorical Data

For a column `df['Category']` of categorical data, some useful summary methods are:

|||
---|---
Unique values | `df['Category'].unique()`
Number of unique values | `df['Category'].nunique()`
Counts of each unique value | `df['Category'].value_counts()`

*Note: These methods can only be applied to a Series (not a DataFrame).*

# Exercise 1.2

If you haven't already, read the file `'data/weather_airports_24hr_snapshot.csv'` into a new DataFrame `weather_all`.

a) What are the warmest and coldest temperatures in this data?

b) How many unique station names are in this data? Display a list of the unique names.

c) What is the most common weather category in the `'Conditions'` column? How many unique categories are there?

d) Add a column with the wind speed in miles per hour (multiply the wind speed in km/h by 0.62137). Save the data from columns `'Station Name'`, `'Datetime (Local Standard)'`, `'Wind Spd (km/h)'`, and your new column of wind speed in miles per hour, to a CSV file.

##### Bonus exercises

Create a variable `conditions` corresponding to the `'Conditions'` column of `weather_all`. We'll use this variable in each of the following exercises.

e) What type of object is returned by `conditions.value_counts()`? Can you think of a method that could be applied to this output so that it returns only the counts for the top `n` values? How about the bottom `n` values?
- Display only the counts for the 5 most common weather categories in `conditions`
- Display only the counts for the 5 least common weather categories in `conditions`

f) Use `conditions.value_counts?` to check out the documentation for the `value_counts` method. Experiment with the `normalize`, `sort` and `dropna` keyword arguments. How does the output change when you change these arguments?

g) `pandas` Series have a few *accessors*, which are attributes that [act like an interface to additional methods](https://realpython.com/python-pandas-tricks/#3-take-advantage-of-accessor-methods). With a Series of text data, like `conditions`, the `str` accessor allows you to apply string methods such as `upper`, `lower`, `strip`, `replace`, etc. to all the items in the Series.

- Check out some of the documentation with `conditions.str?` and `conditions.str.upper?`. 
- Create a new Series with the weather categories converted to upper case.
- Create a new Series with any instance of the string `'Snow'` in a weather category replaced with the string `'SNOW!!!'`.
- For both of these new Series, use `value_counts` or `unique` methods to verify that the output is what you were expecting.

---
If you haven't already, read the file `'data/weather_airports_24hr_snapshot.csv'` into a new DataFrame `weather_all`.

a) What are the warmest and coldest temperatures in this data?

b) How many unique station names are in this data? Display a list of the unique names.

c) What is the most common weather category in the `'Conditions'` column? How many unique categories are there?

d) Add a column with the wind speed in miles per hour (multiply the wind speed in km/h by 0.62137) and save the DataFrame to a new CSV file.