# Lesson 1 Recap

### Importing `pandas` Library
```
import pandas as pd
```

### DataFrames and Series

Data in `pandas` is organized into DataFrames and Series.

- **DataFrame:** 2-dimensional array, like a table in a spreadsheet
  - The rows are axis 0
  - The columns are axis 1
- **Series:** 1-dimensional array, like a single column or row in a spreadsheet
  - Each individual column or row of a DataFrame is represented as a Series

### Reading a CSV File

To read a CSV file and store it as a DataFrame variable:
```
df = pd.read_csv('some_cool_data.csv')
```

Missing data in a DataFrame or Series is represented as `NaN` ("not a number").

### Saving to a CSV File

To save a DataFrame to a CSV file: 
```
df.to_csv('cool_output.csv', index=False)
```
- To include the DataFrame's index as a column in the CSV file, omit the `index=False` keyword argument.

### Quick and Easy Summaries of a DataFrame

|||
---|----
**Useful Attributes** |
Number of rows and columns (rows first, columns second) | `df.shape` 
Names and data types of each column |  `df.dtypes` 
Just the names of each column | `df.columns` 
**Rows at a Glance** |
First `n` rows (default 5) |`df.head(n)`
Last `n` rows (default 5) | `df.tail(n)`
A random sampling of `n` rows (default 1) | `df.sample(n)`


#### Summary Statistics

Full set of summary statistics (min, max, mean, standard deviation, etc.) for each numerical column of a DataFrame:
```
df.describe()
```

Mean value of each column:
```
df.mean()
```

And similarly for other summary statistics: `df.min()`, `df.max()`, `df.median()`, `df.std()`

### Working with DataFrame Columns

#### Single Columns

Each column of a DataFrame is a Series.
```
series_X = df['X']
```

Most DataFrame methods can be applied to a Series, for example:
- `df['X'].head()`
- `df['X'].max()`

Basic calculations with a Series and adding a new column to a DataFrame: 
```
df['Double X'] = 2 * df['X']
```

#### Multiple Columns

Use a list of column names to select several columns of a DataFrame, in a specified order:
```
df_subset = df[['E', 'A', 'C']]
```

### Categorical Data

For a column `df['Category']` of categorical data, some useful summary methods are:

|||
---|---
Unique values | `df['Category'].unique()`
Number of unique values | `df['Category'].nunique()`
Counts of each unique value | `df['Category'].value_counts()`

*Note: These methods can only be applied to a Series (not a DataFrame).*

---

# Lesson 2 Recap

### Sorting

Sorting a Series:
```
series_a.sort_values()
```

Sorting a DataFrame based on the values in the column `'Column B'`:
```
df.sort_values('Column B')
```
To sort in descending order, use the keyword argument `ascending=False`

### Filtering

To select a subset of rows with a filter:
  - Create a filter (Boolean Series) using a comparison operator or other functions (such as the `isnull` method)
  - Use the filter to extract the desired rows from the DataFrame

Example: comparison operator
```
warm = weather_all['Temp (deg C)'] > 25
weather_warm = weather_all[warm]
```

You can also use Boolean Series and DataFrames to count occurrences of specified criteria, for example: count the missing values in each column of a DataFrame with `df.isnull().sum()`


### Aggregation

For basic aggregation operations, use the `groupby` method chained with an aggregation method (e.g., `mean`, `sum`, `max`, `min`, `median`, `std`, `count`).

For example, to find the mean values for data grouped by `'Column B'`: `
```
df.groupby('Column B').mean()
```

For more complex aggregations, you can use the `agg` method.
- Specify a list of aggregation statistics, for example: 
```
weather_all.groupby('Station Name').agg(['mean', 'std'])
```
- Use a dictionary to specify different aggregation statistics for different columns, for example:

```
agg_dict = {'Temp (deg C)' : ['min', 'median', 'max'],
            'Rel Hum (%)' : ['mean', 'std']}
weather_all.groupby('Station Name').agg(agg_dict)
```

### Plots

To display `pandas` / `matplotlib` graphs inline in your Jupyter notebook, you need to run the following magic command:
```
%matplotlib inline
```
- This command only needs to be run once in a notebook
- It's good practice to run this command at the same time as your `import` commands, near the start of your notebook

Create quick and easy plots of Series and DataFrames with the `plot` method:
- For example, to create a bar chart of one column of a DataFrame:
  - `df['A'].plot(kind='bar')`
- Plots can be adjusted with keyword arguments such as `figsize`, `fontsize`, `title`, `colormap`

---

# Lesson 3 Recap

### Anatomy of a DataFrame

![dataframe3](img/dataframe3.png)

- Each cell in a DataFrame is associated with:
  - A row position and column position (integers, starting from 0), and
  - A row label and column label
- Row and column labels are part of the DataFrame's metadata, stored in attributes:
  - `df.index` contains the labels for the DataFrame's rows
  - `df.columns` contains the labels for the DataFrame's columns
- `df.values` contains the underlying data array

### Selection Methods

There are three main ways of selecting subsets of a DataFrame (or Series):
<ol>
    <li>Using only square brackets `[]`</li>
    <li>Positional indexing with `.iloc[]`</li>
    <li>Label-based indexing with `.loc[]`</li>
</ol>


#### 1. Using only square brackets
|||
---|----
Use a Boolean Series to select rows matching some criteria | `df[bool_series]`
Use a list of column names to select several columns in a specified order | `df[['E', 'A', 'C']]`
Select a single column (as a Series) | `series_A = df['A']`
Select a single column (as a DataFrame) | `df_A = df[['A']]`


#### 2. Positional indexing with `iloc`

With `iloc`, we specify the row selection first, followed by column selection, separated by a comma. For example, selecting a single cell at row 10, column 3:
```
df.iloc[10, 3]
```

With `iloc` can select:
- A single cell
- A single row or column
- Slices of rows and/or columns
  - As with Python lists, `iloc` slices are inclusive of the start bound and exclusive of the stop bound
- Multiple rows and/or columns in arbitrary order, using a list of positions

Example&mdash;every second row from 100 up to and including 200, and columns 5, 2, 4 in that order:
```
df.iloc[100:201:2, [5, 2, 4]]
```


#### 3. Label-based indexing with `loc`

As with `iloc`, we specify the row selection first, followed by column selection, separated by a comma, but using the row and column labels instead of their positions. For example, selecting a single cell at a row labelled `'HALIFAX'` and a column labelled `'Temp (deg C)'`:
```
weather.loc['HALIFAX', 'Temp (deg C)']
```

Similar to `iloc`, with `loc` we can select:
- A single cell
- A single row or column
- Slices of rows and/or columns
  - However, unlike `iloc` and Python list slicing, `loc` slices are inclusive of the start bound but also **inclusive** of the stop bound 
- Multiple rows and/or columns in arbitrary order, using a list of labels

We can also select rows and/or columns with a Boolean Series, and mix and match these with the other selection options listed above.

Example&mdash;select rows where temperature > 25 and a slice of columns from `'Rel Hum (%)'` up to and including `'Pressure (kPa)'`:
```
warm = weather['Temp (deg C)'] > 25
df = weather.loc[warm, 'Rel Hum (%)':'Pressure (kPa)']
```