# Day 4: Session A - Introduction to Pandas DataFrames with World Cities Data

[Link to session webpage](https://eds-217-essential-python.github.io/course-materials/interactive-sessions/4a_dataframes.html)

Date: 09/06/2024

# Setup

In [None]:
import pandas as pd
import numpy as np

# Basic Data Importing

In [None]:
url = "https://raw.githubusercontent.com/datasets/world-cities/master/data/world-cities.csv"
cities_df = pd.read_csv(url)

# Basic Data Exploration
Figuring out the DataFrame structure, looking at data types, getting a sense of data values, checking out column names
### Viewing the data

In [None]:
cities_df.head(10)   # head by default returns the first 5 rows. pass it a number to change.

In [None]:
cities_df.tail()

### Exploring DataFrames using their Properties

Use `shape` to get the size (rows x columns) of a dataframe

In [None]:
cities_df.shape   # no parenthesis because `shape` is a variable (attrbute/property) that holds information about the dataframe, not a function (method)

Use `columns` to get a list of column names

In [None]:
cities_df.columns

Determining the types of data usign the `dtypes` property

In [None]:
cities_df.dtypes

### Exploring DataFrames using their methods

Use the `describe()` method to get a summary of the dataframe

In [None]:
cities_df.describe()   # get the summary statistics of numeric columns (if any)

Use `info()` to get detailed information about column types and content

In [None]:
cities_df.info()

### Checking for missing values

Use `isnull()` and `sum()` to count missing values

In [None]:
cities_df.isnull().sum()

# Basic Cleaning

For removing missing data, `dropna()` is best, use `subset` argument to selsct specific columns

In [None]:
cities_df = cities_df.dropna(subset = ['subcountry'])    # drop na from column subcountry

In [None]:
cities_df.isnull().sum()  # check that it worked

# Basic Data Selection and Filtering
### Selecting columns
Just add it to the dataframe with brackets

In [None]:
# selecting a single column. this returns a series
cities_df['name'].head()   # Using the `head()` methof to keep my notebook clean FOR DEMO

In [None]:
# selecting multiple columns. put the columns we want into a list. this returns a dataframe
cities_df[['name', 'country']].head()   # the two [[]] are necessary here. you index in `[]` to a list `[]` of columns that we want

To make a series from a column, request it like `df['column']`

To make a dataframe from a column, request it as a single-item list: `df[['column]]`

### Filtering rows

use conditional expressions to filter rows

In [None]:
us_cities = cities_df[ cities_df['country'] == 'United States' ]
us_cities.head()

another way to think about this:

```python
rows_we_want = cities_df['country'] == 'United States'
us_cities = cities_df[rows_we_want]
```

We can combine logical operators to filter on multiple columns!

In [None]:
# the declarative and expressive way
in_us = cities_df['country'] == 'United States'
in_ca = cities_df['subcountry'] == 'California'

california_cities = cities_df[ in_us & in_ca ]
california_cities.head()

Doing this more succinctly:

```python
california_cities = cities_df[
    (cities_df['country'] == 'United States') & 
    (cities_df['subcountry'] == 'California')         # must wrap each condition in () to avoid confusion!
]
```

### Combining conditions

... coming soon ...

# Basic Sorting and Ranking

.... coming soon ....

# Basic Transformations

.... coming soon.....

# Basic Grouping and Aggregation

the workhorse method for grouping in pandas is `groupby()`

Use aggregation on groupby to summarize the data within each group

In [None]:
# Number of cities by country
cities_per_country = cities_df.groupby('country')   
cities_per_country['name'].count().sort_values(ascending=False).head(10)