## 1. Introducing DataFrames

**What's the point of `pandas`?**
> - `pandas` is a Python package for data manipulation.
> - It can also be used for data visualization; we'll get to that in Chapter 4.

**Course outline**
> - We'll start by talking about DataFrames, which form the core of `pandas`.
> - In chapter 2, we'll discuss aggregating data to gather insights.
> - In chapter 3, you'll learn all about slicing and indexing to subset DataFrames.
> - Finally, you'll visualize your data, deal with missing data, and read data into a DataFrame. Let's dive in.

**`pandas` is built on NumPy and Matplotlib**
> - `pandas` is built on top of two essential Python packages, `numpy` and `matplotlib`.
> - `numpy` provides multidimensional array objects for easy data manipulation that pandas uses to store data.
> - `matplotlib` has powerful data visualization capabilities that pandas takes advantage of.

**`pandas` is popular**
> - pandas has millions of users, with [PyPi](https://pypistats.org/packages/pandas) recording about 14 million downloads in December 2019.
> - This represents almost the entire Python data science community!

**Rectangular data**
> - There are several ways to store data for analysis, but rectangular data, sometimes called "tabular data" is the most common form.
> - In this example, with dogs, each observation, or each dog, is a row, and each variable, or each dog property, is a column.
> - `pandas` is designed to work with rectangular data like this.

<img src="./assets/ch01_01_introducing_dataframes_img01.png">

**pandas DataFrames**
> - In `pandas`, rectangular data is represented as a DataFrame object.
> - Every programming language used for data analysis has something similar to this.
> - R also has DataFrames, while SQL has database tables.
> - Every value within a column has the same data type, either text or numeric, but different columns can contain different data types.

<img src="./assets/ch01_01_introducing_dataframes_img02.png">

**Exploring a DataFrame: `.head()`**
> - When you first receive a new dataset, you want to quickly explore it and get a sense of its contents.
> - `pandas` has several methods for this, the first is `.head()`, which returns the first few rows of the DataFrame.
> - We only had seven rows to begin with, so it's not super exciting, but this becomes very useful if you have many rows.

<img src="./assets/ch01_01_introducing_dataframes_img03.png">

**Exploring a DataFrame: `.info()`**
> - The `.info()` method displays the names of columns, the data types they contain, and whether they have any missing values.

<img src="./assets/ch01_01_introducing_dataframes_img04.png">

**Exploring a DataFrame: `.shape`**
> - A DataFrame's `.shape` attribute contains a tuple that holds the number of rows followed by the number of columns.
> - Since this is an attribute instead of a method, you write it without parentheses.

<img src="./assets/ch01_01_introducing_dataframes_img05.png">

**Exploring a DataFrame: `.describe()`**
> - The `.describe()` method computes some summary statistics for numerical columns, like mean and median.
> - "count" is the number of non-missing values in each column.
> - `.describe()` is good for a quick overview of numeric variables, but if you want more control, you'll see how to perform more specific calculations later in the course.

<img src="./assets/ch01_01_introducing_dataframes_img06.png">

**Components of a DataFrame: `.values`**
> - DataFrames consist of three different components, accessible using attributes.
> - The `.values` attribute, as you might expect, contains the data values in a 2-dimensional `numpy` array.

<img src="./assets/ch01_01_introducing_dataframes_img07.png">

**Components of a DataFrame: `.columns` and `.index`**
> - The other two components of a DataFrame are labels for columns and rows.
> - The `.columns` attribute contains column names, and the `.index` attribute contains row numbers or row names.
> - Be careful, since row labels are stored in `.index`, not in `.rows`.
> - Notice that these are `index` objects, which we'll cover in Chapter 3.
> - This allows for flexibility in labels. For example, the dogs data uses row numbers, but row names are also possible.

<img src="./assets/ch01_01_introducing_dataframes_img08.png">

**pandas Philosophy**
> - Python has a semi-official philosophy on how to write good code called The [Zen of Python](https://www.python.org/dev/peps/pep-0020/).
> - One suggestion is that given a programming problem, there should only be one obvious solution.
> - As you go through this course, bear in mind that `pandas` deliberately doesn't follow this philosophy.
> - Instead, there are often multiple ways to solve a problem, leaving you to choose the best.
> - In this respect, `pandas` is like a Swiss Army Knife, giving you a variety of tools, making it incredibly powerful, but more difficult to learn.
> - In this course, we aim for a more streamlined approach to `pandas`, only covering the most important ways of doing things.

### 1.1. Inspecting a DataFrame

When you get a new DataFrame to work with, the first thing you need to do is explore it and see what it contains. There are several useful methods and attributes for this.

> - `.head()` returns the first few rows (the “head” of the DataFrame).
> - `.info()` shows information on each of the columns, such as the data type and number of missing values.
> - `.shape` returns the number of rows and columns of the DataFrame.
> - `.describe()` calculates a few summary statistics for each column.

`homelessness` is a DataFrame containing estimates of homelessness in each U.S. state in 2018. The `individual` column is the number of homeless individuals not part of a family with children. The `family_members` column is the number of homeless individuals part of a family with children. The `state_pop` column is the state's total population.

`pandas` is imported for you.

In [1]:
# Import pandas and read the data:
import pandas as pd
homelessness = pd.read_csv("./datasets/homelessness.csv").drop(columns="Unnamed: 0")

- Print the head of the `homelessness` DataFrame.

In [2]:
homelessness.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
0,East South Central,Alabama,2570.0,864.0,4887681
1,Pacific,Alaska,1434.0,582.0,735139
2,Mountain,Arizona,7259.0,2606.0,7158024
3,West South Central,Arkansas,2280.0,432.0,3009733
4,Pacific,California,109008.0,20964.0,39461588


- Print information about the column types and missing values in `homelessness`.

In [3]:
homelessness.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   region          51 non-null     object 
 1   state           51 non-null     object 
 2   individuals     51 non-null     float64
 3   family_members  51 non-null     float64
 4   state_pop       51 non-null     int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 2.1+ KB


- Print the number of rows and columns in `homelessness`.

In [4]:
homelessness.shape

(51, 5)

- Print some summary statistics that describe the `homelessness` DataFrame.

In [5]:
homelessness.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
individuals,51.0,7225.784,15991.03,434.0,1446.5,3082.0,6781.5,109008.0
family_members,51.0,3504.882,7805.412,75.0,592.0,1482.0,3196.0,52070.0
state_pop,51.0,6405637.0,7327258.0,577601.0,1777413.5,4461153.0,7340946.5,39461588.0


### 1.2. Parts of a DataFrame

To better understand DataFrame objects, it's useful to know that they consist of three components, stored as attributes:

> - `.values`: A two-dimensional NumPy array of values.
> - `.columns`: An index of columns: the column names.
> - `.index`: An index for the rows: either row numbers or row names.

You can usually think of indexes as a list of strings or numbers, though the pandas `Index` data type allows for more sophisticated options. (These will be covered later in the course.)

`homelessness` is available.

- Import `pandas` using the alias `pd`.

In [6]:
import pandas as pd

- Print a 2D NumPy array of the values in `homelessness`.

In [7]:
homelessness.values[:5]

array([['East South Central', 'Alabama', 2570.0, 864.0, 4887681],
       ['Pacific', 'Alaska', 1434.0, 582.0, 735139],
       ['Mountain', 'Arizona', 7259.0, 2606.0, 7158024],
       ['West South Central', 'Arkansas', 2280.0, 432.0, 3009733],
       ['Pacific', 'California', 109008.0, 20964.0, 39461588]],
      dtype=object)

In [8]:
homelessness.to_numpy()[:5]

array([['East South Central', 'Alabama', 2570.0, 864.0, 4887681],
       ['Pacific', 'Alaska', 1434.0, 582.0, 735139],
       ['Mountain', 'Arizona', 7259.0, 2606.0, 7158024],
       ['West South Central', 'Arkansas', 2280.0, 432.0, 3009733],
       ['Pacific', 'California', 109008.0, 20964.0, 39461588]],
      dtype=object)

- Print the column names of `homelessness`.

In [9]:
homelessness.columns

Index(['region', 'state', 'individuals', 'family_members', 'state_pop'], dtype='object')

- Print the index of `homelessness`.

In [10]:
homelessness.index

RangeIndex(start=0, stop=51, step=1)

## 2. Sorting and Subsetting

**Sorting**
> - The first thing you can do is change the order of the rows by sorting them so that the most interesting data is at the top of the DataFrame.
> - You can sort rows using the `.sort_values()` method, passing in a column name that you want to sort by.
> - For example, when we apply `.sort_values()` on the weight_kg column of the dogs DataFrame, we get the lightest dog at the top, Stella the Chihuahua, and the heaviest dog at the bottom, Bernie the Saint Bernard.

<img src="./assets/ch01_02_sorting_and_subsetting_img01.png">

**Sorting in descending order**
> - Setting the `ascending` argument to `False` will sort the data the other way around, from heaviest dog to lightest dog.

<img src="./assets/ch01_02_sorting_and_subsetting_img02.png">

**Sorting by multiple variables**
> - We can sort by multiple variables by passing a list of column names to `.sort_values()`.
> - Here, we sort first by `weight`, then by `height`.
> - Now, Charlie, Lucy, and Bella are ordered from shortest to tallest, even though they all weigh the same.

<img src="./assets/ch01_02_sorting_and_subsetting_img03.png">

**Sorting by multiple variables**
> - To change the direction values are sorted in, pass a list to the ascending argument to specify which direction sorting should be done for each variable.
> - Now, Charlie, Lucy, and Bella are ordered from tallest to shortest.

<img src="./assets/ch01_02_sorting_and_subsetting_img04.png">

**Subsetting columns**
> - We may want to zoom in on just one column.
> - We can do this using the name of the DataFrame, followed by square brackets with a column name inside.
> - Here, we can look at just the name column.

<img src="./assets/ch01_02_sorting_and_subsetting_img05.png">

**Subsetting multiple columns**
> - To select multiple columns, you need two pairs of square brackets.
> - In this code, the inner and outer square brackets are performing different tasks.
> - The outer square brackets are responsible for subsetting the DataFrame, and the inner square brackets are creating a list of column names to subset.
> - This means you could provide a separate list of column names as a variable and then use that list to perform the same subsetting. Usually, it's easier to do in one line.

<img src="./assets/ch01_02_sorting_and_subsetting_img06.png">

**Subsetting rows**
> - There are lots of different ways to subset rows.
> - The most common way to do this is by creating a logical condition to filter against.
> - For example, let's find all the dogs whose height is greater than 50 centimeters.
> - Now we have a True or False value for every row.

<img src="./assets/ch01_02_sorting_and_subsetting_img07.png">

**Subsetting rows**
> - We can use the logical condition inside of square brackets to subset the rows we're interested in to get all of the dogs taller than 50 centimeters.

<img src="./assets/ch01_02_sorting_and_subsetting_img08.png">

**Subsetting based on text data**
> - We can also subset rows based on text data. Here, we use the double equal sign in the logical condition to filter the dogs that are Labradors.

<img src="./assets/ch01_02_sorting_and_subsetting_img09.png">

**Subsetting based on dates**
> - We can also subset based on dates. Here, we filter all the dogs born before 2015.
> - Notice that the dates are in quotes and are written as year then month, then day.
> - This is the international standard date format.

<img src="./assets/ch01_02_sorting_and_subsetting_img10.png">

**Subsetting based on multiple conditions**
> - To subset the rows that meet multiple conditions, you can combine conditions using logical operators, such as the "and" operator seen here.
> - This means that only rows that meet both of these conditions will be subsetted.
> - You could also do this in one line of code, but you'll also need to add parentheses around each condition.

<img src="./assets/ch01_02_sorting_and_subsetting_img11.png">

**Subsetting using `.isin()`**
> - If you want to filter on multiple values of a categorical variable, the easiest way is to use the `.isin()` method. This takes in a list of values to filter for.
> - Here, we check if the color of a dog is black or brown, and use this condition to subset the data.

<img src="./assets/ch01_02_sorting_and_subsetting_img12.png">

### 2.1. Sorting rows

Finding interesting bits of data in a DataFrame is often easier if you change the order of the rows. You can sort the rows by passing a column name to `.sort_values()`.

In cases where rows have the same value (this is common if you sort on a categorical variable), you may wish to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.

<img src="./assets/ch01_02_01_sorting_rows_img01.png">

By combining `.sort_values()` with `.head()`, you can answer questions in the form, "What are the top cases where…?".

`homelessness` is available and `pandas` is loaded as `pd`.

In [11]:
homelessness.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
0,East South Central,Alabama,2570.0,864.0,4887681
1,Pacific,Alaska,1434.0,582.0,735139
2,Mountain,Arizona,7259.0,2606.0,7158024
3,West South Central,Arkansas,2280.0,432.0,3009733
4,Pacific,California,109008.0,20964.0,39461588


- Sort `homelessness` by the number of homeless individuals, from smallest to largest, and save this as `homelessness_ind`.
- Print the head of the sorted DataFrame.

In [12]:
homelessness_ind = homelessness.sort_values(by="individuals", ascending=True)
homelessness_ind.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
50,Mountain,Wyoming,434.0,205.0,577601
34,West North Central,North Dakota,467.0,75.0,758080
7,South Atlantic,Delaware,708.0,374.0,965479
39,New England,Rhode Island,747.0,354.0,1058287
45,New England,Vermont,780.0,511.0,624358


- Sort `homelessness` by the number of homeless `family_members` in descending order, and save this as `homelessness_fam`.
- Print the head of the sorted DataFrame.

In [13]:
homelessness_fam = homelessness.sort_values(by="family_members", ascending=False)
homelessness_fam.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
32,Mid-Atlantic,New York,39827.0,52070.0,19530351
4,Pacific,California,109008.0,20964.0,39461588
21,New England,Massachusetts,6811.0,13257.0,6882635
9,South Atlantic,Florida,21443.0,9587.0,21244317
43,West South Central,Texas,19199.0,6111.0,28628666


- Sort `homelessness` first by region (ascending), and then by number of family members (descending). Save this as `homelessness_reg_fam`.
- Print the head of the sorted DataFrame.

In [14]:
homelessness_reg_fam = homelessness.sort_values(by=["region", "family_members"], ascending=[True, False])
homelessness_reg_fam.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
13,East North Central,Illinois,6752.0,3891.0,12723071
35,East North Central,Ohio,6929.0,3320.0,11676341
22,East North Central,Michigan,5209.0,3142.0,9984072
49,East North Central,Wisconsin,2740.0,2167.0,5807406
14,East North Central,Indiana,3776.0,1482.0,6695497


### 2.2. Subsetting columns

When working with data, you may not need all of the variables in your dataset. Square brackets (`[]`) can be used to select only the columns that matter to you in an order that makes sense to you. To select only `"col_a"` of the DataFrame `df`, use `df["col_a"]`.

To select `"col_a"` and `"col_b"` of `df`, use `df[["col_a", "col_b"]]`.

`homelessness` is available and `pandas` is loaded as `pd`.

In [15]:
homelessness.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
0,East South Central,Alabama,2570.0,864.0,4887681
1,Pacific,Alaska,1434.0,582.0,735139
2,Mountain,Arizona,7259.0,2606.0,7158024
3,West South Central,Arkansas,2280.0,432.0,3009733
4,Pacific,California,109008.0,20964.0,39461588


- Create a DataFrame called `individuals` that contains only the `individuals` column of `homelessness`.
- Print the head of the result.

In [16]:
individuals = homelessness[["individuals"]]
individuals.head()

Unnamed: 0,individuals
0,2570.0
1,1434.0
2,7259.0
3,2280.0
4,109008.0


- Create a DataFrame called `state_fam` that contains only the `state` and `family_members` columns of `homelessness`, in that order.
- Print the head of the result.

In [17]:
state_fam = homelessness[["state", "family_members"]]
state_fam.head()

Unnamed: 0,state,family_members
0,Alabama,864.0
1,Alaska,582.0
2,Arizona,2606.0
3,Arkansas,432.0
4,California,20964.0


- Create a DataFrame called `ind_state` that contains the `individuals` and `state` columns of `homelessness`, in that order.
- Print the head of the result.

In [18]:
ind_state = homelessness[["individuals", "state"]]
ind_state.head()

Unnamed: 0,individuals,state
0,2570.0,Alabama
1,1434.0,Alaska
2,7259.0,Arizona
3,2280.0,Arkansas
4,109008.0,California


### 2.3. Subsetting rows

A large part of data science is about finding which bits of your dataset are interesting. One of the simplest techniques for this is to find a subset of rows that match some criteria. This is sometimes known as filtering rows or selecting rows.

There are many ways to subset a DataFrame, perhaps the most common is to use relational operators to return `True` or `False` for each row, then pass that inside square brackets.
```
dogs[dogs["height_cm"] > 60]
dogs[dogs["color"] == "tan"]
```

You can filter for multiple conditions at once by using the "bitwise and" operator, `&`.

```
dogs[(dogs["height_cm"] > 60) & (dogs["color"] == "tan")]
```
`homelessness` is available and `pandas` is loaded as `pd`.

In [19]:
homelessness.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
0,East South Central,Alabama,2570.0,864.0,4887681
1,Pacific,Alaska,1434.0,582.0,735139
2,Mountain,Arizona,7259.0,2606.0,7158024
3,West South Central,Arkansas,2280.0,432.0,3009733
4,Pacific,California,109008.0,20964.0,39461588


- Filter `homelessness` for cases where the number of individuals is greater than ten thousand, assigning to `ind_gt_10k`. View the printed result.

In [20]:
ind_gt_10k = homelessness[homelessness["individuals"] > 10_000]
ind_gt_10k

Unnamed: 0,region,state,individuals,family_members,state_pop
4,Pacific,California,109008.0,20964.0,39461588
9,South Atlantic,Florida,21443.0,9587.0,21244317
32,Mid-Atlantic,New York,39827.0,52070.0,19530351
37,Pacific,Oregon,11139.0,3337.0,4181886
43,West South Central,Texas,19199.0,6111.0,28628666
47,Pacific,Washington,16424.0,5880.0,7523869


- Filter `homelessness` for cases where the USA Census region is `"Mountain"`, assigning to `mountain_reg`. View the printed result.

In [21]:
mountain_reg = homelessness[homelessness["region"] == "Mountain"]
mountain_reg

Unnamed: 0,region,state,individuals,family_members,state_pop
2,Mountain,Arizona,7259.0,2606.0,7158024
5,Mountain,Colorado,7607.0,3250.0,5691287
12,Mountain,Idaho,1297.0,715.0,1750536
26,Mountain,Montana,983.0,422.0,1060665
28,Mountain,Nevada,7058.0,486.0,3027341
31,Mountain,New Mexico,1949.0,602.0,2092741
44,Mountain,Utah,1904.0,972.0,3153550
50,Mountain,Wyoming,434.0,205.0,577601


- Filter `homelessness` for cases where the number of `family_members` is less than one thousand and the region is "Pacific", assigning to `fam_lt_1k_pac`. View the printed result.

In [22]:
fam_lt_1k_pac = homelessness[(homelessness["family_members"] < 1_000) & (homelessness["region"] == "Pacific")]
fam_lt_1k_pac

Unnamed: 0,region,state,individuals,family_members,state_pop
1,Pacific,Alaska,1434.0,582.0,735139


### 2.4. Subsetting rows by categorical variables

Subsetting data based on a categorical variable often involves using the "or" operator (`|`) to select rows from multiple categories. This can get tedious when you want all states in one of three different regions, for example. Instead, use the `.isin()` method, which will allow you to tackle this problem by writing one condition instead of three separate ones.

```
colors = ["brown", "black", "tan"]
condition = dogs["color"].isin(colors)
dogs[condition]
```

`homelessness` is available and `pandas` is loaded as `pd`.

In [23]:
homelessness.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
0,East South Central,Alabama,2570.0,864.0,4887681
1,Pacific,Alaska,1434.0,582.0,735139
2,Mountain,Arizona,7259.0,2606.0,7158024
3,West South Central,Arkansas,2280.0,432.0,3009733
4,Pacific,California,109008.0,20964.0,39461588


- Filter `homelessness` for cases where the USA census region is "South Atlantic" or it is "Mid-Atlantic", assigning to `south_mid_atlantic`. View the printed result.

In [24]:
south_mid_atlantic = homelessness[homelessness["region"].isin(["South Atlantic", "Mid-Atlantic"])]
south_mid_atlantic

Unnamed: 0,region,state,individuals,family_members,state_pop
7,South Atlantic,Delaware,708.0,374.0,965479
8,South Atlantic,District of Columbia,3770.0,3134.0,701547
9,South Atlantic,Florida,21443.0,9587.0,21244317
10,South Atlantic,Georgia,6943.0,2556.0,10511131
20,South Atlantic,Maryland,4914.0,2230.0,6035802
30,Mid-Atlantic,New Jersey,6048.0,3350.0,8886025
32,Mid-Atlantic,New York,39827.0,52070.0,19530351
33,South Atlantic,North Carolina,6451.0,2817.0,10381615
38,Mid-Atlantic,Pennsylvania,8163.0,5349.0,12800922
40,South Atlantic,South Carolina,3082.0,851.0,5084156


- Filter `homelessness` for cases where the USA census `state` is in the list of Mojave states, `canu`, assigning to `mojave_homelessness`. View the printed result.

In [25]:
canu = ["California", "Arizona", "Nevada", "Utah"]
mojave_homelessness = homelessness[homelessness["state"].isin(canu)]
mojave_homelessness

Unnamed: 0,region,state,individuals,family_members,state_pop
2,Mountain,Arizona,7259.0,2606.0,7158024
4,Pacific,California,109008.0,20964.0,39461588
28,Mountain,Nevada,7058.0,486.0,3027341
44,Mountain,Utah,1904.0,972.0,3153550


## 3. New columns