---
title: "Create A Series Object From a List"
description: "A Series object is 1D array, similar to a column in a spreadsheet (with a column name and row labels). Series objects in pandas behave much like one-dimensional NumPy ndarrays and you can often pass them as parameters to NumPy ufuncs."
tags: Pandas, Numpy, Lists
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [1]:
import pandas as pd

In [6]:
# create a Series from a list
s = pd.Series([2, -1, 3, 5])
s

0    2
1   -1
2    3
3    5
dtype: int64

# Example&

## Similar to a 1D `ndarray`

`Series` objects behave much like one-dimensional NumPy `ndarray`s, and you can often pass them as parameters to NumPy functions:

In [8]:
import numpy as np
np.sin(s)

0    0.909297
1   -0.841471
2    0.141120
3   -0.958924
dtype: float64

Arithmetic operations on `Series` are also possible, and they apply *elementwise*, just like for `ndarray`s:

In [4]:
s + [1000,2000,3000,4000]

0    1002
1    1999
2    3003
3    4005
dtype: int64

Similar to NumPy, if you add a single number to a `Series`, that number is added to all items in the `Series`. This is called *broadcasting*:

In [5]:
s + 1000

0    1002
1     999
2    1003
3    1005
dtype: int64

The same is true for all binary operations such as `*` or `/`, and even conditional operations:

In [7]:
s < 0

0    False
1     True
2    False
3    False
dtype: bool

---
title: "Index Labels For A Series"
description: "Each item in a pandas Series object has a unique identifier called the index label. By default, it is the rank of the item in the Series (starting at 0) but you can also set the index labels manually."
tags: Pandas
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [1]:
import pandas as pd

In [None]:
s = pd.Series(values, index = ["label_1", "label_2", "label_import pandas as pd3","..."])

# Example&

In [9]:
s2 = pd.Series([68, 83, 112, 68], index=["alice", "bob", "charles", "darwin"])
s2

alice       68
bob         83
charles    112
darwin      68
dtype: int64

# Learn More&

To make it clear when you are accessing by label or by integer location, it is recommended to always use the `loc` attribute when accessing by label, and the `iloc` attribute when accessing by integer location. Check out the bricks on these attributes to learn more!

---
title: "Locate A Value By Label (loc) In A Series"
description: "Access any element by its label in a pandas Series using the loc attribute."
tags: Pandas
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [1]:
import pandas as pd

In [None]:
# locate value at label 'my_label'
s.loc['my_label']

# Example&

## Create an Example Series

In [10]:
tiger = pd.Series([5,3,3,4], index = ['the_masters', 'us_open', 'the_open', 'pga_championship'])
tiger

the_masters         5
us_open             3
the_open            3
pga_championship    4
dtype: int64

## Locate Value at Label 'the_masters'

In [11]:
tiger.loc['the_masters']

5

---
title: "Locate A Value By Index (iloc) In A Series"
description: "Access any element by its label in a pandas Series using the loc attribute."
tags: Pandas
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [1]:
import pandas as pd

In [None]:
# Locate value at 1 index in Series s
s.iloc[index]

In [None]:
# Locate values in an index slice
s.iloc[lower : upper]

# Example&

## Create an Example Series

In [10]:
tiger = pd.Series([5,3,3,4], index = ['the_masters', 'us_open', 'the_open', 'pga_championship'])
tiger

the_masters         5
us_open             3
the_open            3
pga_championship    4
dtype: int64

## Locate Value at 2nd index

In [12]:
tiger.iloc[1]

3

## Slicing a `Series` also slices the index labels:

In [14]:
tiger.iloc[2:4]

the_open            3
pga_championship    4
dtype: int64

# Example&

## Be Careful: Slicing a Series can lead to unexpected results when using default labels!

In [16]:
surprise = pd.Series([1000, 1001, 1002, 1003])
surprise

0    1000
1    1001
2    1002
3    1003
dtype: int64

In [17]:
surprise_slice = surprise[2:]
surprise_slice

2    1002
3    1003
dtype: int64

Oh look! The first element has index label `2`. The element with index label `0` is absent from the slice:

In [16]:
try:
    surprise_slice[0]
except KeyError as e:
    print("Key error:", e)

Key error: 0


But remember that you can access elements by integer location using the `iloc` attribute.

In [17]:
surprise_slice.iloc[0]

1002

---
title: "Create A Series Object From A Dictionary"
description: "A Series object is 1D array, similar to a column in a spreadsheet (with a column name and row labels). Creating a Series object from a dictionary will make the keys the index labels."
tags: Pandas, Dictionaries
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [1]:
import pandas as pd

In [None]:
# my_dictionary's keys will be labels and values will be values
s = pd.Series(my_dictionary)

# Example&

In [18]:
population = {"NYC": 8550000, "LA": 4000000, "CHI": 3000000, "HOU": 2500000}
s = pd.Series(population)
s

NYC    8550000
LA     4000000
CHI    3000000
HOU    2500000
dtype: int64

You can control which elements you want to include in the `Series` and in what order by explicitly specifying the desired `index`:

In [20]:
s2 = pd.Series(population, index = ["LA", "CHI"])
s2

LA     4000000
CHI    3000000
dtype: int64

---
title: "Create A Series Object With A Scalar"
description: "Initialize a Series object using a scalar and a list of index labels such that all rows have the same value."
tags: Pandas
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [1]:
import pandas as pd

In [None]:
# all rows will have the scalar as their value
s = pd.Series(scalar, list_of_index_labels)

# Example&

In [22]:
meaning = pd.Series(42, ["life", "universe", "everything"])
meaning

life          42
universe      42
everything    42
dtype: int64

---
title: "Add a Name to a Series"
description: "Adding a name to a pandas Series object."
tags: Pandas
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [1]:
import pandas as pd

In [None]:
# add name 'my_series' to s
s = pd.Series(values, name = "my_series")

# Example&

In [22]:
s = pd.Series([100, 38], index=["tim", "jimmy"], name="quality of hair")
s

tim      100
jimmy     38
Name: quality of hair, dtype: int64

---
title: "Create A DataFrame from a Dictionary of Series."
description: "A DataFrame object represents a spreadsheet, with cell values, column names and row index labels. Initialize a DataFrame object from a dictionary of Series."
tags: Pandas, Dictionaries
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

In [None]:
# the values of my_dictionary become the column names
# the index labels are row names and values of the Series are entries
df = pd.DataFrame(my_dictionary)

# Example&

## Create a Dictionary of Series

In [23]:
people_dict = {
    "weight": pd.Series([68, 83, 112], index=["alice", "bob", "charles"]),
    "birthyear": pd.Series([1984, 1985, 1992], index=["bob", "alice", "charles"], name="year"),
    "children": pd.Series([0, 3], index=["charles", "bob"]),
    "hobby": pd.Series(["Biking", "Dancing"], index=["alice", "bob"]),
}

## Create the DataFrame

In [24]:
people = pd.DataFrame(people_dict)
people

Unnamed: 0,weight,birthyear,children,hobby
alice,68,1985,,Biking
bob,83,1984,3.0,Dancing
charles,112,1992,0.0,


A few things to note:
* the `Series` were automatically aligned based on their index,
* missing values are represented as `NaN`,
* `Series` names are ignored (the name `"year"` was dropped),
* `DataFrame`s are displayed nicely in Jupyter notebooks, woohoo!

## You can access columns pretty much as you would expect

They are returned as `Series` objects:

In [52]:
people["birthyear"]

alice      1985
bob        1984
charles    1992
Name: birthyear, dtype: int64

You can also get multiple columns at once:

In [53]:
people[["birthyear", "hobby"]]

Unnamed: 0,birthyear,hobby
alice,1985,Biking
bob,1984,Dancing
charles,1992,


# Example&

If you pass a list of columns and/or index row labels to the `DataFrame` constructor, it will guarantee that these columns and/or rows will exist, in that order, and no other column/row will exist. For example:

In [54]:
d2 = pd.DataFrame(
        people_dict,
        columns=["birthyear", "weight", "height"],
        index=["bob", "alice", "eugene"]
     )
d2

Unnamed: 0,birthyear,weight,height
bob,1984.0,83.0,
alice,1985.0,68.0,
eugene,,,


---
title: "Create A DataFrame from a List of Lists."
description: "A DataFrame object represents a spreadsheet, with cell values, column names and row index labels. Initialize a DataFrame object from a list of lists."
tags: Pandas, Lists
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd
list_of_lists = [ [], [], ... ]

In [None]:
df = pd.DataFrame(list_of_lists)

In [None]:
# add column names and/or index labels
df = pd.DataFrame(
        list_of_lists,
        columns = ["name", "your", "columns", "here"]
        index = ["and/or", "name", "your", "index_labels"]
)

# Exmaple&

## Create a list of lists that is our data

In [None]:
values = [
            [1985, 1, "Biking", 68],
            [1984, 3, "Dancing", 83],
            [1992, 0, "Skiing", 112]
         ]

## Create DataFrame from `values` with column names and index labels

In [27]:
d3 = pd.DataFrame(
        values,
        columns=["birthyear", "children", "hobby", "weight"],
        index=["alice", "bob", "charles"]
     )
d3

Unnamed: 0,birthyear,children,hobby,weight
alice,1985,1,Biking,68
bob,1984,3,Dancing,83
charles,1992,0,Skiing,112


---
title: "Create A DataFrame from a Dictionary"
description: "A DataFrame object represents a spreadsheet, with cell values, column names and row index labels. Initialize a DataFrame object from a list of lists. Initialize a DataFrame from a dictionary"
tags: Pandas, Dictionaries
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

## By default, the keys become the columns

In [None]:
pd.DataFrame.from_dict(data)

## Specify `orient = 'index'` to use keys as the rows 

In [None]:
pd.DataFrame.from_dict(data, orient = 'index')
# when using 'index' orientation, you can rename columns
pd.DataFrame.from_dict(data, 
                       orient = 'index',
                       columns = ['name_1', 'name_2', '...'])

# Example&

# Keys as columns

## With index labels

In [128]:
people_dict = {
    "weight": [68, 83, 112],
    "birthyear": [1985, 1984, 1992],
    "children": [np.nan, 3, 0],
    "hobby": ["Biking", "Dancing", np.nan],
}
people = pd.DataFrame(people_dict, index=["alice", "bob", "charles"])
people

Unnamed: 0,weight,birthyear,children,hobby
alice,68,1985,,Biking
bob,83,1984,3.0,Dancing
charles,112,1992,0.0,


## Without index labels

In [30]:
data = {"model" : ['s', '3', 'x', 'y'], "seats" : [2,4,6,8], "price" : [100,200,300,400]}
teslas = pd.DataFrame.from_dict(data)
teslas# Example&

Unnamed: 0,model,seats,price
0,s,2,100
1,3,4,200
2,x,6,300
3,y,8,400


# Example&

## `orient = 'index' `

In [34]:
data = {2018: [100, 200, 250, 300], 2019 : [115, 200, 230, 370]}
money = pd.DataFrame.from_dict(data, orient = 'index')
money

Unnamed: 0,0,1,2,3
2018,100,200,250,300
2019,115,200,230,370


In [35]:
money = pd.DataFrame.from_dict(data, 
                               orient = 'index',
                               columns = ['food', 'internet', 'books', 'rent'])
money

Unnamed: 0,food,internet,books,rent
2018,100,200,250,300
2019,115,200,230,370


---
title: "Multi-Index DataFrame FINISH LATER"
description: "A MultiIndex object is a way to create a hierarchical structure of columns and/or rows. You can think of a MultiIndex as an list of unique tuples.

Use multi-indexing to create a hierarchy of dimensions for your data. If all columns are tuples of the same size, then they are understood as a multi-index. The same goes for row index labels."
tags: Pandas
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html

# Example&

# Example&

If all columns are tuples of the same size, then they are understood as a multi-index. The same goes for row index labels. For example:

In [59]:
df = pd.DataFrame(
  {
    ("public", "birthyear"):
        {("Paris","alice"):1985, ("Paris","bob"): 1984, ("London","charles"): 1992},
    ("public", "hobby"):
        {("Paris","alice"):"Biking", ("Paris","bob"): "Dancing"},
    ("private", "weight"):
        {("Paris","alice"):68, ("Paris","bob"): 83, ("London","charles"): 112},
    ("private", "children"):
        {("Paris", "alice"):np.nan, ("Paris","bob"): 3, ("London","charles"): 0}
  }
)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,private,private,public,public
Unnamed: 0_level_1,Unnamed: 1_level_1,children,weight,birthyear,hobby
London,charles,0.0,112,1992,
Paris,alice,,68,1985,Biking
Paris,bob,3.0,83,1984,Dancing


You can now get a `DataFrame` containing all the `"public"` columns very simply:

In [60]:
df["public"]

Unnamed: 0,Unnamed: 1,birthyear,hobby
London,charles,1992,
Paris,alice,1985,Biking
Paris,bob,1984,Dancing


In [61]:
df["public", "hobby"]  # Same result as d5["public"]["hobby"]

London  charles        NaN
Paris   alice       Biking
        bob        Dancing
Name: (public, hobby), dtype: object

## Dropping a level
Let's look at `d5` again:

In [62]:
d5

Unnamed: 0_level_0,Unnamed: 1_level_0,private,private,public,public
Unnamed: 0_level_1,Unnamed: 1_level_1,children,weight,birthyear,hobby
London,charles,0.0,112,1992,
Paris,alice,,68,1985,Biking
Paris,bob,3.0,83,1984,Dancing


There are two levels of columns, and two levels of indices. We can drop a column level by calling `droplevel()` (the same goes for indices):

In [63]:
d5.columns = d5.columns.droplevel(level = 0)
d5

Unnamed: 0,Unnamed: 1,children,weight,birthyear,hobby
London,charles,0.0,112,1992,
Paris,alice,,68,1985,Biking
Paris,bob,3.0,83,1984,Dancing


## Transposing
You can swap columns and indices using the `T` attribute:

In [64]:
d6 = d5.T
d6

Unnamed: 0_level_0,London,Paris,Paris
Unnamed: 0_level_1,charles,alice,bob
children,0.0,,3
weight,112.0,68,83
birthyear,1992.0,1985,1984
hobby,,Biking,Dancing


## Stacking and unstacking levels
Calling the `stack()` method will push the lowest column level after the lowest index:

In [65]:
d7 = d6.stack()
d7

Unnamed: 0,Unnamed: 1,London,Paris
children,bob,,3
children,charles,0.0,
weight,alice,,68
weight,bob,,83
weight,charles,112.0,
birthyear,alice,,1985
birthyear,bob,,1984
birthyear,charles,1992.0,
hobby,alice,,Biking
hobby,bob,,Dancing


Note that many `NaN` values appeared. This makes sense because many new combinations did not exist before (eg. there was no `bob` in `London`).

Calling `unstack()` will do the reverse, once again creating many `NaN` values.

In [66]:
d8 = d7.unstack()
d8

Unnamed: 0_level_0,London,London,London,Paris,Paris,Paris
Unnamed: 0_level_1,alice,bob,charles,alice,bob,charles
children,,,0.0,,3,
weight,,,112.0,68,83,
birthyear,,,1992.0,1985,1984,
hobby,,,,Biking,Dancing,


If we call `unstack` again, we end up with a `Series` object:

In [67]:
d9 = d8.unstack()
d9

London  alice    children        None
                 weight           NaN
                 birthyear        NaN
                 hobby            NaN
        bob      children         NaN
                 weight           NaN
                 birthyear        NaN
                 hobby            NaN
        charles  children           0
                 weight           112
                 birthyear       1992
                 hobby           None
Paris   alice    children        None
                 weight            68
                 birthyear       1985
                 hobby         Biking
        bob      children           3
                 weight            83
                 birthyear       1984
                 hobby        Dancing
        charles  children         NaN
                 weight           NaN
                 birthyear        NaN
                 hobby           None
dtype: object

The `stack()` and `unstack()` methods let you select the `level` to stack/unstack. You can even stack/unstack multiple levels at once:

In [68]:
d10 = d9.unstack(level = (0,1))
d10

Unnamed: 0_level_0,London,London,London,Paris,Paris,Paris
Unnamed: 0_level_1,alice,bob,charles,alice,bob,charles
children,,,0.0,,3,
weight,,,112.0,68,83,
birthyear,,,1992.0,1985,1984,
hobby,,,,Biking,Dancing,


## Most methods return modified copies
As you may have noticed, the `stack()` and `unstack()` methods do not modify the object they apply to. Instead, they work on a copy and return that copy. This is true of most methods in pandas.

---
title: "Access Rows of a DataFrame By Index Label (loc)"
description: "The loc attribute of a pandas DataFrame lets you access rows by the index label. The result is a Series object with the DataFrame's column names mapped to the row index labels."
tags: Pandas
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

In [None]:
# single label for a row => Series
df.loc['row_label']

In [None]:
# list of labels for many rows => DataFrame
df.loc[['row_1', 'row_2', '...']]

# Example&

## Create example DataFrame

In [36]:
people_dict = {
    "weight": [68, 83, 112],
    "birthyear": [1985, 1984, 1992],
    "children": [np.nan, 3, 0],
    "hobby": ["Biking", "Dancing", np.nan],
}
people = pd.DataFrame(people_dict, index=["alice", "bob", "charles"])
people

Unnamed: 0,weight,birthyear,children,hobby
alice,68,1985,,Biking
bob,83,1984,3.0,Dancing
charles,112,1992,0.0,


## Get one row

In [38]:
people.loc['alice']

weight           68
birthyear      1985
children        NaN
hobby        Biking
Name: alice, dtype: object

## Get multiple rows

In [39]:
people.loc[['bob', 'charles']]

Unnamed: 0,weight,birthyear,children,hobby
bob,83,1984,3.0,Dancing
charles,112,1992,0.0,


---
title: "Access Rows of a DataFrame By Integer Location (iloc)"
description: "The iloc attribute of a pandas DataFrame lets you access rows by the integer location. The result is a Series object with the DataFrame's column names mapped to the row index labels. You can also get a dataframe from the iloc of multiple rows with slices."
tags: Pandas
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

In [None]:
# single row from its index number => Series
df.iloc[index]

In [None]:
# slice or list of indices for many rows => DataFrame
df.iloc[start : end]

# Example&

## Create example DataFrame

In [36]:
people_dict = {
    "weight": [68, 83, 112],
    "birthyear": [1985, 1984, 1992],
    "children": [np.nan, 3, 0],
    "hobby": ["Biking", "Dancing", np.nan],
}
people = pd.DataFrame(people_dict, index=["alice", "bob", "charles"])
people

Unnamed: 0,weight,birthyear,children,hobby
alice,68,1985,,Biking
bob,83,1984,3.0,Dancing
charles,112,1992,0.0,


## Get one row

In [40]:
people.iloc[0]

weight           68
birthyear      1985
children        NaN
hobby        Biking
Name: alice, dtype: object

## Get multiple rows

In [42]:
people.iloc[1:3]

Unnamed: 0,weight,birthyear,children,hobby
bob,83,1984,3.0,Dancing
charles,112,1992,0.0,


---
title: "Filter Rows of DataFrame with a Boolean Expression"
description: "You can pass a boolean array to get the matching rows or filter the rows of a DataFrame with a boolean expression. Both will return DataFrames that are subsets of the original dataframe."
tags: Pandas
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

In [None]:
# get Series of True and False
df["column"] < 10

In [None]:
# get rows of dataframe when condition is True
df[ df["column"] != "Other" ]

In [None]:
import numpy as np
# get rows when True
df[np.array([True, False, True])]

# Example&

## Create example DataFrame

In [36]:
people_dict = {
    "weight": [68, 83, 112],
    "birthyear": [1985, 1984, 1992],
    "children": [np.nan, 3, 0],
    "hobby": ["Biking", "Dancing", np.nan],
}
people = pd.DataFrame(people_dict, index=["alice", "bob", "charles"])
people

Unnamed: 0,weight,birthyear,children,hobby
alice,68,1985,,Biking
bob,83,1984,3.0,Dancing
charles,112,1992,0.0,


## Boolean array to get the matching rows

In [46]:
people[np.array([True, False, True])]

Unnamed: 0,weight,birthyear,children,hobby
alice,68,1985,,Biking
charles,112,1992,0.0,


## Boolean expressions to get matching rows

In [47]:
people["birthyear"] < 1990

alice       True
bob         True
charles    False
Name: birthyear, dtype: bool

In [48]:
people[people["birthyear"] < 1990]

Unnamed: 0,weight,birthyear,children,hobby
alice,68,1985,,Biking
bob,83,1984,3.0,Dancing


In [52]:
# Fun Fact: NaN != NaN
people[people["hobby"] == people["hobby"]]

Unnamed: 0,weight,birthyear,children,hobby
alice,68,1985,,Biking
bob,83,1984,3.0,Dancing


---
title: "Add Column to an Existing DataFrame"
description: "Add a new column to an existing pandas dataframe using a Series or a list. When adding a new column, it is added at the end (on the right) by default. You can also insert a column anywhere else using the insert() method."
tags: Pandas, Lists
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

In [None]:
# add new column to existing df from list
# len(df) *must ==* len(list_of_values)
df['new_column'] = list_of_values

In [None]:
# use insert to add somewhere else than at the end
df.insert(location_index, 'new_column', list_of_values)

# Example&

## Create example DataFrame

In [71]:
people_dict = {
    "weight": [68, 83, 112],
    "birthyear": [1985, 1984, 1992],
    "children": [np.nan, 3, 0],
    "hobby": ["Biking", "Dancing", np.nan],
}
people = pd.DataFrame(people_dict, index=["alice", "bob", "charles"])
people

Unnamed: 0,weight,birthyear,children,hobby
alice,68,1985,,Biking
bob,83,1984,3.0,Dancing
charles,112,1992,0.0,


## Declaring new column

In [72]:
people["home"] = ['Boston', 'New York City', 'San Diego']
people

Unnamed: 0,weight,birthyear,children,hobby,home
alice,68,1985,,Biking,Boston
bob,83,1984,3.0,Dancing,New York City
charles,112,1992,0.0,,San Diego


When you add a new colum, it must have the same number of rows. Missing rows are filled with NaN, and extra rows are ignored:

In [73]:
people["pets"] = pd.Series({"bob": 0, "charles": 5, "eugene":1})  # alice is missing, eugene is ignored
people

Unnamed: 0,weight,birthyear,children,hobby,home,pets
alice,68,1985,,Biking,Boston,
bob,83,1984,3.0,Dancing,New York City,0.0
charles,112,1992,0.0,,San Diego,5.0


# Example&

## Using the `insert` method 

In [74]:
people = pd.DataFrame(people_dict, index=["alice", "bob", "charles"])
people

Unnamed: 0,weight,birthyear,children,hobby
alice,68,1985,,Biking
bob,83,1984,3.0,Dancing
charles,112,1992,0.0,


In [75]:
people.insert(1, "height", [172, 181, 185])
people

Unnamed: 0,weight,height,birthyear,children,hobby
alice,68,172,1985,,Biking
bob,83,181,1984,3.0,Dancing
charles,112,185,1992,0.0,


---
title: "Add Column to an Existing DataFrame Based off of Values in an Existing Column"
description: "Add a new column to an existing pandas dataframe. Use a Series, a list, or create a new column based off the values in an existing column."
tags: Pandas, Lists
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

In [None]:
# any operation will be performed element-wisef
df['new_column'] = df['old_column'] / 100

# Example&

## Create example DataFrame

In [76]:
people_dict = {
    "weight": [68, 83, 112],
    "birthyear": [1985, 1984, 1992],
    "children": [np.nan, 3, 0],
    "hobby": ["Biking", "Dancing", np.nan],
}
people = pd.DataFrame(people_dict, index=["alice", "bob", "charles"])
people

Unnamed: 0,weight,birthyear,children,hobby
alice,68,1985,,Biking
bob,83,1984,3.0,Dancing
charles,112,1992,0.0,


## New Column Based off of Existing column

In [77]:
people["age"] = 2019 - people["birthyear"]  # adds a new column "age"
people["over 30"] = people["age"] > 30      # adds another column "over 30"

people

Unnamed: 0,weight,birthyear,children,hobby,age,over 30
alice,68,1985,,Biking,34,True
bob,83,1984,3.0,Dancing,35,True
charles,112,1992,0.0,,27,False


---
title: "Apply Any Function to an Existing Column"
description: "Using pandas apply function, you can easily create new columns by applying any function to all the values element-wise in a current column to create a new one. Add Column to a DataFrame Based off of an existing one."
tags: Pandas
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

In [None]:
# using a lambda function
df['new_column'] = df['col'].apply(lambda x : x/100)

In [None]:
# using a function you built
def my_func(x):
    return x*100

df['new_column'] = df['col'].apply(my_func)

# Example&

## Create example DataFrame

In [74]:
people_dict = {
    "weight": [68, 83, 112],
    "birthyear": [1985, 1984, 1992],
    "children": [np.nan, 3, 0],
    "hobby": ["Biking", "Dancing", np.nan],
}
people = pd.DataFrame(people_dict, index=["alice", "bob", "charles"])
people

Unnamed: 0,weight,birthyear,children,hobby
alice,68,1985,,Biking
bob,83,1984,3.0,Dancing
charles,112,1992,0.0,


## Create function

In [75]:
def is_it_cool(hobby):
    # fun fact: NaN != NaN
    if hobby != hobby:
        return "that's lame"
    return "that's cool"

## Apply functions

In [77]:
people["hobby_review"] = people['hobby'].apply(is_it_cool)
people["weight_lbs"] = people['weight'].apply(lambda x : 2.2*x)

In [78]:
people

Unnamed: 0,weight,birthyear,children,hobby,hobby_review,weight_lbs
alice,68,1985,,Biking,that's cool,149.6
bob,83,1984,3.0,Dancing,that's cool,182.6
charles,112,1992,0.0,,that's lame,246.4


---
title: "Apply Any Function to an Entire Row"
description: "Using pandas apply function with axis = 1, you can easily create new columns by applying passing an entire row into a function to create a column with values based off a row. Add A Column to a DataFrame Based off of Many Columns."
tags: Pandas
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

In [None]:
import numpy as np
# using a lambda function
df['new_column'] = df.apply(
            lambda row : row['col_1'] * 100, 
            axis = 1)

In [None]:
# using a function you built
def my_func(row):
    return (row['col_1'] + row['col_2']) / 2

df['new_column'] = df.apply(my_func, axis = 1)

# Example&

## Create example DataFrame

In [82]:
people_dict = {
    "height": [172, 181, 185],
    "weight": [68, 83, 112],
    "birthyear": [1985, 1984, 1992],
    "children": [1, 3, 0],
    "hobby": ["Biking", "Dancing", np.nan],
}
people = pd.DataFrame(people_dict, index=["alice", "bob", "charles"])
people

Unnamed: 0,height,weight,birthyear,children,hobby
alice,172,68,1985,1,Biking
bob,181,83,1984,3,Dancing
charles,185,112,1992,0,


## Create function

In [83]:
def bmi(row):
    return row["weight"] / (row["height"] / 100) ** 2

## Apply functions

In [86]:
people['body_mass_index'] = people.apply(bmi, axis = 1)
people['meaningless_calculation'] = people.apply(
        lambda row : (row['height'] + row['children']) % 7,
        axis = 1)

In [87]:
people

Unnamed: 0,height,weight,birthyear,children,hobby,body_mass_index,meaningless_calculation
alice,172,68,1985,1,Biking,22.985398,5
bob,181,83,1984,3,Dancing,25.335002,2
charles,185,112,1992,0,,32.724617,3


---
title: "Drop a Column from a DataFrame"
description: "Remove a column or many columns from a pandas dataframe."
tags: Pandas
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

In [None]:
# just remove
del df['column']

In [None]:
# remove and return
series = df.pop('column')

To drop multiple columns, use the `drop()` method with either `axis = 1` or `columns = ` (it does the same thing). Both return a new dataframe without the unwanted columns.

In [None]:
df.drop(['col_1', 'col_2', '...'], axis = 1)
# overwrite the original df
df = df.drop(columns = ['col_1', 'col_2', '...'])

# Example&

## Create example DataFrame

In [93]:
people_dict = {
    "weight": [68, 83, 112],
    "birthyear": [1985, 1984, 1992],
    "children": [np.nan, 3, 0],
    "hobby": ["Biking", "Dancing", np.nan],
}
people = pd.DataFrame(people_dict, index=["alice", "bob", "charles"])
people

Unnamed: 0,weight,birthyear,children,hobby
alice,68,1985,,Biking
bob,83,1984,3.0,Dancing
charles,112,1992,0.0,


In [94]:
birthyears = people.pop("birthyear")
birthyears

alice      1985
bob        1984
charles    1992
Name: birthyear, dtype: int64

In [95]:
del people["children"]
people

Unnamed: 0,weight,hobby
alice,68,Biking
bob,83,Dancing
charles,112,


In [96]:
people.drop(columns = ['weight', 'hobby'])

alice
bob
charles


---
title: "Drop a Row from a DataFrame"
description: "Remove a row or many rows from a pandas dataframe."
tags: Pandas
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

In [None]:
# drop by index label
df.drop('row_label')
# or
df.drop(['row_label_1', 'row_label_2', '...'])

In [None]:
# drop by index
df.drop(index)
# or
df.drop([index_1, index_2, ...])

# Example&

## Create example DataFrame

In [103]:
people_dict = {
    "weight": [68, 83, 112],
    "birthyear": [1985, 1984, 1992],
    "children": [np.nan, 3, 0],
    "hobby": ["Biking", "Dancing", np.nan],
}
people = pd.DataFrame(people_dict, index=["alice", "bob", "charles"])
people

Unnamed: 0,weight,birthyear,children,hobby
alice,68,1985,,Biking
bob,83,1984,3.0,Dancing
charles,112,1992,0.0,


## Dropping with index label

In [105]:
people.drop('charles')

Unnamed: 0,weight,birthyear,children,hobby
alice,68,1985,,Biking
bob,83,1984,3.0,Dancing


## Dropping with index

In [106]:
people = pd.DataFrame(people_dict)
people

Unnamed: 0,weight,birthyear,children,hobby
0,68,1985,,Biking
1,83,1984,3.0,Dancing
2,112,1992,0.0,


In [107]:
people.drop([1,2])

Unnamed: 0,weight,birthyear,children,hobby
0,68,1985,,Biking


---
title: "Assign Columns to a DataFrame"
description: "Use the assign method to create a new DataFrame object with all the data from a previous DataFrame in attion to new columns. Existing columns that are re-assigned are overwritten by this method. This method lets you assign multiple new columns in one step as well as create new columns that are functions of the values in other rows."
tags: Pandas
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Example&

## Create example DataFrame

In [110]:
people_dict = {
    "weight": [68, 83, 112],
    "height": [172, 181, 185],
    "birthyear": [1985, 1984, 1992],
    "children": [np.nan, 3, 0],
    "hobby": ["Biking", "Dancing", np.nan],
}
people = pd.DataFrame(people_dict, index=["alice", "bob", "charles"])
people

Unnamed: 0,weight,height,birthyear,children,hobby
alice,68,172,1985,,Biking
bob,83,181,1984,3.0,Dancing
charles,112,185,1992,0.0,


## Assign some new columns

In [112]:
people.assign(
    body_mass_index = people["weight"] / (people["height"] / 100) ** 2,
    has_kids = people["children"] > 0
)

Unnamed: 0,weight,height,birthyear,children,hobby,body_mass_index,has_kids
alice,68,172,1985,,Biking,22.985398,False
bob,83,181,1984,3.0,Dancing,25.335002,True
charles,112,185,1992,0.0,,32.724617,False


Note that you cannot access columns created within the same assignment:

In [113]:
try:
    people.assign(
        body_mass_index = people["weight"] / (people["height"] / 100) ** 2,
        overweight = people["body_mass_index"] > 25
    )
except KeyError as e:
    print("Key error:", e)

Key error: 'body_mass_index'


But fear not, there is a simple solution. You can pass a function to the `assign()` method (typically a `lambda` function), and this function will be called with the `DataFrame` as a parameter:

In [114]:
(people
     .assign(body_mass_index = lambda df: df["weight"] / (df["height"] / 100) ** 2)
     .assign(overweight = lambda df: df["body_mass_index"] > 25)
)

Unnamed: 0,weight,height,birthyear,children,hobby,body_mass_index,overweight
alice,68,172,1985,,Biking,22.985398,False
bob,83,181,1984,3.0,Dancing,25.335002,True
charles,112,185,1992,0.0,,32.724617,True


Problem solved!

---
title: "Querying A DataFrame"
description: "Query the columns of a dataframe with a boolean expression and filter out the rows for which it holds true."
tags: Pandas
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

In [None]:
# use 'and' and 'or' to string multiple expressions together
df.query('query_expression')

In [None]:
# True: these expressions are equal 
df.query('col_1 > col_2') == df[df.col_1 > df.col_2]

# Example&

## Create example DataFrame

In [110]:
people_dict = {
    "weight": [68, 83, 112],
    "height": [172, 181, 185],
    "birthyear": [1985, 1984, 1992],
    "children": [np.nan, 3, 0],
    "hobby": ["Biking", "Dancing", np.nan],
}
people = pd.DataFrame(people_dict, index=["alice", "bob", "charles"])
people

Unnamed: 0,weight,height,birthyear,children,hobby
alice,68,172,1985,,Biking
bob,83,181,1984,3.0,Dancing
charles,112,185,1992,0.0,


## Make a query

In [116]:
people.query("weight < 100")

Unnamed: 0,weight,height,birthyear,children,hobby
alice,68,172,1985,,Biking
bob,83,181,1984,3.0,Dancing


## String together expressions with `and`, `or`

In [117]:
people.query("height > 180 and birthyear % 2 == 0")

Unnamed: 0,weight,height,birthyear,children,hobby
bob,83,181,1984,3.0,Dancing
charles,112,185,1992,0.0,


---
title: "Sort a DataFrame by Index Label"
description: "Query the columns of a dataframe with a boolean expression and filter out the rows for which it holds true."
tags: Pandas
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

By default the `sort_index` method sorts by index label and will return a sorted copy of the DataFrame. 

To modify the `df` directly, set the `inplace` argument to `True`.

In [None]:
df.sort_index(inplace = True)

# Example&

## Create example DataFrame

In [118]:
people_dict = {
    "weight": [68, 83, 112],
    "height": [172, 181, 185],
    "birthyear": [1985, 1984, 1992],
    "children": [np.nan, 3, 0],
    "hobby": ["Biking", "Dancing", np.nan],
}
people = pd.DataFrame(people_dict, index=["alice", "bob", "charles"])
people

Unnamed: 0,weight,height,birthyear,children,hobby
alice,68,172,1985,,Biking
bob,83,181,1984,3.0,Dancing
charles,112,185,1992,0.0,


### The rows are already sorted by their index label in ascending (lexicographical) order, so let's reverse the order.

In [119]:
people.sort_index(inplace = True, ascending = False)
people

Unnamed: 0,weight,height,birthyear,children,hobby
charles,112,185,1992,0.0,
bob,83,181,1984,3.0,Dancing
alice,68,172,1985,,Biking


---
title: "Sort a DataFrame by Column"
description: "Sort a DataFrame by its column names."
tags: Pandas
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

By default the `sort_index` method sorts by index label and will return a sorted copy of the DataFrame. We can set `axis = 1` to sort the column names. 

To modify the `df` directly, set the `inplace` argument to `True`.

In [None]:
df.sort_index(axis = 1, inplace = True)

# Example&

## Create example DataFrame

In [120]:
people_dict = {
    "weight": [68, 83, 112],
    "height": [172, 181, 185],
    "birthyear": [1985, 1984, 1992],
    "children": [np.nan, 3, 0],
    "hobby": ["Biking", "Dancing", np.nan],
}
people = pd.DataFrame(people_dict, index=["alice", "bob", "charles"])
people

Unnamed: 0,weight,height,birthyear,children,hobby
alice,68,172,1985,,Biking
bob,83,181,1984,3.0,Dancing
charles,112,185,1992,0.0,


## Sort the columns in lexicographical order

In [121]:
people.sort_index(axis = 1)

Unnamed: 0,birthyear,children,height,hobby,weight
alice,1985,,172,Biking,68
bob,1984,3.0,181,Dancing,83
charles,1992,0.0,185,,112


---
title: "Sort a DataFrame by Values in a Column"
description: "Sort a DataFrame by the values in one particular column."
tags: Pandas
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

By default the `sort_values` method returns a sorted copy of the DataFrame. To modify the `df` directly, set the `inplace` argument to `True`. 

In [None]:
# sort by my_column
df.sort_values(by = 'my_column')

In [None]:
# sort by multiple columns
df.sort_values(by = ['col_1', 'col_2'])

To sort in reverse order, set `ascending = False` (default is `True`) <br> To put NaNs at the beginning, set `na_position = 'first'` (default is `'last'`)

# Example&

## Create example DataFrame

In [122]:
people_dict = {
    "weight": [68, 83, 112, 68, 100],
    "height": [172, 181, 185, 155, 155],
    "birthyear": [1985, 1984, 1992, 2000, 1977],
    "children": [np.nan, 3, 0, 0, 2],
    "hobby": ["Biking", "Dancing", np.nan, "Hockey", "Reading"],
}
people = pd.DataFrame(people_dict)
people

Unnamed: 0,weight,height,birthyear,children,hobby
0,68,172,1985,,Biking
1,83,181,1984,3.0,Dancing
2,112,185,1992,0.0,
3,68,155,2000,0.0,Hockey
4,100,155,1977,2.0,Reading


## Sort by height

In [123]:
people.sort_values('height')

Unnamed: 0,weight,height,birthyear,children,hobby
3,68,155,2000,0.0,Hockey
4,100,155,1977,2.0,Reading
0,68,172,1985,,Biking
1,83,181,1984,3.0,Dancing
2,112,185,1992,0.0,


## Sort by children and birthyear with na first

In [127]:
people.sort_values(['children', 'birthyear'], na_position = 'first')

Unnamed: 0,weight,height,birthyear,children,hobby
0,68,172,1985,,Biking
2,112,185,1992,0.0,
3,68,155,2000,0.0,Hockey
4,100,155,1977,2.0,Reading
1,83,181,1984,3.0,Dancing


---
title: "Aggregate Data into Groups with Groupby"
description: "Similar to the SQL language, pandas allows you to group your data to run calculations over each group."
tags: Pandas
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

In [None]:
# split groups
grouping_object = df.groupby('my_column')

In [None]:
# split groups and apply a function
grouping_object.mean()

# Example&

## Create some example data

In [129]:
data = {
    "sept": [8, 10, 4, np.nan, 9],
    "oct": [8, 9, 11, np.nan, 10],
    "nov": [9, 10, 5, np.nan, 11],
    "dec": [np.nan, np.nan, np.nan, np.nan, np.nan],
    "hobby": ['Biking', 'Dancing', np.nan, 'Dancing', 'Biking']
}
final_grades = pd.DataFrame(data, index=["alice", "bob", "charles", 'colin', 'darwin'])

final_grades

Unnamed: 0,sept,oct,nov,dec,hobby
alice,8.0,8.0,9.0,,Biking
bob,10.0,9.0,10.0,,Dancing
charles,4.0,11.0,5.0,,
colin,,,,,Dancing
darwin,9.0,10.0,11.0,,Biking


## Create grouping object by hobby

In [130]:
grouped_grades = final_grades.groupby("hobby")
grouped_grades

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x7fc0c05d0be0>

## Compute the average grade per hobby

In [131]:
grouped_grades.mean()

Unnamed: 0_level_0,sept,oct,nov,dec
hobby,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Biking,8.5,9.0,10.0,
Dancing,10.0,9.0,10.0,


Note: the `NaN` values have simply been skipped when computing the means.

# Learn More&

There's lots to learn about pandas groupby function. Check out the documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

---
title: "Pivot Tables"
description: "Create a spreadsheet-like statistical summary of the data as a DataFrame with "
tags: Pandas
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

In [None]:
# by default the pivot table computes numerical mean
pd.pivot_table(df, index = 'my_column')

In [None]:
# specify the columns to aggregate over horizontally
pd.pivot_table(df, index = 'my_column', columns = 'col')

In [None]:
# specify the list of columns whose values will be aggregated
pd.pivot_table(df, index = 'my_column', values = ['col_1', 'col_2'])

### Other useful parameters:

- `aggfunc` (function): use something else than the mean. Also this could be a list of functions and the resulting table will have hierarchical columns with the function names at the top.
- `margins` (default = False): include a grand total for rows and columns
- `fill_value` (scalar): replace all missing with this value
- `dropna` (default = True): don't include any columns that have only NaN values

# Example&

## Create example data

In [12]:
data = {
    "sept": [8, 10, 4, 9],
    "oct": [8, 9, 11, 10],
    "nov": [9, 10, 5, 11],
}
grades = pd.DataFrame(data, index=["alice", "bob", "charles", 'darwin'])
grades = grades.stack().reset_index()
grades.columns = ["name", "month", "grade"]
grades['bonus'] = [np.nan, np.nan, np.nan, 0, np.nan, 2, 3, 3, 0, 0, 1, 0]

In [13]:
grades

Unnamed: 0,name,month,grade,bonus
0,alice,sept,8,
1,alice,oct,8,
2,alice,nov,9,
3,bob,sept,10,0.0
4,bob,oct,9,
5,bob,nov,10,2.0
6,charles,sept,4,3.0
7,charles,oct,11,3.0
8,charles,nov,5,0.0
9,darwin,sept,9,0.0


## Groupy by the `name` column
Now we can call the `pd.pivot_table()` function for this `DataFrame`, asking to group by the `name` column. By default, `pivot_table()` computes the mean of each numeric column:

In [14]:
pd.pivot_table(grades, index="name")

Unnamed: 0_level_0,bonus,grade
name,Unnamed: 1_level_1,Unnamed: 2_level_1
alice,,8.333333
bob,1.0,9.666667
charles,2.0,6.666667
darwin,0.333333,10.0


## Calculate max instead of mean

In [16]:
pd.pivot_table(grades, index="name", values=["grade","bonus"], aggfunc=np.max)

Unnamed: 0_level_0,bonus,grade
name,Unnamed: 1_level_1,Unnamed: 2_level_1
alice,,9
bob,2.0,10
charles,3.0,11
darwin,1.0,11


## Aggregate horizontally over `month` and `margins = True`

In [23]:
pd.pivot_table(grades, index="name", values="grade", columns="month", margins=True)

month,nov,oct,sept,All
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
alice,9,8,8,8.333333
bob,10,9,10,9.666667
charles,5,11,4,6.666667
darwin,11,10,9,10.0
All,8,9,7,8.666667


## Specify multiple index or column names and pandas will create multi-level indices

In [24]:
pd.pivot_table(grades, index=("name", "month"), margins=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,bonus,grade
name,month,Unnamed: 2_level_1,Unnamed: 3_level_1
alice,nov,,9
alice,oct,,8
alice,sept,,8
bob,nov,2.0,10
bob,oct,,9
bob,sept,0.0,10
charles,nov,0.0,5
charles,oct,3.0,11
charles,sept,3.0,4
darwin,nov,0.0,11


# Learn More&

Check out the documentation on pivot tables here to learn more: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html

---
title: "Head of a DataFrame"
description: "When dealing with large DataFrames, it is useful to get a quick overview of its content. Pandas provides the head function to preview the first n rows of a large DataFrame."
tags: Pandas, EDA
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

In [None]:
# preview first 5 rows by default
df.head()

In [None]:
# preview first n rows
df.head(n)

# Example&

## Create large example DataFrame

In [26]:
much_data_such_wow = np.fromfunction(lambda x,y: (x+y*y)%17*11, (10000, 26))
large_df = pd.DataFrame(much_data_such_wow, columns=list("ABCDEFGHIJKLMNOPQRSTUVWXYZ"))
large_df[large_df % 16 == 0] = np.nan
large_df.insert(3,"some_text", "very nice")
large_df

Unnamed: 0,A,B,C,some_text,D,E,F,G,H,I,...,Q,R,S,T,U,V,W,X,Y,Z
0,,11.0,44.0,very nice,99.0,,88.0,22.0,165.0,143.0,...,11.0,,11.0,44.0,99.0,,88.0,22.0,165.0,143.0
1,11.0,22.0,55.0,very nice,110.0,,99.0,33.0,,154.0,...,22.0,11.0,22.0,55.0,110.0,,99.0,33.0,,154.0
2,22.0,33.0,66.0,very nice,121.0,11.0,110.0,44.0,,165.0,...,33.0,22.0,33.0,66.0,121.0,11.0,110.0,44.0,,165.0
3,33.0,44.0,77.0,very nice,132.0,22.0,121.0,55.0,11.0,,...,44.0,33.0,44.0,77.0,132.0,22.0,121.0,55.0,11.0,
4,44.0,55.0,88.0,very nice,143.0,33.0,132.0,66.0,22.0,,...,55.0,44.0,55.0,88.0,143.0,33.0,132.0,66.0,22.0,
5,55.0,66.0,99.0,very nice,154.0,44.0,143.0,77.0,33.0,11.0,...,66.0,55.0,66.0,99.0,154.0,44.0,143.0,77.0,33.0,11.0
6,66.0,77.0,110.0,very nice,165.0,55.0,154.0,88.0,44.0,22.0,...,77.0,66.0,77.0,110.0,165.0,55.0,154.0,88.0,44.0,22.0
7,77.0,88.0,121.0,very nice,,66.0,165.0,99.0,55.0,33.0,...,88.0,77.0,88.0,121.0,,66.0,165.0,99.0,55.0,33.0
8,88.0,99.0,132.0,very nice,,77.0,,110.0,66.0,44.0,...,99.0,88.0,99.0,132.0,,77.0,,110.0,66.0,44.0
9,99.0,110.0,143.0,very nice,11.0,88.0,,121.0,77.0,55.0,...,110.0,99.0,110.0,143.0,11.0,88.0,,121.0,77.0,55.0


## Using the head function

In [27]:
large_df.head()

Unnamed: 0,A,B,C,some_text,D,E,F,G,H,I,...,Q,R,S,T,U,V,W,X,Y,Z
0,,11.0,44.0,very nice,99.0,,88.0,22.0,165.0,143.0,...,11.0,,11.0,44.0,99.0,,88.0,22.0,165.0,143.0
1,11.0,22.0,55.0,very nice,110.0,,99.0,33.0,,154.0,...,22.0,11.0,22.0,55.0,110.0,,99.0,33.0,,154.0
2,22.0,33.0,66.0,very nice,121.0,11.0,110.0,44.0,,165.0,...,33.0,22.0,33.0,66.0,121.0,11.0,110.0,44.0,,165.0
3,33.0,44.0,77.0,very nice,132.0,22.0,121.0,55.0,11.0,,...,44.0,33.0,44.0,77.0,132.0,22.0,121.0,55.0,11.0,
4,44.0,55.0,88.0,very nice,143.0,33.0,132.0,66.0,22.0,,...,55.0,44.0,55.0,88.0,143.0,33.0,132.0,66.0,22.0,


In [28]:
large_df.head(3)

Unnamed: 0,A,B,C,some_text,D,E,F,G,H,I,...,Q,R,S,T,U,V,W,X,Y,Z
0,,11.0,44.0,very nice,99.0,,88.0,22.0,165.0,143.0,...,11.0,,11.0,44.0,99.0,,88.0,22.0,165.0,143.0
1,11.0,22.0,55.0,very nice,110.0,,99.0,33.0,,154.0,...,22.0,11.0,22.0,55.0,110.0,,99.0,33.0,,154.0
2,22.0,33.0,66.0,very nice,121.0,11.0,110.0,44.0,,165.0,...,33.0,22.0,33.0,66.0,121.0,11.0,110.0,44.0,,165.0


---
title: "Tail of a DataFrame"
description: "When dealing with large DataFrames, it is useful to get a quick overview of its content. Pandas provides the tail function to preview the last n rows of a large DataFrame."
tags: Pandas, EDA
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

In [None]:
# preview last 5 rows by default
df.tail()

In [None]:
# preview last n rows
df.tail(n)

# Example&

## Create large example DataFrame

In [26]:
much_data_such_wow = np.fromfunction(lambda x,y: (x+y*y)%17*11, (10000, 26))
large_df = pd.DataFrame(much_data_such_wow, columns=list("ABCDEFGHIJKLMNOPQRSTUVWXYZ"))
large_df[large_df % 16 == 0] = np.nan
large_df.insert(3,"some_text", "very nice")
large_df

Unnamed: 0,A,B,C,some_text,D,E,F,G,H,I,...,Q,R,S,T,U,V,W,X,Y,Z
0,,11.0,44.0,very nice,99.0,,88.0,22.0,165.0,143.0,...,11.0,,11.0,44.0,99.0,,88.0,22.0,165.0,143.0
1,11.0,22.0,55.0,very nice,110.0,,99.0,33.0,,154.0,...,22.0,11.0,22.0,55.0,110.0,,99.0,33.0,,154.0
2,22.0,33.0,66.0,very nice,121.0,11.0,110.0,44.0,,165.0,...,33.0,22.0,33.0,66.0,121.0,11.0,110.0,44.0,,165.0
3,33.0,44.0,77.0,very nice,132.0,22.0,121.0,55.0,11.0,,...,44.0,33.0,44.0,77.0,132.0,22.0,121.0,55.0,11.0,
4,44.0,55.0,88.0,very nice,143.0,33.0,132.0,66.0,22.0,,...,55.0,44.0,55.0,88.0,143.0,33.0,132.0,66.0,22.0,
5,55.0,66.0,99.0,very nice,154.0,44.0,143.0,77.0,33.0,11.0,...,66.0,55.0,66.0,99.0,154.0,44.0,143.0,77.0,33.0,11.0
6,66.0,77.0,110.0,very nice,165.0,55.0,154.0,88.0,44.0,22.0,...,77.0,66.0,77.0,110.0,165.0,55.0,154.0,88.0,44.0,22.0
7,77.0,88.0,121.0,very nice,,66.0,165.0,99.0,55.0,33.0,...,88.0,77.0,88.0,121.0,,66.0,165.0,99.0,55.0,33.0
8,88.0,99.0,132.0,very nice,,77.0,,110.0,66.0,44.0,...,99.0,88.0,99.0,132.0,,77.0,,110.0,66.0,44.0
9,99.0,110.0,143.0,very nice,11.0,88.0,,121.0,77.0,55.0,...,110.0,99.0,110.0,143.0,11.0,88.0,,121.0,77.0,55.0


## Using the tail function

In [29]:
large_df.tail()

Unnamed: 0,A,B,C,some_text,D,E,F,G,H,I,...,Q,R,S,T,U,V,W,X,Y,Z
9995,,,33.0,very nice,88.0,165.0,77.0,11.0,154.0,132.0,...,,,,33.0,88.0,165.0,77.0,11.0,154.0,132.0
9996,,11.0,44.0,very nice,99.0,,88.0,22.0,165.0,143.0,...,11.0,,11.0,44.0,99.0,,88.0,22.0,165.0,143.0
9997,11.0,22.0,55.0,very nice,110.0,,99.0,33.0,,154.0,...,22.0,11.0,22.0,55.0,110.0,,99.0,33.0,,154.0
9998,22.0,33.0,66.0,very nice,121.0,11.0,110.0,44.0,,165.0,...,33.0,22.0,33.0,66.0,121.0,11.0,110.0,44.0,,165.0
9999,33.0,44.0,77.0,very nice,132.0,22.0,121.0,55.0,11.0,,...,44.0,33.0,44.0,77.0,132.0,22.0,121.0,55.0,11.0,


In [30]:
large_df.tail(3)

Unnamed: 0,A,B,C,some_text,D,E,F,G,H,I,...,Q,R,S,T,U,V,W,X,Y,Z
9997,11.0,22.0,55.0,very nice,110.0,,99.0,33.0,,154.0,...,22.0,11.0,22.0,55.0,110.0,,99.0,33.0,,154.0
9998,22.0,33.0,66.0,very nice,121.0,11.0,110.0,44.0,,165.0,...,33.0,22.0,33.0,66.0,121.0,11.0,110.0,44.0,,165.0
9999,33.0,44.0,77.0,very nice,132.0,22.0,121.0,55.0,11.0,,...,44.0,33.0,44.0,77.0,132.0,22.0,121.0,55.0,11.0,


---
title: "Info Overview for a DataFrame"
description: "When dealing with large DataFrames, it is useful to get a quick overview of its content. Pandas provides the info function which gives a summary of each columns contents."
tags: Pandas, EDA
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

In [None]:
df.info()

# Example&

## Create large example DataFrame

In [26]:
much_data_such_wow = np.fromfunction(lambda x,y: (x+y*y)%17*11, (10000, 26))
large_df = pd.DataFrame(much_data_such_wow, columns=list("ABCDEFGHIJKLMNOPQRSTUVWXYZ"))
large_df[large_df % 16 == 0] = np.nan
large_df.insert(3,"some_text", "very nice")
large_df

Unnamed: 0,A,B,C,some_text,D,E,F,G,H,I,...,Q,R,S,T,U,V,W,X,Y,Z
0,,11.0,44.0,very nice,99.0,,88.0,22.0,165.0,143.0,...,11.0,,11.0,44.0,99.0,,88.0,22.0,165.0,143.0
1,11.0,22.0,55.0,very nice,110.0,,99.0,33.0,,154.0,...,22.0,11.0,22.0,55.0,110.0,,99.0,33.0,,154.0
2,22.0,33.0,66.0,very nice,121.0,11.0,110.0,44.0,,165.0,...,33.0,22.0,33.0,66.0,121.0,11.0,110.0,44.0,,165.0
3,33.0,44.0,77.0,very nice,132.0,22.0,121.0,55.0,11.0,,...,44.0,33.0,44.0,77.0,132.0,22.0,121.0,55.0,11.0,
4,44.0,55.0,88.0,very nice,143.0,33.0,132.0,66.0,22.0,,...,55.0,44.0,55.0,88.0,143.0,33.0,132.0,66.0,22.0,
5,55.0,66.0,99.0,very nice,154.0,44.0,143.0,77.0,33.0,11.0,...,66.0,55.0,66.0,99.0,154.0,44.0,143.0,77.0,33.0,11.0
6,66.0,77.0,110.0,very nice,165.0,55.0,154.0,88.0,44.0,22.0,...,77.0,66.0,77.0,110.0,165.0,55.0,154.0,88.0,44.0,22.0
7,77.0,88.0,121.0,very nice,,66.0,165.0,99.0,55.0,33.0,...,88.0,77.0,88.0,121.0,,66.0,165.0,99.0,55.0,33.0
8,88.0,99.0,132.0,very nice,,77.0,,110.0,66.0,44.0,...,99.0,88.0,99.0,132.0,,77.0,,110.0,66.0,44.0
9,99.0,110.0,143.0,very nice,11.0,88.0,,121.0,77.0,55.0,...,110.0,99.0,110.0,143.0,11.0,88.0,,121.0,77.0,55.0


## Get the info

In [31]:
large_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 27 columns):
A            8823 non-null float64
B            8824 non-null float64
C            8824 non-null float64
some_text    10000 non-null object
D            8824 non-null float64
E            8822 non-null float64
F            8824 non-null float64
G            8824 non-null float64
H            8822 non-null float64
I            8823 non-null float64
J            8823 non-null float64
K            8822 non-null float64
L            8824 non-null float64
M            8824 non-null float64
N            8822 non-null float64
O            8824 non-null float64
P            8824 non-null float64
Q            8824 non-null float64
R            8823 non-null float64
S            8824 non-null float64
T            8824 non-null float64
U            8824 non-null float64
V            8822 non-null float64
W            8824 non-null float64
X            8824 non-null float64
Y            8822

---
title: "Describe for a DataFrame"
description: "When dealing with large DataFrames, it is useful to get a quick overview of its content. Pandas provides the describe function which gives a nice overview of the main aggregated values over each column."
tags: Pandas, EDA
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

In [None]:
df.describe()

The `describe()` method gives a nice overview of the main aggregated values over each column:
* `count`: number of non-null (not NaN) values
* `mean`: mean of non-null values
* `std`: [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation) of non-null values
* `min`: minimum of non-null values
* `25%`, `50%`, `75%`: 25th, 50th and 75th [percentile](https://en.wikipedia.org/wiki/Percentile) of non-null values
* `max`: maximum of non-null values

# Example&

## Create large example DataFrame

In [26]:
much_data_such_wow = np.fromfunction(lambda x,y: (x+y*y)%17*11, (10000, 26))
large_df = pd.DataFrame(much_data_such_wow, columns=list("ABCDEFGHIJKLMNOPQRSTUVWXYZ"))
large_df[large_df % 16 == 0] = np.nan
large_df.insert(3,"some_text", "very nice")
large_df

Unnamed: 0,A,B,C,some_text,D,E,F,G,H,I,...,Q,R,S,T,U,V,W,X,Y,Z
0,,11.0,44.0,very nice,99.0,,88.0,22.0,165.0,143.0,...,11.0,,11.0,44.0,99.0,,88.0,22.0,165.0,143.0
1,11.0,22.0,55.0,very nice,110.0,,99.0,33.0,,154.0,...,22.0,11.0,22.0,55.0,110.0,,99.0,33.0,,154.0
2,22.0,33.0,66.0,very nice,121.0,11.0,110.0,44.0,,165.0,...,33.0,22.0,33.0,66.0,121.0,11.0,110.0,44.0,,165.0
3,33.0,44.0,77.0,very nice,132.0,22.0,121.0,55.0,11.0,,...,44.0,33.0,44.0,77.0,132.0,22.0,121.0,55.0,11.0,
4,44.0,55.0,88.0,very nice,143.0,33.0,132.0,66.0,22.0,,...,55.0,44.0,55.0,88.0,143.0,33.0,132.0,66.0,22.0,
5,55.0,66.0,99.0,very nice,154.0,44.0,143.0,77.0,33.0,11.0,...,66.0,55.0,66.0,99.0,154.0,44.0,143.0,77.0,33.0,11.0
6,66.0,77.0,110.0,very nice,165.0,55.0,154.0,88.0,44.0,22.0,...,77.0,66.0,77.0,110.0,165.0,55.0,154.0,88.0,44.0,22.0
7,77.0,88.0,121.0,very nice,,66.0,165.0,99.0,55.0,33.0,...,88.0,77.0,88.0,121.0,,66.0,165.0,99.0,55.0,33.0
8,88.0,99.0,132.0,very nice,,77.0,,110.0,66.0,44.0,...,99.0,88.0,99.0,132.0,,77.0,,110.0,66.0,44.0
9,99.0,110.0,143.0,very nice,11.0,88.0,,121.0,77.0,55.0,...,110.0,99.0,110.0,143.0,11.0,88.0,,121.0,77.0,55.0


## Get the summary

In [32]:
large_df.describe()

Unnamed: 0,A,B,C,D,E,F,G,H,I,J,...,Q,R,S,T,U,V,W,X,Y,Z
count,8823.0,8824.0,8824.0,8824.0,8822.0,8824.0,8824.0,8822.0,8823.0,8823.0,...,8824.0,8823.0,8824.0,8824.0,8824.0,8822.0,8824.0,8824.0,8822.0,8823.0
mean,87.977559,87.972575,87.987534,88.012466,87.983791,88.00748,87.977561,88.0,88.022441,88.022441,...,87.972575,87.977559,87.972575,87.987534,88.012466,87.983791,88.00748,87.977561,88.0,88.022441
std,47.535911,47.535523,47.521679,47.521679,47.535001,47.519371,47.529755,47.536879,47.535911,47.535911,...,47.535523,47.535911,47.535523,47.521679,47.521679,47.535001,47.519371,47.529755,47.536879,47.535911
min,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,...,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0
25%,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,...,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0
50%,88.0,88.0,88.0,88.0,88.0,88.0,88.0,88.0,88.0,88.0,...,88.0,88.0,88.0,88.0,88.0,88.0,88.0,88.0,88.0,88.0
75%,132.0,132.0,132.0,132.0,132.0,132.0,132.0,132.0,132.0,132.0,...,132.0,132.0,132.0,132.0,132.0,132.0,132.0,132.0,132.0,132.0
max,165.0,165.0,165.0,165.0,165.0,165.0,165.0,165.0,165.0,165.0,...,165.0,165.0,165.0,165.0,165.0,165.0,165.0,165.0,165.0,165.0


---
title: "Save a DataFrame as CSV, HTML, or JSON"
description: "Pandas can save DataFrames to various backends including file formats such as csv, html, or json."
tags: Pandas
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: "Data Import / Connectivity / Injest"

---

# Key Code&

In [None]:
import pandas as pd

In [None]:
df.to_csv("filename.csv")

In [None]:
df.to_html("filename.html")

In [None]:
df.to_json("filename.json")

# Example&

## Create an example DataFrame

In [130]:
my_df = pd.DataFrame(
    [["Biking", 68.5, 1985, np.nan], ["Dancing", 83.1, 1984, 3]], 
    columns=["hobby","weight","birthyear","children"],
    index=["alice", "bob"]
)
my_df

Unnamed: 0,hobby,weight,birthyear,children
alice,Biking,68.5,1985,
bob,Dancing,83.1,1984,3.0


## Save to a file

In [131]:
my_df.to_csv("my_df.csv")
my_df.to_html("my_df.html")
my_df.to_json("my_df.json")

## Done! Let's peek at the files to make sure

In [132]:
for filename in ("my_df.csv", "my_df.html", "my_df.json"):
    print("#", filename)
    with open(filename, "rt") as f:
        print(f.read())
        print()


# my_df.csv
,hobby,weight,birthyear,children
alice,Biking,68.5,1985,
bob,Dancing,83.1,1984,3.0


# my_df.html
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>hobby</th>
      <th>weight</th>
      <th>birthyear</th>
      <th>children</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>alice</th>
      <td>Biking</td>
      <td>68.5</td>
      <td>1985</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>bob</th>
      <td>Dancing</td>
      <td>83.1</td>
      <td>1984</td>
      <td>3.0</td>
    </tr>
  </tbody>
</table>

# my_df.json
{"hobby":{"alice":"Biking","bob":"Dancing"},"weight":{"alice":68.5,"bob":83.1},"birthyear":{"alice":1985,"bob":1984},"children":{"alice":null,"bob":3.0}}



**Note:** the index is saved as the first column (with no name) in a CSV file, as `<th>` tags in HTML and as keys in JSON.

Saving to other formats works very similarly, but some formats require extra libraries to be installed. For example, saving to Excel requires the openpyxl library:

In [133]:
try:
    my_df.to_excel("my_df.xlsx", sheet_name='People')
except ImportError as e:
    print(e)

No module named 'openpyxl'


---
title: "Load Data from CSV, HTML, JSON, or Excel"
description: "Pandas can easily load various file formats such as csv, html, json, or excel into a DataFrame."
tags: Pandas, Data Import / Connectivity / Injest
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('filename.csv', index_col = 0)

In [None]:
# for other file formats, there are similar functions
df = pd.read_json('filename.json')
df = pd.read_html('filename.html')
df = pd.read_excel('filename.xlsx')

# Example&

In [134]:
my_df_loaded = pd.read_csv("my_df.csv", index_col=0)
my_df_loaded

Unnamed: 0,hobby,weight,birthyear,children
alice,Biking,68.5,1985,
bob,Dancing,83.1,1984,3.0


# Learn More&

There are more options available, in particular regarding datetime format. Check out the [documentation](http://pandas.pydata.org/pandas-docs/stable/io.html) for more details. <br>
Check out the read_csv documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

---
title: "Combine DataFrames with the Merge Function"
description: "One powerful feature of pandas is it's ability to perform SQL-like joins on DataFrames. Various types of joins are supported: inner joins, left/right outer joins and full joins."
tags: Pandas, Data Import / Connectivity / Injest
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

In [None]:
# keeping only rows that have values in both dfs
# equivalent of a SQL 'INNER JOIN'
pd.merge(left = df1, right = df2, on = 'column')

In [None]:
# where no rows get dopped (missing values just filled with NaN)
# equivalent of a SQL 'FULL OUTER JOIN'
pd.merge(left = df1, right = df2, on = 'column', how = 'outer')

A SQL `LEFT OUTER JOIN` or `RIGHT OUTER JOIN` (when only the rows existing in the left/right DataFrame end up in the result) .can be achieved by setting `how = 'left'` or `how = 'right'`

# Example&

## Create a couple simple `DataFrame`s:

In [136]:
city_loc = pd.DataFrame(
    [
        ["CA", "San Francisco", 37.781334, -122.416728],
        ["NY", "New York", 40.705649, -74.008344],
        ["FL", "Miami", 25.791100, -80.320733],
        ["OH", "Cleveland", 41.473508, -81.739791],
        ["UT", "Salt Lake City", 40.755851, -111.896657]
    ], columns=["state", "city", "lat", "lng"])
city_loc

Unnamed: 0,state,city,lat,lng
0,CA,San Francisco,37.781334,-122.416728
1,NY,New York,40.705649,-74.008344
2,FL,Miami,25.7911,-80.320733
3,OH,Cleveland,41.473508,-81.739791
4,UT,Salt Lake City,40.755851,-111.896657


In [137]:
city_pop = pd.DataFrame(
    [
        [808976, "San Francisco", "California"],
        [8363710, "New York", "New-York"],
        [413201, "Miami", "Florida"],
        [2242193, "Houston", "Texas"]
    ], index=[3,4,5,6], columns=["population", "city", "state"])
city_pop

Unnamed: 0,population,city,state
3,808976,San Francisco,California
4,8363710,New York,New-York
5,413201,Miami,Florida
6,2242193,Houston,Texas


## Join these `DataFrame`s using the `merge()` function:

In [138]:
pd.merge(left=city_loc, right=city_pop, on="city")

Unnamed: 0,state_x,city,lat,lng,population,state_y
0,CA,San Francisco,37.781334,-122.416728,808976,California
1,NY,New York,40.705649,-74.008344,8363710,New-York
2,FL,Miami,25.7911,-80.320733,413201,Florida


**Note:** both `DataFrame`s have a column named `state`, so in the result they got renamed to `state_x` and `state_y`.

**Note:** Cleveland, Salt Lake City and Houston were dropped because they don't exist in *both* `DataFrame`s. 

# Example&

## Merge where no city gets dropped and `NaN` values are added

In [139]:
all_cities = pd.merge(left=city_loc, right=city_pop, on="city", how="outer")
all_cities

Unnamed: 0,state_x,city,lat,lng,population,state_y
0,CA,San Francisco,37.781334,-122.416728,808976.0,California
1,NY,New York,40.705649,-74.008344,8363710.0,New-York
2,FL,Miami,25.7911,-80.320733,413201.0,Florida
3,OH,Cleveland,41.473508,-81.739791,,
4,UT,Salt Lake City,40.755851,-111.896657,,
5,,Houston,,,2242193.0,Texas


# Example&

## Right outer join

In [140]:
pd.merge(left=city_loc, right=city_pop, on="city", how="right")

Unnamed: 0,state_x,city,lat,lng,population,state_y
0,CA,San Francisco,37.781334,-122.416728,808976,California
1,NY,New York,40.705649,-74.008344,8363710,New-York
2,FL,Miami,25.7911,-80.320733,413201,Florida
3,,Houston,,,2242193,Texas


# Example&

If the key to join on is actually in one (or both) `DataFrame`'s index, you must use `left_index=True` and/or `right_index=True`. If the key column names differ, you must use `left_on` and `right_on`. For example:

In [141]:
city_pop2 = city_pop.copy()
city_pop2.columns = ["population", "name", "state"]
pd.merge(left=city_loc, right=city_pop2, left_on="city", right_on="name")

Unnamed: 0,state_x,city,lat,lng,population,name,state_y
0,CA,San Francisco,37.781334,-122.416728,808976,San Francisco,California
1,NY,New York,40.705649,-74.008344,8363710,New York,New-York
2,FL,Miami,25.7911,-80.320733,413201,Miami,Florida


---
title: "Concatenate DataFrames with the Concat Function"
description: "Concatenate multiple DataFrames with the pandas concat() function."
tags: Pandas
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

In [None]:
# igonre_index will rename the index 0 ... n-1 (default = False)
pd.concat([df1, df2, df...], ignore_index = True)

In [None]:
# only columns that exist in both DataFrames are returned
pd.concat([df1, df2, df...], join = 'inner')

In [None]:
# concatenate horizontally
pd.concat([df1, df2, df...], axis = 1)

# Example&

## Create example DataFrames

In [136]:
city_loc = pd.DataFrame(
    [
        ["CA", "San Francisco", 37.781334, -122.416728],
        ["NY", "New York", 40.705649, -74.008344],
        ["FL", "Miami", 25.791100, -80.320733],
        ["OH", "Cleveland", 41.473508, -81.739791],
        ["UT", "Salt Lake City", 40.755851, -111.896657]
    ], columns=["state", "city", "lat", "lng"])
city_loc

Unnamed: 0,state,city,lat,lng
0,CA,San Francisco,37.781334,-122.416728
1,NY,New York,40.705649,-74.008344
2,FL,Miami,25.7911,-80.320733
3,OH,Cleveland,41.473508,-81.739791
4,UT,Salt Lake City,40.755851,-111.896657


In [137]:
city_pop = pd.DataFrame(
    [
        [808976, "San Francisco", "California"],
        [8363710, "New York", "New-York"],
        [413201, "Miami", "Florida"],
        [2242193, "Houston", "Texas"]
    ], index=[3,4,5,6], columns=["population", "city", "state"])
city_pop

Unnamed: 0,population,city,state
3,808976,San Francisco,California
4,8363710,New York,New-York
5,413201,Miami,Florida
6,2242193,Houston,Texas


## Concatenate them together

In [142]:
result_concat = pd.concat([city_loc, city_pop])
result_concat

Unnamed: 0,city,lat,lng,population,state
0,San Francisco,37.781334,-122.416728,,CA
1,New York,40.705649,-74.008344,,NY
2,Miami,25.7911,-80.320733,,FL
3,Cleveland,41.473508,-81.739791,,OH
4,Salt Lake City,40.755851,-111.896657,,UT
3,San Francisco,,,808976.0,California
4,New York,,,8363710.0,New-York
5,Miami,,,413201.0,Florida
6,Houston,,,2242193.0,Texas


**Note:** this operation aligned the data horizontally (by columns) but not vertically (by rows). In this example, we end up with multiple rows having the same index (eg. 3). Pandas handles this rather gracefully:

In [143]:
result_concat.loc[3]

Unnamed: 0,city,lat,lng,population,state
3,Cleveland,41.473508,-81.739791,,OH
3,San Francisco,,,808976.0,California


# Example&

## Tell pandas to just ignore the index

In [144]:
pd.concat([city_loc, city_pop], ignore_index=True)

Unnamed: 0,city,lat,lng,population,state
0,San Francisco,37.781334,-122.416728,,CA
1,New York,40.705649,-74.008344,,NY
2,Miami,25.7911,-80.320733,,FL
3,Cleveland,41.473508,-81.739791,,OH
4,Salt Lake City,40.755851,-111.896657,,UT
5,San Francisco,,,808976.0,California
6,New York,,,8363710.0,New-York
7,Miami,,,413201.0,Florida
8,Houston,,,2242193.0,Texas


**Note:** when a column does not exist in a `DataFrame`, it acts as if it was filled with `NaN` values. 

# Example&

## Only columns that exist in *both* `DataFrame`s

In [145]:
pd.concat([city_loc, city_pop], join="inner")

Unnamed: 0,state,city
0,CA,San Francisco
1,NY,New York
2,FL,Miami
3,OH,Cleveland
4,UT,Salt Lake City
3,California,San Francisco
4,New-York,New York
5,Florida,Miami
6,Texas,Houston


# Example&

## Concatenate `DataFrame`s horizontally

In [146]:
pd.concat([city_loc, city_pop], axis=1)

Unnamed: 0,state,city,lat,lng,population,city.1,state.1
0,CA,San Francisco,37.781334,-122.416728,,,
1,NY,New York,40.705649,-74.008344,,,
2,FL,Miami,25.7911,-80.320733,,,
3,OH,Cleveland,41.473508,-81.739791,808976.0,San Francisco,California
4,UT,Salt Lake City,40.755851,-111.896657,8363710.0,New York,New-York
5,,,,,413201.0,Miami,Florida
6,,,,,2242193.0,Houston,Texas


In this case it really does not make much sense because the indices do not align well (eg. Cleveland and San Francisco end up on the same row, because they shared the index label `3`). So let's reindex the `DataFrame`s by city name before concatenating:

In [147]:
pd.concat([city_loc.set_index("city"), city_pop.set_index("city")], axis=1)

Unnamed: 0,state,lat,lng,population,state.1
Cleveland,OH,41.473508,-81.739791,,
Houston,,,,2242193.0,Texas
Miami,FL,25.7911,-80.320733,413201.0,Florida
New York,NY,40.705649,-74.008344,8363710.0,New-York
Salt Lake City,UT,40.755851,-111.896657,,
San Francisco,CA,37.781334,-122.416728,808976.0,California


This looks a lot like a `FULL OUTER JOIN`, except that the `state` columns were not renamed to `state_x` and `state_y`, and the `city` column is now the index.

# Learn More&

Check out [the documentation on pd.concat](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) for more details.

---
title: "Append DataFrames Vertically"
description: "The append method is a useful shorthand for concatenating DataFrames vertically. It does not actually modify the DataFrame, it works on a copy and returns the modified copy. You must pass sort = True or sort = False to silence a warning since the default (currently None) is currently deprecated."
tags: Pandas
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

In [None]:
# igonre_index will rename the index 0 ... n-1 (default = False)
my_df = pd.append([df1, df2, df...], ignore_index = True, sort = False)

In [None]:
# append returns a modified copy. It's not inplace
df1.append(df2, sort = True)

# Example&

## Create example DataFrames

In [67]:
city_loc = pd.DataFrame(
    [
        ["CA", "San Francisco", 37.781334, -122.416728],
        ["NY", "New York", 40.705649, -74.008344],
        ["FL", "Miami", 25.791100, -80.320733],
        ["OH", "Cleveland", 41.473508, -81.739791],
        ["UT", "Salt Lake City", 40.755851, -111.896657]
    ], columns=["state", "city", "lat", "lng"])
city_loc

Unnamed: 0,state,city,lat,lng
0,CA,San Francisco,37.781334,-122.416728
1,NY,New York,40.705649,-74.008344
2,FL,Miami,25.7911,-80.320733
3,OH,Cleveland,41.473508,-81.739791
4,UT,Salt Lake City,40.755851,-111.896657


In [68]:
city_pop = pd.DataFrame(
    [
        [808976, "San Francisco", "California"],
        [8363710, "New York", "New-York"],
        [413201, "Miami", "Florida"],
        [2242193, "Houston", "Texas"]
    ], index=[3,4,5,6], columns=["population", "city", "state"])
city_pop

Unnamed: 0,population,city,state
3,808976,San Francisco,California
4,8363710,New York,New-York
5,413201,Miami,Florida
6,2242193,Houston,Texas


## Append them together

In [73]:
together = city_loc.append(city_pop, sort = False, ignore_index = True)
together

Unnamed: 0,state,city,lat,lng,population
0,CA,San Francisco,37.781334,-122.416728,
1,NY,New York,40.705649,-74.008344,
2,FL,Miami,25.7911,-80.320733,
3,OH,Cleveland,41.473508,-81.739791,
4,UT,Salt Lake City,40.755851,-111.896657,
5,California,San Francisco,,,808976.0
6,New-York,New York,,,8363710.0
7,Florida,Miami,,,413201.0
8,Texas,Houston,,,2242193.0


---
title: "Categories (Levels) in a DataFrame"
description: "Categories (or levels) are categorical variables that usually take on a fixed number of possible values. Examples are sex, blood type, political affiliation etc. Pandas makes it easy to deal with these types of values."
tags: Pandas, Data Cleaning / Preprocessing
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

In [None]:
# make a new column from codes (ex: levels are 0, 1, 2)
df['new_col'] = df['levels'].astype('category')
df['new_col'].cat.categories # => array of categories/levels

In [None]:
# rename the categories for readability
df['new_col'].cat.categories = ['name_1', 'name_2', 'name_3']

# Example&

## Example DataFrame with codes/levels

In [36]:
city_eco = pd.DataFrame(
    [
        [808976, "San Francisco", "California", 0],
        [8363710, "New York", "New-York", 0],
        [413201, "Miami", "Florida", 3],
        [2242193, "Houston", "Texas", 7],
    ], columns=["population", "city", "state", "eco_code"])

city_eco

Unnamed: 0,population,city,state,eco_code
0,808976,San Francisco,California,0
1,8363710,New York,New-York,0
2,413201,Miami,Florida,3
3,2242193,Houston,Texas,7


Right now the `eco_code` column is full of apparently meaningless codes. Let's fix that. 

## Create a new categorical column based on the `eco_code`s

In [38]:
city_eco["economy"] = city_eco["eco_code"].astype('category')
city_eco["economy"].cat.categories

Int64Index([0, 3, 7], dtype='int64')

## Rename the category with a meaningful name

In [39]:
city_eco["economy"].cat.categories = ["Finance", "Energy", "Tourism"]
city_eco

Unnamed: 0,population,city,state,eco_code,economy
0,808976,San Francisco,California,0,Finance
1,8363710,New York,New-York,0,Finance
2,413201,Miami,Florida,3,Energy
3,2242193,Houston,Texas,7,Tourism


**Note:** categorical values are sorted according to their categorical order, *not* their alphabetical (lexicographical) order:

In [40]:
city_eco.sort_values(by="economy", ascending=False)

Unnamed: 0,population,city,state,eco_code,economy
3,2242193,Houston,Texas,7,Tourism
2,413201,Miami,Florida,3,Energy
1,8363710,New York,New-York,0,Finance
0,808976,San Francisco,California,0,Finance


---
title: "Handling Missing Data: Filling Missing Values (NaNs)"
description: "Fill all the NaNs in a DataFrame with a particular value."
tags: Pandas, Data Cleaning / Preprocessing
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

In [None]:
# replace all NaN in column with value
df['column'] = df['column'].fillna(value)

# Example&

## Create example DataFrame

In [41]:
import numpy as np
people_dict = {
    "weight": [68, 83, 112],
    "height": [172, 181, 185],
    "birthyear": [1985, 1984, 1992],
    "children": [np.nan, 3, np.nan],
    "hobby": ["Biking", "Dancing", "Snowboarding"],
}
people = pd.DataFrame(people_dict, index=["alice", "bob", "charles"])
people

Unnamed: 0,weight,height,birthyear,children,hobby
alice,68,172,1985,,Biking
bob,83,181,1984,3.0,Dancing
charles,112,185,1992,,Snowboarding


## Replace NaNs with 0's 

In [42]:
people['children'] = people['children'].fillna(0)
people

Unnamed: 0,weight,height,birthyear,children,hobby
alice,68,172,1985,0.0,Biking
bob,83,181,1984,3.0,Dancing
charles,112,185,1992,0.0,Snowboarding


---
title: "Handling Missing Data: Interpolate Missing Values (NaNs)"
description: "Interpolate missing values (NaNs). Note: DataFrames support only linear interpolation."
tags: Pandas, Data Cleaning / Preprocessing
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

In [None]:
# setting inplace to True changes the orginal df
df['column'].interpolate(inplace = True)

# Example&

## Create example DataFrame

In [49]:
import numpy as np
people_dict = {
    "weight": [68, 83, 112],
    "height": [172, 181, 185],
    "birthyear": [1985, 1984, 1992],
    "children": [np.nan, 3, np.nan],
    "hobby": ["Biking", "Dancing", "Snowboarding"],
    "order_surveyed": [11, np.nan, 13]
}
people = pd.DataFrame(people_dict, index=["alice", "bob", "charles"])
people

Unnamed: 0,weight,height,birthyear,children,hobby,order_surveyed
alice,68,172,1985,,Biking,11.0
bob,83,181,1984,3.0,Dancing,
charles,112,185,1992,,Snowboarding,13.0


## Interpolate the missing value

In [50]:
people['order_surveyed'].interpolate(inplace = True)
people

Unnamed: 0,weight,height,birthyear,children,hobby,order_surveyed
alice,68,172,1985,,Biking,11.0
bob,83,181,1984,3.0,Dancing,12.0
charles,112,185,1992,,Snowboarding,13.0


---
title: "Handling Missing Data: Drop Rows or Columns with Missing Values (NaNs)"
description: "Drop a row or a column with missing values (NaNs). Drop a column or row if it has any NaN values or only if it is full of only NaN values."
tags: Pandas, Data Cleaning / Preprocessing
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

# Key Code&

In [None]:
import pandas as pd

- `how = 'any'` (default): if any NA values are present, drop the row or column 
- `how = 'all'`: if all values in the row or column are NA, drop it
- `inplace = True` (default False): change the original df

## Drop Rows

In [None]:
df.dropna()

## Drop Columns

In [None]:
df.dropna(axis = 1)

# Example&

## Create example DataFrame

In [62]:
import numpy as np
people_dict = {
    "weight": [68, 83, 112, np.nan],
    "height": [172, 181, 185, np.nan],
    "birthyear": [1985, 1984, 1992, np.nan],
    "children": [np.nan, np.nan, np.nan, np.nan],
    "hobby": ["Biking", "Dancing", "Snowboarding", "cash me oustside"],
}
people = pd.DataFrame(people_dict, index=["alice", "bob", "charles", "danielle bregoli"])
people

Unnamed: 0,weight,height,birthyear,children,hobby
alice,68.0,172.0,1985.0,,Biking
bob,83.0,181.0,1984.0,,Dancing
charles,112.0,185.0,1992.0,,Snowboarding
danielle bregoli,,,,,cash me oustside


## Drop Columns

In [63]:
people.dropna(how = 'all', inplace = True, axis = 1)
people

Unnamed: 0,weight,height,birthyear,hobby
alice,68.0,172.0,1985.0,Biking
bob,83.0,181.0,1984.0,Dancing
charles,112.0,185.0,1992.0,Snowboarding
danielle bregoli,,,,cash me oustside


## Drop Rows

In [65]:
people = people.dropna()
people

Unnamed: 0,weight,height,birthyear,hobby
alice,68.0,172.0,1985.0,Biking
bob,83.0,181.0,1984.0,Dancing
charles,112.0,185.0,1992.0,Snowboarding
