<a href="https://colab.research.google.com/github/nachoacev/practice-data-science/blob/main/PandasTutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas Tutorial

In this notebook I will put all what it is needed from `pandas`, the most popular Pythonic library for data analysis, to make a correct manipulation of data.

In [1]:
import pandas as pd
pd.set_option('display.max_rows', 5)

## Creating data

There are two core objects in pandas: the **DataFrame** and the **Series**.

**DataFrame**
A DataFrame is a table. It contains an array of individual *entries*, each of which has a certain *value*. Each entry corresponds to a row (or record) and a column.

In [2]:
pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})

Unnamed: 0,Yes,No
0,50,131
1,21,2


In [3]:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})

Unnamed: 0,Bob,Sue
0,I liked it.,Pretty good.
1,It was awful.,Bland.


We are using the `pd.DataFrame()` constructor to generate these DataFrame objects. The syntax for declaring a new one is a **dictionary** whose keys are the column names (Bob and Sue in this example), and whose values are a list of entries.

The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, ...) for the row labels. Sometimes this is OK, _but oftentimes we will want to assign these labels ourselves_.

The list of row labels used in a DataFrame is known as an __Index__. We can assign values to it by using an index parameter in our constructor:

In [4]:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'],
              'Sue': ['Pretty good.', 'Bland.']},
             index=['Product A', 'Product B'])

Unnamed: 0,Bob,Sue
Product A,I liked it.,Pretty good.
Product B,It was awful.,Bland.


**Series**

A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list. And in fact you can create one with nothing more than a list

In [5]:
pd.Series([1, 2, 3, 4, 5])

Unnamed: 0,0
0,1
1,2
2,3
3,4
4,5


A Series is, in essence, a single column of a DataFrame. So you can assign row labels to the Series the same way as before, using an index parameter. However, a Series does not have a column name, it only has one overall `name`:

In [6]:
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

Unnamed: 0,Product A
2015 Sales,30
2016 Sales,35
2017 Sales,40


## Reading data files `pd.read_csv()`, `to_csv()`

Data can be stored in any of a number of different forms and formats. By far the most basic of these is the humble `CSV file`. When you open a CSV file you get something that looks like this:
```
Product A,Product B,Product C,
30,21,9,
35,34,1,
41,11,11
```
So a CSV file is a table of values separated by commas. Hence the name: **"Comma-Separated Values"**, or CSV.

Let's now set aside our toy datasets and see what a real dataset looks like when we read it into a DataFrame. We'll use the `pd.read_csv()` function to read the data into a DataFrame.

In [7]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("zynicide/wine-reviews") + "/winemag-data_first150k.csv"

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/zynicide/wine-reviews?dataset_version_number=4...


100%|██████████| 50.9M/50.9M [00:02<00:00, 22.8MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/zynicide/wine-reviews/versions/4/winemag-data_first150k.csv


In [8]:
wine_reviews = pd.read_csv(path)

wine_reviews.shape

(150930, 11)

In [9]:
wine_reviews.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


The `pd.read_csv()` function is well-endowed, with over 30 optional parameters you can specify. For example, we can see in this dataset that the CSV file has a built-in index, which pandas did not pick up on automatically. To make pandas use that column for the index (instead of creating a new one from scratch), we can specify an `index_col`.

In [10]:
wine_reviews = pd.read_csv(path, index_col=0)

wine_reviews.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


In [11]:
animals = pd.DataFrame({'Cows': [12, 20], 'Goats': [22, 19]}, index=['Year 1', 'Year 2'])
animals

Unnamed: 0,Cows,Goats
Year 1,12,22
Year 2,20,19


In [12]:
animals.to_csv('animals.csv')

# Indexing, Selecting & Assigning

In `Python`, we can access the property of an object by accessing it as an `attribute`. A book object, for example, might have a title property, which we can access by calling `book.title`. Columns in a pandas DataFrame work in much the same way.

In [13]:
# Select column called country as an attribute
wine_reviews.country

Unnamed: 0,country
0,US
1,Spain
...,...
150928,France
150929,Italy


If we have a Python dictionary, we can access its values using the indexing `([])` operator. We can do the same with columns in a DataFrame:

In [14]:
# Select column called description as index
desc = wine_reviews['description']
desc

Unnamed: 0,description
0,This tremendous 100% varietal wine hails from ...
1,"Ripe aromas of fig, blackberry and cassis are ..."
...,...
150928,"A perfect salmon shade, with scents of peaches..."
150929,More Pinot Grigios should taste like this. A r...


In [15]:
type(desc)

These are the two ways of selecting a specific Series out of a DataFrame. Neither of them is more or less syntactically valid than the other, but the indexing operator `[]` does have the advantage that *it can handle column names with reserved characters in them* (e.g. if we had a country providence column, `reviews.country providence` wouldn't work).

A pandas Series is like a vector, so it isn't surprise that to drill down to a single specific value, we need only use the indexing operator `[]` once more:

In [16]:
wine_reviews['country'][0]

'US'

## Indexing with `loc` and `iloc`

The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, `loc` and `iloc`. For more advanced operations, these are the ones you're supposed to be using.

- `iloc`: It treats the DataFrame **as a matrix**.
- `loc`: It treats it __as a table__ with its indices.

### Index-based selection

Pandas indexing works in one of two paradigms. The first is **index-based selection**: selecting data based on its numerical position in the data. `iloc` follows this paradigm.

To select **the first row** of data in a DataFrame, we may use the following:

In [17]:
# Select first row
wine_reviews.iloc[0]

Unnamed: 0,0
country,US
description,This tremendous 100% varietal wine hails from ...
...,...
variety,Cabernet Sauvignon
winery,Heitz


Both `loc` and `iloc` are **row-first, column-second**. *This is the opposite of what we do in native Python*, which is column-first, row-second.

The big advantage of these two ways is that **they work as matrix notation.**

This means that it's marginally easier to retrieve rows, and marginally harder to get retrieve columns. To get a column with iloc, we can do the following:

In [18]:
wine_reviews.iloc[:, 0]

Unnamed: 0,country
0,US
1,Spain
...,...
150928,France
150929,Italy


In [19]:
wine_reviews.iloc[[0, 1, 2], 0]

Unnamed: 0,country
0,US
1,Spain
2,US


In [20]:
wine_reviews.iloc[:3, 0]

Unnamed: 0,country
0,US
1,Spain
2,US


In [21]:
wine_reviews.iloc[1:3, 0]

Unnamed: 0,country
1,Spain
2,US


Finally, it's worth knowing that **negative numbers** can be used in selection. This will start *counting forwards from the end of the values*. So for example here are the last five elements of the dataset.

In [22]:
wine_reviews.iloc[-5:]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
150925,Italy,Many people feel Fiano represents southern Ita...,,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Feudi di San Gregorio
150926,France,"Offers an intriguing nose with ginger, lime an...",Cuvée Prestige,91,27.0,Champagne,Champagne,,Champagne Blend,H.Germain
150927,Italy,This classic example comes from a cru vineyard...,Terre di Dora,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Terredora
150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,Champagne,,Champagne Blend,Gosset
150929,Italy,More Pinot Grigios should taste like this. A r...,,90,15.0,Northeastern Italy,Alto Adige,,Pinot Grigio,Alois Lageder


### Label-based selection

The second paradigm for attribute selection is the one followed by the `loc` operator: **label-based selection**. In this paradigm, **it's the data index value**, not its position, which matters.

For example, to get the first entry in reviews, we would now do the following:

In [23]:
wine_reviews.loc[0, 'country']

'US'

In [24]:
wine_reviews.loc[:, ['province', 'variety', 'points']]

Unnamed: 0,province,variety,points
0,California,Cabernet Sauvignon,96
1,Northern Spain,Tinta de Toro,96
...,...,...,...
150928,Champagne,Champagne Blend,90
150929,Northeastern Italy,Pinot Grigio,90


### Choosing between loc and iloc

When choosing or transitioning between `loc` and `iloc`, there is one "gotcha" worth keeping in mind, which is that **the two methods use slightly different indexing schemes**.

`iloc` uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So `0:10` will select entries `0,...,9`. `loc`, meanwhile, indexes inclusively. So `0:10` will select entries `0,...,10`.

Why the change? Remember that `loc` can index any stdlib type: strings, for example. If we have a DataFrame with index values `Apples, ..., Potatoes, ...`, and we want to select "all the alphabetical fruit choices between Apples and Potatoes", then it's a lot more convenient to index `df.loc['Apples':'Potatoes']` than it is to index something like `df.loc['Apples', 'Potatoet']` (t coming after s in the alphabet).

This is particularly confusing when the DataFrame index is a simple numerical list, e.g. `0,...,1000`. In this case `df.iloc[0:1000]` will return 1000 entries, while `df.loc[0:1000]` return 1001 of them! To get 1000 elements using `loc`, you will need to go one lower and ask for `df.loc[0:999]`.

Otherwise, the semantics of using loc are the same as those for iloc.

In [25]:
wine_reviews.loc[0:99, ['country', 'province', 'region_1', 'region_2']]

Unnamed: 0,country,province,region_1,region_2
0,US,California,Napa Valley,Napa
1,Spain,Northern Spain,Toro,
...,...,...,...,...
98,France,Southwest France,Buzet,
99,France,Southwest France,Côtes de Gascogne,


## Conditional selection `&`, `|`, `isin`, `isnull`

- and: `&`.
- or : `|`

So far we've been indexing various strides of data, using structural properties of the DataFrame itself. To do *interesting* things with the data, however, we often need to ask questions based on conditions.

For example, suppose that we're interested specifically in better-than-average wines produced in Italy.

We can start by checking if each wine is Italian or not:

In [26]:
wine_reviews.country == 'Italy'

Unnamed: 0,country
0,False
1,False
...,...
150928,False
150929,True


This operation produced a Series of `True/False` booleans based on the country of each record. This result can then be used inside of `loc` to select the relevant data (since this select the rows by index):

In [27]:
wine_reviews.loc[wine_reviews.country == 'Chile']

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
155,Chile,"Lightly herbal, horsey aromas of blueberry, bl...",Ecos de Rulo Single Vineyard El Chequén Estate,89,20.0,Marchigue,,,Carmenère,Viña Bisquertt
159,Chile,"Staunch berry, cassis and spice aromas are fri...",Fina Reserva Ensamblaje Malbec-Cabernet Sauvig...,89,19.0,Colchagua Valley,,,Red Blend,Estampa
...,...,...,...,...,...,...,...,...,...,...
150904,Chile,A lot of Chilean Cabernets seem to have a dist...,,81,10.0,Maipo Valley,,,Cabernet Sauvignon,De Martino
150905,Chile,There's not much point in making a reserve-sty...,Prima Reserva,80,13.0,Maipo Valley,,,Merlot,De Martino


We also wanted to know which ones are better than average. Wines are reviewed on a 80-to-100 point scale, so this could mean wines that accrued at least 90 points.

We can use the ampersand (`&`) to bring the two questions together:

In [28]:
wine_reviews.loc[(wine_reviews.country == 'Chile') & (wine_reviews.points >= 90)]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
537,Chile,"Deep, pure cassis and red berry aromas are int...",Edición Limitada B,92,30.0,Colchagua Valley,,,Bordeaux-style Red Blend,Caliterra
833,Chile,"Spicy, herbal plum and cassis aromas mix in no...",Amplus One,90,24.0,Peumo,,,Red Blend,Santa Ema
...,...,...,...,...,...,...,...,...,...,...
150785,Chile,"Rich and complex from the start, the nose and ...",Reserva de la Familia,90,15.0,Maipo Valley,,,Chardonnay,Santa Carolina
150789,Chile,This is what vineyard selection and winemaker ...,Wild Ferment La Escultura Estate,90,22.0,Casablanca Valley,,,Chardonnay,Errazuriz


Suppose we'll buy any wine that's made in Chile or which is rated above average. For this we use a pipe (`|`):

In [29]:
wine_reviews.loc[(wine_reviews.country == 'Chile') | (wine_reviews.points >= 90)]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
...,...,...,...,...,...,...,...,...,...,...
150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,Champagne,,Champagne Blend,Gosset
150929,Italy,More Pinot Grigios should taste like this. A r...,,90,15.0,Northeastern Italy,Alto Adige,,Pinot Grigio,Alois Lageder


Pandas comes with a few **built-in conditional selector methods**, two of which we will highlight here.

The first is `isin`. `isin` lets you select data whose value "is in" a list of values. For example, here's how we can use it to select wines only from Italy or France:

In [30]:
wine_reviews.loc[wine_reviews.country.isin(['France', 'Italy'])]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude
10,Italy,"Elegance, complexity and structure come togeth...",Ronco della Chiesa,95,80.0,Northeastern Italy,Collio,,Friulano,Borgo del Tiglio
...,...,...,...,...,...,...,...,...,...,...
150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,Champagne,,Champagne Blend,Gosset
150929,Italy,More Pinot Grigios should taste like this. A r...,,90,15.0,Northeastern Italy,Alto Adige,,Pinot Grigio,Alois Lageder


The second is `isnull` (and its companion `notnull`). These methods let you highlight values which are (or are not) empty (`NaN`). For example, to filter out wines lacking a price tag in the dataset, here's what we would do:

In [31]:
wine_reviews.loc[wine_reviews.price.notnull()]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
...,...,...,...,...,...,...,...,...,...,...
150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,Champagne,,Champagne Blend,Gosset
150929,Italy,More Pinot Grigios should taste like this. A r...,,90,15.0,Northeastern Italy,Alto Adige,,Pinot Grigio,Alois Lageder


In [32]:
top_oceania_wines = wine_reviews.loc[wine_reviews.country.isin(['Australia', 'New Zealand']) & (wine_reviews.points >= 95)]

top_oceania_wines

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
2148,Australia,Full-bodied and plush yet vibrant and imbued w...,The Factor,98,125.0,South Australia,Barossa Valley,,Shiraz,Torbreck
2458,Australia,This is a top example of the classic Australia...,The Peake,96,150.0,South Australia,McLaren Vale,,Cabernet-Shiraz,Hickinbotham
...,...,...,...,...,...,...,...,...,...,...
150562,Australia,"As unevolved as they are, the dense and multil...",Grange,96,185.0,South Australia,South Australia,,Shiraz,Penfolds
150563,Australia,"Seamless luxury from stem to stern, this ‘baby...",RWT,95,70.0,South Australia,Barossa Valley,,Shiraz,Penfolds


## Assigning data

Going the other way, assigning data to a DataFrame is easy. You can assign

- A constant value.
- An iterable of values.

In [33]:
wine_reviews['critic'] = 'everyone'
wine_reviews['index_backwards'] = range(len(wine_reviews), 0, -1)

In [34]:
wine_reviews.loc[:, ['critic', 'index_backwards']]

Unnamed: 0,critic,index_backwards
0,everyone,150930
1,everyone,150929
...,...,...
150928,everyone,2
150929,everyone,1


# Summary, Functions and Maps

In this part we will extract useful insights from the data.


## Summary functions (`describe()`, `mean()`, `value_counts`, `unique()`)

Pandas provides many simple "**summary functions**" (not an official name) which restructure the data in some useful way.

- `describe()` method: generates *a high-level summary* of the attributes of the given column.  **It is type-aware**, meaning that its output changes based on the data type of the input.

In [35]:
wine_reviews.points.describe()

Unnamed: 0,points
count,150930.000000
mean,87.888418
...,...
75%,90.000000
max,100.000000


In [36]:
wine_reviews.country.describe()

Unnamed: 0,country
count,150925
unique,48
top,US
freq,62397


If you want to get some particular simple summary statistic about a column in a DataFrame or a Series, **there is usually a helpful pandas function that makes it happen**.

For example, to see the mean of the points allotted (e.g. how well an averagely rated wine does), we can use the `mean()` function:

In [37]:
wine_reviews.points.mean()

87.8884184721394

To see a list of unique values we can use the `unique()` function:

In [38]:
wine_reviews.country.unique()

array(['US', 'Spain', 'France', 'Italy', 'New Zealand', 'Bulgaria',
       'Argentina', 'Australia', 'Portugal', 'Israel', 'South Africa',
       'Greece', 'Chile', 'Morocco', 'Romania', 'Germany', 'Canada',
       'Moldova', 'Hungary', 'Austria', 'Croatia', 'Slovenia', nan,
       'India', 'Turkey', 'Macedonia', 'Lebanon', 'Serbia', 'Uruguay',
       'Switzerland', 'Albania', 'Bosnia and Herzegovina', 'Brazil',
       'Cyprus', 'Lithuania', 'Japan', 'China', 'South Korea', 'Ukraine',
       'England', 'Mexico', 'Georgia', 'Montenegro', 'Luxembourg',
       'Slovakia', 'Czech Republic', 'Egypt', 'Tunisia', 'US-France'],
      dtype=object)

To see a list of unique values and how often they occur in the dataset, we can use the `value_counts()` method:

In [39]:
wine_reviews.country.value_counts().head()

Unnamed: 0_level_0,count
country,Unnamed: 1_level_1
US,62397
Italy,23478
France,21098
Spain,8268
Chile,5816


## Maps (`map()`, `apply()`)

A `map` is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values. In data science we often have a need for **creating new representations from existing data**, or for **transforming data from the format it is in now to the format that we want it to be in later**. Maps are what handle this work, making them extremely important for getting your work done!

There are two mapping methods that you will use often.

- `map()` is the first, and slightly simpler one. The function you pass to `map()` should expect a single value from the Series (a point value, in the above example), and return a transformed version of that value. `map()` returns a new Series where all the values have been transformed by your function.

For example, suppose that we wanted to remean the scores the wines received to 0 (centering the average to 0). We can do this as follows:

In [40]:
review_points_mean = wine_reviews.points.mean()

wine_reviews.points.map(lambda p: p - review_points_mean)


Unnamed: 0,points
0,8.111582
1,8.111582
...,...
150928,2.111582
150929,2.111582


- `apply()` is the equivalent method if we want to transform a whole DataFrame by calling a custom method **on each row** or **on each column**.

In [41]:
def remean_points(row):
  row.points = row.points - review_points_mean
  return row

wine_reviews.iloc[:1000].apply(remean_points, axis='columns').head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery,critic,index_backwards
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,8.111582,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz,everyone,150930
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,8.111582,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez,everyone,150929
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,8.111582,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley,everyone,150928
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,8.111582,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi,everyone,150927
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,7.111582,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude,everyone,150926


Note that `map()` and `apply()` return new, transformed Series and DataFrames, respectively. They don't modify the original data they're called on. If we look at the first row of `wine_reviews`, we can see that it still has its original points value.

In [42]:
wine_reviews

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery,critic,index_backwards
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz,everyone,150930
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez,everyone,150929
...,...,...,...,...,...,...,...,...,...,...,...,...
150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,Champagne,,Champagne Blend,Gosset,everyone,2
150929,Italy,More Pinot Grigios should taste like this. A r...,,90,15.0,Northeastern Italy,Alto Adige,,Pinot Grigio,Alois Lageder,everyone,1


Now, the above isn't the best method to remean the variable points. **Pandas provides many common mapping operations as built-ins**. For example, here's a faster way of remeaning our points column:

In [43]:
review_points_mean = wine_reviews.points.mean()
wine_reviews.points - review_points_mean

Unnamed: 0,points
0,8.111582
1,8.111582
...,...
150928,2.111582
150929,2.111582


In this code we are performing an operation between a lot of values on the left-hand side (everything in the Series) and a single value on the right-hand side (the mean value). Pandas looks at this expression and figures out that we must mean to subtract that mean value from every value in the dataset.

Pandas will also understand what to do if we perform these operations between Series of equal length. For example, an easy way of combining country and region information in the dataset would be to do the following:

In [44]:
wine_reviews.country + ' - ' + wine_reviews.region_1

Unnamed: 0,0
0,US - Napa Valley
1,Spain - Toro
...,...
150928,France - Champagne
150929,Italy - Alto Adige


**These operators are faster than** `map()` or `apply()` because **they use speed ups built into pandas**. All of the standard Python operators (`>`, `<`, `==`, and so on) work in this manner.

However, they are not as flexible as `map()` or `apply()`, which can do more advanced things, like applying conditional logic, which cannot be done with addition and subtraction alone.

## Exercises

**1.**
Select the best bargain

In [45]:
# We create a Series of the points-to-price ratio
reviews = wine_reviews.loc[wine_reviews.price.notnull()]
ratio_points_price = reviews.points / reviews.price

In [46]:
# We select the best bargain using the 'idxmax()' method
bargain_wine = reviews.loc[ratio_points_price.idxmax(), 'winery']

bargain_wine

'Bandit'

**2.**
There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"?

Create a Series `descriptor_counts` counting how many times each of these two words appears in the description column in the dataset. (For simplicity, let's ignore the capitalized versions of these words.)

In [47]:
n_trop = wine_reviews.description.map(lambda p: 'tropical' in p).sum()
n_fruity = wine_reviews.description.map(lambda p: 'fruity' in p).sum()

descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])
descriptor_counts

Unnamed: 0,0
tropical,4135
fruity,8669


__3.__
We'd like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we'd like to translate them into simple star ratings. **A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.**

Also, the Canadian Vintners Association bought a lot of ads on the site, so any **wines from Canada should automatically get 3 stars, regardless of points**.

Create a series `star_ratings` with the number of stars corresponding to each review in the dataset.

In [48]:
def get_stars(row):
  if row.country == 'Canada' or row.points >= 95:
    row.star_ratings = 3
    return row
  elif 85 <= row.points < 95:
    row.star_ratings = 2
    return row
  else:
    row.star_ratings = 1
    return row


In [49]:
wine_reviews['star_ratings'] = 0

star_ratings = wine_reviews.apply(get_stars, axis='columns').star_ratings

star_ratings

Unnamed: 0,star_ratings
0,3
1,3
...,...
150928,2
150929,2


# Grouping and Sorting (`groupby()`, `agg()`)

Maps allow us to transform data in a DataFrame or Series one value at a time for an entire column. However, often we want to group our data, and then do something specific to the group the data is in.

As we'll learn, we do this with the `groupby()` operation. We'll also cover some additional topics, such as more complex ways to index your DataFrames, along with how to sort your data.

## Groupwise analysis

One function we've been using heavily thus far is the `value_counts()` function. We can replicate what `value_counts()` does by doing the following:

In [50]:
wine_reviews.groupby('points').points.count()

Unnamed: 0_level_0,points
points,Unnamed: 1_level_1
80,898
81,1502
...,...
99,50
100,24


`groupby()` created a group of reviews which allotted the same point values to the given wines. Then, for each of these groups, we grabbed the `points()` column and counted how many times it appeared. `value_counts()` is just a shortcut to this `groupby()` operation.

We can use any of the summary functions we've used before with this data. For example, to get the cheapest wine in each point value category, we can do the following:

In [51]:
wine_reviews.groupby('points').price.min()

Unnamed: 0_level_0,price
points,Unnamed: 1_level_1
80,5.0
81,5.0
...,...
99,65.0
100,65.0


You can think of each group we generate as being a slice of our DataFrame containing only data with values that match. This DataFrame is accessible to us directly using the `apply()` method, and we can then manipulate the data in any way we see fit. For example, here's one way of selecting the name of the first wine reviewed from each winery in the dataset:

In [52]:
wine_reviews.groupby('winery').apply(lambda df: df.designation.iloc[0])

  wine_reviews.groupby('winery').apply(lambda df: df.designation.iloc[0])


Unnamed: 0_level_0,0
winery,Unnamed: 1_level_1
'37 Cellars,
1+1=3,Cabernet Sauvignon
...,...
áster,Crianza
Štoka,Grganja


For even more fine-grained control, you can also group by more than one column. For an example, here's how we would pick out the best wine by country and province:

In [53]:
wine_reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()])

  wine_reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()])


Unnamed: 0_level_0,Unnamed: 1_level_0,country,description,designation,points,price,province,region_1,region_2,variety,winery,critic,index_backwards,star_ratings
country,province,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Albania,Mirditë,Albania,This garnet-colored wine made from 100% Kallme...,,88,20.0,Mirditë,,,Kallmet,Arbëri,everyone,146288,0
Argentina,Mendoza Province,Argentina,"If the color doesn't tell the full story, the ...",Nicasia Vineyard,97,120.0,Mendoza Province,Mendoza,,Malbec,Bodega Catena Zapata,everyone,85599,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Uruguay,San Jose,Uruguay,While this ranks as one of the best Uruguayan ...,El Preciado Premier Gran Reserva,89,60.0,San Jose,,,Red Blend,Castillo Viejo,everyone,80773,0
Uruguay,Uruguay,Uruguay,"They call it Special Barrel, and one sniff tel...",Special Barrel,89,50.0,Uruguay,,,Tannat,Bouza,everyone,18448,0


- **`agg()` method:** Another `groupby()` method worth mentioning is `agg()`, which lets you run a bunch of different functions on your DataFrame simultaneously. For example, we can generate a simple statistical summary of the dataset as follows:

In [54]:
wine_reviews.groupby(['country']).price.agg([len, 'min', 'max']).head()

Unnamed: 0_level_0,len,min,max
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Albania,2,20.0,20.0
Argentina,5631,4.0,250.0
Australia,4957,5.0,850.0
Austria,3057,8.0,1100.0
Bosnia and Herzegovina,4,12.0,13.0


What is the best wine I can buy for a given amount of money?

In [55]:
best_rating_per_price = wine_reviews.groupby(['price']).apply(lambda df: df.points.loc[df.points.idxmax()])
best_rating_per_price = wine_reviews.groupby(['price']).points.max().sort_index()

best_rating_per_price

  best_rating_per_price = wine_reviews.groupby(['price']).apply(lambda df: df.points.loc[df.points.idxmax()])


Unnamed: 0_level_0,points
price,Unnamed: 1_level_1
4.0,86
5.0,90
...,...
2013.0,91
2300.0,99


In [56]:
price_extremes = wine_reviews.groupby('variety').price.agg(['min', 'max'])
price_extremes

Unnamed: 0_level_0,min,max
variety,Unnamed: 1_level_1,Unnamed: 2_level_1
Agiorgitiko,8.0,65.0
Aglianico,6.0,130.0
...,...,...
Zweigelt,9.0,70.0
Žilavka,13.0,15.0


## Multi-indexes

In all of the examples we've seen thus far we've been working with DataFrame or Series objects with a single-label index. `groupby()` is slightly different in the fact that, depending on the operation we run, it will sometimes result in what is called a multi-index.

A multi-index differs from a regular index in that it has multiple levels, and it has its own type `pandas.core.indexes.multi.MultiIndex`. For example:

In [57]:
countries_reviewed = wine_reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed

Unnamed: 0_level_0,Unnamed: 1_level_0,len
country,province,Unnamed: 2_level_1
Albania,Mirditë,2
Argentina,Mendoza Province,4742
...,...,...
Uruguay,San Jose,15
Uruguay,Uruguay,18


In [58]:
type(countries_reviewed.index)

pandas.core.indexes.multi.MultiIndex

Multi-indices have several methods for dealing with their tiered structure which are absent for single-level indices. They also require two levels of labels to retrieve a value. Dealing with multi-index output is a common "gotcha" for users new to pandas.

For `loc` we need to use the syntax `loc[(index1, index2, etc), 'column']`.

In general the multi-index method you will use most often is the one for **converting back to a regular index**, the `reset_index()` method:

In [59]:
countries_reviewed.reset_index()

Unnamed: 0,country,province,len
0,Albania,Mirditë,2
1,Argentina,Mendoza Province,4742
...,...,...,...
453,Uruguay,San Jose,15
454,Uruguay,Uruguay,18


## Sorting `sort_values(by=)`

Looking again at countries_reviewed we can see that grouping returns data in index order, not in value order. That is to say, when outputting the result of a groupby, the order of the rows is dependent on the values in the index, not in the data.

To get data in the order want it in we can sort it ourselves. The `sort_values()` method is handy for this.


In [60]:
countries_reviewed = countries_reviewed.reset_index()
countries_reviewed.sort_values(by='len')

Unnamed: 0,country,province,len
154,Greece,Central Greece,1
207,Greece,Zitsa,1
...,...,...,...
442,US,Washington,9750
422,US,California,44508


`sort_values()` defaults to an ascending sort, where the lowest values go first. However, **most of the time we want a descending sort**, where the higher numbers go first. That goes thus through the param `ascending=`

In [61]:
countries_reviewed.sort_values(by='len', ascending=False)

Unnamed: 0,country,province,len
422,US,California,44508
442,US,Washington,9750
...,...,...,...
413,Switzerland,Vino da Tavola della Svizzera Italiana,1
175,Greece,Krania Olympus,1


Finally, know that you can **sort by more than one column at a time**

In [62]:
price_extremes.sort_values(by=['min', 'max'], ascending=[True, False])

Unnamed: 0_level_0,min,max
variety,Unnamed: 1_level_1,Unnamed: 2_level_1
Chardonnay,4.0,2013.0
Cabernet Sauvignon,4.0,625.0
...,...,...
Terret Blanc,,
Zelen,,


In [63]:
country_variety_counts = reviews.groupby(['country', 'variety']).variety.count().sort_values(ascending=False)
country_variety_counts = reviews.groupby(['country', 'variety']).size().sort_values(ascending=False)

country_variety_counts

Unnamed: 0_level_0,Unnamed: 1_level_0,0
country,variety,Unnamed: 2_level_1
US,Pinot Noir,10265
US,Cabernet Sauvignon,9142
...,...,...
Moldova,Bordeaux-style Red Blend,1
South Africa,Chenin Blanc-Viognier,1


# Data Types and Missing Values

Now we will learn how to investigate data types within a DataFrame or Series, and how to find and replace entries.

## Dtypes and `astype()`

The data type for a column in a DataFrame or a Series is known as the **dtype**.

You can use the `dtype` property/attribution to grab the type of a specific column. For instance, we can get the dtype of the `price` column in the reviews DataFrame:

In [64]:
wine_reviews.price.dtype

dtype('float64')

Alternatively, the `dtypes` (**Note it is in plural!**) property returns the **dtype of every column** in the DataFrame:

In [65]:
wine_reviews.dtypes

Unnamed: 0,0
country,object
description,object
...,...
index_backwards,int64
star_ratings,int64


Data types tell us something about how pandas is storing the data internally. `float64` means that it's using a **64-bit floating point number**; `int64` means a similarly sized integer instead, and so on.

One peculiarity to keep in mind (and on display very clearly here) is that **columns consisting entirely of strings do not get their own type**; they are instead **given the object type**.

It's possible to convert a column of one type into another wherever such a conversion makes sense by using the `astype()` method. For example, we may transform the points column from its existing `int64` data type into a `float64` data type:

In [66]:
wine_reviews.points.astype('float64')

Unnamed: 0,points
0,96.0
1,96.0
...,...
150928,90.0
150929,90.0


In [67]:
wine_reviews.points.astype(str)

Unnamed: 0,points
0,96
1,96
...,...
150928,90
150929,90


A DataFrame or Series index has its own `dtype`, too:

In [68]:
wine_reviews.index.dtype

dtype('int64')

## Missing data (`pd.isnull()`, `pd.notnull()`, `fillna()`, `replace()`)

Entries missing values are given the value `NaN`, **short for "Not a Number"**. For technical reasons these `NaN` values are always of the `float64` dtype.

Pandas provides some methods specific to missing data. To select `NaN` entries you can use `pd.isnull()` (or its companion `pd.notnull()`) functions. This is meant to be used in the following way:

In [69]:
wine_reviews[pd.isnull(wine_reviews.country)]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery,critic,index_backwards,star_ratings
1133,,Delicate white flowers and a spin of lemon pee...,Askitikos,90,17.0,,,,Assyrtiko,Tsililis,everyone,149797,0
1440,,"A blend of 60% Syrah, 30% Cabernet Sauvignon a...",Shah,90,30.0,,,,Red Blend,Büyülübağ,everyone,149490,0
68226,,"From first sniff to last, the nose never makes...",Piedra Feliz,81,15.0,,,,Pinot Noir,Chilcas,everyone,82704,0
113016,,"From first sniff to last, the nose never makes...",Piedra Feliz,81,15.0,,,,Pinot Noir,Chilcas,everyone,37914,0
135696,,"From first sniff to last, the nose never makes...",Piedra Feliz,81,15.0,,,,Pinot Noir,Chilcas,everyone,15234,0


**Replacing missing values is a common operation**. Pandas provides a really handy method for this problem: `fillna()`. `fillna()` provides a few different strategies for mitigating such data. For example, **we can simply replace each `NaN` with an "Unknown"**:

In [70]:
wine_reviews.region_2.fillna('Unknown')

Unnamed: 0,region_2
0,Napa
1,Unknown
...,...
150928,Unknown
150929,Unknown


Or we could fill each missing value with the first non-null value that appears sometime after the given record in the database. **This is known as the backfill strategy** (useful for time dependent columns).

Alternatively, **we may have a non-null value that we would like to replace**. For example, suppose that since this dataset was published, reviewer Kerin O'Keefe has changed her Twitter handle from @kerinokeefe to @kerino. One way to reflect this in the dataset is using the `replace()` method:

In [71]:
wine_reviews.country.replace('US', 'USA')

Unnamed: 0,country
0,USA
1,Spain
...,...
150928,France
150929,Italy


The `replace()` method is worth mentioning here because it's handy for replacing missing data which is given some kind of sentinel value in the dataset: things like "Unknown", "Undisclosed", "Invalid", and so on.

In [72]:
# What are the most common wine-producing regions?
reviews_per_region = wine_reviews.region_1.fillna('Unkown').value_counts().sort_values(ascending=False)

reviews_per_region

Unnamed: 0_level_0,count
region_1,Unnamed: 1_level_1
Unkown,25060
Napa Valley,6209
...,...
Mâcon-Mancey,1
Coteaux du Tricastin,1


# Renaming and Combining

Data comes in from many resources. We will learn how to help it all make sense together.

Oftentimes data will come to us with column names, index names, or other naming conventions that we are not satisfied with. In that case, we'll learn how to use pandas functions to change the names of the offending entries to something better.

We'll also explore how to combine data from multiple DataFrames and/or Series.

## Renaming `rename()`

The first function we'll introduce here is `rename()`, which lets you change index names and/or column names. For example, to change the points column in our dataset to score, we would do:

In [76]:
wine_reviews.rename(columns={'points': 'score', 'region_1': 'region', 'region_2': 'locale'})

Unnamed: 0,country,description,designation,score,price,province,region,locale,variety,winery,critic,index_backwards,star_ratings
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz,everyone,150930,0
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez,everyone,150929,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,Champagne,,Champagne Blend,Gosset,everyone,2,0
150929,Italy,More Pinot Grigios should taste like this. A r...,,90,15.0,Northeastern Italy,Alto Adige,,Pinot Grigio,Alois Lageder,everyone,1,0


`rename()` lets you rename index or column values by specifying a index or column keyword parameter, respectively. It supports a variety of input formats, but usually **a Python dictionary is the most convenient**. Here is an example using it to rename some elements of the index

In [74]:
wine_reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'})

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery,critic,index_backwards,star_ratings
firstEntry,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz,everyone,150930,0
secondEntry,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez,everyone,150929,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,Champagne,,Champagne Blend,Gosset,everyone,2,0
150929,Italy,More Pinot Grigios should taste like this. A r...,,90,15.0,Northeastern Italy,Alto Adige,,Pinot Grigio,Alois Lageder,everyone,1,0


You'll probably rename columns very often, but rename index values very rarely. For that, `set_index()` is usually more convenient.

Both the row index and the column index can have their own name attribute. The complimentary `rename_axis()` method may be used to change these names. For example:

In [75]:
wine_reviews.rename_axis('wines', axis='rows').rename_axis('fields', axis='columns')

fields,country,description,designation,points,price,province,region_1,region_2,variety,winery,critic,index_backwards,star_ratings
wines,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz,everyone,150930,0
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez,everyone,150929,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,Champagne,,Champagne Blend,Gosset,everyone,2,0
150929,Italy,More Pinot Grigios should taste like this. A r...,,90,15.0,Northeastern Italy,Alto Adige,,Pinot Grigio,Alois Lageder,everyone,1,0


## Combining `pd.concat()`, `.join()`

When performing operations on a dataset, we will sometimes need to **combine different DataFrames and/or Series in non-trivial ways**. Pandas has three core methods for doing this. In order of increasing complexity, these are `pd.concat()`, `.join()`, and `merge()`. Most of what `merge()` can do can also be done more simply with `.join()`, so we will omit it and focus on the first two functions here.

The simplest combining method is `pd.concat()`. Given a list of elements, this function will smush those elements together along an axis.

This is useful when we have data in different DataFrame or Series objects but having the same fields (columns). One example: the YouTube Videos dataset, which splits the data up based on country of origin (e.g. Canada and the UK, in this example). If we want to study multiple countries simultaneously, we can use `pd.concat()` to smush them together:

In [79]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("datasnaek/youtube-new")

print("Path to dataset files:", path)

Path to dataset files: /root/.cache/kagglehub/datasets/datasnaek/youtube-new/versions/115


In [80]:
canadian_youtube = pd.read_csv(path + "/CAvideos.csv")
british_youtube = pd.read_csv(path + "/GBvideos.csv")

pd.concat([canadian_youtube, british_youtube])

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,n1WpP7iowLc,17.14.11,Eminem - Walk On Water (Audio) ft. Beyoncé,EminemVEVO,10,2017-11-10T17:00:03.000Z,"Eminem|""Walk""|""On""|""Water""|""Aftermath/Shady/In...",17158579,787425,43420,125882,https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg,False,False,False,Eminem's new track Walk on Water ft. Beyoncé i...
1,0dBIkQ4Mz1M,17.14.11,PLUSH - Bad Unboxing Fan Mail,iDubbbzTV,23,2017-11-13T17:00:00.000Z,"plush|""bad unboxing""|""unboxing""|""fan mail""|""id...",1014651,127794,1688,13030,https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg,False,False,False,STill got a lot of packages. Probably will las...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38914,-DRsfNObKIQ,18.14.06,Eleni Foureira - Fuego - Cyprus - LIVE - First...,Eurovision Song Contest,24,2018-05-08T20:32:32.000Z,"Eurovision Song Contest|""2018""|""Lisbon""|""Cypru...",14317515,151870,45875,26766,https://i.ytimg.com/vi/-DRsfNObKIQ/default.jpg,False,False,False,Eleni Foureira represented Cyprus at the first...
38915,4YFo4bdMO8Q,18.14.06,KYLE - Ikuyo feat. 2 Chainz & Sophia Black [A...,SuperDuperKyle,10,2018-05-11T04:06:35.000Z,"Kyle|""SuperDuperKyle""|""Ikuyo""|""2 Chainz""|""Soph...",607552,18271,274,1423,https://i.ytimg.com/vi/4YFo4bdMO8Q/default.jpg,False,False,False,Debut album 'Light of Mine' out now: http://ky...


The middlemost combiner in terms of complexity is `.join()`. `.join()` lets you combine different DataFrame objects which have an index in common. This, for example, **allow you to complete information** or details about the data.

For example, to pull down videos that happened to be trending on the same day in both Canada and the UK, we could do the following:

In [81]:
left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])

left.join(right, lsuffix='_CAN', rsuffix='_UK')

Unnamed: 0_level_0,Unnamed: 1_level_0,video_id_CAN,channel_title_CAN,category_id_CAN,publish_time_CAN,tags_CAN,views_CAN,likes_CAN,dislikes_CAN,comment_count_CAN,thumbnail_link_CAN,...,tags_UK,views_UK,likes_UK,dislikes_UK,comment_count_UK,thumbnail_link_UK,comments_disabled_UK,ratings_disabled_UK,video_error_or_removed_UK,description_UK
title,trending_date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Eminem - Walk On Water (Audio) ft. Beyoncé,17.14.11,n1WpP7iowLc,EminemVEVO,10,2017-11-10T17:00:03.000Z,"Eminem|""Walk""|""On""|""Water""|""Aftermath/Shady/In...",17158579,787425,43420,125882,https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg,...,"Eminem|""Walk""|""On""|""Water""|""Aftermath/Shady/In...",17158579.0,787420.0,43420.0,125882.0,https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg,False,False,False,Eminem's new track Walk on Water ft. Beyoncé i...
PLUSH - Bad Unboxing Fan Mail,17.14.11,0dBIkQ4Mz1M,iDubbbzTV,23,2017-11-13T17:00:00.000Z,"plush|""bad unboxing""|""unboxing""|""fan mail""|""id...",1014651,127794,1688,13030,https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Trump Advisor Grovels To Trudeau,18.14.06,lbMKLzQ4cNQ,The Young Turks,25,2018-06-13T04:00:05.000Z,"180612__TB02SorryExcuse|""News""|""Politics""|""The...",115225,2115,182,1672,https://i.ytimg.com/vi/lbMKLzQ4cNQ/default.jpg,...,,,,,,,,,,
【完整版】遇到恐怖情人該怎麼辦？2018.06.13小明星大跟班,18.14.06,POTgw38-m58,我愛小明星大跟班,24,2018-06-13T16:00:03.000Z,"吳宗憲|""吳姍儒""|""小明星大跟班""|""Sandy""|""Jacky wu""|""憲哥""|""中天...",107392,300,62,251,https://i.ytimg.com/vi/POTgw38-m58/default.jpg,...,,,,,,,,,,
