# Pandas - DataFrame

Pandas is an open source library, providing high-performance, easy-to-use data structures and data analysis tools for Python.

The DataFrame is one of Pandas' most important data structures. It's basically a way to store tabular data where you can label the rows and the columns. One way to build a DataFrame is from a dictionary.

In the exercises that follow you will be working with vehicle data from different countries. Each observation corresponds to a country and the columns give information about the number of vehicles per capita, whether people drive left or right, and so on.

Three lists are defined in the script:
- **names**, containing the country names for which data is available.
- **dr**, a list with booleans that tells whether people drive left or right in the corresponding country.
- **cpc**, the number of motor vehicles per 1000 people in the corresponding country.
Each dictionary key is a column label and each value is a list which contains the column elements.

## Create

In [1]:
# Pre-defined lists
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]

# Import pandas as pd
import pandas as pd

# Create dictionary my_dict with three key:value pairs: my_dict
my_dict = {'country':names, 'drives_right':dr, 'cars_per_cap':cpc}

# Build a DataFrame cars from my_dict: cars
cars = pd.DataFrame(my_dict)

# Print cars
print(cars)

         country  drives_right  cars_per_cap
0  United States          True           809
1      Australia         False           731
2          Japan         False           588
3          India         False            18
4         Russia          True           200
5        Morocco          True            70
6          Egypt          True            45


Have you noticed that the row labels (i.e. the labels for the different observations) were automatically set to integers from 0 up to 6?

To solve this a list row_labels has been created. You can use it to specify the row labels of the cars DataFrame. You do this by setting the index attribute of cars, that you can access as `cars.index`.

In [2]:
import pandas as pd

# Build cars DataFrame
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
dict = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }
cars = pd.DataFrame(dict)
print(cars)

# Definition of row_labels
row_labels = ['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG']

# Specify row labels of cars
cars.index = row_labels

# Print cars again
print(cars)

         country  drives_right  cars_per_cap
0  United States          True           809
1      Australia         False           731
2          Japan         False           588
3          India         False            18
4         Russia          True           200
5        Morocco          True            70
6          Egypt          True            45
           country  drives_right  cars_per_cap
US   United States          True           809
AUS      Australia         False           731
JAP          Japan         False           588
IN           India         False            18
RU          Russia          True           200
MOR        Morocco          True            70
EG           Egypt          True            45


To import CSV data into Python as a Pandas DataFrame you can use `read_csv()`.

In [3]:
#cars.to_csv('data/cars.csv')

In [5]:
# Import pandas as pd
import pandas as pd

# Import the cars.csv data: cars
cars = pd.read_csv('data/cars.csv')

# Print out cars
print(cars)

  Unnamed: 0        country  drives_right  cars_per_cap
0         US  United States          True           809
1        AUS      Australia         False           731
2        JAP          Japan         False           588
3         IN          India         False            18
4         RU         Russia          True           200
5        MOR        Morocco          True            70
6         EG          Egypt          True            45


Specify the `index_col` argument inside pd.read_csv(): set it to 0, so that the first column is used as row labels.

In [7]:
# Import pandas as pd
import pandas as pd

# Fix import by including index_col
cars = pd.read_csv('data/cars.csv', index_col=0)

# Print out cars
print(cars)

           country  drives_right  cars_per_cap
US   United States          True           809
AUS      Australia         False           731
JAP          Japan         False           588
IN           India         False            18
RU          Russia          True           200
MOR        Morocco          True            70
EG           Egypt          True            45


### Select

You can index and select Pandas DataFrames in many different ways. The simplest, but not the most powerful way, is to use square brackets.

The same cars data is imported from a CSV files as a Pandas DataFrame. To select only the cars_per_cap column from cars, you can use:

In [9]:
# Import cars data
import pandas as pd
cars = pd.read_csv('data/cars.csv', index_col = 0)

cars['cars_per_cap']

US     809
AUS    731
JAP    588
IN      18
RU     200
MOR     70
EG      45
Name: cars_per_cap, dtype: int64

In [10]:
cars[['cars_per_cap']]

Unnamed: 0,cars_per_cap
US,809
AUS,731
JAP,588
IN,18
RU,200
MOR,70
EG,45


The single bracket version gives a Pandas Series, the double bracket version gives a Pandas DataFrame.

In [16]:
# Print out country column as Pandas Series
print(cars['country'])
print(type(cars['country']))
print()

# Print out country column as Pandas DataFrame
print(cars[['country']])
print(type(cars[['country']]))
print()

# Print out DataFrame with country and drives_right columns
print(cars[['country', 'drives_right']])

US     United States
AUS        Australia
JAP            Japan
IN             India
RU            Russia
MOR          Morocco
EG             Egypt
Name: country, dtype: object
<class 'pandas.core.series.Series'>

           country
US   United States
AUS      Australia
JAP          Japan
IN           India
RU          Russia
MOR        Morocco
EG           Egypt
<class 'pandas.core.frame.DataFrame'>

           country  drives_right
US   United States          True
AUS      Australia         False
JAP          Japan         False
IN           India         False
RU          Russia          True
MOR        Morocco          True
EG           Egypt          True


In [18]:
# Print out first 3 observations
print(cars[0:3])
print()

# Print out fourth, fifth and sixth observation
print(cars[3:7])

           country  drives_right  cars_per_cap
US   United States          True           809
AUS      Australia         False           731
JAP          Japan         False           588

     country  drives_right  cars_per_cap
IN     India         False            18
RU    Russia          True           200
MOR  Morocco          True            70
EG     Egypt          True            45


### loc and iloc

With `loc` and `iloc` you can do practically any data selection operation on DataFrames you can think of. `loc` is label-based, which means that you have to specify rows and columns based on their row and column labels. `iloc` is integer index based, so you have to specify rows and columns by their integer index like you did in the previous exercise.

In [28]:
cars

Unnamed: 0,country,drives_right,cars_per_cap
US,United States,True,809
AUS,Australia,False,731
JAP,Japan,False,588
IN,India,False,18
RU,Russia,True,200
MOR,Morocco,True,70
EG,Egypt,True,45


In [22]:
print(cars.loc['RU'])
print()
print(cars.iloc[4])

country         Russia
drives_right      True
cars_per_cap       200
Name: RU, dtype: object

country         Russia
drives_right      True
cars_per_cap       200
Name: RU, dtype: object


In [23]:
print(cars.loc[['RU']])
print()
print(cars.iloc[[4]])

   country  drives_right  cars_per_cap
RU  Russia          True           200

   country  drives_right  cars_per_cap
RU  Russia          True           200


In [24]:
print(cars.loc[['RU', 'AUS']])
print()
print(cars.iloc[[4, 1]])

       country  drives_right  cars_per_cap
RU      Russia          True           200
AUS  Australia         False           731

       country  drives_right  cars_per_cap
RU      Russia          True           200
AUS  Australia         False           731


In [25]:
# Print out observation for Japan
print(cars.loc['JAP'])
print()
print(cars.iloc[2])
print()

# Print out observations for Australia and Egypt
print(cars.loc[['AUS', 'EG']])
print()
print(cars.iloc[[1,-1]])

country         Japan
drives_right    False
cars_per_cap      588
Name: JAP, dtype: object

country         Japan
drives_right    False
cars_per_cap      588
Name: JAP, dtype: object

       country  drives_right  cars_per_cap
AUS  Australia         False           731
EG       Egypt          True            45

       country  drives_right  cars_per_cap
AUS  Australia         False           731
EG       Egypt          True            45


`loc` and `iloc` also allow you to select both rows and columns from a DataFrame. 

In [27]:
cars

Unnamed: 0,country,drives_right,cars_per_cap
US,United States,True,809
AUS,Australia,False,731
JAP,Japan,False,588
IN,India,False,18
RU,Russia,True,200
MOR,Morocco,True,70
EG,Egypt,True,45


In [26]:
print(cars.loc['IN', 'cars_per_cap'])
print()
print(cars.iloc[3, 0])

18

India


In [29]:
print(cars.loc[['IN', 'RU'], 'cars_per_cap'])
print()
print(cars.iloc[[3, 4], 0])

IN     18
RU    200
Name: cars_per_cap, dtype: int64

IN     India
RU    Russia
Name: country, dtype: object


In [30]:
print(cars.loc[['IN', 'RU'], ['cars_per_cap', 'country']])
print()
print(cars.iloc[[3, 4], [0, 1]])

    cars_per_cap country
IN            18   India
RU           200  Russia

   country  drives_right
IN   India         False
RU  Russia          True


In [31]:
# Print out drives_right value of Morocco
print(cars.loc['MOR', 'drives_right'])
print()

# Print sub-DataFrame
print(cars.iloc[[4,5], [1,2]])

True

     drives_right  cars_per_cap
RU           True           200
MOR          True            70


It's also possible to select only columns with `loc` and `iloc`. In both cases, you simply put a slice going from beginning to end in front of the comma:

In [32]:
print(cars.loc[:, 'country'])
print()
print(cars.iloc[:, 1])

US     United States
AUS        Australia
JAP            Japan
IN             India
RU            Russia
MOR          Morocco
EG             Egypt
Name: country, dtype: object

US      True
AUS    False
JAP    False
IN     False
RU      True
MOR     True
EG      True
Name: drives_right, dtype: bool


In [33]:
print(cars.loc[:, ['country','drives_right']])
print()
print(cars.iloc[:, [1, 2]])

           country  drives_right
US   United States          True
AUS      Australia         False
JAP          Japan         False
IN           India         False
RU          Russia          True
MOR        Morocco          True
EG           Egypt          True

     drives_right  cars_per_cap
US           True           809
AUS         False           731
JAP         False           588
IN          False            18
RU           True           200
MOR          True            70
EG           True            45


In [35]:
# Print out drives_right column as Series
print(cars.loc[:, 'drives_right'])
print()
print(cars.iloc[:, 2])

US      True
AUS    False
JAP    False
IN     False
RU      True
MOR     True
EG      True
Name: drives_right, dtype: bool

US     809
AUS    731
JAP    588
IN      18
RU     200
MOR     70
EG      45
Name: cars_per_cap, dtype: int64


In [36]:
# Print out drives_right column as DataFrame
print(cars.loc[:, ['drives_right']])
print()
print(cars.iloc[:, [2]])

     drives_right
US           True
AUS         False
JAP         False
IN          False
RU           True
MOR          True
EG           True

     cars_per_cap
US            809
AUS           731
JAP           588
IN             18
RU            200
MOR            70
EG             45


In [37]:
# Print out cars_per_cap and drives_right as DataFrame
print(cars.loc[:, ['cars_per_cap', 'drives_right']])
print()
print(cars.iloc[:, [0, 2]])

     cars_per_cap  drives_right
US            809          True
AUS           731         False
JAP           588         False
IN             18         False
RU            200          True
MOR            70          True
EG             45          True

           country  cars_per_cap
US   United States           809
AUS      Australia           731
JAP          Japan           588
IN           India            18
RU          Russia           200
MOR        Morocco            70
EG           Egypt            45


<img src="images/loc-iloc-ix1.png" alt="" style="width: 800px;"/>

From: [Using iloc, loc, & ix to select rows and columns in Pandas DataFrames](https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/)

In [None]:
# Single selections using iloc and DataFrame
# Rows:
data.iloc[0] # first row of data frame (Aleshia Tomkiewicz) - Note a Series data type output.
data.iloc[1] # second row of data frame (Evan Zigomalas)
data.iloc[-1] # last row of data frame (Mi Richan)
# Columns:
data.iloc[:,0] # first column of data frame (first_name)
data.iloc[:,1] # second column of data frame (last_name)
data.iloc[:,-1] # last column of data frame (id)

# Multiple row and column selections using iloc and DataFrame
data.iloc[0:5] # first five rows of dataframe
data.iloc[:, 0:2] # first two columns of data frame with all rows
data.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns.
data.iloc[0:5, 5:8] # first 5 rows and 5th, 6th, 7th columns of data frame (county -> phone1).

Selections using the `loc` method are based on the index of the data frame (if any). Where the index is set on a DataFrame, using `df.set_index()`, the .loc method directly selects based on index values of any rows.

In [None]:
data.set_index("last_name", inplace=True)

In [None]:
# Select rows with first name Antonio, # and all columns between 'city' and 'email'
data.loc[data['first_name'] == 'Antonio', 'city':'email']
 
# Select rows where the email column ends with 'hotmail.com', include all columns
data.loc[data['email'].str.endswith("hotmail.com")]   
 
# Select rows with last_name equal to some values, all columns
data.loc[data['first_name'].isin(['France', 'Tyisha', 'Eric'])]   
       
# Select rows with first name Antonio AND hotmail email addresses
data.loc[data['email'].str.endswith("gmail.com") & (data['first_name'] == 'Antonio')] 
 
# select rows with id column between 100 and 200, and just return 'postal' and 'web' columns
data.loc[(data['id'] > 100) & (data['id'] <= 200), ['postal', 'web']] 
 
# A lambda function that yields True/False values can also be used.
# Select rows where the company name has 4 words in it.
data.loc[data['company_name'].apply(lambda x: len(x.split(' ')) == 4)] 
 
# Selections can be achieved outside of the main .loc for clarity:
# Form a separate variable with your selections:
idx = data['company_name'].apply(lambda x: len(x.split(' ')) == 4)
# Select only the True values in 'idx' and only the 3 columns specified:
data.loc[idx, ['email', 'first_name', 'company']]

You might wonder if you can also combine label-based selection the `loc` way and index-based selection the `iloc` way. You can! It's done with `ix` (The `ix` indexer has been deprecated in recent versions of Pandas, starting with version 0.20.1.):

For detailed explanation see: [Using iloc, loc, & ix to select rows and columns in Pandas DataFrames](https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/)

In [None]:
# ix indexing works just the same as .loc when passed strings
data.ix[['Andrade']] == data.loc[['Andrade']]
# ix indexing works the same as .iloc when passed integers.
data.ix[[33]] == data.iloc[[33]]
 
# ix only works in both modes when the index of the DataFrame is NOT an integer itself.

## Filtering Pandas DataFrame

Find all observations in `cars` where `drives_right` is True. `drives_right` is a boolean column, so you'll have to extract it as a Series and then use this boolean Series to select observations from `cars`.

In [40]:
# Extract drives_right column as Series: dr
dr = cars['drives_right']
print(dr)
print(type(dr))
print()

# Use dr to subset cars: sel
sel = cars[dr]

# Print sel and assert that drives_right is True for all observations
print(sel)
print(type(sel))

US      True
AUS    False
JAP    False
IN     False
RU      True
MOR     True
EG      True
Name: drives_right, dtype: bool
<class 'pandas.core.series.Series'>

           country  drives_right  cars_per_cap
US   United States          True           809
RU          Russia          True           200
MOR        Morocco          True            70
EG           Egypt          True            45
<class 'pandas.core.frame.DataFrame'>


In [41]:
# Convert code to a one-liner
sel = cars[cars['drives_right']]

# Print sel
print(sel)

           country  drives_right  cars_per_cap
US   United States          True           809
RU          Russia          True           200
MOR        Morocco          True            70
EG           Egypt          True            45


In [42]:
# Create car_maniac: observations that have a cars_per_cap over 500
car_maniac = cars[cars['cars_per_cap'] > 500]

# Print car_maniac
print(car_maniac)

           country  drives_right  cars_per_cap
US   United States          True           809
AUS      Australia         False           731
JAP          Japan         False           588


In [45]:
import numpy as np

cpc = cars['cars_per_cap']
between = np.logical_and(cpc > 10, cpc < 80)
medium = cars[between]
print(medium)

     country  drives_right  cars_per_cap
IN     India         False            18
MOR  Morocco          True            70
EG     Egypt          True            45


In [47]:
# Create medium: observations with cars_per_cap between 100 and 500
cpc = cars['cars_per_cap']
between = np.logical_and(cpc > 100, cpc < 500)
medium = cars[between]

# Print medium
print(medium)

   country  drives_right  cars_per_cap
RU  Russia          True           200
