## Inspecting a DataFrame
When you get a new DataFrame to work with, the first thing you need to do is explore it and see what it contains. There are several useful methods and attributes for this.

- `.head()` returns the first few rows (the “head” of the DataFrame).
- `.info()` shows information on each of the columns, such as the data type and number of missing values.
- `.shape` returns the number of rows and columns of the DataFrame.
- `.describe()` calculates a few summary statistics for each column.

In [2]:
# Pre-defined lists
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]

# Import pandas as pd
import pandas as pd

# Create dictionary my_dict with three key:value pairs: my_dict
my_dyct = {'country':names, 'drives_right': dr, 'cars_per_cap': cpc}

# Build a DataFrame cars from my_dict: cars
cars = pd.DataFrame(my_dyct)

# Print cars
cars.head(10)

Unnamed: 0,country,drives_right,cars_per_cap
0,United States,True,809
1,Australia,False,731
2,Japan,False,588
3,India,False,18
4,Russia,True,200
5,Morocco,True,70
6,Egypt,True,45


In [7]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 3 columns):
country         7 non-null object
drives_right    7 non-null bool
cars_per_cap    7 non-null int64
dtypes: bool(1), int64(1), object(1)
memory usage: 247.0+ bytes


In [9]:
cars.shape

(7, 3)

In [10]:
cars.describe()

Unnamed: 0,cars_per_cap
count,7.0
mean,351.571429
std,345.595552
min,18.0
25%,57.5
50%,200.0
75%,659.5
max,809.0


# Parts of a DataFrame
To better understand DataFrame objects, it's useful to know that they consist of three components, stored as attributes:

- **values**: A two-dimensional NumPy array of values.
- **columns**: An index of columns: the column names.
- **index**: An index for the rows: either row numbers or row names.

In [11]:
cars.values

array([['United States', True, 809],
       ['Australia', False, 731],
       ['Japan', False, 588],
       ['India', False, 18],
       ['Russia', True, 200],
       ['Morocco', True, 70],
       ['Egypt', True, 45]], dtype=object)

In [12]:
cars.columns

Index(['country', 'drives_right', 'cars_per_cap'], dtype='object')

In [14]:
cars.index

RangeIndex(start=0, stop=7, step=1)

In [4]:
# Build cars DataFrame
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
dictz = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }
cars = pd.DataFrame(dictz)

# Definition of row_labels
row_labels = ['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG']

# Specify row labels of cars
cars.index = row_labels

# Print cars again
cars

Unnamed: 0,country,drives_right,cars_per_cap
US,United States,True,809
AUS,Australia,False,731
JAP,Japan,False,588
IN,India,False,18
RU,Russia,True,200
MOR,Morocco,True,70
EG,Egypt,True,45


## NumPy and pandas working together
`Pandas` depends upon and interoperates with `NumPy`, the Python library for fast numeric array computations. For example, you can use the DataFrame attribute `.values` to represent a DataFrame df as a NumPy array. You can also pass pandas data structures to NumPy methods. In this exercise, we have imported pandas as pd and loaded world population data every 10 years since 1960 into the DataFrame df. This dataset was derived from the one used in the previous exercise.

Your job is to extract the values and store them in an array using the attribute `.values`. You'll then use those values as input into the NumPy `np.log10()` method to compute the base 10 logarithm of the population values. Finally, you will pass the entire pandas DataFrame into the same NumPy `np.log10()` method and compare the results.

In [15]:
# Import numpy
import numpy as np

row_labels = [1960, 1970, 1980, 1990, 2000, 2010]
Population =   [3.03497056e+09, 3.68482270e+09, 4.43659036e+09, 5.28271599e+09, 6.11597449e+09, 6.92428294e+09]
dict = { 'Population':Population}
df = pd.DataFrame(dict)
df.index = row_labels

# Create array of DataFrame values: np_vals
np_vals = df.values

# Create new array of base 10 logarithm values: np_vals_log10
np_vals_log10 = np.log10(np_vals)

# Create array of new DataFrame by passing df to np.log10(): df_log10
df_log10 = np.log10(df)

# Print original and new data containers
[print(x, 'has type', type(eval(x))) for x in ['np_vals', 'np_vals_log10', 'df', 'df_log10']]

np_vals has type <class 'numpy.ndarray'>
np_vals_log10 has type <class 'numpy.ndarray'>
df has type <class 'pandas.core.frame.DataFrame'>
df_log10 has type <class 'pandas.core.frame.DataFrame'>


[None, None, None, None]

## Importing including index_col

You can set the index column manually when you import your file with the command `index_col` inside `read_csv`.

In [16]:
# Import pandas as pd
import pandas as pd

# Fix import by including index_col
cars = pd.read_csv('../data/17. Introducción a Pandas/cars.csv', index_col= 0)

# Print out cars
cars

Unnamed: 0,cars_per_cap,country,drives_right
US,809,United States,True
AUS,731,Australia,False
JAP,588,Japan,False
IN,18,India,False
RU,200,Russia,True
MOR,70,Morocco,True
EG,45,Egypt,True


## Change name of columns

YOu can access the columns names and change them using the `.columns` method.

In [17]:
cars_2 = cars.copy()
list_labels = ['cars_per_cap_2','countries','drives_left']
cars_2.columns = list_labels
cars_2.head()

Unnamed: 0,cars_per_cap_2,countries,drives_left
US,809,United States,True
AUS,731,Australia,False
JAP,588,Japan,False
IN,18,India,False
RU,200,Russia,True


# Use broadcast

Pandas makes it possible to broadcast over the dimensions added via a multidimensional and even hierarchical index, and this is very powerfull, if you know how to use it. You don't need to code your loops and conditions. You can rely on what works already.

In [4]:
# Make a string with the value 'PA': state
state = 'PA'
cities = ['Manheim', 'Preston park', 'Biglerville', 'Indiana', 'Curwensville',
          'Crown', 'Harveys lake', 'Mineral springs', 'Cassville', 'Hannastown',
          'Saltsburg', 'Tunkhannock', 'Pittsburgh', 'Lemasters', 'Great bend']

# Construct a dictionary: data
data = {'state':state, 'city':cities}

# Construct a DataFrame from dictionary data: df
df = pd.DataFrame(data)

# Print the DataFrame
df.head()

Unnamed: 0,state,city
0,PA,Manheim
1,PA,Preston park
2,PA,Biglerville
3,PA,Indiana
4,PA,Curwensville


# Subsetting columns
## Square Brackets
You saw that you can index and select Pandas DataFrames in many different ways. The simplest, but not the most powerful way, is to use square brackets.

In the sample code on the right, the same cars data is imported from a CSV files as a Pandas DataFrame. To select only the `cars_per_cap` column from `cars`, you can use:

`cars['cars_per_cap']`

`cars[['cars_per_cap']]`

The single bracket version gives a Pandas Series, the double bracket version gives a Pandas DataFrame.

In [19]:
cars['cars_per_cap']

US     809
AUS    731
JAP    588
IN      18
RU     200
MOR     70
EG      45
Name: cars_per_cap, dtype: int64

In [20]:
cars[['cars_per_cap']]

Unnamed: 0,cars_per_cap
US,809
AUS,731
JAP,588
IN,18
RU,200
MOR,70
EG,45


Square brackets can do more than just selecting columns. You can also use them to get rows, or observations, from a DataFrame. The following call selects the first five rows from the cars DataFrame:

`cars[0:5]`

The result is another DataFrame containing only the rows you specified.

**Pay attention**: You can only select rows using square brackets if you specify a slice, like `0:4`. Also, you're using the integer indexes of the rows here, not the row labels!

In [21]:
cars[0:3]

Unnamed: 0,cars_per_cap,country,drives_right
US,809,United States,True
AUS,731,Australia,False
JAP,588,Japan,False


# loc and iloc
With loc and iloc you can do practically any data selection operation on DataFrames you can think of. 

- `.loc` is label-based, which means that you have to specify rows and columns based on their row and column labels. 

- `.iloc` is integer index based, so you have to specify rows and columns by their integer index like you did in the previous exercise.

Try out the following commands in the IPython Shell to experiment with loc and iloc to select observations. Each pair of commands here gives the same result.

In [23]:
# Print out observation for Japan
cars.loc['JAP']

cars_per_cap      588
country         Japan
drives_right    False
Name: JAP, dtype: object

In [24]:
cars.iloc[2]

cars_per_cap      588
country         Japan
drives_right    False
Name: JAP, dtype: object

In [25]:
# Print out observations for Australia and Egypt
cars.loc[['AUS', 'EG']]

Unnamed: 0,cars_per_cap,country,drives_right
AUS,731,Australia,False
EG,45,Egypt,True


In [26]:
cars.iloc[[1,6]]

Unnamed: 0,cars_per_cap,country,drives_right
AUS,731,Australia,False
EG,45,Egypt,True


loc and iloc also allow you to select both rows and columns from a DataFrame. To experiment, try out the following commands in the IPython Shell. Again, paired commands produce the same result.

In [27]:
# Print out drives_right value of Morocco
cars.loc[['MOR'],'cars_per_cap']

MOR    70
Name: cars_per_cap, dtype: int64

In [28]:
cars.iloc[5,0]

70

In [29]:
cars.loc[['RU', 'MOR'],['country','drives_right']]

Unnamed: 0,country,drives_right
RU,Russia,True
MOR,Morocco,True


In [30]:
cars.iloc[[4,5],[1,2]]

Unnamed: 0,country,drives_right
RU,Russia,True
MOR,Morocco,True


It's also possible to select only columns with loc and iloc. In both cases, you simply put a slice going from beginning to end in front of the comma:

In [31]:
# Print out drives_right column as Series
cars.loc[:,'drives_right']

US      True
AUS    False
JAP    False
IN     False
RU      True
MOR     True
EG      True
Name: drives_right, dtype: bool

In [25]:
# Print out drives_right column as DataFrame
print(cars.loc[:,['drives_right']])

     drives_right
US           True
AUS         False
JAP         False
IN          False
RU           True
MOR          True
EG           True


In [32]:
# Print out cars_per_cap and drives_right as DataFrame
cars.loc[:,['cars_per_cap','drives_right']]

Unnamed: 0,cars_per_cap,drives_right
US,809,True
AUS,731,False
JAP,588,False
IN,18,False
RU,200,True
MOR,70,True
EG,45,True


# Subsetting rows (masks)
You can filter observations from a DataFrame based on boolean arrays. Let's start simple and try to find all observations in cars where `drives_right` is `True`.

`drives_right` is a boolean column, so you'll have to extract it as a Series and then use this boolean Series to select observations from `cars`.

In [33]:
# Extract drives_right column as Series: dr
dr = cars["drives_right"]

# Use dr to subset cars: sel
sel = cars[dr]
sel

Unnamed: 0,cars_per_cap,country,drives_right
US,809,United States,True
RU,200,Russia,True
MOR,70,Morocco,True
EG,45,Egypt,True


In [34]:
# Convert code to a one-liner
sel = cars[cars['drives_right']]
sel

Unnamed: 0,cars_per_cap,country,drives_right
US,809,United States,True
RU,200,Russia,True
MOR,70,Morocco,True
EG,45,Egypt,True


This time you want to find out which countries have a high cars per capita figure. In other words, in which countries do many people have a car, or maybe multiple cars.

Similar to the previous example, you'll want to build up a boolean Series, that you can then use to subset the cars DataFrame to select certain observations. If you want to do this in a one-liner, that's perfectly fine

In [35]:
# Create car_maniac: observations that have a cars_per_cap over 500
cpc = cars['cars_per_cap']
many_cars = cpc > 500
car_maniac = cars[many_cars]
car_maniac

Unnamed: 0,cars_per_cap,country,drives_right
US,809,United States,True
AUS,731,Australia,False
JAP,588,Japan,False


Remember about `np.logical_and()`, `np.logical_or()` and `np.logical_not()`, the Numpy variants of the and, or and not operators? You can also use them on Pandas Series to do more advanced filtering operations.

Take this example that selects the observations that have a `cars_per_cap` between 10 and 80. Try out these lines of code step by step to see what's happening.

In [36]:
# Create medium: observations with cars_per_cap between 100 and 500
import numpy as np

cpc = cars['cars_per_cap']
between = np.logical_and(cpc > 100, cpc < 500)
medium = cars[between]
medium

Unnamed: 0,cars_per_cap,country,drives_right
RU,200,Russia,True


In [37]:
medium = cars[(cars['cars_per_cap']>100) & (cars['cars_per_cap']<500)]
medium

Unnamed: 0,cars_per_cap,country,drives_right
RU,200,Russia,True


# Loop over DataFrame
Iterating over a Pandas DataFrame is typically done with the `.iterrows()` method. Used in a for loop, every observation is iterated over and on every iteration the row label and actual row contents are available:

`for lab, row in brics.iterrows() :
    ...
    `

In this and the following exercises you will be working on the cars DataFrame. It contains information on the cars per capita and whether people drive right or left for seven countries in the world.

In [43]:
# Iterate over rows of cars
for x, y in cars.iterrows() : 
    print('index: ' + x)
    print(y)

index: US
cars_per_cap              809
country         United States
drives_right             True
Name: US, dtype: object
index: AUS
cars_per_cap          731
country         Australia
drives_right        False
Name: AUS, dtype: object
index: JAP
cars_per_cap      588
country         Japan
drives_right    False
Name: JAP, dtype: object
index: IN
cars_per_cap       18
country         India
drives_right    False
Name: IN, dtype: object
index: RU
cars_per_cap       200
country         Russia
drives_right      True
Name: RU, dtype: object
index: MOR
cars_per_cap         70
country         Morocco
drives_right       True
Name: MOR, dtype: object
index: EG
cars_per_cap       45
country         Egypt
drives_right     True
Name: EG, dtype: object


The row data that's generated by `.iterrows()` on every run is a Pandas Series. This format is not very convenient to print out. Luckily, you can easily select variables from the Pandas Series using square brackets:

`for lab, row in brics.iterrows() :
    print(row['country'])`

In [38]:
# Adapt for loop
for lab, row in cars.iterrows() :
    print(str(lab) + ": " + str(row['cars_per_cap']))

US: 809
AUS: 731
JAP: 588
IN: 18
RU: 200
MOR: 70
EG: 45


## Add column
To add the length of the country names of the brics DataFrame in a new column:

`for lab, row in brics.iterrows() :
    brics.loc[lab, "name_length"] = len(row["country"])`

In [45]:
# Code for loop that adds COUNTRY column
for lab, row in cars.iterrows():
    cars.loc[lab, "COUNTRY"] = str.upper(row["country"])
    
cars.head()

Unnamed: 0,cars_per_cap,country,drives_right,COUNTRY
US,809,United States,True,UNITED STATES
AUS,731,Australia,False,AUSTRALIA
JAP,588,Japan,False,JAPAN
IN,18,India,False,INDIA
RU,200,Russia,True,RUSSIA


Using `.iterrows()` to iterate over every observation of a Pandas DataFrame is easy to understand, but not very efficient. On every iteration, you're creating a new Pandas Series.

If you want to add a column to a DataFrame by calling a function on another column, the `.iterrows()` method in combination with a for loop is not the preferred way to go. Instead, you'll want to use `apply()`.

Compare the `.iterrows()` version with the `.apply()` version to get the same result in the brics DataFrame:

`for lab, row in brics.iterrows() :
    brics.loc[lab, "name_length"] = len(row["country"])`

`brics["name_length"] = brics["country"].apply(len)`

We can do a similar thing to call the `.upper()` method on every name in the country column. However, `.upper()` is a method, so we'll need a slightly different approach:

In [39]:
# Use .apply(str.upper)
cars["COUNTRY"] = cars["country"].apply(str.upper)
cars

Unnamed: 0,cars_per_cap,country,drives_right,COUNTRY
US,809,United States,True,UNITED STATES
AUS,731,Australia,False,AUSTRALIA
JAP,588,Japan,False,JAPAN
IN,18,India,False,INDIA
RU,200,Russia,True,RUSSIA
MOR,70,Morocco,True,MOROCCO
EG,45,Egypt,True,EGYPT
