# Pandas

#### Dictionary to DataFrame 
Pandas is an open source library, providing high-performance, easy-to-use data structures and data analysis tools for Python. Sounds promising!

The DataFrame is one of Pandas' most important data structures. It's basically a way to store tabular data where you can label the rows and the columns. One way to build a DataFrame is from a dictionary.

In the exercises that follow you will be working with vehicle data from different countries. Each observation corresponds to a country and the columns give information about the number of vehicles per capita, whether people drive left or right, and so on.

In [None]:
# Import pandas as pd
import pandas as pd

# Lists
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]

# Create dictionary my_dict with three key:value pairs: my_dict
my_dict = {'country':names, 'drives_right':dr, 'cars_per_cap':cpc}

# Build a DataFrame cars from my_dict: cars
cars = pd.DataFrame(my_dict)

# Print cars
print(cars)

The Python code that solves the previous exercise is included in the script. Have you noticed that the row labels (i.e. the labels for the different observations) were automatically set to integers from 0 up to 6?

To solve this a list row_labels has been created. You can use it to specify the row labels of the cars DataFrame. You do this by setting the index attribute of cars, that you can access as cars.index.

In [None]:
# Build cars DataFrame
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
cars_dict = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }
cars = pd.DataFrame(cars_dict)

# Definition of row_labels
row_labels = ['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG']

# Specify row labels of cars
cars.index = row_labels

# Print cars again
print(cars)

#### CSV to DataFrame 
Putting data in a dictionary and then building a DataFrame works, but it's not very efficient. What if you're dealing with millions of observations? In those cases, the data is typically available as files with a regular structure. One of those file types is the CSV file, which is short for "comma-separated values".

To import CSV data into Python as a Pandas DataFrame you can use read_csv().

Let's explore this function with the same cars data from the previous exercises. This time, however, the data is available in a CSV file, named cars.csv. It is available in your current working directory, so the path to the file is simply 'cars.csv'.

In [None]:
# Import the cars.csv data: cars
cars = pd.read_csv('../data/cars.csv')

# Print out cars
print(cars)

Your read_csv() call to import the CSV data didn't generate an error, but the output is not entirely what we wanted. The row labels were imported as another column without a name.
Remember index_col, an argument of read_csv(), that you can use to specify which column in the CSV file should be used as a row label? Well, that's exactly what you need here!

In [None]:
# Fix import by including index_col
cars = pd.read_csv('../data/cars.csv', index_col=0)

# Print out cars
print(cars)

#### Square Brackets 
You can index and select Pandas DataFrames in many different ways. The simplest, but not the most powerful way, is to use square brackets. Square brackets can do more than just selecting columns. You can also use them to get rows, or observations, from a DataFrame. 

In [None]:
# Print out country column as Pandas Series
print(f"Pandas Series\n--------------------\n{cars['country']}\n")

# Print out country column as Pandas DataFrame
print(f"Pandas DataFrame\n------------------\n{cars[['country']]}")


In [None]:
# Print out first 3 observations
print(cars[0:3])

In [None]:
# Print out fourth, fifth and sixth observation
print(cars[3:6])

In [None]:
# Create car_maniac: observations that have a cars_per_cap over 500
car_maniac = cars[cars['cars_per_cap'] > 500]
print(car_maniac)

#### loc and iloc 
With loc and iloc you can do practically any data selection operation on DataFrames you can think of. loc is label-based, which means that you have to specify rows and columns based on their row and column labels. iloc is integer index based, so you have to specify rows and columns by their integer index like you did in the previous exercise.

In [None]:
# Print out drives_right value of Morocco
print(cars.loc['MOR', 'drives_right'])

In [None]:
# Print sub-DataFrame
print(cars.loc[['RU', 'MOR'], ['country', 'drives_right']])

In [None]:
# Print out drives_right column as Series
print(cars.loc[:, 'drives_right'])

In [None]:
# Print out drives_right column as DataFrame
print(cars.loc[:, ['drives_right']])

In [None]:
# Print out cars_per_cap and drives_right as DataFrame
print(cars.loc[:, ['cars_per_cap', 'drives_right']])

# Matplotlib


#### Line Plot 
With matplotlib, you can create a bunch of different plots in Python. The most basic plot is the line plot. *When you have a time scale along the horizontal axis, the line plot is your friend*.

In [None]:
import matplotlib.pyplot as plt

years = pd.read_csv('../data/years.csv', header=None).values.flatten()
population = pd.read_csv('../data/population.csv', header=None).values.flatten()

# Make a line plot: year on the x-axis, pop on the y-axis
plt.plot(years, population)

# Display the plot with plt.show()
plt.show()

> The line plot above shows that approximately in the year 2060 will there be more than ten billion human beings on this planet.

#### Scatter Plot 
When you're trying to assess if there's a correlation between two variables, for example, the *scatter plot is the better choice*.

Let's continue with the gdp_cap versus life_exp plot, the GDP and life expectancy data for different countries in 2007.

In [None]:
# Convert gapminder.csv to DataFrame
df_gapminder = pd.read_csv('../data/gapminder.csv')

gdp_cap = df_gapminder['gdp_cap'].values
life_exp = df_gapminder['life_exp'].values

# Make a scatter plot
plt.scatter(gdp_cap, life_exp)

# Put the x-axis on a logarithmic scale
plt.xscale('log')

# Show plot
plt.show()
gdp_cap

#### Histogram 
life_exp, the list containing data on the life expectancy for different countries in 2007. To see how life expectancy in different countries is distributed, let's create a histogram of life_exp.

#### Compare Arrays 

In [None]:
my_house = np.array([18.0, 20.0, 10.75, 9.50])
your_house = np.array([14.0, 24.0, 14.25, 9.0])

# my_house greater than or equal to 18
print(my_house >= 18)

# my_house less than your_house
print(my_house < your_house)

In [None]:
# Create histogram of life_exp data
plt.hist(life_exp)

# Display histogram
plt.show()

In the previous exercise, you didn't specify the number of bins. By default, Python sets the number of bins to 10 in that case. The number of bins is pretty important. Too few bins will oversimplify reality and won't show you the details. Too many bins will overcomplicate reality and won't show the bigger picture.

To control the number of bins to divide your data in, you can set the bins argument.

In [None]:
# Build histogram with 5 bins
plt.hist(life_exp, bins=5)
plt.show()

# Build histogram with 20 bins
plt.hist(life_exp, bins=20)
plt.show()


#### Labels and Ticks 
It's time to customize your own plot. This is the fun part, you will see your plot come to life!

You're going to work on the scatter plot with world development data: GDP per capita on the x-axis (logarithmic scale), life expectancy on the y-axis. The code for this plot is available in the script.

As a first step, let's add axis labels, title, and ticks to the plot. You can do this with the xlabel(), ylabel(), title(), xticks(), and yticks() functions, available in matplotlib.pyplot.

In [None]:
# Basic scatter plot, log scale
plt.scatter(gdp_cap, life_exp)
plt.xscale('log') 

# Add axis labels
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')

# Add title
plt.title('World Development in 2007')

# Definition of tick_val and tick_lab
tick_val = [1000, 10000, 100000]
tick_lab = ['1k', '10k', '100k']

# Adapt the ticks on the x-axis
plt.xticks(tick_val, tick_lab)

# After customizing, display the plot
plt.show()

#### <span style="color:yellow">Colors and Size</span>
Right now, the scatter plot is just a cloud of blue dots, indistinguishable from each other. Let's change this. Wouldn't it be nice if the size of the dots corresponds to the population?

In [None]:
import numpy as np

colors_dict = {
    'Asia':'red',
    'Europe':'green',
    'Africa':'blue',
    'Americas':'yellow',
    'Oceania':'black'
}

# Create new column color and map the continent by color
df_gapminder['colors'] = df_gapminder['cont'].map(colors_dict)

# Specify c, s and alpha inside plt.scatter()
plt.scatter(x = gdp_cap, y = life_exp, alpha=0.8, c=df_gapminder.colors, s=df_gapminder['population'] / 1e6)

# Previous customizations
plt.xscale('log') 
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
plt.xticks([1000,10000,100000], ['1k','10k','100k'])

# Add grid
plt.grid(True)

# Show the plot
plt.show()


# For Loops
#### Index and values 

Using a for loop to iterate over a list only gives you access to every list element in each run, one after the other. If you also want to access the index information, so where the list element you're iterating over is located, you can use enumerate().

In [None]:
# areas list
areas = [11.25, 18.0, 20.0, 10.75, 9.50]

# Change for loop to use enumerate() and update print()
for index,value in enumerate(areas) :
    print(f'room {index}: {value}')

#### Loop over list of lists
Remember the house variable from the Intro to Python course? Have a look at its definition in the script. It's basically a list of lists, where each sublist contains the name and area of a room in your house.

In [None]:
# house list of lists
house = [["hallway", 11.25], 
         ["kitchen", 18.0], 
         ["living room", 20.0], 
         ["bedroom", 10.75], 
         ["bathroom", 9.50]]
         
# Build a for loop from scratch
for h in house:
    print(f'the {h[0]} is {h[1]} sqm')

#### Loop over dictionary
In Python 3, you need the items() method to loop over a dictionary:

In [None]:
# Definition of dictionary
europe = {'spain':'madrid', 
          'france':'paris', 
          'germany':'berlin',
          'norway':'oslo', 
          'italy':'rome', 
          'poland':'warsaw', 
          'austria':'vienna' }
          
# Iterate over europe
for key, val in europe.items():
    print(f'the capital of {key} is {val}')

#### Loop over NumPy array
If you're dealing with a 1D NumPy array, looping over all elements can be as simple as:

In [None]:
height = pd.read_csv('../data/height.csv', header=None).values.flatten()
for i in height:
    print(i)


If you're dealing with a 2D NumPy array, it's more complicated. A 2D array is built up of multiple 1D arrays. To explicitly iterate over all **separate** elements of a multi-dimensional array, you'll need this syntax:

In [None]:
np_house = np.array(house)
for i in np.nditer(np_house):
    print(i)

#### <span style="color:yellow">Loop over Pandas DataFrame</span>
Iterating over a Pandas DataFrame is typically done with the iterrows() method. Used in a for loop, every observation is iterated over and on every iteration the row label and actual row contents are available:

In [None]:
# Iterate over rows of cars
for lab, val in cars.iterrows():
    print(f'{lab}\n{val}\n')

The row data that's generated by iterrows() on every run is a Pandas Series. This format is not very convenient to print out. Luckily, you can easily select variables from the Pandas Series using square brackets:

In [None]:
# Adapt for loop
for lab, row in cars.iterrows() :
    print(f'{lab} - cars per capital: {row["cars_per_cap"]}')

##### *Add column*

In [None]:
# Code for loop that adds COUNTRY column
for lab, row in cars.iterrows():
    cars.loc[lab, 'country'.upper()] = row['country'].upper()


# Print cars
print(cars)

Using iterrows() to iterate over every observation of a Pandas DataFrame is easy to understand, but not very efficient. On every iteration, you're creating a new Pandas Series.

If you want to add a column to a DataFrame by calling a function on another column, the iterrows() method in combination with a for loop is not the preferred way to go. Instead, you'll want to use apply().

In [None]:
# Use .apply(str.upper)
cars['COUNTRY'] = cars['country'].apply(str.upper)
print(cars)