### Session 4. 
Let's start with a brush up about the tabular data. 

First of all, the simple things.

#### How to import Pandas and NumPy?
- The common practice is to import the ```pandas``` module with alias ```pd```.
- While NumPy with ```numpy``` module with alias ```np```.

In [1]:
import pandas as pd
import numpy as np

#### But wait.. what are these things actually?
- We have seen so far lists, dictionaries and even JSON formats. 
- Pandas benefits by using a familiar tabular data format that divides data between rows and columns. 

Following makes it easy to understand how Pandas benefits with clarity.
I don't expect you to code it. 

#### Let's say you want to have a organized format for your data. In a list it would look like this: 

In [2]:
tech_leaders = ["Tim Cook", "Elon Musk", "Mark Zuckerberg"]
print(tech_leaders)

['Tim Cook', 'Elon Musk', 'Mark Zuckerberg']


- Let's say we want to calculate their average age.
- We can add another list with age for sure. 

In [3]:
ages = [62, 51, 38]

But what is the relationships here? These are just separate lists.. 
We could turn it into a dictionary with zip function and calculate the average by unpacking the values and dividing with the length of the dictionary.

In [4]:
tech_leaders_dict = dict(zip(tech_leaders, ages))
print(tech_leaders_dict)
print(sum([*tech_leaders_dict.values()])/len(tech_leaders_dict))

{'Tim Cook': 62, 'Elon Musk': 51, 'Mark Zuckerberg': 38}
50.333333333333336


What if we add their networth here now?

In [5]:
net_worth = [1.8, 265, 127]

 ... as you can see it gets very complicated like this. 
 We would need to have a very complicated data structure like a nested dictionary to work with the file. However, if our goal is analytics - we want tools to make it easier to work with data. 
 
Pandas is exactly that. It helps to abstract away some of the programming required to work with large datasets. 

#### How would we do it with Pandas?
- We would need to first define a DataFrame. 
- We do it by setting up our first column and calling it's column name.


In [6]:
data = pd.DataFrame(tech_leaders, columns=["tech_leaders"])
data

Unnamed: 0,tech_leaders
0,Tim Cook
1,Elon Musk
2,Mark Zuckerberg


#### If we have more than one value?

In [7]:
# We can achieve this for example with: 
dict_of_tech_leaders = {
    "tech_leaders": tech_leaders,
    "ages": ages,
    "net_worth": net_worth
}

df = pd.DataFrame(dict_of_tech_leaders)
df

Unnamed: 0,tech_leaders,ages,net_worth
0,Tim Cook,62,1.8
1,Elon Musk,51,265.0
2,Mark Zuckerberg,38,127.0


### What if we want to work with flat files such as CSVs?
- Pandas provides that a function for that -> pd.read_csv()
- This takes as arguments the file itself, with numerous optional parameters that I encourage to look into

#### Exercise 1 - Descriptive Analysis
- For that, we will use the forbes_billionaires dataset. 
1. What is the shape of the dataframe?
2. What columns do we have? Show it in a list. 
3. What are the datatypes of the columns?
4. Take a look at the top and bottom rows - How would you do it for top 10, bottom 10? 
5. Look the 99th row. 
6. Describe the dataframe with standard deviation, max, min, mean of the numerical values.
7. Is there any NA values? If yes, drop them. 
8. What is the shape after dropping the values?

#### Exercise 2 - Simple operations
1. How many people in the dataset are from Spain?
2. Who is the richest in Spain? Who are the top 3 richest in Spain? 
3. What is the average age in Spain?
4. Who is the youngest? Who is the oldest in the whole dataset?
5. How many people are in the dataset who are below 30 years old? Show them. 
6. Who are the 5 oldest in the list?
7. Print on each line the name of the five oldest in the list with their wealth and rank. (e.g. George Joseph is ranked 1580 in the world with $2 B of wealth). Order them by their rank.