### Session 4. 
Let's start with a brush up about the tabular data. 

First of all, the simple things.

#### How to import Pandas and NumPy?
- The common practice is to import the ```pandas``` module with alias ```pd```.
- While NumPy with ```numpy``` module with alias ```np```.

In [1]:
import pandas as pd
import numpy as np

#### But wait.. what are these things actually?
- We have seen so far lists, dictionaries and even JSON formats. 
- Pandas benefits by using a familiar tabular data format that divides data between rows and columns. 

Following makes it easy to understand how Pandas benefits with clarity.
I don't expect you to code it. 

#### Let's say you want to have a organized format for your data. In a list it would look like this: 

In [2]:
tech_leaders = ["Tim Cook", "Elon Musk", "Mark Zuckerberg"]
print(tech_leaders)

['Tim Cook', 'Elon Musk', 'Mark Zuckerberg']


- Let's say we want to calculate their average age.
- We can add another list with age for sure. 

In [3]:
ages = [62, 51, 38]

But what is the relationships here? These are just separate lists.. 
We could turn it into a dictionary with zip function and calculate the average by unpacking the values and dividing with the length of the dictionary.

In [4]:
tech_leaders_dict = dict(zip(tech_leaders, ages))
print(tech_leaders_dict)
print(sum([*tech_leaders_dict.values()])/len(tech_leaders_dict))

{'Tim Cook': 62, 'Elon Musk': 51, 'Mark Zuckerberg': 38}
50.333333333333336


What if we add their networth here now?

In [5]:
net_worth = [1.8, 265, 127]

 ... as you can see it gets very complicated like this. 
 We would need to have a very complicated data structure like a nested dictionary to work with the file. However, if our goal is analytics - we want tools to make it easier to work with data. 
 
Pandas is exactly that. It helps to abstract away some of the programming required to work with large datasets. 

#### How would we do it with Pandas?
- We would need to first define a DataFrame. 
- We do it by setting up our first column and calling it's column name.


In [6]:
data = pd.DataFrame(tech_leaders, columns=["tech_leaders"])
data

Unnamed: 0,tech_leaders
0,Tim Cook
1,Elon Musk
2,Mark Zuckerberg


#### If we have more than one value?

In [7]:
# We can achieve this for example with: 
dict_of_tech_leaders = {
    "tech_leaders": tech_leaders,
    "ages": ages,
    "net_worth": net_worth
}

df = pd.DataFrame(dict_of_tech_leaders)
df

Unnamed: 0,tech_leaders,ages,net_worth
0,Tim Cook,62,1.8
1,Elon Musk,51,265.0
2,Mark Zuckerberg,38,127.0


### What if we want to work with flat files such as CSVs?
- Pandas provides that a function for that -> pd.read_csv()
- This takes as arguments the file itself, with numerous optional parameters that I encourage to look into

#### Exercise 1 - Descriptive Analysis
- For that, we will use the forbes_billionaires dataset. 
1. What is the shape of the dataframe?
2. What columns do we have? Show it in a list. 
3. What are the datatypes of the columns?
4. Take a look at the top and bottom rows - How would you do it for top 10, bottom 10? 
5. Look the 99th row. 
6. Describe the dataframe with standard deviation, max, min, mean of the numerical values.
7. Is there any NA values? If yes, drop them. 
8. What is the shape after dropping the values?

## Exercise 1 - Solutions

In [8]:
import os 
print(os.getcwd())

C:\Users\ragna\ie_programming_lab


#### 1.0 Reading the CSV file with pd.read_csv()

In [9]:
df = pd.read_csv("forbes_billionaires.csv")

####  1.1 .shape method

In [10]:
# This dataset has 27
df.shape

(2755, 7)

####  1.2 .columns method

In [11]:
df.columns # gets you the columns

# You can do all sorts of operations with that
for idx, col in enumerate(df.columns):
    print(str(idx) + "th column is " + col)
    
# To get it in a list, we simply convert the type.
list(df.columns)

0th column is Name
1th column is NetWorth
2th column is Country
3th column is Source
4th column is Rank
5th column is Age
6th column is Industry


['Name', 'NetWorth', 'Country', 'Source', 'Rank', 'Age', 'Industry']

####  1.3 .dtypes method

In [12]:
df.dtypes

Name         object
NetWorth     object
Country      object
Source       object
Rank          int64
Age         float64
Industry     object
dtype: object

####  1.4 .head(), tail() methods

In [13]:
# To get the top 5 rows
df.head(10)

Unnamed: 0,Name,NetWorth,Country,Source,Rank,Age,Industry
0,Jeff Bezos,$177 B,United States,Amazon,1,57.0,Technology
1,Elon Musk,$151 B,United States,"Tesla, SpaceX",2,49.0,Automotive
2,Bernard Arnault & family,$150 B,France,LVMH,3,72.0,Fashion & Retail
3,Bill Gates,$124 B,United States,Microsoft,4,65.0,Technology
4,Mark Zuckerberg,$97 B,United States,Facebook,5,36.0,Technology
5,Warren Buffett,$96 B,United States,Berkshire Hathaway,6,90.0,Finance & Investments
6,Larry Ellison,$93 B,United States,software,7,76.0,Technology
7,Larry Page,$91.5 B,United States,Google,8,48.0,Technology
8,Sergey Brin,$89 B,United States,Google,9,47.0,Technology
9,Mukesh Ambani,$84.5 B,India,diversified,10,63.0,Diversified


In [14]:
# To get the bottom 5 rows
df.tail()

Unnamed: 0,Name,NetWorth,Country,Source,Rank,Age,Industry
2750,Daniel Yong Zhang,$1 B,China,e-commerce,2674,49.0,Technology
2751,Zhang Yuqiang,$1 B,China,Fiberglass,2674,65.0,Manufacturing
2752,Zhao Meiguang,$1 B,China,gold mining,2674,58.0,Metals & Mining
2753,Zhong Naixiong,$1 B,China,conglomerate,2674,58.0,Diversified
2754,Zhou Wei family,$1 B,China,Software,2674,54.0,Technology


####  1.5 .iloc() method

In [15]:
# To get the 99th row:
df.iloc[98]

Name                    Alisher Usmanov
NetWorth                        $18.4 B
Country                          Russia
Source      steel, telecom, investments
Rank                                 99
Age                                67.0
Industry                Metals & Mining
Name: 98, dtype: object

####  1.6 .describe() method

In [16]:
df.describe()

Unnamed: 0,Rank,Age
count,2755.0,2676.0
mean,1345.663521,63.113602
std,772.669811,13.445153
min,1.0,18.0
25%,680.0,54.0
50%,1362.0,63.0
75%,2035.0,73.0
max,2674.0,99.0


####  1.7 .isna() and .dropna() methods

In [17]:
# For that, we will use isna() Pandas method.
df.isna()

# However, that returns a boolean value whether a given value is NA. 
# Keep in mind that False == 0 and True == 1. 
# This means, that by summarizing all the values in a column we understand where missing values are. 

print(False == 0, True == 1)

df.isna().sum()

True True


Name         0
NetWorth     0
Country      0
Source       0
Rank         0
Age         79
Industry     0
dtype: int64

In [18]:
df = df.dropna()

In [19]:
df.shape

(2676, 7)

#### Exercise 2 - Simple operations
1. How many people in the dataset are from Spain?
2. Who is the richest in Spain? Who are the top 3 richest in Spain? 
3. What is the average age in Spain?
4. Who is the youngest? Who is the oldest in the whole dataset?
5. How many people are in the dataset who are below 30 years old? Show them. 
6. Who are the 5 oldest in the list?
7. Print on each line the name of the five oldest in the list with their wealth and rank. (e.g. George Joseph is ranked 1580 in the world with $2 B of wealth). Order them by their rank.

## Solutions

####  2.1 Richest people from Spain count. 

In [34]:
# We can make our lives easier by creating a dataframe from our existing one. 
spain_df = df[df['Country'] == 'Spain']

In [21]:
# Using len()
len(df[df['Country'] == 'Spain'])

# or 
len(spain_df)

28

In [32]:
# Using count, shape will give us also indication
df[df['Country'] == 'Spain'].count()
df[df['Country'] == 'Spain'].shape[0]

28

####  2.2 Top 3 in Spain

In [23]:
# Top 3 - we can use .head() method. 
df[df['Country'] == 'Spain'].head(3)

Unnamed: 0,Name,NetWorth,Country,Source,Rank,Age,Industry
10,Amancio Ortega,$77 B,Spain,Zara,11,85.0,Fashion & Retail
348,Sandra Ortega Mera,$7.3 B,Spain,Zara,344,52.0,Fashion & Retail
697,Juan Roig,$4.2 B,Spain,supermarkets,680,71.0,Fashion & Retail


####  2.3 Average age in Spain

In [24]:
# To get the average age in Spain: 
spain_df['Age'].mean()

df[df['Country'] == 'Spain']['Age'].mean()
# or 
df[df['Country'] == 'Spain'].Age.mean()

68.82142857142857

####  2.4 oldest and youngest.

In [41]:
# Youngest
df[df['Age'] == df['Age'].min()]
df[df['Age'] == 18.0]

# sort_values
df.sort_values(by='Age').head(1)

Unnamed: 0,Name,NetWorth,Country,Source,Rank,Age,Industry
940,Kevin David Lehmann,$3.3 B,Germany,drugstores,925,18.0,Fashion & Retail


In [39]:
# Oldest
df[df['Age'] == df['Age'].max()]

# sort_values
df.sort_values(by='Age', ascending = False).head(1)

Unnamed: 0,Name,NetWorth,Country,Source,Rank,Age,Industry
1611,George Joseph,$2 B,United States,insurance,1580,99.0,Finance & Investments


####  2.5 under 30 years old. 

In [46]:
# Using len
len(df[df['Age'] < 30])

10

In [48]:
df[df['Age'] < 30]

#df[df['Age'] < 30].sort_values(by="NetWorth", ascending = False)

Unnamed: 0,Name,NetWorth,Country,Source,Rank,Age,Industry
274,Sam Bankman-Fried,$8.7 B,United States,cryptocurrency,274,29.0,Finance & Investments
661,Gustav Magnar Witzoe,$4.4 B,Norway,fish farming,655,27.0,Food & Beverage
940,Kevin David Lehmann,$3.3 B,Germany,drugstores,925,18.0,Fashion & Retail
1328,Jonathan Kwok,$2.4 B,Hong Kong,Real Estate,1299,29.0,Real Estate
1338,Austin Russell,$2.4 B,United States,★,1299,26.0,Automotive
1596,Andy Fang,$2 B,United States,food delivery app,1580,28.0,Technology
1645,Stanley Tang,$2 B,United States,food delivery app,1580,28.0,Technology
2122,Wang Zelong,$1.5 B,China,chemicals,2035,24.0,Metals & Mining
2143,Alexandra Andresen,$1.4 B,Norway,investments,2141,24.0,Diversified
2144,Katharina Andresen,$1.4 B,Norway,investments,2141,25.0,Diversified


####  2.6 Five oldest billionaires

In [52]:
df.sort_values(by='Age', ascending = False).head()

Unnamed: 0,Name,NetWorth,Country,Source,Rank,Age,Industry
1611,George Joseph,$2 B,United States,insurance,1580,99.0,Finance & Investments
1626,Charles Munger,$2 B,United States,Berkshire Hathaway,1580,97.0,Finance & Investments
170,Robert Kuok,$12.6 B,Malaysia,"palm oil, shipping, property",171,97.0,Diversified
1559,David Murdock,$2.1 B,United States,"Dole, real estate",1517,97.0,Food & Beverage
729,Masatoshi Ito,$4 B,Japan,retail,727,96.0,Fashion & Retail


####  2.7 Five oldest str. 

In [30]:
five_oldest = df.sort_values(by='Age', ascending = False).head()
five_oldest_ordered = five_oldest.sort_values(by='Rank')
for person in five_oldest_ordered.values:
    print(person[0] + " is ranked " + str(person[4]) + " in the world with "+ person[1] + " of wealth")

Robert Kuok is ranked 171 in the world with $12.6 B of wealth
Masatoshi Ito is ranked 727 in the world with $4 B of wealth
David Murdock is ranked 1517 in the world with $2.1 B of wealth
George Joseph is ranked 1580 in the world with $2 B of wealth
Charles Munger is ranked 1580 in the world with $2 B of wealth
