<a href="https://colab.research.google.com/github/jmartinbellido/DMBA/blob/main/DMBA_Python_N3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **ISDI DMBA**
# Introduction to Python for Data Analysis 
# *Notebook 3: Advanced Manipulation*
---

### ÍNDICE
1. BUILDING NEW FIELDS IN DATAFRAME
2. AGREGATE DATA
3. DATA CLEANING
4. EXERCISES

### Lecturer: Juan Martin Bellido
* [linkedin.com/in/jmartinbellido](https://www.linkedin.com/in/jmartinbellido/)
* jmbelldo@isdi.education
---


In [None]:
# importing libraries
import pandas as pd
import numpy as np

# CREATE NEW FIELDS IN DATAFRAME


In [None]:
# Import df
df_jamesbond = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv",index_col="Film")
df_jamesbond.dtypes # the dtypes functions allows to check all fields available in the dataframe

Year                   int64
Actor                 object
Director              object
Box Office           float64
Budget               float64
Bond Actor Salary    float64
dtype: object

### New fields based on operations between existing fields
---

In [None]:
# We create a new column (field) as result of an operation among two existing numeric fields
## let's use the round() function to round decimals
df_jamesbond["profitability"]=(df_jamesbond["Box Office"]/df_jamesbond["Budget"]).round()
df_jamesbond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary,profitability
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6,64.0
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6,43.0
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2,44.0
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7,20.0
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,,4.0


In [None]:
# We now create a new fueld as result of operating with two string fields
df_jamesbond["Director/Actor"]=df_jamesbond["Director"]+ " / " + df_jamesbond["Actor"]
df_jamesbond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary,profitability,Director/Actor
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6,64.0,Terence Young / Sean Connery
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6,43.0,Terence Young / Sean Connery
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2,44.0,Guy Hamilton / Sean Connery
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7,20.0,Terence Young / Sean Connery
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,,4.0,Ken Hughes / David Niven


### New field based on logic test
---

Let us now create a new field based on the result of an *if/else* condition using the *np.where()* function,
```
np.where(condition, value if true, value if false)
```


In [None]:
# We create a new field using a condition
## on this example, we assign 1 to rows where actor is Daniel Craig (0 if not)
df_jamesbond["actor_is_daniel_craig"] = np.where(df_jamesbond["Actor"]=='Daniel Craig',1,0)
df_jamesbond[["Actor","actor_is_daniel_craig"]].head()

Unnamed: 0_level_0,Actor,actor_is_daniel_craig
Film,Unnamed: 1_level_1,Unnamed: 2_level_1
Dr. No,Sean Connery,0
From Russia with Love,Sean Connery,0
Goldfinger,Sean Connery,0
Thunderball,Sean Connery,0
Casino Royale,David Niven,0


# AGGREGATING DATA

In [None]:
# Import df
df_jamesbond = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv",index_col="Film")
df_jamesbond.dtypes

Year                   int64
Actor                 object
Director              object
Box Office           float64
Budget               float64
Bond Actor Salary    float64
dtype: object

### Intro to aggregations
---

In data analytics, to aggregate data is to perform calculations that summarize data, abstracting insight from information that is raw. 

We start by performing aggregations to all variables included in dataframe. To do so, we make use of the *agg()* function.

To aggregate data, we use at least one *aggregate function* which establish the type of operation performed.

```
objecto.agg(["aggregate function","aggregate function"])
```

*Aggregate functions*

| Function    | Description                     |
|-------------|---------------------------------|
| count       | Number of non-null observations |
| nunique     | Number of unique values         |
| sum         | Sum of values                   |
| mean        | Mean of values                  |
| median      | Arithmetic median of values     |
| mode        | Mode                            |
| min         | Minimum                         |
| max         | Maximum                         |




In [None]:
# Aggregating all columns using functions min and max
df_jamesbond.agg(["max","min"])

Unnamed: 0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary,profitability,Director/Actor
max,2015,Timothy Dalton,Terence Young,943.5,206.3,17.9,64.0,Terence Young / Sean Connery
min,1962,Daniel Craig,Guy Hamilton,250.9,7.0,0.6,3.0,Guy Hamilton / Roger Moore


In [None]:
df_jamesbond.agg(["max","min"]).T # the .T attribute allows to pivot the table

Unnamed: 0,max,min
Year,2015,1962
Actor,Timothy Dalton,Daniel Craig
Director,Terence Young,Guy Hamilton
Box Office,943.5,250.9
Budget,206.3,7
Bond Actor Salary,17.9,0.6
profitability,64,3
Director/Actor,Terence Young / Sean Connery,Guy Hamilton / Roger Moore


In [None]:
# Now let us aggregate all fields using count (counts non NaN observations) and nunique (which counts unique values) 
df_jamesbond.agg(["count","nunique"])

Unnamed: 0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
count,26,26,26,26,26,18
nunique,24,7,13,25,26,17


### Aggregating specific fields
---
We normally don't want to perform the same opeartion (aggregate function) to *all* fields in a dataframe. We can specify fields to function using a dictionary,

```
object.agg({
  'col1':["aggregate function","aggregate function"]
  ,'col2':["aggregate function","aggregate function","aggregate function"]
})
```

In [None]:
# Let us now aggregate field "box office" using functions min, max and mean
df_jamesbond.agg({'Box Office':["min","max","mean"]}).T

Unnamed: 0,min,max,mean
Box Office,250.9,943.5,491.611538


In [None]:
# On a second example, we now aggregate "Year" and "Box Office"
df_jamesbond.agg(
    {
    'Year':["min","max"]
    ,"Box Office":["min","max","mean","sum"]
    })

## note that the output has NaN values for field "Year", as we are not performing few operations on this variable

Unnamed: 0,Year,Box Office
max,2015.0,943.5
mean,,491.611538
min,1962.0,250.9
sum,,12781.9


### Grouped aggregations
---

When aggregating data, we frequently want to group observations by one or more variables. The *groupby()* allows to group data into one or more categorical variables.

```
objecto.groupby(["categorial variable","categorial variable"])
```


In [None]:
# We aggregate by one variable and sort
df_jamesbond.groupby("Actor").agg(
    {"Bond Actor Salary":"max"}
).sort_values("Bond Actor Salary", ascending=False)

Unnamed: 0_level_0,Bond Actor Salary
Actor,Unnamed: 1_level_1
Pierce Brosnan,17.9
Daniel Craig,14.5
Roger Moore,9.1
Timothy Dalton,7.9
Sean Connery,5.8
George Lazenby,0.6
David Niven,


In [None]:
# We aggregate grouped by one variable
df = df_jamesbond.groupby("Actor").agg(
    {"Bond Actor Salary":"max"
    ,"Box Office":"mean"}
).sort_values("Bond Actor Salary", ascending=False)

## changing variable names
df = df.rename(columns={'Bond Actor Salary':'total_bond_salary','Box Office':'total_box_office'})

df # we invoke the object

Unnamed: 0_level_0,total_bond_salary,total_box_office
Actor,Unnamed: 1_level_1,Unnamed: 2_level_1
Pierce Brosnan,17.9,471.65
Daniel Craig,14.5,691.475
Roger Moore,9.1,422.957143
Timothy Dalton,7.9,282.2
Sean Connery,5.8,571.114286
George Lazenby,0.6,291.5
David Niven,,315.0


In [None]:
# We aggregate multiple metrics by a grouped variable
## when sorting, we need to specify a tuple to indicate reference metric and function
df_jamesbond.groupby("Actor").agg(
    {
    "Bond Actor Salary":["max","sum","mean","size"]
    ,"Budget":["max","min"]
    }
).sort_values(("Bond Actor Salary","max"),ascending=False) # we set a tuple as parameter


Unnamed: 0_level_0,Bond Actor Salary,Bond Actor Salary,Bond Actor Salary,Bond Actor Salary,Budget,Budget
Unnamed: 0_level_1,max,sum,mean,size,max,min
Actor,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Pierce Brosnan,17.9,46.5,11.625,4,158.3,76.9
Daniel Craig,14.5,25.9,8.633333,4,206.3,145.3
Roger Moore,9.1,16.9,8.45,7,91.5,27.7
Timothy Dalton,7.9,13.1,6.55,2,68.8,56.7
Sean Connery,5.8,20.3,3.383333,7,86.0,7.0
George Lazenby,0.6,0.6,0.6,1,37.3,37.3
David Niven,,0.0,,1,85.0,85.0


In [None]:
# We can set more than one variable when grouping data
## in that case, observations will be grouped by the combination of those fields
df_jamesbond.groupby(["Director","Actor"]).agg(
    {"Box Office":"median"}
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Box Office
Director,Actor,Unnamed: 2_level_1
Guy Hamilton,Roger Moore,397.15
Guy Hamilton,Sean Connery,631.45
Irvin Kershner,Sean Connery,380.0
John Glen,Roger Moore,373.8
John Glen,Timothy Dalton,282.2
Ken Hughes,David Niven,315.0
Lee Tamahori,Pierce Brosnan,465.4
Lewis Gilbert,Roger Moore,534.0
Lewis Gilbert,Sean Connery,514.2
Marc Forster,Daniel Craig,514.2


# DATA CLEANING
In this section, we will review three basic data cleaning operations,
*   handling null values
*   removing duplicates
*   identifying outliers


In [None]:
# Import dataframe
df_jamesbond = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv",index_col="Film")

### Handling null values
---

#### *Identify null values in a DataFrame*

The first step is to check if there are indeed null values in the DataFrame and identify on which field.

There are two functions that will help us with this,

```
isna(object)
```
> *The isna() method test if values are NaN (True) or not (False)*

```
any(boolean object)
```
> *The any() method test if at least one of the boolean elements is True*







In [None]:
# Step 1: apply isna() method to the whole DataFrame to test for NaN values
nan_values = df_jamesbond.isna() # we store the result in a new object
nan_values # let us visualize the object to see what we are doing

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,False,False,False,False,False,False
From Russia with Love,False,False,False,False,False,False
Goldfinger,False,False,False,False,False,False
Thunderball,False,False,False,False,False,False
Casino Royale,False,False,False,False,False,True
You Only Live Twice,False,False,False,False,False,False
On Her Majesty's Secret Service,False,False,False,False,False,False
Diamonds Are Forever,False,False,False,False,False,False
Live and Let Die,False,False,False,False,False,True
The Man with the Golden Gun,False,False,False,False,False,True


In [None]:
# Step 2: test if at least one value for each column is True (null)
nan_columns = nan_values.any() # we apply any() to nan_values and store that in a new object
nan_columns # let us visualize what we built
# note: once visualizing the result, we observe that there are NaN values for field "Bond Actor Salary"

Year                 False
Actor                False
Director             False
Box Office           False
Budget               False
Bond Actor Salary     True
dtype: bool

#### *Operate null values in DataFrame*
Once identifying null values, we can decide whether to, 
*   remove full rows with at least one missing value; or 
*   fill null values 



In [None]:
# We identified NA values in column "Bond Actor Salary", we will now proceed to omit those rows
cond = df_jamesbond["Bond Actor Salary"].isnull()
df_jamesbond[-cond] # we negate condition to omit those rows that are NA for column "Bond Actor Salary"

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Octopussy,1983,Roger Moore,John Glen,373.8,53.9,7.8
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
The Living Daylights,1987,Timothy Dalton,John Glen,313.5,68.8,5.2


In [None]:
# Alternatively, we can opt to fill NAs
# we will proceed to fill NAs with the mean value for the variable
median_bond_salary = df_jamesbond["Bond Actor Salary"].mean() # we calculate the mean
df_jamesbond["Bond Actor Salary"]=df_jamesbond["Bond Actor Salary"].fillna(median_bond_salary) # we replace null values

# Note: in some other cases, we might just want to replace null values by 0 -> .fillna(0)

### Removing duplicates
---

In [None]:
# Import dataframe
df_duplicates = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/renfe_estaciones_duplicates.csv")

#### *Identify duplicated values*

Key functions in this section,
```
object.duplicated()
```
> *The duplicated() method test if a row is an exact duplicate*



In [None]:
len(df_duplicates) # check number of rows

630

In [None]:
# identify duplicated rows
cond = df_duplicates.duplicated()
df_duplicates[cond]

Unnamed: 0,CODIGO,DESCRIPCION,LATITUD,LONGITUD,DIRECCION,C.P.,POBLACION,PROVINCIA,PAIS
16,5000,GRANADA,37.184036,-3.609036,AVENIDA DE LOS ANDALUCES. S/N,18014.0,Granada,Granada,España
24,10204,ZARZALEJO,40.538817,-4.158073,CALLE DEL FERROCARRIL. S/N,28293.0,Zarzalejo,Madrid,España
57,11208,VITORIA/GASTEIZ,42.841528,-2.672665,PLAZUELA DE LA ESTACION. 1,1005.0,Vitoria-Gasteiz,Araba/Álava,España
74,11203,MANZANOS,42.742875,-2.86753,RIO ZADORRA KALEA. S/N,1220.0,Ribera Baja/Erribera Beitia,Araba/Álava,España
90,15006,CISNEROS,42.191068,-4.850206,CARRETERA P-932 - ESTACION FERROCARRIL. S/N,34320.0,Cisneros,Palencia,España
181,23004,PONTEVEDRA,42.42164,-8.63583,AVENIDA DE LA ESTACION. S/N,36003.0,Pontevedra,Pontevedra,España
222,31205,A GUDIÑA,42.06069,-7.132436,CARRETERA N-525 - BEATO SEBASTIAN APARICIO. S/N,32540.0,Gudiña. A,Ourense,España
227,34005,SAN PEDRO DEL ARROYO,40.803838,-4.871408,CALLE ESTACION FERROCARRIL. S/N,5350.0,San Pedro del Arroyo,Ávila,España
242,35206,NAVALMORAL DE LA MATA,39.894854,-5.545566,PLAZA ESTACION FERROCARRIL. 1,10300.0,Navalmoral de la Mata,Cáceres,España
253,37300,PUERTOLLANO,38.691411,-4.111611,CALLE MUELLE. S/N,13500.0,Puertollano,Ciudad Real,España


#### *Remove duplicated values*

Key functions in this section,
```
object.drop_duplicates()
```
> *The drop_duplicates() method removes duplicated rows*

In [None]:
df_clean = df_duplicates.drop_duplicates() # we remove duplicates and store the output in a new object

In [None]:
len(df_clean) # we check again number of rows to confirm new number

620

### Identifying outliers


In [None]:
# Import dataframe
df_jamesbond = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv",index_col="Film")

It is important to identify outliers in numerical fields. Depending on the case, we might only want to check for possible data entry errors or to remove outliers (even when those are not errors).

There are different way to identify outliers in a DataFrame, in this course we will introduce one of the simplest: *normalizing the variable to Z*. For this method, we will assume that the variable is normally distributed. A z score higher than 3 or lower than -3 would indicate that the value is an outlier.


Key functions in this section,

```
stats.zscore(column, nan_policy='omit')
```
> *The szcore() method normalizes (converts) a numerical variable into Z (normal distribution)*





In [None]:
# We normalize variable Box Office to check outliers
df_jamesbond["z_Box Office"] = stats.zscore(df_jamesbond["Box Office"],nan_policy='omit').round(2)

In [None]:
# We select variable and its normalized version and sort
df_jamesbond[["Box Office","z_Box Office"]].sort_values("z_Box Office",ascending=False)
# Note: we do not find any outliers, as there are not Z score values higher than 3 or lower than -3

Unnamed: 0_level_0,Box Office,z_Box Office
Film,Unnamed: 1_level_1,Unnamed: 2_level_1
Skyfall,943.5,2.61
Thunderball,848.1,2.06
Goldfinger,820.4,1.9
Spectre,726.7,1.36
Casino Royale,581.5,0.52
From Russia with Love,543.8,0.3
Moonraker,535.0,0.25
The Spy Who Loved Me,533.0,0.24
GoldenEye,518.5,0.16
You Only Live Twice,514.2,0.13


# EXERCISES

##### EX 1

> Dataset https://data-wizards.s3.amazonaws.com/datasets/dataset_videogames_games.csv

##### Create two new fields in the dataframe,
*   *total_sales* that reflects total sales in all regions
*   *videogame_segment* that flags videogames that sold more than 30MM ("top sales") vs. the rest ("not top sales")

Filter to only videogames for console 'N64' and display columns *name, platform_code, total_sales, videogame_segment* 

---



In [None]:
import pandas as pd
df_videogames = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/dataset_videogames_games.csv")
df_videogames.dtypes

rank               int64
name              object
platform_code     object
year               int64
genre             object
publisher         object
NA_sales         float64
EU_sales         float64
JP_sales         float64
Other_sales      float64
dtype: object

In [None]:
df_videogames["total_sales"] = df_videogames["NA_sales"] + df_videogames["EU_sales"] + df_videogames["JP_sales"] + df_videogames["Other_sales"]
df_videogames["videogame_segment"] = np.where(df_videogames["total_sales"]>30,'top_sales','not_top_sales') 

cond = df_videogames["platform_code"] == 'N64'
df_videogames[cond][['name','platform_code','total_sales','videogame_segment']].sort_values('total_sales',ascending=False)

Unnamed: 0,name,platform_code,total_sales,videogame_segment
46,Super Mario 64,N64,11.90,not_top_sales
63,Mario Kart 64,N64,9.87,not_top_sales
84,GoldenEye 007,N64,8.09,not_top_sales
94,The Legend of Zelda: Ocarina of Time,N64,7.60,not_top_sales
157,Super Smash Bros.,N64,5.56,not_top_sales
...,...,...,...,...
14015,Big Mountain 2000,N64,0.03,not_top_sales
14832,Super Bowling,N64,0.02,not_top_sales
14833,Rat Attack!,N64,0.02,not_top_sales
15929,PGA European Tour,N64,0.01,not_top_sales


##### EX 2 
> Dataset https://data-wizards.s3.amazonaws.com/datasets/fortune1000.csv

##### EX 2.1 Calculate total revenue by sector
##### EX 2.2 Repeat exercise 2.1 but filtering for only companies in sector Technology, Energy or Retailing
---

In [None]:
import pandas as pd
df_fortune = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/fortune1000.csv")
df_fortune.dtypes

Rank          int64
Company      object
Sector       object
Industry     object
Location     object
Revenue       int64
Profits       int64
Employees     int64
dtype: object

In [None]:
# EX 2.1
df_fortune.groupby("Sector").agg(
    {"Revenue":"sum"}
).sort_values("Revenue",ascending=False)

Unnamed: 0_level_0,Revenue
Sector,Unnamed: 1_level_1
Financials,2217159
Health Care,1614707
Energy,1517809
Retailing,1465076
Technology,1377600
"Food, Beverages & Tobacco",555967
Industrials,497581
Food and Drug Stores,483769
Motor Vehicles & Parts,482540
Telecommunications,461834


In [None]:
# EX 2.2
cond = df_fortune["Sector"].isin(["Technology","Energy","Retailing"])

df_fortune[cond].groupby("Sector").agg(
    {"Revenue":"sum"}
).sort_values("Revenue",ascending=False)

Unnamed: 0_level_0,Revenue
Sector,Unnamed: 1_level_1
Energy,1517809
Retailing,1465076
Technology,1377600


##### EX 3 
> Dataset https://data-wizards.s3.amazonaws.com/datasets/starwarsdb_people.csv

##### Extract top 5 homeworlds with highest number of characters included in dataframe
---

In [None]:
import pandas as pd
df_starwars_people = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/starwarsdb_people.csv")
df_starwars_people.dtypes

name           object
height        float64
mass          float64
hair_color     object
skin_color     object
eye_color      object
birth_year    float64
gender         object
homeworld      object
species        object
sex            object
dtype: object

In [None]:
df_starwars_people.groupby('homeworld').agg({
  "name":"nunique"   
}).rename(
    {"name":"count_characteres"}
    ,axis='columns'
).sort_values('count_characteres',ascending=False)\
.iloc[:5,]

Unnamed: 0_level_0,count_characteres
homeworld,Unnamed: 1_level_1
Naboo,11
Tatooine,10
Alderaan,3
Kamino,3
Coruscant,3


##### EX 4 
> Dataset https://data-wizards.s3.amazonaws.com/datasets/dataset_na_who.csv

##### Aggregate the following metrics by continent,

Agregar las siguiente métricas según continente,

*   *Total population*
*   *Average GDP per capita*
*   *Average % of population living below poberty line*

---

In [None]:
import pandas as pd
df_who = pd.read_csv('https://data-wizards.s3.amazonaws.com/datasets/dataset_na_who.csv')
df_who.dtypes

Country                                                    object
CountryID                                                   int64
ContinentID                                                 int64
Adolescent fertility rate (%)                             float64
Adult literacy rate (%)                                   float64
Gross national income per capita (PPP international $)    float64
Net primary school enrolment ratio female (%)             float64
Net primary school enrolment ratio male (%)               float64
Population (in thousands) total                           float64
Population annual growth rate (%)                         float64
Population in urban areas (%)                             float64
Population living below the poverty line                  float64
Continent                                                  object
dtype: object

In [None]:
# EX 4
output = df_who.groupby('Continent').agg({
    'Population (in thousands) total':'sum'
    ,'Gross national income per capita (PPP international $)':'mean'
    ,'Population living below the poverty line':'mean'
})

output['Population (in thousands) total'] = round(output['Population (in thousands) total'])
output['Gross national income per capita (PPP international $)'] = round(output['Gross national income per capita (PPP international $)'])
output['Population living below the poverty line'] = round(output['Population living below the poverty line'])

output.sort_values('Population (in thousands) total',ascending=False)

Unnamed: 0_level_0,Population (in thousands) total,Gross national income per capita (PPP international $),Population living below the poverty line
Continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Asia,2859153.0,2866.0,28.0
Europe,880241.0,19777.0,3.0
Africa,759147.0,3128.0,36.0
Oceania,714480.0,11716.0,12.0
South America,453480.0,7397.0,14.0
North America,441464.0,24524.0,3.0
Middle East,336867.0,14894.0,2.0


##### EX 5 
> Dataset https://data-wizards.s3.amazonaws.com/datasets/movies.csv

##### Extract top 10 directors with highest average IMDB score. Include only directors with more than 5 movies directed.
---

In [None]:
import pandas as pd
df_movies = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/movies.csv",index_col="movie_title")
df_movies.dtypes

color                         object
director_name                 object
num_critic_for_reviews       float64
duration                     float64
director_facebook_likes      float64
actor_3_facebook_likes       float64
actor_2_name                  object
actor_1_facebook_likes       float64
gross                        float64
genres                        object
actor_1_name                  object
num_voted_users                int64
cast_total_facebook_likes      int64
actor_3_name                  object
facenumber_in_poster         float64
plot_keywords                 object
movie_imdb_link               object
num_user_for_reviews         float64
language                      object
country                       object
content_rating                object
budget                       float64
title_year                   float64
actor_2_facebook_likes       float64
imdb_score                   float64
aspect_ratio                 float64
movie_facebook_likes           int64
d

In [None]:
# EX 5
output = df_movies.groupby("director_name").agg({
    "imdb_score":"mean"
    ,"movie_title":"count"
    }).rename({
    "imdb_score":"avg_imdb_score"
    ,"movie_title":"total_movies"
},axis='columns')

output[output['total_movies']>5].sort_values('avg_imdb_score',ascending=False).iloc[:10,]

Unnamed: 0_level_0,avg_imdb_score,total_movies
director_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Christopher Nolan,8.425,8
Quentin Tarantino,8.2,8
Stanley Kubrick,8.05,6
James Cameron,7.914286,7
Peter Jackson,7.888889,9
Alejandro G. Iñárritu,7.783333,6
David Fincher,7.75,10
Martin Scorsese,7.66,20
Wes Anderson,7.628571,7
Paul Greengrass,7.585714,7


##### EX 6
> Dataset https://data-wizards.s3.amazonaws.com/datasets/fortune1000.csv

##### Create a new field ("company_size") that segments companies in terms of number of employees using the following criteria,
*   *small*, when number of employees is fewer than 10000
*   *medium*, when number of employees is 10000 < x < 100000 
*   *big*, when number of employees is higher than 100000

##### Calculate median revenue by (i) sector and (ii) company size
---

In [None]:
import pandas as pd
import numpy as np
df_fortune = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/fortune1000.csv")
df_fortune.dtypes

Rank          int64
Company      object
Sector       object
Industry     object
Location     object
Revenue       int64
Profits       int64
Employees     int64
dtype: object

In [None]:
# step 1: creating a new column based on conditions
df = df_fortune
df["company_size"] = np.where(
    df["Employees"]>100000,"big"                      # condition 1
    ,np.where(df["Employees"]>10000,"medium","small"  # condition 2
  )
)

# step 2: parsing the new variable into the groupby
df.groupby(["Sector","company_size"]).agg(
    {"Profits":"median"}
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Profits
Sector,company_size,Unnamed: 2_level_1
Aerospace & Defense,big,5176.0
Aerospace & Defense,medium,545.0
Aerospace & Defense,small,182.0
Apparel,medium,415.5
Apparel,small,174.0
Business Services,big,169.0
Business Services,medium,419.0
Business Services,small,148.0
Chemicals,medium,1166.0
Chemicals,small,144.5


##### EX 7
> Dataset https://data-wizards.s3.amazonaws.com/datasets/dataset_star_wars.csv

Identify fields with missing values. Remove observations with null values for any numerical variable

---


In [None]:
# Importing libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
from matplotlib.pyplot import figure
from scipy import stats
import scipy
from datetime import datetime

In [None]:
# Importing df
df_starwars = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/dataset_star_wars.csv")
df_starwars.dtypes

name           object
height        float64
mass          float64
hair_color     object
skin_color     object
eye_color      object
birth_year    float64
gender         object
homeworld      object
species        object
dtype: object

In [None]:
df_starwars.isna().any()

name          False
height         True
mass           True
hair_color     True
skin_color    False
eye_color     False
birth_year     True
gender         True
homeworld      True
species        True
dtype: bool

In [None]:
# We now omit rows with null values for numerical variables
cond_1 = df_starwars["height"].isna()
cond_2 = df_starwars["mass"].isna()
cond_3 = df_starwars["birth_year"].isna()

df_starwars_clean = df_starwars[-(cond_1 & cond_2 & cond_3)]