<a href="https://colab.research.google.com/github/jmartinbellido/DMBA/blob/main/DMBA_Python_N2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **ISDI DMBA**
# Introduction to Python for Data Analysis 
# *Notebook 2: DataFrames*
---

### TABLE OF CONTENTS
1. OPERATING WITH COLUMNS IN A DATAFRAME
2. SELECTING ELEMENTS IN A DATAFRAME
3. SORTING A DATAFRAME
4. CHAINING OPERATIONS
5. EXERCISES

### Lecturer: Juan Martin Bellido
* [linkedin.com/in/jmartinbellido](https://www.linkedin.com/in/jmartinbellido/)
* jmbelldo@isdi.education
---

First thing we do is to import all libraries we need

In [None]:
# We import libraries
## note: in case of error we should check if the library is installed in our environment
import pandas as pd

# 1. OPERATING WITH COLUMNS IN A DATAFRAME
A dataframe is a two-dimensional (rows, columns) data structure where elements are indexed. It is the most used data structure in data analytics and it was introduced in Python ba the *pandas* library.

In [None]:
# We import a df from an URL
df_cars = pd.read_csv('https://data-wizards.s3.amazonaws.com/datasets/dataset_us_cars.csv')

In [None]:
# To simplify, we are going to restrict the rows to the top 5
df_cars = df_cars.head()

In [None]:
# We invoke the object to visualize it
df_cars

Unnamed: 0,year,brand,price,mileage,color,state,country
0,2008,toyota,6300,274117,black,new jersey,usa
1,2011,ford,2899,190552,silver,tennessee,usa
2,2018,dodge,5350,39590,silver,georgia,usa
3,2014,ford,25000,64146,blue,virginia,usa
4,2018,chevrolet,27700,6654,red,florida,usa


### Selecting columns in a dataframe
---

We use the following syntaxis to select a column:

```
df["column_name"]
```
We insert a list in case we want to select more than one column:
```
df[["column_1","column_2"]]
```





In [None]:
# Selecting one column
df_cars["brand"]

0       toyota
1         ford
2        dodge
3         ford
4    chevrolet
Name: brand, dtype: object

In [None]:
# Selecting more than one column
df_cars[["brand","price","mileage"]]

Unnamed: 0,brand,price,mileage
0,toyota,6300,274117
1,ford,2899,190552
2,dodge,5350,39590
3,ford,25000,64146
4,chevrolet,27700,6654


Alternatively, when selecting a column, we can opt for the following syntaxis. This only works when seleting a single column.

```
object.column_name
```



In [None]:
# Alternative syntaxis to select only one column
df_cars.brand

0       toyota
1         ford
2        dodge
3         ford
4    chevrolet
Name: brand, dtype: object

### Creating a new column (variable)
---
To create a new column on an existing df, we use the following syntaxis,

```
df["new_column"] = list[...]
```

When defining a new column, we should specify a list with the exact number of elements as rows in the df.



*Important*. In case the name of the column already exists in the df, we would be *overwriting* the variable.

```
df["existing_column"] = list[]
```


In [None]:
# We include a new column into df_cars
df_cars["is_Ford"] = [False,True,False,True,False] # defining a new column
df_cars # invoking the object

Unnamed: 0,year,brand,price,mileage,color,state,country,is_Ford
0,2008,toyota,6300,274117,black,new jersey,usa,False
1,2011,ford,2899,190552,silver,tennessee,usa,True
2,2018,dodge,5350,39590,silver,georgia,usa,False
3,2014,ford,25000,64146,blue,virginia,usa,True
4,2018,chevrolet,27700,6654,red,florida,usa,False


In [None]:
# Now let us overwrite an existing column
df_cars["price"] = [7000,3000,6000,25000,28000] # overwriting, as "price" already exists in dataframe
df_cars # we invoke the object

Unnamed: 0,year,brand,price,mileage,color,state,country,is_Ford
0,2008,toyota,7000,274117,black,new jersey,usa,False
1,2011,ford,3000,190552,silver,tennessee,usa,True
2,2018,dodge,6000,39590,silver,georgia,usa,False
3,2014,ford,25000,64146,blue,virginia,usa,True
4,2018,chevrolet,28000,6654,red,florida,usa,False


In [None]:
# Let us now create a new column, operating on two existing columns in the dataframe
df_cars["brand-color"] = df_cars["brand"] + " - " + df_cars["color"]
df_cars

Unnamed: 0,brand,year,price,mileage,color,state,country,is_Ford,brand-color
0,toyota,2008,7000,274117,black,new jersey,usa,False,toyota - black
1,ford,2011,3000,190552,silver,tennessee,usa,True,ford - silver
2,dodge,2018,6000,39590,silver,georgia,usa,False,dodge - silver
3,ford,2014,25000,64146,blue,virginia,usa,True,ford - blue
4,chevrolet,2018,28000,6654,red,florida,usa,False,chevrolet - red


### Droping elements in a dataframe
---
We can drop (eliminate) both rows and columns using the drop() function,

```
df.drop("index/row/column",axis="rows/columns")
```



In [None]:
# Dropping a column
df_cars.drop("country",axis="columns") # it is crutial to indicate that we are droping a column by editing parameter "axis"

Unnamed: 0,year,brand,price,mileage,color,state,is_Ford
0,2008,toyota,7000,274117,black,new jersey,False
1,2011,ford,3000,190552,silver,tennessee,True
2,2018,dodge,6000,39590,silver,georgia,False
3,2014,ford,25000,64146,blue,virginia,True
4,2018,chevrolet,28000,6654,red,florida,False


In [None]:
# Important: we did not overwrite the df
df_cars
## to overwrite the object we should have run the following code -> df_cars = df_cars.drop("country",axis="columns")

Unnamed: 0,year,brand,price,mileage,color,state,country,is_Ford
0,2008,toyota,7000,274117,black,new jersey,usa,False
1,2011,ford,3000,190552,silver,tennessee,usa,True
2,2018,dodge,6000,39590,silver,georgia,usa,False
3,2014,ford,25000,64146,blue,virginia,usa,True
4,2018,chevrolet,28000,6654,red,florida,usa,False


### Setting an index
---
By default, Python generates automatically a numerica index assigning values to each row that range from 0 to n-1. Alternatively, we can set an existing column to become our index using the set_index() function,

```
df.set_index("col_name")
```




In [None]:
# Defining a column as index
df_cars = df_cars.set_index('brand') # setting column "brand" as new index for our dataframe
df_cars
## observe that Brand is no longer a column in our df

Unnamed: 0_level_0,year,price,mileage,color,state,country,is_Ford,brand-color
brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
toyota,2008,7000,274117,black,new jersey,usa,False,toyota - black
ford,2011,3000,190552,silver,tennessee,usa,True,ford - silver
dodge,2018,6000,39590,silver,georgia,usa,False,dodge - silver
ford,2014,25000,64146,blue,virginia,usa,True,ford - blue
chevrolet,2018,28000,6654,red,florida,usa,False,chevrolet - red


In [None]:
# We can reset our index to default using the reset_index() function
df_cars = df_cars.reset_index()
df_cars

Unnamed: 0,brand,year,price,mileage,color,state,country,is_Ford
0,toyota,2008,7000,274117,black,new jersey,usa,False
1,ford,2011,3000,190552,silver,tennessee,usa,True
2,dodge,2018,6000,39590,silver,georgia,usa,False
3,ford,2014,25000,64146,blue,virginia,usa,True
4,chevrolet,2018,28000,6654,red,florida,usa,False


### Renaming elements in a dataframe
---
We can use the rename() function to rename elements (both rows and columns) in a dataframe,

```
df.rename({name:new_name},axis="rows/columns")
```
Observe that we need to parse a dictionary in the function to specify old name and new name.


In [None]:
# Let us now change the name of two columns
df_cars.rename({
    "price":"car_price"         # first element to modify "old name":"new name"
    ,"brand":"car_brand"        # second element to modify "old name":"new name"
  },axis="columns"              # we need to specify that we are modifying columns
)

Unnamed: 0_level_0,year,car_price,mileage,color,state,country,is_Ford,brand-color
brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
toyota,2008,7000,274117,black,new jersey,usa,False,toyota - black
ford,2011,3000,190552,silver,tennessee,usa,True,ford - silver
dodge,2018,6000,39590,silver,georgia,usa,False,dodge - silver
ford,2014,25000,64146,blue,virginia,usa,True,ford - blue
chevrolet,2018,28000,6654,red,florida,usa,False,chevrolet - red


# 2. SELECTING ELEMENTS IN A DATAFRAME


In [None]:
# We import a df and set index
df_jamesbond = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv",index_col="Film")
df_jamesbond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


### Selecting elements using index and column name
---


The *loc()* method allows to select elements (rows, columns) by its formal name,

```
object.loc[rows,columns]
object.loc[[row_1,row_2,...],[col_1,col_2,...]]
```



In [None]:
# We use the .loc[] method to select elements (rows, columns) in the df
df_jamesbond.loc[["From Russia with Love","Goldfinger"],:] # selecting two specific rows, all columns

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


In [None]:
# Selecting all rows from an interval, two specific columns
df_jamesbond.loc["From Russia with Love":"Goldfinger",["Year","Director"]] # selecting a range in rows

Unnamed: 0_level_0,Year,Director
Film,Unnamed: 1_level_1,Unnamed: 2_level_1
From Russia with Love,1963,Terence Young
GoldenEye,1995,Martin Campbell
Goldfinger,1964,Guy Hamilton


### Selecting elements using index and column position
---


The *iloc()* method allows to select elements by their position within the dataframe,

```
object.iloc[row_position,column_position]
```



In [None]:
# Lets us now select elements using their position in the df
df_jamesbond.iloc[:5,[1,3,5]] # selecting all rows untill the 5th; columns 1, 3 and 5

Unnamed: 0_level_0,Actor,Box Office,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A View to a Kill,Roger Moore,275.2,9.1
Casino Royale,Daniel Craig,581.5,3.3
Casino Royale,David Niven,315.0,
Diamonds Are Forever,Sean Connery,442.5,5.8
Die Another Day,Pierce Brosnan,465.4,17.9


### Selecting rows in a dataframe using a boolean vector
---


We can select rows in a dataframe by parsing a boolean vector with same number of elements as rows in the dataframe.

```
object[[True,False,False]]
```



In [None]:
# We import a df from an URL and reduce the number of rows to top 5
df_cars = pd.read_csv('https://data-wizards.s3.amazonaws.com/datasets/dataset_us_cars.csv')
df_cars = df_cars.head()
df_cars

Unnamed: 0,year,brand,price,mileage,color,state,country
0,2008,toyota,6300,274117,black,new jersey,usa
1,2011,ford,2899,190552,silver,tennessee,usa
2,2018,dodge,5350,39590,silver,georgia,usa
3,2014,ford,25000,64146,blue,virginia,usa
4,2018,chevrolet,27700,6654,red,florida,usa


In [None]:
# We parse a boolean vector to select specific rows
df_cars[[True,False,False,True,False]] # we select first and fourth row

Unnamed: 0,year,brand,price,mileage,color,state,country
0,2008,toyota,6300,274117,black,new jersey,usa
3,2014,ford,25000,64146,blue,virginia,usa


### Selecting rows using conditions
---
As next step, we will build boolean vectors to reflect specific conditions, then we will use then to select rows in a dataframe.

```
object[object[condition]]
```



In [None]:
# Let us first create a boolean vector based on a condition
cond = df_jamesbond["Budget"]>100 # building a boolean vector based on a logic test
cond

Film
A View to a Kill                   False
Casino Royale                       True
Casino Royale                      False
Diamonds Are Forever               False
Die Another Day                     True
Dr. No                             False
For Your Eyes Only                 False
From Russia with Love              False
GoldenEye                          False
Goldfinger                         False
Licence to Kill                    False
Live and Let Die                   False
Moonraker                          False
Never Say Never Again              False
Octopussy                          False
On Her Majesty's Secret Service    False
Quantum of Solace                   True
Skyfall                             True
Spectre                             True
The Living Daylights               False
The Man with the Golden Gun        False
The Spy Who Loved Me               False
The World Is Not Enough             True
Thunderball                        False
Tomorrow Ne

In [None]:
# We parse the boolean vector into the df to filter
df_jamesbond[cond] # filtering using the boolean vector

In [None]:
# We can create and coombine multiple conditions
cond_1 = df_jamesbond["Budget"]>100
cond_2 = df_jamesbond["Actor"]=='Daniel Craig'
cond_3 = df_jamesbond["Year"]<1965

df_jamesbond[cond_1 & cond_2 | cond_3] # filtering using multiple conditions

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,


In [None]:
# chaining methods to filter for both rows and columns
df_jamesbond[cond_1 & cond_2 | cond_3][['Year','Director','Actor']]

Unnamed: 0_level_0,Year,Director,Actor
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Casino Royale,2006,Martin Campbell,Daniel Craig
Dr. No,1962,Terence Young,Sean Connery
From Russia with Love,1963,Terence Young,Sean Connery
Goldfinger,1964,Guy Hamilton,Sean Connery
Quantum of Solace,2008,Marc Forster,Daniel Craig
Skyfall,2012,Sam Mendes,Daniel Craig
Spectre,2015,Sam Mendes,Daniel Craig


In [None]:
# We can also negate conditions
cond_1 = df_jamesbond["Year"] < 2000    # we create one condition

df_jamesbond[-cond_1]                   # we negate that condition

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,


### Useful functions when building conditions
---

#### isin(): filtering variables by multiple values

In [None]:
# Let us use the isin() function to filter for multiple values
cond = df_jamesbond['Director'].isin(['Martin Campbell','Terence Young']) # filtering by more than one value

df_jamesbond[cond]

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7


#### str.contains(): finding characters in string (identifying text patterns)

In [None]:
# importing df, not specifying index
df_jamesbond = pd.read_csv('https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv')
df_jamesbond.head()

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [None]:
# creating conditions that detect string patterns
cond_1 = df_jamesbond['Film'].str.contains('love', regex=False, case=False)
cond_2 = df_jamesbond['Film'].str.contains('die', regex=False, case=False)

# using those conditions to filter
df_jamesbond[cond_1 | cond_2]

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
8,Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,
10,The Spy Who Loved Me,1977,Roger Moore,Lewis Gilbert,533.0,45.1,
19,Tomorrow Never Dies,1997,Pierce Brosnan,Roger Spottiswoode,463.2,133.9,10.0
21,Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9


# SORTING A DATAFRAME
The *sort_index()* method sorts the dataframe based on the index, uses by default an ascending criteria

```
object.sort_index(ascending=True/False)
```


In [None]:
# We import the df setting "Film" as index, then we sort based on index
df_jamesbond = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv",index_col="Film").sort_index()
df_jamesbond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9



Alternatively, the *sort_values()* functions allows to sort the dataframe using one or more columns.

```
object.sort_values(column,ascending=True/False)
```



In [None]:
# We import the df and sort using a column (i.e. not the index)
df_jamesbond = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv",index_col="Film")

df_jamesbond.sort_values("Bond Actor Salary",ascending=False) # we set the ascending parameter as false to establish a descending criteria

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
The World Is Not Enough,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5
Tomorrow Never Dies,1997,Pierce Brosnan,Roger Spottiswoode,463.2,133.9,10.0
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
Licence to Kill,1989,Timothy Dalton,John Glen,250.9,56.7,7.9
Octopussy,1983,Roger Moore,John Glen,373.8,53.9,7.8
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
The Living Daylights,1987,Timothy Dalton,John Glen,313.5,68.8,5.2


In [None]:
# Let us now sort using two different columns
df_jamesbond.sort_values(["Actor","Year"],ascending=[True,False]) # note that we now need to use lists

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,
Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
The World Is Not Enough,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5
Tomorrow Never Dies,1997,Pierce Brosnan,Roger Spottiswoode,463.2,133.9,10.0
GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1


# CHAINING OPERATIONS
Python counts with a native syntax to chain operations *"method chaining"* . This means that we can perform many operations across different code lines.

Syntax,

```
dataframe.operation()\    # operation 1
  .operation()\           # operation 2
  .operation()\           # operation 3
  .operation()            # operation 4
```


In [None]:
pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv",index_col="Film")\   # operation 1: importing df
  .sort_values(["Actor"],ascending =False)\                                                     # operation 2: sorting
  .iloc[:5,:]                                                                                   # operation 3: filtering

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Licence to Kill,1989,Timothy Dalton,John Glen,250.9,56.7,7.9
The Living Daylights,1987,Timothy Dalton,John Glen,313.5,68.8,5.2
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6


# EXERCISES
*Note: solutions to exercises will be provided after each session*

##### EX 1
Dataset > https://data-wizards.s3.amazonaws.com/datasets/fortune1000.csv

##### EX 1.A. Import dataframe. Select columns: "Company", "Sector" and "Revenue".
##### EX 1.B. Rename column "Revenue" for "Company_Revenue".
##### EX 1.C. Build a new column "Profits_per_Employee" based on dividing field "Profits" by field "Employees".

In [None]:
import pandas as pd

In [None]:
# EX 1.A
df_fortune = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/fortune1000.csv")

In [None]:
df_fortune[["Company","Sector","Revenue"]]

Unnamed: 0,Company,Sector,Revenue
0,Walmart,Retailing,482130
1,Exxon Mobil,Energy,246204
2,Apple,Technology,233715
3,Berkshire Hathaway,Financials,210821
4,McKesson,Health Care,181241
...,...,...,...
995,New York Community Bancorp,Financials,1902
996,Portland General Electric,Energy,1898
997,Portland General Electric,Energy,1898
998,Wendy’s,"Hotels, Resturants & Leisure",1896


In [None]:
# EX 1.B
df_fortune = df_fortune.rename(
    {"Revenue":"Company_Revenue"}
    ,axis="columns"
)

df_fortune

Unnamed: 0,Rank,Company,Sector,Industry,Location,Company_Revenue,Profits,Employees
0,1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
1,2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
2,3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000
3,4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),"Omaha, NE",210821,24083,331000
4,5,McKesson,Health Care,Wholesalers: Health Care,"San Francisco, CA",181241,1476,70400
...,...,...,...,...,...,...,...,...
995,996,New York Community Bancorp,Financials,Commercial Banks,"Westbury, NY",1902,-47,3448
996,997,Portland General Electric,Energy,Utilities: Gas and Electric,"Portland, OR",1898,172,2646
997,997,Portland General Electric,Energy,Utilities: Gas and Electric,"Portland, OR",1898,172,2646
998,999,Wendy’s,"Hotels, Resturants & Leisure",Food Services,"Dublin, OH",1896,161,21200


In [None]:
# EX 1.C
df_fortune["Profits_per_Employee"] = df_fortune["Profits"] / df_fortune["Employees"]
df_fortune

Unnamed: 0,Rank,Company,Sector,Industry,Location,Company_Revenue,Profits,Employees,Profits_per_Employee
0,1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000,0.006389
1,2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600,0.213624
2,3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000,0.485400
3,4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),"Omaha, NE",210821,24083,331000,0.072758
4,5,McKesson,Health Care,Wholesalers: Health Care,"San Francisco, CA",181241,1476,70400,0.020966
...,...,...,...,...,...,...,...,...,...
995,996,New York Community Bancorp,Financials,Commercial Banks,"Westbury, NY",1902,-47,3448,-0.013631
996,997,Portland General Electric,Energy,Utilities: Gas and Electric,"Portland, OR",1898,172,2646,0.065004
997,997,Portland General Electric,Energy,Utilities: Gas and Electric,"Portland, OR",1898,172,2646,0.065004
998,999,Wendy’s,"Hotels, Resturants & Leisure",Food Services,"Dublin, OH",1896,161,21200,0.007594


##### EX 2

> Dataset > https://data-wizards.s3.amazonaws.com/datasets/dataset_star_wars.csv

##### EX 2.A. Import dataframe. Select columns "name", "homeworld", "species" and filter for the first 10 rows in the dataframe.

##### EX 2.B. Select characters that a) are NOT *species* human and b) are from *homeworld* Naboo, Endor or Kashyyyk (either one of those three).

---

In [None]:
import pandas as pd

In [None]:
# EX 2.A
df_starwars = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/dataset_star_wars.csv",index_col="name")
df_starwars.head()

Unnamed: 0_level_0,height,mass,hair_color,skin_color,eye_color,birth_year,gender,homeworld,species
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Mon Mothma,150.0,,auburn,fair,blue,48.0,female,Chandrila,Human
Yoda,66.0,17.0,white,green,brown,896.0,male,,Yoda's species
Tion Medon,206.0,80.0,none,grey,black,,male,Utapau,Pau'an
Ratts Tyerell,79.0,15.0,none,grey & blue,unknown,,male,Aleen Minor,Aleena
Luke Skywalker,172.0,77.0,blond,fair,blue,19.0,male,Tatooine,Human


In [None]:

df_starwars[["homeworld","species"]].iloc[:10,:]

Unnamed: 0_level_0,homeworld,species
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Mon Mothma,Chandrila,Human
Yoda,,Yoda's species
Tion Medon,Utapau,Pau'an
Ratts Tyerell,Aleen Minor,Aleena
Luke Skywalker,Tatooine,Human
R2-D2,Naboo,Droid
Beru Whitesun lars,Tatooine,Human
Chewbacca,Kashyyyk,Wookiee
Raymus Antilles,Alderaan,Human
Mace Windu,Haruun Kal,Human


In [None]:
# EX 2.B
## solution without using .isin function
cond_1 = df_starwars["species"]!='Human'
cond_2 = df_starwars["homeworld"]=='Naboo'
cond_3 = df_starwars["homeworld"]=='Endor'
cond_4 = df_starwars["homeworld"]=='Kashyyyk'

df_starwars[cond_1 & (cond_2 | cond_3 | cond_4)]


Unnamed: 0_level_0,height,mass,hair_color,skin_color,eye_color,birth_year,gender,homeworld,species
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
R2-D2,96.0,32.0,,white & blue,red,33.0,,Naboo,Droid
Chewbacca,228.0,112.0,brown,unknown,blue,200.0,male,Kashyyyk,Wookiee
Rugor Nass,206.0,,none,green,orange,,male,Naboo,Gungan
Tarfful,234.0,136.0,brown,brown,blue,,male,Kashyyyk,Wookiee
Quarsh Panaka,183.0,,black,dark,brown,62.0,male,Naboo,
Roos Tarpals,224.0,82.0,none,grey,orange,,male,Naboo,Gungan
Wicket Systri Warrick,88.0,20.0,brown,brown,brown,8.0,male,Endor,Ewok
Ric Olié,183.0,,brown,fair,blue,,male,Naboo,
Jar Jar Binks,196.0,66.0,none,orange,orange,52.0,male,Naboo,Gungan


In [None]:
# EX 2.B
## solution using .isin function
cond_1 = df_starwars["species"]!='Human'
cond_2 = df_starwars["homeworld"].isin(["Naboo","Endor","Kashyyyk"])

df_starwars[cond_1 & cond_2]

Unnamed: 0_level_0,height,mass,hair_color,skin_color,eye_color,birth_year,gender,homeworld,species
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
R2-D2,96.0,32.0,,white & blue,red,33.0,,Naboo,Droid
Chewbacca,228.0,112.0,brown,unknown,blue,200.0,male,Kashyyyk,Wookiee
Rugor Nass,206.0,,none,green,orange,,male,Naboo,Gungan
Tarfful,234.0,136.0,brown,brown,blue,,male,Kashyyyk,Wookiee
Quarsh Panaka,183.0,,black,dark,brown,62.0,male,Naboo,
Roos Tarpals,224.0,82.0,none,grey,orange,,male,Naboo,Gungan
Wicket Systri Warrick,88.0,20.0,brown,brown,brown,8.0,male,Endor,Ewok
Ric Olié,183.0,,brown,fair,blue,,male,Naboo,
Jar Jar Binks,196.0,66.0,none,orange,orange,52.0,male,Naboo,Gungan


##### EX 3
> Dataset > https://data-wizards.s3.amazonaws.com/datasets/fortune1000.csv

Import dataframe. Filter for top 10 companies with highest revenue in the Technology sector.


In [None]:
# EX 3
df_fortune = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/fortune1000.csv",index_col="Company")
df_fortune.head()

Unnamed: 0_level_0,Rank,Sector,Industry,Location,Revenue,Profits,Employees
Company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Walmart,1,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
Exxon Mobil,2,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
Apple,3,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000
Berkshire Hathaway,4,Financials,Insurance: Property and Casualty (Stock),"Omaha, NE",210821,24083,331000
McKesson,5,Health Care,Wholesalers: Health Care,"San Francisco, CA",181241,1476,70400


In [None]:
df_fortune[df_fortune['Sector']=='Technology'].sort_values('Revenue',ascending=False).iloc[:10,]

##### EX 4
> Dataset > https://data-wizards.s3.amazonaws.com/datasets/movies.csv

##### EX 4.A. Import dataframe. Select columns *country*, *director_name*, *imdb_score* and first 10 rows in dataframe
##### EX 4.B. Select movies produced in US, whose IMDB score is higher than 8.5 and directed by either James Cameron, Peter Jackson or Tim Burton

---



In [None]:
df_movies = pd.read_csv('https://data-wizards.s3.amazonaws.com/datasets/movies.csv',index_col='movie_title')
df_movies.head()

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
Avatar,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,886204,4834,Wes Studi,0.0,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,471220,48350,Jack Davenport,0.0,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
Spectre,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,Christoph Waltz,275868,11700,Stephanie Sigman,1.0,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,Tom Hardy,1144337,106759,Joseph Gordon-Levitt,0.0,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,Doug Walker,8,143,,0.0,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,,,,,,,12.0,7.1,,0


In [None]:
# EX 4.A
df_movies[["country","director_name","imdb_score"]].iloc[:10,]

Unnamed: 0_level_0,country,director_name,imdb_score
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Avatar,USA,James Cameron,7.9
Pirates of the Caribbean: At World's End,USA,Gore Verbinski,7.1
Spectre,UK,Sam Mendes,6.8
The Dark Knight Rises,USA,Christopher Nolan,8.5
Star Wars: Episode VII - The Force Awakens,,Doug Walker,7.1
John Carter,USA,Andrew Stanton,6.6
Spider-Man 3,USA,Sam Raimi,6.2
Tangled,USA,Nathan Greno,7.8
Avengers: Age of Ultron,USA,Joss Whedon,7.5
Harry Potter and the Half-Blood Prince,UK,David Yates,7.5


In [None]:
# EX 4.B
cond_1_1 = df_movies['country']!='USA'
cond_1_2 = df_movies['imdb_score']>8.5

cond_2 = df_movies['director_name'].isin(['James Cameron','Peter Jackson','Tim Burton'])

df_movies[(cond_1_1 & cond_1_2)|cond_2][['country','director_name','imdb_score']]

Unnamed: 0_level_0,country,director_name,imdb_score
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Avatar,USA,James Cameron,7.9
The Hobbit: The Battle of the Five Armies,New Zealand,Peter Jackson,7.5
The Hobbit: The Desolation of Smaug,USA,Peter Jackson,7.9
King Kong,New Zealand,Peter Jackson,7.2
Titanic,USA,James Cameron,7.7
Alice in Wonderland,USA,Tim Burton,6.5
The Hobbit: An Unexpected Journey,USA,Peter Jackson,7.9
Charlie and the Chocolate Factory,USA,Tim Burton,6.7
Dark Shadows,USA,Tim Burton,6.2
The Lord of the Rings: The Fellowship of the Ring,New Zealand,Peter Jackson,8.8


##### EX 5
> Dataset > https://data-wizards.s3.amazonaws.com/datasets/movies.csv

##### Import dataframe. Select top 10 Sci-Fi movies with highest IMDB score, select only fields *title_year*, *director_name* and *imdb_score*.
---


In [None]:
df_movies = pd.read_csv('https://data-wizards.s3.amazonaws.com/datasets/movies.csv',index_col='movie_title')
df_movies.head()

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
Avatar,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,886204,4834,Wes Studi,0.0,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,471220,48350,Jack Davenport,0.0,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
Spectre,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,Christoph Waltz,275868,11700,Stephanie Sigman,1.0,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,Tom Hardy,1144337,106759,Joseph Gordon-Levitt,0.0,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,Doug Walker,8,143,,0.0,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,,,,,,,12.0,7.1,,0


In [None]:
cond_1 = df_movies['genres'].str.contains('Sci-Fi', regex=False, case=False)

df_movies[cond_1].sort_values('imdb_score',ascending=False)\
  .iloc[:10,][['title_year','director_name','imdb_score']]

Unnamed: 0_level_0,title_year,director_name,imdb_score
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Daredevil,,,8.8
Star Wars: Episode V - The Empire Strikes Back,1980.0,Irvin Kershner,8.8
Inception,2010.0,Christopher Nolan,8.8
The Matrix,1999.0,Lana Wachowski,8.7
Star Wars: Episode IV - A New Hope,1977.0,George Lucas,8.7
Interstellar,2014.0,Christopher Nolan,8.6
Outlander,,,8.5
Alien,1979.0,Ridley Scott,8.5
The Prestige,2006.0,Christopher Nolan,8.5
Terminator 2: Judgment Day,1991.0,James Cameron,8.5
