# TABLE OF CONTENTS
# **IEB MiM&A** 
# Introduction to Python for Data Analysis 🐍📊
# *Notebook 2: Intro to DataFrames*
---
### 🎯 OBJECTIVES
* Introduce *pandas DataFrames* as key structures in Data Analysis
* Learn basic operations with DataFrames 

### 📋 TABLE OF CONTENTS
1. INTRO TO DATAFRAMES
2. FILTERING ELEMENTS IN DATAFRAMES
3. BASIC OPERATIONS ON DATAFRAMES
4. EXERCISES

Lecturer: Juan Martin Bellido (Martin)
* [linkedin.com/in/jmartinbellido](https://www.linkedin.com/in/jmartinbellido/)
* juan.martin.bellido.arias@claustro-ieb.es

Conventions used in this notebook:
* ⚠️ Warning
* ✅ Key remark

---

In [None]:
# Importing libraries used in this notebook
import pandas as pd
import seaborn as sns

# 1. INTRO TO DATAFRAMES
---
A DataFrame is essentially a data table, that is to say, a two-dimensional structure where data is organized in rows and columns. It is the most relevant data structure on Data Analytics. However, it does not form part of Python's source code, but it's part of Pandas library.

> ✅ In a DataFrame, indexes are rows unique keys

> ✅ By default, DataFrames generate numerical indexes

> ⚠️ Python displays indexes at the very left of a DataFrame, do not confuse indexes with columns



### Importing a DataFrame
---
After importing library *pandas*, we will use one of panda's functions to import a DataFrame into our environment.

```
pd.read_csv('path')
```
* The *pd.read_csv()* function imports a CSV or TXT file into a DataFrame
* The path could be either a route to a local file or (in our case) a URL


In [None]:
# Using pd.read_csv() to import a dataframe
pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv")


Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
5,You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
6,On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
7,Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
8,Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,
9,The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334.0,27.7,


In [None]:
# Repeating the operation, this time storing the data on an object
df_jamesbond = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv")

In [None]:
# We use type() to confirm object type
type(df_jamesbond)

pandas.core.frame.DataFrame

### Exporting a DataFrame
---
We will now use the *pd.to_csv()* function to export a DataFrame into our local environment. In case we don't specify a route, Python will store the file on our working directory (our route by default).

```
df.to_csv('path')
```
> ⚠️ This function does not work on Google Colab

In [None]:
# Exporting dataframe
df_jamesbond.to_csv("dataset_jamesbond.csv")

In [None]:
# We could export dataframe as txt if editing "sep" parameter
df_jamesbond.to_csv("dataset_jamesbond.txt",sep='\t')

In [None]:
# We can check our working directory using os.getcwd()
import os
os.getcwd()

'/content'

### First operations on a DataFrame
---
First opearations we will do on a DataFrame:
* check columns (variables) names and type of variables in our DataFrame
* review number of rows in our DataFrame
* change index in our DataFrame
* display first/last n rows in our DataFrame


In [None]:
# .dtypes method: allows to review name and type of variables in a df
df_jamesbond.dtypes

Film                  object
Year                   int64
Actor                 object
Director              object
Box Office           float64
Budget               float64
Bond Actor Salary    float64
dtype: object

In [None]:
# len() function: provides number of rows in df
len(df_jamesbond)

26

In [None]:
# In some cases, we might be interested in changing numerical index with a new index generated using an existing column in our df
# .set_index(column): converts a column into DataFrame's index
df_jamesbond.set_index("Film") # we use column "Film" as df's new index

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,
The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334.0,27.7,


In [None]:
# head()/ tail() method: review first/last n rows in df; n=5 by default
df_jamesbond.head()

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


### Building a DataFrame manually
---
Though not commonly used, we can build a DataFrame from scratch from one of Python's basic structures (e.g. from a dictionary).


In [None]:
# We create a dictionary
car_dic = {
    "car":['Honda Civic','VW Golf','Toyota Corolla'],
    "price":[12000,13000,15000],
    "is_new":[False,True,True]
    }

In [None]:
# pd.DataFrame() function: building a df manually
pd.DataFrame(car_dic)

Unnamed: 0,car,price,is_new
0,Honda Civic,12000,False
1,VW Golf,13000,True
2,Toyota Corolla,15000,True


# 2. FILTERING ELEMENTS IN DATAFRAMES
---



In [None]:
# Importing a df
df_cars = pd.read_csv('https://data-wizards.s3.amazonaws.com/datasets/dataset_us_cars.csv')

In [None]:
# Overwritting the df above to keep only first 10 rows
df_cars = df_cars.head(10)

In [None]:
# Display df
df_cars

Unnamed: 0,year,brand,price,mileage,color,state,country
0,2008,toyota,6300,274117,black,new jersey,usa
1,2011,ford,2899,190552,silver,tennessee,usa
2,2018,dodge,5350,39590,silver,georgia,usa
3,2014,ford,25000,64146,blue,virginia,usa
4,2018,chevrolet,27700,6654,red,florida,usa
5,2018,dodge,5700,45561,white,texas,usa
6,2010,chevrolet,7300,149050,black,georgia,usa
7,2017,gmc,13350,23525,gray,california,usa
8,2018,chevrolet,14600,9371,silver,florida,usa
9,2017,ford,5250,63418,black,texas,usa


### Filtering using Python base
---
As introduced on notebook 1, Python uses [ ] (brackets) as basic sintaxis to select elements stored in data structures.

```
data_structure[x]
```

* This is true both on *lists, dictionaries, tupples, sets and DataFrames*
* Sintaxis *within* brackets will differ, depending on type of data structure

> ⚠️ In case of questions, please review *notebook 1: Python Base*


#### Filtering columns

In order to select/filter for a single column on a DataFrame, we use the following sintaxis,

```
df["column_name"]
```

In case we would like to select multiple columns, we simply incorporate a list into the sintaxis described above,

```
df[["column_1","column_2"]]
```





In [None]:
# Selecting a single column
df_cars["brand"]

0       toyota
1         ford
2        dodge
3         ford
4    chevrolet
5        dodge
6    chevrolet
7          gmc
8    chevrolet
9         ford
Name: brand, dtype: object

In [None]:
# Selecting multiple columns
df_cars[["brand","price","mileage"]]

Unnamed: 0,brand,price,mileage
0,toyota,6300,274117
1,ford,2899,190552
2,dodge,5350,39590
3,ford,25000,64146
4,chevrolet,27700,6654
5,dodge,5700,45561
6,chevrolet,7300,149050
7,gmc,13350,23525
8,chevrolet,14600,9371
9,ford,5250,63418


Alternatively, we can use a simplified sintaxis,

```
object.column_name
```
> ✅ Pandas identifies column names in a DataFrame as attributes atributos del objeto
> ⚠️ This works only when filtering for a single column



In [None]:
# We use an alternative method to select a single column
df_cars.brand

0       toyota
1         ford
2        dodge
3         ford
4    chevrolet
5        dodge
6    chevrolet
7          gmc
8    chevrolet
9         ford
Name: brand, dtype: object

#### Filtering rows
To filter rows on a DataFrame using Python base, we need to incorporate *boolean lists* into Python's selection syntaxis. 

```
df[[True,False,False...]]
```
> ✅ In case of detecting booleans inside basic selection syntaxis, Python will understand we are trying to filter for rows (instead of selecting columns)

> ⚠️ The boolean list needs to have as many elements as rows in the DataFrame we are trying to filter

This method will become particularly relevant when filtering rows using conditions.


In [None]:
# We create a list containing 10 boolean elements
my_boolean_list = [True,False,False,False,False,True,True,True,True,True]
df_cars[my_boolean_list]
# We can observe that we only keep rows, for which we included a True in our boolean list

Unnamed: 0,year,brand,price,mileage,color,state,country
0,2008,toyota,6300,274117,black,new jersey,usa
5,2018,dodge,5700,45561,white,texas,usa
6,2010,chevrolet,7300,149050,black,georgia,usa
7,2017,gmc,13350,23525,gray,california,usa
8,2018,chevrolet,14600,9371,silver,florida,usa
9,2017,ford,5250,63418,black,texas,usa


### Filtering a DataFrame using Pandas: loc[ ] & iloc[ ] methods
---
Panda's .loc[] and .iloc[] methods allow to filter rows and columns using a matricial-like logic (rows,columns). 

> ⚠️ Methods tipically use parenthesis () to allow parameters. As an exception to this rule, these two methods use brackets [].


#### The .loc[ ] method

The *.loc[]* method allow to access rows and columns by *labels*.

```
object.loc[rows,columns]
object.loc[[row_1,row_2,...],[col_1,col_2,...]]
```

In [None]:
# Importing a df
df_jamesbond = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv")
df_jamesbond = df_jamesbond.set_index("Film") # changing index
df_jamesbond.head() # displaying first 5 rows

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [None]:
# Let us use .loc to filter data
df_jamesbond.loc[["From Russia with Love","Goldfinger"],:] # filtering for rows within range, all columns (:)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


In [None]:
# Filtering for rows in range, two columns
df_jamesbond.loc["From Russia with Love":"Goldfinger",["Year","Director"]] # we use a list within second parameter

Unnamed: 0_level_0,Year,Director
Film,Unnamed: 1_level_1,Unnamed: 2_level_1
From Russia with Love,1963,Terence Young
Goldfinger,1964,Guy Hamilton


#### The .iloc[ ] method
Panda's *.iloc[]* method allow to access rows and columns by their *position* within the structure.

```
object.iloc[row_position,column_position]
```


In [None]:
# Importing df
df_jamesbond = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv")

In [None]:
# Let us select elements, by position
df_jamesbond.iloc[:5,[1,3,5]] # selecting first 5 rows; columns 2nd, 4th and 6th (remember that Python indexes at 0)

Unnamed: 0,Year,Director,Budget
0,1962,Terence Young,7.0
1,1963,Terence Young,12.6
2,1964,Guy Hamilton,18.6
3,1965,Terence Young,41.9
4,1967,Ken Hughes,85.0


### Filtering rows, based on logical conditions
---
We can also choose to filter rows based on specific conditions we build. 

```
object[object[condition]]
```

We use operators to build boolean lists, then we use them to filter.


In [None]:
# Importing df
df_jamesbond = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv")

In [None]:
# We first create a boolean lists by testing a condition on a column
cond = df_jamesbond["Budget"]>100
cond

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19     True
20     True
21     True
22     True
23     True
24     True
25     True
Name: Budget, dtype: bool

In [None]:
# We now use the boolean list created above to filter rows
df_jamesbond[cond]

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
19,Tomorrow Never Dies,1997,Pierce Brosnan,Roger Spottiswoode,463.2,133.9,10.0
20,The World Is Not Enough,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5
21,Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
22,Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
23,Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
24,Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
25,Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,


In [None]:
# we could create multiple conditions and merge them using AND (&) or OR (|) operators 
cond_1 = df_jamesbond["Budget"]>100
cond_2 = df_jamesbond["Actor"]=='Daniel Craig'
cond_3 = df_jamesbond["Year"]<1965

df_jamesbond[(cond_1 & cond_2) | cond_3] # filtering rows


Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
22,Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
23,Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
24,Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
25,Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,


In [None]:
# Filtering rows and selecting columns (two operations on a same coding line)
df_jamesbond[(cond_1 & cond_2) | cond_3][['Year','Director','Actor']]

Unnamed: 0,Year,Director,Actor
0,1962,Terence Young,Sean Connery
1,1963,Terence Young,Sean Connery
2,1964,Guy Hamilton,Sean Connery
22,2006,Martin Campbell,Daniel Craig
23,2008,Marc Forster,Daniel Craig
24,2012,Sam Mendes,Daniel Craig
25,2015,Sam Mendes,Daniel Craig


In [None]:
# We can also negate conditions
cond_1 = df_jamesbond["Actor"]=='Daniel Craig'
df_jamesbond[-cond_1] # this turns all True values into False and viceversa

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
5,You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
6,On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
7,Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
8,Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,
9,The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334.0,27.7,


#### Some useful methods


In [None]:
# The .isin() method is useful when filtering for multiple values on a categorical variable
cond = df_jamesbond['Director'].isin(['Martin Campbell','Terence Young'])
df_jamesbond[cond]

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
18,GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
22,Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3


In [None]:
# We can use the .str.contains() method to test if text is contained in string variable
cond_1 = df_jamesbond['Film'].str.contains('love', case=False)
cond_2 = df_jamesbond['Film'].str.contains('die', case=False)

df_jamesbond[cond_1 | cond_2]

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
8,Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,
10,The Spy Who Loved Me,1977,Roger Moore,Lewis Gilbert,533.0,45.1,
19,Tomorrow Never Dies,1997,Pierce Brosnan,Roger Spottiswoode,463.2,133.9,10.0
21,Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9


### Filtering columns, based on type of variable
---
The .select_dtypes() method allow to select columns, based on type of variable.

```
df.select_dtypes(include=[...], exclude=[...])
```





In [None]:
# Importing df
# Using .dtypes method to review columns type of variables
df_jamesbond = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv")
df_jamesbond.dtypes

Film                  object
Year                   int64
Actor                 object
Director              object
Box Office           float64
Budget               float64
Bond Actor Salary    float64
dtype: object

In [None]:
# We use the .select_dtypes() method to filter only for columns type 'object' (text) or 'int64' (integer)
df_jamesbond.select_dtypes(include=['object','int64']).head()

Unnamed: 0,Film,Year,Actor,Director
0,Dr. No,1962,Sean Connery,Terence Young
1,From Russia with Love,1963,Sean Connery,Terence Young
2,Goldfinger,1964,Sean Connery,Guy Hamilton
3,Thunderball,1965,Sean Connery,Terence Young
4,Casino Royale,1967,David Niven,Ken Hughes


In [None]:
# We now use the same method, this time to select all excluding columns type 'object'
df_jamesbond.select_dtypes(exclude='object').head()

Unnamed: 0,Year,Box Office,Budget,Bond Actor Salary
0,1962,448.8,7.0,0.6
1,1963,543.8,12.6,1.6
2,1964,820.4,18.6,3.2
3,1965,848.1,41.9,4.7
4,1967,315.0,85.0,


### Filtering rows using SQL
---
In case you are familiar with SQL (relational db language), you might find it easy to filter rows using a syntaxis that is similar to this language.

```
df.query("cond")
```
* Multiple conditions should be combined using and (&) / or


In [None]:
# Importing df
df_jamesbond = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv")

In [None]:
# Using the .query() method to filter rows
df_jamesbond.query("Year > 2000 or Actor == 'Daniel Craig'")

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
21,Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
22,Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
23,Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
24,Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
25,Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,


# 3. BASIC OPERATIONS ON DATAFRAMES
---


In [None]:
# Importing df
df_cars = pd.read_csv('https://data-wizards.s3.amazonaws.com/datasets/dataset_us_cars.csv')

In [None]:
# To simplify df, we keep only first 10 rows
df_cars = df_cars.head()

In [None]:
# Displaying df
df_cars

Unnamed: 0,year,brand,price,mileage,color,state,country
0,2008,toyota,6300,274117,black,new jersey,usa
1,2011,ford,2899,190552,silver,tennessee,usa
2,2018,dodge,5350,39590,silver,georgia,usa
3,2014,ford,25000,64146,blue,virginia,usa
4,2018,chevrolet,27700,6654,red,florida,usa


### Crearing a new column (variable)
---
We use the following syntaxis to create a new column in a DataFrame. In case the column name is already taken, Python will overwrite that variable with new data provided.

```
df["new_column"] = [...]
```
* We should provide a list with as many elements as rows in the DataFrame we are modifying




In [None]:
# Creating a new column
df_cars["is_ford_card"] = [False,True,False,True,False]
df_cars

Unnamed: 0,year,brand,price,mileage,color,state,country,is_ford_card
0,2008,toyota,6300,274117,black,new jersey,usa,False
1,2011,ford,2899,190552,silver,tennessee,usa,True
2,2018,dodge,5350,39590,silver,georgia,usa,False
3,2014,ford,25000,64146,blue,virginia,usa,True
4,2018,chevrolet,27700,6654,red,florida,usa,False


In [None]:
# Modifying an existing column
df_cars["brand"] = ["Toyota","Ford","Dodge","Ford","Chevrolet"] 
df_cars

Unnamed: 0,year,brand,price,mileage,color,state,country,is_ford_card
0,2008,Toyota,6300,274117,black,new jersey,usa,False
1,2011,Ford,2899,190552,silver,tennessee,usa,True
2,2018,Dodge,5350,39590,silver,georgia,usa,False
3,2014,Ford,25000,64146,blue,virginia,usa,True
4,2018,Chevrolet,27700,6654,red,florida,usa,False


In [None]:
# Crating a new column, based on an operation b/w two existing columns
df_cars["brand-color"] = df_cars["brand"] + " - " + df_cars["color"]
df_cars

Unnamed: 0,year,brand,price,mileage,color,state,country,is_ford_card,brand-color
0,2008,Toyota,6300,274117,black,new jersey,usa,False,Toyota - black
1,2011,Ford,2899,190552,silver,tennessee,usa,True,Ford - silver
2,2018,Dodge,5350,39590,silver,georgia,usa,False,Dodge - silver
3,2014,Ford,25000,64146,blue,virginia,usa,True,Ford - blue
4,2018,Chevrolet,27700,6654,red,florida,usa,False,Chevrolet - red


### Deleting columns on DataFrames
---
To delete a column on a DataFrame, we could simply filter for all columns we would like to keep using any method introduced on the previous section. However, in some occasions it is easier to explicitly drop columns. 

```
df.drop("column_name",axis="columns")
```
* we must specify that we are dropping columns using the "axis" parameter



In [None]:
# Deleting a specific column
df_cars.drop("state",axis="columns")

Unnamed: 0,year,brand,price,mileage,color,country,is_ford_card,brand-color
0,2008,Toyota,6300,274117,black,usa,False,Toyota - black
1,2011,Ford,2899,190552,silver,usa,True,Ford - silver
2,2018,Dodge,5350,39590,silver,usa,False,Dodge - silver
3,2014,Ford,25000,64146,blue,usa,True,Ford - blue
4,2018,Chevrolet,27700,6654,red,usa,False,Chevrolet - red


In [None]:
# Please note that we did NOT overwrite the DataFrame, therefore we still keep the column (we only changed data displayed)
df_cars

Unnamed: 0,brand-color,year,brand,price,mileage,color,state,country,is_ford_card
0,Toyota - black,2008,Toyota,6300,274117,black,new jersey,usa,False
1,Ford - silver,2011,Ford,2899,190552,silver,tennessee,usa,True
2,Dodge - silver,2018,Dodge,5350,39590,silver,georgia,usa,False
3,Ford - blue,2014,Ford,25000,64146,blue,virginia,usa,True
4,Chevrolet - red,2018,Chevrolet,27700,6654,red,florida,usa,False


### Renaming columns in DataFrame
---
We use the rename() method to rename columns on a DataFrame,

```
df.rename({name:new_name},axis="rows/columns")
```
* This method uses dictionaries to define which column name should be replaced by which new name


In [None]:
# Let us change two column names
# First, we create a dictionary with the column - new name mapping
# Then, we use that dictionary into the .rename() method

new_names_dic = {"brand":"car_brand","price":"car_price"}
df_cars.rename(new_names_dic,axis="columns")

Unnamed: 0_level_0,year,car_brand,car_price,mileage,color,state,country,is_ford_card
brand-color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Toyota - black,2008,Toyota,6300,274117,black,new jersey,usa,False
Ford - silver,2011,Ford,2899,190552,silver,tennessee,usa,True
Dodge - silver,2018,Dodge,5350,39590,silver,georgia,usa,False
Ford - blue,2014,Ford,25000,64146,blue,virginia,usa,True
Chevrolet - red,2018,Chevrolet,27700,6654,red,florida,usa,False


### Sorting a DataFrame
---
We can choose to sort a DataFrame either by (i) index or (ii) one or more columns.




#### Sorting by index
The *sort_index()* method sorts a DataFrame by index, ascendingly by default.

```
object.sort_index(ascending=True/False)
```

In [None]:
# Importing df, setting new index
df_jamesbond = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv")
df_jamesbond = df_jamesbond.set_index("Film")
df_jamesbond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [None]:
# Sorting by index
df_jamesbond.sort_index()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
For Your Eyes Only,1981,Roger Moore,John Glen,449.4,60.2,
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


#### Sorting by column(s)
The *sort_values()* method allows to sort by one or more columns available in DataFrame

```
object.sort_values(column,ascending=True/False)
```



In [None]:
# Importing df
df_jamesbond = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv")

In [None]:
# Sorting by column
df_jamesbond.sort_values("Bond Actor Salary",ascending=False)

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
21,Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
24,Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
20,The World Is Not Enough,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5
19,Tomorrow Never Dies,1997,Pierce Brosnan,Roger Spottiswoode,463.2,133.9,10.0
15,A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
23,Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
17,Licence to Kill,1989,Timothy Dalton,John Glen,250.9,56.7,7.9
14,Octopussy,1983,Roger Moore,John Glen,373.8,53.9,7.8
7,Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
16,The Living Daylights,1987,Timothy Dalton,John Glen,313.5,68.8,5.2


In [None]:
# Sorting DataFrame by multiple columns
df_jamesbond.sort_values(["Actor","Year"],ascending=[True,False])
# Note that we use a list within method parameter

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
25,Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,
24,Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
23,Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
22,Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
6,On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
21,Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
20,The World Is Not Enough,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5
19,Tomorrow Never Dies,1997,Pierce Brosnan,Roger Spottiswoode,463.2,133.9,10.0
18,GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1


### Nesting operations in Python
---
Python allows to nest operations nativelly - that is to say, without the need for any library. 

Example for nested operations:

```
df.iloc[x,y].head().sort_values(x)
```
* On this example, we are performing three differenet operations on a same coding line


In [None]:
# Example of nesting opeartions: (i) import df, (ii) filter rows, (iii) sorting
pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv").iloc[0:5,:].sort_values("Film")

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7


#### Python method chaining
Coding lines turn long when nesting multiple operations, turning it difficult to read the code. Under those situations, we could choose to bring operations on a new coding line, by using the so-called *method chaining*.

```
df.iloc[x,y]\
  .head()\
  .sort_values(x)
```
* We use an backward slash (\) to separate operations
* We place the backward slash right after an operation and followed by a new coding line

> ⚠️ Method chaining does not allow any character (including comments) after the backward slash

In [None]:
# Below is an example of nested operations using the method chaining
pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv")\
  .iloc[0:5,:]\
  .sort_values("Film")

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7


# EJERCICIOS
---
> ⚠️ Solutions to exercises will be provided after finishing with notebook


### EX 1 Filtering elements in DataFrames
---

##### EX 1.A. Import dataframe. Select columns *'name', 'homeworld' and 'species'* for the firs 10 rows in DataFrame.

##### EX 1.B. Filter for characters that a) are not human and b) are from *homeworld* either Naboo, Endor or Kashyyyk.

> *Dataset https://data-wizards.s3.amazonaws.com/datasets/dataset_star_wars.csv*




In [None]:
import pandas as pd

In [None]:
# EX 1.A
df_starwars = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/dataset_star_wars.csv",index_col="name")
df_starwars.head()

Unnamed: 0_level_0,height,mass,hair_color,skin_color,eye_color,birth_year,gender,homeworld,species
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Mon Mothma,150.0,,auburn,fair,blue,48.0,female,Chandrila,Human
Yoda,66.0,17.0,white,green,brown,896.0,male,,Yoda's species
Tion Medon,206.0,80.0,none,grey,black,,male,Utapau,Pau'an
Ratts Tyerell,79.0,15.0,none,grey & blue,unknown,,male,Aleen Minor,Aleena
Luke Skywalker,172.0,77.0,blond,fair,blue,19.0,male,Tatooine,Human


In [None]:
df_starwars[["homeworld","species"]].iloc[:10,:]

Unnamed: 0_level_0,homeworld,species
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Mon Mothma,Chandrila,Human
Yoda,,Yoda's species
Tion Medon,Utapau,Pau'an
Ratts Tyerell,Aleen Minor,Aleena
Luke Skywalker,Tatooine,Human
R2-D2,Naboo,Droid
Beru Whitesun lars,Tatooine,Human
Chewbacca,Kashyyyk,Wookiee
Raymus Antilles,Alderaan,Human
Mace Windu,Haruun Kal,Human


In [None]:
# EX 1.B
cond_1 = df_starwars["species"]!='Human'
cond_2 = df_starwars["homeworld"]=='Naboo'
cond_3 = df_starwars["homeworld"]=='Endor'
cond_4 = df_starwars["homeworld"]=='Kashyyyk'

df_starwars[cond_1 & (cond_2 | cond_3 | cond_4)]


Unnamed: 0_level_0,height,mass,hair_color,skin_color,eye_color,birth_year,gender,homeworld,species
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
R2-D2,96.0,32.0,,white & blue,red,33.0,,Naboo,Droid
Chewbacca,228.0,112.0,brown,unknown,blue,200.0,male,Kashyyyk,Wookiee
Rugor Nass,206.0,,none,green,orange,,male,Naboo,Gungan
Tarfful,234.0,136.0,brown,brown,blue,,male,Kashyyyk,Wookiee
Quarsh Panaka,183.0,,black,dark,brown,62.0,male,Naboo,
Roos Tarpals,224.0,82.0,none,grey,orange,,male,Naboo,Gungan
Wicket Systri Warrick,88.0,20.0,brown,brown,brown,8.0,male,Endor,Ewok
Ric Olié,183.0,,brown,fair,blue,,male,Naboo,
Jar Jar Binks,196.0,66.0,none,orange,orange,52.0,male,Naboo,Gungan


In [None]:
# EX 1.B
cond_1 = df_starwars["species"]!='Human'
cond_2 = df_starwars["homeworld"].isin(["Naboo","Endor","Kashyyyk"])

df_starwars[cond_1 & cond_2]

Unnamed: 0_level_0,height,mass,hair_color,skin_color,eye_color,birth_year,gender,homeworld,species
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
R2-D2,96.0,32.0,,white & blue,red,33.0,,Naboo,Droid
Chewbacca,228.0,112.0,brown,unknown,blue,200.0,male,Kashyyyk,Wookiee
Rugor Nass,206.0,,none,green,orange,,male,Naboo,Gungan
Tarfful,234.0,136.0,brown,brown,blue,,male,Kashyyyk,Wookiee
Quarsh Panaka,183.0,,black,dark,brown,62.0,male,Naboo,
Roos Tarpals,224.0,82.0,none,grey,orange,,male,Naboo,Gungan
Wicket Systri Warrick,88.0,20.0,brown,brown,brown,8.0,male,Endor,Ewok
Ric Olié,183.0,,brown,fair,blue,,male,Naboo,
Jar Jar Binks,196.0,66.0,none,orange,orange,52.0,male,Naboo,Gungan


### EX 2 Filtering elements in DataFrames
---

##### EX 2.A. Select columns *'movie_title', 'country', 'director_name' and 'imdb_score'* for the first 10 rows in DataFrame.
##### EX 2.B. Select movies (i) produced outside the USA or with an IMDB score higher than 8.5 and (ii) directed by either James Cameron, Peter Jackson or Tim Burton.

> Dataset https://data-wizards.s3.amazonaws.com/datasets/movies.csv




In [None]:
df_movies = pd.read_csv('https://data-wizards.s3.amazonaws.com/datasets/movies.csv')
df_movies.dtypes

color                         object
director_name                 object
num_critic_for_reviews       float64
duration                     float64
director_facebook_likes      float64
actor_3_facebook_likes       float64
actor_2_name                  object
actor_1_facebook_likes       float64
gross                        float64
genres                        object
actor_1_name                  object
movie_title                   object
num_voted_users                int64
cast_total_facebook_likes      int64
actor_3_name                  object
facenumber_in_poster         float64
plot_keywords                 object
movie_imdb_link               object
num_user_for_reviews         float64
language                      object
country                       object
content_rating                object
budget                       float64
title_year                   float64
actor_2_facebook_likes       float64
imdb_score                   float64
aspect_ratio                 float64
m

In [None]:
# EX 2.A
df_movies[["movie_title","country","director_name","imdb_score"]].iloc[:10,]

Unnamed: 0,movie_title,country,director_name,imdb_score
0,Avatar,USA,James Cameron,7.9
1,Pirates of the Caribbean: At World's End,USA,Gore Verbinski,7.1
2,Spectre,UK,Sam Mendes,6.8
3,The Dark Knight Rises,USA,Christopher Nolan,8.5
4,Star Wars: Episode VII - The Force Awakens,,Doug Walker,7.1
5,John Carter,USA,Andrew Stanton,6.6
6,Spider-Man 3,USA,Sam Raimi,6.2
7,Tangled,USA,Nathan Greno,7.8
8,Avengers: Age of Ultron,USA,Joss Whedon,7.5
9,Harry Potter and the Half-Blood Prince,UK,David Yates,7.5


In [None]:
# EX 2.B
cond_1_1 = df_movies['country']!='USA'
cond_1_2 = df_movies['imdb_score']>8.5
cond_2 = df_movies['director_name'].isin(['James Cameron','Peter Jackson','Tim Burton'])

df_movies[(cond_1_1 | cond_1_2) & cond_2][['movie_title','country','director_name','imdb_score']]

Unnamed: 0,movie_title,country,director_name,imdb_score
20,The Hobbit: The Battle of the Five Armies,New Zealand,Peter Jackson,7.5
25,King Kong,New Zealand,Peter Jackson,7.2
267,The Lord of the Rings: The Fellowship of the Ring,New Zealand,Peter Jackson,8.8
335,The Lord of the Rings: The Return of the King,USA,Peter Jackson,8.9
336,The Lord of the Rings: The Two Towers,USA,Peter Jackson,8.7
3508,The Terminator,UK,James Cameron,8.1
3704,Heavenly Creatures,New Zealand,Peter Jackson,7.4


### EX 3 Basic operations on DataFrames
---
##### EX 3.A. Import DataFrame. Select columbs "Company", "Sector" and "Revenue".
##### EX 3.B. Rename columb "Revenue" for "Company_Revenue".
##### EX 3.C. Create a new column "Profits_per_Employee" as the result of dividing column "Profits" by "Employees".


> Dataset https://data-wizards.s3.amazonaws.com/datasets/fortune1000.csv




In [None]:
import pandas as pd

In [None]:
# EX 3.A
df_fortune = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/fortune1000.csv")

In [None]:
df_fortune[["Company","Sector","Revenue"]]

Unnamed: 0,Company,Sector,Revenue
0,Walmart,Retailing,482130
1,Exxon Mobil,Energy,246204
2,Apple,Technology,233715
3,Berkshire Hathaway,Financials,210821
4,McKesson,Health Care,181241
...,...,...,...
995,New York Community Bancorp,Financials,1902
996,Portland General Electric,Energy,1898
997,Portland General Electric,Energy,1898
998,Wendy’s,"Hotels, Resturants & Leisure",1896


In [None]:
# EX 3.B
df_fortune = df_fortune.rename(
    {"Revenue":"Company_Revenue"}
    ,axis="columns"
)

df_fortune

Unnamed: 0,Rank,Company,Sector,Industry,Location,Company_Revenue,Profits,Employees
0,1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
1,2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
2,3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000
3,4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),"Omaha, NE",210821,24083,331000
4,5,McKesson,Health Care,Wholesalers: Health Care,"San Francisco, CA",181241,1476,70400
...,...,...,...,...,...,...,...,...
995,996,New York Community Bancorp,Financials,Commercial Banks,"Westbury, NY",1902,-47,3448
996,997,Portland General Electric,Energy,Utilities: Gas and Electric,"Portland, OR",1898,172,2646
997,997,Portland General Electric,Energy,Utilities: Gas and Electric,"Portland, OR",1898,172,2646
998,999,Wendy’s,"Hotels, Resturants & Leisure",Food Services,"Dublin, OH",1896,161,21200


In [None]:
# EX 3.C
df_fortune["Profits_per_Employee"] = df_fortune["Profits"] / df_fortune["Employees"]
df_fortune

Unnamed: 0,Rank,Company,Sector,Industry,Location,Company_Revenue,Profits,Employees,Profits_per_Employee
0,1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000,0.006389
1,2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600,0.213624
2,3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000,0.485400
3,4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),"Omaha, NE",210821,24083,331000,0.072758
4,5,McKesson,Health Care,Wholesalers: Health Care,"San Francisco, CA",181241,1476,70400,0.020966
...,...,...,...,...,...,...,...,...,...
995,996,New York Community Bancorp,Financials,Commercial Banks,"Westbury, NY",1902,-47,3448,-0.013631
996,997,Portland General Electric,Energy,Utilities: Gas and Electric,"Portland, OR",1898,172,2646,0.065004
997,997,Portland General Electric,Energy,Utilities: Gas and Electric,"Portland, OR",1898,172,2646,0.065004
998,999,Wendy’s,"Hotels, Resturants & Leisure",Food Services,"Dublin, OH",1896,161,21200,0.007594


### EX 4 Basic operations on DataFrames
---

Import DataFrame. Extract top 10 companies with highest revenue in the Technology sector.

> Dataset https://data-wizards.s3.amazonaws.com/datasets/fortune1000.csv




In [None]:
df_fortune = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/fortune1000.csv")
df_fortune.head()

Unnamed: 0,Rank,Company,Sector,Industry,Location,Revenue,Profits,Employees
0,1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
1,2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
2,3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000
3,4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),"Omaha, NE",210821,24083,331000
4,5,McKesson,Health Care,Wholesalers: Health Care,"San Francisco, CA",181241,1476,70400


In [None]:
df_fortune[df_fortune['Sector']=='Technology'].sort_values('Revenue',ascending=False).iloc[:10,]

Unnamed: 0,Rank,Company,Sector,Industry,Location,Revenue,Profits,Employees
2,3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000
17,18,Amazon.com,Technology,Internet Services and Retailing,"Seattle, WA",107006,596,230800
19,20,HP,Technology,"Computers, Office Equipment","Palo Alto, CA",103355,4554,287000
24,25,Microsoft,Technology,Computer Software,"Redmond, WA",93580,12193,118000
30,31,IBM,Technology,Information Technology Services,"Armonk, NY",82461,13190,411798
35,36,Alphabet,Technology,Internet Services and Retailing,"Mountain View, CA",74989,16348,61814
50,51,Intel,Technology,Semiconductors and Other Electronic Components,"Santa Clara, CA",55355,11420,107300
53,54,Cisco Systems,Technology,Network and Other Communications Equipment,"San Jose, CA",49161,8981,71833
76,77,Oracle,Technology,Computer Software,"Redwood City, CA",38226,9938,132000
109,110,Qualcomm,Technology,Network and Other Communications Equipment,"San Diego, CA",25281,5271,33000


### EX 5 Basic operations on DataFrames
---
##### Import dataframe. Extract top 10 Sci-Fi movies with highest IMDB score; select only the following fields: *title_year*, *director_name* and *imdb_score*.

> Dataset https://data-wizards.s3.amazonaws.com/datasets/movies.csv




In [None]:
df_movies = pd.read_csv('https://data-wizards.s3.amazonaws.com/datasets/movies.csv',index_col='movie_title')
df_movies.head()

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
Avatar,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,886204,4834,Wes Studi,0.0,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,471220,48350,Jack Davenport,0.0,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
Spectre,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,Christoph Waltz,275868,11700,Stephanie Sigman,1.0,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,Tom Hardy,1144337,106759,Joseph Gordon-Levitt,0.0,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,Doug Walker,8,143,,0.0,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,,,,,,,12.0,7.1,,0


In [None]:
# Step 1: building a boolean list using the str.contains method
cond_1 = df_movies['genres'].str.contains('Sci-Fi', regex=False, case=False)
# Step 2: filtering using boolean list, sorting, filtering to get top 10 entries and selecting few columns
df_movies[cond_1].sort_values('imdb_score',ascending=False).iloc[:10,][['title_year','director_name','imdb_score']]

Unnamed: 0_level_0,title_year,director_name,imdb_score
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Daredevil,,,8.8
Star Wars: Episode V - The Empire Strikes Back,1980.0,Irvin Kershner,8.8
Inception,2010.0,Christopher Nolan,8.8
The Matrix,1999.0,Lana Wachowski,8.7
Star Wars: Episode IV - A New Hope,1977.0,George Lucas,8.7
Interstellar,2014.0,Christopher Nolan,8.6
Outlander,,,8.5
Alien,1979.0,Ridley Scott,8.5
The Prestige,2006.0,Christopher Nolan,8.5
Terminator 2: Judgment Day,1991.0,James Cameron,8.5
