# Uvod v Pandas

## Understanding pandas and NumPy


<p></p><center><img alt="anatomy of a dataframe" src="images/df_anatomy_static_resized.svg"></center><p></p>

## About pandas

Key Features of Pandas
- Fast and efficient DataFrame object with default and customized indexing.
- Tools for loading data into in-memory data objects from different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of date sets.
- Label-based slicing, indexing and subsetting of large data sets.
- Columns from a data structure can be deleted or inserted.
- Group by data for aggregation and transformations.
- High performance merging and joining of data.
- Time Series functionality.

## Importing pandas

[Installation guide](https://pandas.pydata.org/docs/getting_started/install.html)

In [9]:
import pandas as pd

In [10]:
import numpy as np

In [11]:
pd.__version__

'1.5.1'

In [12]:
pd.read_csv? # odpri dokumentacijo

SyntaxError: invalid syntax (613196192.py, line 1)

More detailed documentation, along with tutorials and other resources, can be found at http://pandas.pydata.org/.

## Introduction to the Data

<p>The data set is a CSV file called <code>f500.csv</code>. Here is a data dictionary for some of the columns in the CSV:</p>
<ul>
<li><code>company</code>: Name of the company.</li>
<li><code>rank</code>: Global 500 rank for the company.</li>
<li><code>revenues</code>: Company's total revenue for the fiscal year, in millions of dollars (USD).</li>
<li><code>revenue_change</code>: Percentage change in revenue between the current and prior fiscal year.</li>
<li><code>profits</code>: Net income for the fiscal year, in millions of dollars (USD).</li>
<li><code>ceo</code>: Company's Chief Executive Officer.</li>
<li><code>industry</code>: Industry in which the company operates.</li>
<li><code>sector</code>: Sector in which the company operates.</li>
<li><code>previous_rank</code>: Global 500 rank for the company for the prior year.</li>
<li><code>country</code>: Country in which the company is headquartered.</li>
</ul>
</div>

<img src="images/02_io_readwrite.svg">

https://pandas.pydata.org/docs/user_guide/io.html?highlight=io

In [43]:
f500 = pd.read_csv("./data/f500.csv", index_col=0)
f500


Unnamed: 0_level_0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4,China,"Beijing, China",http://www.sinopec.com,19,713288,106523
China National Petroleum,4,262573,-12.3,1867.5,585619,-73.7,Zhang Jianhua,Petroleum Refining,Energy,3,China,"Beijing, China",http://www.cnpc.com.cn,17,1512048,301893
Toyota Motor,5,254694,7.7,16899.3,437575,-12.3,Akio Toyoda,Motor Vehicles and Parts,Motor Vehicles & Parts,8,Japan,"Toyota, Japan",http://www.toyota-global.com,23,364445,157210
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Teva Pharmaceutical Industries,496,21903,11.5,329.0,92890,-79.3,Yitzhak Peterburg,Pharmaceuticals,Health Care,0,Israel,"Petach Tikva, Israel",http://www.tevapharm.com,1,56960,33337
New China Life Insurance,497,21796,-13.3,743.9,100609,-45.6,Wan Feng,"Insurance: Life, Health (stock)",Financials,427,China,"Beijing, China",http://www.newchinalife.com,2,54378,8507
Wm. Morrison Supermarkets,498,21741,-11.3,406.4,11630,20.4,David T. Potts,Food and Drug Stores,Food & Drug Stores,437,Britain,"Bradford, Britain",http://www.morrisons.com,13,77210,5111
TUI,499,21655,-5.5,1151.7,16247,195.5,Friedrich Joussen,Travel Services,Business Services,467,Germany,"Hanover, Germany",http://www.tuigroup.com,23,66779,3006


In [None]:
type(f500)

In [None]:
f500.shape

## Introducing Pandas Objects - Data Structures

ONE OF THE KEYS TO UNDERSTANDING PANDAS IS TO UNDERSTAND
model. At the core of pandas are three data structures:

- Series — 1D (can be understood as columns of a spreadsheet)

<img src="images/01_table_series.svg">

- DataFrame — 2D (can be understood as a single spreadsheet)

<img src="images/01_table_dataframe.svg">

- Panel — 3D (can be understood as a group of spreadsheets)

<table class="table table-bordered">
<tbody><tr>
<th style="text-align:center;">Data Structure</th>
<th style="text-align:center;">Dimensions</th>
<th style="text-align:center;">Description</th>
</tr>
<tr>
<td style="text-align:center;">Series</td>
<td style="text-align:center;">1</td>
<td style="text-align:center;">1D labeled homogeneous array, sizeimmutable.</td>
</tr>
<tr>
<td style="text-align:center;">Data Frames</td>
<td style="text-align:center;">2</td>
<td style="text-align:center;">General 2D labeled, size-mutable tabular structure with potentially heterogeneously typed
columns.</td>
</tr>
<tr>
<td style="text-align:center;">Panel</td>
<td style="text-align:center;">3</td>
<td style="text-align:center;">General 3D labeled, size-mutable array.</td>
</tr>
</tbody></table>

## Introducing DataFrames

In [None]:
f500.head()

In [None]:
f500.tail(3)

In [None]:
f500.dtypes

In [None]:
f500.info()

In [None]:
f500.describe()

## Pandas Data Selection - indexing

### Selecting a Column From a DataFrame by Label (.loc)
Podatki se pri izbiri .loc kopirajo (create copy)

    df.loc[row_label, column_label]

<img src="images/03_subset_columns.svg">

In [None]:
f500.head(2)

In [None]:
# rank_col = f500.loc[:,"rank"]
rank_col = f500["rank"] # krajsi zapis brez loc in oznake vseh vrstic
rank_col

In [None]:
# industry_col = f500.industry -> metoda je deprecated
industry_col = f500["industry"]
industry_col
type(industry_col)

In [None]:
type(industry_col.values)

In [None]:
industry_col.values.shape

In [None]:
print(industry_col.values.dtype)

In [None]:
# f500.loc[:,["country", "rank"]]
f500[["country","rank"]]

In [None]:
f500.loc[:,"profits":"ceo"] # pri indeksiranju s string vrednostmi je zadnji indeks (v tem primeru "ceo") vključen

<div>
<p><img alt="dataframe exploded" src="images/df_exploded_resized.svg"></p>
</div>

<div>

<p>A summary of the techniques we've learned so far is below:</p>
<p></p><center>
<table>
<thead>
<tr>
<th>Select by Label</th>
<th>Explicit Syntax</th>
<th>Common Shorthand</th>
<th>Other Shorthand</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single column</td>
<td><code>df.loc[:,"col1"]</code></td>
<td bgcolor="#00FF00"><code>df["col1"]</code></td>
<td><code>df.col1</code></td>
</tr>
<tr>
<td>List of columns</td>
<td><code>df.loc[:,["col1", "col7"]]</code></td>
<td bgcolor="#00FF00"><code>df[["col1", "col7"]]</code></td>
<td></td>
</tr>
<tr>
<td>Slice of columns</td>
<td bgcolor="#00FF00"><code>df.loc[:,"col1":"col4"]</code></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
</center><p></p>
</div>

<div class="alert alert-block alert-info">
<b>Vaja:</b> Select the country column. Assign the result to the variable name countries.</div>

In [None]:
f500.head()
countries = f500["country"]
countries

<div class="alert alert-block alert-info">
<b>Vaja:</b> In order, select the revenues and years_on_global_500_list columns. Assign the result to the variable name revenues_years.</div>

In [None]:
f500.head()
revenues_years = f500[["revenues","years_on_global_500_list"]]
print(revenues_years)

<div class="alert alert-block alert-info">
<b>Vaja:</b> In order, select all columns from ceo up to and including sector. Assign the result to the variable name ceo_to_sector.</div>

In [None]:
ceo_to_sector = f500.loc[:,"ceo":"sector"]
ceo_to_sector

### Selecting Rows From a DataFrame by Label (.loc)

    df.loc[row_label, column_label]

<img src="images/03_subset_rows.svg">

**Select a single row**

In [None]:
f500.head()

In [None]:
single_row = f500.loc["Sinopec Group"]
single_row
print(single_row.dtype)

In [None]:
row = f500.loc["State Grid" : "Toyota Motor"]
row

In [None]:
sectors = f500["sector"]
count_sectors = sectors.value_counts() # grupira in presteje

In [None]:
count_sectors["Materials"]

In [None]:
count_sectors[["Materials", "Media"]]

**Select a list of rows**

In [None]:
cols = ["Toyota Motor", "Walmart"]



**Select a slice object with labels**

In [None]:
slice_rows_names = "State Grid":"Toyota Motor"

    
    

<img alt="series vs dataframe: series" src="images/df_series_s_updated.svg">

<img alt="series vs dataframe: dataframe" src="images/df_series_df_updated.svg">

### Selecting Items from a Series by Label (.loc)

<table>
<thead>
<tr>
<th>Select by Label</th>
<th>Explicit Syntax</th>
<th>Shorthand Convention</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single item from series</td>
<td><code>s.loc["item8"]</code></td>
<td bgcolor="#00FF00"> <code>s["item8"]</code></td>
</tr>
<tr>
<td>List of items from series</td>
<td><code>s.loc[["item1","item7"]]</code></td>
<td bgcolor="#00FF00"><code>s[["item1","item7"]]</code></td>
</tr>
<tr>
<td>Slice of items from series</td>
<td><code>s.loc["item2":"item4"]</code></td>
<td bgcolor="#00FF00"><code>s["item2":"item4"]</code></td>
</tr>
</tbody>
</table>

### Summary of label selection (.loc)

<table>
<thead>
<tr>
<th>Select by Label</th>
<th>Explicit Syntax</th>
<th>Shorthand Convention</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single column from dataframe</td>
<td><code>df.loc[:,"col1"]</code></td>
<td bgcolor="#00FF00"><code>df["col1"]</code></td>
</tr>
<tr>
<td>List of columns from dataframe</td>
<td><code>df.loc[:,["col1","col7"]]</code></td>
<td bgcolor="#00FF00"><code>df[["col1","col7"]]</code></td>
</tr>
<tr>
<td>Slice of columns from dataframe</td>
<td bgcolor="#00FF00"><code>df.loc[:,"col1":"col4"]</code></td>
<td></td>
</tr>
<tr>
<td>Single row from dataframe</td>
<td bgcolor="#00FF00"><code>df.loc["row4"]</code></td>
<td></td>
</tr>
<tr>
<td>List of rows from dataframe</td>
<td bgcolor="#00FF00"><code>df.loc[["row1", "row8"]]</code></td>
<td></td>
</tr>
<tr>
<td>Slice of rows from dataframe</td>
<td bgcolor="#00FF00"><code>df.loc["row3":"row5"]</code></td>
<td><code>df["row3":"row5"]</code></td>
</tr>
<tr>
<td>Single item from series</td>
<td><code>s.loc["item8"]</code></td>
<td bgcolor="#00FF00"><code>s["item8"]</code></td>
</tr>
<tr>
<td>List of items from series</td>
<td><code>s.loc[["item1","item7"]]</code></td>
<td bgcolor="#00FF00"><code>s[["item1","item7"]]</code></td>
</tr>
<tr>
<td>Slice of items from series</td>
<td><code>s.loc["item2":"item4"]</code></td>
<td bgcolor="#00FF00"><code>s["item2":"item4"]</code></td>
</tr>
</tbody>
</table>

<div class="alert alert-block alert-info">
<b>Vaja:</b> Create a new variable big_movers, with: Rows with indices Aviva, HP, JD.com, and BHP Billiton, in that order. The rank and previous_rank columns, in that order.</div>

In [13]:
big_movers = f500.loc[["Aviva", "HP", "JD.com", "BHP Billiton"], ["rank","previous_rank"]]
big_movers

NameError: name 'f500' is not defined

​
 
<div class="alert alert-block alert-info">
<b>Vaja:</b> Create a new variable, bottom_companies with: All rows with indices from National Gridto AutoNation, inclusive. The rank, sector, and country columns.</div>

In [None]:
bottom_companies = f500.loc["National Grid":"AutoNation", ["rank","sector","country"]]
bottom_companies

In [None]:
f500.head(5)

In [None]:
f500.loc["China National Petroleum":"Volkswagen", ["ceo", "industry", "country"]]

## Vectorized Operations

<p><img alt="Vectorized operation" src="images/vectorized.gif"></p>

In [None]:
my_series = pd.Series([1, 2, 3, 4, 5]) # manual init of pandas series
my_series

In [None]:
my_series = my_series + 10

In [None]:
my_series

<div>
<ul>
<li><code>series_a + series_b</code> - Addition</li>
<li><code>series_a - series_b</code> - Subtraction</li>
<li><code>series_a * series_b</code> - Multiplication (this is unrelated to the multiplications used in linear algebra).</li>
<li><code>series_a / series_b</code> - Division</li>
</ul>
</div>

In [None]:
change_rank = f500["rank"] - f500["previous_rank"]
change_rank

##  Series Data Exploration Methods

<div>
<ul>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.max.html"><code>Series.max()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.min.html"><code>Series.min()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html"><code>Series.mean()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.median.html"><code>Series.median()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mode.html"><code>Series.mode()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sum.html"><code>Series.sum()</code></a></li>
</ul>

</div>

In [None]:
my_series = pd.Series([0, 1, 2, 3, 4])
my_series

In [None]:
print(my_series.sum())

<div class="alert alert-block alert-info">
<b>Vaja:</b> Use the Series.max() method to fMind the maximum value for the rank_change series. Assign the result to the variable rank_change_max.</div>

In [None]:
change_rank.max()

<div class="alert alert-block alert-info">
<b>Vaja:</b> Use the Series.min() method to find the minimum value for the rank_change series. Assign the result to the variable rank_change_min.</div>

In [None]:
change_rank.min()

### Series Describe Method

In [None]:
assets = f500["assets"]
assets.describe()

In [None]:
f500["country"].describe() # metoda se prilagodi tipu

<div class="alert alert-block alert-info">
<b>Vaja:</b> Return a series of descriptive statistics for the rank column in f500.</div>

In [None]:
rank = f500["rank"]
rank.describe()
rank.describe()["count"]

<div class="alert alert-block alert-info">
<b>Vaja:</b> Return a series of descriptive statistics for the previous_rank column in f500.</div>

In [None]:
prev_rank = f500["previous_rank"]
prev_rank.describe()


## Method Chaining

In [None]:
f500["country"].value_counts()["USA"]

<div class="alert alert-block alert-info">
<b>Vaja:</b> Use Series.value_counts() and Series.loc to return the number of companies with a value of 0 in the previous_rank column in the f500 dataframe. Assign the results to zero_previous_rank.</div>

In [None]:
# najdi število podjetij ki imajo previous rank enak 0, koliko je teh?
# f500.loc[f500["previous_rank"] == 0].describe()
f500["previous_rank"].value_counts()[0]

## Dataframe Exploration Methods

<div>

<ul>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.max.html"><code>Series.max()</code></a> and <a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html"><code>DataFrame.max()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.min.html"><code>Series.min()</code></a> and <a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.min.html"><code>DataFrame.min()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html"><code>Series.mean()</code></a> and <a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html"><code>DataFrame.mean()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.median.html"><code>Series.median()</code></a> and <a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.median.html"><code>DataFrame.median()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mode.html"><code>Series.mode()</code></a> and <a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mode.html"><code>DataFrame.mode()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sum.html"><code>Series.sum()</code></a> and <a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html"><code>DataFrame.sum()</code></a></li>
</ul>

<p><img alt="dataframe axis parameters" src="images/axis_param.svg"></p>

</div>

In [None]:
# mediano stolpca revenues in profits
f500[["revenues", "profits"]].median(axis=0) # default je axis 0, po stolpcih tudi, če damo brez argumenta

<div class="alert alert-block alert-info">
<b>Vaja:</b> Use the DataFrame.max() method to find the maximum value for only the numeric columns from f500 (you may need to check the documentation). Assign the result to the variable max_f500.</div>

In [None]:
# poisci maximalno vrednost za vsak stolpec
f500.max(numeric_only=1)

### Dataframe Describe Method

In [14]:
f500.describe()

NameError: name 'f500' is not defined

In [None]:
f500.describe(include=["O"])

<div class="alert alert-block alert-info">
<b>Vaja:</b> Return a dataframe of descriptive statistics for all of the numeric columns in f500. Assign the result to f500_desc.</div>

## Assignment with pandas

In [None]:
top5 = f500[["rank", "revenues"]].head(5)
top5["revenues"] = 0
top5.loc["Sinopec Group","revenues"] = 999
top5

In [None]:
f500.head() # nismo vplivali na originalne vrednosti

<div class="alert alert-block alert-info">
<b>Vaja:</b> The company "Dow Chemical" has named a new CEO. Update the value where the row label is Dow Chemical and for the ceo column to Jim Fitterling in the f500 dataframe.</div>

In [None]:
f500.loc["Dow Chemical", "ceo"] = "Jim Fitterling"
f500.loc["Dow Chemical"]

## Using Boolean Indexing with pandas Objects

In [None]:
d = {'name': ['Bob', 'Eva', 'Sara', 'Mihael'], 'num': [12, 8, 5, 8]}
df = pd.DataFrame(data=d, index=['w', 'x', 'y', 'z'])
df

In [8]:
# mask za stolpec, kjer vrednost enaka 8
df["num"] == 8
df2 = df[df["num"] == 8]
print(df2)

     name  num
x     Eva    8
z  Mihael    8


<div class="alert alert-block alert-info">
<b>Vaja:</b> Create a boolean series, motor_bool, that compares whether the values in the industry column from the f500 dataframe are equal to "Motor Vehicles and Parts".
Use the motor_bool boolean series to index the country column. Assign the result to motor_countries.</div>

In [None]:
# industry -> "Motor Vehicles and Parts"
# želimo dobiti države, ki se nahajajo v tem filtru

In [24]:
countries_MVP = f500.loc[f500["industry"] == "Motor Vehicles and Parts", "country"]
print(countries_MVP)

# v tem primeru ni cisto oki, glej naprej -> kopije dataframe-a
# countries_MVP = f500[f500["industry"] == "Motor Vehicles and Parts"]["country"]



company
Toyota Motor                                 Japan
Volkswagen                                 Germany
Daimler                                    Germany
General Motors                                 USA
Ford Motor                                     USA
Honda Motor                                  Japan
SAIC Motor                                   China
Nissan Motor                                 Japan
BMW Group                                  Germany
Dongfeng Motor                               China
Robert Bosch                               Germany
Hyundai Motor                          South Korea
China FAW Group                              China
Beijing Automotive Group                     China
Peugeot                                     France
Renault                                     France
Kia Motors                             South Korea
Continental                                Germany
Denso                                        Japan
Guangzhou Automobile In

### Using Boolean Arrays to Assign Values

In [28]:
sector = "Motor Vehicles & Parts"
f500[f500["sector"] == sector].shape[0] # stevilo vrstic v df kjer je sector enak "Motor Vehicles & Parts"


34

In [29]:
sector_and = "Motor Vehicles and Parts"
f500[f500["sector"] == sector_and].shape[0]

0

In [44]:
# assign values to column only at specific values
f500.loc[f500["sector"] == "Motor Vehicles & Parts","sector"] = "Motor Vehicles and Parts"
f500

Unnamed: 0_level_0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4,China,"Beijing, China",http://www.sinopec.com,19,713288,106523
China National Petroleum,4,262573,-12.3,1867.5,585619,-73.7,Zhang Jianhua,Petroleum Refining,Energy,3,China,"Beijing, China",http://www.cnpc.com.cn,17,1512048,301893
Toyota Motor,5,254694,7.7,16899.3,437575,-12.3,Akio Toyoda,Motor Vehicles and Parts,Motor Vehicles and Parts,8,Japan,"Toyota, Japan",http://www.toyota-global.com,23,364445,157210
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Teva Pharmaceutical Industries,496,21903,11.5,329.0,92890,-79.3,Yitzhak Peterburg,Pharmaceuticals,Health Care,0,Israel,"Petach Tikva, Israel",http://www.tevapharm.com,1,56960,33337
New China Life Insurance,497,21796,-13.3,743.9,100609,-45.6,Wan Feng,"Insurance: Life, Health (stock)",Financials,427,China,"Beijing, China",http://www.newchinalife.com,2,54378,8507
Wm. Morrison Supermarkets,498,21741,-11.3,406.4,11630,20.4,David T. Potts,Food and Drug Stores,Food & Drug Stores,437,Britain,"Bradford, Britain",http://www.morrisons.com,13,77210,5111
TUI,499,21655,-5.5,1151.7,16247,195.5,Friedrich Joussen,Travel Services,Business Services,467,Germany,"Hanover, Germany",http://www.tuigroup.com,23,66779,3006


## Creating New Columns

In [45]:
f500["rank_change"] = f500["previous_rank"] - f500["rank"]



In [42]:
f500

Unnamed: 0_level_0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity,sdfkjsdf
company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798,0
State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456,0
Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4,China,"Beijing, China",http://www.sinopec.com,19,713288,106523,0
China National Petroleum,4,262573,-12.3,1867.5,585619,-73.7,Zhang Jianhua,Petroleum Refining,Energy,3,China,"Beijing, China",http://www.cnpc.com.cn,17,1512048,301893,0
Toyota Motor,Motor Vehicles and Parts,Motor Vehicles and Parts,Motor Vehicles and Parts,Motor Vehicles and Parts,Motor Vehicles and Parts,Motor Vehicles and Parts,Motor Vehicles and Parts,Motor Vehicles and Parts,Motor Vehicles and Parts,Motor Vehicles and Parts,Motor Vehicles and Parts,Motor Vehicles and Parts,Motor Vehicles and Parts,Motor Vehicles and Parts,Motor Vehicles and Parts,Motor Vehicles and Parts,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Teva Pharmaceutical Industries,496,21903,11.5,329.0,92890,-79.3,Yitzhak Peterburg,Pharmaceuticals,Health Care,0,Israel,"Petach Tikva, Israel",http://www.tevapharm.com,1,56960,33337,0
New China Life Insurance,497,21796,-13.3,743.9,100609,-45.6,Wan Feng,"Insurance: Life, Health (stock)",Financials,427,China,"Beijing, China",http://www.newchinalife.com,2,54378,8507,0
Wm. Morrison Supermarkets,498,21741,-11.3,406.4,11630,20.4,David T. Potts,Food and Drug Stores,Food & Drug Stores,437,Britain,"Bradford, Britain",http://www.morrisons.com,13,77210,5111,0
TUI,499,21655,-5.5,1151.7,16247,195.5,Friedrich Joussen,Travel Services,Business Services,467,Germany,"Hanover, Germany",http://www.tuigroup.com,23,66779,3006,0


## Vaja: Top Performers by Country

In [55]:
# Najdi 5 najbolj pogostih industrij v USA
f500.loc[f500["country"] == "USA", "industry"].value_counts().head(5)

Banks: Commercial and Savings               8
Insurance: Property and Casualty (Stock)    7
Aerospace and Defense                       6
Petroleum Refining                          6
Specialty Retailers                         6
Name: industry, dtype: int64

In [56]:
# Najdi 5 najbolj pogostih sektorjev v USA
f500.loc[f500["country"] == "China", "sector"].value_counts().head(5)

Financials                    25
Energy                        22
Wholesalers                    9
Engineering & Construction     8
Technology                     8
Name: sector, dtype: int64

Operatorji za kombiniranje pogojev (filtrov)
- &  ...  IN
- | ... ALI
- ~ ... NEGACIJA

In [58]:
# kombiniranje pogojev
f500.loc[(f500["country"] == "USA") & \
         (f500["sector"] == "Retailing")] # uporabljanje \ za lazjo berljivost


Unnamed: 0_level_0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity,rank_change
company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798,0
Costco,36,118719,2.2,2350.0,33163,-1.1,W. Craig Jelinek,General Merchandisers,Retailing,38,USA,"Issaquah, WA",http://www.costco.com,23,172000,12079,2
Home Depot,59,94595,6.9,7957.0,42966,13.5,Craig A. Menear,Specialty Retailers,Retailing,69,USA,"Atlanta, GA",http://www.homedepot.com,23,406000,4333,10
Target,107,69495,-5.8,2737.0,37431,-18.6,Brian C. Cornell,General Merchandisers,Retailing,97,USA,"Minneapolis, MN",http://www.target.com,23,323000,10953,-10
Lowe’s,122,65017,10.1,3093.0,34408,21.5,Robert A. Niblock,Specialty Retailers,Retailing,148,USA,"Mooresville, NC",http://www.lowes.com,20,240000,6434,26
Best Buy,258,39403,-0.9,1228.0,13856,36.9,Hubert B. Joly,Specialty Retailers,Retailing,244,USA,"Richfield, MN",http://www.bestbuy.com,19,125000,4709,-14
TJX,321,33184,7.2,2298.2,12884,0.9,Ernie L. Herrman,Specialty Retailers,Retailing,338,USA,"Framingham, MA",http://www.tjx.com,16,235000,4511,17
Macy’s,425,25778,-4.8,619.0,19851,-42.3,Jeffrey Gennette,General Merchandisers,Retailing,389,USA,"Cincinnati, OH",http://www.macysinc.com,23,148300,4323,-36
Sears Holdings,489,22138,-12.0,-2221.0,9362,,Edward S. Lampert,General Merchandisers,Retailing,425,USA,"Hoffman Estates, IL",http://www.searsholdings.com,23,140000,-3824,-64
Dollar General,492,21987,7.9,1251.1,11672,7.4,Todd J. Vasos,Specialty Retailers,Retailing,0,USA,"Goodlettsville, TN",http://www.dollargeneral.com,1,121000,5406,-492


In [64]:
f500.loc[(((f500["country"] == "USA") | (f500["country"] == "China")) \
          & ~(f500["sector"] == "Retailing"), "industry")] \
            .value_counts().head(5)

Banks: Commercial and Savings    18
Mining, Crude-Oil Production     13
Aerospace and Defense            12
Motor Vehicles and Parts          9
Pharmaceuticals                   8
Name: industry, dtype: int64

## Reading CSV files with pandas

<div>
<p><img alt="csv_to_dataframe" src="images/csv_to_dataframe.svg"></p>


</div>

Primer ko naložimo brez nekega indexa (ni pametnega index stolpca)

In [71]:
f500 = pd.read_csv("data/f500.csv")
f500.head()
f500.columns
f500.index

RangeIndex(start=0, stop=500, step=1)

iloc - location based indexing
loc - labeled based indexing

## Using iloc to select by integer position

In [83]:
cols = ['company', 'rank', 'revenues']
minif500 = f500[cols].head()
print(minif500.iloc[4])
print("------")
print(minif500.iloc[3,2])
print("------")
print(minif500.iloc[2,:])


company     Toyota Motor
rank                   5
revenues          254694
Name: 4, dtype: object
------
262573
------
company     Sinopec Group
rank                    3
revenues           267518
Name: 2, dtype: object


<p><img alt="selection using iloc" src="images/selection_iloc.svg"></p>

    df.iloc[row_index, column_index]

<div class="alert alert-block alert-info">
<b>Vaja:</b> Select just the fifth row of the f500 dataframe. Assign the result to fifth_row.</div>

In [93]:
fifth_row = f500.iloc[4] # vse vrstice
print(fifth_row)

company                                     Toyota Motor
rank                                                   5
revenues                                          254694
revenue_change                                       7.7
profits                                          16899.3
assets                                            437575
profit_change                                      -12.3
ceo                                          Akio Toyoda
industry                        Motor Vehicles and Parts
sector                            Motor Vehicles & Parts
previous_rank                                          8
country                                            Japan
hq_location                                Toyota, Japan
website                     http://www.toyota-global.com
years_on_global_500_list                              23
employees                                         364445
total_stockholder_equity                          157210
Name: 4, dtype: object


<div class="alert alert-block alert-info">
<b>Vaja:</b> Select the value in first row of the company column. Assign the result to company_value.</div>

In [97]:
f500.iloc[0,0]
# f500.iloc[0]["company"] # !!! zopet pazi kak copy naredi to na arrayu

'Walmart'

<div>

<table>
<thead>
<tr>
<th>Select by integer position</th>
<th>Explicit Syntax</th>
<th>Shorthand Convention</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single column from dataframe</td>
<td><code>df.iloc[:,3]</code></td>
<td></td>
</tr>
<tr>
<td>List of columns from dataframe</td>
<td><code>df.iloc[:,[3,5,6]]</code></td>
<td></td>
</tr>
<tr>
<td>Slice of columns from dataframe</td>
<td><code>df.iloc[:,3:7]</code></td>
<td></td>
</tr>
<tr>
<td>Single row from dataframe</td>
<td><code>df.iloc[20]</code></td>
<td></td>
</tr>
<tr>
<td>List of rows from dataframe</td>
<td><code>df.iloc[[0,3,8]]</code></td>
<td></td>
</tr>
<tr>
<td>Slice of rows from dataframe</td>
<td><code>df.iloc[3:5]</code></td>
<td><code>df[3:5]</code></td>
</tr>
<tr>
<td>Single items from series</td>
<td><code>s.iloc[8]</code></td>
<td><code>s[8]</code></td>
</tr>
<tr>
<td>List of item from series</td>
<td><code>s.iloc[[2,8,1]]</code></td>
<td><code>s[[2,8,1]]</code></td>
</tr>
<tr>
<td>Slice of items from series</td>
<td><code>s.iloc[5:10]</code></td>
<td><code>s[5:10]</code></td>
</tr>
</tbody>
</table>
</div>

<div class="alert alert-block alert-info">
<b>Vaja:</b> Select the first three rows of the f500 dataframe. Assign the result to first_three_rows.</div>

### Pozor - razlika loc - iloc
f500.iloc[0:3] ---> zadnja vrstica se ne upostevam  

f500.loc[0:3] ---> gleda kot label (v tem primeru je label = indexs) in uposteva tudi zadnjo vrstico

<p><img alt="loc vs iloc for rows in different order" src="images/integer_labels_2.svg"></p>

In [101]:
first_three_rows = f500.iloc[:3]
first_three_rows

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
0,Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
1,State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
2,Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4,China,"Beijing, China",http://www.sinopec.com,19,713288,106523


<div class="alert alert-block alert-info">
<b>Vaja:</b> Select the first and seventh rows and the first five columns of the f500 dataframe. Assign the result to first_seventh_row_slice.</div>

In [107]:
first_seventh_row_slice = f500.iloc[[0,6],:5]
first_seventh_row_slice

Unnamed: 0,company,rank,revenues,revenue_change,profits
0,Walmart,1,485873,0.8,13643.0
6,Royal Dutch Shell,7,240033,-11.8,4575.0


## Using pandas methods to create boolean masks

In [114]:
f500["revenue_change"].isnull().value_counts()

False    498
True       2
Name: revenue_change, dtype: int64

In [116]:
f500[f500["revenue_change"].isnull()]

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
90,Uniper,91,74407,,-3557.5,51541,,Klaus Schafer,Energy,Energy,0,Germany,"Dusseldorf, Germany",http://www.uniper.energy,1,12890,12889
180,Hewlett Packard Enterprise,181,50123,,3161.0,79679,,Margaret C. Whitman,Information Technology Services,Technology,0,USA,"Palo Alto, CA",http://www.hpe.com,1,195000,31448


<div class="alert alert-block alert-info">
<b>Vaja:</b> Use the Series.isnull() method to select all rows from f500 that have a null value for the previous_rank column. Select only the company, rank, and previous_rank columns. Assign the result to null_previous_rank.</div>

In [118]:
import numpy as np
# predpripravljeno
f500 = pd.read_csv("data/f500.csv")

# povsod kjer je previous_rank 0 -> nastavimo na nan (manjkajoci podatek)
# lazje delati naprej
f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan


Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
0,Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1.0,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
1,State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2.0,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
2,Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4.0,China,"Beijing, China",http://www.sinopec.com,19,713288,106523
3,China National Petroleum,4,262573,-12.3,1867.5,585619,-73.7,Zhang Jianhua,Petroleum Refining,Energy,3.0,China,"Beijing, China",http://www.cnpc.com.cn,17,1512048,301893
4,Toyota Motor,5,254694,7.7,16899.3,437575,-12.3,Akio Toyoda,Motor Vehicles and Parts,Motor Vehicles & Parts,8.0,Japan,"Toyota, Japan",http://www.toyota-global.com,23,364445,157210
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,Teva Pharmaceutical Industries,496,21903,11.5,329.0,92890,-79.3,Yitzhak Peterburg,Pharmaceuticals,Health Care,,Israel,"Petach Tikva, Israel",http://www.tevapharm.com,1,56960,33337
496,New China Life Insurance,497,21796,-13.3,743.9,100609,-45.6,Wan Feng,"Insurance: Life, Health (stock)",Financials,427.0,China,"Beijing, China",http://www.newchinalife.com,2,54378,8507
497,Wm. Morrison Supermarkets,498,21741,-11.3,406.4,11630,20.4,David T. Potts,Food and Drug Stores,Food & Drug Stores,437.0,Britain,"Bradford, Britain",http://www.morrisons.com,13,77210,5111
498,TUI,499,21655,-5.5,1151.7,16247,195.5,Friedrich Joussen,Travel Services,Business Services,467.0,Germany,"Hanover, Germany",http://www.tuigroup.com,23,66779,3006


In [126]:
# isnull metoda išče nan (NULL) vrednosti
f500.loc[f500["previous_rank"].isnull(), ["company", "rank", "previous_rank"]]

Unnamed: 0,company,rank,previous_rank
48,Legal & General Group,49,
90,Uniper,91,
123,Dell Technologies,124,
138,Anbang Insurance Group,139,
140,Albertsons Cos.,141,
180,Hewlett Packard Enterprise,181,
267,Hengli Group,268,
271,Johnson Controls International,272,
341,Chubb,342,
375,Charter Communications,376,


## Working with Integer Labels

<p><img alt="loc vs iloc for rows in different order" src="images/integer_labels_2.svg"></p>

<div class="alert alert-block alert-info">
<b>Vaja:</b> Assign the first five rows of the null_previous_rank dataframe to the variable top5_null_prev_rank by choosing the correct method out of either loc[] or iloc[].</div>

## Pandas Index Alignment
Matchanje indexov

In [131]:
previously_ranked = f500.loc[f500["previous_rank"].notnull()]
previously_ranked

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
0,Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1.0,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
1,State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2.0,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
2,Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4.0,China,"Beijing, China",http://www.sinopec.com,19,713288,106523
3,China National Petroleum,4,262573,-12.3,1867.5,585619,-73.7,Zhang Jianhua,Petroleum Refining,Energy,3.0,China,"Beijing, China",http://www.cnpc.com.cn,17,1512048,301893
4,Toyota Motor,5,254694,7.7,16899.3,437575,-12.3,Akio Toyoda,Motor Vehicles and Parts,Motor Vehicles & Parts,8.0,Japan,"Toyota, Japan",http://www.toyota-global.com,23,364445,157210
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
490,National Grid,491,22036,-3.2,10150.6,82310,160.2,John Pettigrew,Utilities,Energy,471.0,Britain,"London, Britain",http://www.nationalgrid.com,12,22132,25463
492,Telecom Italia,493,21941,-17.4,1999.4,74295,,Flavio Cattaneo,Telecommunications,Telecommunications,404.0,Italy,"Milan, Italy",http://www.telecomitalia.com,18,61227,22366
496,New China Life Insurance,497,21796,-13.3,743.9,100609,-45.6,Wan Feng,"Insurance: Life, Health (stock)",Financials,427.0,China,"Beijing, China",http://www.newchinalife.com,2,54378,8507
497,Wm. Morrison Supermarkets,498,21741,-11.3,406.4,11630,20.4,David T. Potts,Food and Drug Stores,Food & Drug Stores,437.0,Britain,"Bradford, Britain",http://www.morrisons.com,13,77210,5111


In [132]:
previously_ranked.loc[498:500] # ta label index ostane ???

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
498,TUI,499,21655,-5.5,1151.7,16247,195.5,Friedrich Joussen,Travel Services,Business Services,467.0,Germany,"Hanover, Germany",http://www.tuigroup.com,23,66779,3006


In [133]:
previously_ranked.iloc[498:500] # tega integer indexa ni vec ???

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity


In [142]:
food = pd.DataFrame({'fruit_veg': ['fruit', 'veg', 'fruit', 'veg', 'veg'], 'qty': [4, 2, 4, 1, 2]}, 
                    index=['tomato', 'carrot', 'lime', 'corn', 'eggplant'])
food



Unnamed: 0,fruit_veg,qty
tomato,fruit,4
carrot,veg,2
lime,fruit,4
corn,veg,1
eggplant,veg,2


In [143]:
# matcha po indexu
alt_name = pd.Series(['rocket', 'aubergine', 'maize'], index=['arugula', 'eggplant', 'corn'])

In [145]:
food["alt_name"] = alt_name
food

Unnamed: 0,fruit_veg,qty,alt_name
tomato,fruit,4,
carrot,veg,2,
lime,fruit,4,
corn,veg,1,maize
eggplant,veg,2,aubergine


<div class="alert alert-block alert-info">
<b>Vaja:</b> Use the Series.notnull() method to select all rows from f500 that have a non-null value for the previous_rank column. Assign the result to previously_ranked.  From the previously_ranked dataframe, subtract the rank column from the previous_rank column. Assign the result to rank_change. Assign the values in the rank_change to a new column in the f500 dataframe, "rank_change".</div>

In [147]:
previously_ranked = f500.loc[f500["previous_rank"].notnull()]

In [151]:
rank_change = previously_ranked["previous_rank"] - previously_ranked["rank"]
rank_change

0       0.0
1       0.0
2       1.0
3      -1.0
4       3.0
       ... 
490   -20.0
492   -89.0
496   -70.0
497   -61.0
498   -32.0
Length: 467, dtype: float64

In [150]:
f500["rank_change"] = rank_change
f500.tail()

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity,rank_change
495,Teva Pharmaceutical Industries,496,21903,11.5,329.0,92890,-79.3,Yitzhak Peterburg,Pharmaceuticals,Health Care,,Israel,"Petach Tikva, Israel",http://www.tevapharm.com,1,56960,33337,
496,New China Life Insurance,497,21796,-13.3,743.9,100609,-45.6,Wan Feng,"Insurance: Life, Health (stock)",Financials,427.0,China,"Beijing, China",http://www.newchinalife.com,2,54378,8507,-70.0
497,Wm. Morrison Supermarkets,498,21741,-11.3,406.4,11630,20.4,David T. Potts,Food and Drug Stores,Food & Drug Stores,437.0,Britain,"Bradford, Britain",http://www.morrisons.com,13,77210,5111,-61.0
498,TUI,499,21655,-5.5,1151.7,16247,195.5,Friedrich Joussen,Travel Services,Business Services,467.0,Germany,"Hanover, Germany",http://www.tuigroup.com,23,66779,3006,-32.0
499,AutoNation,500,21609,3.6,430.5,10060,-2.7,Michael J. Jackson,Specialty Retailers,Retailing,,USA,"Fort Lauderdale, FL",http://www.autonation.com,12,26000,2310,


## Boolean Operators

<div>
<table>
<thead>
<tr>
<th>pandas</th>
<th>Python equivalent</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>a &amp; b</code></td>
<td><code>a and b</code></td>
<td><code>True</code> if both <code>a</code> and <code>b</code> are <code>True</code>, else <code>False</code></td>
</tr>
<tr>
<td><code>a | b</code></td>
<td><code>a or b</code></td>
<td><code>True</code> if either <code>a</code> or <code>b</code> is <code>True</code></td>
</tr>
<tr>
<td><code>~a</code></td>
<td><code>not a</code></td>
<td><code>True</code> if <code>a</code> is <code>False</code>, else <code>False</code></td>
</tr>
</tbody>
</table>

<p><img alt="boolean operators example 1" src="images/bool_ops_1.svg"></p>

<p><img alt="boolean operators example 2" src="images/bool_ops_2.svg"></p>

<p><img alt="boolean operators example 3" src="images/bool_ops_3.svg"></p>

<p><img alt="boolean operators example 4" src="images/bool_ops_4.svg"></p>

</div>

<div class="alert alert-block alert-info">
<b>Vaja:</b> Select all companies with revenues over 100 billion and negative profits from the f500 dataframe. The result should include all columns.</div>

In [163]:
f500[(f500["revenues"] > 100_000) & (f500["profits"] > 0)]

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity,rank_change
0,Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1.0,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798,0.0
1,State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2.0,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456,0.0
2,Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4.0,China,"Beijing, China",http://www.sinopec.com,19,713288,106523,1.0
3,China National Petroleum,4,262573,-12.3,1867.5,585619,-73.7,Zhang Jianhua,Petroleum Refining,Energy,3.0,China,"Beijing, China",http://www.cnpc.com.cn,17,1512048,301893,-1.0
4,Toyota Motor,5,254694,7.7,16899.3,437575,-12.3,Akio Toyoda,Motor Vehicles and Parts,Motor Vehicles & Parts,8.0,Japan,"Toyota, Japan",http://www.toyota-global.com,23,364445,157210,3.0
5,Volkswagen,6,240264,1.5,5937.3,432116,,Matthias Muller,Motor Vehicles and Parts,Motor Vehicles & Parts,7.0,Germany,"Wolfsburg, Germany",http://www.volkswagen.com,23,626715,97753,1.0
6,Royal Dutch Shell,7,240033,-11.8,4575.0,411275,135.9,Ben van Beurden,Petroleum Refining,Energy,5.0,Netherlands,"The Hague, Netherlands",http://www.shell.com,23,89000,186646,-2.0
7,Berkshire Hathaway,8,223604,6.1,24074.0,620854,,Warren E. Buffett,Insurance: Property and Casualty (Stock),Financials,11.0,USA,"Omaha, NE",http://www.berkshirehathaway.com,21,367700,283001,3.0
8,Apple,9,215639,-7.7,45687.0,321686,-14.4,Timothy D. Cook,"Computers, Office Equipment",Technology,9.0,USA,"Cupertino, CA",http://www.apple.com,15,116000,128249,0.0
9,Exxon Mobil,10,205004,-16.7,7840.0,330314,-51.5,Darren W. Woods,Petroleum Refining,Energy,6.0,USA,"Irving, TX",http://www.exxonmobil.com,23,72700,167325,-4.0


<div class="alert alert-block alert-info">
<b>Vaja:</b> Select all rows for companies headquartered in either Brazil or Venezuela. Assign the result to brazil_venezuela.</div>

In [178]:
brazil_venezuela = f500[f500["country"].isin(["Brazil", "Venezuela"])]
brazil_venezuela

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity,rank_change
74,Petrobras,75,81405,-16.3,-4838.0,246983,,Pedro Pullen Parente,Petroleum Refining,Energy,58.0,Brazil,"Rio de Janeiro, Brazil",http://www.petrobras.com.br,23,68829,76779,-17.0
112,Itau Unibanco Holding,113,66876,21.4,6666.4,415972,-13.7,Candido Botelho Bracher,Banks: Commercial and Savings,Financials,159.0,Brazil,"Sao Paulo, Brazil",http://www.itau.com.br,4,94779,37680,46.0
150,Banco do Brasil,151,58093,-13.4,2013.8,426416,-52.3,Paulo Rogerio Caffarelli,Banks: Commercial and Savings,Financials,115.0,Brazil,"Brasilia, Brazil",http://www.bb.com.br,23,100622,26551,-36.0
153,Banco Bradesco,154,57443,31.3,5127.9,366418,-5.7,Luiz Carlos Trabuco Cappi,Banks: Commercial and Savings,Financials,209.0,Brazil,"Osasco, Brazil",http://www.bradesco.com.br,21,94541,32369,55.0
190,JBS,191,48825,-0.1,107.7,31605,-92.3,Wesley Mendonca Batista,Food Production,"Food, Beverages & Tobacco",185.0,Brazil,"Sao Paulo, Brazil",http://jbss.infoinvest.com.br,8,237061,7307,-6.0


<div class="alert alert-block alert-info">
<b>Vaja:</b> Select the first five companies in the Technology sector that are not headquartered in the USA from the f500 dataframe. Assign the result to tech_outside_usa.</div>

In [180]:
tech_outside_usa = f500[(f500["sector"] == "Technology") & (f500["country"] != 'USA')]
tech_outside_usa.head(5)

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity,rank_change
14,Samsung Electronics,15,173957,-2.0,19316.5,217104,16.8,Oh-Hyun Kwon,"Electronics, Electrical Equip.",Technology,13.0,South Korea,"Suwon, South Korea",http://www.samsung.com,23,325000,154376,-2.0
26,Hon Hai Precision Industry,27,135129,-4.3,4608.8,80436,-0.4,Terry Gou,"Electronics, Electrical Equip.",Technology,25.0,Taiwan,"New Taipei City, Taiwan",http://www.foxconn.com,13,726772,33476,-2.0
70,Hitachi,71,84558,1.2,2134.3,86742,48.8,Toshiaki Higashihara,"Electronics, Electrical Equip.",Technology,79.0,Japan,"Tokyo, Japan",http://www.hitachi.com,23,303887,26632,8.0
82,Huawei Investment & Holding,83,78511,24.9,5579.4,63837,-5.0,Ren Zhengfei,Network and Other Communications Equipment,Technology,129.0,China,"Shenzhen, China",http://www.huawei.com,8,180000,20159,46.0
104,Sony,105,70170,3.9,676.4,158519,-45.1,Kazuo Hirai,"Electronics, Electrical Equip.",Technology,113.0,Japan,"Tokyo, Japan",http://www.sony.net,23,128400,22415,8.0


## Sorting Values

In [186]:
china_rows = f500[f500["country"] == "China"]
china_rows.sort_values("employees", ascending=False).head(5)

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity,rank_change
3,China National Petroleum,4,262573,-12.3,1867.5,585619,-73.7,Zhang Jianhua,Petroleum Refining,Energy,3.0,China,"Beijing, China",http://www.cnpc.com.cn,17,1512048,301893,-1.0
118,China Post Group,119,65605,-5.8,4980.3,1221649,18.7,Li Guohua,"Mail, Package, and Freight Delivery",Transportation,105.0,China,"Beijing, China",http://www.chinapost.com.cn,7,941211,43114,-14.0
1,State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2.0,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456,0.0
2,Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4.0,China,"Beijing, China",http://www.sinopec.com,19,713288,106523,1.0
37,Agricultural Bank of China,38,117275,-12.1,27687.8,2816039,-3.6,Zhao Huan,Banks: Commercial and Savings,Financials,29.0,China,"Beijing, China",http://www.abchina.com,18,501368,189682,-9.0


<div class="alert alert-block alert-info">
<b>Vaja:</b> Find the company headquartered in Japan with the largest number of employees.</div>

### ce sortiras -> lahko vedno iloc[0] uporabiš -> ni vezan na label indexing ampak enako kot python list

In [206]:
f500[f500["country"] == "Japan"].sort_values("employees", ascending=False).iloc[0]["company"] # ce sortiras -> lahko vedno iloc[0] uporabiš -> ni vezan na label indexing ampak enako kot python list

'Toyota Motor'

## Using Loops with pandas

In [210]:
avg_rev_by_country = {}

countries = f500["country"].unique() # unikatne vrednosti
countries

array(['USA', 'China', 'Japan', 'Germany', 'Netherlands', 'Britain',
       'South Korea', 'Switzerland', 'France', 'Taiwan', 'Singapore',
       'Italy', 'Russia', 'Spain', 'Brazil', 'Mexico', 'Luxembourg',
       'India', 'Malaysia', 'Thailand', 'Australia', 'Belgium', 'Norway',
       'Canada', 'Ireland', 'Indonesia', 'Denmark', 'Saudi Arabia',
       'Sweden', 'Finland', 'Venezuela', 'Turkey', 'U.A.E', 'Israel'],
      dtype=object)

In [211]:
for c in countries:
    selected_rows = f500[f500["country"] == c]
    mean = selected_rows["revenues"].mean()
    avg_rev_by_country[c] = mean

In [212]:
avg_rev_by_country

{'USA': 64218.371212121216,
 'China': 55397.880733944956,
 'Japan': 53164.03921568627,
 'Germany': 63915.0,
 'Netherlands': 61708.92857142857,
 'Britain': 51588.708333333336,
 'South Korea': 49725.6,
 'Switzerland': 51353.57142857143,
 'France': 55231.793103448275,
 'Taiwan': 46364.666666666664,
 'Singapore': 54454.333333333336,
 'Italy': 51899.57142857143,
 'Russia': 65247.75,
 'Spain': 40600.666666666664,
 'Brazil': 52024.57142857143,
 'Mexico': 54987.5,
 'Luxembourg': 56791.0,
 'India': 39993.0,
 'Malaysia': 49479.0,
 'Thailand': 48719.0,
 'Australia': 33688.71428571428,
 'Belgium': 45905.0,
 'Norway': 45873.0,
 'Canada': 31848.0,
 'Ireland': 32819.5,
 'Indonesia': 36487.0,
 'Denmark': 35464.0,
 'Saudi Arabia': 35421.0,
 'Sweden': 27963.666666666668,
 'Finland': 26113.0,
 'Venezuela': 24403.0,
 'Turkey': 23456.0,
 'U.A.E': 22799.0,
 'Israel': 21903.0}

<div class="alert alert-block alert-info">
<b>Vaja:</b> Calculate the company that employs the most people in each country</div>

In [213]:
top_employer_by_country = {}

countries = f500["country"].unique()
for c in countries:
    selected_rows = f500[f500["country"] == c]
    sorted_rows = selected_rows.sort_values("employees", ascending=False)
    top_employer = sorted_rows.iloc[0]
    employer_name = top_employer["company"]
    top_employer_by_country[c] = employer_name

In [214]:
top_employer_by_country

{'USA': 'Walmart',
 'China': 'China National Petroleum',
 'Japan': 'Toyota Motor',
 'Germany': 'Volkswagen',
 'Netherlands': 'EXOR Group',
 'Britain': 'Compass Group',
 'South Korea': 'Samsung Electronics',
 'Switzerland': 'Nestle',
 'France': 'Sodexo',
 'Taiwan': 'Hon Hai Precision Industry',
 'Singapore': 'Flex',
 'Italy': 'Poste Italiane',
 'Russia': 'Gazprom',
 'Spain': 'Banco Santander',
 'Brazil': 'JBS',
 'Mexico': 'America Movil',
 'Luxembourg': 'ArcelorMittal',
 'India': 'State Bank of India',
 'Malaysia': 'Petronas',
 'Thailand': 'PTT',
 'Australia': 'Wesfarmers',
 'Belgium': 'Anheuser-Busch InBev',
 'Norway': 'Statoil',
 'Canada': 'George Weston',
 'Ireland': 'Accenture',
 'Indonesia': 'Pertamina',
 'Denmark': 'Maersk Group',
 'Saudi Arabia': 'SABIC',
 'Sweden': 'H & M Hennes & Mauritz',
 'Finland': 'Nokia',
 'Venezuela': 'Mercantil Servicios Financieros',
 'Turkey': 'Koc Holding',
 'U.A.E': 'Emirates Group',
 'Israel': 'Teva Pharmaceutical Industries'}

## Primer: Calculating Return on Assets by Country
<b> SettingWithCopyWarning in Pandas </b> 

    return on assets = profit/assets

<div class="alert alert-block alert-info">
<b>Vaja:</b> Create a new column roa in the f500 dataframe, containing the return on assets metric for each company.</div>

<div class="alert alert-block alert-info">
<b>Vaja:</b> Aggregate the data by the sector column, and create a dictionary top_roa_by_sector.</div>

## Understanding SettingwithCopyWarning in pandas


### What is SettingWithCopyWarning?



<img class="full-width" src="https://www.dataquest.io/wp-content/uploads/2019/01/view-vs-copy.png" alt="view-vs-copy">



<img class="full-width" src="https://www.dataquest.io/wp-content/uploads/2019/01/modifying.png" alt="modifying">


### Chained assignment




In [219]:
f500[f500["sector"] == "Energy"]["sector"] = "Oil"
# ustvari se kopija, nismo sigurno da dalje spreminjamo na kopiji ali viewu?? Zato se prikaže warning!

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  f500[f500["sector"] == "Energy"]["sector"] = "Oil"


In [220]:
# Sedaj vidimo, da se v originalnem df podatek ni spremenil
f500.head()

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity,rank_change
0,Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1.0,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798,0.0
1,State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2.0,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456,0.0
2,Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4.0,China,"Beijing, China",http://www.sinopec.com,19,713288,106523,1.0
3,China National Petroleum,4,262573,-12.3,1867.5,585619,-73.7,Zhang Jianhua,Petroleum Refining,Energy,3.0,China,"Beijing, China",http://www.cnpc.com.cn,17,1512048,301893,-1.0
4,Toyota Motor,5,254694,7.7,16899.3,437575,-12.3,Akio Toyoda,Motor Vehicles and Parts,Motor Vehicles & Parts,8.0,Japan,"Toyota, Japan",http://www.toyota-global.com,23,364445,157210,3.0


In [221]:
# Uporabiti moramo .loc metodo, ko prirejamo nove vrednosti !!!
f500.loc[f500["sector"] == "Energy","sector"] = "Oil"

In [222]:
f500

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity,rank_change
0,Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1.0,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798,0.0
1,State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Oil,2.0,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456,0.0
2,Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Oil,4.0,China,"Beijing, China",http://www.sinopec.com,19,713288,106523,1.0
3,China National Petroleum,4,262573,-12.3,1867.5,585619,-73.7,Zhang Jianhua,Petroleum Refining,Oil,3.0,China,"Beijing, China",http://www.cnpc.com.cn,17,1512048,301893,-1.0
4,Toyota Motor,5,254694,7.7,16899.3,437575,-12.3,Akio Toyoda,Motor Vehicles and Parts,Motor Vehicles & Parts,8.0,Japan,"Toyota, Japan",http://www.toyota-global.com,23,364445,157210,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,Teva Pharmaceutical Industries,496,21903,11.5,329.0,92890,-79.3,Yitzhak Peterburg,Pharmaceuticals,Health Care,,Israel,"Petach Tikva, Israel",http://www.tevapharm.com,1,56960,33337,
496,New China Life Insurance,497,21796,-13.3,743.9,100609,-45.6,Wan Feng,"Insurance: Life, Health (stock)",Financials,427.0,China,"Beijing, China",http://www.newchinalife.com,2,54378,8507,-70.0
497,Wm. Morrison Supermarkets,498,21741,-11.3,406.4,11630,20.4,David T. Potts,Food and Drug Stores,Food & Drug Stores,437.0,Britain,"Bradford, Britain",http://www.morrisons.com,13,77210,5111,-61.0
498,TUI,499,21655,-5.5,1151.7,16247,195.5,Friedrich Joussen,Travel Services,Business Services,467.0,Germany,"Hanover, Germany",http://www.tuigroup.com,23,66779,3006,-32.0


### Hidden chaining



In [253]:
???

In [254]:
???

### Tips and tricks for dealing with SettingWithCopyWarning

### Chained assignment in Depth

In [246]:
df1 = pd.DataFrame(np.arange(6).reshape((3,2)), columns=list('AB'))
df1

Unnamed: 0,A,B
0,0,1
1,2,3
2,4,5


In [247]:
df2 = df1.loc[:1]

In [249]:
id(df1)

1771473588416

In [248]:
id(df2)

1771514249168

In [250]:
a = df2
id(a)

1771514249168

In [251]:
a.loc[0,"A"] = 10

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  a.loc[0,"A"] = 10
