# Uvod v Pandas

## Understanding pandas and NumPy


<p></p><center><img alt="anatomy of a dataframe" src="images/df_anatomy_static_resized.svg"></center><p></p>

## About pandas

## Importing pandas

In [1]:
import numpy as np
import pandas as pd

In [2]:
pd.__version__

'1.3.3'

In [3]:
pd?

More detailed documentation, along with tutorials and other resources, can be found at http://pandas.pydata.org/.

## Introduction to the Data

<p>The data set is a CSV file called <code>f500.csv</code>. Here is a data dictionary for some of the columns in the CSV:</p>
<ul>
<li><code>company</code>: Name of the company.</li>
<li><code>rank</code>: Global 500 rank for the company.</li>
<li><code>revenues</code>: Company's total revenue for the fiscal year, in millions of dollars (USD).</li>
<li><code>revenue_change</code>: Percentage change in revenue between the current and prior fiscal year.</li>
<li><code>profits</code>: Net income for the fiscal year, in millions of dollars (USD).</li>
<li><code>ceo</code>: Company's Chief Executive Officer.</li>
<li><code>industry</code>: Industry in which the company operates.</li>
<li><code>sector</code>: Sector in which the company operates.</li>
<li><code>previous_rank</code>: Global 500 rank for the company for the prior year.</li>
<li><code>country</code>: Country in which the company is headquartered.</li>
</ul>
</div>

In [6]:
f500 = pd.read_csv("data/f500.csv", index_col=0)

In [12]:
f500.head(3)

Unnamed: 0_level_0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4,China,"Beijing, China",http://www.sinopec.com,19,713288,106523


In [13]:
type(f500)

pandas.core.frame.DataFrame

In [14]:
f500.shape

(500, 16)

## Introducing Pandas Objects - Data Structures

<ol class=""><li id="5a05" class="mr ms dz ap mu b mv mw mx my mz na nb nc nd ne nf od oe of" data-selectable-paragraph="">Series — 1D (can be understood as columns of a spreadsheet)</li><li id="bdf1" class="mr ms dz ap mu b mv og mx oh mz oi nb oj nd ok nf od oe of" data-selectable-paragraph="">DataFrame — 2D (can be understood as a single spreadsheet)</li><li id="6491" class="mr ms dz ap mu b mv og mx oh mz oi nb oj nd ok nf od oe of" data-selectable-paragraph="">Panel — 3D (can be understood as a group of spreadsheets)</li></ol>

<table class="table table-bordered">
<tbody><tr>
<th style="text-align:center;">Data Structure</th>
<th style="text-align:center;">Dimensions</th>
<th style="text-align:center;">Description</th>
</tr>
<tr>
<td style="text-align:center;">Series</td>
<td style="text-align:center;">1</td>
<td style="text-align:center;">1D labeled homogeneous array, sizeimmutable.</td>
</tr>
<tr>
<td style="text-align:center;">Data Frames</td>
<td style="text-align:center;">2</td>
<td style="text-align:center;">General 2D labeled, size-mutable tabular structure with potentially heterogeneously typed
columns.</td>
</tr>
<tr>
<td style="text-align:center;">Panel</td>
<td style="text-align:center;">3</td>
<td style="text-align:center;">General 3D labeled, size-mutable array.</td>
</tr>
</tbody></table>

## Introducing DataFrames

In [15]:
f500.head(3)

Unnamed: 0_level_0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4,China,"Beijing, China",http://www.sinopec.com,19,713288,106523


In [16]:
f500.tail(3)

Unnamed: 0_level_0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Wm. Morrison Supermarkets,498,21741,-11.3,406.4,11630,20.4,David T. Potts,Food and Drug Stores,Food & Drug Stores,437,Britain,"Bradford, Britain",http://www.morrisons.com,13,77210,5111
TUI,499,21655,-5.5,1151.7,16247,195.5,Friedrich Joussen,Travel Services,Business Services,467,Germany,"Hanover, Germany",http://www.tuigroup.com,23,66779,3006
AutoNation,500,21609,3.6,430.5,10060,-2.7,Michael J. Jackson,Specialty Retailers,Retailing,0,USA,"Fort Lauderdale, FL",http://www.autonation.com,12,26000,2310


In [17]:
f500.dtypes

rank                          int64
revenues                      int64
revenue_change              float64
profits                     float64
assets                        int64
profit_change               float64
ceo                          object
industry                     object
sector                       object
previous_rank                 int64
country                      object
hq_location                  object
website                      object
years_on_global_500_list      int64
employees                     int64
total_stockholder_equity      int64
dtype: object

In [20]:
f500.info()

<class 'pandas.core.frame.DataFrame'>
Index: 500 entries, Walmart to AutoNation
Data columns (total 16 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   rank                      500 non-null    int64  
 1   revenues                  500 non-null    int64  
 2   revenue_change            498 non-null    float64
 3   profits                   499 non-null    float64
 4   assets                    500 non-null    int64  
 5   profit_change             436 non-null    float64
 6   ceo                       500 non-null    object 
 7   industry                  500 non-null    object 
 8   sector                    500 non-null    object 
 9   previous_rank             500 non-null    int64  
 10  country                   500 non-null    object 
 11  hq_location               500 non-null    object 
 12  website                   500 non-null    object 
 13  years_on_global_500_list  500 non-null    int64  
 14  em

## Pandas Data Selection - indexing

### Selecting a Column From a DataFrame by Label (.loc)

    df.loc[row_label, column_label]

In [21]:
rank_col = f500.loc[:, "rank"]

In [23]:
# krajšnjica za izbiro stolpca
rank_col = f500["rank"]

In [24]:
rank_col

company
Walmart                             1
State Grid                          2
Sinopec Group                       3
China National Petroleum            4
Toyota Motor                        5
                                 ... 
Teva Pharmaceutical Industries    496
New China Life Insurance          497
Wm. Morrison Supermarkets         498
TUI                               499
AutoNation                        500
Name: rank, Length: 500, dtype: int64

In [25]:
type(rank_col)

pandas.core.series.Series

In [40]:
rank_col = f500.rank # ne delamo

In [41]:
type(rank_col)

method

In [31]:
industry = f500["industry"]
industry.head(3)

company
Walmart          General Merchandisers
State Grid                   Utilities
Sinopec Group       Petroleum Refining
Name: industry, dtype: object

In [32]:
industry = f500.loc[: ,"industry"]
industry.head(3)

company
Walmart          General Merchandisers
State Grid                   Utilities
Sinopec Group       Petroleum Refining
Name: industry, dtype: object

In [43]:
industry = f500.industry # ni priporočljiv
industry.head(6)

company
Walmart                        General Merchandisers
State Grid                                 Utilities
Sinopec Group                     Petroleum Refining
China National Petroleum          Petroleum Refining
Toyota Motor                Motor Vehicles and Parts
Volkswagen                  Motor Vehicles and Parts
Name: industry, dtype: object

In [44]:
type(industry)

pandas.core.series.Series

In [45]:
type(industry.values)

numpy.ndarray

In [47]:
industry.values[:5]

array(['General Merchandisers', 'Utilities', 'Petroleum Refining',
       'Petroleum Refining', 'Motor Vehicles and Parts'], dtype=object)

In [49]:
print(industry.values.dtype)

object


In [52]:
rank_col = f500["rank"]
print(rank_col.values.dtype)

int64



<p><img alt="dataframe exploded" src="images/df_exploded_resized.svg"></p>


In [56]:
f500.loc[:, ["country", "rank"]].head() # imamo dataframe

Unnamed: 0_level_0,country,rank
company,Unnamed: 1_level_1,Unnamed: 2_level_1
Walmart,USA,1
State Grid,China,2
Sinopec Group,China,3
China National Petroleum,China,4
Toyota Motor,Japan,5


In [57]:
f500[["country", "rank"]].head()

Unnamed: 0_level_0,country,rank
company,Unnamed: 1_level_1,Unnamed: 2_level_1
Walmart,USA,1
State Grid,China,2
Sinopec Group,China,3
China National Petroleum,China,4
Toyota Motor,Japan,5


In [59]:
f500.loc[:, "rank": "profits"].head()

Unnamed: 0_level_0,rank,revenues,revenue_change,profits
company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Walmart,1,485873,0.8,13643.0
State Grid,2,315199,-4.4,9571.3
Sinopec Group,3,267518,-9.1,1257.9
China National Petroleum,4,262573,-12.3,1867.5
Toyota Motor,5,254694,7.7,16899.3


<p></p><center>
<table>
<thead>
<tr>
<th>Select by Label</th>
<th>Explicit Syntax</th>
<th>Common Shorthand</th>
<th>Other Shorthand</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single column</td>
<td><code>df.loc[:,"col1"]</code></td>
<td bgcolor="#00FF00"><code>df["col1"]</code></td>
<td><code>df.col1</code></td>
</tr>
<tr>
<td>List of columns</td>
<td><code>df.loc[:,["col1", "col7"]]</code></td>
<td bgcolor="#00FF00"><code>df[["col1", "col7"]]</code></td>
<td></td>
</tr>
<tr>
<td>Slice of columns</td>
<td bgcolor="#00FF00"><code>df.loc[:,"col1":"col4"]</code></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
</center><p></p>

<div class="alert alert-block alert-info">
<b>Vaja:</b> Select the country column. Assign the result to the variable name countries.</div>

In [61]:
f500["country"].head()

company
Walmart                       USA
State Grid                  China
Sinopec Group               China
China National Petroleum    China
Toyota Motor                Japan
Name: country, dtype: object

<div class="alert alert-block alert-info">
<b>Vaja:</b> In order, select the revenues and years_on_global_500_list columns. Assign the result to the variable name revenues_years.</div>

In [63]:
f500[["revenues", "years_on_global_500_list"]].head()

Unnamed: 0_level_0,revenues,years_on_global_500_list
company,Unnamed: 1_level_1,Unnamed: 2_level_1
Walmart,485873,23
State Grid,315199,17
Sinopec Group,267518,19
China National Petroleum,262573,17
Toyota Motor,254694,23


<div class="alert alert-block alert-info">
<b>Vaja:</b> In order, select all columns from ceo up to and including sector. Assign the result to the variable name ceo_to_sector.</div>

In [64]:
f500.loc[:, "ceo":"sector"].head()

Unnamed: 0_level_0,ceo,industry,sector
company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Walmart,C. Douglas McMillon,General Merchandisers,Retailing
State Grid,Kou Wei,Utilities,Energy
Sinopec Group,Wang Yupu,Petroleum Refining,Energy
China National Petroleum,Zhang Jianhua,Petroleum Refining,Energy
Toyota Motor,Akio Toyoda,Motor Vehicles and Parts,Motor Vehicles & Parts


### Selecting Rows From a DataFrame by Label (.loc)

    df.loc[row_label, column_label]

**Select a single row**

In [65]:
single_row = f500.loc["Sinopec Group", :]

In [70]:
single_row = f500.loc["Sinopec Group"]

In [71]:
print(type(single_row))

<class 'pandas.core.series.Series'>


In [72]:
single_row

rank                                             3
revenues                                    267518
revenue_change                                -9.1
profits                                     1257.9
assets                                      310726
profit_change                                -65.0
ceo                                      Wang Yupu
industry                        Petroleum Refining
sector                                      Energy
previous_rank                                    4
country                                      China
hq_location                         Beijing, China
website                     http://www.sinopec.com
years_on_global_500_list                        19
employees                                   713288
total_stockholder_equity                    106523
Name: Sinopec Group, dtype: object

In [69]:
type(single_row["rank"])

numpy.int64

**Select a list of rows**

In [73]:
list_rows = f500.loc[["Walmart", "Toyota Motor"]]

In [74]:
print(type(list_rows))

<class 'pandas.core.frame.DataFrame'>


In [75]:
list_rows

Unnamed: 0_level_0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
Toyota Motor,5,254694,7.7,16899.3,437575,-12.3,Akio Toyoda,Motor Vehicles and Parts,Motor Vehicles & Parts,8,Japan,"Toyota, Japan",http://www.toyota-global.com,23,364445,157210


**Select a slice object with labels**

In [78]:
f500.loc["State Grid": "China National Petroleum"].head()

Unnamed: 0_level_0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4,China,"Beijing, China",http://www.sinopec.com,19,713288,106523
China National Petroleum,4,262573,-12.3,1867.5,585619,-73.7,Zhang Jianhua,Petroleum Refining,Energy,3,China,"Beijing, China",http://www.cnpc.com.cn,17,1512048,301893


<img alt="series vs dataframe: series" src="images/df_series_s_updated.svg">

<img alt="series vs dataframe: dataframe" src="images/df_series_df_updated.svg">

### Selecting Items from a Series by Label (.loc)

In [85]:
sectors = f500["sector"]
type(sectors)

pandas.core.series.Series

In [86]:
sectors.value_counts().head()

Financials                118
Energy                     80
Technology                 44
Motor Vehicles & Parts     34
Wholesalers                28
Name: sector, dtype: int64

In [87]:
sec_ind = f500[["sector", "industry"]]

In [88]:
type(sec_ind)

pandas.core.frame.DataFrame

In [90]:
sec_ind.value_counts().head()

sector                  industry                       
Financials              Banks: Commercial and Savings      51
Motor Vehicles & Parts  Motor Vehicles and Parts           34
Energy                  Petroleum Refining                 28
Financials              Insurance: Life, Health (stock)    24
Food & Drug Stores      Food and Drug Stores               20
dtype: int64

In [91]:
top_10_countries = f500["country"].value_counts().head(10)

In [92]:
top_10_countries

USA            132
China          109
Japan           51
Germany         29
France          29
Britain         24
South Korea     15
Netherlands     14
Switzerland     14
Canada          11
Name: country, dtype: int64

<table>
<thead>
<tr>
<th>Select by Label</th>
<th>Explicit Syntax</th>
<th>Shorthand Convention</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single item from series</td>
<td><code>s.loc["item8"]</code></td>
<td bgcolor="#00FF00"> <code>s["item8"]</code></td>
</tr>
<tr>
<td>List of items from series</td>
<td><code>s.loc[["item1","item7"]]</code></td>
<td bgcolor="#00FF00"><code>s[["item1","item7"]]</code></td>
</tr>
<tr>
<td>Slice of items from series</td>
<td><code>s.loc["item2":"item4"]</code></td>
<td bgcolor="#00FF00"><code>s["item2":"item4"]</code></td>
</tr>
</tbody>
</table>

In [100]:
top_10_countries["France"]

29

In [101]:
top_10_countries["South Korea": "Canada"]

South Korea    15
Netherlands    14
Switzerland    14
Canada         11
Name: country, dtype: int64

### Summary of label selection (.loc)

<table>
<thead>
<tr>
<th>Select by Label</th>
<th>Explicit Syntax</th>
<th>Shorthand Convention</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single column from dataframe</td>
<td><code>df.loc[:,"col1"]</code></td>
<td bgcolor="#00FF00"><code>df["col1"]</code></td>
</tr>
<tr>
<td>List of columns from dataframe</td>
<td><code>df.loc[:,["col1","col7"]]</code></td>
<td bgcolor="#00FF00"><code>df[["col1","col7"]]</code></td>
</tr>
<tr>
<td>Slice of columns from dataframe</td>
<td bgcolor="#00FF00"><code>df.loc[:,"col1":"col4"]</code></td>
<td></td>
</tr>
<tr>
<td>Single row from dataframe</td>
<td bgcolor="#00FF00"><code>df.loc["row4"]</code></td>
<td></td>
</tr>
<tr>
<td>List of rows from dataframe</td>
<td bgcolor="#00FF00"><code>df.loc[["row1", "row8"]]</code></td>
<td></td>
</tr>
<tr>
<td>Slice of rows from dataframe</td>
<td bgcolor="#00FF00"><code>df.loc["row3":"row5"]</code></td>
<td><code>df["row3":"row5"]</code></td>
</tr>
<tr>
<td>Single item from series</td>
<td><code>s.loc["item8"]</code></td>
<td bgcolor="#00FF00"><code>s["item8"]</code></td>
</tr>
<tr>
<td>List of items from series</td>
<td><code>s.loc[["item1","item7"]]</code></td>
<td bgcolor="#00FF00"><code>s[["item1","item7"]]</code></td>
</tr>
<tr>
<td>Slice of items from series</td>
<td><code>s.loc["item2":"item4"]</code></td>
<td bgcolor="#00FF00"><code>s["item2":"item4"]</code></td>
</tr>
</tbody>
</table>

<div class="alert alert-block alert-info">
<b>Vaja:</b> Create a new variable big_movers, with: Rows with indices Aviva, HP, JD.com, and BHP Billiton, in that order. The rank and previous_rank columns, in that order.</div>

In [102]:
f500.loc[["Aviva", "HP", "JD.com", "BHP Billiton"], ["rank", "previous_rank"]]

Unnamed: 0_level_0,rank,previous_rank
company,Unnamed: 1_level_1,Unnamed: 2_level_1
Aviva,90,279
HP,194,48
JD.com,261,366
BHP Billiton,350,168


​
 
<div class="alert alert-block alert-info">
<b>Vaja:</b> Create a new variable, bottom_companies with: All rows with indices from National Grid to AutoNation, inclusive. The rank, sector, and country columns.</div>

In [104]:
f500.loc["National Grid" :"AutoNation", ["rank", "sector", "country"]]

Unnamed: 0_level_0,rank,sector,country
company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
National Grid,491,Energy,Britain
Dollar General,492,Retailing,USA
Telecom Italia,493,Telecommunications,Italy
Xiamen ITG Holding Group,494,Wholesalers,China
Xinjiang Guanghui Industry Investment,495,Wholesalers,China
Teva Pharmaceutical Industries,496,Health Care,Israel
New China Life Insurance,497,Financials,China
Wm. Morrison Supermarkets,498,Food & Drug Stores,Britain
TUI,499,Business Services,Germany
AutoNation,500,Retailing,USA


## Vectorized Operations


<p><img alt="Vectorized operation" src="images/vectorized.gif"></p>


In [106]:
my_series = pd.Series([1,2,3,4,5])
my_series

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [107]:
my_series = my_series + 10

In [108]:
my_series

0    11
1    12
2    13
3    14
4    15
dtype: int64


<ul>
<li><code>series_a + series_b</code> - Addition</li>
<li><code>series_a - series_b</code> - Subtraction</li>
<li><code>series_a * series_b</code> - Multiplication (this is unrelated to the multiplications used in linear algebra).</li>
<li><code>series_a / series_b</code> - Division</li>
</ul>


In [109]:
rank_change = f500["previous_rank"] - f500["rank"]

In [110]:
rank_change.head(10)

company
Walmart                     0
State Grid                  0
Sinopec Group               1
China National Petroleum   -1
Toyota Motor                3
Volkswagen                  1
Royal Dutch Shell          -2
Berkshire Hathaway          3
Apple                       0
Exxon Mobil                -4
dtype: int64

##  Series Data Exploration Methods

In [113]:
my_series = pd.Series([0, 1, 2, 3, 4])

In [114]:
my_series.sum()

10

In [115]:
my_series.mean()

2.0


<ul>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.max.html"><code>Series.max()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.min.html"><code>Series.min()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html"><code>Series.mean()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.median.html"><code>Series.median()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mode.html"><code>Series.mode()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sum.html"><code>Series.sum()</code></a></li>
</ul>


<div class="alert alert-block alert-info">
<b>Vaja:</b> Use the Series.max() method to fMind the maximum value for the rank_change series. Assign the result to the variable rank_change_max.</div>

In [None]:
rank_change.max()

<div class="alert alert-block alert-info">
<b>Vaja:</b> Use the Series.min() method to find the minimum value for the rank_change series. Assign the result to the variable rank_change_min.</div>

In [112]:
rank_change.min()

-500

### Series Describe Method

In [117]:
f500["assets"].describe()

count    5.000000e+02
mean     2.436323e+05
std      4.851937e+05
min      3.717000e+03
25%      3.658850e+04
50%      7.326150e+04
75%      1.805640e+05
max      3.473238e+06
Name: assets, dtype: float64

In [118]:
f500["country"].describe()

count     500
unique     34
top       USA
freq      132
Name: country, dtype: object

<div class="alert alert-block alert-info">
<b>Vaja:</b> Return a series of descriptive statistics for the rank column in f500.</div>

In [119]:
f500["rank"].describe()

count    500.000000
mean     250.500000
std      144.481833
min        1.000000
25%      125.750000
50%      250.500000
75%      375.250000
max      500.000000
Name: rank, dtype: float64

<div class="alert alert-block alert-info">
<b>Vaja:</b> Return a series of descriptive statistics for the previous_rank column in f500.</div>

In [120]:
f500["previous_rank"].describe()

count    500.000000
mean     222.134000
std      146.941961
min        0.000000
25%       92.750000
50%      219.500000
75%      347.250000
max      500.000000
Name: previous_rank, dtype: float64

## Method Chaining

In [123]:
counties = f500["country"]
counties_count = counties.value_counts()
counties_count.head()

USA        132
China      109
Japan       51
Germany     29
France      29
Name: country, dtype: int64

In [124]:
# isto amapk v eni vrstici
f500["country"].value_counts().head()

USA        132
China      109
Japan       51
Germany     29
France      29
Name: country, dtype: int64

In [125]:
f500["country"].value_counts().head().loc["China"]

109

<div class="alert alert-block alert-info">
<b>Vaja:</b> Use Series.value_counts() and Series.loc to return the number of companies with a value of 0 in the previous_rank column in the f500 dataframe. Assign the results to zero_previous_rank.</div>

In [128]:
f500["previous_rank"].value_counts().loc[0]

33

## Dataframe Exploration Methods


<ul>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.max.html"><code>Series.max()</code></a> and <a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html"><code>DataFrame.max()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.min.html"><code>Series.min()</code></a> and <a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.min.html"><code>DataFrame.min()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html"><code>Series.mean()</code></a> and <a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html"><code>DataFrame.mean()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.median.html"><code>Series.median()</code></a> and <a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.median.html"><code>DataFrame.median()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mode.html"><code>Series.mode()</code></a> and <a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mode.html"><code>DataFrame.mode()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sum.html"><code>Series.sum()</code></a> and <a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html"><code>DataFrame.sum()</code></a></li>
</ul>

<p><img alt="dataframe axis parameters" src="images/axis_param.svg"></p>


In [129]:
medians = f500[["revenues", "profits"]].median(axis=0)
# we could also use .median(axis="index")
print(medians)

revenues    40236.0
profits      1761.6
dtype: float64


<div class="alert alert-block alert-info">
<b>Vaja:</b> Use the DataFrame.max() method to find the maximum value for only the numeric columns from f500 (you may need to check the documentation). Assign the result to the variable max_f500.</div>

In [132]:
f500.max(axis=0, numeric_only=True)

rank                            500.0
revenues                     485873.0
revenue_change                  442.3
profits                       45687.0
assets                      3473238.0
profit_change                  8909.5
previous_rank                   500.0
years_on_global_500_list         23.0
employees                   2300000.0
total_stockholder_equity     301893.0
dtype: float64

### Dataframe Describe Method

In [133]:
f500.describe()

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,previous_rank,years_on_global_500_list,employees,total_stockholder_equity
count,500.0,500.0,498.0,499.0,500.0,436.0,500.0,500.0,500.0,500.0
mean,250.5,55416.358,4.538353,3055.203206,243632.3,24.152752,222.134,15.036,133998.3,30628.076
std,144.481833,45725.478963,28.549067,5171.981071,485193.7,437.509566,146.941961,7.932752,170087.8,43642.576833
min,1.0,21609.0,-67.3,-13038.0,3717.0,-793.7,0.0,1.0,328.0,-59909.0
25%,125.75,29003.0,-5.9,556.95,36588.5,-22.775,92.75,7.0,42932.5,7553.75
50%,250.5,40236.0,0.55,1761.6,73261.5,-0.35,219.5,17.0,92910.5,15809.5
75%,375.25,63926.75,6.975,3954.0,180564.0,17.7,347.25,23.0,168917.2,37828.5
max,500.0,485873.0,442.3,45687.0,3473238.0,8909.5,500.0,23.0,2300000.0,301893.0


In [134]:
f500.describe(include=["O"])

Unnamed: 0,ceo,industry,sector,country,hq_location,website
count,500,500,500,500,500,500
unique,500,58,21,34,235,500
top,C. Douglas McMillon,Banks: Commercial and Savings,Financials,USA,"Beijing, China",http://www.walmart.com
freq,1,51,118,132,56,1


<div class="alert alert-block alert-info">
<b>Vaja:</b> Return a dataframe of descriptive statistics for all of the numeric columns in f500. Assign the result to f500_desc.</div>

## Assignment with pandas

In [138]:
top5 = f500[["rank", "revenues"]].head()
top5

Unnamed: 0_level_0,rank,revenues
company,Unnamed: 1_level_1,Unnamed: 2_level_1
Walmart,1,485873
State Grid,2,315199
Sinopec Group,3,267518
China National Petroleum,4,262573
Toyota Motor,5,254694


In [141]:
top5["revenues"] = 0
top5

Unnamed: 0_level_0,rank,revenues
company,Unnamed: 1_level_1,Unnamed: 2_level_1
Walmart,1,0
State Grid,2,0
Sinopec Group,3,0
China National Petroleum,4,0
Toyota Motor,5,0


In [143]:
top5.loc["Sinopec Group", "revenues"] = 999
top5

Unnamed: 0_level_0,rank,revenues
company,Unnamed: 1_level_1,Unnamed: 2_level_1
Walmart,1,0
State Grid,2,0
Sinopec Group,3,999
China National Petroleum,4,0
Toyota Motor,5,0


<div class="alert alert-block alert-info">
<b>Vaja:</b> The company "Dow Chemical" has named a new CEO. Update the value where the row label is Dow Chemical and for the ceo column to Jim Fitterling in the f500 dataframe.</div>

In [145]:
f500.loc["Dow Chemical","ceo"] = "Jim Fitterling"

## Using Boolean Indexing with pandas Objects

In [None]:
d = {'name': ['Bob', 'Eva', 'Sara', 'Mihael'], 'num': [12, 8, 5, 8]}
df = pd.DataFrame(data=d, index=['w', 'x', 'y', 'z'])

<div class="alert alert-block alert-info">
<b>Vaja:</b> Create a boolean series, motor_bool, that compares whether the values in the industry column from the f500 dataframe are equal to "Motor Vehicles and Parts".
Use the motor_bool boolean series to index the country column. Assign the result to motor_countries.</div>

### Using Boolean Arrays to Assign Values

## Creating New Columns

## Vaja: Top Performers by Country

## Reading CSV files with pandas

    f500 = pd.read_csv("data/f500.csv", index_col=0)
    f500.index.name = None

    company,rank,revenues,revenue_change
    Walmart,1,485873,0.8
    State Grid,2,315199,-4.4
    Sinopec Group,3,267518,-9.1
    China National Petroleum,4,262573,-12.3
    Toyota Motor,5,254694,7.7


<p><img alt="csv_to_dataframe" src="images/csv_to_dataframe.svg"></p>


## Using iloc to select by integer position


<p><img alt="selection using label" src="images/selection_loc.svg"></p>

<p><img alt="selection using iloc" src="images/selection_iloc.svg"></p>


    df.iloc[row_index, column_index]

<div class="alert alert-block alert-info">
<b>Vaja:</b> Select just the fifth row of the f500 dataframe. Assign the result to fifth_row.</div>

<div class="alert alert-block alert-info">
<b>Vaja:</b> Select the value in first row of the company column. Assign the result to company_value.</div>


<table>
<thead>
<tr>
<th>Select by integer position</th>
<th>Explicit Syntax</th>
<th>Shorthand Convention</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single column from dataframe</td>
<td><code>df.iloc[:,3]</code></td>
<td></td>
</tr>
<tr>
<td>List of columns from dataframe</td>
<td><code>df.iloc[:,[3,5,6]]</code></td>
<td></td>
</tr>
<tr>
<td>Slice of columns from dataframe</td>
<td><code>df.iloc[:,3:7]</code></td>
<td></td>
</tr>
<tr>
<td>Single row from dataframe</td>
<td><code>df.iloc[20]</code></td>
<td></td>
</tr>
<tr>
<td>List of rows from dataframe</td>
<td><code>df.iloc[[0,3,8]]</code></td>
<td></td>
</tr>
<tr>
<td>Slice of rows from dataframe</td>
<td><code>df.iloc[3:5]</code></td>
<td><code>df[3:5]</code></td>
</tr>
<tr>
<td>Single items from series</td>
<td><code>s.iloc[8]</code></td>
<td><code>s[8]</code></td>
</tr>
<tr>
<td>List of item from series</td>
<td><code>s.iloc[[2,8,1]]</code></td>
<td><code>s[[2,8,1]]</code></td>
</tr>
<tr>
<td>Slice of items from series</td>
<td><code>s.iloc[5:10]</code></td>
<td><code>s[5:10]</code></td>
</tr>
</tbody>
</table>


<div class="alert alert-block alert-info">
<b>Vaja:</b> Select the first three rows of the f500 dataframe. Assign the result to first_three_rows.</div>

<div class="alert alert-block alert-info">
<b>Vaja:</b> Select the first and seventh rows and the first five columns of the f500 dataframe. Assign the result to first_seventh_row_slice.</div>

## Using pandas methods to create boolean masks

<div class="alert alert-block alert-info">
<b>Vaja:</b> Use the Series.isnull() method to select all rows from f500 that have a null value for the previous_rank column. Select only the company, rank, and previous_rank columns. Assign the result to null_previous_rank.</div>

## Working with Integer Labels


<p><img alt="loc vs iloc for rows in different order" src="images/integer_labels_2.svg"></p>


<div class="alert alert-block alert-info">
<b>Vaja:</b> Assign the first five rows of the null_previous_rank dataframe to the variable top5_null_prev_rank by choosing the correct method out of either loc[] or iloc[].</div>

## Pandas Index Alignment

In [14]:
food = pd.DataFrame({'fruit_veg': ['fruit', 'veg', 'fruit', 'veg', 'veg'], 'qty': [4, 2, 4, 1, 2]}, 
                    index=['tomato', 'carrot', 'lime', 'corn', 'eggplant'])

In [20]:
alt_name = pd.Series(['rocket', 'aubergine', 'maize'], index=['arugula', 'eggplant', 'corn'])

<div class="alert alert-block alert-info">
<b>Vaja:</b> Use the Series.notnull() method to select all rows from f500 that have a non-null value for the previous_rank column. Assign the result to previously_ranked.  From the previously_ranked dataframe, subtract the rank column from the previous_rank column. Assign the result to rank_change. Assign the values in the rank_change to a new column in the f500 dataframe, "rank_change".</div>

## Boolean Operators


<table>
<thead>
<tr>
<th>pandas</th>
<th>Python equivalent</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>a &amp; b</code></td>
<td><code>a and b</code></td>
<td><code>True</code> if both <code>a</code> and <code>b</code> are <code>True</code>, else <code>False</code></td>
</tr>
<tr>
<td><code>a | b</code></td>
<td><code>a or b</code></td>
<td><code>True</code> if either <code>a</code> or <code>b</code> is <code>True</code></td>
</tr>
<tr>
<td><code>~a</code></td>
<td><code>not a</code></td>
<td><code>True</code> if <code>a</code> is <code>False</code>, else <code>False</code></td>
</tr>
</tbody>
</table>

<p><img alt="boolean operators example 1" src="images/bool_ops_1.svg"></p>

<p><img alt="boolean operators example 2" src="images/bool_ops_2.svg"></p>

<p><img alt="boolean operators example 3" src="images/bool_ops_3.svg"></p>

<p><img alt="boolean operators example 4" src="images/bool_ops_4.svg"></p>


<div class="alert alert-block alert-info">
<b>Vaja:</b> Select all companies with revenues over 100 billion and negative profits from the f500 dataframe. The result should include all columns.</div>


<table>
<thead>
<tr>
<th>pandas</th>
<th>Python equivalent</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>a &amp; b</code></td>
<td><code>a and b</code></td>
<td><code>True</code> if both <code>a</code> and <code>b</code> are <code>True</code>, else <code>False</code></td>
</tr>
<tr>
<td><code>a | b</code></td>
<td><code>a or b</code></td>
<td><code>True</code> if either <code>a</code> or <code>b</code> is <code>True</code></td>
</tr>
<tr>
<td><code>~a</code></td>
<td><code>not a</code></td>
<td><code>True</code> if <code>a</code> is <code>False</code>, else <code>False</code></td>
</tr>
</tbody>
</table></div>

<div class="alert alert-block alert-info">
<b>Vaja:</b> Select all rows for companies headquartered in either Brazil or Venezuela. Assign the result to brazil_venezuela.</div>

<div class="alert alert-block alert-info">
<b>Vaja:</b> Select the first five companies in the Technology sector that are not headquartered in the USA from the f500 dataframe. Assign the result to tech_outside_usa.</div>

## Sorting Values

<div class="alert alert-block alert-info">
<b>Vaja:</b> Find the company headquartered in Japan with the largest number of employees.</div>

## Using Loops with pandas

<div class="alert alert-block alert-info">
<b>Vaja:</b> Calculate the company that employs the most people in each country</div>

## Primer: Calculating Return on Assets by Country

<div class="alert alert-block alert-info">
<b>Vaja:</b> Create a new column roa in the f500 dataframe, containing the return on assets metric for each company.</div>

<div class="alert alert-block alert-info">
<b>Vaja:</b> Aggregate the data by the sector column, and create a dictionary top_roa_by_sector.</div>

## Understanding SettingwithCopyWarning in pandas


As you can see, each row of our data set concerns a single bid on a specific eBay Xbox auction. Here is a brief description of each column:

- auctionid — A unique identifier of each auction.
- bid — The value of the bid.
- bidtime — The age of the auction, in days, at the time of the bid.
- bidder — eBay username of the bidder.
- bidderrate - The bidder's eBay user rating.
- openbid — The opening bid set by the seller for the auction.
- price — The winning bid at the close of the auction.

### What is SettingWithCopyWarning?


<img class="full-width" src="https://www.dataquest.io/wp-content/uploads/2019/01/view-vs-copy.png" alt="view-vs-copy">



<img class="full-width" src="https://www.dataquest.io/wp-content/uploads/2019/01/modifying.png" alt="modifying">



### Chained assignment





### Hidden chaining



### Tips and tricks for dealing with SettingWithCopyWarning