# Uvod v Pandas

## Understanding pandas and NumPy


<p></p><center><img alt="anatomy of a dataframe" src="images/df_anatomy_static_resized.svg"></center><p></p>

## About pandas

Key Features of Pandas
- Fast and efficient DataFrame object with default and customized indexing.
- Tools for loading data into in-memory data objects from different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of date sets.
- Label-based slicing, indexing and subsetting of large data sets.
- Columns from a data structure can be deleted or inserted.
- Group by data for aggregation and transformations.
- High performance merging and joining of data.
- Time Series functionality.

## Importing pandas

[Installation guide](https://pandas.pydata.org/docs/getting_started/install.html)

In [None]:
import pandas as pd

In [None]:
import numpy as np

In [None]:
pd.__version__

More detailed documentation, along with tutorials and other resources, can be found at http://pandas.pydata.org/.

## Introduction to the Data

<p>The data set is a CSV file called <code>f500.csv</code>. Here is a data dictionary for some of the columns in the CSV:</p>
<ul>
<li><code>company</code>: Name of the company.</li>
<li><code>rank</code>: Global 500 rank for the company.</li>
<li><code>revenues</code>: Company's total revenue for the fiscal year, in millions of dollars (USD).</li>
<li><code>revenue_change</code>: Percentage change in revenue between the current and prior fiscal year.</li>
<li><code>profits</code>: Net income for the fiscal year, in millions of dollars (USD).</li>
<li><code>ceo</code>: Company's Chief Executive Officer.</li>
<li><code>industry</code>: Industry in which the company operates.</li>
<li><code>sector</code>: Sector in which the company operates.</li>
<li><code>previous_rank</code>: Global 500 rank for the company for the prior year.</li>
<li><code>country</code>: Country in which the company is headquartered.</li>
</ul>
</div>

<img src="images/02_io_readwrite.svg">

## Introducing Pandas Objects - Data Structures

ONE OF THE KEYS TO UNDERSTANDING PANDAS IS TO UNDERSTAND
model. At the core of pandas are three data structures:

- Series — 1D (can be understood as columns of a spreadsheet)

<img src="images/01_table_series.svg">

- DataFrame — 2D (can be understood as a single spreadsheet)

<img src="images/01_table_dataframe.svg">

- Panel — 3D (can be understood as a group of spreadsheets)

<table class="table table-bordered">
<tbody><tr>
<th style="text-align:center;">Data Structure</th>
<th style="text-align:center;">Dimensions</th>
<th style="text-align:center;">Description</th>
</tr>
<tr>
<td style="text-align:center;">Series</td>
<td style="text-align:center;">1</td>
<td style="text-align:center;">1D labeled homogeneous array, sizeimmutable.</td>
</tr>
<tr>
<td style="text-align:center;">Data Frames</td>
<td style="text-align:center;">2</td>
<td style="text-align:center;">General 2D labeled, size-mutable tabular structure with potentially heterogeneously typed
columns.</td>
</tr>
<tr>
<td style="text-align:center;">Panel</td>
<td style="text-align:center;">3</td>
<td style="text-align:center;">General 3D labeled, size-mutable array.</td>
</tr>
</tbody></table>

## Introducing DataFrames

## Pandas Data Selection - indexing

### Selecting a Column From a DataFrame by Label (.loc)

    df.loc[row_label, column_label]

<img src="images/03_subset_columns.svg">

<div>
<p><img alt="dataframe exploded" src="images/df_exploded_resized.svg"></p>
</div>

<div>

<p>A summary of the techniques we've learned so far is below:</p>
<p></p><center>
<table>
<thead>
<tr>
<th>Select by Label</th>
<th>Explicit Syntax</th>
<th>Common Shorthand</th>
<th>Other Shorthand</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single column</td>
<td><code>df.loc[:,"col1"]</code></td>
<td bgcolor="#00FF00"><code>df["col1"]</code></td>
<td><code>df.col1</code></td>
</tr>
<tr>
<td>List of columns</td>
<td><code>df.loc[:,["col1", "col7"]]</code></td>
<td bgcolor="#00FF00"><code>df[["col1", "col7"]]</code></td>
<td></td>
</tr>
<tr>
<td>Slice of columns</td>
<td bgcolor="#00FF00"><code>df.loc[:,"col1":"col4"]</code></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
</center><p></p>
</div>

<div class="alert alert-block alert-info">
<b>Vaja:</b> Select the country column. Assign the result to the variable name countries.</div>

<div class="alert alert-block alert-info">
<b>Vaja:</b> In order, select the revenues and years_on_global_500_list columns. Assign the result to the variable name revenues_years.</div>

<div class="alert alert-block alert-info">
<b>Vaja:</b> In order, select all columns from ceo up to and including sector. Assign the result to the variable name ceo_to_sector.</div>

### Selecting Rows From a DataFrame by Label (.loc)

    df.loc[row_label, column_label]

<img src="images/03_subset_rows.svg">

**Select a single row**

**Select a list of rows**

In [None]:
cols = ["Toyota Motor", "Walmart"]



**Select a slice object with labels**

In [None]:
slice_rows_names = "State Grid":"Toyota Motor"

    
    

<img alt="series vs dataframe: series" src="images/df_series_s_updated.svg">

<img alt="series vs dataframe: dataframe" src="images/df_series_df_updated.svg">

### Selecting Items from a Series by Label (.loc)

<table>
<thead>
<tr>
<th>Select by Label</th>
<th>Explicit Syntax</th>
<th>Shorthand Convention</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single item from series</td>
<td><code>s.loc["item8"]</code></td>
<td bgcolor="#00FF00"> <code>s["item8"]</code></td>
</tr>
<tr>
<td>List of items from series</td>
<td><code>s.loc[["item1","item7"]]</code></td>
<td bgcolor="#00FF00"><code>s[["item1","item7"]]</code></td>
</tr>
<tr>
<td>Slice of items from series</td>
<td><code>s.loc["item2":"item4"]</code></td>
<td bgcolor="#00FF00"><code>s["item2":"item4"]</code></td>
</tr>
</tbody>
</table>

### Summary of label selection (.loc)

<table>
<thead>
<tr>
<th>Select by Label</th>
<th>Explicit Syntax</th>
<th>Shorthand Convention</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single column from dataframe</td>
<td><code>df.loc[:,"col1"]</code></td>
<td bgcolor="#00FF00"><code>df["col1"]</code></td>
</tr>
<tr>
<td>List of columns from dataframe</td>
<td><code>df.loc[:,["col1","col7"]]</code></td>
<td bgcolor="#00FF00"><code>df[["col1","col7"]]</code></td>
</tr>
<tr>
<td>Slice of columns from dataframe</td>
<td bgcolor="#00FF00"><code>df.loc[:,"col1":"col4"]</code></td>
<td></td>
</tr>
<tr>
<td>Single row from dataframe</td>
<td bgcolor="#00FF00"><code>df.loc["row4"]</code></td>
<td></td>
</tr>
<tr>
<td>List of rows from dataframe</td>
<td bgcolor="#00FF00"><code>df.loc[["row1", "row8"]]</code></td>
<td></td>
</tr>
<tr>
<td>Slice of rows from dataframe</td>
<td bgcolor="#00FF00"><code>df.loc["row3":"row5"]</code></td>
<td><code>df["row3":"row5"]</code></td>
</tr>
<tr>
<td>Single item from series</td>
<td><code>s.loc["item8"]</code></td>
<td bgcolor="#00FF00"><code>s["item8"]</code></td>
</tr>
<tr>
<td>List of items from series</td>
<td><code>s.loc[["item1","item7"]]</code></td>
<td bgcolor="#00FF00"><code>s[["item1","item7"]]</code></td>
</tr>
<tr>
<td>Slice of items from series</td>
<td><code>s.loc["item2":"item4"]</code></td>
<td bgcolor="#00FF00"><code>s["item2":"item4"]</code></td>
</tr>
</tbody>
</table>

<div class="alert alert-block alert-info">
<b>Vaja:</b> Create a new variable big_movers, with: Rows with indices Aviva, HP, JD.com, and BHP Billiton, in that order. The rank and previous_rank columns, in that order.</div>

In [None]:
big_movers = f500.loc[["Aviva", "HP", "JD.com", "BHP Billiton"], ["rank","previous_rank"]]
big_movers

​
 
<div class="alert alert-block alert-info">
<b>Vaja:</b> Create a new variable, bottom_companies with: All rows with indices from National Gridto AutoNation, inclusive. The rank, sector, and country columns.</div>

In [None]:
bottom_companies = f500.loc["National Grid":"AutoNation", ["rank","sector","country"]]
bottom_companies

## Vectorized Operations

<p><img alt="Vectorized operation" src="images/vectorized.gif"></p>

In [None]:
my_series = pd.Series([1, 2, 3, 4, 5])
my_series

In [None]:
my_series = my_series + 10

In [None]:
my_series

<div>
<ul>
<li><code>series_a + series_b</code> - Addition</li>
<li><code>series_a - series_b</code> - Subtraction</li>
<li><code>series_a * series_b</code> - Multiplication (this is unrelated to the multiplications used in linear algebra).</li>
<li><code>series_a / series_b</code> - Division</li>
</ul>
</div>

##  Series Data Exploration Methods

<div>
<ul>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.max.html"><code>Series.max()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.min.html"><code>Series.min()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html"><code>Series.mean()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.median.html"><code>Series.median()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mode.html"><code>Series.mode()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sum.html"><code>Series.sum()</code></a></li>
</ul>

</div>

In [None]:
my_series = pd.Series([0, 1, 2, 3, 4])
my_series

In [None]:
print(my_series.sum())

<div class="alert alert-block alert-info">
<b>Vaja:</b> Use the Series.max() method to fMind the maximum value for the rank_change series. Assign the result to the variable rank_change_max.</div>

<div class="alert alert-block alert-info">
<b>Vaja:</b> Use the Series.min() method to find the minimum value for the rank_change series. Assign the result to the variable rank_change_min.</div>

### Series Describe Method

In [None]:
assets = f500["assets"]




<div class="alert alert-block alert-info">
<b>Vaja:</b> Return a series of descriptive statistics for the rank column in f500.</div>

In [None]:
rank = f500["rank"]



<div class="alert alert-block alert-info">
<b>Vaja:</b> Return a series of descriptive statistics for the previous_rank column in f500.</div>

In [None]:
prev_rank = f500["previous_rank"]


## Method Chaining

<div class="alert alert-block alert-info">
<b>Vaja:</b> Use Series.value_counts() and Series.loc to return the number of companies with a value of 0 in the previous_rank column in the f500 dataframe. Assign the results to zero_previous_rank.</div>

## Dataframe Exploration Methods

<div>

<ul>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.max.html"><code>Series.max()</code></a> and <a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html"><code>DataFrame.max()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.min.html"><code>Series.min()</code></a> and <a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.min.html"><code>DataFrame.min()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html"><code>Series.mean()</code></a> and <a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html"><code>DataFrame.mean()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.median.html"><code>Series.median()</code></a> and <a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.median.html"><code>DataFrame.median()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mode.html"><code>Series.mode()</code></a> and <a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mode.html"><code>DataFrame.mode()</code></a></li>
<li><a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sum.html"><code>Series.sum()</code></a> and <a target="_blank" href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html"><code>DataFrame.sum()</code></a></li>
</ul>

<p><img alt="dataframe axis parameters" src="images/axis_param.svg"></p>

</div>

<div class="alert alert-block alert-info">
<b>Vaja:</b> Use the DataFrame.max() method to find the maximum value for only the numeric columns from f500 (you may need to check the documentation). Assign the result to the variable max_f500.</div>

### Dataframe Describe Method

<div class="alert alert-block alert-info">
<b>Vaja:</b> Return a dataframe of descriptive statistics for all of the numeric columns in f500. Assign the result to f500_desc.</div>

## Assignment with pandas

<div class="alert alert-block alert-info">
<b>Vaja:</b> The company "Dow Chemical" has named a new CEO. Update the value where the row label is Dow Chemical and for the ceo column to Jim Fitterling in the f500 dataframe.</div>

## Using Boolean Indexing with pandas Objects

In [None]:
d = {'name': ['Bob', 'Eva', 'Sara', 'Mihael'], 'num': [12, 8, 5, 8]}
df = pd.DataFrame(data=d, index=['w', 'x', 'y', 'z'])
df

<div class="alert alert-block alert-info">
<b>Vaja:</b> Create a boolean series, motor_bool, that compares whether the values in the industry column from the f500 dataframe are equal to "Motor Vehicles and Parts".
Use the motor_bool boolean series to index the country column. Assign the result to motor_countries.</div>

### Using Boolean Arrays to Assign Values

In [None]:
sector = "Motor Vehicles & Parts"


In [None]:
sector_and = "Motor Vehicles and Parts"


## Creating New Columns

## Vaja: Top Performers by Country

## Reading CSV files with pandas

<div>
<p><img alt="csv_to_dataframe" src="images/csv_to_dataframe.svg"></p>


</div>

In [None]:
f500.index

In [None]:
f500.columns

In [None]:
f500 = pd.read_csv("data/f500.csv")
f500.head()

## Using iloc to select by integer position

In [None]:
cols = ['company', 'rank', 'revenues']



<p><img alt="selection using iloc" src="images/selection_iloc.svg"></p>

    df.iloc[row_index, column_index]

<div class="alert alert-block alert-info">
<b>Vaja:</b> Select just the fifth row of the f500 dataframe. Assign the result to fifth_row.</div>

<div class="alert alert-block alert-info">
<b>Vaja:</b> Select the value in first row of the company column. Assign the result to company_value.</div>

<div>

<table>
<thead>
<tr>
<th>Select by integer position</th>
<th>Explicit Syntax</th>
<th>Shorthand Convention</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single column from dataframe</td>
<td><code>df.iloc[:,3]</code></td>
<td></td>
</tr>
<tr>
<td>List of columns from dataframe</td>
<td><code>df.iloc[:,[3,5,6]]</code></td>
<td></td>
</tr>
<tr>
<td>Slice of columns from dataframe</td>
<td><code>df.iloc[:,3:7]</code></td>
<td></td>
</tr>
<tr>
<td>Single row from dataframe</td>
<td><code>df.iloc[20]</code></td>
<td></td>
</tr>
<tr>
<td>List of rows from dataframe</td>
<td><code>df.iloc[[0,3,8]]</code></td>
<td></td>
</tr>
<tr>
<td>Slice of rows from dataframe</td>
<td><code>df.iloc[3:5]</code></td>
<td><code>df[3:5]</code></td>
</tr>
<tr>
<td>Single items from series</td>
<td><code>s.iloc[8]</code></td>
<td><code>s[8]</code></td>
</tr>
<tr>
<td>List of item from series</td>
<td><code>s.iloc[[2,8,1]]</code></td>
<td><code>s[[2,8,1]]</code></td>
</tr>
<tr>
<td>Slice of items from series</td>
<td><code>s.iloc[5:10]</code></td>
<td><code>s[5:10]</code></td>
</tr>
</tbody>
</table>
</div>

<div class="alert alert-block alert-info">
<b>Vaja:</b> Select the first three rows of the f500 dataframe. Assign the result to first_three_rows.</div>

<div class="alert alert-block alert-info">
<b>Vaja:</b> Select the first and seventh rows and the first five columns of the f500 dataframe. Assign the result to first_seventh_row_slice.</div>

## Using pandas methods to create boolean masks

In [None]:
cols = ["company","country","sector"]

<div class="alert alert-block alert-info">
<b>Vaja:</b> Use the Series.isnull() method to select all rows from f500 that have a null value for the previous_rank column. Select only the company, rank, and previous_rank columns. Assign the result to null_previous_rank.</div>

In [None]:
import numpy as np
# predpripravljeno
f500 = pd.read_csv("data/f500.csv")
f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan

## Working with Integer Labels

<p><img alt="loc vs iloc for rows in different order" src="images/integer_labels_2.svg"></p>

<div class="alert alert-block alert-info">
<b>Vaja:</b> Assign the first five rows of the null_previous_rank dataframe to the variable top5_null_prev_rank by choosing the correct method out of either loc[] or iloc[].</div>

## Pandas Index Alignment

In [None]:
food = pd.DataFrame({'fruit_veg': ['fruit', 'veg', 'fruit', 'veg', 'veg'], 'qty': [4, 2, 4, 1, 2]}, 
                    index=['tomato', 'carrot', 'lime', 'corn', 'eggplant'])

In [None]:
alt_name = pd.Series(['rocket', 'aubergine', 'maize'], index=['arugula', 'eggplant', 'corn'])

<div class="alert alert-block alert-info">
<b>Vaja:</b> Use the Series.notnull() method to select all rows from f500 that have a non-null value for the previous_rank column. Assign the result to previously_ranked.  From the previously_ranked dataframe, subtract the rank column from the previous_rank column. Assign the result to rank_change. Assign the values in the rank_change to a new column in the f500 dataframe, "rank_change".</div>

## Boolean Operators

<div>
<table>
<thead>
<tr>
<th>pandas</th>
<th>Python equivalent</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>a &amp; b</code></td>
<td><code>a and b</code></td>
<td><code>True</code> if both <code>a</code> and <code>b</code> are <code>True</code>, else <code>False</code></td>
</tr>
<tr>
<td><code>a | b</code></td>
<td><code>a or b</code></td>
<td><code>True</code> if either <code>a</code> or <code>b</code> is <code>True</code></td>
</tr>
<tr>
<td><code>~a</code></td>
<td><code>not a</code></td>
<td><code>True</code> if <code>a</code> is <code>False</code>, else <code>False</code></td>
</tr>
</tbody>
</table>

<p><img alt="boolean operators example 1" src="images/bool_ops_1.svg"></p>

<p><img alt="boolean operators example 2" src="images/bool_ops_2.svg"></p>

<p><img alt="boolean operators example 3" src="images/bool_ops_3.svg"></p>

<p><img alt="boolean operators example 4" src="images/bool_ops_4.svg"></p>

</div>

<div class="alert alert-block alert-info">
<b>Vaja:</b> Select all companies with revenues over 100 billion and negative profits from the f500 dataframe. The result should include all columns.</div>

<div class="alert alert-block alert-info">
<b>Vaja:</b> Select all rows for companies headquartered in either Brazil or Venezuela. Assign the result to brazil_venezuela.</div>

<div class="alert alert-block alert-info">
<b>Vaja:</b> Select the first five companies in the Technology sector that are not headquartered in the USA from the f500 dataframe. Assign the result to tech_outside_usa.</div>

## Sorting Values

<div class="alert alert-block alert-info">
<b>Vaja:</b> Find the company headquartered in Japan with the largest number of employees.</div>

## Using Loops with pandas

In [None]:
avg_rev_by_country = {}

countries = f500["country"].unique()

In [None]:
for c in countries:
    selected_rows = f500[f500["country"] == c]
    mean = selected_rows["revenues"].mean()
    avg_rev_by_country[c] = mean

<div class="alert alert-block alert-info">
<b>Vaja:</b> Calculate the company that employs the most people in each country</div>

In [None]:
top_employer_by_country = {}

countries = f500["country"].unique()
for c in countries:
    selected_rows = f500[f500["country"] == c]
    sorted_rows = selected_rows.sort_values("employees", ascending=False)
    top_employer = sorted_rows.iloc[0]
    employer_name = top_employer["company"]
    top_employer_by_country[c] = employer_name

In [None]:
#top_employer_by_country

## Primer: Calculating Return on Assets by Country

    return on assets = profit/assets

<div class="alert alert-block alert-info">
<b>Vaja:</b> Create a new column roa in the f500 dataframe, containing the return on assets metric for each company.</div>

<div class="alert alert-block alert-info">
<b>Vaja:</b> Aggregate the data by the sector column, and create a dictionary top_roa_by_sector.</div>

## Understanding SettingwithCopyWarning in pandas


### What is SettingWithCopyWarning?



<img class="full-width" src="https://www.dataquest.io/wp-content/uploads/2019/01/view-vs-copy.png" alt="view-vs-copy">



<img class="full-width" src="https://www.dataquest.io/wp-content/uploads/2019/01/modifying.png" alt="modifying">


### Chained assignment




### Hidden chaining



### Tips and tricks for dealing with SettingWithCopyWarning

### Chained assignment in Depth

In [None]:
df1 = pd.DataFrame(np.arange(6).reshape((3,2)), columns=list('AB'))
df1