# Del 04: Parsanje, analiza podatkov in generiranje poročil

# Branje in pisanje posebnih formatov datotek

## Uvod v pandas

> Namestimo pandas v virtualno okolje

<div><p>In the last two missions, we explored how the NumPy library makes working with data easier. Because we can easily work across multiple dimensions, our code is a lot easier to understand. By using vectorized operations instead of loops, our code runs faster with larger data.</p>
<p>Although NumPy provides fundamental structures and tools that make working with data easier, there are several things that limit its usefulness:</p>
<ul>
<li>The lack of support for column names forces us to frame questions as multi-dimensional array operations.</li>
<li>Support for only one data type per ndarray makes it more difficult to work with data that contains both numeric and string data.</li>
<li>There are lots of low level methods, but there are many common analysis patterns that don't have pre-built methods.</li>
</ul>
<p>The <strong>pandas</strong> library provides solutions to all of these pain points and more. Pandas is not so much a replacement for NumPy as an <em>extension</em> of NumPy. The underlying code for pandas uses the NumPy library extensively, which means the concepts you've been learning will come in handy as you begin to learn more about pandas.</p>
<p>The primary data structure in pandas is called a <a target="_blank" href="(http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#pandas.DataFrame)"><strong>dataframe</strong></a>. Dataframes are the pandas equivalent of a Numpy 2D ndarray, with a few key differences:</p>
<ul>
<li>Axis values can have string <strong>labels</strong>, not just numeric ones.</li>
<li>Dataframes can contain columns with <strong>multiple data types</strong>: including integer, float, and string.</li>
</ul>
</div>

### About pandas

Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool using its powerful data structures. The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data.

In 2008, developer Wes McKinney started developing pandas when in need of high performance, flexible tool for analysis of data.

Prior to Pandas, Python was majorly used for data munging and preparation. It had very little contribution towards data analysis. Pandas solved this problem. Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data — load, prepare, manipulate, model, and analyze.

Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc.

Key Features of Pandas
- Fast and efficient DataFrame object with default and customized indexing.
- Tools for loading data into in-memory data objects from different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of date sets.
- Label-based slicing, indexing and subsetting of large data sets.
- Columns from a data structure can be deleted or inserted.
- Group by data for aggregation and transformations.
- High performance merging and joining of data.
- Time Series functionality.

Similar to the import convention for NumPy (import numpy as np), the import convention for pandas is:



In [2]:
import pandas as pd

In [6]:
f500 = pd.read_csv('data/f500.csv')

In [7]:
f500.head(3)

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
0,Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
1,State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
2,Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4,China,"Beijing, China",http://www.sinopec.com,19,713288,106523


> Pokažemo osnovne metode za pregled podatkov

### Reading and Writing Data with pandas

The pandas I/O API is a set of top level reader functions accessed like pandas.read_csv() that generally return a pandas object. The corresponding writer functions are object methods that are accessed like DataFrame.to_csv(). Below is a table containing available readers and writers.

[IO tools (text, CSV, HDF5, …)](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)

<table class="colwidths-given table">
<colgroup>
<col style="width: 12%">
<col style="width: 40%">
<col style="width: 24%">
<col style="width: 24%">
</colgroup>
<thead>
<tr class="row-odd"><th class="head"><p>Format Type</p></th>
<th class="head"><p>Data Description</p></th>
<th class="head"><p>Reader</p></th>
<th class="head"><p>Writer</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>text</p></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/Comma-separated_values">CSV</a></p></td>
<td><p><a class="reference internal" href="#io-read-csv-table"><span class="std std-ref">read_csv</span></a></p></td>
<td><p><a class="reference internal" href="#io-store-in-csv"><span class="std std-ref">to_csv</span></a></p></td>
</tr>
<tr class="row-odd"><td><p>text</p></td>
<td><p>Fixed-Width Text File</p></td>
<td><p><a class="reference internal" href="#io-fwf-reader"><span class="std std-ref">read_fwf</span></a></p></td>
<td></td>
</tr>
<tr class="row-even"><td><p>text</p></td>
<td><p><a class="reference external" href="https://www.json.org/">JSON</a></p></td>
<td><p><a class="reference internal" href="#io-json-reader"><span class="std std-ref">read_json</span></a></p></td>
<td><p><a class="reference internal" href="#io-json-writer"><span class="std std-ref">to_json</span></a></p></td>
</tr>
<tr class="row-odd"><td><p>text</p></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/HTML">HTML</a></p></td>
<td><p><a class="reference internal" href="#io-read-html"><span class="std std-ref">read_html</span></a></p></td>
<td><p><a class="reference internal" href="#io-html"><span class="std std-ref">to_html</span></a></p></td>
</tr>
<tr class="row-even"><td><p>text</p></td>
<td><p>Local clipboard</p></td>
<td><p><a class="reference internal" href="#io-clipboard"><span class="std std-ref">read_clipboard</span></a></p></td>
<td><p><a class="reference internal" href="#io-clipboard"><span class="std std-ref">to_clipboard</span></a></p></td>
</tr>
<tr class="row-odd"><td></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/Microsoft_Excel">MS Excel</a></p></td>
<td><p><a class="reference internal" href="#io-excel-reader"><span class="std std-ref">read_excel</span></a></p></td>
<td><p><a class="reference internal" href="#io-excel-writer"><span class="std std-ref">to_excel</span></a></p></td>
</tr>
<tr class="row-even"><td><p>binary</p></td>
<td><p><a class="reference external" href="http://www.opendocumentformat.org">OpenDocument</a></p></td>
<td><p><a class="reference internal" href="#io-ods"><span class="std std-ref">read_excel</span></a></p></td>
<td></td>
</tr>
<tr class="row-odd"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://support.hdfgroup.org/HDF5/whatishdf5.html">HDF5 Format</a></p></td>
<td><p><a class="reference internal" href="#io-hdf5"><span class="std std-ref">read_hdf</span></a></p></td>
<td><p><a class="reference internal" href="#io-hdf5"><span class="std std-ref">to_hdf</span></a></p></td>
</tr>
<tr class="row-even"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://github.com/wesm/feather">Feather Format</a></p></td>
<td><p><a class="reference internal" href="#io-feather"><span class="std std-ref">read_feather</span></a></p></td>
<td><p><a class="reference internal" href="#io-feather"><span class="std std-ref">to_feather</span></a></p></td>
</tr>
<tr class="row-odd"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://parquet.apache.org/">Parquet Format</a></p></td>
<td><p><a class="reference internal" href="#io-parquet"><span class="std std-ref">read_parquet</span></a></p></td>
<td><p><a class="reference internal" href="#io-parquet"><span class="std std-ref">to_parquet</span></a></p></td>
</tr>
<tr class="row-even"><td><p>binary</p></td>
<td><p><a class="reference external" href="//https://orc.apache.org/">ORC Format</a></p></td>
<td><p><a class="reference internal" href="#io-orc"><span class="std std-ref">read_orc</span></a></p></td>
<td></td>
</tr>
<tr class="row-odd"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://msgpack.org/index.html">Msgpack</a></p></td>
<td><p><a class="reference internal" href="#io-msgpack"><span class="std std-ref">read_msgpack</span></a></p></td>
<td><p><a class="reference internal" href="#io-msgpack"><span class="std std-ref">to_msgpack</span></a></p></td>
</tr>
<tr class="row-even"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/Stata">Stata</a></p></td>
<td><p><a class="reference internal" href="#io-stata-reader"><span class="std std-ref">read_stata</span></a></p></td>
<td><p><a class="reference internal" href="#io-stata-writer"><span class="std std-ref">to_stata</span></a></p></td>
</tr>
<tr class="row-odd"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/SAS_(software)">SAS</a></p></td>
<td><p><a class="reference internal" href="#io-sas-reader"><span class="std std-ref">read_sas</span></a></p></td>
<td></td>
</tr>
<tr class="row-even"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/SPSS">SPSS</a></p></td>
<td><p><a class="reference internal" href="#io-spss-reader"><span class="std std-ref">read_spss</span></a></p></td>
<td></td>
</tr>
<tr class="row-odd"><td><p>binary</p></td>
<td><p><a class="reference external" href="https://docs.python.org/3/library/pickle.html">Python Pickle Format</a></p></td>
<td><p><a class="reference internal" href="#io-pickle"><span class="std std-ref">read_pickle</span></a></p></td>
<td><p><a class="reference internal" href="#io-pickle"><span class="std std-ref">to_pickle</span></a></p></td>
</tr>
<tr class="row-even"><td><p>SQL</p></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/SQL">SQL</a></p></td>
<td><p><a class="reference internal" href="#io-sql"><span class="std std-ref">read_sql</span></a></p></td>
<td><p><a class="reference internal" href="#io-sql"><span class="std std-ref">to_sql</span></a></p></td>
</tr>
<tr class="row-odd"><td><p>SQL</p></td>
<td><p><a class="reference external" href="https://en.wikipedia.org/wiki/BigQuery">Google BigQuery</a></p></td>
<td><p><a class="reference internal" href="#io-bigquery"><span class="std std-ref">read_gbq</span></a></p></td>
<td><p><a class="reference internal" href="#io-bigquery"><span class="std std-ref">to_gbq</span></a></p></td>
</tr>
</tbody>
</table>

## Branje in pisanje JSON datotek
    

Since its inception, JSON has quickly become the de facto standard for information exchange. Chances are you’re here because you need to transport some data from here to there. Perhaps you’re gathering information through an API or storing your data in a document database. One way or another, you’re up to your neck in JSON, and you’ve got to Python your way out.

### A (Very) Brief History of JSON

Not so surprisingly, JavaScript Object Notation was inspired by a subset of the JavaScript programming language dealing with object literal syntax. They’ve got a nifty website that explains the whole thing. Don’t worry though: JSON has long since become language agnostic and exists as its own standard, so we can thankfully avoid JavaScript for the sake of this discussion.

Ultimately, the community at large adopted JSON because it’s easy for both humans and machines to create and understand.

As you can see, JSON supports primitive types, like strings and numbers, as well as nested lists and objects.

> Wait, that looks like a Python dictionary! I know, right? It’s pretty much universal object notation at this point, but I don’t think UON rolls off the tongue quite as nicely.

### Python Supports JSON Natively

Python comes with a built-in package called json for encoding and decoding JSON data.

Just throw this little guy up at the top of your file:

In [1]:
import json

The process of encoding JSON is usually called serialization. This term refers to the transformation of data into a series of bytes (hence serial) to be stored or transmitted across a network. You may also hear the term marshaling, but that’s a whole other discussion. Naturally, deserialization is the reciprocal process of decoding data that has been stored or delivered in the JSON standard.

Yikes! That sounds pretty technical. Definitely. But in reality, all we’re talking about here is reading and writing. Think of it like this: encoding is for writing data to disk, while decoding is for reading data into memory.

### Serializing JSON

What happens after a computer processes lots of information? It needs to take a data dump. Accordingly, the json library exposes the dump() method for writing data to files. There is also a dumps() method (pronounced as “dump-s”) for writing to a Python string.

Simple Python objects are translated to JSON according to a fairly intuitive conversion.

<div class="table-responsive">
<table class="table table-hover">
<thead>
<tr>
<th>Python</th>
<th>JSON</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>dict</code></td>
<td><code>object</code></td>
</tr>
<tr>
<td><code>list</code>, <code>tuple</code></td>
<td><code>array</code></td>
</tr>
<tr>
<td><code>str</code></td>
<td><code>string</code></td>
</tr>
<tr>
<td><code>int</code>, <code>long</code>, <code>float</code></td>
<td><code>number</code></td>
</tr>
<tr>
<td><code>True</code></td>
<td><code>true</code></td>
</tr>
<tr>
<td><code>False</code></td>
<td><code>false</code></td>
</tr>
<tr>
<td><code>None</code></td>
<td><code>null</code></td>
</tr>
</tbody>
</table>
</div>

Imagine you’re working with a Python object in memory that looks a little something like this:

In [2]:
data = {
    "president": {
        "name": "Zaphod Beeblebrox",
        "species": "Betelgeusian"
    }
}

It is critical that you save this information to disk, so your mission is to write it to a file.

Using Python’s context manager, you can create a file called data_file.json and open it in write mode. (JSON files conveniently end in a .json extension.)

In [3]:
with open("data/data_file.json", "w") as write_file:
    json.dump(data, write_file)

Note that dump() takes two positional arguments: (1) the data object to be serialized, and (2) the file-like object to which the bytes will be written.

Or, if you were so inclined as to continue using this serialized JSON data in your program, you could write it to a native Python str object.

In [4]:
json_string = json.dumps(data)
print(json_string)

{"president": {"name": "Zaphod Beeblebrox", "species": "Betelgeusian"}}


Notice that the file-like object is absent since you aren’t actually writing to disk. Other than that, dumps() is just like dump().

### Deserializing JSON

Great, looks like you’ve captured yourself some wild JSON! Now it’s time to whip it into shape. In the json library, you’ll find load() and loads() for turning JSON encoded data into Python objects.

Just like serialization, there is a simple conversion table for deserialization, though you can probably guess what it looks like already.

<div class="table-responsive">
<table class="table table-hover">
<thead>
<tr>
<th>JSON</th>
<th>Python</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>object</code></td>
<td><code>dict</code></td>
</tr>
<tr>
<td><code>array</code></td>
<td><code>list</code></td>
</tr>
<tr>
<td><code>string</code></td>
<td><code>str</code></td>
</tr>
<tr>
<td><code>number</code> (int)</td>
<td><code>int</code></td>
</tr>
<tr>
<td><code>number</code> (real)</td>
<td><code>float</code></td>
</tr>
<tr>
<td><code>true</code></td>
<td><code>True</code></td>
</tr>
<tr>
<td><code>false</code></td>
<td><code>False</code></td>
</tr>
<tr>
<td><code>null</code></td>
<td><code>None</code></td>
</tr>
</tbody>
</table>
</div>

Technically, this conversion isn’t a perfect inverse to the serialization table. That basically means that if you encode an object now and then decode it again later, you may not get exactly the same object back. I imagine it’s a bit like teleportation: break my molecules down over here and put them back together over there. Am I still the same person?

In reality, it’s probably more like getting one friend to translate something into Japanese and another friend to translate it back into English. Regardless, the simplest example would be encoding a tuple and getting back a list after decoding, like so:

In [8]:
blackjack_hand = (8, "Q")
encoded_hand = json.dumps(blackjack_hand)
decoded_hand = json.loads(encoded_hand)

In [9]:
blackjack_hand == decoded_hand

False

In [10]:
type(blackjack_hand)

tuple

In [11]:
type(decoded_hand)

list

In [12]:
blackjack_hand == tuple(decoded_hand)

True

This time, imagine you’ve got some data stored on disk that you’d like to manipulate in memory. You’ll still use the context manager, but this time you’ll open up the existing data_file.json in read mode.

In [5]:
with open("data/data_file.json", "r") as read_file:
    data = json.load(read_file)

In [6]:
data

{'president': {'name': 'Zaphod Beeblebrox', 'species': 'Betelgeusian'}}

Things are pretty straightforward here, but keep in mind that the result of this method could return any of the allowed data types from the conversion table. This is only important if you’re loading in data you haven’t seen before. In most cases, the root object will be a dict or a list.

If you’ve pulled JSON data in from another program or have otherwise obtained a string of JSON formatted data in Python, you can easily deserialize that with loads(), which naturally loads from a string:

In [15]:
json_string = """
{
    "researcher": {
        "name": "Ford Prefect",
        "species": "Betelgeusian",
        "relatives": [
            {
                "name": "Zaphod Beeblebrox",
                "species": "Betelgeusian"
            }
        ]
    }
}
"""
data = json.loads(json_string)

In [17]:
data

{'researcher': {'name': 'Ford Prefect',
  'species': 'Betelgeusian',
  'relatives': [{'name': 'Zaphod Beeblebrox', 'species': 'Betelgeusian'}]}}

### Vaja: parsing JSON data

Objective: using data file 'interface-data.json', create output that resembles the following by parsing the included JSON file.

    Interface Status
    ================================================================================
    DN                                                 Description           Speed    MTU  
    -------------------------------------------------- --------------------  ------  ------
    topology/pod-1/node-201/sys/phys-[eth1/33]                              inherit   9150 
    topology/pod-1/node-201/sys/phys-[eth1/34]                              inherit   9150 
    topology/pod-1/node-201/sys/phys-[eth1/35]                              inherit   9150 

In [7]:
head = """================================================================\n
DN                                                  Speed    MTU\n 
--------------------------------------------------  ------  ------\n"""

In [8]:
import json

with open('data/exer1-interface-data.json') as f:
    #jsondata = f.read()
    json_object = json.load(f)

In [9]:
imdata = json_object["imdata"]

In [10]:
with open('data/interface_output.txt', 'w') as f:
    f.write(head)
    #print(head)
    for interface in imdata:
        attributes = interface["l1PhysIf"]["attributes"]
        dn = attributes["dn"]
        speed = attributes["speed"]
        mtu = attributes["mtu"]
        data_string = f"{dn:50} {speed:8} {mtu:7}\n"
        #print(data_string)
        f.write(data_string)

### Parsing JSON Files With the pandas Library

In [8]:
import pandas as pd

[Primeri datasetov](https://github.com/jdorfman/awesome-json-datasets#bitcoin)

Reading a JSON string to pandas object can take a number of parameters. The parser will try to parse a DataFrame if typ is not supplied or is None. To explicitly force Series parsing, pass typ=series

    pd.read_json(json)

- dtype : if True, infer dtypes, if a dict of column to dtype, then use those, if False, then don’t infer dtypes at all, default is True, apply only to the data.
- convert_axes : boolean, try to convert the axes to the proper dtypes, default is True
- convert_dates : a list of columns to parse for dates; If True, then try to parse date-like columns, default is True.
- keep_default_dates : boolean, default True. If parsing dates, then parse the default date-like columns.
- numpy : direct decoding to NumPy arrays. default is False; Supports numeric data only, although labels may be non-numeric. Also note that the JSON ordering MUST be the same for each term if numpy=True.
- precise_float : boolean, default False. Set to enable usage of higher precision (strtod) function when decoding string to double values. Default (False) is to use fast but less precise builtin functionality.
- date_unit : string, the timestamp unit to detect if converting dates. Default None. By default the timestamp precision will be detected, if this is not desired then pass one of ‘s’, ‘ms’, ‘us’ or ‘ns’ to force timestamp precision to seconds, milliseconds, microseconds or nanoseconds respectively.
- lines : reads file as one json object per line. 
- encoding : The encoding to use to decode py3 bytes.
- chunksize : when used in combination with lines=True, return a JsonReader which reads in chunksize lines per iteration.

#### Orient options

`orient` :
- Series:
    - default is index
    - allowed values are {split, records, index}
- DataFrame:
    - default is columns
    - allowed values are {split, records, index, columns, values, table}

In [9]:
dfjo = pd.DataFrame(dict(A=range(1, 4), B=range(4, 7), C=range(7, 10)), columns=list('ABC'), index=list('xyz'))

In [10]:
dfjo

Unnamed: 0,A,B,C
x,1,4,7
y,2,5,8
z,3,6,9


The format of the JSON string:

<table class="colwidths-given table">
<colgroup>
<col style="width: 12%">
<col style="width: 88%">
</colgroup>
<tbody>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">split</span></code></p></td>
<td><p>dict like {index -&gt; [index], columns -&gt; [columns], data -&gt; [values]}</p></td>
</tr>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">records</span></code></p></td>
<td><p>list like [{column -&gt; value}, … , {column -&gt; value}]</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">index</span></code></p></td>
<td><p>dict like {index -&gt; {column -&gt; value}}</p></td>
</tr>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">columns</span></code></p></td>
<td><p>dict like {column -&gt; {index -&gt; value}}</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">values</span></code></p></td>
<td><p>just the values array</p></td>
</tr>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">table</span></code></p></td>
<td><p>adhering to the JSON <a class="reference external" href="https://specs.frictionlessdata.io/json-table-schema/">Table Schema</a></p></td>
</tr>
</tbody>
</table>

- Column oriented (the default for DataFrame) serializes the data as nested JSON objects with column labels acting as the primary index:

In [11]:
dfjo.to_json(orient="columns")

'{"A":{"x":1,"y":2,"z":3},"B":{"x":4,"y":5,"z":6},"C":{"x":7,"y":8,"z":9}}'

- Index oriented (the default for Series) similar to column oriented but the index labels are now primary:

In [12]:
dfjo.to_json(orient="index")

'{"x":{"A":1,"B":4,"C":7},"y":{"A":2,"B":5,"C":8},"z":{"A":3,"B":6,"C":9}}'

- Record oriented serializes the data to a JSON array of column -> value records, index labels are not included. This is useful for passing DataFrame data to plotting libraries, for example the JavaScript library d3.js

In [13]:
dfjo.to_json(orient="records")

'[{"A":1,"B":4,"C":7},{"A":2,"B":5,"C":8},{"A":3,"B":6,"C":9}]'

- Value oriented is a bare-bones option which serializes to nested JSON arrays of values only, column and index labels are not included:

In [14]:
dfjo.to_json(orient="values")

'[[1,4,7],[2,5,8],[3,6,9]]'

- Split oriented serializes to a JSON object containing separate entries for values, index and columns. Name is also included for Series:

In [15]:
dfjo.to_json(orient="split")

'{"columns":["A","B","C"],"index":["x","y","z"],"data":[[1,4,7],[2,5,8],[3,6,9]]}'

- Table oriented serializes to the JSON Table Schema, allowing for the preservation of metadata including but not limited to dtypes and index names.

In [16]:
dfjo.to_json(orient="table")

'{"schema":{"fields":[{"name":"index","type":"string"},{"name":"A","type":"integer"},{"name":"B","type":"integer"},{"name":"C","type":"integer"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":"x","A":1,"B":4,"C":7},{"index":"y","A":2,"B":5,"C":8},{"index":"z","A":3,"B":6,"C":9}]}'

#### Primer: ocenas.json

In [17]:
# način 1
ocenas = pd.read_json('data/ocenas.json', orient='column')
ocenas.drop(columns='description', inplace=True)
ocenas.drop(['title', 'units', 'base_period', 'missing'], inplace=True)
ocenas.index.name = 'year'
ocenas.rename(columns={'data':'temp_anomaly_celsius'}, inplace=True)
ocenas.index = pd.to_datetime(ocenas.index).year
ocenas.head()

Unnamed: 0_level_0,temp_anomaly_celsius
year,Unnamed: 1_level_1
1880,-0.12
1881,-0.09
1882,-0.1
1883,-0.18
1884,-0.27


#### Primer: temperatures.json

In [18]:
#!cat ./data/temperatures.json

In [19]:
import json
#load json object
with open('data/temperatures.json') as f:
    d = json.load(f)

In [20]:
# podatke pretvorimo v json v pomnilniku
temps_json = json.dumps(d['data'])

In [21]:
temps = pd.read_json(temps_json, orient='index')
temps.head()

Unnamed: 0,value,anomaly
189512,50.34,-1.68
189612,51.99,-0.03
189712,51.56,-0.46
189812,51.43,-0.59
189912,51.01,-1.01


#### Primer: cities.json

In [22]:
# lahko prodamo amapk ne moremo dobit geolokacije vn
cities = pd.read_json('data/cities.json', orient='records')
cities.head(2) 

Unnamed: 0,name,id,nametype,recclass,mass,fall,year,reclat,reclong,geolocation,:@computed_region_cbhk_fwbd,:@computed_region_nnqa_25f4
0,Aachen,1,Valid,L5,21.0,Fell,1880-01-01T00:00:00.000,50.775,6.08333,"{'type': 'Point', 'coordinates': [6.08333, 50....",,
1,Aarhus,2,Valid,H6,720.0,Fell,1951-01-01T00:00:00.000,56.18333,10.23333,"{'type': 'Point', 'coordinates': [10.23333, 56...",,


In [23]:
with open('data/cities.json') as f:
    d = json.load(f)

In [24]:
from pandas.io.json import json_normalize
cities = json_normalize(d)

  cities = json_normalize(d)


In [25]:
cities['coordinates_x'] = cities['geolocation.coordinates'].str[0]
cities['coordinates_y'] = cities['geolocation.coordinates'].str[1]
cities.drop(columns=['geolocation.coordinates', ':@computed_region_cbhk_fwbd', ':@computed_region_nnqa_25f4'], inplace=True)
cities.set_index('name', inplace=True)
cities['mass'] = pd.to_numeric(cities['mass'])
# na podoben način še ostale
cities.head()

Unnamed: 0_level_0,id,nametype,recclass,mass,fall,year,reclat,reclong,geolocation.type,coordinates_x,coordinates_y
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Aachen,1,Valid,L5,21.0,Fell,1880-01-01T00:00:00.000,50.775,6.08333,Point,6.08333,50.775
Aarhus,2,Valid,H6,720.0,Fell,1951-01-01T00:00:00.000,56.18333,10.23333,Point,10.23333,56.18333
Abee,6,Valid,EH4,107000.0,Fell,1952-01-01T00:00:00.000,54.21667,-113.0,Point,-113.0,54.21667
Acapulco,10,Valid,Acapulcoite,1914.0,Fell,1976-01-01T00:00:00.000,16.88333,-99.9,Point,-99.9,16.88333
Achiras,370,Valid,L6,780.0,Fell,1902-01-01T00:00:00.000,-33.16667,-64.95,Point,-64.95,-33.16667


#### Primer: transactions.json

In [26]:
!head -n 10 data/transactions.json

{
"txs":[

{
   "lock_time":0,
   "ver":1,
   "size":373,
   "inputs":[
      {
         "sequence":4294967295,


In [27]:
with open('data/transactions.json') as f:
    data= json.load(f)

[pandas.json_normalize](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html): Normalize semi-structured JSON data into a flat table.

In [28]:
from pandas.io.json import json_normalize

#json_normalize(data['txs']).head()

In [29]:
trans = json_normalize(data['txs'], record_path=['out'], meta=['time', 'relayed_by', 'vout_sz', 'hash'])

  trans = json_normalize(data['txs'], record_path=['out'], meta=['time', 'relayed_by', 'vout_sz', 'hash'])


In [30]:
trans.head()

Unnamed: 0,spent,tx_index,type,addr,value,n,script,time,relayed_by,vout_sz,hash
0,False,0,0,1H7r57SXAwaKs3Tf5ugbkRNxwfh9YaxC5b,7541,0,76a914b0cd787a7a879ac0a5277b0013ec7b11c145055d...,1586376721,0.0.0.0,2,0f06714015f334626a168ee3e0aa5e0d3866a33dad504b...
1,False,0,0,1BPULhbGfrojrknyD7aZYMtRVUu38Cn75j,1364400,1,76a91471f13b222426eb80b47d2413d21a8904ec1966b2...,1586376721,0.0.0.0,2,0f06714015f334626a168ee3e0aa5e0d3866a33dad504b...
2,False,0,0,1LQ6YURobx4EGZRp8bdEDHup6T56o5NGKN,3127836,0,76a914d4c895721d3a8cd74bb3ccbb699a3dbe342c0807...,1586376722,0.0.0.0,2,3684072a50d7389933210d7adf4f98640d3d53c8cb245e...
3,False,0,0,1HSLVVSSQmzaNG8sbakhFDrmpzUPZLnYCe,30036732,1,76a914b44cae99837337275d21d2c5c6ed6cddf7a7e9f7...,1586376722,0.0.0.0,2,3684072a50d7389933210d7adf4f98640d3d53c8cb245e...
4,False,0,0,3Lb2MJWbBE88BUHf6tAw8ZzhkR6H2cYRhR,206183,0,a914cf48401e3cf81080352f281ea859ccabd51a821487,1586376721,0.0.0.0,3,3d3cc141654170060a7e298a9e5298557970e8cd0051ab...


#### Primer: all_hour_geo.json

In [297]:
from pandas.io.json import json_normalize
import json

In [298]:
#load json object
with open('data/all_hour_geo.json') as f:
    d = json.load(f)

In [300]:
data = [element['properties'] for element in  d['features']]
all_hour_geo = json_normalize(data)

In [96]:
all_hour_geo.head(3)

Unnamed: 0,mag,place,time,updated,tz,url,detail,felt,cdi,mmi,...,ids,sources,types,nst,dmin,rms,gap,magType,type,title
0,0.69,"16km ESE of Anza, CA",1586352802900,1586353032308,-480,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/earthquakes/feed/v...,,,,...,",ci39143639,",",ci,",",geoserve,nearby-cities,origin,phase-data,scit...",12.0,0.05468,0.14,98.0,ml,earthquake,"M 0.7 - 16km ESE of Anza, CA"
1,2.34,"7km ENE of Pahala, Hawaii",1586352794640,1586353127910,-600,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/earthquakes/feed/v...,,,,...,",hv71464377,",",hv,",",geoserve,origin,phase-data,",49.0,0.02127,0.13,136.0,ml,earthquake,"M 2.3 - 7km ENE of Pahala, Hawaii"
2,0.85,"15km ESE of Anza, CA",1586352704490,1586352926133,-480,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/earthquakes/feed/v...,,,,...,",ci39143631,",",ci,",",geoserve,nearby-cities,origin,phase-data,scit...",30.0,0.04884,0.14,55.0,ml,earthquake,"M 0.9 - 15km ESE of Anza, CA"


#### Primer: rates.json

In [301]:
#load json object
with open('data/rates.json') as f:
    d = json.load(f)

In [316]:
# json_normalize(d['rates'])

In [315]:
rates = json_normalize(d['rates'], record_path=['periods'], meta=['name', 'code', 'country_code'])
rates.head()

Unnamed: 0,effective_from,rates.super_reduced,rates.reduced,rates.standard,rates.reduced1,rates.reduced2,rates.parking,name,code,country_code
0,0000-01-01,4.0,10.0,21.0,,,,Spain,ES,ES
1,0000-01-01,,9.0,20.0,,,,Bulgaria,BG,BG
2,0000-01-01,,,27.0,5.0,18.0,,Hungary,HU,HU
3,0000-01-01,,12.0,21.0,,,,Latvia,LV,LV
4,0000-01-01,,,23.0,5.0,8.0,,Poland,PL,PL


## Branje in pisanje CSV datotek

Let’s face it: you need to get information into and out of your programs through more than just the keyboard and console. Exchanging information through text files is a common way to share info between programs. One of the most popular formats for exchanging data is the CSV format. But how do you use it?

Let’s get one thing clear: you don’t have to (and you won’t) build your own CSV parser from scratch. There are several perfectly acceptable libraries you can use. The Python csv library will work for most cases. If your work requires lots of data or numerical analysis, the pandas library has CSV parsing capabilities as well, which should handle the rest.



### What Is a CSV File?

A CSV file (Comma Separated Values file) is a type of plain text file that uses specific structuring to arrange tabular data. Because it’s a plain text file, it can contain only actual text data—in other words, printable ASCII or Unicode characters.

The structure of a CSV file is given away by its name. Normally, CSV files use a comma to separate each specific data value.

Notice how each piece of data is separated by a comma. Normally, the first line identifies each piece of data—in other words, the name of a data column. Every subsequent line after that is actual data and is limited only by file size constraints.

In general, the separator character is called a delimiter, and the comma is not the only one used. Other popular delimiters include the tab (\t), colon (:) and semi-colon (;) characters. Properly parsing a CSV file requires us to know which delimiter is being used.

CSV files are normally created by programs that handle large amounts of data. They are a convenient way to export data from spreadsheets and databases as well as import or use it in other programs. For example, you might export the results of a data mining program to a CSV file and then import that into a spreadsheet to analyze the data, generate graphs for a presentation, or prepare a report for publication.

CSV files are very easy to work with programmatically. Any language that supports text file input and string manipulation (like Python) can work with CSV files directly.

### Parsing CSV Files With Python’s Built-in CSV Library

The csv library provides functionality to both read from and write to CSV files. Designed to work out of the box with Excel-generated CSV files, it is easily adapted to work with a variety of CSV formats. The csv library contains objects and other code to read, write, and process data from and to CSV files.

#### Reading CSV Files With csv

Reading from a CSV file is done using the reader object. The CSV file is opened as a text file with Python’s built-in open() function, which returns a file object. This is then passed to the reader, which does the heavy lifting.

Here’s code to read it:

In [11]:
import csv

with open('data/employee_birthday.txt') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    line_count = 0
    for row in csv_reader:
        if line_count == 0:
            print(f'Column names are {", ".join(row)}')
            line_count += 1
        else:
            print(f'\t{row[0]} works in the {row[1]} department, and was born in {row[2]}.')
            line_count += 1
    print(f'Processed {line_count} lines.')

Column names are name, department, birthday month
	John Smith works in the Accounting department, and was born in November.
	Erica Meyers works in the IT department, and was born in March.
Processed 3 lines.


Each row returned by the reader is a list of String elements containing the data found by removing the delimiters. The first row returned contains the column names, which is handled in a special way.

#### Writing CSV Files With csv

You can also write to a CSV file using a writer object and the .write_row() method:

In [12]:
import csv

with open('data/employee_file1.csv', mode='w') as employee_file:
    employee_writer = csv.writer(employee_file, delimiter=',')

    employee_writer.writerow(['John Smith', 'Accounting', 'November'])
    employee_writer.writerow(['Erica Meyers', 'IT', 'March'])

### Parsing CSV Files With the pandas Library

Of course, the Python CSV library isn’t the only game in town. Reading CSV files is possible in pandas as well. It is highly recommended if you have a lot of data to analyze.

pandas is an open-source Python library that provides high performance data analysis tools and easy to use data structures. pandas is available for all Python installations, but it is a key part of the Anaconda distribution and works extremely well in Jupyter notebooks to share data, code, analysis results, visualizations, and narrative text.

Reading the CSV into a pandas DataFrame is quick and straightforward:

In [1]:
import pandas
df = pandas.read_csv('data/hrdata.csv')
df

Unnamed: 0,Name,Hire Date,Salary,Sick Days remaining
0,Graham Chapman,03/15/14,50000.0,10
1,John Cleese,06/01/15,65000.0,8
2,Eric Idle,05/12/14,45000.0,10
3,Terry Jones,11/01/13,70000.0,3
4,Terry Gilliam,08/12/14,48000.0,7
5,Michael Palin,05/23/13,66000.0,8


That’s it: three lines of code, and only one of them is doing the actual work. pandas.read_csv() opens, analyzes, and reads the CSV file provided, and stores the data in a DataFrame.

Wrinting data to json.

In [2]:
df.to_json('data/hrdata.json', orient='index')

In [3]:
! cat data/hrdata.json

{"0":{"Name":"Graham Chapman","Hire Date":"03\/15\/14","Salary":50000.0,"Sick Days remaining":10},"1":{"Name":"John Cleese","Hire Date":"06\/01\/15","Salary":65000.0,"Sick Days remaining":8},"2":{"Name":"Eric Idle","Hire Date":"05\/12\/14","Salary":45000.0,"Sick Days remaining":10},"3":{"Name":"Terry Jones","Hire Date":"11\/01\/13","Salary":70000.0,"Sick Days remaining":3},"4":{"Name":"Terry Gilliam","Hire Date":"08\/12\/14","Salary":48000.0,"Sick Days remaining":7},"5":{"Name":"Michael Palin","Hire Date":"05\/23\/13","Salary":66000.0,"Sick Days remaining":8}}

#### Primer 1: seaslug.txt

In [37]:
seaslug = pd.read_csv('data/seaslug.txt', sep='\t')

In [38]:
seaslug.head()

Unnamed: 0,Time,Percent
0,99,0.067
1,99,0.133
2,99,0.067
3,99,0.0
4,99,0.0


- `sep: str, defaults to ',' for read_csv(), \t for read_table()`: Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\\r\\t'.

- `delimiter: str, default None`: Alias for sep.

#### Primer 2: FOOD_DES.txt

Encoding: `iso-8859-1`, separator: `^`

In [16]:
! head -n 3 ./data/FOOD_DES.txt

~01001~^~0100~^~Butter, salted~^~BUTTER,WITH SALT~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87
~01002~^~0100~^~Butter, whipped, with salt~^~BUTTER,WHIPPED,W/ SALT~^~~^~~^~Y~^~~^0^~~^6.38^^^
~01003~^~0100~^~Butter oil, anhydrous~^~BUTTER OIL,ANHYDROUS~^~~^~~^~Y~^~~^0^~~^6.38^4.27^8.79^3.87


- Turns out the file, instead of using commas to separate the fields, uses carets ^.
- By some reason, USDA people thought separating strings with tildes ~ was a good idea. Thankfully, we can use the quotechar argument

In [23]:
pd.read_csv('data/FOOD_DES.txt', sep='^', encoding='iso-8859-1', header=None, nrows=5, quotechar='~')

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,1001,100,"Butter, salted","BUTTER,WITH SALT",,,Y,,0,,6.38,4.27,8.79,3.87
1,1002,100,"Butter, whipped, with salt","BUTTER,WHIPPED,W/ SALT",,,Y,,0,,6.38,,,
2,1003,100,"Butter oil, anhydrous","BUTTER OIL,ANHYDROUS",,,Y,,0,,6.38,4.27,8.79,3.87
3,1004,100,"Cheese, blue","CHEESE,BLUE",,,Y,,0,,6.38,4.27,8.79,3.87
4,1005,100,"Cheese, brick","CHEESE,BRICK",,,Y,,0,,6.38,4.27,8.79,3.87


- `nrows: int, default None` Number of rows of file to read. Useful for reading pieces of large files.

- `header: int or list of ints, default 'infer'` Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names.

- `encoding: str, default None` Encoding to use for UTF when reading/writing (e.g. 'utf-8'). [List of Python standard encodings](https://docs.python.org/3/library/codecs.html#standard-encodings).

- `quotechar: str (length 1)`: The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.

#### Primer 3: MplsStops.csv

In [39]:
! head -n 3 ./data/mpls_stops.csv

Unnamed: 0,id Num,date,problem,MDC,citation Issued,person Search,vehicle Search,pre Race,race,gender,lat,long,police Precinct,neighborhood
,idNum,date,problem,MDC,citationIssued,personSearch,vehicleSearch,preRace,race,gender,lat,long,policePrecinct,neighborhood
6823.0,17-000003,2017-01-01 00:00:42,suspicious,MDC,,NO,NO,Unknown,Unknown,Unknown,44.96661711,-93.24645826,1,Cedar Riverside


In [40]:
mpls = pd.read_csv('data/mpls_stops.csv', nrows=3)
mpls

Unnamed: 0.1,Unnamed: 0,id Num,date,problem,MDC,citation Issued,person Search,vehicle Search,pre Race,race,gender,lat,long,police Precinct,neighborhood
0,,idNum,date,problem,MDC,citationIssued,personSearch,vehicleSearch,preRace,race,gender,lat,long,policePrecinct,neighborhood
1,6823.0,17-000003,2017-01-01 00:00:42,suspicious,MDC,,NO,NO,Unknown,Unknown,Unknown,44.96661711,-93.24645826,1,Cedar Riverside
2,6824.0,17-000007,2017-01-01 00:03:07,suspicious,MDC,,NO,NO,Unknown,Unknown,Male,44.98045,-93.27134,1,Downtown West


In [121]:
mpls.columns

Index(['Unnamed: 0', 'id Num', 'date', 'problem', 'MDC', 'citation Issued',
       'person Search', 'vehicle Search', 'pre Race', 'race', 'gender', 'lat',
       'long', 'police Precinct', 'neighborhood'],
      dtype='object')

In [122]:
new_column_names = ['Unnamed: 0', 'id Num', 'date', 'problem', 'MDC', 'citation Issued',
       'person Search', 'vehicle Search', 'pre Race', 'race', 'gender', 'lat',
       'long', 'police Precinct', 'neighborhood']

In [123]:
new_column_names = [name.lower().replace(' ', '_') for name in new_column_names]
new_column_names[0] = 'case_number_id'
print(new_column_names)

['case_number_id', 'id_num', 'date', 'problem', 'mdc', 'citation_issued', 'person_search', 'vehicle_search', 'pre_race', 'race', 'gender', 'lat', 'long', 'police_precinct', 'neighborhood']


In [124]:
dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

In [125]:
mpls = pd.read_csv('data/mpls_stops.csv', 
                   names=new_column_names, 
                   skiprows=2, 
                   engine='c',
                   true_values=['YES'],
                   false_values=['NO'],
                   dtype={'mdc': 'category', 'problem':'category', 'citation_issued': 'float',
                         'person_search': 'float', 'vehicle_search': 'float',  'pre_race':'category'},
                   index_col='case_number_id',
                   parse_dates=['date'],
                   date_parser=dateparse,
                   na_values=['Unknown'])

mpls.index = mpls.index.astype('int')
mpls.head()

Unnamed: 0_level_0,id_num,date,problem,mdc,citation_issued,person_search,vehicle_search,pre_race,race,gender,lat,long,police_precinct,neighborhood
case_number_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
6823,17-000003,2017-01-01 00:00:42,suspicious,MDC,1.0,0.0,0.0,,,,44.966617,-93.246458,1,Cedar Riverside
6824,17-000007,2017-01-01 00:03:07,suspicious,MDC,1.0,0.0,0.0,,,Male,44.98045,-93.27134,1,Downtown West
6825,17-000073,2017-01-01 00:23:15,traffic,MDC,1.0,0.0,0.0,,White,Female,44.94835,-93.27538,5,Whittier
6826,17-000092,2017-01-01 00:33:48,suspicious,MDC,1.0,0.0,0.0,,East African,Male,44.94836,-93.28135,5,Whittier
6827,17-000098,2017-01-01 00:37:58,traffic,MDC,1.0,0.0,0.0,,White,Female,44.979078,-93.262076,1,Downtown West


- `names: array-like, default None` List of column names to use. If file contains no header row, then you should explicitly pass header=None. Duplicates in this list are not allowed.


- `skiprows: list-like or integer, default None` Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

- `engine: {'c', 'python'}` Parser engine to use. The C engine is faster while the Python engine is currently more feature-complete.

In [72]:
%timeit mpls = pd.read_csv('data/mpls_stops.csv', names=new_column_names, skiprows=2, engine='python')

776 ms ± 17.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [74]:
%timeit mpls = pd.read_csv('data/mpls_stops.csv', names=new_column_names, skiprows=2, engine='c')

193 ms ± 3.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


- `true_values: list, default None` Values to consider as True.
- `false_values: list, default None` Values to consider as False.

- `index_col: int, str, sequence of int / str, or False, default None` Column(s) to use as the row labels of the DataFrame, either given as string name or column index. If a sequence of int / str is given, a MultiIndex is used.

- `dtype: Type name or dict of column -> type, default None` Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32} (unsupported with engine='python'). Use str or object together with suitable na_values settings to preserve and not interpret dtype.

- `parse_dates: boolean or list of ints or names or list of lists or dict, default False.` 
    - If True -> try parsing the index.
    - If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
    - If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
    - If {'foo': [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’. A fast-path exists for iso8601-formatted dates.

- `date_parserfunction, default None` Function to use for converting a sequence of string columns to an array of datetime instances. The default uses dateutil.parser.parser to do the conversion. pandas will try to call date_parser in three different ways, advancing to the next if an exception occurs: 1) Pass one or more arrays (as defined by parse_dates) as arguments; 2) concatenate (row-wise) the string values from the columns defined by parse_dates into a single array and pass that; and 3) call date_parser once for each row using one or more strings (corresponding to the columns defined by parse_dates) as arguments.

- `na_values: scalar, str, list-like, or dict, default None` Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. See na values const below for a list of the values interpreted as NaN by default.

#### Primer 4: iperf.txt

In [41]:
!head data/iperf.txt

Wed Aug 15 19:35:11 CEST 2018
Connecting to host x.x.x.x, port 5201
[  4] local x.x.x.x port 48944 connected to x.x.x.x port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   375 MBytes  3.14 Gbits/sec  273    471 KBytes
[  4]   1.00-2.00   sec   428 MBytes  3.59 Gbits/sec  145    376 KBytes
[  4]   2.00-3.00   sec   360 MBytes  3.02 Gbits/sec  148    454 KBytes
[  4]   3.00-4.00   sec   339 MBytes  2.84 Gbits/sec   83    407 KBytes
[  4]   4.00-5.00   sec   305 MBytes  2.56 Gbits/sec  104    414 KBytes
[  4]   5.00-6.00   sec   301 MBytes  2.53 Gbits/sec  186    440 KBytes


Želimo stolpce: 
- timestamp, transfer_mbytesec, bandwidth_gbitsec, retr, cwnd_kbytes

In [None]:
# vidimo da ne gre
#pd.read_csv('data/iperf.txt')

In [16]:
# preberemo podatke v list
with open('data/iperf.txt', 'r') as f:
    data = f.readlines()
    data = [line.strip() for line in data]

In [17]:
print(data)

['Wed Aug 15 19:35:11 CEST 2018', 'Connecting to host x.x.x.x, port 5201', '[  4] local x.x.x.x port 48944 connected to x.x.x.x port 5201', '[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd', '[  4]   0.00-1.00   sec   375 MBytes  3.14 Gbits/sec  273    471 KBytes', '[  4]   1.00-2.00   sec   428 MBytes  3.59 Gbits/sec  145    376 KBytes', '[  4]   2.00-3.00   sec   360 MBytes  3.02 Gbits/sec  148    454 KBytes', '[  4]   3.00-4.00   sec   339 MBytes  2.84 Gbits/sec   83    407 KBytes', '[  4]   4.00-5.00   sec   305 MBytes  2.56 Gbits/sec  104    414 KBytes', '[  4]   5.00-6.00   sec   301 MBytes  2.53 Gbits/sec  186    440 KBytes', '[  4]   6.00-7.00   sec   325 MBytes  2.73 Gbits/sec  174    485 KBytes', '[  4]   7.00-8.00   sec   434 MBytes  3.64 Gbits/sec   81    677 KBytes', '[  4]   8.00-9.00   sec   412 MBytes  3.46 Gbits/sec  226    537 KBytes', '[  4]   9.00-10.00  sec   409 MBytes  3.43 Gbits/sec   47    372 KBytes', '[  4]   10.00-11.00  sec   523 MBytes  3.

In [34]:
# izluščimo začetni čas -> tabela https://www.journaldev.com/23365/python-string-to-datetime-strptime
import datetime

start_time = datetime.datetime.strptime(data[0], '%a %b %d %H:%M:%S CEST %Y')
print(start_time, type(start_time))

2018-08-15 19:35:11 <class 'datetime.datetime'>


In [46]:
rows = []

for line in data[4:]:
    line_splited = line.split()
    # seconds to add to start time
    add_seconds = int(line_splited[2].split('.')[0])
    timestamp = start_time + datetime.timedelta(seconds=add_seconds)
    transfer_mbytesec = int(line_splited[4])
    bandwidth_gbitsec = float(line_splited[6])
    retr = int(line_splited[8])
    cwnd_kbytes = int(line_splited[9])
    rows.append((timestamp, transfer_mbytesec, bandwidth_gbitsec, retr, cwnd_kbytes))
    
print(rows)

[(datetime.datetime(2018, 8, 15, 19, 35, 11), 375, 3.14, 273, 471), (datetime.datetime(2018, 8, 15, 19, 35, 12), 428, 3.59, 145, 376), (datetime.datetime(2018, 8, 15, 19, 35, 13), 360, 3.02, 148, 454), (datetime.datetime(2018, 8, 15, 19, 35, 14), 339, 2.84, 83, 407), (datetime.datetime(2018, 8, 15, 19, 35, 15), 305, 2.56, 104, 414), (datetime.datetime(2018, 8, 15, 19, 35, 16), 301, 2.53, 186, 440), (datetime.datetime(2018, 8, 15, 19, 35, 17), 325, 2.73, 174, 485), (datetime.datetime(2018, 8, 15, 19, 35, 18), 434, 3.64, 81, 677), (datetime.datetime(2018, 8, 15, 19, 35, 19), 412, 3.46, 226, 537), (datetime.datetime(2018, 8, 15, 19, 35, 20), 409, 3.43, 47, 372), (datetime.datetime(2018, 8, 15, 19, 35, 21), 523, 3.81, 96, 422)]


In [49]:
# podatke vpišemo v novo datoteko
import csv

headers = ['timestamp', 'transfer_mbytesec', 'bandwidth_gbitsec', 'retr', 'cwnd_kbytes']

with open('data/iperf_clean.csv', 'w') as f:
    f_csv = csv.writer(f)
    f_csv.writerow(headers)
    f_csv.writerows(rows)

In [58]:
# preberemo podatke
iperf_data = pd.read_csv('data/iperf_clean.csv', parse_dates=['timestamp'], index_col=['timestamp'])
iperf_data

Unnamed: 0_level_0,transfer_mbytesec,bandwidth_gbitsec,retr,cwnd_kbytes
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018-08-15 19:35:11,375,3.14,273,471
2018-08-15 19:35:12,428,3.59,145,376
2018-08-15 19:35:13,360,3.02,148,454
2018-08-15 19:35:14,339,2.84,83,407
2018-08-15 19:35:15,305,2.56,104,414
2018-08-15 19:35:16,301,2.53,186,440
2018-08-15 19:35:17,325,2.73,174,485
2018-08-15 19:35:18,434,3.64,81,677
2018-08-15 19:35:19,412,3.46,226,537
2018-08-15 19:35:20,409,3.43,47,372


In [59]:
iperf_data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 11 entries, 2018-08-15 19:35:11 to 2018-08-15 19:35:21
Data columns (total 4 columns):
transfer_mbytesec    11 non-null int64
bandwidth_gbitsec    11 non-null float64
retr                 11 non-null int64
cwnd_kbytes          11 non-null int64
dtypes: float64(1), int64(3)
memory usage: 440.0 bytes


## Branje in pisanje excel datotek

To facilitate working with multiple sheets from the same file, the ExcelFile class can be used to wrap the file and can be passed into read_excel There will be a performance benefit for reading multiple sheets as the file is read into memory only once.

The sheet_names property will generate a list of the sheet names in the file.

In [31]:
# Assign spreadsheet filename: file
file = 'data/battledeath.xlsx'

# Load spreadsheet: xls
xls = pd.ExcelFile(file)

# Print xlssheet names
print(xls.sheet_names)

['2002', '2004']


Read an Excel file into a pandas DataFrame.

Supports xls, xlsx, xlsm, xlsb, and odf file extensions read from a local filesystem or URL. Supports an option to read a single sheet or a list of sheets.


[pandas.read_excel](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html)

In [32]:
df_2002 = pd.read_excel(xls, '2002')

In [33]:
df_2002.head()

Unnamed: 0,"War, age-adjusted mortality due to",2002
0,Afghanistan,36.08399
1,Albania,0.128908
2,Algeria,18.31412
3,Andorra,0.0
4,Angola,18.96456


The ExcelFile class can also be used as a context manager. The primary use-case for an ExcelFile is parsing multiple sheets with different parameters.

In [34]:
with pd.ExcelFile(file) as xls:
    df_2002 = pd.read_excel(xls, '2002', names=['Country', 'AAM due to War (2002)'], index_col='Country')
    df_2004 = pd.read_excel(xls, '2004', names=['Country', 'War(2004)'], index_col='Country')

In [35]:
df_2002.head(2)

Unnamed: 0_level_0,AAM due to War (2002)
Country,Unnamed: 1_level_1
Afghanistan,36.08399
Albania,0.128908


In [36]:
df_2004.head(2)

Unnamed: 0_level_0,War(2004)
Country,Unnamed: 1_level_1
Afghanistan,9.451028
Albania,0.130354


## Ostali formati

There are common situations that you may encounter while working with files. Most of these cases can be handled using other modules. Two common file types you may need to work with are .csv and .json. Real Python has already put together some great articles on how to handle these:
- Reading and Writing CSV Files in Python
- Working With JSON Data in Python

Additionally, there are built-in libraries out there that you can use to help you:
- wave: read and write WAV files (audio)
- aifc: read and write AIFF and AIFC files (audio)
- sunau: read and write Sun AU files
- tarfile: read and write tar archive files
- zipfile: work with ZIP archives
- configparser: easily create and parse configuration files
- xml.etree.ElementTree: create or read XML based files
- msilib: read and write Microsoft Installer files
- plistlib: generate and parse Mac OS X .plist files
There are plenty more out there. Additionally there are even more third party tools available on PyPI. Some popular ones are the following:

PyPDF2: 
- PDF toolkit
- xlwings: read and write Excel files
- Pillow: image reading and manipulation

## Barvanje celic in izvoz

https://pandas.pydata.org/pandas-docs/version/0.20/whatsnew.html#excel-output-for-styled-dataframes

In [40]:
import pandas as pd

def highlight_cols(value):
    if value == True:
        color = 'green'
    else:
        color = 'red'

    return f'background-color: {color}'


df = pd.DataFrame({'rez1': [2, 4, 8, 0,10],
                   'rez2': [2, 0, 0, 0,5],
                   'uspesnost': [True, False, False, True, True]})

styled = df.style.applymap(highlight_cols, subset=pd.IndexSlice[:, ['uspesnost']])
styled.to_excel('data/styled.xlsx', engine='openpyxl')