# File Formats

There are thousands of file formats for different purpose. Like

- Audio files
- Binary files
- Database files
- Documents
- Image files
- Source codes
- Video files
- ...

In this workshop we will focus on markup and generic text based files used
all over the web.

## Materials & Resources

| Material                                                                                              | Time |
|:------------------------------------------------------------------------------------------------------|-----:|
| [Files & File Systems: Crash Course Computer Science #20](https://youtu.be/KN8YgJnShPM) (*till 4:45*) | 4:45 |
| [Text Files (Part 1) What is a text file?](https://youtu.be/H7R0LN41N8c)                              | 3:56 |
| [5 Minute Metadata - What is a CSV?](https://youtu.be/_blfh7uR05A)                                    | 4:42 |
| [XML Tutorial for Beginners \| What is XML \| Learn XML](https://youtu.be/KeLiQXqVgMI)                | 6:38 |
| [What is JSON? - 3 Minutes of Code](https://youtu.be/sSL2to7Jg5g)                                     | 2:33 |

## Material Review

- What is XML?
  <!--
    It stands for eXtensible Markup Language. A simple text file format to store
    and transport data.
  -->
- How the data is stored in XML files?
  <!--
    It is stored between an opening and a closing tag. More tags can follow each
    other so it can represent any complex structure. It is similar to HTML however
    in XML there are no predefined tags. It is flexible and customizable.
  -->
- Can we nest data in XML files?
  <!--
    Just like in HTML any tag can contain 0 or more other tags, so you can nest
    data.
    <employee>
      <name>...</name>
      <department>...</department>
    </employee>
  -->
- How can we add special information to the data in XML?
  <!--
    We can define attributes on the tags just like in HTML.
  -->
- What is the CSV format used for?
  <!--
    CSV can describe data as it would be a table. It has columns and rows.
    Columns are separated by a colon.
  -->
- What is the difference between CSV and TSV?
  <!--
    In TSV the columns are separated with tabs.
  -->
- Are there different types of CSV?
  <!--
    Sometimes we separate the columns with a semi-colon, to prevent the confusion
    when a value contains a colon.
  -->
- What is JSON?
  <!--
    JavaScript Object Notation, a widely used file format to transfer and store
    data. It comes from the JS object format.
  -->
- What are the valid data types in JSON?
  <!--
    Array, boolean, number, string, null, object
  -->
- What is the benefit/drawback of JSON over XML?
  <!--
    JSON is much shorter so it takes less space.
    XML can add metadata to the values.
  -->

## Workshop

In [71]:
from pandas import read_csv, read_json
from pandas.io.json import json_normalize
from bs4 import BeautifulSoup
import re

In [2]:
%load_ext pycodestyle_magic

### Find oldest movie

Read this [input file](./movies.csv) and print the title of the oldest movie.
The file has the following columns:

- Title
- Year
- Director

In [3]:
df = read_csv("movies.csv", sep=";", header=None)

In [4]:
df.iloc[df[1].idxmax(),0]

'The Dark Knight Rises'

### Remove useless data

In [this file](./election.csv) you can find the raw data of a public election.
Unfortunately something went wrong and there are some row which cannot be used
(a value is missing). We need to remove these rows and then print them to the
console. Columns (mandatory fields are signed with *):

- Name *
- Candidate *
- Time
- State *

In [5]:
df = read_csv("election.csv", header=None)

In [6]:
df[~df.index.isin(df.dropna().index)]

Unnamed: 0,0,1,2,3
2,Richard Gere,,2018-09-21 11:41:53,FL
4,,John Doe,2018-09-21 12:50:29,WA
5,Michael Jackson,Jane Doe,2018-09-21 16:11:57,


### Find the post with the most popular comments

You can find some posts and their comments in [this file](./posts.json). Now
you need to find the post which got the most popular comments. Most popular
comments mean the sum of the likes on the comments.

In [45]:
df = read_json("post.json", encoding="utf8")

In [46]:
df = df.dropna(subset=["comments"])

In [59]:
for i in df.index:
    df.loc[i, "commentlikes"] = json_normalize(df.loc[i, "comments"]).like_count.sum()

In [64]:
df.iloc[df.commentlikes.idxmax(),:]

author_id                                                     210
comments        [{'id': 64523, 'author_id': 12, 'published_at'...
content                                      Happy new year guys!
id                                                            145
is_public                                                    True
like_count                                                     87
published_at                                  2019-01-01 00:01:14
commentlikes                                                  320
Name: 1, dtype: object

### USD transactions

In the [transactions.xml](./transactions.xml) you can find money transfers. Your
task is to filter all USD transactions and print them to the console in a user
friendly format.

In [179]:
with open("transactions.xml") as f:
    soup = BeautifulSoup(f.read(), 'xml')

In [202]:
trans = soup.find_all("amount", currency="USD")

In [236]:
for tran in trans:
    print(f"""From: {tran.parent.find("from").string} \
To: {tran.parent.find("to").string} \
{tran.parent.find("amount").string} USD""")

From: 465345 To: 46548743 2350 USD
From: 38644 To: 8756113 8000 USD


### Exam performance

Here is a fictive [result](./exams.tsv) of an exam. The examiners have tracked
the user id, the result and time spent on the exam. There were no standard time
format so each mentor used their own. Now you need to find the user who has got
the most points within one min. Your task is to find the highest points/mins
ratio within the dataset.

In [72]:
df = read_csv("exams.tsv", sep="\t")

In [163]:
def calculateMin(string):
    s = re.match(r"\.?\d+(?=s)", string)
    m = re.match(r"\.?\d+(?=m)", string)
    h = re.match(r"\.?\d+(?=h)", string)
    hms = re.match(r"(\d{1,2})\D(\d{1,2})\D(\d{1,2})", string)
    if hms:
        return float(hms.group(1)) * 60 + float(hms.group(2)) + float(hms.group(3)) / 60
    elif s:
        return float(s.group()) / 60
    elif m:
        return float(m.group())
    elif h:
        return float(h.group()) * 60

In [174]:
df = df.dropna(subset=["time"])
df["min"] = df.time.apply(calculateMin).apply(round, ndigits=2)

In [175]:
df

Unnamed: 0,user,points,time,min
0,1,13,3600s,60.0
1,2,18,1h2m20s,62.33
2,3,2,600s,10.0
3,4,20,32m,32.0
4,5,19,.5h,30.0
5,6,25,1h12m38s,72.63
6,7,13,65m,65.0
7,8,0,98s,1.63
8,9,17,1:02:08,62.13


### Transform data

- Transform [users.csv](./users.csv) into `json` and save it.

In [309]:
df = read_csv("users.csv")

In [310]:
with open("user.json", "w") as f:
    for row in df.iterrows():
        row[1].to_json(f)
        f.write("\n")

- Transform [flowers.json](./flowers.json) into `xml` and save it.

### A66 (Green Fox HQ) entering

Write a method which can read and parse a file containing information about
the door chip usage at the A66. The method must return how many times a chip 
was used at the main door each day *(A66 - 04 FÕBEJÁRAT (F-1) Door #1)*.

During the development you will need only three fields from the input:

- Date
- Description #1 - the used door
- Card number


#### Example

[Example file can be found here.](./logs.csv)

Each line represents an entry and contains the following information:

```csv
Id,Date and Time,Event message,Event number,Object #1,Description #1,Object #2,Description #2,Object #3,Description #3,Object #4,Description #4,Card number
1,2019.01.02. 9:21:49,Access granted,203,12,A66 - 04 FÕBEJÁRAT (F-1) Door #1,5,Czender András,0,,0,,00215:09895
...
```

Example output (numbers can be different)

```js
{
  ...,
  '00088:56736': {
    '2019.01.02.': 3,
    '2019.01.03.': 5,
    '2019.01.04.': 1,
    ...
  },
  '00247:27091': {
    '2019.01.02.': 7,
    '2019.01.04.': 4,
    ...
  },
  '00038:28945': {
    '2019.01.02.': 2,
    '2019.01.03.': 1,
    '2019.01.05.': 6,
    ...
  },
  ...
}
```

In [267]:
df = read_csv("logs.csv", header=None)

In [268]:
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,1,2019.01.02. 9:21:49,Access granted,203,12,A66 - 04 FÕBEJÁRAT (F-1) Door #1,5,Czender András,0,,0,,00215:09895
1,2,2019.01.02. 9:22:54,Access granted,203,12,A66 - 17 Recepció (3-1) Door #1,5,Czender András,0,,0,,00215:09895
2,3,2019.01.02. 9:29:15,Access granted,203,12,A66 - 12 Recepció (2-1) Door #1,5,Puskás Nóra,0,,0,,00059:58046
3,4,2019.01.02. 9:31:19,Access granted,203,12,A66 - 17 Recepció (3-1) Door #1,5,Puskás Nóra,0,,0,,00059:58046
4,5,2019.01.02. 9:36:51,Access granted,203,12,A66 - 04 FÕBEJÁRAT (F-1) Door #1,5,Ripka Péter,0,,0,,00110:57041
5,6,2019.01.02. 9:38:00,Access granted,203,12,A66 - 17 Recepció (3-1) Door #1,5,Ripka Péter,0,,0,,00110:57041
6,7,2019.01.02. 9:44:46,Access granted,203,12,A66 - 04 FÕBEJÁRAT (F-1) Door #1,5,Szívós István,0,,0,,00008:58673
7,8,2019.01.02. 9:46:16,Access granted,203,12,A66 - 17 Recepció (3-1) Door #1,5,Szívós István,0,,0,,00008:58673
8,9,2019.01.02. 10:02:19,Access granted,203,12,A66 - 04 FÕBEJÁRAT (F-1) Door #1,5,Megyaszai Dániel,0,,0,,00055:39162
9,10,2019.01.02. 10:03:26,Access granted,203,12,A66 - 17 Recepció (3-1) Door #1,5,Megyaszai Dániel,0,,0,,00055:39162
