# Load Data from Files
## Session 3 - Unit 2 [Self Guided]

<img src=media/files.jpg width=200/>

Welcome to the "Load Data from Files" unit of the Python Academy! In this unit, you will learn:
  - File Formats (DSV, JSON, Excel)
  - Pandas read methods
  - Open a file in Python
  - Slicing
  - Alternative File Formats

In [1]:
import pandas as pd

## File Formats

Learning how to handle the most common file formats is a fundamental skill for Python developers. Technically, a **file format** is a **standard way to encode information for storage** in a computer file. Such standardization means that we can (roughly) use the same way to handle files that are within the same format. Learn once, use forever. Cool, right?

There are three fundamental types we need to look out for:
  - Delimiter Separated Values (CSV, TSV, etc.)
  - Excel
  - JSON

### Delimiter Separated Values

A [DSV](https://en.wikipedia.org/wiki/Delimiter-separated_values) is a two-dimensional, **tabular-like** storage; each row represents a data entry and the values are separated into columns by specific delimiter characters. They are **widely supported**, which makes them great for exchanging data between multiple applications.

The most common delimiters include:
  - comma (`,`), *aka comma separated values or CSV*
  - tab (`\t`), *aka tab separated values or TSV*
  - semicolon (`;`)
  - pipe (`|`)

Before we import them to Python, let's take a look at a CSV to understand better how data is stored. We will use the first 3 rows of [Spotify's Top 100 Songs of 2010-2019](https://www.kaggle.com/datasets/muhmores/spotify-top-100-songs-of-20152019) dataset.

<img src=media/csv_example.png />

We highlighted each column with a different color to make things clearer. Do you see how the **comma delimiter** works? For each row, we start at the far left and whenever we find a new delimiter (e.g. comma) or we reach the end of the line (cf. [newline](https://en.wikipedia.org/wiki/Newline)), we consider all text to be within the same column.

### Excel

[Excel](https://www.microsoft.com/en-us/microsoft-365/excel) is a proprietary **spreadsheet** software developed by Microsoft, used for numerical computation, data visualization, data re-arranging and macro programming. The spreadsheets are organized in a **tabular form**, using a grid of cells arranged in rows and columns. By being flexible on its data capabilities, Excel files can be used to both **store and manipulate data** within the same Excel file.

Despite being widely used, Excel faces an uphill climb on development efficiency when we incorporate higher levels of complexity within our data processing flow. Incorporating complex if-else conditional logic or applying the same operation multiple tipes (e.g. while, for loops) is possible but cumbersome. **This is when we should migrate data into Python**.

### JSON

JSON is a **language-independent, human-readable** data format. It was derived from JavaScript but, being languange independent, many languages have built-in support to parse (i.e. read) and generate JSON data. It is a very common format for electronic communication between browsers, servers, databases; so expect to see it a lot!

The data stored in JSONs is typically represented wth key-value pairs (similar to `dict`) and arrays (similar to `list`).

In [2]:
# example of JSON file
# {
#     "title": {
#         "0": "STARSTRUKK (feat. Katy Perry)",
#         "1": "My First Kiss (feat. Ke$ha)"
#     },
#     "artist": {
#         "0": "3OH!3",
#         "1": "3OH!3"
#     }
# }

## Pandas `read_*` methods 

Remember when we said the `pandas` package has a lot of built-in functionality for data analysis and manipulation? Well, this is just one example of it.

Pandas provides an [extensive API](https://pandas.pydata.org/pandas-docs/stable/reference/io.html#) to perform **input/output** operations with **data files**. It ranges from classical file formats (e.g. json, csv), to newer alternatives (e.g. parquet) and even databases (e.g. SQL, which we'll cover in a following Unit).

The standard way to input data is with `pd.read_{format}`, where format is the abbreviated name of the file format. Conversely, to output data into a file you can use `DataFrame.to_{format}`. Let's see a couple of examples with a small sample of the Spotify's dataset.

In [2]:
# comma separated values
df1 = pd.read_csv('data/spotify_1.csv')
df1.head()

Unnamed: 0,title,artist,top genre,year released,added,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop,top year,artist type
0,STARSTRUKK (feat. Katy Perry),3OH!3,dance pop,2009,2022‑02‑17,140,81,61,-6,23,23,203,0,6,70,2010,Duo
1,My First Kiss (feat. Ke$ha),3OH!3,dance pop,2010,2022‑02‑17,138,89,68,-4,36,83,192,1,8,68,2010,Duo
2,I Need A Dollar,Aloe Blacc,pop soul,2010,2022‑02‑17,95,48,84,-7,9,96,243,20,3,72,2010,Solo
3,Airplanes (feat. Hayley Williams of Paramore),B.o.B,atl hip hop,2010,2022‑02‑17,93,87,66,-4,4,38,180,11,12,80,2010,Solo
4,Nothin' on You (feat. Bruno Mars),B.o.B,atl hip hop,2010,2022‑02‑17,104,85,69,-6,9,74,268,39,5,79,2010,Solo


There are a **lot of ways we can configure** the `pd.read_csv` ([API Reference](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)), but for now let's focus on the simpler stuff. The file is loaded with commas as delimiters (extension `.csv`) and the header is inferred.

But what if my file is separated by another delimiter? Let's see how pandas loads a CSV which is poorly configured.

In [3]:
# not really comma separated right?
pd.read_csv('data/spotify_2.csv').head(3)

Unnamed: 0,title;artist;top genre;year released;added;bpm;nrgy;dnce;dB;live;val;dur;acous;spch;pop;top year;artist type
0,STARSTRUKK (feat. Katy Perry);3OH!3;dance pop;...
1,My First Kiss (feat. Ke$ha);3OH!3;dance pop;20...
2,I Need A Dollar;Aloe Blacc;pop soul;2010;2022‑...


What we see is that the **delimiters** are rather **semicolons** (`;`) than commas, and that is messing up the data loading. The engine doesn't find any comma, it doesn't create any new column and what we get is 5 rows of data all bundled together. Not cool. 

To fix this, we can provide a `sep` argument (or `delimiter`) that specifies by which **separator** should the data loading engine break columns by.

In [4]:
# this is more like
pd.read_csv('data/spotify_2.csv', sep=';').head(3)

Unnamed: 0,title,artist,top genre,year released,added,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop,top year,artist type
0,STARSTRUKK (feat. Katy Perry),3OH!3,dance pop,2009,2022‑02‑17,140,81,61,-6,23,23,203,0,6,70,2010,Duo
1,My First Kiss (feat. Ke$ha),3OH!3,dance pop,2010,2022‑02‑17,138,89,68,-4,36,83,192,1,8,68,2010,Duo
2,I Need A Dollar,Aloe Blacc,pop soul,2010,2022‑02‑17,95,48,84,-7,9,96,243,20,3,72,2010,Solo


Despite the extension name, the CSV above is not really a comma-separated-values, but a semicolon-separated-values file. The naming of the `.csv` extension is not the most appropriate, but you will often find this scenarios in the wild; so better to get used to them.

**As a good developer should, use your intuition but be skeptically about it.**

---

Alternatively, we can load the same data with pandas from [JSON](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html) and [Excel](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) files.

In [5]:
# read with JSON
pd.read_json('data/spotify_1.json').head(3)

# read from Excel file
pd.read_excel('data/spotify_1.xlsx').head(3)

Unnamed: 0,title,artist,top genre,year released,added,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop,top year,artist type
0,STARSTRUKK (feat. Katy Perry),3OH!3,dance pop,2009,2022‑02‑17,140,81,61,-6,23,23,203,0,6,70,2010,Duo
1,My First Kiss (feat. Ke$ha),3OH!3,dance pop,2010,2022‑02‑17,138,89,68,-4,36,83,192,1,8,68,2010,Duo
2,I Need A Dollar,Aloe Blacc,pop soul,2010,2022‑02‑17,95,48,84,-7,9,96,243,20,3,72,2010,Solo


## Open a file in Python 

Besides the alternatives provided by pandas, Python allows to **open any text file** with the `open()` function used for **reading and writing data**. This built-in function takes the filepath as argument and an optional access mode (besides other parameters). The **access mode** specifies the main purpose of opening the file.

| Access Mode   | Purpose               |
|---------------|-----------------------|
| `r`           | read only (default)
| `w`           | write only. if file already exists, deletes existing contents and adds new from start of the file
| `x`           | exclusive creation. if file already exists, operation fails.
| `a`           | append only. adds new content to end of file.
| `b`           | binary mode
| `+`           | updating (reading and writing)

Let's see what happens when we try to read the data. Notice that the `open()` function **does not return the text contents of the file immediately** (tip: get `type(f)`). Instead, we need to use the `.readlines()` method in order to store each line as an element of a list. Alternatively, we could read the whole file contents with `.read()`.

After we are done with the file, it is a **good practice** to **close the connection** to the file object with `.close()`. Check this [StackOverflow's answer](https://stackoverflow.com/a/25070939) on "Why should I close files in Python?" for more context.

In [2]:
# create a file object
f = open('data/spotify_2.csv', 'r')

# create a list of lines
lines_list = f.readlines()

# close the connection
f.close()

### Slicing

To check the results of opening our file, notice how we use a **different type of indexing** we haven't seen before: a **slice**. Instead of specifying a given element of the list (proper indexing), we provide a ***start*** and ***stop* indices** and it returns the contiguous elements between them. If nothing is provided, the **default** is to **start at 0** (first element) and to **stop at -1** (last element). So, whenever we write `lines_list[:3]`, we are saying: get me the elements of `lines_list` between the start (0, since it's not specified) and the stop at 3 (not included), so we are basically asking for the first 3 elements.

Additionally, we can add an **optional third argument** to a slice which is called the **step**. The *step* sets the interval at which the elements are included in the slice.

In [7]:
# slicing - get first 3 lines
lines_list[:3]

['title;artist;top genre;year released;added;bpm;nrgy;dnce;dB;live;val;dur;acous;spch;pop;top year;artist type\n',
 'STARSTRUKK (feat. Katy Perry);3OH!3;dance pop;2009;2022‑02‑17;140;81;61;-6;23;23;203;0;6;70;2010;Duo\n',
 'My First Kiss (feat. Ke$ha);3OH!3;dance pop;2010;2022‑02‑17;138;89;68;-4;36;83;192;1;8;68;2010;Duo\n']

In [3]:
lines_list[1:2:2]

['STARSTRUKK (feat. Katy Perry);3OH!3;dance pop;2009;2022â€‘02â€‘17;140;81;61;-6;23;23;203;0;6;70;2010;Duo\n']

In [8]:
# slicing - get all the lines, except the first, with steps of 2 (e.g. lines 1, 3, 5, 7, 9 if 0-indexed)
lines_list[1::2]

['STARSTRUKK (feat. Katy Perry);3OH!3;dance pop;2009;2022‑02‑17;140;81;61;-6;23;23;203;0;6;70;2010;Duo\n',
 'I Need A Dollar;Aloe Blacc;pop soul;2010;2022‑02‑17;95;48;84;-7;9;96;243;20;3;72;2010;Solo\n',
 "Nothin' on You (feat. Bruno Mars);B.o.B;atl hip hop;2010;2022‑02‑17;104;85;69;-6;9;74;268;39;5;79;2010;Solo\n",
 'The Time (Dirty Bit);Black Eyed Peas;dance pop;2010;2022‑02‑17;128;81;82;-8;60;44;308;7;7;75;2010;Band/Group\n',
 'Talking to the Moon;Bruno Mars;dance pop;2010;2022‑02‑17;146;59;50;-5;11;8;218;51;3;87;2010;Solo\n']

### `with` statement

The `with` statement simplifies the way to open and handle files in Python. Using `with`, Python:
  1. **automatically closes** the file you are working with;
  2. creates a **context manager** that helps exception handling.

Instead of specifying a variable to hold the file we want to open, we name the `file_object` we want to use and indent the code relevant in the same code block.


```python
with open('my_file.txt') as file_object:
    <my code here>
```

In [9]:
# write a single phrase to a file
with open('data/my_file.txt', 'w') as file:
    file.write("My first file!")

## Other File Formats

Beyond the traditional file types we've already covered, you should also be aware of some alternatives that may be better suited to your needs depending on the type of application you are developing. 


| Format     | Properties                                                                  | Appropriate For                      | Extension     |
|------------|-----------------------------------------------------------------------------|--------------------------------------|---------------|
| parquet    | columnar file format, stores metadata, optimized to speed up column queries | big data. hadoop ecosystem.          | `.parquet`      |
| pickle     | serialize objects and save them to a file                                   | any python object                    | `.pickle`, `.pkl` |
| plain text | unformatted text                                                            | unstructured data; small data files  | `.txt`          |
| feather    | fast, lightweight, language-agnostic.                                       | dataframes                           | `.feather`      |

## Recap

Congratulations, you made it all the way "Load Data from Files" unit! By now, you should already be able to get some data into Python, transform it and save it afterwards. By the end of this notebook, you should have a clear idea of:
  1. File Formats
  2. How to leverage pandas to read/write data
  3. Using the pythonic way to read & write data
  4. Alternative file formats