# Data File Types

Data can come in many different forms and, for our purposes, can be stored in various formats. In this notebook we will touch on some of the different file types you may encounter.

## What we will accomplish

In this notebook we will discuss the following file types:
- Basic text or `.txt` files,
- Comma separated values or `.csv` files,
- Tab separated values or `.tsv` files,
- JavaScript object notation or `.json` files,
- Columnar data, or `.parquete` files,
- Zipped drives and
- Compressed tar files. 

## Uncompressed files

### Basic .txt files

A file that ends in `.txt` is a text file. This is a file that contains sequences of characters (letters, numbers, special characters, etc.) that is readable by a computer. To see an example of this file type open `sample_text_file.txt` from the `data` folder.

These files are often used to store text based data like transcripts, essays, raw (meaning no meta-data) tweets and more. You can open and read these files using `open()` as described in <a href="Reading and Writing to File.ipynb">Reading and Writing to File</a>.

In [5]:
with open("../data/sample_text_file.txt", "r") as file:
    print(file.read())


Hello this is a sample text file :).


### .csv files

A `.csv` or comma separated values file, sometimes called a <i>comma delimited file</i>, is a file containing data organized into columns and rows. Each column in the file is separated by a `,` and each row is placed on a new line within the file. We can think of these as a table of data represented in a text file. For an example see `sample_csv_file.csv` in the `data` folder.


While we can also read these in using `open`, it is easier to use something like `pandas`. See the code below where we read in `sample_csv_file.csv` using the `.read_csv()` function, <a href="https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html">https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html</a>.

In [10]:
## importing pandas as pd
import pandas as pd

ImportError: DLL load failed while importing _multiarray_umath: The specified module could not be found.

ImportError: numpy._core.multiarray failed to import

In [None]:
csv_table = pd.read_csv("../data/sample_csv_file.csv")
print(type(csv_table))
print(csv_table)

### .tsv files

A `.tsv` or tab separated values file, sometimes called <i>tab delimited file</i>, is identical to a `.csv` file where the columns of data are separated with tabs or `\t` characters. For an example see `sample_tsv_file.csv` in the `data` folder.

These can be read in using the `.read_table()` function in `pandas`, <a href="https://pandas.pydata.org/docs/reference/api/pandas.read_table.html">https://pandas.pydata.org/docs/reference/api/pandas.read_table.html</a>.

In [None]:
table = pd.read_table("../data/sample_tsv_file.tsv")
print(table)

### JSON files

A `.json` or JavaScript object notation file is a file type that stores data the way it is typically transported between applications and servers online. The structure of the JSON format is quite similar to that of a Python dictionary (`dict`), as we will soon see. To look at an example open `miserables.json`, which is a JSON file taken from the d3.js GitHub repository, <a href="https://github.com/d3/d3-plugins/blob/master/graph/data/miserables.json">https://github.com/d3/d3-plugins/blob/master/graph/data/miserables.json</a>. This file stores the characters from the novel <u>Les Miserables</u> by Victor Hugo as a series of nodes and links in order to describe a character network. Here each node is a character in the novel and a link between A and B represents the number of chapters in which characters A and B both appear.

We can load in this file using `open()` and the `json` module.

In [None]:
## Import json
import json

In [None]:
file = 

## we can load in the data with json.load()
les_mis = 

file.close()

In [None]:
## As we can see the data is read in as a dictionary


In [None]:
## see :)


### Parquet files

Parquet files (`.parquet`) present a different way of storing data that can siginificantly improve on the query speed, memory requirements, and some calculations when compared to other file types like `.csv` and `.tsv`.

File types like `.csv` and `.tsv` are row-based data storage approaches, that is each row represents an observation and each column a variable. Parquet takes what's known as a columnar approach to storing data. In columnar data storage each row represents a variable and each column represents an observation.

For example `sample_csv_file.csv` looks like this in storage:

`column_1, column_2, column_3`

`1, 'a', 8`

`2, 'b', 5`

`3, 'c', 3`

`4, 'd', 2`

`5, 'e', 7`.


The equivalent `.parquet` file would theoretically look like this in storage:


`column_1: 1, 2, 3, 4, 5`

`column_2: 'a', 'b', 'c', 'd', 'e'`

`column_3: 8, 5, 3, 2, 7`

however, we are unable to open a `.parquet` file with a text editor and see such a file.

Columnar formats have advantages in storage of large data files because they can be encoded in ways that row-based formats cannot that greatly reduce the amount of data that needs to be stored.

For our purposes, we can read in a `.parquet` file with `pandas` `read_parquet` function, <a href="https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html">https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html</a>. Note that when we read in a `.parquet` file in this way, it is converted to the standard row-based format of a `pandas` `DataFrame`. Note, that in order to use `read_parquet` you have to have `pyarrow` or `fastparquet` installed:
- `pyarrow` installation instructions: <a href="https://arrow.apache.org/docs/python/install.html">https://arrow.apache.org/docs/python/install.html</a>
- `fastparquet` installation instructions: <a href="https://github.com/dask/fastparquet/#installation">https://github.com/dask/fastparquet/#installation</a>.

If you are unsure on how to install a python package, you cand see more using the python package installation guide on the python prep Erd&#337;s Institute website.

The parquet format is actually a little more complicated than what was presented above, but that is a good introduction. If you would like to learn more about `.parquet` files check out the optional notebook, <a href="Parquet Files (Optional).ipynb">Parquet Files (Optional)</a>.

## Compressed file types

If you have a particularly large set of data, perhaps spanning multiple files, it is likely that you will initially recieve it in a compressed format. In order to access the data you will first need to uncompress the files. Let's review how to do so for two of the most popular compressed file types.

### Zipped files

You have most likely encountered these before. Zipped files are those that end with `.zip`. To uncompress these you can do the following:

#### MacOS

Double click on the `.zip` file in the Finder.

#### Windows

Locate the `.zip` file and double click.

#### Linux

To uncompress a `.zip` file on a Linux machine follow the instructions at this post, <a href="https://linuxize.com/post/how-to-unzip-files-in-linux/">https://linuxize.com/post/how-to-unzip-files-in-linux/</a>.

You can practice using the `example.zip` file in the `data` folder.

### Tar files

Another very popular compression file type is known as a <i>tar file</i> or <i>tarball</i>. These files end in a `.tar` or `.tar.gz`.

These can be uncompressed like so:

#### MacOS

You can double click the file, or execute `tar -xzf file_name.tar.gz` or `tar -xf file_name.tar` in the command prompt, depending on how the filename ends. Note, this assumes you are in the correct directory on your computer.

#### Windows

Follow instructions found on <a href="https://www.7-zip.org/">https://www.7-zip.org/</a>. This piece of software will enable you to extract the data in a tar file.

#### Linux

Follow the command prompt instructions under the MacOS header.


You can practice this using the `example.tar.gz` file in the `data` folder.

We now know of the most common data file types. Let's move on and learn how we might find some data files of our own.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2023.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)