Read Data from plain text source file with different formats and encodings and potentioal large size into `python DataFrame` and indentify its granularity

## 8.1 Plain Text Format
Easy to read with text editor such as **Sublime, Vim, Emacs**

### 1. Delimited Format

1. each line represents a record, delimited by newline `\n` or `\r\n`
2. within each line: special character to separate data values
    - comma: csv
    - tab: tsv
    - white space(s), colons
3. the first line contains the names of tables columns or features

People often confuse CSV and TSV files with spreadsheets. This is in part because most spreadsheet software (like Microsoft Excel) will automatically display a CSV file as a table in a workbook. Behind the scenes, Excel looks at the file format and encoding just like we’ve done in this section. However, Excel files have a different format than CSV and TSV files, and we need to use different pandas functions to read these formats into Python.

#### filename extension
indicator, expections, and suggestions

It’s good practice to inspect the contents of the file before loading it into a data frame. If the file is not too large, you can open and examine it with a plain text editor. Otherwise, you view a couple of lines using .readline() or shell commands.

#### pathlib
the built-in `pathlib` library has a useful `Path` object to specify paths to files and folders that work across platforms

Paths are tricky when working across different operating systems (OS). For instance, a typical path in Windows might look like C:\files\data.csv, while a path in Unix or MacOS might look like ~/files/data.csv. Because of this, code that works on one OS can fail to run on other operating systems.

The pathlib Python library was created to avoid OS-specific path issues. By using it, the code shown here is more portable—it works across Windows, MacOS, and Unix.

In [2]:
from pathlib import Path
p = Path('/Users/jasonwu/GitHub/ds/data-100') / 'lec' / 'lec06' / 'data' / 'inspections.csv'
print(p)

/Users/jasonwu/GitHub/ds/data-100/lec/lec06/data/inspections.csv


In [6]:
def head(filepath, n=5, width=-1):
    '''Prints the width characters of first n lines of filepath'''
    with filepath.open() as f:
        for _ in range(n):
            (print(f.readline(), end='') if width < 0  
             else print(f.readline()[:width]))

In [13]:
head(p, width=65)

"business_id","score","date","type"

19,"94","20160513","routine"

19,"94","20171211","routine"

24,"98","20171101","routine"

24,"98","20161005","routine"



### 2. Fixed-width Format

The fixed-width format (FWF) does not use delimiters to separate data values. Instead, the values for a specific field appear in the exact same position in each line

values are aligned from one row to the next, some of the values seem to be squished together
`codebook` provide the position information and some basic checks

If a file contains 200 thousand lines and over 280 million characters so, on average, there are about 1200 characters per line. This might be why they used a fixed-width rather than a CSV format. Think how much larger the file would be if there was a comma between every field!

Use pandas.read_fwf to read a subset of a large fixed-width file
```
colspecs = [(0,6), (14,29), (33,35), (35, 37), (37, 39), (1213, 1214)]
varNames = ["id", "wt", "age", "sex", "race","type"]
dawn = pd.read_fwf('data/DAWN-Data.txt', colspecs=colspecs, 
                   header=None, index_col=0, names=varNames)
```

### 3. Hierarchical Formats

Nested form: JSON, XML, HTML

### 4. Loosely Formatted Text

##### Loosely format: 
organizational pattern, but no delimiters

##### Examples:
Web logs, contains information such as the date and time and type of request made to a Web site:
```
169.237.46.168 - -
[26/Jan/2004:10:47:58 -0800]"GET /stat141/Winter04 HTTP/1.1" 301 328
"http://anson.ucdavis.edu/courses"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)"
```
Wireless device log: reports the timestamp, identifier, location of the device, and the signal strengths that it picks up from other devices. This information uses a combination of formats: _key=value pairs, semicolon delimited, and comma delimited values_
```
t=1139644637174;id=00:02:2D:21:0F:33;pos=2.0,0.0,0.0;degree=45.5;
00:14:bf:b1:97:8a=-33,2437000000,3;00:14:bf:b1:97:8a=-38,2437000000,3;
```

### Note: file format and strucrue of data are different
1. file format: stored types
2. data structure: mental represnetation \
pandas DataFrame, SQL relation, text data, hierachical and binary file

## 8.2 Encoding

#### catagories

So far, we have used the term ‘plain text’ to broadly cover formats that can be viewed with a text editor. However, a plain text file may have different encodings, and if we don’t specify the encoding correctly, the values in the data frame might contain gibbersih. We give an overview of file encoding next.

Computers only recognize 0 and 1, we need a map to translate 0, 1 digits to human texts

1. ASCII
2. Latin-1 (ISO-8859-1)
3. UTF-8 (backwards compatible with ASCII)

#### figuring out encoding

1. check the data's documentation
2. meta data
3. `chardet` package, `detect()` function infers the confidence of a file's encoding

In [3]:
import chardet

line = '{:<25} {:<10} {}'.format

# for each file, print its name, encoding & confidence in the encoding
print(line('File Name', 'Encoding', 'Confidence'))

for filepath in Path().glob('*'):
    result = chardet.detect(filepath.read_bytes())
    print(line(str(filepath), result['encoding'], result['confidence']))

File Name                 Encoding   Confidence
wrangling.ipynb           utf-8      0.99


#### specify the encoding when loading data

`pd.read_csv(filename, encoding=[])`

## 8.3 Size

#### Examine size

##### 1. os
begin our data work by making sure the files are of manageable size 

                `os.path.getsize(filepath)`

##### 2. Command Line Interface (CLI)
**_commmands_** available in a shell interpreter (sh, bash, zsh, perform oparations on files with their own syntax, build-in commands, and language)

                `command -options arguments`

Documentation
- if you need to record what you did

Error reduction
- if you want to reduce typographical errors and other simple but potentially harmful mistakes

Reproducibility
- if you need to repeat the same process in the future or you plan to share your process with others, you have a record of your actions

Volume
- if you have many repetitive operations to perform, the size of the file you are working with is large, or you need to perform things quickly, then CLI tools can help.



`ls`

`-l`: provide extra information about each file; 
`-h`: provide filesizes in a more human-readable format; 
`-L`: Symbolic links

`du`

dest usage, shows the size in units called blocks

`-s`: show the file sizes for both files and folders
`-h`: display quantities in the standard KiB, MiB, GiB format

`wc`

word count (line, word, character)

`head`, `tail`

displays 10 lines of a file by default

`-n #` or `-#`: # lines of file

`cat`

concatinate: print the entire file’s contents

**take care when using this command, as printing a large file can cause a crash**

`file`

help use determine a file’s encoding

```
file -I data/*

data/DAWN-Data.txt:   text/plain; charset=us-ascii
data/businesses.csv:  application/csv; charset=iso-8859-1
data/co2_mm_mlo.txt:  text/plain; charset=us-ascii
data/inspections.csv: application/csv; charset=us-ascii
data/legend.csv:      application/csv; charset=us-ascii
data/violations.csv:  application/csv; charset=us-ascii

```

#### Deal with large files



In scientific domains like astronomy, where telescopes capture images of space that can be petabytes in size. While not quite as big, social media giants, health care providers

##### 1. Subset The Data
Either select a specific part of it (e.g., one day’s worth of data), or we can randomly sample the data set.

Simple but may lose many of the benefits like rare events

##### 2. Data System
Relational database management systems (RDBMS) are specifically designed to store large data sets

Downside: 
- require a separate server for the data that needs its own configuration
- SQL is less flexible in what it can compute than Python, which becomes especially relevant for modeling.

Hybrid Approach: 
SQL: subset -- aggregate -- sample
Python: more sophisticated analysis

##### 3. Distributed Computing System
MapReduce, Spark, or Ray

These systems work best on tasks that can be split into many smaller parts where they divide up data sets into smaller pieces and run programs on all of the smaller data sets at once. These systems have great flexibility and can be used in a variety of scenarios. Their main downside is that they can require a lot of work to install and configure properly because they are typically installed across many computers that need to coordinate with each other.



#### Read data into Memory (RAM)

All Python code requires the use of RAM, no matter how short the code is. A computer’s RAM is typically much smaller than its disk storage.

Multiple | Notation | Number of Bytes
--- | --- | --- |
Kibibyte|KiB|1024
Mebibyte|MiB|1024²
Gibibyte|GiB|1024³
Tebibyte|TiB|1024⁴
Pebibyte|PiB|1024⁵

You also see the typical SI prefixes used to describe size—kilobytes, megabytes, and gigabytes, for example. Unfortunately, these prefixes are used inconsistently. Sometimes a kilobyte refers to 1000 bytes; other times, a kilobyte refers to 1024 bytes. To avoid confusion, we stick to kibi-, mebi-, and gibibytes which clearly represent multiples of 1024.


As a rule of thumb, reading in a file using pandas usually requires at least five times the available memory as the file size. Memory is shared by all programs running on a computer, including the operating system, web browsers, and Jupyter notebook itself.

## 8.4 Shape and Granularity 

shape: quantifies the table’s rows and columns

granularity: describe what each row in the table represents

#### Primary Key and Foreign Keys

One Field as UID: `len(DataFrame) == len(DataFrame.groupby(id).unique())`

Two or More (combination of multiple fields):
    `DataFrame.groupby([field1, field2]).size().sort_values()`

Since we have identified primary and foreign keys for them, we can potentially join these tables.

#### Weight

In order to reflect the sampling scheme and be representative of the population of all drug-related ER visits in a year, weights are provided. We must apply the weight to each record when we compute summary statistics, build histograms, and fit models. (The wt field contains these values).

The weights take into account the chance of an ER visit like this one appearing in the sample. By “like this one” we mean a visit with similar features, such as the visitor age, race, visit location, and time of day.

It is critical to include the survey weights in your analysis to get data that represents the population at large.

## 8.5 Qustions

1. What does a record represent?

2. Do all records in a table capture granularity at the same level? Sometimes a table contains additional summary rows that have a different granularity, and you want to use only those rows that are at the right level of detail.

3. How was the aggregation performed, what kinds of it is performed?

The wrangling techniques in this chapter help us bring data from a source file into a data frame and understand its structure. Once we have a data frame, further wrangling is needed to assess and improve quality and prepare the data for analysis.``