### Path

Paths are tricky when working across different operating systems (OSs). For instance, a typical path in Windows might look like C:\files\data.csv, while a path in Unix or macos might look like ~/files/data.csv. Because of this, code that works on one OS can fail to run on other OSs.

The pathlib Python library was created to avoid OS-specific path issues. By using it, the code shown here is more portable—it works across Windows, macos, and Unix.

In [14]:
from pathlib import Path

# Create a Path pointing to our data file
insp_path = Path() /'Data'/'inspections.csv'

In [15]:
text = insp_path.read_text()
# Print first five lines
print('\n'.join(text.split('\n')[:5]))

business_id,score,date,type
19,94,20160513,routine
19,94,20171211,routine
24,98,20171101,routine
24,98,20161005,routine


#### File Encoding

When we have a text file, we usually need to figure out its encoding.  

Computers store data as sequences of bits: 0s and 1s. Character encodings, like ASCII, tell the computer how to translate between bits and text. For example, in ASCII, the bits 100 001 stand for the letter A and 100 010 for B. The most basic kind of plain text supports only standard ASCII characters, which includes the uppercase and lowercase English letters, numbers, punctuation symbols, and spaces.

ASCII encoding does not include a lot of special characters or characters from other languages. Other, more modern character encodings have many more characters that can be represented. Common encodings for documents and web pages are Latin-1 (ISO-8859-1) and UTF-8. UTF-8 has over a million characters and is backward compatible with ASCII, meaning that it uses the same representation for English letters, numbers, and punctuation as ASCII.



In [5]:
#!pip install chardet

Collecting chardet
  Downloading chardet-5.2.0-py3-none-any.whl.metadata (3.4 kB)
Downloading chardet-5.2.0-py3-none-any.whl (199 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.4/199.4 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: chardet
Successfully installed chardet-5.2.0


In [19]:
import chardet

# for each file, print its name, encoding & confidence in the encoding
print('File Name', 'Encoding', 'Confidence')

for filepath in Path('data').glob('*'):
    result = chardet.detect(filepath.read_bytes())
    print(str(filepath), result['encoding'], result['confidence'])

File Name Encoding Confidence
data/inspections.csv ascii 1.0
data/businesses.csv ISO-8859-1 0.73


In [20]:
import chardet

line = '{:<25} {:<10} {}'.format

# for each file, print its name, encoding & confidence in the encoding
print(line('File Name', 'Encoding', 'Confidence'))

for filepath in Path('data').glob('*'):
    result = chardet.detect(filepath.read_bytes())
    print(line(str(filepath), result['encoding'], result['confidence']))


File Name                 Encoding   Confidence
data/inspections.csv      ascii      1.0
data/businesses.csv       ISO-8859-1 0.73


`line = '{:<25} {:<10} {}'.format`  
is using Python string formatting with left alignment (<) for specific column widths:

'{:<25}' → Reserves 25 characters for the first value and aligns it to the left.  
'{:<10}' → Reserves 10 characters for the second value and aligns it to the left.  
'{}' → The third value has no specific width, meaning it will take up as much space as needed.

In [11]:
import pandas as pd

In [29]:
# naively reads file without considering encoding
pd.read_csv('data/businesses.csv')


UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 234790: invalid continuation byte

To successfully read the data, we must specify the ISO-8859-1 encoding:

In [23]:
bus = pd.read_csv('data/businesses.csv', encoding='ISO-8859-1')

### Table Shape and Granularity

We refer to a dataset’s structure as a mental representation of the data, and in particular, we represent data that have a table structure by arranging values in rows and columns. We use the term granularity to describe what each row in the table represents, and the term shape quantifies the table’s rows and columns.

In [25]:
bus = pd.read_csv('data/businesses.csv', encoding='ISO-8859-1')
insp = pd.read_csv("data/inspections.csv")
viol = pd.read_csv("data/violations.csv")

In [26]:
print(" Businesses:", bus.shape, "\t Inspections:", insp.shape, 
     "\t Violations:", viol.shape)

 Businesses: (6406, 9) 	 Inspections: (14222, 4) 	 Violations: (39042, 3)


The field named business_id implies that it is the unique identifier for the restaurant. We can confirm this by checking whether the number of records in the dataframe matches the number of unique values in the field business_id:

In [27]:
print("Number of records:", len(bus))
print("Number of unique business ids:", len(bus['business_id'].unique()))

Number of records: 6406
Number of unique business ids: 6406


The number of unique business_ids matches the number of rows in the table, so it seems safe to assume that each row represents a restaurant.

In [28]:
(insp
 .groupby(['business_id', 'date'])  # Grouping by 'business_id' and 'date'
 .size()                            # Counting the number of occurrences per group
 .sort_values(ascending=False)      # Sorting in descending order
 .head(5)                           # Selecting the top 5 results
)


business_id  date    
64859        20150924    2
87440        20160801    2
77427        20170706    2
19           20160513    1
71416        20171213    1
dtype: int64

#### Why Use Brackets?   
Readability: Breaking the method calls into multiple lines with parentheses makes it more readable.  
Avoiding Line Continuation Errors: In Python, dot notation (.) at the start of a new line works within parentheses,   allowing us to avoid using \ for line continuation.  
Ensuring Proper Execution: The brackets ensure that the method chain is executed as a single, uninterrupted statement.