# INTRODUCTION TO IMPORTING DATA IN PYTHON

One of the core strengths of Python as a data science, engineering, and automation language is its extensive capability to **import and export data** from a wide range of sources. Whether working with flat files, files produced by other software, or relational databases, Python provides robust tools to read, process, and write data efficiently.


### Types of Data Sources

#### 1. Flat Files

- Text files (`.txt`)
- Comma-Separated Values (`.csv`)
- Tab-delimited files (`.tsv`)
- These are the most common formats for raw or exported data.

#### 2. Files from Other Software

- Excel files (`.xlsx`, `.xls`)
- SPSS, Stata, SAS
- Binary formats and data from statistical or business software

#### 3. Relational Databases

- SQL-based databases: SQLite, PostgreSQL, MySQL, etc.
- NoSQL and semi-structured sources (MongoDB, JSON, XML) can also be handled, but require different approaches.

### Reading Text Files in Python

The foundational tool for file I/O in Python is the built-in `open()` function, which provides an interface to interact with files in various modes:

- `'r'` – read (default)
- `'w'` – write (overwrites existing file)
- `'a'` – append (adds to end of file)
- `'b'` – binary (for non-text data)
- `'t'` – text (default, for strings)

#### Basic Pattern: Reading a Text File

```python
filename = 'data.txt'
file = open(filename, mode='r')  # Open file for reading
text = file.read()               # Read entire file contents as a single string
file.close()                     # Always close the file to free resources
```

##### Why Close the File?

- Closing the file releases system resources, flushes any buffered data, and avoids resource leaks.
- Failing to close files, especially in long-running programs or scripts processing many files, can lead to subtle bugs or performance issues.

### Writing to a Text File

Writing data follows the same `open()` pattern, with mode set to `'w'` (write):

```python
filename = 'output.txt'
file = open(filename, mode='w')  # Open file for writing (creates/overwrites)
file.write("Hello, world!\n")    # Write text to file
file.close()                     # Always close after writing
```

- Be aware that opening a file in write mode will erase its previous contents.

### The Pythonic Way: Context Managers with `with`

Using the `with` statement (context manager) is the **recommended, Pythonic approach** for file I/O. It ensures that the file is **automatically closed** when the block is exited, even if errors occur.

```python
with open('data.txt', 'r') as file:
    data = file.read()
    # The file is open inside this block

# Once outside, file is automatically closed
```

- Context managers improve reliability and code readability.
- Works for both reading and writing (`'r'`, `'w'`, `'a'`, etc.).

### Why These Techniques Matter

- Data rarely comes in a single, ready-to-use format. Mastering Python’s I/O enables seamless integration with diverse data sources and downstream workflows.
- Robust file handling practices (closing files, using context managers) prevent bugs, resource leaks, and data corruption.
- This foundational knowledge prepares you to work with higher-level libraries (such as `pandas`, `csv`, `json`, `openpyxl`, and `sqlalchemy`) for more sophisticated data import/export tasks.

In [1]:
# Open the file moby_dick.txt as read-only using a with statement and bind it to the variable file. Make sure to pass the filename enclosed in quotation marks ''.
with open("data/moby_dick.txt") as file:
    print(file.read())

CHAPTER 1. Loomings.

Call me Ishmael. Some years ago--never mind how long precisely--having
little or no money in my purse, and nothing particular to interest me on
shore, I thought I would sail about a little and see the watery part of
the world. It is a way I have of driving off the spleen and regulating
the circulation. Whenever I find myself growing grim about the mouth;
whenever it is a damp, drizzly November in my soul; whenever I find
myself involuntarily pausing before coffin warehouses, and bringing up
the rear of every funeral I meet; and especially whenever my hypos get
such an upper hand of me, that it requires a strong moral principle to
prevent me from deliberately stepping into the street, and methodically
knocking people's hats off--then, I account it high time to get to sea
as soon as I can. This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I quietly
take to the ship. There is nothing surprising in this. If th

### Importing text files line by line
For large files, we may not want to print all of their content to the shell: you may wish to print only the first few lines. Enter the `.readline()` method, which allows you to do this. When a file called file is open, you can print out the first line by executing `file.readline()`. If you execute the same command again, the second line will print, and so on.

In the introductory video, Hugo also introduced the concept of a context manager. He showed that you can bind a variable file by using a context manager construct:

```python
with open('huck_finn.txt') as file:
```
While still within this construct, the variable file will be bound to open('huck_finn.txt'); thus, to print the file to the shell, all the code you need to execute is:
```python

with open('huck_finn.txt') as file:
    print(file.readline())
```

You'll now use these tools to print the first few lines of moby_dick.txt!

In [None]:
# Open moby_dick.txt using the with context manager and the variable file.
with open("data/moby_dick.txt") as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())

CHAPTER 1. Loomings.



Call me Ishmael. Some years ago--never mind how long precisely--having



: 

## The Importance of Flat Files in Data Science

### What Are Flat Files?

**Flat files** are the most fundamental and widely used format for storing and exchanging tabular data. They consist of plain text files in which each row represents a single record (observation) and each column represents a feature or attribute (field) of that record. Because of their simplicity and universal compatibility, flat files underpin the vast majority of data workflows, data exchanges, and archival systems in both research and industry.

### Key Concepts

- **Record**: A row in the file; each record is an individual observation or entry.
- **Field / Attribute**: A column; each field is a property, variable, or feature of the record.
- **Header Row**: Typically, the first row in the file specifies the names of each column, providing context for the data.
- **Delimiters**: Characters that separate fields within a record—commonly commas (`,`), tabs (`\t`), or semicolons (`;`).

### Common File Extensions and Formats

- `.csv` – **Comma-Separated Values** (most common): Each field is separated by a comma.
- `.txt` – **Plain Text**: Can be structured with various delimiters (commas, tabs, spaces).
- `.tsv` – **Tab-Separated Values**: Each field is separated by a tab character.

**Example CSV:**
```
name,age,score
Alice,23,88
Bob,21,93
```

**Example Tab-Delimited (.txt or .tsv):**
```
name    age    score
Alice   23     88
Bob     21     93
```

### Flat Files in Practice

#### Why Are Flat Files So Important in Data Science?

- **Universality**: Supported by virtually all software tools, programming languages, databases, and operating systems.
- **Human-Readable**: Easily inspected and edited with basic text editors.
- **Portable**: Easy to share, version, and archive.
- **Simplicity**: No embedded formulas, macros, or binary structures—just raw data.
- **Interoperability**: Used as the lingua franca for data exchange between disparate systems.

#### Data Science Scenarios

- **Data Acquisition**: Many open datasets (e.g., from Kaggle, UCI, government portals) are distributed as flat files.
- **Intermediate Processing**: Data pipelines often use CSV or TSV as staging or checkpoint formats.
- **Archival**: Flat files are ideal for long-term storage and future-proofing data against software obsolescence.

### Importing Flat Files in Python

Two primary packages dominate the import of flat files in the data science ecosystem:

- **NumPy**: Optimised for fast import of numeric tabular data into arrays (ideal for large, homogeneous datasets).
    - Functions like `numpy.loadtxt()` and `numpy.genfromtxt()` read in delimited text files.
- **pandas**: The de facto library for data manipulation, capable of importing flat files with both numeric and string data, handling headers, missing values, and various delimiters.
    - The `pandas.read_csv()` and `pandas.read_table()` functions are highly flexible and feature-rich.

#### Importing Examples

```python
import numpy as np
numeric_data = np.loadtxt('data.csv', delimiter=',')

import pandas as pd
df = pd.read_csv('data.csv')  # Handles headers and mixed data types by default
```

### Beyond the Basics: Limitations and Considerations

- **No Data Types**: All fields are read as text until parsed; you must specify or infer types post-import.
- **No Metadata**: Flat files lack embedded information about units, relationships, or formats.
- **Scalability**: Very large flat files can be unwieldy and require chunked or streamed processing.