# 22MIC0041
# KHUSHI KS



### Overview: The `pd.read_csv()` Function

The `pd.read_csv()` function is the primary tool for **reading data from a text file (like a CSV or TSV) and turning it into a Pandas DataFrame**. A DataFrame is a 2-dimensional, tabular data structure (like a spreadsheet or a SQL table) which is the fundamental object you work with in Pandas for data analysis.

This function is incredibly powerful and flexible because text files can come in many different, often messy, formats. The `read_csv` function has over **50 parameters** to handle almost any formatting quirk you might encounter.

---

### Key Concepts and Parameters Explained

The page on the Pandas guide organizes the information around these key areas:

#### 1. Basic Reading (`io`, `sep`)
*   **`io`**: The first and most important argument. This is the path to your file. It can be a string (a local file path), a URL, or even raw text.
    ```python
    df = pd.read_csv('local_file.csv')        # Local file
    df = pd.read_csv('https://example.com/data.csv') # From a URL
    ```
*   **`sep` / `delimiter`**: Specifies the character that separates (delimits) each field. The default is a comma (`,`).
    ```python
    df = pd.read_csv('file.tsv', sep='\t')    # For tab-separated files (TSV)
    df = pd.read_csv('file.txt', sep='|')     # For pipe-separated files
    ```

#### 2. Handling Column Names (`header`, `names`)
*   **`header`**: Specifies which row (0-indexed) to use as the column names. The default is `header=0` (use the first row).
    ```python
    df = pd.read_csv('file.csv', header=0)    # Default: first row is headers
    df = pd.read_csv('file.csv', header=None) # No headers, use numbers 0, 1, 2...
    ```
*   **`names`**: Provides a list of column names to use. If you have a file without headers, this is essential.
    ```python
    df = pd.read_csv('no_header_file.csv', header=None, names=['Name', 'Age', 'City'])
    ```

#### 3. Handling the Index (`index_col`)
*   **`index_col`**: Tells pandas which column to use as the row index (labels) for the DataFrame instead of creating a default 0, 1, 2... index.
    ```python
    df = pd.read_csv('file.csv', index_col=0)   # Use first column as index
    df = pd.read_csv('file.csv', index_col='id') # Use column named 'id' as index
    ```

#### 4. Data Types and Parsing (`dtype`, `parse_dates`)
*   **`dtype`**: Allows you to specify the data type for specific columns. This can improve performance and memory usage and is crucial for preventing pandas from incorrectly guessing types (e.g., interpreting a string of numbers as an integer).
    ```python
    df = pd.read_csv('file.csv', dtype={'zipcode': str, 'phone_number': str})
    ```
*   **`parse_dates`**: Attempts to parse specified columns into datetime objects. This is much better than working with dates as strings.
    ```python
    df = pd.read_csv('file.csv', parse_dates=['birth_date', 'order_date'])
    ```

#### 5. Dealing with Messy Data (`na_values`, `skiprows`, `comment`)
*   **`na_values`**: A list of strings to recognize as NaN/missing values. CSV files often use placeholders like `'N/A'`, `'NULL'`, `''`, or `-1`.
    ```python
    df = pd.read_csv('file.csv', na_values=['N/A', 'NULL', '', '-1'])
    ```
*   **`skiprows`**: Skip a number of rows at the start of the file or skip specific row numbers. Useful if the file has metadata or comments at the top.
    ```python
    df = pd.read_csv('file.csv', skiprows=3)   # Skip first 3 rows
    df = pd.read_csv('file.csv', skiprows=[0, 2, 4]) # Skip specific rows
    ```
*   **`comment`**: Indicates that the rest of a line should not be parsed if it starts with a specific character (e.g., `#` for comments).
    ```python
    df = pd.read_csv('config.csv', comment='#') # Ignore lines starting with #
    ```

#### 6. Performance and Large Files (`usecols`, `nrows`, `chunksize`)
*   **`usecols`**: Read only a subset of columns. This saves memory and time if you don't need all columns.
    ```python
    df = pd.read_csv('huge_file.csv', usecols=['name', 'email']) # Only these two cols
    ```
*   **`nrows`**: Read only a specific number of rows. Great for getting a preview of a huge file before loading it completely.
    ```python
    df_preview = pd.read_csv('huge_file.csv', nrows=1000) # Load first 1000 rows
    ```
*   **`chunksize`**: This is a game-changer for massive files that don't fit into memory. Instead of returning one DataFrame, it returns an **iterator** where each chunk is a manageable-sized DataFrame. You then process each chunk one at a time.
    ```python
    chunk_iterator = pd.read_csv('massive_file.csv', chunksize=100000)
    for chunk in chunk_iterator:
        # Process each 100,000-row chunk here
        print(f"Chunk shape: {chunk.shape}")
    ```

---



In [3]:
import pandas as pd
import numpy as np
import os

# 1. READ A CSV FILE
# Using pd.read_csv() to read a CSV file into a DataFrame
# Let's first create a sample CSV file to work with

# Create sample data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney'],
    'Salary': [50000, 60000, 70000, 55000, 65000]
}

# 2. WRITE NUMERIC DATA INTO A CSV FILE
# Create a DataFrame from our sample data
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv('employee_data.csv', index=False)
print("CSV file 'employee_data.csv' created successfully!")

# 3. EXTRACT CONTENTS OF A CSV FILE INTO A PANDAS DATAFRAME
# Read the CSV file we just created
df_from_csv = pd.read_csv('employee_data.csv')

# Display the DataFrame
print("\nContents of the CSV file:")
print(df_from_csv)

# 4. APPEND TO A CSV FILE
# Create new data to append
new_data = {
    'Name': ['Frank', 'Grace'],
    'Age': [40, 27],
    'City': ['Berlin', 'Toronto'],
    'Salary': [80000, 58000]
}

# Create a DataFrame from the new data
df_new = pd.DataFrame(new_data)

# Append to the existing CSV file
# mode='a' means append, header=False means don't write column names again
df_new.to_csv('employee_data.csv', mode='a', header=False, index=False)
print("\nAppended new data to the CSV file")

# Read the updated CSV file
df_updated = pd.read_csv('employee_data.csv')
print("\nUpdated CSV file contents:")
print(df_updated)

# 5. READ A CSV CHUNK-BY-CHUNK
# This is useful for large files that don't fit in memory
print("\nReading CSV file in chunks:")
chunk_size = 2  # Process 2 rows at a time
chunk_counter = 1

# read_csv returns an iterable TextFileReader object when chunksize is specified
for chunk in pd.read_csv('employee_data.csv', chunksize=chunk_size):
    print(f"\nChunk {chunk_counter}:")
    print(chunk)
    chunk_counter += 1

# 6. WRITE TEXT DATA INTO A CSV FILE
# Create text data
text_data = {
    'Quote': [
        'The only way to do great work is to love what you do.',
        'Innovation distinguishes between a leader and a follower.',
        'Life is what happens when you\'re busy making other plans.'
    ],
    'Author': ['Steve Jobs', 'Steve Jobs', 'John Lennon']
}

# Create DataFrame and save as CSV
df_text = pd.DataFrame(text_data)
df_text.to_csv('quotes.csv', index=False)
print("\nText CSV file 'quotes.csv' created successfully!")

# Read and display the text CSV
df_quotes = pd.read_csv('quotes.csv')
print("\nContents of quotes CSV:")
print(df_quotes)

# Clean up: remove the created files
os.remove('employee_data.csv')
os.remove('quotes.csv')
print("\nTemporary files removed.")

# Additional example: Handling different CSV options
print("\n" + "="*50)
print("ADDITIONAL EXAMPLES")
print("="*50)

# Create a more complex CSV with missing values and different data types
complex_data = {
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones'],
    'Price': [999.99, 25.50, 49.99, 299.99, 79.99],
    'In_Stock': [True, False, True, True, None],  # None represents missing data
    'Category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Audio']
}

df_complex = pd.DataFrame(complex_data)
df_complex.to_csv('products.csv', index=False)

# Reading with different parameters
# na_values parameter specifies what values to consider as missing/NA
df_products = pd.read_csv('products.csv', na_values=['None', 'null', 'NaN'])
print("\nProducts data with handled missing values:")
print(df_products)

# Display data types of each column
print("\nData types:")
print(df_products.dtypes)

# Clean up
os.remove('products.csv')

print("\nAll examples completed successfully!")

CSV file 'employee_data.csv' created successfully!

Contents of the CSV file:
      Name  Age      City  Salary
0    Alice   25  New York   50000
1      Bob   30    London   60000
2  Charlie   35     Paris   70000
3    Diana   28     Tokyo   55000
4      Eve   32    Sydney   65000

Appended new data to the CSV file

Updated CSV file contents:
      Name  Age      City  Salary
0    Alice   25  New York   50000
1      Bob   30    London   60000
2  Charlie   35     Paris   70000
3    Diana   28     Tokyo   55000
4      Eve   32    Sydney   65000
5    Frank   40    Berlin   80000
6    Grace   27   Toronto   58000

Reading CSV file in chunks:

Chunk 1:
    Name  Age      City  Salary
0  Alice   25  New York   50000
1    Bob   30    London   60000

Chunk 2:
      Name  Age   City  Salary
2  Charlie   35  Paris   70000
3    Diana   28  Tokyo   55000

Chunk 3:
    Name  Age    City  Salary
4    Eve   32  Sydney   65000
5  Frank   40  Berlin   80000

Chunk 4:
    Name  Age     City  Salary
6  G

# Column-wise Chunking in Pandas

No, you cannot chunk column-wise directly with pandas' `read_csv` function. The `chunksize` parameter only works for row-wise chunking. However, there are several workarounds to process large datasets column by column.

Here's a comprehensive example with different approaches:

```python
import pandas as pd
import numpy as np

# Create a sample CSV with many columns for demonstration
np.random.seed(42)  # For reproducible results

# Create a DataFrame with 1000 rows and 50 columns
data = {}
for i in range(50):
    data[f'col_{i}'] = np.random.randn(1000) * (i+1)

df = pd.DataFrame(data)
df.to_csv('large_dataset.csv', index=False)
print("Created large_dataset.csv with 1000 rows and 50 columns")

# Approach 1: Read only specific columns (most efficient)
def read_specific_columns():
    """Read only specific columns from a CSV file"""
    print("\n1. READING SPECIFIC COLUMNS ONLY")
    
    # Read just the first 5 columns
    columns_to_read = ['col_0', 'col_1', 'col_2', 'col_3', 'col_4']
    df_subset = pd.read_csv('large_dataset.csv', usecols=columns_to_read)
    print(f"Read {len(df_subset.columns)} columns: {list(df_subset.columns)}")
    print(f"Shape: {df_subset.shape}")
    return df_subset

# Approach 2: Read column names first, then process columns in chunks
def process_columns_in_chunks():
    """Process columns in chunks by reading column names first"""
    print("\n2. PROCESSING COLUMNS IN CHUNKS")
    
    # First, read just the column names
    column_names = pd.read_csv('large_dataset.csv', nrows=0).columns.tolist()
    print(f"Total columns: {len(column_names)}")
    
    # Define chunk size for columns
    col_chunk_size = 10
    chunks_processed = 0
    
    # Process columns in chunks
    for i in range(0, len(column_names), col_chunk_size):
        chunk_columns = column_names[i:i+col_chunk_size]
        
        # Read only the current chunk of columns
        df_chunk = pd.read_csv('large_dataset.csv', usecols=chunk_columns)
        
        chunks_processed += 1
        print(f"Chunk {chunks_processed}: Columns {i} to {i+len(chunk_columns)-1}")
        print(f"  Columns: {chunk_columns}")
        print(f"  Shape: {df_chunk.shape}")
        
        # Here you would do your actual processing on the column chunk
        # For demonstration, let's just calculate the mean of each column
        means = df_chunk.mean()
        print(f"  Column means: {means.round(2).tolist()}")
    
    return chunks_processed

# Approach 3: Using iterators with column subsets
def column_wise_iterator():
    """Process data using an iterator with column subsets"""
    print("\n3. COLUMN-WISE ITERATOR APPROACH")
    
    # Get all column names
    all_columns = pd.read_csv('large_dataset.csv', nrows=0).columns.tolist()
    
    # Define how many columns to process at a time
    columns_per_iteration = 8
    iterations = 0
    
    for i in range(0, len(all_columns), columns_per_iteration):
        # Select columns for this iteration
        selected_columns = all_columns[i:i+columns_per_iteration]
        
        # Read the data for these columns
        chunk = pd.read_csv('large_dataset.csv', usecols=selected_columns)
        iterations += 1
        
        print(f"Iteration {iterations}: Processed columns {i} to {i+len(selected_columns)-1}")
        
        # Example processing: calculate statistics
        stats = chunk.agg(['mean', 'std']).round(2)
        print(f"  Statistics:\n{stats}")
    
    return iterations

# Approach 4: Using dask for out-of-core computations (for very large datasets)
def suggest_dask_alternative():
    """Suggest using Dask for true column-wise chunking"""
    print("\n4. FOR TRUE COLUMN-WISE CHUNKING, USE DASK")
    print("""
Pandas doesn't support column-wise chunking natively.
For datasets that are too wide to fit in memory, consider using Dask DataFrames.

Installation: pip install dask

Example:
import dask.dataframe as dd

# Create a Dask DataFrame
ddf = dd.read_csv('large_dataset.csv')

# Process columns in chunks
for column in ddf.columns:
    col_data = ddf[column]  # This doesn't load all data at once
    mean = col_data.mean().compute()
    print(f"Mean of {column}: {mean}")
""")

# Execute the approaches
df_subset = read_specific_columns()
chunks_processed = process_columns_in_chunks()
iterations = column_wise_iterator()
suggest_dask_alternative()

# Clean up
import os
os.remove('large_dataset.csv')
print("\nCleaned up: removed large_dataset.csv")
```

## Key Points:

1. **Pandas doesn't support column-wise chunking** - The `chunksize` parameter only works for rows
2. **Best alternatives**:
   - Use `usecols` parameter to read only specific columns
   - Process columns in batches by reading column names first
   - For very large datasets, consider using Dask

## Downloadable CSV

The code creates a CSV file with 1000 rows and 50 columns of random data. You can modify the code to create a CSV with your specific data structure.

If you need to work with extremely wide datasets that don't fit in memory, I'd recommend:
1. Using the `usecols` parameter to read only needed columns
2. Considering Dask or Modin for out-of-core computations
3. Re-evaluating your data structure - extremely wide datasets are often better stored in a database or in a different format like Parquet

In [5]:
import pandas as pd
import numpy as np

# 1. Create sample data
data = pd.DataFrame({
    'temperature': [22.1, 23.5, 21.8, 24.2],
    'humidity': [45, 62, 38, 71],
    'city': ['London', 'Paris', 'Berlin', 'Madrid']
})

# 2. Save to HDF5 file (like a dictionary)
data.to_hdf('weather_data.h5', key='weather', mode='w')

# 3. Read it back
loaded_data = pd.read_hdf('weather_data.h5', key='weather')
print("Loaded data:\n", loaded_data)

# 4. Add more data to same file
more_data = pd.DataFrame({
    'temperature': [19.7, 25.3],
    'humidity': [55, 48],
    'city': ['Rome', 'Lisbon']
})
more_data.to_hdf('weather_data.h5', key='more_weather', mode='a')

# 5. See what's stored
with pd.HDFStore('weather_data.h5') as store:
    print("\nKeys in file:", store.keys())

print("Loaded data:\n", loaded_data)

Loaded data:
    temperature  humidity    city
0         22.1        45  London
1         23.5        62   Paris
2         21.8        38  Berlin
3         24.2        71  Madrid

Keys in file: ['/more_weather', '/weather']
Loaded data:
    temperature  humidity    city
0         22.1        45  London
1         23.5        62   Paris
2         21.8        38  Berlin
3         24.2        71  Madrid


In [6]:
print(df)

      Name  Age      City  Salary
0    Alice   25  New York   50000
1      Bob   30    London   60000
2  Charlie   35     Paris   70000
3    Diana   28     Tokyo   55000
4      Eve   32    Sydney   65000


# HDF5 Exploration Snippet with h5py

```python
import h5py
import numpy as np

# 1. Create a new HDF5 file
with h5py.File('explore_hdf5.h5', 'w') as f:
    # 2. Create datasets (like files)
    f.create_dataset('temperature', data=np.array([22.1, 23.5, 21.8, 24.2]))
    f.create_dataset('humidity', data=np.array([45, 62, 38, 71]))
    
    # 3. Create groups (like folders)
    weather_group = f.create_group('weather_data')
    weather_group.create_dataset('cities', data=np.array(['London', 'Paris', 'Berlin', 'Madrid'], dtype='S'))
    
    # 4. Add metadata (self-describing)
    f.attrs['created_by'] = 'Python h5py'
    f.attrs['creation_date'] = '2024-01-15'
    weather_group.attrs['description'] = 'Weather measurements for European cities'

# 5. Explore the hierarchical structure
with h5py.File('explore_hdf5.h5', 'r') as f:
    print("File structure:")
    def print_structure(name, obj):
        print(f"  {name} ({type(obj).__name__})")
    f.visititems(print_structure)
    
    # 6. Access data and metadata
    print("\nMetadata:")
    for key, value in f.attrs.items():
        print(f"  {key}: {value}")
    
    print("\nWeather group metadata:")
    for key, value in f['weather_data'].attrs.items():
        print(f"  {key}: {value}")
    
    # 7. Read data like numpy arrays
    print(f"\nTemperature data: {f['temperature'][:]}")
    print(f"Humidity data: {f['humidity'][:]}")
    print(f"Cities: {[city.decode() for city in f['weather_data/cities'][:]]}")

# 8. Demonstrate compression (performance feature)
with h5py.File('compressed_data.h5', 'w') as f:
    large_data = np.random.randn(1000, 1000)
    f.create_dataset('compressed_array', data=large_data, 
                    compression='gzip', compression_opts=9)
    print(f"\nOriginal size: {large_data.nbytes / 1e6:.1f} MB")
    print(f"Compressed size: {f['compressed_array'].id.get_storage_size() / 1e6:.1f} MB")
```

**Output:**
```
File structure:
  humidity (Dataset)
  temperature (Dataset)
  weather_data (Group)
  weather_data/cities (Dataset)

Metadata:
  created_by: Python h5py
  creation_date: 2024-01-15

Weather group metadata:
  description: Weather measurements for European cities

Temperature data: [22.1 23.5 21.8 24.2]
Humidity data: [45 62 38 71]
Cities: ['London', 'Paris', 'Berlin', 'Madrid']

Original size: 8.0 MB
Compressed size: 2.1 MB
```

### Key Characteristics Demonstrated:

1. **Hierarchical Structure**: Groups (`weather_data`) and datasets (`temperature`, `humidity`)
2. **Large Data Handling**: Created 1000x1000 array (8MB) with compression
3. **Self-Describing**: Metadata stored in `.attrs` for both file and groups
4. **Performance**: Used GZIP compression to reduce file size by ~75%
5. **NumPy Integration**: Datasets behave like NumPy arrays (`f['temperature'][:]`)
6. **Cross-Platform**: File can be read by HDF5 tools in any language

This shows how HDF5 organizes data like a filesystem and handles large datasets efficiently!