# 2.0 Introduction to the pandas library
## 2.0.3 Main DataFrame attributes

Here's how to create a simple DataFrame and display its dimensions:

```python
import pandas as pd  

# Create a fictional DataFrame 
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'City': ['Paris', 'New York', 'London']}  

df = pd.DataFrame(data)  # Display the shape of the DataFrame 
print(df.shape)
```

Display the column names of the DataFrame:

```python
# Display the column names 
print(df.columns)
```

Display the row indices of the DataFrame:

```python
# Display the row labels 
print(df.index)
```

Display the data types of each column:

```python
# Display the data types of the columns 
print(df.dtypes)
```

View the first few rows of the DataFrame:

```python
# Display the first few rows of the DataFrame 
print(df.head()) 
```

# 2.1 Flat Files and Other Structured Formats
## 2.1.1 CSV, TXT and TSV

Here's a basic example of reading a CSV file:

```python
import pandas as pd

CSV_PATH = './data/raw/source.csv'

df = pd.read_csv(CSV_PATH, 
                 sep=';',          # often used in France
                 encoding='utf-8') # optional

# We can then display a preview of the first few rows to ensure everything is read correctly
df.head()
```

Process large CSV files in chunks:

```python
import pandas as pd

chunk_size = 10000 # number of lines per chunk
csv_path = 'big_file.csv'

# Create an empty list to store modified chunks
modified_chunks = []

# Iterate over chunks
for i, chunk in enumerate(pd.read_csv(csv_path, chunksize=chunk_size)):
    # Apply modifications to the chunk (example: add 1 to a column named 'example_column')
    chunk_modifie = chunk.apply(ma_fonction)

    # Append the modified chunk to the list
    chunk_modifie.to_csv(
    f'./data/processed/chunk_modifie_{i}.csv')
```

## 2.1.2 XLS, XLSX

Read an Excel file using pandas:

```python
import pandas as pd

XLSX_PATH = './data/raw/results.xlsx'

df = pd.read_excel(XLSX_PATH, 
                   sheet_name='Results',
                   usecols=[1,2,3,7,8,9])

df.head()
```

## 2.1.3 JSON

Create a function to read JSON files with error handling:

```python
import pandas as pd

def load_json_with_pandas(file_path): 
    try: 
        dataframe = pd.read_json(file_path) 
        return dataframe 
    except FileNotFoundError: 
        print(f'The file "{file_path}" was not found.')
        return None
    except pd.errors.JSONDecodeError as e: 
        print(f'Error reading the JSON file with pandas: {e}') 
        return None
```

## 2.1.4 XML

Using Python's built-in XML library:

```python
import xml.etree.ElementTree as ET

def load_xml_file(file_path):
    try:
        tree = ET.parse(file_path)
        return tree.getroot()
    except ET.ParseError as e:
        print(f"Error reading the XML file: {e}")
        return None

def traverse_elements(parent_element):
    for child in parent_element:
        print(f"Tag: {child.tag}, Text: {child.text}")
        traverse_elements(child)

# Replace 'example.xml' with the path to your XML file
xml_file = 'example.xml'
root = load_xml_file(xml_file)

if root is not None:
    traverse_elements(root)
```

Using pandas to read XML:

```python
import pandas as pd

def load_xml_with_pandas(file_path):
    try:
        # Using Pandas' read_xml function to load the XML file
        dataframe = pd.read_xml(file_path)
        return dataframe
    except Exception as e:
        print(f"Error reading the XML file with Pandas: {e}")
        return None

# Replace 'example.xml' with the path to your XML file
xml_file = 'example.xml'
xml_dataframe = load_xml_with_pandas(xml_file)

if xml_dataframe is not None:
    print(xml_dataframe)
```

## 2.1.5 PDF
### 2.1.5.1 Reading a Simple PDF with PyPDF2

Extract text from a PDF:

```python
import PyPDF2

pdf_file = open('./data/source.pdf', 'rb')

pdf_reader = PyPDF2.PdfReader(pdf_file)

# For example, we can access the number of pages in a document
print(len(pdf_reader.pages))

# Now we create a page object
page_2 = pdf_reader.pages[2]

# We extract the text from this page using the extract_text() method
page_2_content = page_2.extract_text()

print(page_2_content)
```

### 2.1.5.2 Table Extraction with Tabula

Extract tables from PDF using Tabula:

```python
import tabula

PDF_PATH = './data/source.pdf'

tables = tabula.read(PDF_PATH, pages=1)
first_table = tables[0]
```

Convert PDF tables to CSV:

```python
tabula.convert_into(PDF_PATH, './data/processed/table.csv')
```

## 2.1.6 Images

Extract tables from images:

```python
from img2table.ocr import TesseractOCR
from img2table.document import Image

# OCR Instantiation
ocr = TesseractOCR(n_threads=1, lang="en")

# Document Instantiation (image or PDF for example)
doc = Image(src)

# Table Extraction
table = doc.extract_tables(ocr=ocr,
                           implicit_rows=False,
                           borderless_tables=False,
                           min_confidence=50)
```