## 1. File manipulation using `open` built-in function

`open()` returns a file object, and is most commonly used with two arguments: `open(filename, mode)`.

The first argument is a string containing the filename. The second argument is another string containing a few characters describing the way in which the file will be used. `mode` can be `'r'` when the file will only be read, `'w'` for only writing (an existing file with the same name will be erased), and `'a'` opens the file for appending; any data written to the file is automatically added to the end. `'r+'` opens the file for both reading and writing. The `mode` argument is optional; `'r'` will be assumed if it’s omitted.

Normally, files are opened in *text mode*, that means, you read and write strings from and to the file, which are encoded in a specific encoding. If `encoding` is not specified, the default is platform dependent (`locale.getpreferredencoding(False)` is called to get the current locale encoding). `'b'` appended to the mode opens the file in *binary mode*: now the data is read and written in the form of bytes objects. This mode should be used for all files that don’t contain text.

| Character | Meaning                                                         |
|-----------|-----------------------------------------------------------------|
| `r`       | open for reading (default)                                      |
| `w`       | open for writing, truncating the file first                     |
| `x`       | open for exclusive creation, failing if the file already exists |
| `a`       | open for writing, appending to the end of the file if it exists |
| `b`       | binary mode                                                     |
| `t`       | text mode (default)                                             |
| `+`       | open for updating (reading and writing)                         |

### Context managers (`with` statement)

It is good practice to use the `with` keyword when dealing with file objects. The advantage is that the file is properly closed after its suite finishes, even if an exception is raised at some point. Using `with` is also much shorter than writing equivalent `try-finally` blocks:

```python
with open('workfile') as f:
    read_data = f.read()
```

The equivalent `try-finally` block of the `with` block above would be:

```python
f = open('workfile')
try:
    read_data = f.read()
finally:
    f.close()
```

While comparing it to the first example we can see that a lot of boilerplate code is eliminated just by using `with`. The main advantage of using a with statement is that it makes sure our file is closed without paying attention to how the nested block exits.

The `with` statement is used to wrap the execution of a block with methods defined by a context manager. A context manager is an object that defines the runtime context to be established when executing a `with` statement. The context manager handles the entry into, and the exit from, the desired runtime context for the execution of the block of code. 

Typical uses of context managers include saving and restoring various kinds of global state, locking and unlocking resources, closing opened files, etc.

### Methods of File Objects

To read a file’s contents, call `f.read(size)`, which reads some quantity of data and returns it as a string (in text mode) or bytes object (in binary mode). `size` is an optional numeric argument. When `size` is omitted or negative, the entire contents of the file will be read and returned; it’s your problem if the file is twice as large as your machine’s memory. Otherwise, at most `size` characters (in text mode) or `size` bytes (in binary mode) are read and returned. If the end of the file has been reached, `f.read()` will return an empty string.



In [1]:
f = open('file_example.txt', 'r+')
f.read(20)

'Beautiful is better '

`f.readline()` reads a single line from the file:

In [2]:
print(f.readline(), end='')
print(f.readline(), end='')

than ugly.
Explicit is better than implicit.


For reading lines from a file, you can loop over the file object. This is memory efficient, fast, and leads to simple code:

In [3]:
for line in f:
    print(line, end='')

Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Readability counts.
Readability counts.
Readability counts.
Readability counts.
Readability counts.
Readability counts.
Readability counts.


If you want to read all the lines of a file in a list you can also use `list(f)` or `f.readlines()`.

`f.write(string)` writes the contents of string to the file, returning the number of characters written.

In [4]:
f.write('Readability counts.\n')

20

`f.tell()` returns an integer giving the file object’s current position in the file represented as number of bytes from the beginning of the file when in binary mode and an opaque number when in text mode.

In [5]:
f.tell()

369

To change the file object’s position, use `f.seek(offset, whence)`. The position is computed from adding `offset` to a reference point; the reference point is selected by the `whence` argument. A `whence` value of 0 measures from the beginning of the file, 1 uses the current file position, and 2 uses the end of the file as the reference point. whence can be omitted and defaults to 0, using the beginning of the file as the reference point.

In [6]:
f.seek(0)
f.read(10)

'Beautiful '

If you’re not using the `with` keyword, then you should call `f.close()` to close the file and immediately free up any system resources used by it. If you don’t explicitly close a file, Python’s garbage collector will eventually destroy the object and close the open file for you, but the file may stay open for a while.

In [7]:
f.close()

## Exercises 1

1. Write a function `grep` that receives `text` and `file` as parameters and returns a list with all the lines in the file containing `text`. 
1. Write another function `grepinto` that receives `text`, `infile` and `outfile` as parameters and writes to `outfile` the lines in `infile` that contain `text`. Open both files within one `with` statement. 
    
    [!] `file`, `infile` and `outfile` are all file names - not file handlers.


## 2. Working with the file system (`os`, `os.path`, `glob`)

### `os`

The `os` module contains functions to get information on local directories, files, processes, and environment variables.

`os.getcwd()` - returns the current working directory

In [8]:
import os
current_path = os.getcwd()
print(current_path)

/Users/iulia/PycharmProjects/python-advanced-custom/docs


`os.listdir(path)` - returns a list of all the entries in the directory given by `path`

In [9]:
os.listdir(current_path)

['11. Concurrent execution.ipynb',
 'employees.xml',
 'out.txt',
 '15. Android API used with Python.ipynb',
 '03. More on functions and iterables.ipynb',
 '12. Logging.ipynb',
 'images',
 '05. Decorators.ipynb',
 'process_files.py',
 'books.xml',
 'Financial Sample.xlsx',
 'output.json',
 '13. Other useful modules.ipynb',
 '16. Building Dashboards in Python .ipynb',
 'users.json',
 'file_example.txt',
 '10. Testing your code.ipynb',
 'output.csv',
 'data.csv',
 'data.json',
 '07. Context Managers.ipynb',
 'books.csv',
 '00. Table of Contents.ipynb',
 'output.xml',
 '14. GUI automation.ipynb',
 '09. Working with databases.ipynb',
 'out2.txt',
 'output.xlsx',
 'example.txt',
 '.ipynb_checkpoints',
 'modified.xlsx',
 'employees_modified.xml',
 '06. Object-Oriented Programming.ipynb',
 '04. Advanced Data Structures.ipynb',
 'employees_updated.xml',
 'output.txt',
 '02. Control Flow.ipynb',
 'data_out.json',
 '08. Working with different data formats.ipynb',
 '01. PyCharm Overview.ipynb']

`os.mkdir(path)` - creates a directory

`os.makedirs(path)` - creates directory recursively, by adding eventual missing directories 

In [10]:
os.mkdir('testdir')
assert 'testdir' in os.listdir(current_path)

`os.chdir()` - changes the current working directory

In [11]:
os.chdir('testdir')
print('Items in testdir:', os.listdir())
os.chdir(current_path)

Items in testdir: []


`os.rename(source, dest)` - renames the file or directory 

In [12]:
os.rename('testdir', 'new_testdir')
assert 'testdir' not in os.listdir(current_path)
assert 'new_testdir' in os.listdir(current_path)

`os.remove(path)` - removes a file

`os.rmdir(path)` - removes the directory path

`os.removedirs(path)` - Removes directories recursively

In [13]:
os.rmdir('new_testdir')
assert 'new_testdir' not in os.listdir(current_path)

`os.walk(path)` - Directory tree generator. For each directory in the directory tree rooted at top, yields a 3-tuple `dirpath, dirnames, filenames`:
    
* `dirpath` is a string, the path to the directory.
* `dirnames` is a list of the names of the subdirectories in `dirpath` (excluding '.' and '..').
* `filenames` is a list of the names of the non-directory files in `dirpath`.

In [14]:
for dirpath, dirnames, filenames in os.walk('.'):
    print(dirpath, dirnames, filenames)

. ['images', '.ipynb_checkpoints'] ['11. Concurrent execution.ipynb', 'employees.xml', 'out.txt', '15. Android API used with Python.ipynb', '03. More on functions and iterables.ipynb', '12. Logging.ipynb', '05. Decorators.ipynb', 'process_files.py', 'books.xml', 'Financial Sample.xlsx', 'output.json', '13. Other useful modules.ipynb', '16. Building Dashboards in Python .ipynb', 'users.json', 'file_example.txt', '10. Testing your code.ipynb', 'output.csv', 'data.csv', 'data.json', '07. Context Managers.ipynb', 'books.csv', '00. Table of Contents.ipynb', 'output.xml', '14. GUI automation.ipynb', '09. Working with databases.ipynb', 'out2.txt', 'output.xlsx', 'example.txt', 'modified.xlsx', 'employees_modified.xml', '06. Object-Oriented Programming.ipynb', '04. Advanced Data Structures.ipynb', 'employees_updated.xml', 'output.txt', '02. Control Flow.ipynb', 'data_out.json', '08. Working with different data formats.ipynb', '01. PyCharm Overview.ipynb']
./images [] ['multithreading.png', 'py

### `os.path`

`os.path` contains functions for manipulating filenames and directory names.

`os.path.exists(path)` - test whether a path exists

In [15]:
os.path.exists(current_path)

True

`os.path.isfile(path)` - test whether a path is a regular file

In [16]:
os.path.isfile(current_path)

False

`os.path.isdir(path)` - return true if the pathname refers to an existing directory

In [17]:
os.path.isdir(current_path)

True

`os.path.split(path)` - split a pathname;  returns tuple `(head, tail)` where `tail` is everything after the final slash

In [18]:
os.path.split(current_path)

('/Users/iulia/PycharmProjects/python-advanced-custom', 'docs')

`os.path.join(path, "new_var")` - join two or more pathname components, inserting `os.sep` as needed.

In [19]:
os.path.join(current_path, 'testdir', 'innerdir')

'/Users/iulia/PycharmProjects/python-advanced-custom/docs/testdir/innerdir'

### `glob`

The glob module is another tool in the Python standard library. It's an easy way to get the contents of a directory programmatically, and it uses the sort of wildcards that we may already be familiar with from working on the command line.

`glob.glob(pathname, recursive=False)` - Return a list of paths matching a `pathname` pattern. The pattern may contain simple shell-style wildcards. If `recursive` is true, the pattern `'**'` will match any files and zero or more directories and subdirectories.

`glob.iglob(pathname, recursive=False)` - Return an iterator which yields the paths matching a pathname pattern.

In [20]:
import glob
glob.glob('*Types*')

[]

In [21]:
glob.glob('*.ipynb', recursive=True)

['11. Concurrent execution.ipynb',
 '15. Android API used with Python.ipynb',
 '03. More on functions and iterables.ipynb',
 '12. Logging.ipynb',
 '05. Decorators.ipynb',
 '13. Other useful modules.ipynb',
 '16. Building Dashboards in Python .ipynb',
 '10. Testing your code.ipynb',
 '07. Context Managers.ipynb',
 '00. Table of Contents.ipynb',
 '14. GUI automation.ipynb',
 '09. Working with databases.ipynb',
 '06. Object-Oriented Programming.ipynb',
 '04. Advanced Data Structures.ipynb',
 '02. Control Flow.ipynb',
 '08. Working with different data formats.ipynb',
 '01. PyCharm Overview.ipynb']

## Exercises 2

1. Write a Python program that creates a directory `outdir` at the current location and a directory `innerdir` inside `outdir`. Create an empty file inside `innerdir`. Use `os.walk()` to print the directory tree for `outdir`. Remove the directories and the file.
2. Write a function that returns a list (or iterator) of all the file names with an extension from a directory. Give the path and the file extension as parameters.
   
## 3. Working with paths in the file system (`pathlib`)

The `pathlib` module provides an object-oriented approach to handling file system paths. It allows you to work with file and directory paths in a more intuitive and Pythonic way than traditional string manipulation.

The main class in this module is `Path`, which represents a file or directory path:

In [22]:
from pathlib import Path

Using this class, we can instantiate it to create paths:

In [23]:
current_dir = Path()
root_path = Path("/")
relative_path = Path("images/multithreading.png")

`Path` objects attributes:

In [24]:
relative_path.parent  # parent directory

PosixPath('images')

In [25]:
relative_path.stem  # final path component, without its suffix

'multithreading'

In [26]:
relative_path.suffix  # file extension

'.png'

`Path` objects can be used to build new paths:

In [27]:
python_path = root_path / "usr" / "bin" / "python3"
python_path

PosixPath('/usr/bin/python3')

Methods of `Path` objects to check various properties:

In [28]:
python_path.exists()

True

In [29]:
python_path.is_file()

True

In [30]:
python_path.is_dir()

False

The `Path` class also implements functions in `os` module used for directory manipulation, as methods:

In [31]:
new_dir = current_dir / "new_dir"
new_dir.mkdir(exist_ok=True)

for directory in current_dir.iterdir():
    if directory.is_dir():
        print(directory)

images
new_dir
.ipynb_checkpoints


In [32]:
for txt_file in current_dir.glob("**/*.txt"):
    print(txt_file)

out.txt
file_example.txt
out2.txt
example.txt
output.txt
images/ceva.txt


In [33]:
new_dir.rmdir()

In [34]:
for dirpath, dirnames, filenames in current_dir.walk():
    print(dirpath, dirnames, len(filenames))

. ['images', '.ipynb_checkpoints'] 38
images [] 8
.ipynb_checkpoints [] 6


## Exercises 3

1. Solve the same exercises above (Exercises 2), but using `pathlib` module.

## 4. Working with JSON format

### What is JSON?

JSON (JavaScript Object Notation) is a lightweight data format used for exchanging data between systems. It is easy for humans to read and write and easy for machines to parse and generate. JSON is based on key-value pairs and supports basic data types such as strings, numbers, arrays, and objects.

Example:  
```json
{
  "name": "Alice",
  "age": 30,
  "isEmployed": true,
  "skills": ["Python", "SQL", "JavaScript"],
  "address": {
    "city": "New York",
    "zipcode": "10001"
  }
}
```

### `json` module

The `json` module in Python provides methods to **encode** (convert Python objects to JSON) and **decode** (convert JSON to Python objects).


- `json.loads()` - parses a JSON string and converts it into a Python object.

In [35]:
import json

json_string = '{"name": "Alice", "age": 30}'
data = json.loads(json_string)

In [36]:
data

{'name': 'Alice', 'age': 30}

In [37]:
type(data)

dict

- `json.dumps()` - converts a Python object into a JSON string.

In [38]:
json_string = json.dumps(data)

In [39]:
json_string

'{"name": "Alice", "age": 30}'

- `json.load()` - reads JSON data from a file and converts it into a Python object.

In [40]:
with open("data.json", "r") as f:
    data = json.load(f)

In [41]:
data

{'name': 'Mia', 'hobbies': ['painting', 'jogging']}

In [42]:
data["age"] = 20
data["hobbies"].append("cooking")
data

{'name': 'Mia', 'hobbies': ['painting', 'jogging', 'cooking'], 'age': 20}

- `json.dump()` - writes a Python object as JSON data to a file.

In [43]:
with open("output.json", "w") as f:
    json.dump(data, f)

- Formatting with `indent` and `sort_keys`: both `dump` and `dumps` methods also receive `indent` and `sort_keys` optional parameters to add indentation or sort keys alphabetically.

In [44]:
formatted_json = json.dumps(data, indent=4, sort_keys=True)
print(formatted_json)

{
    "age": 20,
    "hobbies": [
        "painting",
        "jogging",
        "cooking"
    ],
    "name": "Mia"
}


## Exercises 4
1. Using [`users.json`](users.json) file:
   - open it and decode the Python object inside it
   - filter users with `"email"` key and encode the resulting object in a JSON string; print the string to the console
   - filter users with ages between 20 and 40 and encode the resulting object in a JSON file, using `indent` and `sort_keys` parameters.

## 5. Working with CSV format

### What is CSV?
 
CSV (Comma-Separated Values) is a simple file format used to store tabular data, such as a spreadsheet or database. Each line in a CSV file represents a data record, and each record consists of fields separated by a delimiter (commonly a comma).

Example:
```csv
Name,Age,City
Alice,30,New York
Bob,25,Los Angeles
Charlie,35,Chicago
```

### `csv` module

The `csv` module in Python provides functionality to read, write, and manipulate CSV files easily. It supports different delimiters, quoting styles, and file encoding.

* `csv.reader()` - reads a CSV file and returns an iterable object for processing each row as a list.

In [45]:
import csv

with open("data.csv", "r") as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)

['Name', 'Age', 'City']
['Alice', '30', 'New York']
['Bob', '25', 'Los Angeles']
['Charlie', '35', 'Chicago']


- `csv.writer()` - writes data to a CSV file row by row.

In [67]:
data = [
    ["Name", "Age", "City"],
    ["Alice", 30, "New York"],
    ["Bob", 25, "Los Angeles"],
    ["Charlie", 35, "Chicago"]
]

with open("output.csv", "w", newline="") as file:
    writer = csv.writer(file)
    # writer.writerows(data)
    for line in data:
        writer.writerow(line)
        print(f"{line} written to CSV file.")

['Name', 'Age', 'City'] written to CSV file.
['Alice', 30, 'New York'] written to CSV file.
['Bob', 25, 'Los Angeles'] written to CSV file.
['Charlie', 35, 'Chicago'] written to CSV file.


- `csv.DictReader()` - reads a CSV file and converts each row into a dictionary, with column headers as keys.

In [47]:
with open("data.csv", "r") as file:
    reader = csv.DictReader(file)
    for row in reader:
        print(row)

{'Name': 'Alice', 'Age': '30', 'City': 'New York'}
{'Name': 'Bob', 'Age': '25', 'City': 'Los Angeles'}
{'Name': 'Charlie', 'Age': '35', 'City': 'Chicago'}


- `csv.DictWriter()` - writes dictionaries to a CSV file, using specified `fieldnames` as headers.

In [68]:
data = [
    {"Name": "Alice", "Age": 30, "City": "New York"},
    {"Name": "Bob", "Age": 25, "City": "Los Angeles"},
    {"Name": "Charlie", "Age": 35, "City": "Chicago"}
]

with open("output.csv", "w", newline="") as file:
    fieldnames = ["Name", "Age", "City"]
    writer = csv.DictWriter(file, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(data)

## Exercises 5
1. Using [`books.csv`](books.csv), do the following:
   - read the CSV file
   - create two other CSV files: `mathematics_books.csv` and `computer_science_books.csv`, containing only books in each genre (_Genre_ column should be equal to _mathematics_ or _computer_science_, respectively), with all columns in _books.csv_ except _Genre_

## 6. Working with Excel files (`openpyxl` module)

[`openpyxl`](https://openpyxl.readthedocs.io/en/stable/) is a 3rd party Python package used for reading, writing, and manipulating Excel files in `.xlsx` and `.xlsm` formats. It allows you to automate tasks like creating workbooks, modifying cell data, applying styles, and creating charts—all without relying on Microsoft Excel software.

In order to work with it, you should install it using `pip`:

```bash
pip install openpyxl
```

The main class in `openpyxl` in `Workbook`. A workbook is the container for all other parts of the document. Workbooks contain multiple worksheets, which are modeled in `Worksheet` class.

#### Creating a New Workbook

In [49]:
from openpyxl import Workbook

# Create a workbook and add data
wb = Workbook()
ws = wb.active  # Get the default worksheet
ws.title = "MySheet"
ws.append(["Name", "Age", "City"])  # Add a header row
ws.append(["Alice", 30, "New York"])  # Add data
wb.save("output.xlsx")  # Save the file

#### Reading an Existing Excel File

In [69]:
from openpyxl import load_workbook

wb = load_workbook("output.xlsx")  # Load the workbook
ws = wb["MySheet"]

for row in ws.iter_rows(values_only=True):
    print(row)

('Name', 'Age', 'City')
('Alice', 30, 'New York')


#### Modifying Data in a Workbook

In [51]:
ws["A2"] = "Bob"  # Update a cell value
ws.append(["Charlie", 35, "Chicago"])  # Add a new row

wb.save("modified.xlsx")

## Exercises 6

1. Using `Financial Sample.xlsx`, do the following:
   - Read the file
   - Add a new sheet _Q4 2014_
   - Iterate on rows and write to the new sheet those rows that have `Profit` greater than or equal to `100.000$` and `Date` between 1.10.2014 and 31.12.2014

## 7. Working with XML format

### What is XML?

XML is a markup language used for storing and transporting data. It uses tags to define elements and attributes to add metadata.

**Example XML Document**:
```xml
<employees>
    <employee id="1">
        <name>Alice</name>
        <department>HR</department>
        <salary>60000</salary>
    </employee>
    <employee id="2">
        <name>Bob</name>
        <department>Engineering</department>
        <salary>75000</salary>
    </employee>
</employees>
```

### `xml.etree.ElementTree` module

`xml.etree.ElementTree` is a Python module used for parsing, creating, and manipulating XML data. XML (eXtensible Markup Language) is a widely-used format for data exchange between applications.

#### Parsing XML
- **Purpose**: Load an XML file or string into Python for processing.

In [52]:
import xml.etree.ElementTree as ET

# Parse XML from a file
tree = ET.parse("employees.xml")
root = tree.getroot()

# Print the root tag
print(root.tag)  # Output: employees

# Iterate over child elements
for employee in root:
    print(employee.tag, employee.attrib)

employees
employee {'id': '1'}
employee {'id': '2'}


#### Accessing and Iterating Over Elements
- **Purpose**: Navigate through XML elements and attributes.

In [53]:
for employee in root.findall("employee"):
    name = employee.find("name").text
    department = employee.find("department").text
    print(f"Name: {name}, Department: {department}")

Name: Alice, Department: HR
Name: Bob, Department: Engineering


#### Creating an XML Document
- **Purpose**: Generate an XML structure programmatically.

In [54]:
# Create root element
root = ET.Element("employees")

# Add a child element
employee = ET.SubElement(root, "employee", id="1")
ET.SubElement(employee, "name").text = "Alice"
ET.SubElement(employee, "department").text = "HR"
ET.SubElement(employee, "salary").text = "60000"

# Write to a file
tree = ET.ElementTree(root)
tree.write("output.xml", encoding="utf-8", xml_declaration=True)

#### Modifying an XML Document
- **Purpose**: Update an existing XML structure.

In [55]:
# Update an employee's salary
for employee in root.findall("employee"):
    if employee.get("id") == "1":
        employee.find("salary").text = "65000"

# Save changes back to the file
tree.write("employees_updated.xml")

#### Deleting an Element
- **Purpose**: Remove specific elements from the XML structure.

In [56]:
# Remove an employee with id="2"
for employee in root.findall("employee"):
    if employee.get("id") == "2":
        root.remove(employee)

# Save the modified XML
tree.write("employees_modified.xml")

## Exercises 7

1. Using `books.xml`, do the following:
   - read the XML file
   - append records in the XML file to `books.csv`: leave nonexistent fields empty, change `genre`:
       - to be all lowercase
       - to have spaces ` ` replaced with underscores `_`
   - create a new XML file where genres become tags, and books are grouped by genre:
     ```xml
     <catalog>
       <computer>
         <book id="bk101">
           <author>Gambardella, Matthew</author>
           <title>XML Developer's Guide</title>
           [...]
         </book>
         <book id="bk110">
           <author>O'Brien, Tim</author>
           <title>Microsoft .NET: The Programming Bible</title>
           [...]
         </book>
        </computer>
      </catalog>
      ```
     
     - include all tags under `book`, except `genre`
     - change date format `2000-12-01` -> `1 December 2000`

## 8. Working with HTML/XML (`BeautifulSoup`)

`BeautifulSoup` is a Python 3rd party library used for parsing HTML and XML documents. It creates a tree structure from raw markup, enabling easy navigation, search, and modification of the document's elements. Find the official documentation [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

**Why Use It?**
- Simplifies the process of web scraping.
- Compatible with different parsers like `html.parser`, `lxml`, and `html5lib`.
- Makes HTML and XML documents easy to traverse and manipulate.

### Installation

Install `BeautifulSoup` via `pip` along with a parser (optional):
```bash
pip install beautifulsoup4
pip install lxml  # Optional: Faster parser
```

`lxml` is considered the fastest parser and will be automatically used, if installed. See more details [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use).

* **Parsing HTML** - Load HTML content into a `BeautifulSoup` object.

In [57]:
from bs4 import BeautifulSoup

html = "<html><body><h1>Hello!</h1></body></html>"
soup = BeautifulSoup(html)

print(soup.prettify())

<html>
 <body>
  <h1>
   Hello!
  </h1>
 </body>
</html>



* **Navigating the Parse Tree** - Traverse and access specific elements in the document.

In [58]:
# Access the first occurrence of a tag
print(soup.h1.text)

Hello!


In [59]:
# Access parent and children
print(soup.body.h1.parent.name)

body


In [60]:
# Find all elements of a tag
print(soup.find_all("h1"))

[<h1>Hello!</h1>]


* **Searching for Elements** - Use methods like `find()` and `find_all()` to locate elements by tag, class, or attributes.

In [61]:
html = """
<html>
    <body>
        <p class="intro">Introduction paragraph.</p>
        <p class="content">Main content.</p>
    </body>
</html>
"""
soup = BeautifulSoup(html, "html.parser")

# Find by class
intro = soup.find("p", class_="intro")
print(intro.text)

Introduction paragraph.


In [62]:
# Find all paragraphs
all_paragraphs = soup.find_all("p")
for p in all_paragraphs:
    print(p.text)

Introduction paragraph.
Main content.


* **Extracting Text and Attributes** - Extract text content or attribute values.

In [63]:
print(soup.p.text)

Introduction paragraph.


In [64]:
print(soup.p["class"])

['intro']


* **Modifying HTML** - Add, remove, or modify elements and attributes.

In [65]:
# Add a new tag
new_tag = soup.new_tag("h2")
new_tag.string = "A new heading"
soup.body.append(new_tag)

# Modify an existing tag
soup.p["class"] = "updated-class"

# Remove an element
soup.p.decompose()

print(soup.prettify())

<html>
 <body>
  <p class="content">
   Main content.
  </p>
  <h2>
   A new heading
  </h2>
 </body>
</html>



## Exercises 8

1. Go to https://docs.python.org/3/library/index.html and save the page to your computer (option to use [`requests`](https://requests.readthedocs.io/en/latest/) to get the page content programatically)
2. Open the HTML file with `BeautifulSoup`.
3. Extract all links (`<a>` tags) and save their `href`, `title` attributes and their text in a CSV file.
E.g. the following html:

    ```html
    <a href="https://docs.python.org/3/library/intro.html" title="Introduction" accesskey="N">next</a>
    ```
    
    will become a row in the csv file:
   
    ```csv
    url,title,text
    https://docs.python.org/3/library/intro.html,Introduction,next
    ```