<p style="text-align:center">
    <a href="https://www.ict.mahidol.ac.th/en/" target="_blank">
    <img src="https://www3.ict.mahidol.ac.th/ICTSurveysV2/Content/image/MUICT2.png" width="400" alt="Faculty of ICT">
    </a>
</p>

# Tutorial 02: Handling Data

In the realm of data manipulation and analysis, Python, along with its powerful libraries Pandas and NumPy, provides a versatile and efficient toolkit.

Pandas offers data structures like DataFrames, which are similar to tables, allowing for easy organization and manipulation of data. With Pandas, you can effortlessly clean, filter, merge, and aggregate data, making it a crucial tool for data preprocessing and exploration.

NumPy, on the other hand, excels at numerical computations. It provides support for arrays and matrices, enabling fast and efficient operations on large datasets. NumPy is particularly useful for tasks like mathematical operations, data manipulation, and scientific computing.

Together, Pandas and NumPy form a powerful combination for handling and manipulating data in Python. Whether you're cleaning messy data, performing complex calculations, or exploring patterns in your data, these libraries provide the tools you need to get the job done.

## This tutorial aims to teach you how to:

* Read and write simple files using Open.
* Select data in dataframes.
* Load data with Pandas.

## Exercise 01: Reading Text Files in Python

This tutorial covers the basics of reading text files in Python and best practices for file handling.

### Reading Text Files (Basic Method)

The most straightforward way to read a file is using the open() function with the 'r' (read) mode:

In [None]:
file = open("files/sample.txt", "r")  # Open the file in read mode
content = file.read()           # Read the entire content into a string
print(content)                  # Print the content
file.close()                   # Close the file (important!)

This reads the entire file content into a single string. You can also read line by line:

In [None]:
file = open("files/sample.txt", "r")
count = 0
for line in file:
    count = count + 1
    print('Line '+str(count)+': '+line.strip()) # strip() removes newline characters
file.close()

### A Better Way to Open a File (Using <code>with</code>)

The with statement provides a cleaner and safer way to handle files. It automatically closes the file, even if errors occur.

Key Advantages of using with:
* Automatic file closing: Prevents resource leaks.
* Exception handling: Ensures the file is closed even if errors occur within the with block.

This <code>with</code> method is the recommended way to work with files in Python. It's concise, readable, and less prone to errors. This tutorial provides a basic understanding of reading text files in Python. You can further explore reading specific lines, handling different encodings, and other file operations based on your needs.

In [None]:
with open("files/sample.txt", "r") as file:
    content = file.read()
    print(content)

# Or read line by line:

with open("files/sample.txt", "r") as file:
    count = 0
    for line in file:
        count = count + 1
        print('Line '+str(count)+': '+line.strip())

## Exercise 02: Writing and Saving Files in Python

This tutorial covers writing, appending, and copying files in Python.

### Writing Files
The `open()` function with the 'w' (write) mode creates a new file or overwrites an existing one.

In [None]:
with open("my_new_file.txt", "w") as file:
    file.write("This is the first line.\n")
    file.write("This is the second line.\n")
    file.write("This is the third line.\n")

 Let's check the contents of the file.

In [None]:
with open("my_new_file.txt", "r") as file:
    print(file.read())

**Important Note:** If "my_new_file.txt" already existed, its contents would have been completely overwritten.

### Appending Files

To add content to an existing file without overwriting it, use the 'a' (append) mode.

In [None]:
with open("my_new_file.txt", "a") as file:
    file.write("This line is appended.\n")
    file.write("Another appended line.\n")

Let's check the contents of the file again.

In [None]:
with open("my_new_file.txt", "r") as file:
    print(file.read())

### Additional File Modes
*   `'r'`: Read (default). Opens the file for reading.
*   `'w'`: Write. Opens the file for writing. Creates a new file if it does not exist or truncates the file if it exists.
*   `'a'`: Append. Opens the file for appending. Creates a new file if it does not exist.
*   `'x'`: Exclusive creation. Opens a file for exclusive creation. If the file already exists, the operation fails.
*   `'b'`: Binary mode. Used for non-text files (e.g., images, audio).
*   `'t'`: Text mode (default). Used for text files.
*   `'+'`: Open for updating (reading and writing).

You can combine modes, for example:

*   `'rb'`: Read binary.
*   `'w+'`: Read and write (overwrites the file).
*   `'a+'`: Read and append.

Example of using 'x' mode:

In [None]:
try:
    with open("new_file_exclusive.txt", "x") as file:
        file.write("This file was created exclusively.")
except FileExistsError:
    print("File already exists. Cannot create.")

### Copy a File

There are several ways to copy files in Python. One simple way is to read the contents of one file and write them to another. For binary files, use binary modes ('rb' and 'wb'). For text files, the following approach works well:

In [None]:
def copy_file(source, destination):
    try:
      with open(source, "r") as source_file, open(destination, "w") as dest_file:
          for line in source_file:
              dest_file.write(line)
      print(f"File '{source}' copied to '{destination}' successfully.")
    except FileNotFoundError:
      print(f"Error: Source file '{source}' not found.")
    except Exception as e:
      print(f"An error occurred: {e}")

Example usage:

In [None]:
copy_file("my_new_file.txt", "my_new_file_copy.txt")

Check the copy

In [None]:
with open("my_new_file_copy.txt", "r") as file:
    print(file.read())

Summary: This exercise has covered the basics of writing, appending, using different file modes, and copying files in Python. Remember to use the `with` statement for proper file handling.

## Exercise 03: Selecting Data in a Pandas DataFrame

This Jupyter Notebook tutorial demonstrates how to select data from a Pandas DataFrame using various methods.

### Creating a DataFrame and Series

First, let's import the Pandas library and create a sample DataFrame.

In [None]:
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 22, 28, 24],
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney'],
    'Score': [85, 92, 78, 88, 95]
}

df = pd.DataFrame(data)
print("DataFrame:")
print(df)
print("\n")

In [None]:
# Creating a Series
ages = pd.Series([25, 30, 22, 28, 24], name="Ages")
print("Series:")
print(ages)
print("\n")

### Locating Data using loc()
The `loc()` function is used to access data by label (row and column names).

In [None]:
# Selecting a single row
print("Select the row with index 2:")
print(df.loc[2])
print("\n")

In [None]:
# Selecting multiple rows
print("Select rows with index 1 and 3:")
print(df.loc[[1, 3]])
print("\n")

In [None]:
# Selecting a single column
print("Select the 'Name' column:")
print(df.loc[:, 'Name'])
print("\n")

In [None]:
# Selecting multiple columns
print("Select the 'Name' and 'City' columns:")
print(df.loc[:, ['Name', 'City']])
print("\n")

In [None]:
# Selecting specific rows and columns
print("Select the 'Age' and 'Score' from rows 0 and 2:")
print(df.loc[[0, 2], ['Age', 'Score']])
print("\n")

In [None]:
# Selecting a range of rows and columns
print("Select rows from index 1 to 3 and columns 'Age' to 'City':")
print(df.loc[1:3, 'Age':'City']) # Note: with loc, the end is *inclusive*
print("\n")

In [None]:
# Boolean indexing with loc()
print("Select rows where 'Age' is greater than 25:")
print(df.loc[df['Age'] > 25])
print("\n")

In [None]:
print("Select 'Name' and 'City' where 'Score' is greater than 90:")
print(df.loc[df['Score'] > 90, ['Name', 'City']])
print("\n")

### Locating Data using iloc()

The `iloc()` function is used to access data by integer position (row and column index).

In [None]:
# Selecting a single row
print("Select the row at index 2:")
print(df.iloc[2])
print("\n")

In [None]:
# Selecting multiple rows
print("Select rows at indices 1 and 3:")
print(df.iloc[[1, 3]])
print("\n")

In [None]:
# Selecting a single column
print("Select the column at index 0 (Name):")
print(df.iloc[:, 0])
print("\n")

In [None]:
# Selecting multiple columns
print("Select the columns at indices 0 and 2 (Name and City):")
print(df.iloc[:, [0, 2]])
print("\n")

In [None]:
# Selecting specific rows and columns
print("Select the values at row indices 0 and 2, and column indices 1 and 3:")
print(df.iloc[[0, 2], [1, 3]])
print("\n")

In [None]:
# Selecting a range of rows and columns
print("Select rows from index 1 to 3 (exclusive) and columns from index 1 to 3 (exclusive):")
print(df.iloc[1:3, 1:3]) # Note: with iloc, the end is *exclusive*
print("\n")

### Slicing

You can use slicing directly on the DataFrame, but it primarily works on row indices.

In [None]:
print("Select the first 3 rows:")
print(df[:3]) # Equivalent to df.iloc[:3]
print("\n")

In [None]:
print("Select rows from index 2 to the end:")
print(df[2:])
print("\n")

Slicing with specific columns requires using .loc or .iloc

In [None]:
print("Select the first 2 rows of 'Name' and 'Age' columns")
print(df.loc[:1, ['Name', 'Age']])
print("\n")

Summary: We covered the fundamental ways to select data from a Pandas DataFrame. Using `loc()` for label-based indexing and `iloc()` for integer-based indexing provides flexible and powerful data access. Remember the important difference in how the end of the range is handled between `loc` (inclusive) and `iloc` (exclusive).

## Exercise 04: Loading Data with Pandas and Basic Data Exploration

This exercise covers loading data from a CSV file using Pandas and performing basic data exploration.

### Introduction to Pandas

Pandas is a powerful Python library for data manipulation and analysis. It provides data structures like DataFrames, which are similar to tables in a relational database or spreadsheets.

### Importing a CSV File into Pandas

In [None]:
# First, let's import the Pandas library.

import pandas as pd

Now, let's load a CSV file into a Pandas DataFrame. We have provided you with a CSV file called 'titanic.csv'.

In [None]:
try:
    df = pd.read_csv('files/titanic.csv')
    print("CSV file loaded successfully!")
except FileNotFoundError:
    print("Error: 'titanic.csv' not found. Please make sure the file is in the same directory as your notebook or provide the correct path.")
    exit()

### Viewing Data

Here are some ways to view the data:

*   `head()`: Displays the first few rows (default is 5).

In [None]:
print("\nFirst 5 rows:")
print(df.head())

*   `tail()`: Displays the last few rows.

In [None]:
print("\nLast 5 rows:")
print(df.tail())

*   `info()`: Provides information about the DataFrame, including data types and non-null values.

In [None]:
print("\nDataFrame info:")
df.info()

*   `shape`: Returns the dimensions (rows, columns) of the DataFrame.

In [None]:
print("\nDataFrame shape:")
print(df.shape)

*   `columns`: Returns the column names.

In [None]:
print("\nColumn names:")
print(df.columns)

### Accessing Data

You can access data in a DataFrame using various methods:

**Column selection:**

In [None]:
print("\nSelecting the 'Name' column:")
print(df['Name'])  # Or df.Name

**Row selection using `.loc` (label-based) and `.iloc` (integer-based):**

In [None]:
print("\nSelecting the row with index 1 using loc:")
print(df.loc[1])

In [None]:
print("\nSelecting the row at index 0 using iloc:")
print(df.iloc[0])

**Selecting specific cells:**

In [None]:
print("\nSelecting the 'Age' value in the second row (index 1) using loc:")
print(df.loc[1, 'Age'])

In [None]:
print("\nSelecting the 'Age' value in the first row (index 0) using iloc:")
print(df.iloc[0, 1])

### Simple Exploratory Data Analysis (EDA)

The `describe()` method provides summary statistics for numerical columns.

In [None]:
print("\nSummary statistics:")
print(df.describe())

For categorical columns you can use value_counts()

In [None]:
print("\nValue counts for Pclass")
print(df['Pclass'].value_counts())

### Practice: Finding Answers in the Data

Let's try to answer some questions using the data.

1. What is the average age of the passengers?

<details><summary> >> Hint </summary>

```
average_age = df['Age'].mean()
print(f"\nAverage age: {average_age}")
```
    
</details>

In [None]:
# Your code here

2. How many passengers survived?

<details><summary> >> Hint </summary>

```
number_survived = df['Survived'].sum() # Because survived is 1 or 0 we can just sum it
print(f"\nNumber of passengers survived: {number_survived}")
```

</details>

In [None]:
# Your code here

3. What is the name of the first passenger?

<details><summary> >> Hint </summary>

```
first_passenger_name = df.loc[0, 'Name']
print(f"\nName of the first passenger: {first_passenger_name}")
```

</details>

In [None]:
# Your code here

4. How many passengers were in each Pclass?

<details><summary> >> Hint </summary>

```
passengers_per_class = df['Pclass'].value_counts()
print(f"\nPassengers per Pclass:\n{passengers_per_class}")
```

</details>

In [None]:
# Your code here

Summary: This exercise has covered the basics of loading data with Pandas, viewing and accessing data, and performing some basic EDA.

<p style="text-align:center;">That's it! Congratulations! <br> 
    Let's now work on your lab assigment.</p>