<p style="text-align:center">
    <a href="https://www.ict.mahidol.ac.th/en/" target="_blank">
    <img src="https://www3.ict.mahidol.ac.th/ICTSurveysV2/Content/image/MUICT2.png" width="400" alt="Faculty of ICT">
    </a>
</p>

# Tutorial 02: Handling Data

In the realm of data manipulation and analysis, Python, along with its powerful libraries Pandas and NumPy, provides a versatile and efficient toolkit.

Pandas offers data structures like DataFrames, which are similar to tables, allowing for easy organization and manipulation of data. With Pandas, you can effortlessly clean, filter, merge, and aggregate data, making it a crucial tool for data preprocessing and exploration.

NumPy, on the other hand, excels at numerical computations. It provides support for arrays and matrices, enabling fast and efficient operations on large datasets. NumPy is particularly useful for tasks like mathematical operations, data manipulation, and scientific computing.

Together, Pandas and NumPy form a powerful combination for handling and manipulating data in Python. Whether you're cleaning messy data, performing complex calculations, or exploring patterns in your data, these libraries provide the tools you need to get the job done.

## This tutorial aims to teach you how to:

* Read and write simple files using Open.
* Select data in dataframes.
* Load data with Pandas.

## Exercise 01: Reading Text Files in Python

This tutorial covers the basics of reading text files in Python and best practices for file handling.

### Reading Text Files (Basic Method)

The most straightforward way to read a file is using the open() function with the 'r' (read) mode:

In [4]:
file = open("files/sample.txt", "r")  # Open the file in read mode
content = file.read()           # Read the entire content into a string
print(content)                  # Print the content
file.close()                   # Close the file (important!)

This is the first line.
This is the second line.
And this is the third line.


This reads the entire file content into a single string. You can also read line by line:

In [5]:
file = open("files/sample.txt", "r")
count = 0
for line in file:
    count = count + 1
    print('Line '+str(count)+': '+line.strip()) # strip() removes newline characters
file.close()

Line 1: This is the first line.
Line 2: This is the second line.
Line 3: And this is the third line.


### A Better Way to Open a File (Using <code>with</code>)

The with statement provides a cleaner and safer way to handle files. It automatically closes the file, even if errors occur.

Key Advantages of using with:
* Automatic file closing: Prevents resource leaks.
* Exception handling: Ensures the file is closed even if errors occur within the with block.

This <code>with</code> method is the recommended way to work with files in Python. It's concise, readable, and less prone to errors. This tutorial provides a basic understanding of reading text files in Python. You can further explore reading specific lines, handling different encodings, and other file operations based on your needs.

In [6]:
with open("files/sample.txt", "r") as file:
    content = file.read()
    print(content)

# Or read line by line:

with open("files/sample.txt", "r") as file:
    count = 0
    for line in file:
        count = count + 1
        print('Line '+str(count)+': '+line.strip())

This is the first line.
This is the second line.
And this is the third line.
Line 1: This is the first line.
Line 2: This is the second line.
Line 3: And this is the third line.


## Exercise 02: Writing and Saving Files in Python

This tutorial covers writing, appending, and copying files in Python.

### Writing Files
The `open()` function with the 'w' (write) mode creates a new file or overwrites an existing one.

In [7]:
with open("my_new_file.txt", "w") as file:
    file.write("This is the first line.\n")
    file.write("This is the second line.\n")
    file.write("This is the third line.\n")

 Let's check the contents of the file.

In [8]:
with open("my_new_file.txt", "r") as file:
    print(file.read())

This is the first line.
This is the second line.
This is the third line.



**Important Note:** If "my_new_file.txt" already existed, its contents would have been completely overwritten.

### Appending Files

To add content to an existing file without overwriting it, use the 'a' (append) mode.

In [9]:
with open("my_new_file.txt", "a") as file:
    file.write("This line is appended.\n")
    file.write("Another appended line.\n")

Let's check the contents of the file again.

In [10]:
with open("my_new_file.txt", "r") as file:
    print(file.read())

This is the first line.
This is the second line.
This is the third line.
This line is appended.
Another appended line.



### Additional File Modes
*   `'r'`: Read (default). Opens the file for reading.
*   `'w'`: Write. Opens the file for writing. Creates a new file if it does not exist or truncates the file if it exists.
*   `'a'`: Append. Opens the file for appending. Creates a new file if it does not exist.
*   `'x'`: Exclusive creation. Opens a file for exclusive creation. If the file already exists, the operation fails.
*   `'b'`: Binary mode. Used for non-text files (e.g., images, audio).
*   `'t'`: Text mode (default). Used for text files.
*   `'+'`: Open for updating (reading and writing).

You can combine modes, for example:

*   `'rb'`: Read binary.
*   `'w+'`: Read and write (overwrites the file).
*   `'a+'`: Read and append.

Example of using 'x' mode:

In [11]:
try:
    with open("new_file_exclusive.txt", "x") as file:
        file.write("This file was created exclusively.")
except FileExistsError:
    print("File already exists. Cannot create.")

File already exists. Cannot create.


### Copy a File

There are several ways to copy files in Python. One simple way is to read the contents of one file and write them to another. For binary files, use binary modes ('rb' and 'wb'). For text files, the following approach works well:

In [12]:
def copy_file(source, destination):
    try:
      with open(source, "r") as source_file, open(destination, "w") as dest_file:
          for line in source_file:
              dest_file.write(line)
      print(f"File '{source}' copied to '{destination}' successfully.")
    except FileNotFoundError:
      print(f"Error: Source file '{source}' not found.")
    except Exception as e:
      print(f"An error occurred: {e}")

Example usage:

In [13]:
copy_file("my_new_file.txt", "my_new_file_copy.txt")

File 'my_new_file.txt' copied to 'my_new_file_copy.txt' successfully.


Check the copy

In [14]:
with open("my_new_file_copy.txt", "r") as file:
    print(file.read())

This is the first line.
This is the second line.
This is the third line.
This line is appended.
Another appended line.



Summary: This exercise has covered the basics of writing, appending, using different file modes, and copying files in Python. Remember to use the `with` statement for proper file handling.

## Exercise 03: Selecting Data in a Pandas DataFrame

This Jupyter Notebook tutorial demonstrates how to select data from a Pandas DataFrame using various methods.

### Creating a DataFrame and Series

First, let's import the Pandas library and create a sample DataFrame.

In [15]:
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 22, 28, 24],
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney'],
    'Score': [85, 92, 78, 88, 95]
}

df = pd.DataFrame(data)
print("DataFrame:")
print(df)
print("\n")

DataFrame:
      Name  Age      City  Score
0    Alice   25  New York     85
1      Bob   30    London     92
2  Charlie   22     Paris     78
3    David   28     Tokyo     88
4      Eve   24    Sydney     95




In [16]:
# Creating a Series
ages = pd.Series([25, 30, 22, 28, 24], name="Ages")
print("Series:")
print(ages)
print("\n")

Series:
0    25
1    30
2    22
3    28
4    24
Name: Ages, dtype: int64




### Locating Data using loc()
The `loc()` function is used to access data by label (row and column names).

In [17]:
# Selecting a single row
print("Select the row with index 2:")
print(df.loc[2])
print("\n")

Select the row with index 2:
Name     Charlie
Age           22
City       Paris
Score         78
Name: 2, dtype: object




In [18]:
# Selecting multiple rows
print("Select rows with index 1 and 3:")
print(df.loc[[1, 3]])
print("\n")

Select rows with index 1 and 3:
    Name  Age    City  Score
1    Bob   30  London     92
3  David   28   Tokyo     88




In [19]:
# Selecting a single column
print("Select the 'Name' column:")
print(df.loc[:, 'Name'])
print("\n")

Select the 'Name' column:
0      Alice
1        Bob
2    Charlie
3      David
4        Eve
Name: Name, dtype: object




In [20]:
# Selecting multiple columns
print("Select the 'Name' and 'City' columns:")
print(df.loc[:, ['Name', 'City']])
print("\n")

Select the 'Name' and 'City' columns:
      Name      City
0    Alice  New York
1      Bob    London
2  Charlie     Paris
3    David     Tokyo
4      Eve    Sydney




In [21]:
# Selecting specific rows and columns
print("Select the 'Age' and 'Score' from rows 0 and 2:")
print(df.loc[[0, 2], ['Age', 'Score']])
print("\n")

Select the 'Age' and 'Score' from rows 0 and 2:
   Age  Score
0   25     85
2   22     78




In [22]:
# Selecting a range of rows and columns
print("Select rows from index 1 to 3 and columns 'Age' to 'City':")
print(df.loc[1:3, 'Age':'City']) # Note: with loc, the end is *inclusive*
print("\n")

Select rows from index 1 to 3 and columns 'Age' to 'City':
   Age    City
1   30  London
2   22   Paris
3   28   Tokyo




In [23]:
# Boolean indexing with loc()
print("Select rows where 'Age' is greater than 25:")
print(df.loc[df['Age'] > 25])
print("\n")

Select rows where 'Age' is greater than 25:
    Name  Age    City  Score
1    Bob   30  London     92
3  David   28   Tokyo     88




In [24]:
print("Select 'Name' and 'City' where 'Score' is greater than 90:")
print(df.loc[df['Score'] > 90, ['Name', 'City']])
print("\n")

Select 'Name' and 'City' where 'Score' is greater than 90:
  Name    City
1  Bob  London
4  Eve  Sydney




### Locating Data using iloc()

The `iloc()` function is used to access data by integer position (row and column index).

In [25]:
# Selecting a single row
print("Select the row at index 2:")
print(df.iloc[2])
print("\n")

Select the row at index 2:
Name     Charlie
Age           22
City       Paris
Score         78
Name: 2, dtype: object




In [26]:
# Selecting multiple rows
print("Select rows at indices 1 and 3:")
print(df.iloc[[1, 3]])
print("\n")

Select rows at indices 1 and 3:
    Name  Age    City  Score
1    Bob   30  London     92
3  David   28   Tokyo     88




In [27]:
# Selecting a single column
print("Select the column at index 0 (Name):")
print(df.iloc[:, 0])
print("\n")

Select the column at index 0 (Name):
0      Alice
1        Bob
2    Charlie
3      David
4        Eve
Name: Name, dtype: object




In [28]:
# Selecting multiple columns
print("Select the columns at indices 0 and 2 (Name and City):")
print(df.iloc[:, [0, 2]])
print("\n")

Select the columns at indices 0 and 2 (Name and City):
      Name      City
0    Alice  New York
1      Bob    London
2  Charlie     Paris
3    David     Tokyo
4      Eve    Sydney




In [29]:
# Selecting specific rows and columns
print("Select the values at row indices 0 and 2, and column indices 1 and 3:")
print(df.iloc[[0, 2], [1, 3]])
print("\n")

Select the values at row indices 0 and 2, and column indices 1 and 3:
   Age  Score
0   25     85
2   22     78




In [30]:
# Selecting a range of rows and columns
print("Select rows from index 1 to 3 (exclusive) and columns from index 1 to 3 (exclusive):")
print(df.iloc[1:3, 1:3]) # Note: with iloc, the end is *exclusive*
print("\n")

Select rows from index 1 to 3 (exclusive) and columns from index 1 to 3 (exclusive):
   Age    City
1   30  London
2   22   Paris




### Slicing

You can use slicing directly on the DataFrame, but it primarily works on row indices.

In [31]:
print("Select the first 3 rows:")
print(df[:3]) # Equivalent to df.iloc[:3]
print("\n")

Select the first 3 rows:
      Name  Age      City  Score
0    Alice   25  New York     85
1      Bob   30    London     92
2  Charlie   22     Paris     78




In [32]:
print("Select rows from index 2 to the end:")
print(df[2:])
print("\n")

Select rows from index 2 to the end:
      Name  Age    City  Score
2  Charlie   22   Paris     78
3    David   28   Tokyo     88
4      Eve   24  Sydney     95




Slicing with specific columns requires using .loc or .iloc

In [33]:
print("Select the first 2 rows of 'Name' and 'Age' columns")
print(df.loc[:1, ['Name', 'Age']])
print("\n")

Select the first 2 rows of 'Name' and 'Age' columns
    Name  Age
0  Alice   25
1    Bob   30




Summary: We covered the fundamental ways to select data from a Pandas DataFrame. Using `loc()` for label-based indexing and `iloc()` for integer-based indexing provides flexible and powerful data access. Remember the important difference in how the end of the range is handled between `loc` (inclusive) and `iloc` (exclusive).

## Exercise 04: Loading Data with Pandas and Basic Data Exploration

This exercise covers loading data from a CSV file using Pandas and performing basic data exploration.

### Introduction to Pandas

Pandas is a powerful Python library for data manipulation and analysis. It provides data structures like DataFrames, which are similar to tables in a relational database or spreadsheets.

### Importing a CSV File into Pandas

In [34]:
# First, let's import the Pandas library.

import pandas as pd

Now, let's load a CSV file into a Pandas DataFrame. We have provided you with a CSV file called 'titanic.csv'.

In [35]:
try:
    df = pd.read_csv('files/titanic.csv')
    print("CSV file loaded successfully!")
except FileNotFoundError:
    print("Error: 'titanic.csv' not found. Please make sure the file is in the same directory as your notebook or provide the correct path.")
    exit()

CSV file loaded successfully!


### Viewing Data

Here are some ways to view the data:

*   `head()`: Displays the first few rows (default is 5).

In [36]:
print("\nFirst 5 rows:")
print(df.head())


First 5 rows:
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   Na

*   `tail()`: Displays the last few rows.

In [37]:
print("\nLast 5 rows:")
print(df.tail())


Last 5 rows:
     PassengerId  Survived  Pclass                                      Name  \
886          887         0       2                     Montvila, Rev. Juozas   
887          888         1       1              Graham, Miss. Margaret Edith   
888          889         0       3  Johnston, Miss. Catherine Helen "Carrie"   
889          890         1       1                     Behr, Mr. Karl Howell   
890          891         0       3                       Dooley, Mr. Patrick   

        Sex   Age  SibSp  Parch      Ticket   Fare Cabin Embarked  
886    male  27.0      0      0      211536  13.00   NaN        S  
887  female  19.0      0      0      112053  30.00   B42        S  
888  female   NaN      1      2  W./C. 6607  23.45   NaN        S  
889    male  26.0      0      0      111369  30.00  C148        C  
890    male  32.0      0      0      370376   7.75   NaN        Q  


*   `info()`: Provides information about the DataFrame, including data types and non-null values.

In [38]:
print("\nDataFrame info:")
df.info()


DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


*   `shape`: Returns the dimensions (rows, columns) of the DataFrame.

In [39]:
print("\nDataFrame shape:")
print(df.shape)


DataFrame shape:
(891, 12)


*   `columns`: Returns the column names.

In [40]:
print("\nColumn names:")
print(df.columns)


Column names:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


### Accessing Data

You can access data in a DataFrame using various methods:

**Column selection:**

In [41]:
print("\nSelecting the 'Name' column:")
print(df['Name'])  # Or df.Name


Selecting the 'Name' column:
0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object


**Row selection using `.loc` (label-based) and `.iloc` (integer-based):**

In [42]:
print("\nSelecting the row with index 1 using loc:")
print(df.loc[1])


Selecting the row with index 1 using loc:
PassengerId                                                    2
Survived                                                       1
Pclass                                                         1
Name           Cumings, Mrs. John Bradley (Florence Briggs Th...
Sex                                                       female
Age                                                         38.0
SibSp                                                          1
Parch                                                          0
Ticket                                                  PC 17599
Fare                                                     71.2833
Cabin                                                        C85
Embarked                                                       C
Name: 1, dtype: object


In [43]:
print("\nSelecting the row at index 0 using iloc:")
print(df.iloc[0])


Selecting the row at index 0 using iloc:
PassengerId                          1
Survived                             0
Pclass                               3
Name           Braund, Mr. Owen Harris
Sex                               male
Age                               22.0
SibSp                                1
Parch                                0
Ticket                       A/5 21171
Fare                              7.25
Cabin                              NaN
Embarked                             S
Name: 0, dtype: object


**Selecting specific cells:**

In [44]:
print("\nSelecting the 'Age' value in the second row (index 1) using loc:")
print(df.loc[1, 'Age'])


Selecting the 'Age' value in the second row (index 1) using loc:
38.0


In [45]:
print("\nSelecting the 'Age' value in the first row (index 0) using iloc:")
print(df.iloc[0, 1])


Selecting the 'Age' value in the first row (index 0) using iloc:
0


### Simple Exploratory Data Analysis (EDA)

The `describe()` method provides summary statistics for numerical columns.

In [46]:
print("\nSummary statistics:")
print(df.describe())


Summary statistics:
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  


For categorical columns you can use value_counts()

In [47]:
print("\nValue counts for Pclass")
print(df['Pclass'].value_counts())


Value counts for Pclass
Pclass
3    491
1    216
2    184
Name: count, dtype: int64


### Practice: Finding Answers in the Data

Let's try to answer some questions using the data.

1. What is the average age of the passengers?

<details><summary> >> Hint </summary>

```
average_age = df['Age'].mean()
print(f"\nAverage age: {average_age}")
```
    
</details>

In [56]:
df['Age'].mean()
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


2. How many passengers survived?

<details><summary> >> Hint </summary>

```
number_survived = df['Survived'].sum() # Because survived is 1 or 0 we can just sum it
print(f"\nNumber of passengers survived: {number_survived}")
```

</details>

In [55]:
# Your code here
df['Survived'].value_counts()

Survived
0    549
1    342
Name: count, dtype: int64

3. What is the name of the first passenger?

<details><summary> >> Hint </summary>

```
first_passenger_name = df.loc[0, 'Name']
print(f"\nName of the first passenger: {first_passenger_name}")
```

</details>

In [50]:
# Your code here

4. How many passengers were in each Pclass?

<details><summary> >> Hint </summary>

```
passengers_per_class = df['Pclass'].value_counts()
print(f"\nPassengers per Pclass:\n{passengers_per_class}")
```

</details>

In [51]:
# Your code here

Summary: This exercise has covered the basics of loading data with Pandas, viewing and accessing data, and performing some basic EDA.

<p style="text-align:center;">That's it! Congratulations! <br> 
    Let's now work on your lab assigment.</p>