#### Mar 2024 ASG1 Submission

In [1]:
### Student ID: 4070163B
### Student Name: Lai Jing Kai

# Grade: << By Grader, after marking >>

We will be using a modified, anonymized dataset from Titanic Passenger Data. This dataset is a partial sub-set of the full original data, with missing/unusual data already cleaned or adjusted.

Column details:

*passenger_id*: (anonymized, unique id numbers assigned to each passenger) <br>
*pclass*: represents passenger cabin class, (1 - first class, 2 - second class, 3 - third class) <br>
*survived*: (0 - no, 1 - yes) <br>
*gender*: (0 - female, 1 - male) <br>
*age*: (numerical age of passenger) <br>
*sibsp*: (number of siblings and/or spouses the passenger had onboard together with them) <br>
*parch*: (number of parents and/or children the passenger had onboard together with them) <br>
*fare*: (assume in $US based on 1912 pricing) <br>


Q1.
Read the provided data file into Jupyter Notebook using suitable file opening functions, and perform the following tasks: <br> <br>
i) Print a list (python data structure) of all the column header names of the dataset (the column names are in the first line of the data file) <br>
ii) Print the first 5 column header names, followed by the first 5 rows of the data

For ii), your output should clearly display the column names and the required rows of data as per the example below.

Hint: Use Python's open() and readline() to open the provided file, and to read the file's column headers and rows of the data, line by line.

Reminder: use of non-basic python such as csv or pandas libraries will result in 50% penalties.

In [4]:
file_name = "titanic_mod.csv"

# Open the file and read the first line to get column headers
with open(file_name, 'r',encoding = 'utf-8-sig') as file:
    header = file.readline().strip().split(',')

# Print the list of column header names
print("List of column header names:")
print(header)
print()

# Task ii)
# Open the file again to read the first 5 rows of data
with open(file_name, 'r', encoding='utf-8') as file:
    # Skip the first line (header)
    file.readline()
    
    print(''.join(f'{header.ljust(15)}' for header in header[:5]))  # Print the first 5 column header names
    
    for _ in range(5):
        row = file.readline().strip().split(',')
        row_output = ''.join(f'{value.ljust(15)}' for value in row[:5])  # Format each row
        print(row_output)  # Print each row of data

List of column header names:
['passenger_id', 'survived', 'pclass', 'gender', 'age', 'sibsp', 'parch', 'fare']

passenger_id   survived       pclass         gender         age            
1              1              1              0              29             
2              1              1              1              1              
3              0              1              0              2              
4              0              1              1              30             
5              0              1              0              25             


Q2.
Using relevant NumPy functions, load the data file into an array, excluding the first row of column headers.
Display the contents and properties (such as size and shape) of the array. You may need to use NumPy’s set_printoptions() function to achieve the desired display.

Hint: Use numpy's genfromtxt() function to load data from a text or csv file. Read documentation and consider carefully what parameters to use when calling genfromtxt().

In [5]:
import numpy as np

# using with open on CSV file
with open('titanic_mod.csv', 'r', encoding='utf-8-sig') as file:
    # Use np.genfromtxt to load the CSV data into a NumPy array
    array = np.genfromtxt(file, delimiter=',', dtype=str, skip_header=1)
    
# Set print options to display the entire array
np.set_printoptions(threshold=np.inf)

# Print the NumPy array
print(array[0:5])

print("The array type is:", type(array))

# Print the number of rows of the array
print("Number of rows:", len(array))

# Display the shape of the array
print("Shape:", array.shape)

# Display the size of the array
print("Size:", array.size)

# Print the dimensions of the array
print("Dimensions:", array.ndim)


[['1' '1' '1' '0' '29' '0' '0' '211.3375']
 ['2' '1' '1' '1' '1' '1' '2' '151.55']
 ['3' '0' '1' '0' '2' '1' '2' '151.55']
 ['4' '0' '1' '1' '30' '1' '2' '151.55']
 ['5' '0' '1' '0' '25' '1' '2' '151.55']]
The array type is: <class 'numpy.ndarray'>
Number of rows: 1046
Shape: (1046, 8)
Size: 8368
Dimensions: 2


Q3.
Write a user-defined function that has two parameters: a column index number and a passenger age number. It will count the occurrences of the passenger age number in the column index of the Numpy array and return the total occurrences.

Use the above user-defined function to answer the following question:

In the dataset, print the 3 most frequent ages of the passengers. Include the proportion as a % out of entire passenger manifest, to 3 decimal places, for each age.

In [6]:
# Function to find the top 3 most common ages and their occurrences 
def find_top3_common_ages():
    # Filter out age column first and convert to int
    ages = array[:, 4]
    ages = ages.astype(int)
    
    # Same as above: Filter out index column from array
    indices = array[:, 0]
    
    # Find unique ages and their counts
    unique_ages, counts = np.unique(ages, return_counts=True)
    
    # Sort the counts in descending order and get the indices of top 3 counts
    sorted_indices = np.argsort(counts)[::-1]
    
    # Get the top 3 most common age indices
    top3_indices = sorted_indices[:3]
    
    # Get the top 3 most common ages and their occurrences
    top3_ages = unique_ages[top3_indices]
    top3_occurrences = counts[top3_indices]
    
    # Calculate the percentage of passengers for each age
    percentages = top3_occurrences / len(indices) * 100
    
    return top3_ages, top3_occurrences, percentages

# Call the function and assign returned values in top3_ages and top3_occurrences variables
top3_ages, top3_occurrences, percentages = find_top3_common_ages()

# Print, using indexing approach on both variables using f" string
print(f"Passengers aged {top3_ages[0]} accounted for {percentages[0]:.3f}% of the passenger population.")
print(f"Passengers aged {top3_ages[1]} accounted for {percentages[1]:.3f}% of the passenger population.")
print(f"Passengers aged {top3_ages[2]} accounted for {percentages[2]:.3f}% of the passenger population.")        


Passengers aged 24 accounted for 4.589% of the passenger population.
Passengers aged 30 accounted for 4.111% of the passenger population.
Passengers aged 22 accounted for 4.111% of the passenger population.


Q4.
It is often important to explore the data to gain preliminary insights, before proceeding to predictive models or deciding on a problem statement to investigate.

Please print out the following values amongst passengers (when appropriate, to 2 decimal places): <br>

    i)   Highest value of number of siblings and/or spouses onboard
    ii)  Mean value of parents and/or children onboard
    iii) 50th-percentile value of fare paid
    iv)  Cheapest non-zero fare paid



In [7]:
# i
# Convert to int
sibsp_column = array[:, 5].astype(int)

# Find the highest value
highest_sibsp_value = np.max(sibsp_column)

print("Highest value of number of siblings and/or spouses onboard:", highest_sibsp_value)


Highest value of number of siblings and/or spouses onboard: 8


In [8]:
#ii
#Convert to int
parch_column = array[0:,6].astype(int)

# Find the mean value
mean_parch_val = np.mean(parch_column)

print(f"Mean value of parents and/or children onboard: {mean_parch_val:.2f}")


Mean value of parents and/or children onboard: 0.42


In [9]:
#iii
#Convert to int
fare_median = array[0:,7].astype(float)

#Find median value 
median_fare_val = np.median(fare_median)

print("50th-percentile value of fare paid:", median_fare_val)


50th-percentile value of fare paid: 15.75


In [10]:
# Filter out zero fares
non_zero_fares = array[array[:, 7].astype(float) > 0]

# Find the minimum fare 
cheapest_non_zero_fare = np.min(non_zero_fares[:, 7].astype(float))

# Display the minimum fare with two decimal places
print(f"Cheapest non-zero fare paid: {cheapest_non_zero_fare:.2f}")


Cheapest non-zero fare paid: 3.17


Q5.
An example of a more advanced investigation, requiring comparison across aggregated values of different features/columns, could be, a researcher wanting to measure the difference in fares between males that survived and males that did not:

Print out the difference between mean amount of fare paid by males that survived, and mean amount of fare paid by males that did not, appropriately formatted.

In [11]:
#Based on the two conditions, 2 arguments passed in like boolean mask to specify the cols (same as df) 
#and then apply to last col to get the mean astype float.

mean_fare_svm = array[(array[:, 1] == '1') & (array[:, 3] == '1'), -1].astype(float).mean()

mean_fare_deadm = array[(array[:, 1] == '0') & (array[:, 3] == '1'), -1].astype(float).mean()

diff = mean_fare_svm - mean_fare_deadm

print(f"{diff:.2f} is the difference betweeen the mean amount of fare paid by males that survived, and mean amount of fare paid by males that did not.")

14.37 is the difference betweeen the mean amount of fare paid by males that survived, and mean amount of fare paid by males that did not.


Q6.
A research think-tank has tasked you with automating some of the common queries that their members make about the Titanic dataset.

Write a simple Python program for the user to query the data based on his/her given inputs. When a user enters an option from 0 to 3, the program will process the option accordingly. After the option has been processed, the program will display the main menu again and the process is repeated until the user chooses to exit.
The options are explained in Questions 7 to 9.


Q7.
Correlation between quantities may indicate some underlying relationship or likely pattern of behaviour.

For the Compute Correlation option, display a numbered list of all the column header names and prompt the user to input the numbers representing the two quantities for the computation of correlation. The computed correlation should be rounded off to 3 decimal places.


Q8.
In the absence of actual lifeboat data, survivor age can be used to gauge if certain demographics were allowed on the lifeboats first.

Prompt the user to enter the passenger class number, before displaying the corresponding rows of the 20 oldest survivors for that passenger class, in order from oldest to youngest.


Q9.
It was reported that while generally women were allowed onto lifeboats first, researchers are also keen to identify female survivors with larger numbers of family members onboard (not including themselves).

Write a simple lambda function to calculate a new Numpy array column containing each passenger's non-self family members onboard, by adding the count of sibling and/or spouses, to the count of parents and/or children, for each passenger.

Append this column to the existing Numpy 2-D array of values (you may need to use numpy.reshape() before appending) and display the top 20 rows of female survivors, ordered by highest to lowest by non-self family member count primarily, and in case of a tie, by highest to lowest fare secondarily.



In [12]:
import numpy as np

def user():
    print("""Main Menu
1. Compute Correlation
2. Ranked List of 20 Oldest Survivors by Passenger Class Number
3. Ranked List of 20 Female Survivors by Highest Non-self Family Member Onboard Count and then by Highest Fare
0. Exit""")
    print()
    while True:
        option_first = input("Enter your option (0 to exit): ")
        print()
        
        try:
            option = int(option_first)
            if option == 1:
                print("""List of header names for calculating correlation
0 passenger_id
1 survived
2 pclass
3 gender
4 age
5 sibsp
6 parch
7 fare""")
                print()
                      
                option_cor1 = int(input("Enter the number for the first quantity:"))
                option_cor2 = int(input("Enter the number for the second quantity:"))

                # Setting up a dictionary to capture the above inputs as keys first. Keys will be used later for paired values.    
                correlation_dict = {
                    0: "passenger_id",
                    1: "survived",
                    2: "pclass",
                    3: "gender",
                    4: "age",
                    5: "sibsp",
                    6: "parch",
                    7: "fare"
                }
                
                # Assuming 'array' is your existing 2D array
                
                # Convert string values to numeric ones
                array_numeric = np.array(array, dtype=float)
                
                corrcoef = np.corrcoef(array_numeric[:, option_cor1], array_numeric[:, option_cor2]) 
                print()
                print(f"The correlation between {correlation_dict[option_cor1]} and {correlation_dict[option_cor2]} is {corrcoef[0, 1]:.3f}")

            elif option == 2:
                print("Enter the passenger class number (1 to 3):")
                print()
                pclass = int(input())

                # Filter array to get survivors of the specified class
                survivors_class = array[(array[:, 2] == str(pclass)) & (array[:, 1] == '1')]
                
                # Convert age column to integer values
                survivors_class[:, 4] = survivors_class[:, 4].astype(int)

                # Round fare to two decimal places
                survivors_class[:, 7] = np.round(survivors_class[:, 7].astype(float), 2)

                # Sort survivors by age in descending order
                oldest_survivors = survivors_class[np.argsort(-survivors_class[:, 4].astype(int))][:20]

                # Determine maximum width of each column including the header
                max_widths = [max(len(str(header[i])), max(len(str(value)) for value in column)) for i, column in enumerate(oldest_survivors.T)]
                
                print("List of the 20 Oldest Survivors for Passenger Cabin Class Number", pclass)
                print()

                # Print header with aligned columns
                header_str = ' '.join(header[i].ljust(max_widths[i] + 6) for i in range(len(header)))  
                print(header_str)
                
                # Print rows with aligned columns
                for row in oldest_survivors:
                    row_str = ' '.join(str(row[i]).ljust(max_widths[i] + 6) for i in range(len(row)))  
                    print(row_str)
                                        
            elif option == 3:
                
                # Define a lambda function to calculate non-self family members on-board
                calculate_ns_family_members = lambda row: int(row[5]) + int(row[6])

                # Apply the lambda function to each row of the array and create a new column
                family_members_column = np.apply_along_axis(calculate_ns_family_members, axis=1, arr=array)

                # Reshape the new column to match the shape of the existing array
                family_members_column = np.reshape(family_members_column, (-1, 1))

                # Append the new column to the existing array
                array_with_ns_family_members = np.append(array, family_members_column, axis=1)

                #Filter using boolean mask logic, retaining only female survivors
                female_survivors_array = array_with_ns_family_members[(array_with_ns_family_members[:, 1] == '1') & (array_with_ns_family_members[:, 3] == '0')]

                # Sort by family member size first [-1]. If tied, use fare [7]
                sorted_indices = np.lexsort((-female_survivors_array[:, 7].astype(float), -female_survivors_array[:, -1].astype(int)))
                female_survivors_sorted = female_survivors_array[sorted_indices][:20]

                new_header = header + ['sibsp_parch']
                
                final_array = np.vstack((new_header, female_survivors_sorted))
                
                print("List of 20 Female Survivors by Highest Non-self Family Member Onboard Count, and then by Highest Fare, in descending order")
                print()

                # Print the combined array with left-aligned columns and equal length
                print(''.join(f'{header.ljust(15)}' for header in final_array[0]))  # Print the header row fisrt
                for row in final_array[1:]:
                    row_formatted = [value if i != 7 else f"{float(value):.2f}" for i, value in enumerate(row)]  # Format 'fare' column to two decimal places
                    print(''.join(f"{value.ljust(15)}" for value in row_formatted))  # Print each row of data

            elif option == 0:
                print("Exiting program.")
                break
                
            else:
                print("Invalid option. Please enter a number from 0 to 3.")

        except ValueError:
            print("Invalid input. Please enter a valid number.")
        except Exception as e:
            print(f"An error occurred: {e}")

user()


Main Menu
1. Compute Correlation
2. Ranked List of 20 Oldest Survivors by Passenger Class Number
3. Ranked List of 20 Female Survivors by Highest Non-self Family Member Onboard Count and then by Highest Fare
0. Exit



Enter your option (0 to exit):  0



Exiting program.


Q10. Reflection<BR>
Share 3 things that you have learned from this assignment (i.e. Question 1 – 9) and/or topics covered from Week 1 to 7.

Discuss 2 things that you found interesting from this assignment and/or topics covered from Week 1 to 7


## End of Notebook