AIM #1: Loading the dataset and printing basic information 
1. Import the Titanic dataset using pandas
2. Create a Dataframe from the dataset
3. Print the first 10 rows of the dataset
4. Print the last 20 rows of the dataset
5. Print dataset's information
6. Describe the dataset
7. Make sure all the information returned by the different functions are displayed in a single table and not on multiple ines

In [1]:
import pandas as pd
import seaborn as sns

# Load the Titanic dataset
df = sns.load_dataset('titanic')

# Print the first 10 rows of the dataset
print("First 10 rows of the dataset:")
print(df.head(10))

# Print the last 20 rows of the dataset
print("\nLast 20 rows of the dataset:")
print(df.tail(20))

# Print dataset's information
print("\nDataset's information:")
print(df.info())

# Describe the dataset
print("\nDataset description:")
print(df.describe())

KeyboardInterrupt: 

AIM #2: Finding issues (empty, NAs, incorrect value, incorrect format, outliers, etc.) 
1. Find out how many missing values there are in the dataset
2. For the 'Age' column, find the best way to handle the missing values
    2.1. Use an appropriate plot to study the nature of the 'Age' column
    2.2. Figure out what is the best way to calculate the central tendency of the 'Age' column based on the above plot
    2.3. Using the most suitable central tendency measure, fill the missing values in the age column
3. Decide what is the best way to handle the missing values in the 'Cabin' columns
4. Similarly, decide what is the best way to handle the missing values in the 'Embarked' columns
5. Handle the incorrect data under the 'Survived' columns using appropriate measure
6. Handle the incorrectly formatted data under the 'Fare' column


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Titanic dataset
df = sns.load_dataset('titanic')

# Count missing values in each column
missing_values = df.isnull().sum()
print("Missing values in each column:\n", missing_values)

# Plotting the age distribution to understand its nature
plt.figure(figsize=(10, 6))
sns.histplot(df['age'].dropna(), kde=True)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

# Calculate mean and median age
mean_age = df['age'].mean()
median_age = df['age'].median()
print(f"Mean Age: {mean_age}")
print(f"Median Age: {median_age}")

# Fill missing values in the 'Age' column with the median age
df['age'].fillna(median_age, inplace=True)

# Fill missing values in the 'Cabin' column with 'Unknown'
df['cabin'].fillna('Unknown', inplace=True)

# Fill missing values in the 'Embarked' column with the mode
mode_embarked = df['embarked'].mode()[0]
df['embarked'].fillna(mode_embarked, inplace=True)

# Convert non-numeric to numeric with errors becoming NaN, then fill NaN with mode
df['survived'] = pd.to_numeric(df['survived'], errors='coerce')
df['survived'].fillna(df['survived'].mode()[0], inplace=True)

# Convert 'Fare' to numeric, coercing errors to NaN
df['fare'] = pd.to_numeric(df['fare'], errors='coerce')

# Fill missing values in the 'Fare' column with the median fare
median_fare = df['fare'].median()
df['fare'].fillna(median_fare, inplace=True)

# Detecting outliers in the 'Age' column
Q1 = df['age'].quantile(0.25)
Q3 = df['age'].quantile(0.75)
IQR = Q3 - Q1
outliers = ((df['age'] < (Q1 - 1.5 * IQR)) | (df['age'] > (Q3 + 1.5 * IQR)))
print("Outliers in the 'Age' column:\n", df[outliers])

AIM #3: Grouping 
1. Find out the average fare grouped by Pclass
    1.1. Plot the above using a suitable plot
2. Find out the average fare grouped by Sex
    2.1. Plot the above using a suitable plot

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Titanic dataset
df = sns.load_dataset('titanic')

# Calculate the average fare by passenger class
avg_fare_pclass = df.groupby('pclass')['fare'].mean()

# Plot the average fare by passenger class using a bar plot
plt.figure(figsize=(8, 6))
sns.barplot(x=avg_fare_pclass.index, y=avg_fare_pclass.values)
plt.title('Average Fare by Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Average Fare')
plt.show()

# Calculate the average fare by sex
avg_fare_sex = df.groupby('sex')['fare'].mean()

# Plot the average fare by sex using a bar plot
plt.figure(figsize=(8, 6))
sns.barplot(x=avg_fare_sex.index, y=avg_fare_sex.values, color=['blue', 'green'])
plt.title('Average Fare by Sex')
plt.xlabel('Sex')
plt.ylabel('Average Fare')
plt.show()

AIM #4: Dataset visualization using pandas

1. Plot the distribution of 'Age' using a suitable plot
2. Plot the distribution of 'Fare' using a suitable plot
3. Plot the distribution of 'Pclass' using a suitable plot
4. Plot the distribution of 'Survived' using a suitable plot
5. Plot the distribution of 'Embarked' using a suitable plot
6. Plot the distribution of 'Fare' grouped by 'Survived'
7. Plot the distribution of 'Fare' grouped by 'Pclass'
8. Plot the distribution of 'Age' grouped by 'Survived'
9. Plot the distribution of 'Age' grouped by 'PClass'
10. Combine the 'SibSp' and 'Parch' and plot its distribution grouped by 'Survived'
11. Combine the 'SibSp' and 'Parch' and plot its distribution grouped by 'Pclass'
12. Plot a distribution between 'Age' and 'Fare' to see if there's any relationship
13. Are there any other possibilities to show relationships?

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Titanic dataset
df = sns.load_dataset('titanic')

# Plot the distribution of 'Age'
plt.figure(figsize=(10, 6))
sns.histplot(df['age'].dropna(), kde=True)
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

# Plot the distribution of 'Fare'
plt.figure(figsize=(10, 6))
sns.histplot(df['fare'].dropna(), kde=True)
plt.title('Distribution of Fare')
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.show()

# Plot the distribution of 'Pclass'
plt.figure(figsize=(10, 6))
sns.countplot(x='pclass', data=df)
plt.title('Distribution of Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Count')
plt.show()

# Plot the distribution of 'Survived'
plt.figure(figsize=(10, 6))
sns.countplot(x='survived', data=df)
plt.title('Distribution of Survival Status')
plt.xlabel('Survived')
plt.ylabel('Count')
plt.show()

# Plot the distribution of 'Embarked'
plt.figure(figsize=(10, 6))
sns.countplot(x='embarked', data=df, order=['S', 'C', 'Q'])
plt.title('Distribution of Embarked Port')
plt.xlabel('Embarked')
plt.ylabel('Count')
plt.show()

# Plot the distribution of 'Fare' grouped by 'Survived'
plt.figure(figsize=(10, 6))
sns.boxplot(x='survived', y='fare', data=df)
plt.title('Fare Distribution Grouped by Survival Status')
plt.xlabel('Survived')
plt.ylabel('Fare')
plt.show()

# Plot the distribution of 'Fare' grouped by 'Pclass'
plt.figure(figsize=(10, 6))
sns.boxplot(x='pclass', y='fare', data=df)
plt.title('Fare Distribution Grouped by Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Fare')
plt.show()

# Plot the distribution of 'Age' grouped by 'Survived'
plt.figure(figsize=(10, 6))
sns.boxplot(x='survived', y='age', data=df)
plt.title('Age Distribution Grouped by Survival Status')
plt.xlabel('Survived')
plt.ylabel('Age')
plt.show()

# Plot the distribution of 'Age' grouped by 'PClass'
plt.figure(figsize=(10, 6))
sns.boxplot(x='pclass', y='age', data=df)
plt.title('Age Distribution Grouped by Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Age')
plt.show()

# Combine 'SibSp' and 'Parch' and plot its distribution grouped by 'Survived'
df['family_size'] = df['sibsp'] + df['parch'] + 1
plt.figure(figsize=(10, 6))
sns.boxplot(x='survived', y='family_size', data=df)
plt.title('Family Size Distribution Grouped by Survival Status')
plt.xlabel('Survived')
plt.ylabel('Family Size')
plt.show()

# Combine 'SibSp' and 'Parch' and plot its distribution grouped by 'Pclass'
plt.figure(figsize=(10, 6))
sns.boxplot(x='pclass', y='family_size', data=df)
plt.title('Family Size Distribution Grouped by Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Family Size')
plt.show()

# Plot a distribution between 'Age' and 'Fare' to see if there's any relationship
plt.figure(figsize=(10, 6))
sns.scatterplot(x='age', y='fare', data=df.dropna(subset=['age', 'fare']))
plt.title('Relationship between Age and Fare')
plt.xlabel('Age')
plt.ylabel('Fare')
plt.show()

# Are there any other possibilities to show relationships?
# Yes, we can use pair plots, heatmaps of correlations, or combined plots like facet grids.

AIM #5: Correlation

1. Generate a correlation matrix for the entire dataset
2. Find correlation between 'Age' and 'Fare'
3. What other possible correlations can be found in the dataset?

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Titanic dataset
df = sns.load_dataset('titanic')

# Drop columns that are non-numeric or don't make sense to include in a correlation matrix
df_numeric = df.select_dtypes(include=[float, int]).drop(['age', 'fare'], errors='ignore')

# Calculate the correlation matrix
corr_matrix = df_numeric.corr()

# Plot the correlation matrix using a heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of the Dataset')
plt.show()

# Calculate the correlation between 'Age' and 'Fare'
age_fare_corr = df[['age', 'fare']].corr().loc['age', 'fare']
print(f"Correlation between 'Age' and 'Fare': {age_fare_corr:.2f}")

# Calculate additional correlations
pclass_fare_corr = df[['pclass', 'fare']].corr().loc['pclass', 'fare']
print(f"Correlation between 'Pclass' and 'Fare': {pclass_fare_corr:.2f}")

parch_survived_corr = df[['parch', 'survived']].corr().loc['parch', 'survived']
print(f"Correlation between 'Parch' and 'Survived': {parch_survived_corr:.2f}")

sibsp_survived_corr = df[['sibsp', 'survived']].corr().loc['sibsp', 'survived']
print(f"Correlation between 'SibSp' and 'Survived': {sibsp_survived_corr:.2f}")

embarked_survived_corr = df[['embarked', 'survived']].corr().loc['embarked', 'survived']
print(f"Correlation between 'Embarked' and 'Survived': {embarked_survived_corr:.2f}")