AIM #1: Loading the dataset and printing basic information 
1. Import the Titanic dataset using pandas
2. Create a Dataframe from the dataset
3. Print the first 10 rows of the dataset
4. Print the last 20 rows of the dataset
5. Print dataset's information
6. Describe the dataset
7. Make sure all the information returned by the different functions are displayed in a single table and not on multiple ines

In [None]:
import pandas as pd

# 导入数据集
df = pd.read_csv('titanic.csv')

# 打印前10行
print("前10行数据：")
print(df.head(10))

# 打印后20行
print("\n后20行数据：")
print(df.tail(20))

# 打印数据集的信息
print("\n数据集信息：")
print(df.info())

# 描述数据集
print("\n数据集描述：")
print(df.describe())


AIM #2: Finding issues (empty, NAs, incorrect value, incorrect format, outliers, etc.) 
1. Find out how many missing values there are in the dataset
2. For the 'Age' column, find the best way to handle the missing values
    2.1. Use an appropriate plot to study the nature of the 'Age' column
    2.2. Figure out what is the best way to calculate the central tendency of the 'Age' column based on the above plot
    2.3. Using the most suitable central tendency measure, fill the missing values in the age column
3. Decide what is the best way to handle the missing values in the 'Cabin' columns
4. Similarly, decide what is the best way to handle the missing values in the 'Embarked' columns
5. Handle the incorrect data under the 'Survived' columns using appropriate measure
6. Handle the incorrectly formatted data under the 'Fare' column


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# 找出缺失值
print("\n缺失值统计：")
print(df.isnull().sum())

# 处理'Age'列的缺失值
plt.figure(figsize=(10, 6))
sns.histplot(df['Age'].dropna(), kde=True)
plt.title('Age Distribution')
plt.show()

# 使用中位数填充'Age'列的缺失值
age_median = df['Age'].median()
df['Age'].fillna(age_median, inplace=True)

# 处理'Cabin'列的缺失值
df['Cabin'].fillna('Unknown', inplace=True)

# 处理'Embarked'列的缺失值
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# 处理'Survived'列中的不正确数据
df['Survived'] = df['Survived'].apply(lambda x: 1 if x in [1, '1', 'yes', 'Yes'] else 0)

# 处理'Fare'列中的格式不正确数据
df['Fare'] = df['Fare'].apply(lambda x: float(x) if isinstance(x, str) and x.replace('.', '', 1).isdigit() else x)


AIM #3: Grouping 
1. Find out the average fare grouped by Pclass
    1.1. Plot the above using a suitable plot
2. Find out the average fare grouped by Sex
    2.1. Plot the above using a suitable plot

In [None]:
# 按Pclass分组计算平均票价
avg_fare_pclass = df.groupby('Pclass')['Fare'].mean()
print("\n按Pclass分组的平均票价：")
print(avg_fare_pclass)

# 绘图
plt.figure(figsize=(10, 6))
sns.barplot(x=avg_fare_pclass.index, y=avg_fare_pclass.values)
plt.title('Average Fare by Pclass')
plt.xlabel('Pclass')
plt.ylabel('Average Fare')
plt.show()

# 按Sex分组计算平均票价
avg_fare_sex = df.groupby('Sex')['Fare'].mean()
print("\n按Sex分组的平均票价：")
print(avg_fare_sex)

# 绘图
plt.figure(figsize=(10, 6))
sns.barplot(x=avg_fare_sex.index, y=avg_fare_sex.values)
plt.title('Average Fare by Sex')
plt.xlabel('Sex')
plt.ylabel('Average Fare')
plt.show()


AIM #4: Dataset visualization using pandas

1. Plot the distribution of 'Age' using a suitable plot
2. Plot the distribution of 'Fare' using a suitable plot
3. Plot the distribution of 'Pclass' using a suitable plot
4. Plot the distribution of 'Survived' using a suitable plot
5. Plot the distribution of 'Embarked' using a suitable plot
6. Plot the distribution of 'Fare' grouped by 'Survived'
7. Plot the distribution of 'Fare' grouped by 'Pclass'
8. Plot the distribution of 'Age' grouped by 'Survived'
9. Plot the distribution of 'Age' grouped by 'PClass'
10. Combine the 'SibSp' and 'Parch' and plot its distribution grouped by 'Survived'
11. Combine the 'SibSp' and 'Parch' and plot its distribution grouped by 'Pclass'
12. Plot a distribution between 'Age' and 'Fare' to see if there's any relationship
13. Are there any other possibilities to show relationships?

In [None]:
# 绘制'Age'的分布图
plt.figure(figsize=(10, 6))
sns.histplot(df['Age'], kde=True)
plt.title('Age Distribution')
plt.show()

# 绘制'Fare'的分布图
plt.figure(figsize=(10, 6))
sns.histplot(df['Fare'], kde=True)
plt.title('Fare Distribution')
plt.show()

# 绘制'Pclass'的分布图
plt.figure(figsize=(10, 6))
sns.countplot(x='Pclass', data=df)
plt.title('Pclass Distribution')
plt.show()

# 绘制'Survived'的分布图
plt.figure(figsize=(10, 6))
sns.countplot(x='Survived', data=df)
plt.title('Survived Distribution')
plt.show()

# 绘制'Embarked'的分布图
plt.figure(figsize=(10, 6))
sns.countplot(x='Embarked', data=df)
plt.title('Embarked Distribution')
plt.show()

# 绘制按'Survived'分组的'Fare'分布图
plt.figure(figsize=(10, 6))
sns.boxplot(x='Survived', y='Fare', data=df)
plt.title('Fare Distribution by Survived')
plt.show()

# 绘制按'Pclass'分组的'Fare'分布图
plt.figure(figsize=(10, 6))
sns.boxplot(x='Pclass', y='Fare', data=df)
plt.title('Fare Distribution by Pclass')
plt.show()

# 绘制按'Survived'分组的'Age'分布图
plt.figure(figsize=(10, 6))
sns.boxplot(x='Survived', y='Age', data=df)
plt.title('Age Distribution by Survived')
plt.show()

# 绘制按'Pclass'分组的'Age'分布图
plt.figure(figsize=(10, 6))
sns.boxplot(x='Pclass', y='Age', data=df)
plt.title('Age Distribution by Pclass')
plt.show()

# 合并'SibSp'和'Parch'并绘制按'Survived'分组的分布图
df['FamilySize'] = df['SibSp'] + df['Parch']
plt.figure(figsize=(10, 6))
sns.boxplot(x='Survived', y='FamilySize', data=df)
plt.title('Family Size Distribution by Survived')
plt.show()

# 合并'SibSp'和'Parch'并绘制按'Pclass'分组的分布图
plt.figure(figsize=(10, 6))
sns.boxplot(x='Pclass', y='FamilySize', data=df)
plt.title('Family Size Distribution by Pclass')
plt.show()

# 绘制'Age'和'Fare'之间的关系图
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Age', y='Fare', data=df)
plt.title('Age vs Fare')
plt.show()


AIM #5: Correlation

1. Generate a correlation matrix for the entire dataset
2. Find correlation between 'Age' and 'Fare'
3. What other possible correlations can be found in the dataset?

In [None]:
# 生成相关矩阵
correlation_matrix = df.corr()
print("\n相关矩阵：")
print(correlation_matrix)

# 找出'Age'和'Fare'之间的相关性
age_fare_correlation = df['Age'].corr(df['Fare'])
print("\n'Age'和'Fare'之间的相关性：")
print(age_fare_correlation)

# 找出其他可能的相关性
print("\n其他可能的相关性：")
print(correlation_matrix.unstack().sort_values(ascending=False).drop_duplicates())
