# Exercise: Data cleaning

Before doing actual data analysis, we usually first need to clean the data. 
This might involve steps such as dealing with missing values and encoding categorical variables as integers.

Load the Titanic data set in `titanic.csv` and perform the following tasks:

1. Report the number of observations with missing `Age`, for example using [`isna()`](https://pandas.pydata.org/docs/reference/api/pandas.isna.html).
2. Compute the average age in the data set. Use the following approaches and compare your results:
    1.  Use the [`mean()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html) method.
    2.  Convert the `Age` column to a NumPy array using [`to_numpy()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html). Experiment with NumPy's [`np.mean()`](https://numpy.org/doc/2.0/reference/generated/numpy.mean.html) and [`np.nanmean()`](https://numpy.org/doc/2.0/reference/generated/numpy.nanmean.html) to see if you obtain the same results.
3. Replace the all missing ages with the mean age you computed above, rounded to the nearest integer.
   Note that in "real" applications, replacing missing values with sample means is usually not a good idea.
4. Convert this updated `Age` column to integer type using [`astype()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html).
5. Generate a new column `Female` which takes on the value one if `Sex` is equal to `"female"` and zero otherwise. 
   This is called an _indicator_ or _dummy_ variable, and is preferrable to storing such categorical data as strings.
   Delete the original column `Sex`.
6. Save your cleaned data set as `titanic-clean.csv` using [`to_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html) with `,` as the field separator.
   Tell `to_csv()` to *not* write the `DataFrame` index to the CSV file as it's not needed in this example.

In [2]:
import pandas as pd

DATA_PATH = '../data'

#path to CSV file
file = f'{DATA_PATH}/titanic.csv'

#load sample data set
df = pd.read_csv(file, sep = ',')

In [9]:
#report the number of observations with missing age (using isna())
df['Age'].isna().sum() #in pandas you can chain calculations, we find true and false and then we sum (by default true = 1 and false = 0)

177

In [20]:
#compute the average age in the data set. use te following approaches to compare your results
#use the mean() method
import numpy as np
df['Age'].mean()


29.69911764705882

In [23]:
#convert the age column to a numpy array using to_numpy(). experiment with numpy's np.mean() and np.nanmean() to see if you obtain the same result
array = df['Age'].to_numpy()
np.mean(array) #this will give nan because id there is a calculation with nan, you get nan as the answer

nan

In [25]:
np.nanmean(array) #therefore we use np.nanmean() instead

29.69911764705882

# Exercise: Working with strings

Most of the data we deal with contain strings, i.e., text data (names, addresses, etc.). Often, such data is not in the format needed for analysis, and we have to perform additional string manipulation to extract the exact data we need. This can be achieved using the pandas [string methods](https://pandas.pydata.org/docs/user_guide/text.html#string-methods).

To illustrate, we use the Titanic data set for this exercise.

1.  Load the Titanic data and restrict the sample to men. (This simplifies the task. Women in this data set have much more complicated names as they contain both their husband's and their maiden name)
2.  Print the first five observations of the `Name` column. As you can see, the data is stored in the format _"Last name, Title First name"_ where title is something like Mr., Rev., etc.
3. Split the `Name` column by `,` to extract the last name and the remainder as separate columns. You can achieve this using the [`partition()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.partition.html#pandas.Series.str.partition) string method.
4. Split the remainder (containing the title and first name) using the space character `" "` as separator to obtain individual columns for the title and the first name.
5. Store the three data series in the original `DataFrame` (using the column names `FirstName`, `LastName` and `Title`) and delete the `Name` column which is no longer needed.
6. Finally, extract the ship deck from the values in `Cabin`. The ship deck is the first character in the string stored in `Cabin` (A, B, C, ...). You extract the first character using the 
[`get()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.get.html#pandas.Series.str.get) string method. Store the result in the column `Deck`.

*Hint*: Pandas's string methods can be accessed using the `.str` attribute. For example, to partition values in the column `Name`, you need to use
```python
df['Name'].str.partition()
```
