### Dealing with the data

1. Show the dataframe shape.
2. Standardize header names.
3. Which columns are numerical?
4. Which columns are categorical?
5. Check and deal with `NaN` values.
6. Datetime format - Extract the months from the dataset and store in a separate column. Then filter the data to show only the information for the first quarter , ie. January, February and March. _Hint_: If data from March does not exist, consider only January and February.
7. BONUS: Put all the previously mentioned data transformations into a function.

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('marketing_customer_analysis.csv')

1. Show the dataframe shape.

In [None]:
df.shape

In [None]:
# did this just out of interest
df.iloc[100]

2. Standardize header names.

In [None]:
print(df.columns.tolist())

In [None]:
df.columns = [col_name.lower().replace(' ', '_') for col_name in df.columns]
df.columns

In [None]:
df = df.rename(columns={'employmentstatus':'employment_status'})
df = df.drop(['unnamed:_0'], axis=1)
df.columns

3. Which columns are numerical?
4. Which columns are categorical?

In [None]:
df.dtypes

In [None]:
display(df)

5. Check and deal with NaN values.

In [None]:
# just felt like checking for NULL values as well
df.isnull().sum()

In [None]:
df.isna().sum()

In [None]:
df[df['vehicle_class'].isna()==True]

NaNs in subsets 'state' and 'response' are not important when analysing the claim history of a client, so I don't delete respective rows, but only replace 'NaN' with 'no data' to make the dataframe better readible.

In [None]:
df = df.fillna({'state': 'no data', 'response': 'no data'})
df['state'].unique()

In the other columns, it is necessary to have values in order to analyse the customer behaviour. Hence I drop the rows with NaNs. It might be sufficient to have either vehicle_size or vehicle_type or vehicle_class of a customer, in order to judge their claim history and to predict future insurance usage. But I don't know how to code that... :)

In [None]:
df = df.dropna(subset=['months_since_last_claim','number_of_open_complaints','vehicle_class', 'vehicle_size', 'vehicle_type'])
display(df)

6. Datetime format - Extract the months from the dataset and store in a separate column. Then filter the data to show only the information for the first quarter , ie. January, February and March. Hint: If data from March does not exist, consider only January and February.

In [None]:
# did this out of interest, to see how to get the current date and time
import datetime

now = datetime.datetime.now()

print(now)

In [None]:
# same as above, did this just out of interest
from datetime import date

today = date.today()
today

In [None]:
df['month'] = pd.to_datetime(df['effective_to_date']).dt.month_name()
print(df)

In [None]:
df['month'].value_counts()

In [None]:
# The dataframe is already filtered since there's only data from January & February.
display(df)

7. BONUS: Put all the previously mentioned data transformations into a function.

In [None]:
def preprocess_df(df):
    df.columns = [col_name.lower().replace(' ', '_') for col_name in df.columns]
    df = df.rename(columns={'employmentstatus':'employment_status'})
    df = df.drop(['unnamed:_0'], axis=1)
    df = df.fillna({'state': 'no data', 'response': 'no data'})
    df = df.dropna(subset=['months_since_last_claim','number_of_open_complaints','vehicle_class', 'vehicle_size', 'vehicle_type'])
    df['month'] = pd.to_datetime(df['effective_to_date']).dt.month_name()
    return df

df = preprocess_df(df)
df.head()