# Lab 2: Data Preprocessing

In this assignment, we will learn how to explore the raw data and preprocess it. 

The dataset we are going to exlore is an insurance data. It provides different features of each user as follows:
* age: age of the user
* sex: gender of the user
* bmi: body mass index, providing an understanding of body
* children: number of children covered by health insurance / number of dependents
* smoker: smoker or not
* region: the user's residential area in the US, northeast, southeast, southwest, northwest.

Additionally, the medical cost of each user is also provided: 
* charges: the medical cost 

Please follow Lecture 5_data_understanding and Lecture 6_data_preprocessing to complete following questions. 

### Q1. Load data with Pandas and output the basic information of this dataset, such as the features and their data types.  Which features are numerical features and which users are categorical features?


In [None]:
# your code
import pandas as pd # for read_csv
import matplotlib.pyplot as plt

# constants
DATASET_FILENAME = r'insurance.csv'         # filename of dataset
N_HIST_BINS = 100                           # bins per histogram

# value types represented by `d_types` in pandas
VALUE_TYPES = {
    'numerical': (r'int64', r'float64'),
    'categorical': (r'object',)
}

# load the dataset into dataframe df
df = pd.read_csv(DATASET_FILENAME)

# print the original dataframe
print('\n===data frame===')
print(df)

# print the basic information of the dataset
print('\n===data frame information===')
df.info()

# show the statistics
print('\n===data frame stats===')
print(df.describe())

# print the columns headers and their datatypes by themselves
print('\n===column datatypes===')
print(df.dtypes)

# find which features are numerical and which are categorical

# loop through value types and corresponding set of dtype
# initialize empty dict for features by value type
features = {}
for value_type, accepted_dtypes in VALUE_TYPES.items():
    # dtypes must first be converted to strings using the `str` function
    # because dtypes do not have well defined hashes,
    # and do not otherwise play well with Python's `in`
    features[value_type] = df.dtypes.apply(str).isin(accepted_dtypes)
    # print the value type and corresponding columns
    print(f"\n==={value_type} features===")
    print(df.columns[features[value_type]])
# next value_type, dtypes


===data frame===
      age     sex     bmi  children smoker     region      charges
0      19  female  27.900         0    yes  southwest  16884.92400
1      18    male  33.770         1     no  southeast   1725.55230
2      28    male  33.000         3     no  southeast   4449.46200
3      33    male  22.705         0     no  northwest  21984.47061
4      32    male  28.880         0     no  northwest   3866.85520
...   ...     ...     ...       ...    ...        ...          ...
1333   50    male  30.970         3     no  northwest  10600.54830
1334   18  female  31.920         0     no  northeast   2205.98080
1335   18  female  36.850         0     no  southeast   1629.83350
1336   21  female  25.800         0     no  southwest   2007.94500
1337   61  female  29.070         0    yes  northwest  29141.36030

[1338 rows x 7 columns]

===data frame information===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    No

### Q2. Check whether there are missing values in this dataset.

In [None]:
# check for any missing values
any_missing = df.isnull().any()
print('===any missing value per column?===')
print(any_missing)

===any missing value per column?===
age         False
sex         False
bmi         False
children    False
smoker      False
region      False
charges     False
dtype: bool


### Q3. Visualize all numerical features with histogram plot to see the distribution of each numerical feature. 


In [None]:
# select each numerical feature
df_num_columns = df.columns[:, features['numerical']]

# print each column name
for column in df_num.columns:
    print(fr"===histogram of {column}===")
    plt.hist(df_)

age
bmi
children
charges


### Q4. Use corr() function of Pandas to show the correlation between different numerical features

In [None]:
# your code


### Q5. For all categorical features, use bar plot to visualize the number of users within each category.

In [None]:
# your code

### Q6. Convert all categorical features into numerical features with Label Encoding or One-Hot Encoding

In [None]:
# your code



### Q7. Normalize all numerical features

In [None]:
# your code

### Q8. Save your preprocessed data into a csv file. Submit your code and the preprocessed data.