### 1. Import the necessary libraries if you are starting a new notebook.

In [4]:
# 📚 Basic libraries
import os # file managment
import pandas as pd # data manipulation
import numpy as np # numerical operations
import warnings # warning messages managment

# ⚙️ Settings
pd.set_option('display.max_columns', None)
warnings.filterwarnings('ignore') # ignore warnings

# 🔄 Functions
import sys # system path to our functions
sys.path.append("C:/Users/apisi/01. IronData/01. GitHub/01. IronLabs/unit_4_py/lab-cleaning-categorical-data")

from easy.functions import open_data # quick data overview
from easy.functions import snake_columns # snake_case
from easy.functions import explore_data # checks for duplicates, NaN & empty spaces

### 2. Load the csv. Use the variable `customer_df` as `customer_df = pd.read_csv()`.

In [5]:
file_path = os.path.join("C:/Users/apisi/01. IronData/01. GitHub/01. IronLabs/unit_4_py/lab-cleaning-categorical-data/01_data/we_fn_use_c_marketing_customer_value_analysis.csv")
customer_df = pd.read_csv(file_path)

### 3. What should we do with the `customer_id` column?

In [6]:
# Drop it like it's NaN.
# ID's are irrelevant numbers most of the times
data_c = customer_df.copy() # To choose a path of less violance we can first create a copy
data_c = data_c.drop('Customer', axis=1) # And then... Say adios, my friend

In [7]:
data_c.columns

Index(['State', 'Customer Lifetime Value', 'Response', 'Coverage', 'Education',
       'Effective To Date', 'EmploymentStatus', 'Gender', 'Income',
       'Location Code', 'Marital Status', 'Monthly Premium Auto',
       'Months Since Last Claim', 'Months Since Policy Inception',
       'Number of Open Complaints', 'Number of Policies', 'Policy Type',
       'Policy', 'Renew Offer Type', 'Sales Channel', 'Total Claim Amount',
       'Vehicle Class', 'Vehicle Size'],
      dtype='object')

### 4. Load the continuous and discrete variables into `numericals_df` and `categorical_df` variables, for eg.: 
    ```py
    numerical_df = customer_df.select_dtypes()
    categorical_df = customer_df.select_dtypes()
    ```

In [8]:
# Selecting Numericals
n = data_c.select_dtypes(include=np.number)

# Selecting Categoricals
c = data_c.select_dtypes(exclude=np.number)

### 5. Plot every categorical variable. What can you see in the plots? Note that in the previous lab you used a bar plot to plot categorical data, with each unique category in the column on the x-axis and an appropriate measure on the y-axis. However, this time you will try a different plot. This time in each plot for the categorical variable you will have, each unique category in the column on the x-axis and the target(which is numerical) on the Y-axis

In [9]:
# For this step, I will directly copy+paste my previous lab code.
# Also, it is not better to use .unique() or value_counts() instead of ploting everything?
# All categorical data can be encoded, in some cases it can be bucked

### Encoding Categoricals
* We will count `unique` for each feature.
* **If** it follows an hierarchy, ordinal encoding. **Elif**, manual encoding. **Elif** (too many uniques), get dummies. **Else** (dates), transform it to a datetime object and then create new columns for `day`, `month` & `year`

In [10]:
snake_columns(c)

Unnamed: 0,state,response,coverage,education,effective_to_date,employmentstatus,gender,location_code,marital_status,policy_type,policy,renew_offer_type,sales_channel,vehicle_class,vehicle_size


In [11]:
# One by one, we will check unique values to encode them manually if it's necessary
c['response'].unique()

array(['No', 'Yes'], dtype=object)

In [12]:
binary = {'No' : 0, 'Yes' : 1}
c['response'].replace(binary, inplace=True)

In [13]:
c['coverage'].unique()

array(['Basic', 'Extended', 'Premium'], dtype=object)

In [14]:
# In this case, ordinal encoding. Premium > Extended > Basic
ordinal = {'Basic' : 0, 'Extended' : 1, 'Premium' : 2}
c['coverage'].replace(ordinal, inplace=True)

In [15]:
c['education'].unique()

array(['Bachelor', 'College', 'Master', 'High School or Below', 'Doctor'],
      dtype=object)

In [16]:
# Then again, ordinal. Doctor > Master > College > Bachelor > High School or Below
ordinal = {'High School or Below' : 0, 'Bachelor' : 1, 'College' : 2, 'Master' : 3, 'Doctor' : 4}
c['education'].replace(ordinal, inplace=True)

In [17]:
# Dates are complex. First, we will change it to datetime format
c['effective_to_date'] = c['effective_to_date'].astype('datetime64[ns]')

In [18]:
c['year'] = c['effective_to_date'].dt.year
c['month'] = c['effective_to_date'].dt.month
c['day'] = c['effective_to_date'].dt.day

In [19]:
c.head(3) # To see the changes

Unnamed: 0,state,response,coverage,education,effective_to_date,employmentstatus,gender,location_code,marital_status,policy_type,policy,renew_offer_type,sales_channel,vehicle_class,vehicle_size,year,month,day
0,Washington,0,0,1,2011-02-24,Employed,F,Suburban,Married,Corporate Auto,Corporate L3,Offer1,Agent,Two-Door Car,Medsize,2011,2,24
1,Arizona,0,1,1,2011-01-31,Unemployed,F,Suburban,Single,Personal Auto,Personal L3,Offer3,Agent,Four-Door Car,Medsize,2011,1,31
2,Nevada,0,2,1,2011-02-19,Employed,F,Suburban,Married,Personal Auto,Personal L3,Offer1,Agent,Two-Door Car,Medsize,2011,2,19


In [20]:
# We then drop `effective_to_date`
c = c.drop(['effective_to_date'], axis=1)

In [21]:
c.head(3)

Unnamed: 0,state,response,coverage,education,employmentstatus,gender,location_code,marital_status,policy_type,policy,renew_offer_type,sales_channel,vehicle_class,vehicle_size,year,month,day
0,Washington,0,0,1,Employed,F,Suburban,Married,Corporate Auto,Corporate L3,Offer1,Agent,Two-Door Car,Medsize,2011,2,24
1,Arizona,0,1,1,Unemployed,F,Suburban,Single,Personal Auto,Personal L3,Offer3,Agent,Four-Door Car,Medsize,2011,1,31
2,Nevada,0,2,1,Employed,F,Suburban,Married,Personal Auto,Personal L3,Offer1,Agent,Two-Door Car,Medsize,2011,2,19


In [22]:
# Next, employmentstatus:
c['employmentstatus'].unique() # In this case, we will use get_dummies, since we don't want to represent a hierarchy

array(['Employed', 'Unemployed', 'Medical Leave', 'Disabled', 'Retired'],
      dtype=object)

In [23]:
c['gender'].unique() # We have two genders in this dataset, so get_dummies

array(['F', 'M'], dtype=object)

In [24]:
c['location_code'].unique() # Again, we don't want to show any hierarchy so we will use get_dummies

array(['Suburban', 'Rural', 'Urban'], dtype=object)

In [25]:
c['marital_status'].unique() # get_dummies

array(['Married', 'Single', 'Divorced'], dtype=object)

In [26]:
c['policy_type'].unique()

array(['Corporate Auto', 'Personal Auto', 'Special Auto'], dtype=object)

In [27]:
# Then again, hierarchy. Special Auto > Corporate Auto > Personal Auto
ordinal = {'Personal Auto' : 0, 'Corporate Auto' : 1, 'Special Auto' : 2}
c['policy_type'].replace(ordinal, inplace=True)

In [28]:
c['policy'].unique() # get_dummies

array(['Corporate L3', 'Personal L3', 'Corporate L2', 'Personal L1',
       'Special L2', 'Corporate L1', 'Personal L2', 'Special L1',
       'Special L3'], dtype=object)

In [29]:
# Then again, hierarchy. Special L3 > Special L2 > Special L1 > Corporate L3 > Corporate L2 > Corporate L1 > Personal L3 > Personal L2 > Personal L1
ordinal = {'Personal L1' : 0, 'Personal L2' : 1, 'Personal L3': 2, 'Corporate L1' : 3, 'Corporate L2' : 4, 'Corporate L3' : 5, 'Special L1' : 6, 'Special L2' : 7, 'Special L3' : 8}
c['policy'].replace(ordinal, inplace=True)

In [30]:
c['renew_offer_type'].unique() # get_dummies, we don't know the hierarchy of the offers

array(['Offer1', 'Offer3', 'Offer2', 'Offer4'], dtype=object)

In [31]:
c['sales_channel'].unique() # get_dummies

array(['Agent', 'Call Center', 'Web', 'Branch'], dtype=object)

In [32]:
c['vehicle_class'].unique() # There is a clear hierarchy Luxury > Sports but not with the others. We will use get_dummies

array(['Two-Door Car', 'Four-Door Car', 'SUV', 'Luxury SUV', 'Sports Car',
       'Luxury Car'], dtype=object)

In [33]:
c['vehicle_size'].unique()

array(['Medsize', 'Small', 'Large'], dtype=object)

In [34]:
ordinal = {'Small' : 0, 'Medsize' : 1, 'Large': 2}
c['vehicle_size'].replace(ordinal, inplace=True)

In [35]:
# We now select all our categoricals encoded before applying get dummies
c_n = c.select_dtypes(include = np.number)
c_n.head(3)

Unnamed: 0,response,coverage,education,policy_type,policy,vehicle_size,year,month,day
0,0,0,1,1,5,1,2011,2,24
1,0,1,1,0,2,1,2011,1,31
2,0,2,1,0,2,1,2011,2,19


In [36]:
# We concat them to check it with a correlation matrix (we have ordinal categoricals, so it makes sense)
X_N = pd.concat([c_n, n], axis=1) # we concat them with our numerical values, target at our righ
X_N.head(3)

Unnamed: 0,response,coverage,education,policy_type,policy,vehicle_size,year,month,day,Customer Lifetime Value,Income,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Total Claim Amount
0,0,0,1,1,5,1,2011,2,24,2763.519279,56274,69,32,5,0,1,384.811147
1,0,1,1,0,2,1,2011,1,31,6979.535903,0,94,13,42,0,8,1131.464935
2,0,2,1,0,2,1,2011,2,19,12887.43165,48767,108,18,38,0,2,566.472247


In [37]:
# Saved for later
X_N.to_csv('C:/Users/apisi/01. IronData/01. GitHub/01. IronLabs/unit_4_py/lab-cleaning-numerical-data/01_data/X_N.csv')

In [38]:
# Now again, we select only categoricals to encode them with get_dummies
c  = c.select_dtypes(exclude = np.number)
c.head(3)

Unnamed: 0,state,employmentstatus,gender,location_code,marital_status,renew_offer_type,sales_channel,vehicle_class
0,Washington,Employed,F,Suburban,Married,Offer1,Agent,Two-Door Car
1,Arizona,Unemployed,F,Suburban,Single,Offer3,Agent,Four-Door Car
2,Nevada,Employed,F,Suburban,Married,Offer1,Agent,Two-Door Car


In [39]:
# Now, get_dummies
c_dumm = pd.get_dummies(c, drop_first=False)
c_dumm.sample(5)

Unnamed: 0,state_Arizona,state_California,state_Nevada,state_Oregon,state_Washington,employmentstatus_Disabled,employmentstatus_Employed,employmentstatus_Medical Leave,employmentstatus_Retired,employmentstatus_Unemployed,gender_F,gender_M,location_code_Rural,location_code_Suburban,location_code_Urban,marital_status_Divorced,marital_status_Married,marital_status_Single,renew_offer_type_Offer1,renew_offer_type_Offer2,renew_offer_type_Offer3,renew_offer_type_Offer4,sales_channel_Agent,sales_channel_Branch,sales_channel_Call Center,sales_channel_Web,vehicle_class_Four-Door Car,vehicle_class_Luxury Car,vehicle_class_Luxury SUV,vehicle_class_SUV,vehicle_class_Sports Car,vehicle_class_Two-Door Car
7162,1,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1
7166,0,0,0,1,0,0,1,0,0,0,0,1,0,1,0,0,1,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0
6126,0,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0
138,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0
976,1,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,1,0,0


### 6. For the categorical data, check if there is any data cleaning that need to perform. 
**Hint**: You can use the function `value_counts()` on each of the categorical columns and check the representation of different categories in each column. Discuss if this information might in some way be used for data cleaning.|

In [40]:
### Already did in the previous response. In a previous lab (cleaning-numerical-data I overdid all this process)