## Issue 1: Data Cleaning (Handle Missing Values & Duplicates)
- Description: Handle missing values and remove duplicates.
- Solution Steps:
  1. Load the dataset.
  2. Identify missing values.
  3. Impute or remove missing data.
  4. Remove duplicates.
  5. Making Data Clean.

## Importing Necessary Libraries

In [None]:
import numpy as np
import pandas as pd

## Loading the Dataset and Analysing it

In [None]:
data = pd.read_csv('in-vehicle-coupon-recommendation.csv') # Reading The Dataset
data.head() # Checking Top 5 values

Unnamed: 0,destination,passanger,weather,temperature,time,coupon,expiration,gender,age,maritalStatus,...,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y
0,No Urgent Place,Alone,Sunny,55,2PM,Restaurant(<20),1d,Female,21,Unmarried partner,...,never,,4~8,1~3,1,0,0,0,1,1
1,No Urgent Place,Friend(s),Sunny,80,10AM,Coffee House,2h,Female,21,Unmarried partner,...,never,,4~8,1~3,1,0,0,0,1,0
2,No Urgent Place,Friend(s),Sunny,80,10AM,Carry out & Take away,2h,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,0,0,1,1
3,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,2h,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,0,0,1,0
4,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,1d,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,0,0,1,0


In [None]:
data.shape # Checking the shape of dataset

(12684, 26)

In [None]:
data.info() # Information about the dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12684 entries, 0 to 12683
Data columns (total 26 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   destination           12684 non-null  object
 1   passanger             12684 non-null  object
 2   weather               12684 non-null  object
 3   temperature           12684 non-null  int64 
 4   time                  12684 non-null  object
 5   coupon                12684 non-null  object
 6   expiration            12684 non-null  object
 7   gender                12684 non-null  object
 8   age                   12684 non-null  object
 9   maritalStatus         12684 non-null  object
 10  has_children          12684 non-null  int64 
 11  education             12684 non-null  object
 12  occupation            12684 non-null  object
 13  income                12684 non-null  object
 14  car                   108 non-null    object
 15  Bar                   12577 non-null

In [None]:
data.describe() # Statistics of Dataset

Unnamed: 0,temperature,has_children,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y
count,12684.0,12684.0,12684.0,12684.0,12684.0,12684.0,12684.0,12684.0
mean,63.301798,0.414144,1.0,0.561495,0.119126,0.214759,0.785241,0.568433
std,19.154486,0.492593,0.0,0.496224,0.32395,0.410671,0.410671,0.495314
min,30.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,55.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
50%,80.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0
75%,80.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0
max,80.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Checking for Missing Values

In [None]:
missing_values = data.isnull().sum() # Checking missing values in each column of dataset
missing_values

Unnamed: 0,0
destination,0
passanger,0
weather,0
temperature,0
time,0
coupon,0
expiration,0
gender,0
age,0
maritalStatus,0


In [None]:
data = data.drop(columns=['car']) # Removing car columns as it have many null values

In [None]:
def fill_missing_values(df):

  categorical_columns = ['Bar', 'CoffeeHouse', 'CarryAway', 'RestaurantLessThan20', 'Restaurant20To50'] # Categorical Columns
  for col in categorical_columns:
    df[col] = df[col].fillna(df[col].mode()[0])  # Filling Categorical values with mode

fill_missing_values(data)

In [None]:
missing_values = data.isnull().sum() # Checking missing values in each column of dataset
missing_values

Unnamed: 0,0
destination,0
passanger,0
weather,0
temperature,0
time,0
coupon,0
expiration,0
gender,0
age,0
maritalStatus,0


## Checking for Duplicated Values

In [None]:
duplicates = data.duplicated().sum() # Checking Number of Duplicated Row in the Dataset
duplicates

np.int64(74)

In [None]:
duplicate_rows = data[data.duplicated()] # Checking Duplicated Rows
duplicate_rows

Unnamed: 0,destination,passanger,weather,temperature,time,coupon,expiration,gender,age,maritalStatus,...,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y
4192,Work,Alone,Sunny,80,7AM,Carry out & Take away,1d,Male,26,Single,...,never,1~3,less1,less1,1,1,1,0,1,1
4236,Work,Alone,Sunny,80,7AM,Carry out & Take away,1d,Male,26,Single,...,gt8,gt8,4~8,less1,1,1,1,0,1,1
4280,Work,Alone,Sunny,80,7AM,Carry out & Take away,1d,Female,26,Single,...,never,4~8,1~3,less1,1,1,1,0,1,1
4324,Work,Alone,Sunny,80,7AM,Carry out & Take away,1d,Female,46,Single,...,never,4~8,1~3,1~3,1,1,1,0,1,1
4409,Work,Alone,Sunny,80,7AM,Carry out & Take away,1d,Female,21,Single,...,never,less1,1~3,never,1,1,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8511,Home,Alone,Sunny,80,6PM,Bar,2h,Male,46,Married partner,...,1~3,1~3,less1,1~3,1,0,0,1,0,1
8512,Home,Partner,Sunny,30,10PM,Carry out & Take away,2h,Male,46,Married partner,...,1~3,1~3,less1,1~3,1,1,0,0,1,1
8513,Work,Alone,Rainy,55,7AM,Restaurant(<20),2h,Male,46,Married partner,...,1~3,1~3,less1,1~3,1,1,1,0,1,0
8515,Work,Alone,Snowy,30,7AM,Restaurant(20-50),1d,Male,46,Married partner,...,1~3,1~3,less1,1~3,1,1,1,0,1,0


In [None]:
data.drop_duplicates(inplace=True) # Removing Duplicated Rows

In [None]:
data.shape # size of dataset got changed from (12685,26) to (12610,25)

(12610, 25)

## Checking For Unique Values

In [None]:
def unique_values(df):

  for col in df.columns:
    unique_values = df[col].unique()
    print(f"Unique values in column '{col}':")
    print(unique_values)
    print("-" * 50)

unique_values(data)

Unique values in column 'destination':
['No Urgent Place' 'Home' 'Work']
--------------------------------------------------
Unique values in column 'passanger':
['Alone' 'Friend(s)' 'Kid(s)' 'Partner']
--------------------------------------------------
Unique values in column 'weather':
['Sunny' 'Rainy' 'Snowy']
--------------------------------------------------
Unique values in column 'temperature':
[55 80 30]
--------------------------------------------------
Unique values in column 'time':
['2PM' '10AM' '6PM' '7AM' '10PM']
--------------------------------------------------
Unique values in column 'coupon':
['Restaurant(<20)' 'Coffee House' 'Carry out & Take away' 'Bar'
 'Restaurant(20-50)']
--------------------------------------------------
Unique values in column 'expiration':
['1d' '2h']
--------------------------------------------------
Unique values in column 'gender':
['Female' 'Male']
--------------------------------------------------
Unique values in column 'age':
['21' '46' 

In [None]:
categorical_cols = data.select_dtypes(include=['object']).columns

data[categorical_cols] = data[categorical_cols].apply(lambda x: x.str.lower().str.strip()) # Converting all string values to lower case

In [None]:
unique_values(data)

Unique values in column 'destination':
['no urgent place' 'home' 'work']
--------------------------------------------------
Unique values in column 'passanger':
['alone' 'friend(s)' 'kid(s)' 'partner']
--------------------------------------------------
Unique values in column 'weather':
['sunny' 'rainy' 'snowy']
--------------------------------------------------
Unique values in column 'temperature':
[55 80 30]
--------------------------------------------------
Unique values in column 'time':
['2pm' '10am' '6pm' '7am' '10pm']
--------------------------------------------------
Unique values in column 'coupon':
['restaurant(<20)' 'coffee house' 'carry out & take away' 'bar'
 'restaurant(20-50)']
--------------------------------------------------
Unique values in column 'expiration':
['1d' '2h']
--------------------------------------------------
Unique values in column 'gender':
['female' 'male']
--------------------------------------------------
Unique values in column 'age':
['21' '46' 

## Removing Redundant values

In [None]:
data = data.drop(columns=['toCoupon_GEQ5min', 'direction_opp'])
# toCoupon_GEQ5min Column only have one value so Drop it
# direction_opp and direction_same are inversely related so drop any one of it

In [None]:
data.shape

(12610, 23)

In [None]:
data.to_csv("cleaned-in-vehicle-coupon-recommendation.csv", index=False)